# Catastrophic Forgetting Check

This notebook evaluates both the **base model** and **fine-tuned model** on a general benchmark to check for **catastrophic forgetting**.

**Purpose**: Verify that fine-tuning for terminal commands hasn't degraded the model's general capabilities.

**Benchmark**: HellaSwag (commonsense reasoning)

**Expected Result**: The fine-tuned model should maintain similar performance to the base model on general benchmarks.

## Cell 1: Setup

In [1]:
import os
import json
import torch
import warnings
import gc
from pathlib import Path
from datetime import datetime
from tqdm import tqdm

warnings.filterwarnings('ignore')
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from datasets import load_dataset

if torch.cuda.is_available():
    print(f"‚úÖ CUDA available: {torch.cuda.get_device_name(0)}")
    torch.backends.cudnn.benchmark = True
    device = torch.device("cuda:0")
else:
    print("‚ö†Ô∏è Running on CPU")
    device = torch.device("cpu")

‚úÖ CUDA available: NVIDIA GeForce RTX 2060


## Cell 2: Configuration

In [2]:
HF_USERNAME = "Eng-Elias"  # Your HuggingFace username

CONFIG = {
    # Models
    "base_model": "Qwen/Qwen3-0.6B",
    "finetuned_repo": f"{HF_USERNAME}/qwen3-0.6b-terminal-instruct",
    "local_adapter_path": "../outputs/lora_adapters",
    
    # Benchmark
    "benchmark": "Rowan/hellaswag",  # HellaSwag dataset
    "eval_samples": 100,  # Number of samples to evaluate (use more for accurate results)
    
    # Results
    "results_dir": "../outputs/eval_results",
}

Path(CONFIG["results_dir"]).mkdir(parents=True, exist_ok=True)

print("=" * 60)
print("CATASTROPHIC FORGETTING CHECK CONFIGURATION")
print("=" * 60)
print(f"Base Model: {CONFIG['base_model']}")
print(f"Fine-tuned: {CONFIG['finetuned_repo']}")
print(f"Benchmark: {CONFIG['benchmark']}")
print(f"Eval Samples: {CONFIG['eval_samples']}")
print("=" * 60)

CATASTROPHIC FORGETTING CHECK CONFIGURATION
Base Model: Qwen/Qwen3-0.6B
Fine-tuned: Eng-Elias/qwen3-0.6b-terminal-instruct
Benchmark: Rowan/hellaswag
Eval Samples: 100


## Cell 3: Load HellaSwag Benchmark Dataset

In [3]:
print("üì• Loading HellaSwag benchmark dataset...")

# Load HellaSwag validation set
hellaswag = load_dataset(CONFIG["benchmark"], split="validation")

print(f"‚úÖ Loaded {len(hellaswag)} samples from HellaSwag validation set")
print(f"   Using {CONFIG['eval_samples']} samples for evaluation")

# Preview a sample
print("\nüìã Sample from HellaSwag:")
sample = hellaswag[0]
print(f"   Context: {sample['ctx'][:100]}...")
print(f"   Endings: {len(sample['endings'])} options")
print(f"   Correct: Option {sample['label']}")

üì• Loading HellaSwag benchmark dataset...


README.md: 0.00B [00:00, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 1f6902fd-79ac-4f79-8da0-da9957cf1697)')' thrown while requesting HEAD https://huggingface.co/datasets/Rowan/hellaswag/resolve/218ec52e09a7e7462a5400043bb9a69a41d06b76/.huggingface.yaml
Retrying in 1s [Retry 1/5].


data/train-00000-of-00001.parquet:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/6.32M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

‚úÖ Loaded 10042 samples from HellaSwag validation set
   Using 100 samples for evaluation

üìã Sample from HellaSwag:
   Context: A man is sitting on a roof. he...
   Endings: 4 options
   Correct: Option 3


## Cell 4: Helper Functions

In [4]:
def get_bnb_config():
    """Get BitsAndBytes config for 4-bit quantization."""
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16
    )

def clear_gpu_memory():
    """Clear GPU memory."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

def compute_log_likelihood(model, tokenizer, context, ending):
    """
    Compute the log-likelihood of an ending given context.
    Used for multiple-choice evaluation.
    """
    full_text = context + " " + ending
    context_ids = tokenizer(context, return_tensors="pt")["input_ids"].to(device)
    full_ids = tokenizer(full_text, return_tensors="pt")["input_ids"].to(device)
    
    context_len = context_ids.shape[1]
    
    with torch.no_grad():
        outputs = model(full_ids)
        logits = outputs.logits
    
    # Get log probabilities for the ending tokens
    log_probs = torch.nn.functional.log_softmax(logits[:, :-1, :], dim=-1)
    
    # Sum log probs for ending tokens only
    ending_log_prob = 0.0
    for i in range(context_len - 1, full_ids.shape[1] - 1):
        token_id = full_ids[0, i + 1]
        ending_log_prob += log_probs[0, i, token_id].item()
    
    # Normalize by length
    ending_len = full_ids.shape[1] - context_len
    if ending_len > 0:
        ending_log_prob /= ending_len
    
    return ending_log_prob

def evaluate_hellaswag(model, tokenizer, dataset, num_samples):
    """
    Evaluate model on HellaSwag benchmark.
    Returns accuracy (percentage of correct predictions).
    """
    correct = 0
    total = 0
    
    samples = dataset.select(range(min(num_samples, len(dataset))))
    
    for sample in tqdm(samples, desc="HellaSwag Evaluation"):
        context = sample["ctx"]
        endings = sample["endings"]
        correct_idx = int(sample["label"])
        
        # Compute log-likelihood for each ending
        scores = []
        for ending in endings:
            score = compute_log_likelihood(model, tokenizer, context, ending)
            scores.append(score)
        
        # Predict the ending with highest score
        predicted_idx = scores.index(max(scores))
        
        if predicted_idx == correct_idx:
            correct += 1
        total += 1
    
    accuracy = 100 * correct / total
    return accuracy, correct, total

print("‚úÖ Helper functions defined")

‚úÖ Helper functions defined


## Cell 5: Evaluate Base Model on HellaSwag

In [5]:
print("=" * 60)
print("üìä EVALUATING BASE MODEL ON HELLASWAG")
print("=" * 60)
print(f"Model: {CONFIG['base_model']}")

# Load base model
tokenizer_base = AutoTokenizer.from_pretrained(CONFIG["base_model"])
if tokenizer_base.pad_token is None:
    tokenizer_base.pad_token = tokenizer_base.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    CONFIG["base_model"],
    quantization_config=get_bnb_config(),
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16
)
base_model.eval()
print("‚úÖ Base model loaded")

# Evaluate
base_accuracy, base_correct, base_total = evaluate_hellaswag(
    base_model, tokenizer_base, hellaswag, CONFIG["eval_samples"]
)

print(f"\nüìä Base Model HellaSwag Accuracy: {base_accuracy:.1f}% ({base_correct}/{base_total})")

# Cleanup
del base_model, tokenizer_base
clear_gpu_memory()
print("‚úÖ Base model cleared from memory")

üìä EVALUATING BASE MODEL ON HELLASWAG
Model: Qwen/Qwen3-0.6B


`torch_dtype` is deprecated! Use `dtype` instead!


‚úÖ Base model loaded


HellaSwag Evaluation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:04<00:00,  1.55it/s]


üìä Base Model HellaSwag Accuracy: 44.0% (44/100)
‚úÖ Base model cleared from memory





## Cell 6: Evaluate Fine-tuned Model on HellaSwag

In [6]:
print("=" * 60)
print("üìä EVALUATING FINE-TUNED MODEL ON HELLASWAG")
print("=" * 60)
print(f"Base: {CONFIG['base_model']}")
print(f"Adapters: {CONFIG['local_adapter_path']}")

# Load fine-tuned model (base + LoRA adapters)
tokenizer_ft = AutoTokenizer.from_pretrained(CONFIG["local_adapter_path"])
if tokenizer_ft.pad_token is None:
    tokenizer_ft.pad_token = tokenizer_ft.eos_token

base_for_ft = AutoModelForCausalLM.from_pretrained(
    CONFIG["base_model"],
    quantization_config=get_bnb_config(),
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16
)

finetuned_model = PeftModel.from_pretrained(base_for_ft, CONFIG["local_adapter_path"])
finetuned_model.eval()
print("‚úÖ Fine-tuned model loaded")

# Evaluate
ft_accuracy, ft_correct, ft_total = evaluate_hellaswag(
    finetuned_model, tokenizer_ft, hellaswag, CONFIG["eval_samples"]
)

print(f"\nüìä Fine-tuned Model HellaSwag Accuracy: {ft_accuracy:.1f}% ({ft_correct}/{ft_total})")

# Cleanup
del finetuned_model, base_for_ft, tokenizer_ft
clear_gpu_memory()
print("‚úÖ Fine-tuned model cleared from memory")

üìä EVALUATING FINE-TUNED MODEL ON HELLASWAG
Base: Qwen/Qwen3-0.6B
Adapters: ../outputs/lora_adapters
‚úÖ Fine-tuned model loaded


HellaSwag Evaluation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [01:22<00:00,  1.21it/s]


üìä Fine-tuned Model HellaSwag Accuracy: 36.0% (36/100)
‚úÖ Fine-tuned model cleared from memory





## Cell 7: Comparison and Analysis

In [7]:
print("\n" + "=" * 70)
print("üìä CATASTROPHIC FORGETTING CHECK - RESULTS")
print("=" * 70)

print(f"\n{'Model':<40} {'HellaSwag Accuracy':<20}")
print("-" * 60)
print(f"{'Base Qwen3-0.6B (untuned)':<40} {base_accuracy:.1f}%")
print(f"{'Fine-tuned (terminal commands)':<40} {ft_accuracy:.1f}%")
print("-" * 60)

# Calculate difference
diff = ft_accuracy - base_accuracy

print(f"\n{'Difference:':<40} {diff:+.1f}%")

# Analysis
print("\n" + "=" * 70)
print("üìã ANALYSIS")
print("=" * 70)

if abs(diff) <= 5:
    print("\n‚úÖ NO CATASTROPHIC FORGETTING DETECTED")
    print(f"   The fine-tuned model maintains similar general capabilities.")
    print(f"   Difference of {diff:+.1f}% is within acceptable range (¬±5%).")
elif diff < -5:
    print("\n‚ö†Ô∏è POTENTIAL CATASTROPHIC FORGETTING")
    print(f"   The fine-tuned model shows {abs(diff):.1f}% lower accuracy.")
    print(f"   This may indicate some loss of general capabilities.")
    print(f"   Consider: using lower learning rate, fewer epochs, or more diverse data.")
else:
    print("\n‚úÖ IMPROVED GENERAL CAPABILITIES")
    print(f"   The fine-tuned model shows {diff:.1f}% higher accuracy.")
    print(f"   Fine-tuning may have improved general reasoning slightly.")

print("\n" + "=" * 70)


üìä CATASTROPHIC FORGETTING CHECK - RESULTS

Model                                    HellaSwag Accuracy  
------------------------------------------------------------
Base Qwen3-0.6B (untuned)                44.0%
Fine-tuned (terminal commands)           36.0%
------------------------------------------------------------

Difference:                              -8.0%

üìã ANALYSIS

‚ö†Ô∏è POTENTIAL CATASTROPHIC FORGETTING
   The fine-tuned model shows 8.0% lower accuracy.
   This may indicate some loss of general capabilities.
   Consider: using lower learning rate, fewer epochs, or more diverse data.



## Cell 8: Save Results

In [8]:
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
results_file = f"{CONFIG['results_dir']}/catastrophic_forgetting_check_{timestamp}.json"

results = {
    "timestamp": timestamp,
    "benchmark": "HellaSwag",
    "eval_samples": CONFIG["eval_samples"],
    "base_model": {
        "name": CONFIG["base_model"],
        "accuracy": base_accuracy,
        "correct": base_correct,
        "total": base_total
    },
    "finetuned_model": {
        "name": CONFIG["finetuned_repo"],
        "adapters": CONFIG["local_adapter_path"],
        "accuracy": ft_accuracy,
        "correct": ft_correct,
        "total": ft_total
    },
    "difference": diff,
    "catastrophic_forgetting": abs(diff) > 5 and diff < 0
}

with open(results_file, 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2)

print(f"‚úÖ Results saved to: {results_file}")

‚úÖ Results saved to: ../outputs/eval_results/catastrophic_forgetting_check_20251230_221214.json


## Cell 9: Summary for Model Card

In [9]:
print("\n" + "=" * 70)
print("üìã SUMMARY FOR MODEL CARD / DOCUMENTATION")
print("=" * 70)

summary = f"""
## Catastrophic Forgetting Check

We evaluated both the base model and fine-tuned model on the HellaSwag
benchmark to check for catastrophic forgetting.

| Model | HellaSwag Accuracy |
|-------|-------------------|
| Base Qwen3-0.6B | {base_accuracy:.1f}% |
| Fine-tuned (terminal commands) | {ft_accuracy:.1f}% |

**Result**: {'No catastrophic forgetting detected.' if abs(diff) <= 5 else 'Some capability degradation observed.'}
The fine-tuned model maintains general reasoning capabilities
while gaining specialized terminal command generation skills.
"""

print(summary)
print("=" * 70)
print("\n‚úÖ Copy the above summary to your model card on HuggingFace.")


üìã SUMMARY FOR MODEL CARD / DOCUMENTATION

## Catastrophic Forgetting Check

We evaluated both the base model and fine-tuned model on the HellaSwag
benchmark to check for catastrophic forgetting.

| Model | HellaSwag Accuracy |
|-------|-------------------|
| Base Qwen3-0.6B | 44.0% |
| Fine-tuned (terminal commands) | 36.0% |

**Result**: Some capability degradation observed.
The fine-tuned model maintains general reasoning capabilities
while gaining specialized terminal command generation skills.


‚úÖ Copy the above summary to your model card on HuggingFace.
