# Base Model Test - LLaMAntino-3-ANITA-8B-Inst-DPO-ITA

**Purpose**: Test if the base model (WITHOUT LoRA adapter) makes the same grammar mistakes.

**Hypothesis**: Base model knows Italian correctly, but LoRA alpha=24 is too strong and causes catastrophic forgetting.

**Test Cases**:
1. Gender agreement: "ragno" (spider) - should use masculine "il/un"
2. Gender agreement: "lombrico" (worm) - should use masculine "il/un"
3. Topic adherence: Generate exercises about spiders without topic drift
4. Semantic coherence: Avoid nonsensical scenarios

In [None]:
# Cell 1: Install dependencies
!pip install transformers accelerate torch -q
print("✅ Dependencies installed")

In [None]:
# Cell 2: Load base model (NO LoRA)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

BASE_MODEL = "swap-uniba/LLaMAntino-3-ANITA-8B-Inst-DPO-ITA"

print("Loading base model (this may take 2-3 minutes)...")

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True  # Use 4-bit quantization to fit in Colab GPU
)

print(f"✅ Base model loaded: {BASE_MODEL}")
print(f"   Device: {model.device}")
print(f"   Memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

In [None]:
# Cell 3: Define generation function
def generate_exercises(cefr_level, grammar_focus, topic, quantity, exercise_types):
    """
    Generate exercises using base model.
    """
    system_prompt = (
        "You are an expert Italian language teacher. Generate high-quality Italian language exercises "
        "based on the assignment specification. Output ONLY a JSON array of exercises.\n\n"
        "Each exercise must have:\n"
        "- type: exercise type\n"
        "- question: the exercise question in Italian\n"
        "- answer: the correct answer\n"
        "- explanation: explanation in Italian\n"
        "- options: array of 4 options (for multiple_choice only)\n"
    )
    
    user_prompt = (
        f"Generate {quantity} Italian language exercises:\n"
        f"CEFR Level: {cefr_level}\n"
        f"Grammar Focus: {grammar_focus}\n"
        f"Topic: {topic}\n"
        f"Exercise Types: {', '.join(exercise_types)}\n\n"
        f"CRITICAL RULES:\n"
        f"1. TOPIC: Every exercise MUST be about \"{topic}\" - stay on topic throughout\n"
        f"2. REALISM: Use factual, natural scenarios appropriate for the topic\n"
        f"3. GRAMMAR: Every exercise MUST test \"{grammar_focus}\" at {cefr_level} level\n"
        f"4. MULTIPLE CHOICE: Provide 4 DIFFERENT grammatical forms as options\n"
        f"5. CONSISTENCY: Do not mix different topics or introduce unrelated subjects\n\n"
        f"Output ONLY the JSON array, no additional text."
    )
    
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    
    # Format with chat template
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    
    # Generate
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=1500,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract JSON (everything after the user prompt)
    if "[{" in response:
        json_start = response.find('[{')
        json_text = response[json_start:]
        # Find end of JSON array
        if "}]" in json_text:
            json_end = json_text.rfind("}]") + 2
            json_text = json_text[:json_end]
        return json_text
    else:
        return response

print("✅ Generation function defined")

---

## Test 1: Spider Exercises (Gender Test)

**Expected**: "il ragno" / "un ragno" (masculine)

**Failed in LoRA**: "la ragno" (feminine - WRONG)

In [None]:
# Cell 4: Test with spiders (gender test)
import json

print("=== TEST 1: Spider Exercises (Gender Test) ===")
print("Topic: spiders")
print("Grammar: past_tense")
print("Expected: 'il ragno' or 'un ragno' (masculine)\n")

result = generate_exercises(
    cefr_level="A2",
    grammar_focus="past_tense",
    topic="spiders",
    quantity=3,
    exercise_types=["fill_in_blank", "translation", "multiple_choice"]
)

print("\n" + "="*80)
print("RAW OUTPUT:")
print("="*80)
print(result)

# Try to parse JSON
try:
    exercises = json.loads(result)
    print("\n" + "="*80)
    print("PARSED EXERCISES:")
    print("="*80)
    for i, ex in enumerate(exercises, 1):
        print(f"\nExercise {i}:")
        print(f"  Type: {ex.get('type')}")
        print(f"  Question: {ex.get('question')}")
        print(f"  Answer: {ex.get('answer')}")
        if ex.get('options'):
            print(f"  Options: {ex.get('options')}")
        
        # Check for gender errors
        question = ex.get('question', '')
        answer = ex.get('answer', '')
        text = question + " " + answer
        
        if 'la ragno' in text.lower() or 'una ragno' in text.lower():
            print("  ❌ GENDER ERROR FOUND: 'la ragno' or 'una ragno' (should be masculine)")
        elif 'il ragno' in text.lower() or 'un ragno' in text.lower():
            print("  ✅ CORRECT: Uses masculine article with 'ragno'")
            
except json.JSONDecodeError as e:
    print(f"\n❌ JSON parsing failed: {e}")

---

## Test 2: Worm Exercises (Gender Test)

**Expected**: "il lombrico" / "un lombrico" (masculine)

**Failed in LoRA**: "le lombrichi" (feminine - WRONG)

In [None]:
# Cell 5: Test with worms (gender test)
print("=== TEST 2: Worm Exercises (Gender Test) ===")
print("Topic: worms")
print("Grammar: past_tense")
print("Expected: 'il lombrico' or 'un lombrico' (masculine)\n")

result = generate_exercises(
    cefr_level="A2",
    grammar_focus="past_tense",
    topic="worms",
    quantity=3,
    exercise_types=["fill_in_blank", "translation", "multiple_choice"]
)

print("\n" + "="*80)
print("RAW OUTPUT:")
print("="*80)
print(result)

# Try to parse JSON
try:
    exercises = json.loads(result)
    print("\n" + "="*80)
    print("PARSED EXERCISES:")
    print("="*80)
    for i, ex in enumerate(exercises, 1):
        print(f"\nExercise {i}:")
        print(f"  Type: {ex.get('type')}")
        print(f"  Question: {ex.get('question')}")
        print(f"  Answer: {ex.get('answer')}")
        if ex.get('options'):
            print(f"  Options: {ex.get('options')}")
        
        # Check for gender errors
        question = ex.get('question', '')
        answer = ex.get('answer', '')
        text = question + " " + answer
        
        if 'la lombric' in text.lower() or 'una lombric' in text.lower() or 'le lombric' in text.lower():
            print("  ❌ GENDER ERROR FOUND: Uses feminine article with 'lombrico' (should be masculine)")
        elif 'il lombric' in text.lower() or 'un lombric' in text.lower() or 'i lombric' in text.lower():
            print("  ✅ CORRECT: Uses masculine article with 'lombrico'")
        
        # Check for topic drift
        if any(word in text.lower() for word in ['lumaca', 'snail', 'granchio', 'crab']):
            print("  ⚠️  TOPIC DRIFT: Mentions snails or crabs instead of worms")
            
except json.JSONDecodeError as e:
    print(f"\n❌ JSON parsing failed: {e}")

---

## Test 3: Simple Italian Sentences (General Grammar Test)

Test if base model can generate grammatically correct Italian sentences.

In [None]:
# Cell 6: Simple grammar test
print("=== TEST 3: Simple Italian Grammar ===")

test_prompts = [
    "Complete the sentence with correct article: ___ ragno è nero. (The spider is black)",
    "Complete the sentence with correct article: ___ lombrico vive nella terra. (The worm lives in the soil)",
    "Translate to Italian: The spider climbed the wall.",
]

for prompt in test_prompts:
    messages = [
        {"role": "system", "content": "You are an Italian language expert. Answer concisely and correctly."},
        {"role": "user", "content": prompt}
    ]
    
    formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.3,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the answer (after the prompt)
    answer = response.split("<|assistant|>")[-1].strip() if "<|assistant|>" in response else response
    
    print(f"\nPrompt: {prompt}")
    print(f"Answer: {answer}")
    
    # Check for errors
    if 'la ragno' in answer.lower() or 'una ragno' in answer.lower():
        print("❌ GENDER ERROR: 'la ragno' or 'una ragno'")
    elif 'la lombric' in answer.lower() or 'una lombric' in answer.lower():
        print("❌ GENDER ERROR: 'la lombrico' or 'una lombrico'")
    else:
        print("✅ No obvious gender errors")

---

## Summary and Analysis

Run all cells above, then analyze the results:

### If Base Model is CORRECT:
- ✅ Base model knows "il ragno", "il lombrico" correctly
- ✅ Base model adheres to topics
- ✅ Base model generates realistic scenarios
- **Conclusion**: LoRA alpha=24 is too strong and causes catastrophic forgetting
- **Solution**: Lower LoRA alpha to 4-6 to preserve base knowledge

### If Base Model is WRONG:
- ❌ Base model also makes gender errors
- **Conclusion**: Need different base model or different approach
- **Alternative**: Use spaCy post-processing to fix all gender errors

### Next Steps:
1. If base model is correct: Implement weaker LoRA (alpha=6, rank=8, fewer modules)
2. If base model is wrong: Consider different base model or rely on spaCy validation
3. Implement EWC regularization regardless to prevent forgetting during training