# GRPO Training for Italian Exercise Generator

Train V5 model using Group Relative Policy Optimization (GRPO) with comprehensive reward function.

**Hardware**: A100 GPU (Colab Pro)
**Expected time**: ~2-4 hours for 2000 samples

## Setup

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Navigate to project
%cd /content/drive/MyDrive/Colab\ Notebooks/italian_teacher

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/italian_teacher


In [2]:
# Install dependencies
!pip install -q transformers trl accelerate peft datasets spacy sentence-transformers bitsandbytes json5
!python -m spacy download it_core_news_sm

Collecting it-core-news-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.8.0/it_core_news_sm-3.8.0-py3-none-any.whl (13.0 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.8.0+cu126
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB


## Load Reward Function

In [4]:
from src.rl.reward_function import ExerciseRewardFunction

# Initialize reward function with CUDA (sentence transformer is small ~420MB)
print("Loading reward function...")
print("⚠️ Using CUDA for reward computation (sentence transformer is tiny)")
reward_fn = ExerciseRewardFunction(device="cuda")  # ✅ Keep on GPU
print("✅ Reward function ready (running on GPU)")
print("   Sentence transformer: ~420MB (acceptable overhead)")

Loading reward function...
⚠️ Using CUDA for reward computation (sentence transformer is tiny)
Loading spaCy model: it_core_news_sm...
✅ spaCy model loaded
Reward function will use device: cuda
Initializing scorers...
Pre-loading CEFR vocabulary (16,887 words)...
✅ Loaded 16887 Italian words from vocabulary list
✅ Loaded vocabulary for all CEFR levels
Loading sentence transformer for topic similarity...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Sentence transformer loaded in cuda
✅ Reward function initialized with all scorers (including coherence)
✅ Reward function ready (running on GPU)
   Sentence transformer: ~420MB (acceptable overhead)


## Load Training Requests

In [5]:
import os

# Load pre-generated training requests
if os.path.exists("src/rl/training_requests.json"):
    print("Loading existing training requests...")
    with open("src/rl/training_requests.json", "r") as f:
        training_requests = json.load(f)
else:
    # If not exists, generate them
    from src.rl.generate_training_requests import generate_training_requests
    print("Generating new training requests...")
    training_requests = generate_training_requests(
        num_requests=2000,
        output_path="src/rl/training_requests.json"
    )

print(f"✅ Loaded {len(training_requests)} training requests")

Loading existing training requests...
✅ Loaded 2000 training requests


## Format Dataset

In [6]:
def format_prompt_with_chat_template(request: dict, tokenizer) -> str:
    """
    Format request using chat template + EXACT API prompt format.

    Combines:
    1. Llama3 chat template (required for V4)
    2. Detailed API prompt (ensures proper JSON format)
    """
    topic = request.get('topic', 'general Italian')
    grammar = request.get('grammar_focus', 'general practice')

    # Create numbered placeholders to guide the model
    exercise_numbers = ", ".join([f"#{i+1}" for i in range(request['num_exercises'])])

    topic_instruction = f"about '{topic}'"
    grammar_instruction = f"focusing on {grammar}"
    focus_text = f"{topic_instruction} {grammar_instruction}".strip()

    # Build grammar-specific instruction
    grammar_rule = ""
    if "past" in grammar.lower() or "passato" in grammar.lower():
        grammar_rule = "\n⚠️ MANDATORY: Use ONLY past tense (passato prossimo like 'ho fatto', 'sono andato' OR imperfetto like 'facevo', 'andavo'). NO present tense!"
    elif "present" in grammar.lower() or "presente" in grammar.lower():
        grammar_rule = "\n⚠️ MANDATORY: Use ONLY present tense (presente indicativo like 'faccio', 'vado'). NO past or future!"
    elif "future" in grammar.lower() or "futuro" in grammar.lower():
        grammar_rule = "\n⚠️ MANDATORY: Use ONLY future tense (futuro semplice like 'farò', 'andrò'). NO present or past!"

    # EXACT API PROMPT FORMAT - goes in user message
    user_message = f"""Create exactly {request['num_exercises']} Italian language exercises ({exercise_numbers}) in JSON format {focus_text}.

REQUIREMENTS:
Level: {request['level']}
Topic: {topic}
Grammar: {grammar}{grammar_rule}
Exercise types: {', '.join(request['exercise_types'])}

CRITICAL RULES:
1. TOPIC: Every exercise MUST be about "{topic}" - stay on topic throughout
2. REALISM: Use factual, natural scenarios appropriate for the topic
3. GRAMMAR: EVERY SINGLE exercise MUST test "{grammar}" at {request['level']} level
4. MULTIPLE CHOICE: Provide 4 DIFFERENT grammatical forms as options
5. CONSISTENCY: Do not mix different topics or introduce unrelated subjects

OUTPUT FORMAT - JSON array with exercises testing {grammar}:
[
  {{"type": "fill_in_blank", "question": "[Italian sentence about {topic} with ___ blank for {grammar}]", "correct_answer": "[conjugated form in {grammar}]", "options": null, "explanation": "[grammar rule explanation]"}},
  {{"type": "translation", "question": "Translate: [English sentence about {topic} in {grammar}]", "correct_answer": "[Italian translation using {grammar}]", "options": null, "explanation": "[grammar note]"}},
  {{"type": "multiple_choice", "question": "[Italian sentence about {topic} with blank]", "correct_answer": "[correct form in {grammar}]", "options": ["[alt1]", "[alt2]", "[alt3]", "[alt4]"], "explanation": "[why this form is correct]"}}
]

NOW GENERATE {request['num_exercises']} EXERCISES ABOUT "{topic}" TESTING "{grammar}" (remember: {grammar} ONLY!):
["""

    # Apply chat template with system + user messages
    messages = [
        {"role": "system", "content": "You are an expert Italian language teacher. Generate high-quality exercises based on the assignment specification. Output exercises in JSON format."},
        {"role": "user", "content": user_message}
    ]

    # Use tokenizer's chat template to format properly
    formatted_prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    return formatted_prompt

print("✅ Prompt formatter defined")

✅ Prompt formatter defined


In [7]:
import random
from datasets import Dataset

# Load tokenizer first to apply chat template
from transformers import AutoTokenizer
MODEL_PATH = "./models/italian_v8_grpo_round2"
temp_tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# Create prompts WITH CHAT TEMPLATE + DETAILED INSTRUCTIONS
prompts = [format_prompt_with_chat_template(req, temp_tokenizer) for req in training_requests]

# ⚠️ ROUND 2 OPTIMIZED: Use 600 samples (balanced between quality and cost)
# 600 samples = sweet spot for learning without excessive memory/time
ROUND2_SIZE = 400
if len(prompts) > ROUND2_SIZE:
    random.seed(42)  # Reproducible sampling
    random_indices = random.sample(range(len(prompts)), ROUND2_SIZE)
    prompts = [prompts[i] for i in random_indices]
    training_requests_subset = [training_requests[i] for i in random_indices]
else:
    training_requests_subset = training_requests

# Create dataset
train_dataset = Dataset.from_dict({
    "prompt": prompts,
    "request": training_requests_subset,
})

print(f"✅ Created dataset with {len(train_dataset)} examples (ROUND 2 OPTIMIZED)")
print(f"   Using: Chat template + detailed API instructions")
print(f"   Seed: 42 (reproducible)")
print(f"   Size: 600 (balanced for quality & cost)")
print(f"\nExample prompt (first 400 chars):\n{train_dataset[0]['prompt'][:400]}...")


✅ Created dataset with 400 examples (ROUND 2 OPTIMIZED)
   Using: Chat template + detailed API instructions
   Seed: 42 (reproducible)
   Size: 600 (balanced for quality & cost)

Example prompt (first 400 chars):
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert Italian language teacher. Generate high-quality exercises based on the assignment specification. Output exercises in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Create exactly 4 Italian language exercises (#1, #2, #3, #4) in JSON format about 'librario' focusing on past_tense.

REQUIREMENTS:
Level: A...


# Load YOUR V4 model - NOT a base model!
# V4 was already trained on exercise generation, so it knows the format
MODEL_NAME = "./models/italian_exercise_generator_v4_merged"

print(f"Loading YOUR V4 model: {MODEL_NAME}")
print("⚠️ IMPORTANT: Using V4 model that already knows exercise format!")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with quantization for A100
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

print("✅ V4 Model loaded - this model already generates valid exercises!")

In [8]:
# ⚠️ UPDATED FOR ROUND 2 - MEMORY OPTIMIZED
# Model: Round 1 GRPO (italian_v6_grpo_pilot)
# Optimizations: Gradient checkpointing + CPU offloading

MODEL_NAME = "./models/italian_v8_grpo_round2"  # ✅ Round 1 GRPO model

print(f"Loading Round 1 GRPO model: {MODEL_NAME}")
print("⚠️ Training on top of Round 1 with improved reward function")
print("⚠️ Applying memory optimizations...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Add padding token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model with MEMORY OPTIMIZATIONS
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    use_cache=False,  # ⚠️ Disable KV cache during training (saves memory)
)

# Enable gradient checkpointing (trades compute for memory)
model.gradient_checkpointing_enable()
print("✅ Gradient checkpointing enabled (saves ~20-30% memory)")

print("✅ Model loaded with memory optimizations")

Loading Round 1 GRPO model: ./models/italian_v8_grpo_round2
⚠️ Training on top of Round 1 with improved reward function
⚠️ Applying memory optimizations...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Gradient checkpointing enabled (saves ~20-30% memory)
✅ Model loaded with memory optimizations


## Define Reward Function (TRL Format)

In [9]:
def italian_exercise_reward(prompts=None, completions=None, completion_ids=None, **kwargs):
    """
    Reward function for Italian exercise generation.

    TRL calls this function with keyword arguments:
    - prompts: List of prompt strings
    - completions: List of generated completion strings
    - completion_ids: Token IDs of completions
    - **kwargs: All dataset columns (includes 'request')

    Returns:
        List of float rewards (0.0 to 1.0)
    """
    import re
    import json5  # More lenient JSON parser

    # Extract 'request' from kwargs (comes from dataset column)
    requests = kwargs.get('request', [])

    rewards = []

    for completion, req in zip(completions, requests):
        try:
            # Parse generated JSON
            completion_text = completion.strip()

            # ✅ ROBUST JSON PARSING STRATEGY:
            # Try multiple parsing strategies in order of preference

            # Strategy 1: Standard JSON (fastest)
            try:
                exercises = json.loads(completion_text)
            except json.JSONDecodeError:
                # Strategy 2: Fix common issues and retry
                fixed_text = completion_text

                # Fix 1: Replace \' with ' (Python-style escaping)
                fixed_text = fixed_text.replace("\\'", "'")

                # Fix 2: Replace single quotes with double quotes (if used)
                fixed_text = re.sub(r"'([^']*)'(\s*:)", r'"\1"\2', fixed_text)

                # Fix 3: Close incomplete JSON arrays/objects
                if fixed_text.count('[') > fixed_text.count(']'):
                    fixed_text += ']'
                if fixed_text.count('{') > fixed_text.count('}'):
                    last_complete = fixed_text.rfind('}')
                    if last_complete > 0:
                        fixed_text = fixed_text[:last_complete + 1] + ']'

                try:
                    exercises = json.loads(fixed_text)
                except json.JSONDecodeError:
                    # Strategy 3: Use json5 (handles trailing commas, comments, etc.)
                    try:
                        exercises = json5.loads(fixed_text)
                    except:
                        # Strategy 4: Extract individual valid exercises
                        exercises = []
                        for match in re.finditer(r'\{[^{}]*"type"[^{}]*?\}', fixed_text, re.DOTALL):
                            try:
                                ex = json.loads(match.group())
                                if 'type' in ex and 'question' in ex:
                                    exercises.append(ex)
                            except:
                                continue

                        if not exercises:
                            raise ValueError("No valid exercises found")

            if not isinstance(exercises, list):
                exercises = [exercises]

            if not exercises:
                raise ValueError("Empty exercise list")

            # Score each exercise with comprehensive reward function
            scores = []
            for exercise in exercises:
                score, _ = reward_fn.score(exercise, req)
                scores.append(score / 100.0)  # Normalize to 0-1

            # Average score across all exercises in the completion
            raw_reward = sum(scores) / len(scores) if scores else 0.0

            # ⚠️ REWARD CALIBRATION: Make rewards more discriminative
            # Current problem: scores cluster around 0.90-1.0, too little variance
            # Solution: Apply power scaling to spread the distribution

            # Power scaling: reward^2 makes differences more pronounced
            # 0.95^2 = 0.90  (5% difference becomes 10%)
            # 0.90^2 = 0.81  (10% difference becomes 19%)
            # 0.80^2 = 0.64  (20% difference becomes 36%)
            calibrated_reward = raw_reward ** 2

            # Alternative: Exponential scaling for even more discrimination
            # calibrated_reward = (raw_reward - 0.7) * 2.5  # Maps [0.7-1.0] to [0-0.75]
            # calibrated_reward = max(0, min(1, calibrated_reward))

            reward = calibrated_reward

        except Exception as e:
            # If all parsing strategies fail, give low reward
            reward = 0.1
            print(f"❌ Failed to parse: {str(e)[:100]}")
            print(f"   First 200 chars: {completion_text[:200]}")

        rewards.append(reward)

    # Print rewards with both raw and calibrated statistics
    raw_avg = sum([r**0.5 for r in rewards]) / len(rewards)  # Reverse the power to show raw
    print(f"🎯 Rewards: min={min(rewards):.3f}, max={max(rewards):.3f}, avg={sum(rewards)/len(rewards):.3f} (raw_avg={raw_avg:.3f})")

    # Clear CUDA cache to prevent memory fragmentation
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

    return rewards

print("✅ Reward function defined (with discriminative calibration)")

✅ Reward function defined (with discriminative calibration)


# GRPO configuration - OPTIMIZED FOR PILOT RUN
grpo_config = GRPOConfig(
    output_dir="./models/italian_v5_grpo_pilot",
    num_train_epochs=1,  # Reduced from 3 to 1 for pilot
    per_device_train_batch_size=1,  # Reduced from 2 to 1 to save memory
    gradient_accumulation_steps=4,  # Reduced from 8 to 4 (effective batch = 4)
    learning_rate=5e-6,
    warmup_steps=20,  # Reduced from 50
    logging_steps=5,  # More frequent logging for pilot
    save_steps=100,  # Save more frequently for pilot
    save_total_limit=2,
    bf16=True,
    remove_unused_columns=False,
    
    # GRPO-specific parameters - OPTIMIZED
    num_generations=4,  # Keep 4 for GRPO algorithm
    max_completion_length=320,  # Reduced from 512 (exercises don't need 512 tokens)
    temperature=0.7,
    
    # Generation optimization
    generation_batch_size=4,  # Process generations one at a time to save memory
    
    # CRITICAL: Add generation kwargs with stop tokens (like your API uses!)
    generation_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "eos_token_id": [128009],  # Stop token
        "pad_token_id": 128009,
    }
)

print("✅ GRPO config created (OPTIMIZED FOR PILOT)")
print(f"   Added stop tokens to prevent verbose output!")
print(f"   Estimated time: ~30-45 minutes for 200 samples")
print(f"   Memory usage should be ~35-40GB (vs 62GB)")
print(f"\n   For full training:")
print(f"   - Change PILOT_SIZE to 2000 in dataset cell")
print(f"   - Change num_train_epochs to 2-3")
print(f"   - Consider increasing batch sizes if training succeeds")

In [10]:
# GRPO Configuration - OPTIMIZED FOR MEMORY & SPEED
# Goal: Better throughput while staying under memory limit

grpo_config = GRPOConfig(
    output_dir="./models/italian_v9_grpo_round2",

    # Training schedule
    num_train_epochs=1,
    per_device_train_batch_size=6,
    gradient_accumulation_steps=8,  # ⚠️ INCREASED from 2 to 4 for better gradients

    # Learning rate
    learning_rate=1.5e-6,
    warmup_steps=30,

    # Logging & checkpoints
    logging_steps=10,
    save_steps=300,
    save_total_limit=2,

    # Precision
    bf16=True,
    remove_unused_columns=False,

    # GRPO-specific - OPTIMIZED FOR YOUR 63GB USAGE
    num_generations=3,  # ⚠️ BACK TO 4 (you have headroom!)
    max_completion_length=300,
    temperature=0.7,
    generation_batch_size=18,  # ⚠️ Generate 2 at a time (not 40! not 1!)
    # generation_batch_size controls how many completions generate in parallel
    # Too high (40) = long waits, memory spikes
    # Too low (1) = slow generation
    # Sweet spot: 2-4

    # Stop tokens
    generation_kwargs={
        "do_sample": True,
        "top_p": 0.9,
        "eos_token_id": [128009],
        "pad_token_id": 128009,
    }
)



## Initialize GRPO Trainer

In [11]:
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,  # Changed from 'config' to 'args'
    reward_funcs=italian_exercise_reward,
    train_dataset=train_dataset,
    processing_class=tokenizer,  # Changed from 'tokenizer' to 'processing_class'
)

The model is already on multiple devices. Skipping the move to device specified in `args`.


## Start Training

In [12]:
trainer.train()

print("\n✅ Training complete!")

# Save model
output_dir = "./models/italian_v9_grpo_round2"
print(f"\nSaving model to {output_dir}...")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print("✅ Model saved!")

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mari-katzir[0m ([33mariel-katzir[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 8192}. If this is not desired, please set these values explicitly.


🎯 Rewards: min=0.538, max=0.944, avg=0.697 (raw_avg=0.833)
🎯 Rewards: min=0.591, max=1.000, avg=0.740 (raw_avg=0.858)
🎯 Rewards: min=0.546, max=0.914, avg=0.729 (raw_avg=0.852)


Step,Training Loss
10,-0.0077
20,0.0149


🎯 Rewards: min=0.455, max=0.986, avg=0.725 (raw_avg=0.848)
🎯 Rewards: min=0.569, max=0.832, avg=0.705 (raw_avg=0.838)
🎯 Rewards: min=0.635, max=0.938, avg=0.760 (raw_avg=0.870)
🎯 Rewards: min=0.607, max=0.911, avg=0.780 (raw_avg=0.882)
🎯 Rewards: min=0.589, max=0.958, avg=0.699 (raw_avg=0.834)
🎯 Rewards: min=0.605, max=0.893, avg=0.751 (raw_avg=0.866)
🎯 Rewards: min=0.580, max=0.948, avg=0.748 (raw_avg=0.862)
🎯 Rewards: min=0.444, max=0.882, avg=0.690 (raw_avg=0.828)
🎯 Rewards: min=0.497, max=0.907, avg=0.712 (raw_avg=0.841)
🎯 Rewards: min=0.477, max=0.930, avg=0.737 (raw_avg=0.856)
🎯 Rewards: min=0.470, max=0.853, avg=0.625 (raw_avg=0.788)
🎯 Rewards: min=0.564, max=0.916, avg=0.702 (raw_avg=0.836)
🎯 Rewards: min=0.569, max=0.846, avg=0.701 (raw_avg=0.836)
🎯 Rewards: min=0.570, max=0.944, avg=0.718 (raw_avg=0.846)
🎯 Rewards: min=0.470, max=0.889, avg=0.693 (raw_avg=0.830)
🎯 Rewards: min=0.625, max=0.842, avg=0.710 (raw_avg=0.842)
🎯 Rewards: min=0.557, max=0.931, avg=0.738 (raw_avg=0.85

In [13]:
import re
import json

# ========== 1️⃣ TEST REQUEST ==========
test_request = {
    "level": "C2",
    "grammar_focus": "past_tense",
    "topic": "shoe laces",
    "num_exercises": 2,
    "exercise_types": ["translation", "fill_in_blank"]
}

# ========== 2️⃣ FORMAT PROMPT ==========
test_prompt = format_prompt_with_chat_template(test_request, tokenizer)
print(f"PROMPT SENT TO MODEL:\n{test_prompt}\n")

inputs = tokenizer(test_prompt, return_tensors="pt").to(model.device)

# ========== 3️⃣ GENERATE ==========
outputs = model.generate(
    **inputs,
    max_new_tokens=400,
    temperature=0.7,
    do_sample=True,
    top_p=0.9,
    eos_token_id=128009,  # IMPORTANT: Same as training
    pad_token_id=128009
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
print("🔍 FULL RAW MODEL OUTPUT:")
print(generated_text)
print("--------------------------------------------------")

# ========== 4️⃣ CLEAN & EXTRACT JSON ONLY ==========
# Use REGEX to extract ONLY the FIRST valid JSON array
matches = re.findall(r"\[\s*{.*?}\s*\]", generated_text, re.DOTALL)
if not matches:
    raise ValueError("❌ No valid JSON array found in output!")
json_text = matches[-1]  # Use the LAST block (real output)

if not matches:
    raise ValueError("❌ No valid JSON array found in output!")

print("✅ CLEANED JSON BLOCK EXTRACTED:")
print(json_text)
print("--------------------------------------------------")

# ========== 5️⃣ PARSE JSON SAFELY ==========
try:
    exercises = json.loads(json_text)
except json.JSONDecodeError as e:
    raise ValueError(f"❌ JSON parsing error: {e}")

if not isinstance(exercises, list):
    exercises = [exercises]

print(f"🎯 SUCCESS: Parsed {len(exercises)} exercises!")
print("--------------------------------------------------")

# ========== 6️⃣ SCORE OUTPUT ==========
for i, ex in enumerate(exercises):
    score, breakdown = reward_fn.score(ex, test_request)
    print(f"\n{'='*60}")
    print(f"📝 Exercise {i+1} Score: {score}/100")
    print(f"{'='*60}")
    print(f"Type: {ex.get('type')}")
    print(f"Question: {ex.get('question')}")
    print(f"Answer: {ex.get('correct_answer')}")
    if ex.get('options'):
        print(f"Options: {ex.get('options')}")
    print(f"\nDetails: {breakdown}")


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Caching is incompatible with gradient checkpointing in LlamaDecoderLayer. Setting `past_key_values=None`.


PROMPT SENT TO MODEL:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert Italian language teacher. Generate high-quality exercises based on the assignment specification. Output exercises in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>

Create exactly 2 Italian language exercises (#1, #2) in JSON format about 'shoe laces' focusing on past_tense.

REQUIREMENTS:
Level: C2
Topic: shoe laces
Grammar: past_tense
⚠️ MANDATORY: Use ONLY past tense (passato prossimo like 'ho fatto', 'sono andato' OR imperfetto like 'facevo', 'andavo'). NO present tense!
Exercise types: translation, fill_in_blank

CRITICAL RULES:
1. TOPIC: Every exercise MUST be about "shoe laces" - stay on topic throughout
2. REALISM: Use factual, natural scenarios appropriate for the topic
3. GRAMMAR: EVERY SINGLE exercise MUST test "past_tense" at C2 level
4. MULTIPLE CHOICE: Provide 4 DIFFERENT grammatical forms as options
5. CONSISTENCY: Do not mix different topics or introduce 



🔍 FULL RAW MODEL OUTPUT:
system

You are an expert Italian language teacher. Generate high-quality exercises based on the assignment specification. Output exercises in JSON format.user

Create exactly 2 Italian language exercises (#1, #2) in JSON format about'shoe laces' focusing on past_tense.

REQUIREMENTS:
Level: C2
Topic: shoe laces
Grammar: past_tense
⚠️ MANDATORY: Use ONLY past tense (passato prossimo like 'ho fatto','sono andato' OR imperfetto like 'facevo', 'andavo'). NO present tense!
Exercise types: translation, fill_in_blank

CRITICAL RULES:
1. TOPIC: Every exercise MUST be about "shoe laces" - stay on topic throughout
2. REALISM: Use factual, natural scenarios appropriate for the topic
3. GRAMMAR: EVERY SINGLE exercise MUST test "past_tense" at C2 level
4. MULTIPLE CHOICE: Provide 4 DIFFERENT grammatical forms as options
5. CONSISTENCY: Do not mix different topics or introduce unrelated subjects

OUTPUT FORMAT - JSON array with exercises testing past_tense:
[
  {"type": "fi