# Phi-3 Try

This notebook is used to try the pipeline without the integration of MongoDB to store the versioning of the dataset

### Dataset and preprocessing

In [1]:
# Load the SQuAD v2 dataset using the Hugging Face datasets library
from datasets import load_dataset

dataset = load_dataset("squad_v2")
print("Number of examples in the dataset:", len(dataset["train"]))
print("First example in the dataset:", dataset["train"][0])

# Extract 0.5% of each split
print("\nExtracting 0.5% of each dataset split...")

# Calculate 0.5% sizes
train_size = int(len(dataset["train"]) * 0.005)
validation_size = int(len(dataset["validation"]) * 0.005)

print(f"Original train size: {len(dataset['train'])}")
print(f"0.5% train size: {train_size}")
print(f"Original validation size: {len(dataset['validation'])}")
print(f"0.5% validation size: {validation_size}")

# Create 0.5% subsets
dataset_05percent = {
    "train": dataset["train"].select(range(train_size)),
    "validation": dataset["validation"].select(range(validation_size)),
    "test": dataset["validation"].select(range(validation_size, min(validation_size * 2, len(dataset["validation"]))))
}

# Print final sizes
print(f"\nFinal dataset sizes:")
print(f"Train: {len(dataset_05percent['train'])}")
print(f"Validation: {len(dataset_05percent['validation'])}")
print(f"Test: {len(dataset_05percent['test'])}")

# Show examples from each split
print(f"\nTrain example: {dataset_05percent['train'][0]['question']}")
print(f"Validation example: {dataset_05percent['validation'][0]['question']}")
print(f"Test example: {dataset_05percent['test'][0]['question']}")

# Update the dataset variable to use the 0.5% version
dataset = dataset_05percent

Number of examples in the dataset: 130319
First example in the dataset: {'id': '56be85543aeaaa14008c9063', 'title': 'Beyoncé', 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".', 'question': 'When did Beyonce start becoming popular?', 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

Extracting 0.5% of each dataset split

In [2]:
#Define a function to format the dataset examples into a prompt
#The prompt will include the context, question, and answer
def make_prompt(example):
    context = example["context"]
    question = example["question"]
    answer = example["answers"]["text"][0] if example["answers"]["text"] else "No answer"

    prompt = f"[INST] Given the context, answer the question.\n\nContext: {context}\n\nQuestion: {question} [/INST] {answer}"
    return {"prompt": prompt, "reference": answer}

formatted_dataset = {
    split: dataset[split].map(make_prompt)
    for split in dataset.keys()
}

In [3]:
# Print the first formatted prompt and its reference answer
print("Prompt:\n", formatted_dataset["train"][0]["prompt"])
print("\nReference Answer:\n", formatted_dataset["train"][0]["reference"])

Prompt:
 [INST] Given the context, answer the question.

Context: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

Question: When did Beyonce start becoming popular? [/INST] in the late 1990s

Reference Answer:
 in the late 1990s


### Tokenizer

In [4]:
from dotenv import load_dotenv
import os

load_dotenv("key.env")
token = os.getenv("HUGGINGFACE_TOKEN")

from huggingface_hub import login
login(token=token)

In [5]:
import torch
torch.cuda.empty_cache()


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "c:\Users\manua\anaconda3\envs\BigData\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\Users\manua\anaconda3\envs\BigData\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\manua\anaconda3\envs\BigData\lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\manua\anaconda3\envs\BigData\lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.s

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-3-mini-128k-instruct")
tokenizer.pad_token = tokenizer.eos_token

def tokenize(example):
    return tokenizer(
        example["prompt"],
        truncation=True,
        padding="max_length",
        max_length=256
    )

tokenized = {
    split: formatted_dataset[split].map(tokenize, batched=True)
    for split in formatted_dataset.keys()
}

### Load Mistral model

In [7]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


In [8]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from accelerate import Accelerator
from peft import LoraConfig, get_peft_model

# Accelerator setup
accelerator = Accelerator()

# Quantization config (4-bit recommended for large models)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_enable_fp32_cpu_offload=True
)

# Load model with quantization and device mapping
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-3-mini-128k-instruct",
    device_map="auto",
    quantization_config=bnb_config
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [9]:
from transformers import TrainingArguments, Trainer
from peft import LoraConfig
from bert_score import score
import torch
import numpy as np

# Custom Trainer class with BERTScore loss
class BERTScoreTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    def compute_loss(self, model, inputs, return_outputs=False):
        """
        Custom loss function using BERTScore
        """
        labels = inputs.get("labels")
        
        # Forward pass
        outputs = model(**inputs)
        
        # Generate predictions for BERTScore
        with torch.no_grad():
            input_ids = inputs["input_ids"]
            attention_mask = inputs["attention_mask"]
            
            # Generate text
            generated = model.generate(
                input_ids=input_ids,
                attention_mask=attention_mask,
                max_new_tokens=50,
                do_sample=False,
                pad_token_id=self.tokenizer.eos_token_id
            )
            
            # Decode predictions and references
            pred_texts = self.tokenizer.batch_decode(generated, skip_special_tokens=True)
            ref_texts = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
            
            # Calculate BERTScore
            try:
                P, R, F1 = score(pred_texts, ref_texts, lang="en", verbose=False)
                bert_f1 = F1.mean().item()
                
                # Convert BERTScore to loss (1 - F1 so higher F1 = lower loss)
                bert_loss = torch.tensor(1.0 - bert_f1, requires_grad=True, device=input_ids.device)
            except:
                # Fallback to standard loss if BERTScore fails
                bert_loss = outputs.loss
        
        # Combine with standard language modeling loss (optional)
        standard_loss = outputs.loss
        combined_loss = 0.7 * standard_loss + 0.3 * bert_loss
        
        return (combined_loss, outputs) if return_outputs else combined_loss

# Custom compute_metrics function to show BERTScore during evaluation
def compute_metrics(eval_preds):
    predictions, labels = eval_preds
    
    # Decode predictions and labels
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    
    # Calculate BERTScore
    try:
        P, R, F1 = score(decoded_preds, decoded_labels, lang="en", verbose=False)
        bert_f1 = F1.mean().item()
        bert_precision = P.mean().item()
        bert_recall = R.mean().item()
        
        return {
            "bert_f1": bert_f1,
            "bert_precision": bert_precision,
            "bert_recall": bert_recall
        }
    except:
        return {"bert_f1": 0.0, "bert_precision": 0.0, "bert_recall": 0.0}

training_config_no_eval = {
    "bf16": True,
    "do_eval": False,  # Disable evaluation completely
    "learning_rate": 1.0e-05,
    "log_level": "info",
    "logging_steps": 10,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 3,
    "max_steps": -1,
    "output_dir": "./phi3-squad2-checkpoint",
    "overwrite_output_dir": True,
    "per_device_train_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 50,
    "save_total_limit": 2,
    "seed": 42,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs": {"use_reentrant": False},
    "gradient_accumulation_steps": 2,
    "warmup_ratio": 0.05,
    "save_strategy": "steps",
    "load_best_model_at_end": False,  # No evaluation, so no "best" model
}

# LoRA configuration (minimal parameters)
peft_config = {
    "r": 8,  # Reduced from 16 to 8 (fewer parameters)
    "lora_alpha": 16,  # Reduced from 32 to 16
    "lora_dropout": 0.1,  # Slightly increased dropout
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}

# Create configuration objects
train_args = TrainingArguments(**training_config_no_eval)
lora_config = LoraConfig(**peft_config)

print("Configuration with BERTScore loss created successfully!")
print("BERTScore will be used as:")
print("- Combined with standard loss during training")
print("- Primary metric for evaluation and model saving")

Configuration with BERTScore loss created successfully!
BERTScore will be used as:
- Combined with standard loss during training
- Primary metric for evaluation and model saving


In [10]:
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 12,582,912 || all params: 3,833,662,464 || trainable%: 0.3282


### Training

In [None]:
import torch
from transformers import Trainer, DataCollatorForLanguageModeling, TrainerCallback
from peft import get_peft_model
import numpy
import sys
import io
import logging
import warnings
from contextlib import redirect_stdout, redirect_stderr

# Enhanced logging suppression - including BERTScore sharding messages
logging.getLogger("transformers").setLevel(logging.CRITICAL)
logging.getLogger("transformers.modeling_utils").setLevel(logging.CRITICAL)
logging.getLogger("transformers.configuration_utils").setLevel(logging.CRITICAL)
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.CRITICAL)
logging.getLogger("accelerate").setLevel(logging.CRITICAL)
logging.getLogger("accelerate.utils.modeling").setLevel(logging.CRITICAL)
logging.getLogger().setLevel(logging.CRITICAL)
warnings.filterwarnings("ignore")

# Initialize BERTScore silently
print("Initializing BERTScore silently...")
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
    from bert_score import score
    _ = score(["test"], ["test"], lang="en", verbose=False)
print("BERTScore initialized successfully!")

# Modified BERTScore function with complete output suppression
def silent_bert_score(cands, refs, lang="en"):
    """BERTScore calculation with all output suppressed"""
    old_stdout = sys.stdout
    old_stderr = sys.stderr
    
    sys.stdout = io.StringIO()
    sys.stderr = io.StringIO()
    
    try:
        P, R, F1 = score(cands, refs, lang=lang, verbose=False)
        return P, R, F1
    finally:
        sys.stdout = old_stdout
        sys.stderr = old_stderr

# Custom Early Stopping based on Training Loss
class TrainingLossEarlyStoppingCallback(TrainerCallback):
    def __init__(self, patience=3, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float('inf')
        self.wait_count = 0
        
    def on_log(self, args, state, control, model=None, logs=None, **kwargs):
        if logs is not None and 'train_loss' in logs:
            current_loss = logs['train_loss']
            
            if current_loss < self.best_loss - self.min_delta:
                self.best_loss = current_loss
                self.wait_count = 0
                print(f"📈 Training loss improved to {current_loss:.4f}")
            else:
                self.wait_count += 1
                print(f"📊 No improvement in training loss ({self.wait_count}/{self.patience})")
                
                if self.wait_count >= self.patience:
                    print(f"🛑 Early stopping triggered! Best loss: {self.best_loss:.4f}")
                    control.should_training_stop = True

# Fixed Custom Trainer class with BERTScore loss
class BERTScoreTrainer(Trainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """
        Custom loss function using BERTScore - completely silent
        """
        labels = inputs.get("labels")
        
        # Forward pass
        outputs = model(**inputs)
        
        # Generate predictions for BERTScore
        with torch.no_grad():
            input_ids = inputs["input_ids"]
            attention_mask = inputs["attention_mask"]
            
            # Generate text
            try:
                generated = model.generate(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    max_new_tokens=50,
                    do_sample=False,
                    pad_token_id=self.processing_class.eos_token_id
                )
                
                # Decode predictions and references
                pred_texts = self.processing_class.batch_decode(generated, skip_special_tokens=True)
                ref_texts = self.processing_class.batch_decode(labels, skip_special_tokens=True)
                
                # Calculate BERTScore with completely silent function
                P, R, F1 = silent_bert_score(pred_texts, ref_texts, lang="en")
                bert_f1 = F1.mean().item()
                
                # Convert BERTScore to loss
                bert_loss = torch.tensor(1.0 - bert_f1, requires_grad=True, device=input_ids.device)
            except Exception as e:
                # Fallback to standard loss if BERTScore fails
                bert_loss = outputs.loss
        
        # Combine with standard language modeling loss
        standard_loss = outputs.loss
        combined_loss = 0.7 * standard_loss + 0.3 * bert_loss
        
        return (combined_loss, outputs) if return_outputs else combined_loss

# Data preparation function
def prepare_training_data(tokenized_dataset, tokenizer):
    """Prepare data for training"""
    
    def add_labels(example):
        example["labels"] = example["input_ids"].copy()
        return example

    # Only prepare train split
    train_dataset = tokenized_dataset["train"].map(add_labels)
    
    # Keep only necessary columns
    keep_keys = ["input_ids", "attention_mask", "labels"]
    train_dataset = train_dataset.remove_columns(
        [col for col in train_dataset.column_names if col not in keep_keys]
    )
    
    return {"train": train_dataset}

# Main training function
def train_model(model, tokenized_data, tokenizer, train_args):
    """Training function with BERTScore and early stopping"""
    
    # Prepare data
    prepared_data = prepare_training_data(tokenized_data, tokenizer)
    
    # Setup data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
        return_tensors="pt"
    )

    # Custom early stopping based on training loss
    early_stopping_callback = TrainingLossEarlyStoppingCallback(
        patience=5,
        min_delta=0.01
    )
    
    # Initialize BERTScore Trainer
    trainer = BERTScoreTrainer(
        model=model,
        args=train_args,
        train_dataset=prepared_data["train"],
        data_collator=data_collator,
        processing_class=tokenizer,
        callbacks=[early_stopping_callback],
    )
    
    # Start training
    print("Starting training with BERTScore optimization...")
    print("Early stopping based on training loss improvement")
    
    trainer.train()
    
    # Save model
    trainer.save_model("./phi3-squad2-final")
    print("✅ Model saved to ./phi3-squad2-final")
    
    return trainer

Initializing BERTScore silently...
BERTScore initialized successfully!


In [14]:
# Apply LoRA to model (if not already done)
if not hasattr(model, 'peft_config'):
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

# Clean and simple function call
trainer = train_model(model, tokenized, tokenizer, train_args)

Using auto half precision backend
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
***** Running training *****
  Num examples = 651
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 246
  Number of trainable parameters = 12,582,912


Starting training with BERTScore optimization...
Early stopping will check only at the end of each epoch


Step,Training Loss
10,3.2954
20,3.3507
30,3.3331
40,3.3286
50,3.2391
60,3.0818
70,2.8358
80,2.985
90,2.7499
100,2.8885


The following layers were not sharded: embeddings.token_type_embeddings.weight, encoder.layer.*.attention.self.value.bias, encoder.layer.*.intermediate.dense.weight, encoder.layer.*.attention.output.dense.bias, encoder.layer.*.attention.output.LayerNorm.weight, encoder.layer.*.attention.self.query.weight, encoder.layer.*.output.dense.bias, embeddings.position_embeddings.weight, encoder.layer.*.attention.self.key.bias, encoder.layer.*.attention.self.key.weight, encoder.layer.*.attention.self.query.bias, embeddings.LayerNorm.weight, embeddings.word_embeddings.weight, encoder.layer.*.intermediate.dense.bias, pooler.dense.weight, encoder.layer.*.output.LayerNorm.bias, encoder.layer.*.output.LayerNorm.weight, encoder.layer.*.attention.output.dense.weight, encoder.layer.*.attention.self.value.weight, pooler.dense.bias, embeddings.LayerNorm.bias, encoder.layer.*.output.dense.weight, encoder.layer.*.attention.output.LayerNorm.bias
The following layers were not sharded: embeddings.token_type_em

✅ Model saved to ./phi3-squad2-final


### Generate Predictions and Evaluate them via BERTScore

In [16]:
import torch
from tqdm import tqdm
from bert_score import score
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
import json
import random

def evaluate_test_set_with_examples(model, tokenizer, dataset, make_prompt_func, num_examples=10):
    """
    Comprehensive evaluation on test set with detailed prediction examples
    """
    print("="*60)
    print("STARTING TEST SET EVALUATION WITH EXAMPLES")
    print("="*60)
    
    # Prepare test data
    print("Preparing test prompts...")
    test_prompts = dataset["test"].map(make_prompt_func)
    
    # Tokenize test prompts
    print("Tokenizing test data...")
    tokenized_test = test_prompts.map(
        lambda x: tokenizer(x["prompt"], truncation=True, padding="max_length", max_length=512),
        batched=True
    )
    
    # Set model to evaluation mode
    model.eval()
    
    # Initialize lists for predictions and references
    preds = []
    refs = []
    raw_outputs = []
    prompts_list = []
    
    print(f"Generating predictions for {len(test_prompts)} test examples...")
    
    # Generate predictions
    for example in tqdm(test_prompts, desc="Evaluating"):
        # Tokenize input
        inputs = tokenizer(
            example["prompt"],
            return_tensors="pt",
            truncation=True,
            padding="max_length",
            max_length=512
        ).to(model.device)
        
        # Generate response
        with torch.no_grad():
            output = model.generate(
                **inputs, 
                max_new_tokens=50,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
                temperature=0.7,
                top_p=0.9
            )
        
        # Decode output
        decoded = tokenizer.decode(output[0], skip_special_tokens=True)
        
        # Extract answer (everything after [/INST])
        if '[/INST]' in decoded:
            answer = decoded.split('[/INST]')[-1].strip()
        else:
            answer = decoded.strip()
        
        # Store results
        preds.append(answer)
        refs.append(example.get("reference", example.get("answer", "No answer")))
        raw_outputs.append(decoded)
        prompts_list.append(example["prompt"])
    
    print("Predictions generated! Computing metrics...")
    
    # Compute BERTScore
    print("Computing BERTScore...")
    try:
        P, R, F1 = score(preds, refs, lang="en", verbose=False)
        bert_scores = {
            "precision": P.mean().item(),
            "recall": R.mean().item(),
            "f1": F1.mean().item()
        }
    except Exception as e:
        print(f"BERTScore computation failed: {e}")
        bert_scores = {"precision": 0.0, "recall": 0.0, "f1": 0.0}
        P = R = F1 = [0.0] * len(preds)
    
    # Compute exact match accuracy
    exact_matches = []
    for pred, ref in zip(preds, refs):
        if ref.lower().strip() in pred.lower().strip():
            exact_matches.append(1)
        else:
            exact_matches.append(0)
    
    exact_match_score = np.mean(exact_matches)
    
    # Compute answer length statistics
    pred_lengths = [len(pred.split()) for pred in preds]
    ref_lengths = [len(ref.split()) for ref in refs]
    
    # Print comprehensive results
    print("="*60)
    print("EVALUATION RESULTS")
    print("="*60)
    print(f"Test Set Size: {len(preds)}")
    print("-"*60)
    print("BERTScore Metrics:")
    print(f"  Precision: {bert_scores['precision']:.4f}")
    print(f"  Recall:    {bert_scores['recall']:.4f}")
    print(f"  F1 Score:  {bert_scores['f1']:.4f}")
    print("-"*60)
    print("Other Metrics:")
    print(f"  Exact Match: {exact_match_score:.4f}")
    print("-"*60)
    print("Answer Length Statistics:")
    print(f"  Avg Prediction Length: {np.mean(pred_lengths):.2f} words")
    print(f"  Avg Reference Length:  {np.mean(ref_lengths):.2f} words")
    print("="*60)
    
    # Show detailed examples
    print("\n" + "="*80)
    print("DETAILED PREDICTION EXAMPLES")
    print("="*80)
    
    # Select diverse examples: best, worst, and random
    f1_scores = [f.item() if hasattr(f, 'item') else f for f in F1]
    
    # Get indices for different categories
    sorted_indices = sorted(range(len(f1_scores)), key=lambda i: f1_scores[i], reverse=True)
    
    best_indices = sorted_indices[:3]  # Top 3
    worst_indices = sorted_indices[-3:]  # Bottom 3
    random_indices = random.sample(range(len(preds)), min(4, len(preds)))  # Random 4
    
    example_categories = [
        ("BEST PREDICTIONS", best_indices),
        ("WORST PREDICTIONS", worst_indices),
        ("RANDOM PREDICTIONS", random_indices)
    ]
    
    for category_name, indices in example_categories:
        print(f"\n{category_name}:")
        print("-" * 80)
        
        for i, idx in enumerate(indices):
            print(f"\nExample {i+1} (Index {idx}):")
            print(f"BERTScore F1: {f1_scores[idx]:.4f}")
            print(f"Exact Match: {'✓' if exact_matches[idx] else '✗'}")
            
            # Extract question from prompt
            prompt = prompts_list[idx]
            if "Question:" in prompt:
                question = prompt.split("Question:")[-1].split("Answer:")[0].strip()
                print(f"Question: {question}")
            else:
                print(f"Prompt: {prompt[:200]}...")
            
            print(f"Reference Answer: {refs[idx]}")
            print(f"Model Prediction: {preds[idx]}")
            
            # Analysis
            pred_words = len(preds[idx].split())
            ref_words = len(refs[idx].split())
            print(f"Length: Pred={pred_words} words, Ref={ref_words} words")
            
            # Simple similarity check
            pred_lower = preds[idx].lower()
            ref_lower = refs[idx].lower()
            common_words = set(pred_lower.split()) & set(ref_lower.split())
            print(f"Common words: {len(common_words)}")
            
            print("-" * 50)
    
    # Show questions by category if available
    if "category" in test_prompts.column_names or "topic" in test_prompts.column_names:
        print(f"\n{'='*80}")
        print("PERFORMANCE BY CATEGORY")
        print("="*80)
        
        category_field = "category" if "category" in test_prompts.column_names else "topic"
        categories = {}
        
        for i, example in enumerate(test_prompts):
            cat = example.get(category_field, "Unknown")
            if cat not in categories:
                categories[cat] = {"f1_scores": [], "exact_matches": [], "indices": []}
            categories[cat]["f1_scores"].append(f1_scores[i])
            categories[cat]["exact_matches"].append(exact_matches[i])
            categories[cat]["indices"].append(i)
        
        for cat, data in categories.items():
            avg_f1 = np.mean(data["f1_scores"])
            avg_em = np.mean(data["exact_matches"])
            count = len(data["f1_scores"])
            print(f"{cat}: F1={avg_f1:.4f}, EM={avg_em:.4f}, Count={count}")
            
            # Show one example from each category
            best_idx_in_cat = data["indices"][np.argmax(data["f1_scores"])]
            print(f"  Best example: {prompts_list[best_idx_in_cat][:100]}...")
            print(f"  Prediction: {preds[best_idx_in_cat]}")
            print()
    
    # Create results dictionary
    results = {
        "test_size": len(preds),
        "bert_score": bert_scores,
        "exact_match": exact_match_score,
        "avg_prediction_length": np.mean(pred_lengths),
        "avg_reference_length": np.mean(ref_lengths),
        "predictions": preds,
        "references": refs,
        "prompts": prompts_list,
        "individual_scores": {
            "bert_f1": f1_scores,
            "exact_match": exact_matches
        }
    }
    
    # Save detailed results
    print(f"\n{'='*60}")
    print("SAVING RESULTS")
    print("="*60)
    
    # Save main results
    with open("test_evaluation_results.json", "w") as f:
        json.dump({k: v for k, v in results.items() if k != "prompts"}, f, indent=2)
    
    # Save detailed examples
    with open("detailed_predictions.txt", "w", encoding="utf-8") as f:
        f.write("DETAILED TEST SET PREDICTIONS\n")
        f.write("="*80 + "\n\n")
        
        for i, (prompt, pred, ref, f1_score, em) in enumerate(zip(prompts_list, preds, refs, f1_scores, exact_matches)):
            f.write(f"Example {i+1}:\n")
            f.write(f"BERTScore F1: {f1_score:.4f}\n")
            f.write(f"Exact Match: {'✓' if em else '✗'}\n")
            
            if "Question:" in prompt:
                question = prompt.split("Question:")[-1].split("Answer:")[0].strip()
                f.write(f"Question: {question}\n")
            else:
                f.write(f"Prompt: {prompt}\n")
            
            f.write(f"Reference: {ref}\n")
            f.write(f"Prediction: {pred}\n")
            f.write("-" * 50 + "\n\n")
    
    print("Results saved to:")
    print("  - test_evaluation_results.json")
    print("  - detailed_predictions.txt")
    print("="*60)
    
    return results

In [17]:
# Run evaluation with detailed examples
eval_results = evaluate_test_set_with_examples(model, tokenizer, dataset, make_prompt, num_examples=10)

# Print final summary
print(f"\n🎯 FINAL RESULTS:")
print(f"   BERTScore F1: {eval_results['bert_score']['f1']:.4f}")
print(f"   Exact Match:  {eval_results['exact_match']:.4f}")
print(f"   Test Size:    {eval_results['test_size']}")

STARTING TEST SET EVALUATION WITH EXAMPLES
Preparing test prompts...
Tokenizing test data...


Map:   0%|          | 0/59 [00:00<?, ? examples/s]

Generating predictions for 59 test examples...


Evaluating:   0%|          | 0/59 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p'].
- `temperature`: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
- `top_p`: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
If you're using a pretrained model, note that some of these attributes may be set through the model's `generation_config.json` file.
Evaluating:   2%|▏         | 1/59 [00:03<03:00,  3.12s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p'].
- `temperature`: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `

Predictions generated! Computing metrics...
Computing BERTScore...


The following layers were not sharded: embeddings.token_type_embeddings.weight, encoder.layer.*.attention.self.value.bias, encoder.layer.*.intermediate.dense.weight, encoder.layer.*.attention.output.dense.bias, encoder.layer.*.attention.output.LayerNorm.weight, encoder.layer.*.attention.self.query.weight, encoder.layer.*.output.dense.bias, embeddings.position_embeddings.weight, encoder.layer.*.attention.self.key.bias, encoder.layer.*.attention.self.key.weight, encoder.layer.*.attention.self.query.bias, embeddings.LayerNorm.weight, embeddings.word_embeddings.weight, encoder.layer.*.intermediate.dense.bias, pooler.dense.weight, encoder.layer.*.output.LayerNorm.bias, encoder.layer.*.output.LayerNorm.weight, encoder.layer.*.attention.output.dense.weight, encoder.layer.*.attention.self.value.weight, pooler.dense.bias, embeddings.LayerNorm.bias, encoder.layer.*.output.dense.weight, encoder.layer.*.attention.output.LayerNorm.bias


EVALUATION RESULTS
Test Set Size: 59
------------------------------------------------------------
BERTScore Metrics:
  Precision: 0.7231
  Recall:    0.7583
  F1 Score:  0.7397
------------------------------------------------------------
Other Metrics:
  Exact Match: 0.2542
------------------------------------------------------------
Answer Length Statistics:
  Avg Prediction Length: 15.31 words
  Avg Reference Length:  1.81 words

DETAILED PREDICTION EXAMPLES

BEST PREDICTIONS:
--------------------------------------------------------------------------------

Example 1 (Index 12):
BERTScore F1: 0.9441
Exact Match: ✓
Question: Who was Robert's son? [/INST] Bohemond
Reference Answer: Bohemond
Model Prediction: Bohemond [
Length: Pred=2 words, Ref=1 words
Common words: 1
--------------------------------------------------

Example 2 (Index 56):
BERTScore F1: 0.8866
Exact Match: ✓
Question: When was Scotland invaded by William? [/INST] 1072
Reference Answer: 1072
Model Prediction: 1072 [/in

