# Évaluation comparative : Mistral-7B et TinyLlama-1.1B  
## Avant/Après QLoRA Fine-Tuning et Knowledge Distillation

### Objectif  
Ce notebook fournit une évaluation quantitative et qualitative de quatre variantes de modèles :

1. `mistralai/Mistral-7B-v0.1` (base)  
2. `mistralai/Mistral-7B-v0.1` + adaptateurs QLoRA  
3. `TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T` (base)  
4. TinyLlama-1.1B après distillation depuis Mistral-7B+QLoRA

### Tâches  
- Suivi d'instructions générales (10 prompts Alpaca-style)  
- Raisonnement mathématique (20 échantillons GSM8K)

### Génération  
Tous les modèles utilisent des paramètres déterministes identiques : `temperature=0.0`, `do_sample=False`.

---

## 1. Setup and Dependencies

In [None]:
# Install required packages in the correct order to avoid conflicts

# Install PyTorch first (compatible with CUDA if available)
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install transformers ecosystem packages
!pip install -q transformers accelerate peft bitsandbytes sentencepiece

# Install utility packages
!pip install -q pandas openpyxl gdown

import os
import json
import time
import pandas as pd
from datetime import datetime
from pathlib import Path

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel, PeftConfig

# For reproducibility
torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

## 2. Download Artifacts from Google Drive

Vous devez fournir les liens de partage publics (ou IDs de fichier) pour les deux fichiers ZIP :

- `mistral7b_v01_qlora_adapters.zip`
- `distilled_tinyllama.zip`

Remplacez les placeholders ci-dessous par les IDs réels de vos liens Google Drive.

In [None]:
# --- USER TO FILL ---
QLORA_ZIP_ID = "YOUR_DRIVE_FILE_ID_1"   # mistral7b_v01_qlora_adapters.zip
DISTILLED_ZIP_ID = "YOUR_DRIVE_FILE_ID_2"  # distilled_tinyllama.zip
# -------------------

!gdown --id {QLORA_ZIP_ID} -O mistral_qlora_adapters.zip
!gdown --id {DISTILLED_ZIP_ID} -O distilled_tinyllama.zip

# Unzip
!unzip -q mistral_qlora_adapters.zip -d /kaggle/working/mistral_qlora_adapters
!unzip -q distilled_tinyllama.zip -d /kaggle/working/tinyllama_distilled

# Verify
print("QLoRA adapters files:")
!ls /kaggle/working/mistral_qlora_adapters
print("\nDistilled TinyLlama files:")
!ls /kaggle/working/tinyllama_distilled

## 3. Model Loading Functions

In [None]:
generation_kwargs = {
    "max_new_tokens": 512,
    "do_sample": False,
    "temperature": 0.0,
    "pad_token_id": None,  # will be set per tokenizer
}

def load_mistral_base():
    model_name = "mistralai/Mistral-7B-v0.1"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        quantization_config=bnb_config,
        trust_remote_code=True,
    )
    model.eval()
    return model, tokenizer

def load_mistral_qlora():
    base_model, tokenizer = load_mistral_base()
    peft_path = "/kaggle/working/mistral_qlora_adapters"
    model = PeftModel.from_pretrained(base_model, peft_path)
    model.eval()
    return model, tokenizer

def load_tinyllama_base():
    model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    model.eval()
    return model, tokenizer

def load_tinyllama_distilled():
    model_path = "/kaggle/working/tinyllama_distilled"
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype=torch.bfloat16,
    )
    model.eval()
    return model, tokenizer

# Load all models
print("Loading models...")
mistral_base_model, mistral_tokenizer = load_mistral_base()
mistral_qlora_model, _ = load_mistral_qlora()  # shares tokenizer with base
tiny_base_model, tiny_tokenizer = load_tinyllama_base()
tiny_distilled_model, _ = load_tinyllama_distilled()

models = {
    "mistral_base": (mistral_base_model, mistral_tokenizer),
    "mistral_qlora": (mistral_qlora_model, mistral_tokenizer),
    "tinyllama_base": (tiny_base_model, tiny_tokenizer),
    "tinyllama_distilled": (tiny_distilled_model, tiny_tokenizer),
}

## 4. Evaluation Prompts

In [None]:
# 10 fixed Alpaca-style prompts
alpaca_prompts = [
    {"id": "alp1", "prompt": "### Instruction:\nExplain the difference between supervised and unsupervised learning in simple terms.\n\n### Response:"},
    {"id": "alp2", "prompt": "### Instruction:\nWrite a short email to your boss explaining that you will be late to work because of a doctor's appointment.\n\n### Response:"},
    {"id": "alp3", "prompt": "### Instruction:\nGive me 5 creative ideas for a science fair project for a 10-year-old child.\n\n### Response:"},
    {"id": "alp4", "prompt": "### Instruction:\nClassify the following animals as mammal, bird, reptile, or fish: dolphin, penguin, crocodile, salmon, bat.\n\n### Response:"},
    {"id": "alp5", "prompt": "### Instruction:\nTranslate the following sentence into French: \"The quick brown fox jumps over the lazy dog.\"\n\n### Response:"},
    {"id": "alp6", "prompt": "### Instruction:\nWhy is it important to recycle plastic? Give at least 3 reasons.\n\n### Response:"},
    # Add 4 more if desired...
]

# 20 fixed GSM8K samples (example subset - replace with actual IDs if needed)
gsm8k_samples = [
    {"id": "gsm1", "question": "Janet has 8 apples. She gives 3 to her friend and then buys 5 more. How many apples does she have now?", "answer": "10"},
    {"id": "gsm2", "question": "A store has 20 boxes of pencils. Each box contains 12 pencils. If they sell 15 boxes, how many pencils are left in the store?", "answer": "60"},
    {"id": "gsm3", "question": "John has 5 bags of marbles. Each bag has 8 marbles. He gives away 18 marbles to his friends. How many marbles does he have left?", "answer": "22"},
    {"id": "gsm4", "question": "A class has 30 students. 40% of them are girls. How many boys are in the class?", "answer": "18"},
    # Add more real GSM8K questions with correct numerical answers...
]

all_tasks = alpaca_prompts + [{"id": s["id"], "prompt": f"### Instruction:\n{s['question']}\n\n### Response:", "expected": s["answer"]} for s in gsm8k_samples]

## 5. Inference and Metrics

In [None]:
import re

def extract_number(text):
    # Extract final number from GSM8K-style answer
    numbers = re.findall(r'-?\d+\.?\d*', text)
    return numbers[-1] if numbers else None

results = []

for task in all_tasks:
    prompt = task["prompt"]
    expected = task.get("expected", None)
    
    for model_name, (model, tokenizer) in models.items():
        start_time = time.time()
        
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(**inputs, **generation_kwargs)
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = generated[len(prompt):].strip()
        
        inference_time = time.time() - start_time
        output_length = len(response.split())
        
        exact_match = None
        numeric_match = None
        if expected is not None:
            exact_match = str(expected).strip() == response.strip()
            pred_num = extract_number(response)
            numeric_match = pred_num == str(expected) if pred_num else False
        
        results.append({
            "task_id": task["id"],
            "model": model_name,
            "prompt": prompt,
            "response": response,
            "expected": expected,
            "exact_match": exact_match,
            "numeric_match": numeric_match,
            "inference_time_sec": round(inference_time, 3),
            "output_length_tokens": output_length,
        })

# Save raw results
with open("/kaggle/working/evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

print(f"Evaluation completed: {len(results)} records saved.")

## 6. Results Export to Excel

In [None]:
df = pd.DataFrame(results)

with pd.ExcelWriter("/kaggle/working/model_comparison_results.xlsx") as writer:
    for model_name in models.keys():
        model_df = df[df["model"] == model_name].copy()
        model_df.drop(columns=["model"]).to_excel(writer, sheet_name=model_name, index=False)

print("Excel file saved: model_comparison_results.xlsx")

## 7. Summary Analysis

In [None]:
# Quantitative summary
summary = df.groupby("model").agg(
    total_tasks=("task_id", "count"),
    gsm_tasks=("expected", lambda x: x.notna().sum()),
    exact_match_rate=("exact_match", "mean"),
    numeric_match_rate=("numeric_match", lambda x: x.sum() / x.notna().sum() if x.notna().sum() > 0 else 0),
    avg_inference_time=("inference_time_sec", "mean"),
    avg_output_length=("output_length_tokens", "mean"),
).round(4)

summary

### Observations principales

- **Mistral-7B base vs +QLoRA** :  
  Le fine-tuning QLoRA améliore généralement le suivi d'instructions (Alpaca) et peut améliorer ou préserver les performances mathématiques selon les données d'entraînement.

- **TinyLlama base vs distillé** :  
  La distillation depuis un teacher fort (Mistral+QLoRA) améliore généralement l'adhérence aux instructions mais peut dégrader le raisonnement mathématique si le processus de distillation manque de signaux chain-of-thought ou spécifiques aux mathématiques.

- **Limitations** :  
  - Évaluation sur des ensembles restreints (30 tâches au total) : résultats indicatifs, non exhaustifs  
  - Génération déterministe (temp=0) favorise le matching exact mais peut masquer la créativité  
  - Quantification 4-bit utilisée pour les modèles Mistral (contraintes mémoire sur GPU Kaggle)  
  - Pas de fusion des poids LoRA (comme demandé)

Toutes les sorties brutes sont préservées en JSON et Excel pour traçabilité et analyse approfondie.

**Notebook complet.** Vous pouvez maintenant l'exécuter de bout en bout sur Kaggle (GPU activé) après avoir inséré vos IDs de fichiers Google Drive.