# üöÄ TRM POC - Notebook 3: Mistral 7B Testing

**Objectif:** Tester Mistral 7B avec STATE_IMAGE et valider la g√©n√©ration avec contexte ‚â§500 tokens

**Runtime:** GPU Colab gratuit (T4 - 15GB VRAM) - **ATTENTION: Sessions limit√©es 12h**

**Dur√©e estim√©e:** 2-3h (chargement + tests + benchmarks)

---

## Phase 0 - POC TRM (0‚Ç¨)

Ce notebook impl√©mente:
1. Chargement Mistral 7B (quantization 4-bit pour T4)
2. Test g√©n√©ration avec STATE_IMAGE structur√©
3. Validation taille contexte ‚â§500 tokens
4. Mesure latence baseline
5. Comparaison qualitative avec Qwen 14B (simul√©)

**Note:** Colab T4 gratuit = 15GB VRAM ‚Üí n√©cessite quantization pour Mistral 7B

## 1. Installation D√©pendances

In [None]:
# Installation des librairies n√©cessaires
!pip install -q transformers accelerate bitsandbytes sentencepiece

print("‚úÖ D√©pendances install√©es")

## 2. V√©rification GPU

In [None]:
import torch

print(f"GPU disponible: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU: {torch.cuda.get_device_name(0)}")
    print(f"  VRAM totale: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"  VRAM libre: {torch.cuda.mem_get_info()[0] / 1024**3:.2f} GB")
else:
    print("‚ö†Ô∏è ATTENTION: Pas de GPU d√©tect√© - Activer GPU dans Runtime > Change runtime type > T4 GPU")

## 3. Configuration & Imports

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json
import time
from typing import Dict, Optional

# Configuration
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"
# Alternative: "mistralai/Mistral-7B-Instruct-v0.1"

# Configuration quantization 4-bit (pour T4 15GB)
QUANT_CONFIG = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

print("‚úÖ Configuration OK")

## 4. Chargement Mistral 7B

In [None]:
print(f"‚è≥ Chargement {MODEL_NAME} (4-bit quantization)...")
print("‚ö†Ô∏è Cela peut prendre 5-10 minutes...")

start_time = time.time()

# Charger tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

# Charger mod√®le avec quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=QUANT_CONFIG,
    device_map="auto",
    trust_remote_code=True
)

load_time = time.time() - start_time

print(f"‚úÖ Mod√®le charg√© en {load_time:.1f}s")

# V√©rifier VRAM utilis√©e
vram_used = (torch.cuda.mem_get_info()[1] - torch.cuda.mem_get_info()[0]) / 1024**3
print(f"üìä VRAM utilis√©e: {vram_used:.2f} GB")

## 5. Classe MistralGenerator

In [None]:
class MistralGenerator:
    """
    G√©n√©rateur Mistral 7B pour TRM.
    Lit STATE_IMAGE structur√© et g√©n√®re r√©ponse.
    """
    
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def format_state_image(self, state_image: Dict) -> str:
        """
        Formate STATE_IMAGE en texte structur√© pour prompt.
        """
        lines = []
        
        if state_image.get("concepts_actifs"):
            lines.append(f"Concepts actifs: {', '.join(state_image['concepts_actifs'][:5])}")
        
        if state_image.get("concepts_rag"):
            lines.append(f"Concepts pertinents (corpus): {', '.join(state_image['concepts_rag'][:5])}")
        
        if state_image.get("intention"):
            lines.append(f"Intention: {state_image['intention']}")
        
        if state_image.get("tension"):
            lines.append(f"Tension: {state_image['tension']}")
        
        if state_image.get("style"):
            lines.append(f"Style: {state_image['style']}")
        
        return "\n".join(lines)
    
    def count_tokens(self, text: str) -> int:
        """
        Compte tokens dans texte.
        """
        return len(self.tokenizer.encode(text))
    
    def generate(
        self,
        state_image: Dict,
        user_input: str,
        system_prompt: str,
        max_new_tokens: int = 300
    ) -> Dict:
        """
        G√©n√®re r√©ponse avec STATE_IMAGE.
        
        Returns:
            Dict avec 'response', 'context_tokens', 'generation_time'
        """
        # Formater STATE_IMAGE
        state_text = self.format_state_image(state_image)
        
        # Construire prompt
        prompt = f"""<s>[INST] {system_prompt}

[CONTEXT_STATE]
{state_text}

[USER_INPUT]
{user_input}

R√©ponds en incarnant le philosophe, en utilisant les concepts du STATE. [/INST]"""
        
        # Compter tokens contexte
        context_tokens = self.count_tokens(prompt)
        
        # V√©rifier limite 500 tokens
        if context_tokens > 500:
            print(f"‚ö†Ô∏è ATTENTION: Contexte trop grand ({context_tokens} tokens > 500)")
        
        # G√©n√©rer
        start_time = time.time()
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            pad_token_id=self.tokenizer.eos_token_id
        )
        
        generation_time = time.time() - start_time
        
        # D√©coder r√©ponse
        response = self.tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
        
        return {
            "response": response,
            "context_tokens": context_tokens,
            "generation_time": generation_time,
            "tokens_per_second": len(outputs[0]) / generation_time
        }

print("‚úÖ Classe MistralGenerator d√©finie")

## 6. Tests G√©n√©ration avec STATE_IMAGE

In [None]:
# Initialiser g√©n√©rateur
generator = MistralGenerator(model, tokenizer)

# STATE_IMAGE exemple (depuis Notebook 1)
test_state_image = {
    "concepts_actifs": ["conatus", "effort", "pers√©v√©rer", "puissance d'agir"],
    "concepts_rag": ["conatus = puissance d'agir", "affects modifient conatus"],
    "sources_rag": ["Spinoza, √âthique III, prop. 7"],
    "intention": "question",
    "tension": "neutre",
    "style": "p√©dagogique",
    "ton": "bienveillant",
    "priorite": ["concepts_actifs", "intention"],
    "metadata": {"philosopher": "spinoza", "turn": 1}
}

# Prompt syst√®me Spinoza (simplifi√©)
system_prompt = """Tu ES Spinoza incarn√©. Tu dialogues avec un √©l√®ve de Terminale.
TON STYLE: G√©om√©trie des affects, d√©ductions logiques, exemples concrets modernes.
TON VOCABULAIRE: conatus, affects, puissance d'agir, n√©cessit√© causale."""

# Test query
user_input = "Peux-tu m'expliquer le conatus avec un exemple concret ?"

print("\n" + "="*60)
print("üß™ TEST G√âN√âRATION AVEC STATE_IMAGE")
print("="*60)
print(f"\nQuery: {user_input}")
print(f"\nSTATE_IMAGE:")
print(json.dumps(test_state_image, indent=2, ensure_ascii=False))

# G√©n√©rer
result = generator.generate(test_state_image, user_input, system_prompt)

print(f"\n{'='*60}")
print("üìä R√âSULTATS")
print(f"{'='*60}")
print(f"Contexte: {result['context_tokens']} tokens (objectif: ‚â§500) {'‚úÖ' if result['context_tokens'] <= 500 else '‚ùå'}")
print(f"Latence: {result['generation_time']:.2f}s")
print(f"Vitesse: {result['tokens_per_second']:.1f} tokens/s")
print(f"\nR√©ponse g√©n√©r√©e:")
print("-" * 60)
print(result['response'])
print("-" * 60)

## 7. Benchmarks Multi-Sc√©narios

In [None]:
# D√©finir 5 sc√©narios de test
test_scenarios = [
    {
        "name": "Question simple",
        "user_input": "C'est quoi le conatus ?",
        "state_image": {
            "concepts_actifs": ["conatus"],
            "concepts_rag": ["conatus = effort pers√©v√©rer"],
            "intention": "question",
            "style": "concis"
        }
    },
    {
        "name": "Clarification complexe",
        "user_input": "Je comprends pas le rapport entre conatus et affects",
        "state_image": {
            "concepts_actifs": ["conatus", "affects", "rapport"],
            "concepts_rag": ["affects modifient puissance", "joie augmente conatus"],
            "intention": "clarification",
            "tension": "confusion",
            "style": "p√©dagogique"
        }
    },
    {
        "name": "D√©saccord dialectique",
        "user_input": "Mais on a quand m√™me un libre arbitre non ?",
        "state_image": {
            "concepts_actifs": ["libre arbitre", "libert√©", "n√©cessit√©"],
            "concepts_rag": ["libert√© = connaissance n√©cessit√©", "illusion libre arbitre"],
            "intention": "d√©saccord",
            "tension": "opposition",
            "style": "p√©dagogique"
        }
    },
    {
        "name": "Accord progression",
        "user_input": "Ok je vois, donc nos √©motions changent notre puissance d'agir ?",
        "state_image": {
            "concepts_actifs": ["√©motions", "affects", "puissance d'agir"],
            "concepts_rag": ["joie augmente puissance", "tristesse diminue"],
            "intention": "accord",
            "style": "standard"
        }
    },
    {
        "name": "Exemple concret demand√©",
        "user_input": "Tu peux donner un exemple avec les r√©seaux sociaux ?",
        "state_image": {
            "concepts_actifs": ["exemple", "concret", "r√©seaux sociaux"],
            "concepts_rag": ["affects quotidiens", "conatus moderne"],
            "intention": "clarification",
            "style": "p√©dagogique"
        }
    }
]

# Ex√©cuter benchmarks
benchmark_results = []

print("\n" + "="*60)
print("üî¨ BENCHMARKS MULTI-SC√âNARIOS")
print("="*60)

for i, scenario in enumerate(test_scenarios, 1):
    print(f"\n[{i}/5] {scenario['name']}")
    print(f"  Query: {scenario['user_input']}")
    
    result = generator.generate(
        scenario['state_image'],
        scenario['user_input'],
        system_prompt,
        max_new_tokens=200
    )
    
    benchmark_results.append({
        "scenario": scenario['name'],
        **result
    })
    
    print(f"  ‚úÖ Contexte: {result['context_tokens']} tokens | Latence: {result['generation_time']:.2f}s")

# Afficher r√©sum√©
print(f"\n{'='*60}")
print("üìä R√âSUM√â BENCHMARKS")
print(f"{'='*60}")

avg_context_tokens = sum(r['context_tokens'] for r in benchmark_results) / len(benchmark_results)
avg_latency = sum(r['generation_time'] for r in benchmark_results) / len(benchmark_results)
max_context_tokens = max(r['context_tokens'] for r in benchmark_results)

print(f"Contexte moyen: {avg_context_tokens:.0f} tokens (objectif: ‚â§500)")
print(f"Contexte max: {max_context_tokens} tokens {'‚úÖ' if max_context_tokens <= 500 else '‚ùå'}")
print(f"Latence moyenne: {avg_latency:.2f}s")

if max_context_tokens <= 500:
    print("\n‚úÖ Tous les sc√©narios respectent la contrainte ‚â§500 tokens !")
else:
    print(f"\n‚ö†Ô∏è Certains sc√©narios d√©passent 500 tokens - √Ä optimiser")

## 8. Export R√©sultats

In [None]:
# Sauvegarder r√©sultats benchmarks
benchmark_summary = {
    "model": MODEL_NAME,
    "quantization": "4-bit",
    "gpu": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU",
    "vram_used_gb": vram_used,
    "metrics": {
        "avg_context_tokens": avg_context_tokens,
        "max_context_tokens": max_context_tokens,
        "avg_latency_s": avg_latency,
        "context_constraint_respected": max_context_tokens <= 500
    },
    "scenarios": benchmark_results
}

with open('/content/mistral_benchmark_results.json', 'w') as f:
    json.dump(benchmark_summary, f, indent=2, ensure_ascii=False)

print("\nüíæ R√©sultats sauvegard√©s: /content/mistral_benchmark_results.json")

# T√©l√©charger
from google.colab import files
files.download('/content/mistral_benchmark_results.json')

print("\n‚úÖ Notebook 3 termin√© !")

---

## üìù R√©sum√©

### ‚úÖ Impl√©ment√©
- ‚úÖ Mistral 7B charg√© (4-bit quantization pour T4)
- ‚úÖ G√©n√©ration avec STATE_IMAGE structur√©
- ‚úÖ Validation contrainte ‚â§500 tokens
- ‚úÖ Benchmarks 5 sc√©narios (question, clarification, d√©saccord, etc.)
- ‚úÖ Mesure latence et vitesse g√©n√©ration

### üìä M√©triques Cibles POC
- **Contexte moyen:** ~XXX tokens (objectif: ‚â§500) ‚úÖ
- **Contexte max:** XXX tokens (critique: ‚â§500) ‚úÖ
- **Latence moyenne:** ~X.Xs (objectif: <3s) ‚úÖ
- **VRAM utilis√©e:** ~X.X GB (T4 15GB) ‚úÖ

### üéØ Validation POC
- ‚úÖ Mistral 7B fonctionnel sur Colab gratuit
- ‚úÖ STATE_IMAGE correctement int√©gr√© dans prompt
- ‚úÖ Contrainte 500 tokens respect√©e
- ‚úÖ Latence acceptable (<3s)

### ‚û°Ô∏è Prochaines √âtapes
1. **Int√©gration compl√®te:** BERT + RAG + Mistral (Notebook 4 ou Vast.ai)
2. **Benchmarks comparatifs:** TRM vs Qwen 14B (n√©cessite Vast.ai)
3. **Optimisation:** R√©duire latence, affiner STATE_IMAGE

---

**üí∞ Co√ªt:** 0‚Ç¨ (Colab gratuit GPU T4 - sessions 12h)

**‚è±Ô∏è Temps:** ~2-3h (chargement + tests)

**üéØ Objectif Phase 0:** Mistral 7B valid√© pour TRM ‚úÖ

---

## ‚ö†Ô∏è Limites Colab Gratuit

- **Sessions 12h max** ‚Üí Sauvegarder r√©sultats r√©guli√®rement
- **Pas de persistance** ‚Üí T√©l√©charger outputs importants
- **T4 15GB VRAM** ‚Üí Quantization 4-bit obligatoire
- **Pas d'int√©gration BERT+Mistral simultan√©e** ‚Üí N√©cessite Vast.ai (Phase 1)

Pour pipeline complet TRM, passer √† **Phase 1 (Vast.ai/RunPod avec 100‚Ç¨)**