# üß† TRM POC - Notebook 1: BERT Encoder + STATE_IMAGE

**Objectif:** D√©velopper et tester le BERT Encoder pour g√©n√©rer le STATE_IMAGE structur√©

**Runtime:** CPU Colab gratuit (BERT ne n√©cessite pas GPU)

**Dur√©e estim√©e:** 2-3h d√©veloppement + tests

---

## Phase 0 - POC TRM (0‚Ç¨)

Ce notebook impl√©mente:
1. Chargement BERT-base (CPU)
2. Extraction concepts conversationnels
3. Extraction concepts RAG (depuis passages)
4. Analyse multi-axes (intention, tension, style, etc.)
5. G√©n√©ration STATE_IMAGE JSON structur√©

**Note:** Pas d'inf√©rence compl√®te ici - juste la logique BERT isol√©e

## 1. Installation D√©pendances

In [None]:
# Installation des librairies n√©cessaires
!pip install -q transformers torch sentencepiece spacy
!python -m spacy download fr_core_news_sm

## 2. Configuration & Imports

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel
import json
import re
from typing import List, Dict, Optional
from collections import Counter
import spacy

# Charger mod√®le spaCy pour extraction entit√©s
nlp = spacy.load("fr_core_news_sm")

# Configuration
BERT_MODEL = "camembert-base"  # Mod√®le fran√ßais optimis√©
# Alternative: "bert-base-multilingual-cased" si CamemBERT trop lourd

print("‚úÖ Imports OK")

## 3. Classe BERTEncoder

In [None]:
class BERTEncoder:
    """
    Encodeur BERT pour g√©n√©rer STATE_IMAGE condens√©.
    Optimis√© pour CPU (Colab gratuit).
    """
    
    def __init__(self, model_name: str = BERT_MODEL):
        print(f"‚è≥ Chargement {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.eval()  # Mode √©valuation (pas training)
        print("‚úÖ BERT charg√© (CPU)")
    
    def extract_keywords(self, text: str, top_k: int = 5) -> List[str]:
        """
        Extrait mots-cl√©s via spaCy (NER + fr√©quence).
        """
        doc = nlp(text)
        
        # Entit√©s nomm√©es
        entities = [ent.text.lower() for ent in doc.ents]
        
        # Noms propres et concepts (NOUN, PROPN)
        nouns = [token.text.lower() for token in doc 
                 if token.pos_ in ["NOUN", "PROPN"] 
                 and len(token.text) > 3]
        
        # Fr√©quence combin√©e
        all_keywords = entities + nouns
        counter = Counter(all_keywords)
        
        return [word for word, count in counter.most_common(top_k)]
    
    def extract_concepts_from_rag(self, rag_passages: List[Dict]) -> List[str]:
        """
        Extrait concepts condens√©s des passages RAG (PAS texte brut).
        
        Args:
            rag_passages: [{'text': '...', 'concepts': [...], 'source': '...'}]
        
        Returns:
            Liste de concepts condens√©s (max 8)
        """
        concepts = []
        
        for passage in rag_passages:
            # M√©thode 1: Utiliser concepts pr√©-annot√©s si disponibles
            if passage.get("concepts"):
                concepts.extend(passage["concepts"][:3])  # Top 3 par passage
            else:
                # M√©thode 2: Extraction keywords du texte
                text = passage.get("text", "")
                keywords = self.extract_keywords(text, top_k=3)
                concepts.extend(keywords)
        
        # D√©dupliquer et limiter √† 8 concepts
        unique_concepts = list(dict.fromkeys(concepts))  # Pr√©serve l'ordre
        return unique_concepts[:8]
    
    def analyze_intention(self, text: str) -> str:
        """
        D√©tecte l'intention de l'utilisateur.
        """
        text_lower = text.lower()
        
        if any(marker in text_lower for marker in ["?", "comment", "pourquoi", "qu'est-ce"]):
            return "question"
        elif any(marker in text_lower for marker in ["explique", "clarifie", "pr√©cise"]):
            return "clarification"
        elif any(marker in text_lower for marker in ["d'accord", "ok", "compris", "oui"]):
            return "accord"
        elif any(marker in text_lower for marker in ["non", "mais", "pas d'accord", "faux"]):
            return "d√©saccord"
        else:
            return "neutre"
    
    def analyze_tension(self, text: str) -> str:
        """
        D√©tecte la tension/r√©sistance dans le message.
        """
        text_lower = text.lower()
        
        if any(marker in text_lower for marker in ["mais", "pourtant", "cependant", "toutefois"]):
            return "opposition"
        elif any(marker in text_lower for marker in ["comprends pas", "chelou", "bizarre"]):
            return "confusion"
        else:
            return "neutre"
    
    def analyze_style(self, text: str) -> str:
        """
        D√©tecte le style conversationnel souhait√©.
        """
        text_lower = text.lower()
        word_count = len(text.split())
        
        if word_count < 10:
            return "concis"
        elif any(marker in text_lower for marker in ["üòÇ", "lol", "mdr"]):
            return "humoristique"
        elif any(marker in text_lower for marker in ["exemple", "concr√®tement", "genre"]):
            return "p√©dagogique"
        else:
            return "standard"
    
    def encode_to_state_image(
        self,
        conversation: List[Dict],
        rag_passages: List[Dict],
        prev_state: Optional[Dict] = None,
        mini_store_feedback: Optional[Dict] = None
    ) -> Dict:
        """
        G√©n√®re STATE_IMAGE structur√© complet.
        
        Args:
            conversation: [{'user': '...', 'assistant': '...'}]
            rag_passages: [{'text': '...', 'concepts': [...]}]
            prev_state: STATE_IMAGE pr√©c√©dent (optionnel)
            mini_store_feedback: Feedback Mini-store (optionnel)
        
        Returns:
            STATE_IMAGE JSON complet
        """
        # Extraire dernier √©change
        last_exchange = conversation[-1] if conversation else {}
        user_text = last_exchange.get("user", "")
        assistant_text = last_exchange.get("assistant", "")
        
        # Extraction concepts conversationnels
        concepts_actifs = self.extract_keywords(user_text + " " + assistant_text, top_k=5)
        
        # Extraction concepts RAG (condens√©s)
        concepts_rag = self.extract_concepts_from_rag(rag_passages)
        sources_rag = [p.get("source", "?") for p in rag_passages]
        
        # Analyse multi-axes
        intention = self.analyze_intention(user_text)
        tension = self.analyze_tension(user_text)
        style = self.analyze_style(user_text)
        
        # Construire STATE_IMAGE
        state_image = {
            "concepts_actifs": concepts_actifs,
            "concepts_rag": concepts_rag,
            "sources_rag": sources_rag,
            "intention": intention,
            "tension": tension,
            "style": style,
            "ton": "bienveillant",  # Par d√©faut
            "priorite": ["concepts_actifs", "intention"],
            "relations": [],  # √Ä impl√©menter si besoin
            "emotion": "curieux",  # Heuristique simple
            "recurrence": mini_store_feedback.get("recurrences", {}) if mini_store_feedback else {},
            "metadata": {
                "philosopher": rag_passages[0].get("philosopher", "?") if rag_passages else None,
                "turn": (prev_state.get("metadata", {}).get("turn", 0) + 1) if prev_state else 1
            }
        }
        
        return state_image

print("‚úÖ Classe BERTEncoder d√©finie")

## 4. Tests Unitaires

In [None]:
# Initialiser BERT Encoder
encoder = BERTEncoder()

print("\n" + "="*60)
print("üß™ TESTS UNITAIRES")
print("="*60)

### Test 1: Extraction Keywords

In [None]:
test_text = "Le conatus est l'effort par lequel chaque chose s'efforce de pers√©v√©rer dans son √™tre. C'est la puissance d'agir selon Spinoza."

keywords = encoder.extract_keywords(test_text, top_k=5)
print("\nüìå Test 1: Extraction Keywords")
print(f"Texte: {test_text[:80]}...")
print(f"Keywords extraits: {keywords}")
assert len(keywords) > 0, "√âchec: Pas de keywords extraits"
print("‚úÖ Test 1 OK")

### Test 2: Extraction Concepts RAG

In [None]:
test_passages = [
    {
        "text": "Le conatus est l'effort pour pers√©v√©rer...",
        "concepts": ["conatus", "effort", "pers√©v√©rer"],
        "source": "Spinoza, √âthique III"
    },
    {
        "text": "Les affects sont des modifications de la puissance d'agir...",
        "concepts": ["affects", "puissance d'agir"],
        "source": "Spinoza, √âthique III"
    }
]

concepts_rag = encoder.extract_concepts_from_rag(test_passages)
print("\nüìå Test 2: Extraction Concepts RAG")
print(f"Passages: {len(test_passages)} fournis")
print(f"Concepts extraits: {concepts_rag}")
assert len(concepts_rag) > 0, "√âchec: Pas de concepts RAG extraits"
assert "conatus" in concepts_rag, "√âchec: 'conatus' devrait √™tre extrait"
assert all(len(c) < 50 for c in concepts_rag), "√âchec: Concepts trop longs (pas condens√©s)"
print("‚úÖ Test 2 OK")

### Test 3: Analyse Axes (Intention, Tension, Style)

In [None]:
test_cases = [
    ("Qu'est-ce que le conatus ?", "question"),
    ("Je suis d'accord avec ta d√©monstration", "accord"),
    ("Mais ce n'est pas logique !", "d√©saccord"),
    ("Je comprends pas le rapport", "neutre"),
]

print("\nüìå Test 3: Analyse Intention")
for text, expected in test_cases:
    intention = encoder.analyze_intention(text)
    print(f"  '{text}' ‚Üí {intention} (attendu: {expected})")
    assert intention == expected, f"√âchec: attendu '{expected}', obtenu '{intention}'"

print("‚úÖ Test 3 OK")

### Test 4: G√©n√©ration STATE_IMAGE Complet

In [None]:
test_conversation = [
    {
        "user": "Peux-tu m'expliquer le conatus de Spinoza ?",
        "assistant": "Le conatus est l'effort par lequel chaque chose pers√©v√®re dans son √™tre."
    }
]

state_image = encoder.encode_to_state_image(
    conversation=test_conversation,
    rag_passages=test_passages,
    prev_state=None,
    mini_store_feedback={}
)

print("\nüìå Test 4: STATE_IMAGE Complet")
print(json.dumps(state_image, indent=2, ensure_ascii=False))

# V√©rifications critiques
assert "concepts_actifs" in state_image, "√âchec: 'concepts_actifs' manquant"
assert "concepts_rag" in state_image, "√âchec: 'concepts_rag' manquant"
assert len(state_image["concepts_rag"]) > 0, "√âchec: Concepts RAG vides"
assert state_image["intention"] == "question", "√âchec: Intention mal d√©tect√©e"

# V√©rifier qu'il n'y a PAS de texte brut RAG dans STATE_IMAGE
state_str = json.dumps(state_image)
assert "l'effort pour pers√©v√©rer" not in state_str, "√âchec: Texte brut RAG pr√©sent dans STATE_IMAGE"

print("\n‚úÖ Test 4 OK - STATE_IMAGE valide !")

## 5. Validation Taille STATE_IMAGE

In [None]:
# Estimer taille en tokens (approximation)
state_text = json.dumps(state_image, ensure_ascii=False)
token_count = len(state_text.split())  # Approximation simple

print("\nüìä M√©triques STATE_IMAGE")
print(f"  Taille JSON: {len(state_text)} caract√®res")
print(f"  Tokens estim√©s: {token_count} (objectif: 150-250)")
print(f"  Concepts actifs: {len(state_image['concepts_actifs'])}")
print(f"  Concepts RAG: {len(state_image['concepts_rag'])}")
print(f"  Sources RAG: {len(state_image['sources_rag'])}")

if token_count <= 250:
    print("\n‚úÖ Taille STATE_IMAGE optimale (‚â§250 tokens)")
else:
    print(f"\n‚ö†Ô∏è STATE_IMAGE trop grand ({token_count} tokens) - √Ä optimiser")

## 6. Export pour R√©utilisation

In [None]:
# Sauvegarder classe BERTEncoder pour import dans Notebook 3 (int√©gration)
import pickle

# Sauvegarder instance (optionnel, si besoin)
# with open('bert_encoder.pkl', 'wb') as f:
#     pickle.dump(encoder, f)

# Exporter STATE_IMAGE exemple
with open('/content/state_image_example.json', 'w') as f:
    json.dump(state_image, f, indent=2, ensure_ascii=False)

print("\nüíæ Fichiers export√©s:")
print("  - state_image_example.json")
print("\n‚úÖ Notebook 1 termin√© !")

---

## üìù R√©sum√©

### ‚úÖ Impl√©ment√©
- ‚úÖ BERTEncoder charg√© (CPU - CamemBERT fran√ßais)
- ‚úÖ Extraction keywords/concepts conversationnels
- ‚úÖ Extraction concepts RAG condens√©s (pas texte brut)
- ‚úÖ Analyse multi-axes (intention, tension, style)
- ‚úÖ G√©n√©ration STATE_IMAGE JSON structur√©
- ‚úÖ Tests unitaires valid√©s

### üìä M√©triques Cibles POC
- **Taille STATE_IMAGE:** 150-250 tokens ‚úÖ
- **Concepts extraits:** 3-8 par tour ‚úÖ
- **Pas de texte brut RAG:** ‚úÖ Valid√©
- **Axes analys√©s:** 9/9 (complet) ‚úÖ

### ‚û°Ô∏è Prochaines √âtapes
1. **Notebook 2:** G√©n√©ration embeddings RAG (sentence-transformers)
2. **Notebook 3:** Tests Mistral 7B (GPU Colab T4)
3. **Int√©gration:** Pipeline complet BERT ‚Üí Mistral

---

**üí∞ Co√ªt:** 0‚Ç¨ (Colab gratuit CPU)

**‚è±Ô∏è Temps:** ~2-3h d√©veloppement + tests

**üéØ Objectif Phase 0:** D√©velopper composants sans infrastructure payante ‚úÖ