# üîç TRM POC - Notebook 2: RAG Embeddings Generation

**Objectif:** G√©n√©rer les embeddings s√©mantiques des corpus (Spinoza/Bergson/Kant) avec sentence-transformers

**Runtime:** GPU Colab gratuit (T4 - 12h max)

**Dur√©e estim√©e:** 1-2h (g√©n√©ration + indexation)

---

## Phase 0 - POC TRM (0‚Ç¨)

Ce notebook impl√©mente:
1. Upload corpus philosophiques (Spinoza/Bergson/Kant)
2. D√©coupage en passages (sections Markdown)
3. G√©n√©ration embeddings (sentence-transformers)
4. Indexation FAISS pour retrieval rapide
5. Export index pour r√©utilisation (Notebook 3 + Vast.ai)

**Note:** GPU T4 gratuit suffisant pour embeddings (mod√®le l√©ger)

## 1. Installation D√©pendances

In [None]:
# Installation des librairies n√©cessaires
!pip install -q sentence-transformers faiss-gpu

print("‚úÖ D√©pendances install√©es")

## 2. Upload Corpus Fichiers

**Action manuelle:** Uploader les fichiers suivants depuis `bergsonAndFriends/data/RAG/`:
- `Corpus Spinoza Dialogique 18k - √âthique II-IV.md`
- `Glossaire Conversationnel Spinoza - 12 Concepts.md`
- `corpus_bergson_27k_dialogique.md`
- `glossaire_bergson_conversationnel.md`
- `corpus_kant_20k.txt.md`
- `glossaire_kant_conversationnel.md`

Ou cloner le repo directement:

In [None]:
# Option 1: Upload manuel via interface Colab
from google.colab import files
print("üì§ Upload fichiers corpus (6 fichiers .md attendus)")
uploaded = files.upload()

# Option 2: Clone repo (si public)
# !git clone https://github.com/YOUR_USERNAME/bergsonAndFriends.git
# CORPUS_DIR = "bergsonAndFriends/data/RAG/"

## 3. Configuration & Imports

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json
import pickle
from pathlib import Path
from typing import List, Dict
import re

# Configuration
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # L√©ger, rapide, multilingue
# Alternative: "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

CORPUS_FILES = {
    "spinoza": {
        "corpus": "Corpus Spinoza Dialogique 18k - √âthique II-IV.md",
        "glossaire": "Glossaire Conversationnel Spinoza - 12 Concepts.md"
    },
    "bergson": {
        "corpus": "corpus_bergson_27k_dialogique.md",
        "glossaire": "glossaire_bergson_conversationnel.md"
    },
    "kant": {
        "corpus": "corpus_kant_20k.txt.md",
        "glossaire": "glossaire_kant_conversationnel.md"
    }
}

print("‚úÖ Configuration OK")

## 4. Chargement Mod√®le Embeddings

In [None]:
print(f"‚è≥ Chargement {EMBEDDING_MODEL}...")
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"‚úÖ Mod√®le charg√© - Dimension: {embedder.get_sentence_embedding_dimension()}")

## 5. Parsing Corpus en Passages

In [None]:
def parse_corpus_to_passages(text: str, philosopher: str) -> List[Dict]:
    """
    D√©coupe corpus markdown en passages (bas√© sur headers ##).
    
    Returns:
        Liste de passages avec metadata
    """
    passages = []
    lines = text.split('\n')
    current_section = {"title": "", "content": ""}
    
    for line in lines:
        if line.startswith('##'):
            # Sauvegarder section pr√©c√©dente
            if current_section["content"].strip():
                passages.append({
                    "text": f"{current_section['title']}\n{current_section['content']}",
                    "title": current_section["title"],
                    "philosopher": philosopher,
                    "type": "corpus"
                })
            # Nouvelle section
            current_section = {
                "title": re.sub(r'^#+\s*', '', line),
                "content": ""
            }
        else:
            current_section["content"] += line + '\n'
    
    # Derni√®re section
    if current_section["content"].strip():
        passages.append({
            "text": f"{current_section['title']}\n{current_section['content']}",
            "title": current_section["title"],
            "philosopher": philosopher,
            "type": "corpus"
        })
    
    return passages

print("‚úÖ Fonction parsing d√©finie")

## 6. G√©n√©ration Embeddings Par Philosophe

In [None]:
# Stockage passages + embeddings par philosophe
all_data = {}

for philosopher, files in CORPUS_FILES.items():
    print(f"\n{'='*60}")
    print(f"üìö Traitement: {philosopher.upper()}")
    print(f"{'='*60}")
    
    passages = []
    
    # Charger corpus
    corpus_file = files["corpus"]
    if Path(corpus_file).exists():
        with open(corpus_file, 'r', encoding='utf-8') as f:
            corpus_text = f.read()
        corpus_passages = parse_corpus_to_passages(corpus_text, philosopher)
        passages.extend(corpus_passages)
        print(f"  ‚úÖ Corpus: {len(corpus_passages)} passages")
    
    # Charger glossaire
    glossaire_file = files["glossaire"]
    if Path(glossaire_file).exists():
        with open(glossaire_file, 'r', encoding='utf-8') as f:
            glossaire_text = f.read()
        glossaire_passages = parse_corpus_to_passages(glossaire_text, philosopher)
        for p in glossaire_passages:
            p["type"] = "glossaire"  # Marquer glossaire
        passages.extend(glossaire_passages)
        print(f"  ‚úÖ Glossaire: {len(glossaire_passages)} passages")
    
    print(f"  üìä Total passages: {len(passages)}")
    
    # G√©n√©rer embeddings
    print(f"  ‚è≥ G√©n√©ration embeddings...")
    texts = [p["text"] for p in passages]
    embeddings = embedder.encode(texts, show_progress_bar=True, convert_to_numpy=True)
    print(f"  ‚úÖ Embeddings g√©n√©r√©s: {embeddings.shape}")
    
    # Stocker
    all_data[philosopher] = {
        "passages": passages,
        "embeddings": embeddings
    }

print(f"\n‚úÖ Tous les embeddings g√©n√©r√©s !")
for phil, data in all_data.items():
    print(f"  {phil}: {len(data['passages'])} passages")

## 7. Indexation FAISS

In [None]:
# Cr√©er index FAISS par philosophe
faiss_indexes = {}

for philosopher, data in all_data.items():
    print(f"\nüîç Indexation FAISS: {philosopher}")
    
    embeddings = data["embeddings"]
    dimension = embeddings.shape[1]
    
    # Cr√©er index FAISS (IndexFlatIP = Inner Product, bon pour cosine similarity)
    index = faiss.IndexFlatIP(dimension)
    
    # Normaliser embeddings (pour cosine similarity)
    faiss.normalize_L2(embeddings)
    
    # Ajouter √† l'index
    index.add(embeddings)
    
    faiss_indexes[philosopher] = index
    print(f"  ‚úÖ Index cr√©√©: {index.ntotal} vecteurs")

print("\n‚úÖ Tous les index FAISS cr√©√©s !")

## 8. Test Retrieval

In [None]:
def retrieve_passages(query: str, philosopher: str, top_k: int = 3):
    """
    Retrieve passages pertinents via FAISS.
    """
    # Encoder query
    query_emb = embedder.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_emb)
    
    # Recherche
    index = faiss_indexes[philosopher]
    scores, indices = index.search(query_emb, top_k)
    
    # R√©cup√©rer passages
    results = []
    for score, idx in zip(scores[0], indices[0]):
        passage = all_data[philosopher]["passages"][idx]
        passage["similarity_score"] = float(score)
        results.append(passage)
    
    return results

# Test
print("\nüß™ TEST RETRIEVAL")
print("="*60)
test_query = "Qu'est-ce que le conatus ?"
results = retrieve_passages(test_query, "spinoza", top_k=3)

print(f"Query: '{test_query}'")
print(f"\nTop 3 passages (Spinoza):\n")
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['type']}] {r['title']} (score: {r['similarity_score']:.3f})")
    print(f"   {r['text'][:150]}...\n")

print("‚úÖ Retrieval fonctionnel !")

## 9. Export pour R√©utilisation

In [None]:
# Cr√©er dossier export
!mkdir -p /content/rag_exports

# Sauvegarder index FAISS + passages par philosophe
for philosopher in all_data.keys():
    print(f"\nüíæ Export: {philosopher}")
    
    # Index FAISS
    faiss.write_index(faiss_indexes[philosopher], f"/content/rag_exports/{philosopher}_faiss.index")
    print(f"  ‚úÖ FAISS index saved")
    
    # Passages (pickle)
    with open(f"/content/rag_exports/{philosopher}_passages.pkl", 'wb') as f:
        pickle.dump(all_data[philosopher]["passages"], f)
    print(f"  ‚úÖ Passages saved")

# Sauvegarder config
config = {
    "embedding_model": EMBEDDING_MODEL,
    "dimension": embedder.get_sentence_embedding_dimension(),
    "philosophers": list(all_data.keys()),
    "total_passages": {phil: len(data["passages"]) for phil, data in all_data.items()}
}

with open('/content/rag_exports/config.json', 'w') as f:
    json.dump(config, f, indent=2)

print("\n‚úÖ Tous les fichiers export√©s dans /content/rag_exports/")
print("\nüìã Fichiers √† t√©l√©charger:")
!ls -lh /content/rag_exports/

## 10. T√©l√©chargement ZIP

In [None]:
# Cr√©er archive ZIP pour t√©l√©chargement
!zip -r /content/rag_exports.zip /content/rag_exports/

# T√©l√©charger via interface Colab
from google.colab import files
files.download('/content/rag_exports.zip')

print("\n‚úÖ Archive t√©l√©charg√©e - √Ä uploader sur Vast.ai pour Phase 1 !")

---

## üìù R√©sum√©

### ‚úÖ Impl√©ment√©
- ‚úÖ Parsing 3 corpus (Spinoza/Bergson/Kant) en passages
- ‚úÖ G√©n√©ration embeddings (sentence-transformers)
- ‚úÖ Indexation FAISS par philosophe
- ‚úÖ Retrieval s√©mantique fonctionnel
- ‚úÖ Export index + passages pour r√©utilisation

### üìä Statistiques
- **Mod√®le:** all-MiniLM-L6-v2 (dimension: 384)
- **Total passages:** ~XXX (√† compl√©ter apr√®s ex√©cution)
- **Index FAISS:** 3 (Spinoza, Bergson, Kant)

### üì¶ Fichiers Export√©s
1. `spinoza_faiss.index` + `spinoza_passages.pkl`
2. `bergson_faiss.index` + `bergson_passages.pkl`
3. `kant_faiss.index` + `kant_passages.pkl`
4. `config.json` (metadata)

### ‚û°Ô∏è Prochaines √âtapes
1. **Notebook 3:** Tests Mistral 7B (GPU Colab T4)
2. **Phase 1 (Vast.ai):** Upload index RAG + int√©gration compl√®te

---

**üí∞ Co√ªt:** 0‚Ç¨ (Colab gratuit GPU T4)

**‚è±Ô∏è Temps:** ~1-2h (d√©pend taille corpus)

**üéØ Objectif Phase 0:** RAG s√©mantique pr√™t pour POC ‚úÖ