# üß† Cr√©ation des Embeddings

G√©n√©ration des vecteurs d'embeddings avec **Mistral AI** ou **sentence-transformers**.

**√âtapes :**

1. Configuration du provider
2. G√©n√©ration des vecteurs
3. Validation et analyse
4. Export pour FAISS


## 1Ô∏è‚É£ Configuration


In [1]:
import json
import numpy as np
from pathlib import Path
from typing import List

from tqdm.notebook import tqdm

from src.config.constants import PROCESSED_DATA_DIR
from src.config.settings import settings

print("‚úÖ Configuration OK")
print(f"üìä Provider d'embeddings : {settings.embedding_provider}")
print(f"üß† Mod√®le : {settings.embedding_model}")

‚úÖ Configuration OK
üìä Provider d'embeddings : mistral
üß† Mod√®le : mistral-embed


## 2Ô∏è‚É£ Chargement des Donn√©es


In [2]:
# Charger les documents pr√©trait√©s
documents_path = PROCESSED_DATA_DIR / "rag_documents.json"

with open(documents_path, "r", encoding="utf-8") as f:
    documents = json.load(f)

print(f"üìä {len(documents)} documents charg√©s")
print(f"\nüìã Exemple de contenu √† vectoriser :")
print(documents[0]["content"][:300])

üìä 497 documents charg√©s

üìã Exemple de contenu √† vectoriser :
Titre: Formation Civique et Citoyenne "Cadre de vie et architecture"
Ville: Marseille
Date: Du 14/06/2022 √† 07:00 au 31/12/2025 √† 22:59
Description: Formation obligatoire √† l'attention des volontaires de Service Civique, propos√©e par la MAV PACA.
Adresse: 12 Bd Th√©odore Thurner 13006 Marseille


## 3Ô∏è‚É£ Classe d'Embeddings Unifi√©e


In [3]:
from mistralai import Mistral
from sentence_transformers import SentenceTransformer


class EmbeddingGenerator:
    """G√©n√©rateur d'embeddings unifi√© (Mistral ou SentenceTransformers)."""

    def __init__(self, provider: str = None, model_name: str = None, api_key: str = None):
        self.provider = provider or settings.embedding_provider
        self.model_name = model_name or settings.embedding_model

        print(f"üîÑ Initialisation du provider : {self.provider}")

        if self.provider == "mistral":
            if not api_key and not settings.mistral_api_key:
                raise ValueError("Mistral API key required")

            self.client = Mistral(api_key=api_key or settings.mistral_api_key)
            self.dimension = 1024
            print(f"‚úÖ Mistral client initialis√© (dimension: {self.dimension})")

        else:  # sentence-transformers
            self.model = SentenceTransformer(self.model_name)
            self.dimension = self.model.get_sentence_embedding_dimension()
            print(f"‚úÖ SentenceTransformer charg√© (dimension: {self.dimension})")

    def encode(
        self, texts: List[str], batch_size: int = 32, show_progress: bool = True
    ) -> np.ndarray:
        """Encode une liste de textes en embeddings."""

        if self.provider == "mistral":
            return self._encode_mistral(texts, batch_size, show_progress)
        else:
            return self._encode_sentence_transformer(texts, batch_size, show_progress)

    def _encode_mistral(self, texts: List[str], batch_size: int, show_progress: bool) -> np.ndarray:
        """Encode avec Mistral API."""
        embeddings = []

        iterator = range(0, len(texts), batch_size)
        if show_progress:
            iterator = tqdm(iterator, desc="üß† Mistral Embeddings")

        for i in iterator:
            batch = texts[i : i + batch_size]

            # Appel API Mistral
            response = self.client.embeddings.create(model=self.model_name, inputs=batch)

            # Extraire les embeddings
            batch_embeddings = [item.embedding for item in response.data]
            embeddings.extend(batch_embeddings)

        return np.array(embeddings, dtype=np.float32)

    def _encode_sentence_transformer(
        self, texts: List[str], batch_size: int, show_progress: bool
    ) -> np.ndarray:
        """Encode avec SentenceTransformer."""
        return self.model.encode(
            texts, batch_size=batch_size, show_progress_bar=show_progress, convert_to_numpy=True
        )


print("‚úÖ Classe EmbeddingGenerator d√©finie")

‚úÖ Classe EmbeddingGenerator d√©finie


## 4Ô∏è‚É£ G√©n√©ration des Embeddings


In [4]:
# Initialiser le g√©n√©rateur
generator = EmbeddingGenerator()

# Extraire les contenus
contents = [doc["content"] for doc in documents]

print(f"üß† G√©n√©ration de {len(contents)} embeddings...")
print("‚ö†Ô∏è Cela peut prendre plusieurs minutes\n")

# G√©n√©rer les embeddings
embeddings = generator.encode(contents, batch_size=32, show_progress=True)

print(f"\n‚úÖ Embeddings g√©n√©r√©s")
print(f"üìê Shape : {embeddings.shape}")
print(f"üíæ Taille : {embeddings.nbytes / 1024 / 1024:.2f} MB")

üîÑ Initialisation du provider : mistral
‚úÖ Mistral client initialis√© (dimension: 1024)
üß† G√©n√©ration de 497 embeddings...
‚ö†Ô∏è Cela peut prendre plusieurs minutes



üß† Mistral Embeddings:   0%|          | 0/16 [00:00<?, ?it/s]


‚úÖ Embeddings g√©n√©r√©s
üìê Shape : (497, 1024)
üíæ Taille : 1.94 MB


## 5Ô∏è‚É£ Validation


In [5]:
# V√©rifier la qualit√© des embeddings
print("‚úÖ Validation des embeddings :")
print(f"  Nombre : {len(embeddings)}")
print(f"  Dimension : {embeddings.shape[1]}")
print(f"  Type : {embeddings.dtype}")
print(f"  Valeurs nulles : {np.isnan(embeddings).sum()}")
print(f"  Min : {embeddings.min():.4f}")
print(f"  Max : {embeddings.max():.4f}")
print(f"  Moyenne : {embeddings.mean():.4f}")

‚úÖ Validation des embeddings :
  Nombre : 497
  Dimension : 1024
  Type : float32
  Valeurs nulles : 0
  Min : -0.1210
  Max : 0.1220
  Moyenne : -0.0002


## 6Ô∏è‚É£ Test de Similarit√©


In [6]:
from sklearn.metrics.pairwise import cosine_similarity

# Tester sur les 5 premiers documents
sample_embeddings = embeddings[:5]
similarity_matrix = cosine_similarity(sample_embeddings)

print("üîç Matrice de similarit√© (5 premiers documents) :")
print("\nIndices des documents :")
for i in range(5):
    print(f"{i}: {documents[i]['title'][:50]}")

print("\nüìä Similarit√©s :")
print(similarity_matrix.round(3))

üîç Matrice de similarit√© (5 premiers documents) :

Indices des documents :
0: Formation Civique et Citoyenne "Cadre de vie et ar
1: "CONSTRUCTEURS DE DEMAIN, D√©couverte ludique des m
2: "Viv(r)e l'architecture, un architecte dans la cla
3: Exposition "Hors Site (mais pas hors sol)"
4: Devenez conducteur de bus - POEI RTM

üìä Similarit√©s :
[[1.    0.872 0.902 0.85  0.84 ]
 [0.872 1.    0.924 0.873 0.838]
 [0.902 0.924 1.    0.876 0.836]
 [0.85  0.873 0.876 1.    0.814]
 [0.84  0.838 0.836 0.814 1.   ]]


## 7Ô∏è‚É£ Export


In [7]:
# Cr√©er le dossier embeddings
embeddings_dir = PROCESSED_DATA_DIR / "embeddings"
embeddings_dir.mkdir(exist_ok=True)

# Export embeddings (numpy)
embeddings_path = embeddings_dir / "embeddings.npy"
np.save(embeddings_path, embeddings)

# Export m√©tadonn√©es
metadata_path = embeddings_dir / "metadata.json"
metadata = {
    "provider": settings.embedding_provider,
    "model_name": settings.embedding_model,
    "embedding_dim": embeddings.shape[1],
    "num_documents": len(embeddings),
    "document_ids": [doc["id"] for doc in documents],
}

with open(metadata_path, "w", encoding="utf-8") as f:
    json.dump(metadata, f, ensure_ascii=False, indent=2)

print(f"üíæ Embeddings : {embeddings_path}")
print(f"üíæ M√©tadonn√©es : {metadata_path}")
print(f"\n‚úÖ Export termin√© !")
print(f"\n‚û°Ô∏è Prochaine √©tape : Notebook 04 - Construction de l'index FAISS")

üíæ Embeddings : /Users/ppluton/Documents/Repositories/OC7---Projet-RAG-Assistant-Intelligent-Events/data/processed/embeddings/embeddings.npy
üíæ M√©tadonn√©es : /Users/ppluton/Documents/Repositories/OC7---Projet-RAG-Assistant-Intelligent-Events/data/processed/embeddings/metadata.json

‚úÖ Export termin√© !

‚û°Ô∏è Prochaine √©tape : Notebook 04 - Construction de l'index FAISS
