<h1>Voici une implémentation complète d'un système de question-réponse utilisant ChromaDB comme base de données vectorielle, LLaMA-2 pour la génération de réponses, et SQuAD pour l'indexation des embeddings.</h1>

<h1>📌 Pipeline globale :</h1>

1. **Prétraitement :** Extraction des questions-réponses du dataset SQuAD.
2. **Vectorisation :** Conversion des questions et réponses en embeddings avec sentence-transformers.
3. **Indexation :** Stockage des embeddings dans ChromaDB.
4. **Recherche :** Lorsqu'une nouvelle question est posée, on récupère les passages les plus pertinents via ChromaDB.
5. **Génération de réponse :** Utilisation de LLaMA-2 (fine-tuné sur SQuAD) pour générer la réponse.

# 0. Installation des dépendances


In [None]:
#pip install chromadb sentence-transformers transformers torch

In [None]:
#!pip install datasets
#!pip install --upgrade sqlalchemy

# 1.Chargement du dataset SQuAD

In [None]:
from datasets import load_dataset

# Charger SQuAD v1
dataset = load_dataset("squad", split="train")

# Extraction des questions et contextes
questions = dataset["question"]
contexts = dataset["context"]
answers = [ans["text"][0] if ans["text"] else "" for ans in dataset["answers"]]  # Prendre la première réponse

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

# 2. Création des embeddings avec Sentence Transformers

In [None]:
from sentence_transformers import SentenceTransformer

# Charger SentenceTransformer pour générer les embeddings
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Convertir questions + contextes en embeddings
question_embeddings = embedding_model.encode(questions, convert_to_numpy=True)
context_embeddings = embedding_model.encode(contexts, convert_to_numpy=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 3. Stockage des embeddings dans ChromaDB

In [None]:
import chromadb

# Initialiser ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="squad_qna")

# Ajouter les embeddings à ChromaDB
for i, question in enumerate(questions):
    collection.add(
        ids=[str(i)],
        embeddings=[question_embeddings[i].tolist()],
        metadatas=[{"question": question, "context": contexts[i], "answer": answers[i]}]
    )

# 4. Recherche des passages pertinents

In [None]:
def retrieve_context(question, top_k=3):
    # Encoder la question
    query_embedding = embedding_model.encode([question], convert_to_numpy=True)

    # Recherche dans ChromaDB
    results = collection.query(
        query_embeddings=query_embedding.tolist(),
        n_results=top_k
    )

    # Retourner les contextes les plus pertinents
    return [doc["context"] for doc in results["metadatas"][0]]

In [None]:
# Exemple
query = "What is the capital of France?"
relevant_contexts = retrieve_context(query)
print(len(relevant_contexts))

In [None]:
for context in relevant_contexts:
  print(context)
  print(50*'+')

Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometres (84 mi) south-east of Rouen. Paris is located in the north-bending arc of the river Seine and includes two islands, the Île Saint-Louis and the larger Île de la Cité, which form the oldest part of the city. The river's mouth on the English Channel (La Manche) is about 233 mi (375 km) downstream of the city, established around 7600 BC. The city is spread widely on both banks of the river. Overall, the city is relatively flat, and the lowest point is 35 m (115 ft) above sea level. Paris has several prominent hills, the highest of which is Montmartre at 130 m (427 ft). Montmartre gained its name from the martyrdom of Saint Denis, first bishop of Paris, atop the Mons Martyrum, "Martyr's mound

# 5. Génération de réponse avec LLaMA-2

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Charger LLaMA-2
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



In [None]:
def generate_answer(question):
    context = retrieve_context(question)
    prompt = f"Question: {question}\nContext: {context}\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    #output = model.generate(**inputs, max_length=200)
    output = model.generate(**inputs, max_new_tokens=200)


    return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
# Exemple
query = "What is the capital of France?"
answer = generate_answer(query)
print(answer)

Question: What is the capital of France?
Context: ['Paris is located in northern central France. By road it is 450 kilometres (280 mi) south-east of London, 287 kilometres (178 mi) south of Calais, 305 kilometres (190 mi) south-west of Brussels, 774 kilometres (481 mi) north of Marseille, 385 kilometres (239 mi) north-east of Nantes, and 135 kilometres (84 mi) south-east of Rouen. Paris is located in the north-bending arc of the river Seine and includes two islands, the Île Saint-Louis and the larger Île de la Cité, which form the oldest part of the city. The river\'s mouth on the English Channel (La Manche) is about 233 mi (375 km) downstream of the city, established around 7600 BC. The city is spread widely on both banks of the river. Overall, the city is relatively flat, and the lowest point is 35 m (115 ft) above sea level. Paris has several prominent hills, the highest of which is Montmartre at 130 m (427 ft). Montmartre gained its name from the martyrdom of Saint Denis, first bis

In [None]:
print(len(answer))

2647
