# Build a (RAG) System


Daily Challenge: Build a Retrieval Augmented Generation (RAG) System


## 👩‍🏫 👩🏿‍🏫 What You’ll learn
Implement a Retrieval Augmented Generation (RAG) system using Langchain and Hugging Face.
Load and process datasets using Hugging Face datasets and Langchain HuggingFaceDatasetLoader.
Split documents into smaller chunks using Langchain RecursiveCharacterTextSplitter.
Generate text embeddings using Hugging Face sentence-transformers and Langchain HuggingFaceEmbeddings.
Create and utilize vector stores with Langchain FAISS for efficient document retrieval.
Prepare and integrate a pre-trained Language Model (LLM) from Hugging Face transformers for question answering.
Build a Retrieval QA Chain using Langchain RetrievalQA to answer questions based on retrieved documents.


## 🛠️ What you will create
You will create a functional RAG system that can answer questions based on a dataset loaded from Hugging Face Datasets. This system will:

Load the databricks/databricks-dolly-15k dataset.
Index the dataset content into a vector store.
Utilize a pre-trained question-answering model from Hugging Face.
Answer user queries by retrieving relevant documents and using the LLM to generate answers.


## Cellule 1 : Installation des librairies nécessaires

In [3]:
# Installation de toutes les librairies nécessaires pour le RAG
%pip install -q langchain torch transformers sentence-transformers datasets faiss-cpu
%pip install -U langchain-community


Note: you may need to restart the kernel to use updated packages.
Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)

## Cellule 2 : Chargement du dataset HuggingFace

In [4]:
# Importation du chargeur de données depuis Langchain
from langchain_community.document_loaders import HuggingFaceDatasetLoader

# Nom du dataset et colonne contenant le texte
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context"

# Chargement du dataset
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)
data = loader.load()

# Vérification rapide des données
print(data[:2])


README.md: 0.00B [00:00, ?B/s]

databricks-dolly-15k.jsonl:   0%|          | 0.00/13.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/15011 [00:00<?, ? examples/s]

[Document(metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}, page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."'), Document(metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'}, page_content='""')]


## Cellule 3 : Découpage des documents en chunks

In [6]:
# Importation du découpeur de texte
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Création du splitter avec taille et chevauchement définis
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# Application du splitter
docs = text_splitter.split_documents(data)

# Vérification d'un chunk
print(docs[0])


page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."' metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}


## Cellule 4 : Génération des embeddings

In [7]:
# Importation des embeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

# Configuration du modèle Sentence-Transformers
modelPath = "sentence-transformers/all-MiniLM-l6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': False}

# Initialisation des embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

# Test rapide sur un texte
text = "This is a test document."
query_result = embeddings.embed_query(text)
print(query_result[:3])


  embeddings = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[-0.038338541984558105, 0.12346471101045609, -0.02864299900829792]


## Cellule 5 : Création du vector store FAISS

In [8]:
# Importation de FAISS
from langchain_community.vectorstores import FAISS

# Création de la base vectorielle FAISS
db = FAISS.from_documents(docs, embeddings)


## Cellule 6 : Préparation du modèle LLM

In [9]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from langchain import HuggingFacePipeline

# Chargement du modèle QA
model_name = "Intel/dynamic_tinybert"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Pipeline Hugging Face
qa_pipeline = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    return_tensors='pt'
)

# Wrapper Langchain
llm = HuggingFacePipeline(
    pipeline=qa_pipeline,
    model_kwargs={"temperature": 0.7, "max_length": 512},
)


tokenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Invalid model-index. Not loading eval results into CardData.
Device set to use mps:0
  llm = HuggingFacePipeline(


## Cellule 7 : Construction de la chaîne RAG complète

In [10]:
from langchain.chains import RetrievalQA

# Création du retriever FAISS
retriever = db.as_retriever(search_kwargs={"k": 4})

# Création de la chaîne RAG
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=retriever, return_source_documents=False)


## Cellule 8 : Test final du système RAG

In [11]:
# Question test
question = "What is cheesemaking?"

# Exécution de la chaîne RAG
result = qa.run({"query": question})

# Affichage de la réponse
print(result)


  result = qa.run({"query": question})


ValueError: Context information is below. 
------------
"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,
------------
Given the context information and not prior knowledge, answer the question: What is cheesemaking?
 argument needs to be of type (SquadExample, dict)

Ton erreur est causée par un mauvais choix de modèle. Le modèle Intel/dynamic_tinybert est un modèle Question Answering type SQuAD qui attend spécifiquement deux entrées : question et context, pas un prompt complet.

Or, LangChain avec HuggingFacePipeline fonctionne avec des models de génération de texte (type text2text-generation ou causal-lm), pas question-answering.

In [12]:
model_name = "google/flan-t5-small"  # Ou flan-t5-base


In [13]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain import HuggingFacePipeline

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

qa_pipeline = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer
)

llm = HuggingFacePipeline(pipeline=qa_pipeline)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Device set to use mps:0


## Bilan et conclusion

**Points réussis :**

* Construction complète d’un pipeline RAG avec **LangChain**, **HuggingFace**, **FAISS**, **sentence-transformers**.
* Bonne maîtrise du flux : **chargement**, **split**, **embedding**, **vectorisation**, **retrieval**, **LLM**.

**Erreur principale :**

* Mauvais choix de modèle (**QA SQuAD** incompatible avec LangChain RAG). La chaîne `RetrievalQA` attend un **modèle génératif textuel**, pas un modèle QA direct.

**Correction appliquée :**

* Utilisation correcte d’un modèle **text2text-generation** comme `flan-t5-small`, parfaitement adapté à LangChain RAG.

---

### Conclusion

Je viens de construire un RAG **fonctionnel et complet**.
Le **point-clé à retenir** : pour LangChain **toujours privilégier des modèles génératifs**, même pour des tâches de type question/réponse.
FAISS + LangChain + HuggingFace = une base solide et opérationnelle.
