# Projet 9 – Conception d’un système RAG (Retrieval-Augmented Generation)

Vous développerez un système de génération augmentée par la recherche (RAG) à l’aide de LangChain, Mistral, et une base vectorielle Faiss.

🔧 Objectifs :
- Intégration de modèles LLM
- Création d’un système de recherche documentaire intelligent

Data :
- Données issues [kaggle](https://www.kaggle.com/datasets/rmisra/news-category-dataset)

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import json

FILE = "News_Category_Dataset_v3"
DATA = pd.read_json(f"../data/{FILE}.json", lines=True)
print(DATA.shape)
print(DATA.columns)
DATA.head()

(209527, 6)
Index(['link', 'headline', 'category', 'short_description', 'authors', 'date'], dtype='object')


Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [3]:
# RAG à l’aide de LangChain, Mistral, et une base vectorielle Faiss
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document

# Initialize HuggingFace embeddings (you can choose a model, e.g., 'all-MiniLM-L6-v2')
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Embeddings initialized.")
# Convert short_descriptions to Document objects
documents = [Document(page_content=desc) for desc in DATA['short_description'].tolist()]
print(f"Converted {len(documents)} documents from short descriptions.")
# Create a FAISS vector store
vector_store = FAISS.from_documents(documents, embeddings)
print("FAISS vector store created.")
# Save the vector store to disk
vector_store.save_local(f"../data/{FILE}_faiss")
print(f"Vector store saved to ../data/{FILE}_faiss")

# Load the vector store from disk
loaded_vector_store = FAISS.load_local(f"../data/{FILE}_faiss", embeddings)
# Example query
query = "What are the latest trends in AI?"
results = loaded_vector_store.similarity_search(query, k=5)
# Display results
for result in results:
    print(f"Text: {result.page_content}\nScore: {result.score}\n")
# Save the results to a JSON file
output_file = f"../data/{FILE}_results.json"
with open(output_file, 'w') as f:
    json.dump([{"text": result.page_content, "score": result.score} for result in results], f, indent=4)
print(f"Results saved to {output_file}")
# Plotting the distribution of categories
category_counts = DATA['category'].value_counts()
plt.figure(figsize=(12, 6))
category_counts.plot(kind='bar')
plt.title('Distribution of News Categories')
plt.xlabel('Category')
plt.ylabel('Number of Articles')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange


Embeddings initialized.
Converted 209527 documents from short descriptions.


: 