# Building a Retrieval-Augmented Generation (RAG) Application with LangChain and MistralAI

This notebook demonstrates how to build a RAG application using LangChain and MistralAI to query over a dataset of markdown documents.

## 1. Setup and Imports

Import the necessary libraries and load your API keys.

In [1]:
import os
from dotenv import load_dotenv
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_mistralai.chat_models import ChatMistralAI

# Load environment variables
load_dotenv()
mistral_api_key = os.getenv("MISTRAL_API_KEY")

# Ensure the API key is loaded
assert mistral_api_key is not None, "Please set your MISTRAL_API_KEY in the .env file."

## 2. Load Markdown Documents

Load all markdown files from the `data/` directory, including all subdirectories.

In [None]:
import os

def load_documents(root_dir):
    documents = []
    for dirpath, dirnames, filenames in os.walk(root_dir):
        for filename in filenames:
            if filename.endswith(".md"):
                filepath = os.path.join(dirpath, filename)
                loader = UnstructuredMarkdownLoader(filepath)
                documents.extend(loader.load())
    return documents

# Load documents from the data directory
docs = load_documents("data")

print(f"Loaded {len(docs)} documents.")

## 3. Split Documents into Chunks

Split the documents into smaller chunks to improve retrieval efficiency.

In [None]:
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)

# Split the documents
split_docs = text_splitter.split_documents(docs)

print(f"Split into {len(split_docs)} chunks.")

## 4. Create Embeddings for the Text Chunks

Convert the text chunks into vector embeddings using a pre-trained model.

In [None]:
# Initialize the embedding model
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## 5. Build a Vector Store with FAISS

Store the embeddings in a vector store to enable efficient similarity searches.

In [None]:
# Build the FAISS vector store from the documents and embeddings
vectorstore = FAISS.from_documents(split_docs, embedding)

print("Vector store created.")

## 6. Set Up a Retriever

Create a retriever that uses the vector store to find relevant documents based on a query.

In [6]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever()

# Optional: Customize retriever settings
retriever.search_kwargs['k'] = 50  # Number of documents to retrieve

## 7. Initialize the MistralAI Language Model

Set up the MistralAI language model to generate responses.

In [7]:
# Initialize the MistralAI language model
llm = ChatMistralAI(model="mistral-large-2407", api_key=mistral_api_key)

## 8. Create a RetrievalQA Chain

Combine the retriever and the language model into a chain that can handle queries.

In [None]:
# Create the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

print("RetrievalQA chain created.")

## 9. Test the RAG Application

Run a test query to see the RAG application in action.

In [None]:
# Define a test query
query = "Quelles sont les UEs de la filières 3d en 2ème année ?"

# Run the query through the RetrievalQA chain
result = qa_chain.invoke(query)

# Display the result
print("Query :", query)
print("")
print("Answer:", result)

In [None]:
# Define a test query
query = "Donne le détail de chaque UE."

# Run the query through the RetrievalQA chain
result = qa_chain.invoke(query)

# Display the result
print("Query :", query)
print("")
print("Answer:", result)

## 10. Conclusion

You've successfully built a RAG application that can answer queries based on your markdown documents. You can now expand upon this foundation to handle more complex queries or integrate additional features.