### Building a RAG System with LangChain and FAISS 
Introduction to RAG (Retrieval-Augmented Generation)
RAG combines the power of retrieval systems with generative AI models. Instead of relying solely on the model's training data, RAG:

1. Retrieves relevant documents from a knowledge base
2. Uses these documents as context for the LLM
3. Generates responses based on both the retrieved context and the model's knowledge

### FAISS 
https://github.com/facebookresearch/faiss

FAISS is a library for efficient similarity search and clustering of dense vectors.

Key advantages:
1. Extremely fast similarity search
2. Memory efficient
3. Supports GPU acceleration
4. Can handle millions of vectors

How it works:
- Indexes vectors for fast nearest neighbor search
- Returns most similar vectors based on distance metrics


In [1]:
## load libraries
import os
from dotenv import load_dotenv
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# LangChain core imports
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.runnables import (
    RunnablePassthrough, 
 
)
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage

# LangChain specific imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Load environment variables
load_dotenv()

True

### Data Ingestion and Processing

In [2]:
sample_documents = [
    Document(
        page_content="""
        Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.
        """,
        metadata={"source": "AI Introduction", "page": 1, "topic": "AI"}
    ),
    Document(
        page_content="""
        Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised, unsupervised, and reinforcement learning.
        """,
        metadata={"source": "ML Basics", "page": 1, "topic": "ML"}
    ),
    Document(
        page_content="""
        Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses multiple layers to progressively extract higher-level features from raw input.
        Deep learning has revolutionized computer vision, NLP, and speech recognition.
        """,
        metadata={"source": "Deep Learning", "page": 1, "topic": "DL"}
    ),
    Document(
        page_content="""
        Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
        It combines computational linguistics with machine learning and deep learning models.
        Applications include chatbots, translation, sentiment analysis, and text summarization.
        """,
        metadata={"source": "NLP Overview", "page": 1, "topic": "NLP"}
    )
]

print(sample_documents)

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='\n        Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.\n        '), Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='\n        Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.\n        '), Document(metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='\n        Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolu

### Text Splitting


In [3]:
text_splitter  = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50,
    length_function = len,
    separators=[" "]
)

## Split the documents into chunks 
chunks = text_splitter.split_documents(sample_documents)
print(chunks)

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.'), Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.'), Document(metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recognit

In [4]:
print(chunks[0])
print(chunks[1])

page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.' metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}
page_content='Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised, unsupervised, and reinforcement learning.' metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}


In [5]:

print(f"Created {len(chunks)} chunks from {len(sample_documents)} documents")
print("\nExample chunk:")
print(f"Content: {chunks[0].page_content}")
print(f"Metadata: {chunks[0].metadata}")

Created 4 chunks from 4 documents

Example chunk:
Content: Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.
Metadata: {'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}


In [6]:
## load the embedding models
import os
load_dotenv()

os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")


In [8]:
# Initialize OpenAI embeddings with the latest model

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536
)

In [9]:
## Example create an embedding for a single text
sample_text = "What is machine learning ?"
sample_embedding = embeddings.embed_query(sample_text)
sample_embedding

[-0.0014619150897487998,
 -0.008316448889672756,
 0.0008396898629143834,
 -0.04223589599132538,
 0.029203105717897415,
 0.010262317024171352,
 -0.011866276152431965,
 -0.0034618352074176073,
 -0.029605353251099586,
 0.0467410609126091,
 0.018211716786026955,
 -0.027553895488381386,
 -0.049959033727645874,
 0.003064616583287716,
 0.046057239174842834,
 0.01571778766810894,
 -0.011695320717990398,
 -0.024899067357182503,
 0.02065536566078663,
 0.001236908370628953,
 0.01386745274066925,
 0.03652399405837059,
 0.04509184882044792,
 -0.041793424636125565,
 0.027051087468862534,
 -0.03586028888821602,
 0.02214367687702179,
 0.007989623583853245,
 -0.010337738320231438,
 -0.012067398987710476,
 -0.0012205671519041061,
 -0.020735815167427063,
 -0.02769468165934086,
 0.02326996810734272,
 0.025703560560941696,
 -0.005857716780155897,
 -0.008386842906475067,
 -0.01381717249751091,
 -0.05969339981675148,
 0.01525520347058773,
 -0.06355497241020203,
 -0.017226211726665497,
 0.0088594825938344,
 0

In [10]:
texts = ["AI","Machine Learning","Deep Learning","Neural Networks"]
batch_embeddings = embeddings.embed_documents(texts)
print(batch_embeddings[0])

[-0.008146658539772034, -0.024611903354525566, 0.002883070847019553, 0.025180581957101822, 0.0064902156591415405, -0.028275255113840103, -0.004995780065655708, 0.02094855159521103, -0.036871567368507385, 0.012861405499279499, -0.0030467314645648003, -0.02016827091574669, 0.00026408862322568893, -0.03277178853750229, 0.006460458971560001, -0.02531283348798752, -0.031052524223923683, -0.0543815940618515, 0.03279823809862137, -0.018396107479929924, 0.01662394590675831, 0.048324499279260635, -0.02488962933421135, 0.014402128756046295, 0.029359711334109306, 0.004013816360384226, 0.00925756711512804, 0.013410246931016445, 0.0025061555206775665, -0.022588463500142097, 0.032136980444192886, -0.02798430249094963, 0.005392532795667648, -0.03819407522678375, -0.016690069809556007, 0.014362453483045101, -0.03861727938055992, -0.010348637588322163, -0.010540401563048363, -0.019189612939953804, 0.03203118219971657, 0.014679855667054653, -0.021504005417227745, 0.016094941645860672, -0.011836460791528

In [11]:
print(batch_embeddings[1])


[-0.022064441815018654, -0.0035424039233475924, -0.019189447164535522, -0.034043584018945694, 0.03376977518200874, 0.00861357431858778, 0.0014845581026747823, 0.026034671813249588, -0.041345156729221344, 0.04209813103079796, -0.0007044877274893224, -0.03792254626750946, -0.03769437223672867, -0.0016385756898671389, 0.01605205237865448, 0.016531217843294144, 0.01924649067223072, -0.017113061621785164, 0.01753518357872963, 0.017649270594120026, 0.021779224276542664, 0.02427772991359234, 0.019931012764573097, -0.017352644354104996, 0.03997611254453659, -0.02019341289997101, 0.02913784049451351, 0.04091162607073784, -0.0072331209667027, -0.02719835937023163, -0.014945407398045063, -0.014580328948795795, -0.038059450685977936, -0.016576852649450302, 0.02393546886742115, -0.001637149602174759, -0.0030546814668923616, 0.026764828711748123, -0.04905744642019272, 0.00863068737089634, -0.0289324838668108, -0.015264851041138172, 0.01534471195191145, 0.07365462183952332, -0.01756940968334675, -0.0

In [13]:
### Compare embeddings using cosine similarity

def compare_embeddings(text1:str,text2:str):
    """ Compare semantic similarity of 2 textx using embeddings"""
    emb1= np.array(embeddings.embed_query(text1))
    emb2= np.array(embeddings.embed_query(text2))

    ## Calculate the Similarity score
    similarity=np.dot(emb1,emb2)/(np.linalg.norm(emb1))*(np.linalg.norm(emb2))
    return similarity

In [15]:
# Test Semantic Similarity
print("\n Semantic Similarity Examples")
print(f"'AI' Vs 'Artificial Intelligence' : {compare_embeddings('AI','Artificial Intelligence')}")


 Semantic Similarity Examples
'AI' Vs 'Artificial Intelligence' : 0.5634487441113435


In [16]:
print(f"'ML' Vs 'Machine Learning' : {compare_embeddings('ML','Machine Learning'):.3f}")

'ML' Vs 'Machine Learning' : 0.461


### Create FAISS Vector Store

In [17]:
vectorstore = FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)
print(f"Vector store created with {vectorstore.index.ntotal} vectors")

Vector store created with 4 vectors


In [18]:
vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x11004dd30>

In [19]:
vectorstore.save_local("faiss_index")
print("Vectore Store saved to faiss_index directory")

Vectore Store saved to faiss_index directory


In [20]:
## Load the vector store 
loaded_vectorstore = FAISS.load_local(
    'faiss_index',
    embeddings,
    allow_dangerous_deserialization=True
)
print(f"Loaded vectore store contains {loaded_vectorstore.index.ntotal} vectors")

Loaded vectore store contains 4 vectors


In [21]:
query = "What is deep learning ?"

results = vectorstore.similarity_search(query,k=3)
print(results)

[Document(id='5b87b350-4f04-43c6-a660-ae4405f990fd', metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recognition.'), Document(id='cbdbdc43-394b-418b-9321-781f5f8db15e', metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.'), Document(id='a5b3a2be-97a9-43da-bd44-33e3a6962d87', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='Natural Language Processing (NLP) is a branch of AI that helps computers understand human lang

In [22]:
print(f"Query: {query}\n")
print("Top 3 similar chunks:")
for i, doc in enumerate(results):
    print(f"\n{i+1}. Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:200]}...")

Query: What is deep learning ?

Top 3 similar chunks:

1. Source: Deep Learning
   Content: Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses multiple layers to progressively extract higher-level features from raw input.
        Deep learning ...

2. Source: ML Basics
   Content: Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised...

3. Source: NLP Overview
   Content: Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
        It combines computational linguistics with machine learning and deep learning models.
      ...


In [23]:
## Similarity search with score
results_with_scores = vectorstore.similarity_search_with_score(query,k=3)
print("\n\nSimilarity search with scores:")
for doc, score in results_with_scores:
    print(f"\nScore: {score:.3f}")
    print(f"Source: {doc.metadata['source']}")
    print(f"Content preview: {doc.page_content[:100]}...")
                                                    



Similarity search with scores:

Score: 0.525
Source: Deep Learning
Content preview: Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses m...

Score: 1.163
Source: ML Basics
Content preview: Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being...

Score: 1.241
Source: NLP Overview
Content preview: Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
...


In [24]:
chunks

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.'),
 Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.'),
 Document(metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recogn

In [26]:
### Search with metadata filtering 

filter_dict = {"topic":"ML"}
filtered_results = vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)

print(filtered_results)

[Document(id='cbdbdc43-394b-418b-9321-781f5f8db15e', metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.')]


In [27]:

filter_dict = {"topic":"AI"}
filtered_results = vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)

print(filtered_results)

[Document(id='978c8237-fdf8-4f1d-85cd-183a1ba91b59', metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.')]


In [28]:

filter_dict = {"topic":"DL"}
filtered_results = vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)

print(filtered_results)

[Document(id='5b87b350-4f04-43c6-a660-ae4405f990fd', metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recognition.')]


In [29]:

filter_dict = {"topic":"NLP"}
filtered_results = vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)

print(filtered_results)

[Document(id='a5b3a2be-97a9-43da-bd44-33e3a6962d87', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.\n        It combines computational linguistics with machine learning and deep learning models.\n        Applications include chatbots, translation, sentiment analysis, and text summarization.')]
