# Building a RAG System with LangChain and FAISS 
Introduction to RAG (Retrieval-Augmented Generation)
RAG combines the power of retrieval systems with generative AI models. Instead of relying solely on the model's training data, RAG:

1. Retrieves relevant documents from a knowledge base
2. Uses these documents as context for the LLM
3. Generates responses based on both the retrieved context and the model's knowledge

### FAISS 
https://github.com/facebookresearch/faiss

FAISS is a library for efficient similarity search and clustering of dense vectors.

Key advantages:
1. Extremely fast similarity search
2. Memory efficient
3. Supports GPU acceleration
4. Can handle millions of vectors

How it works:
- Indexes vectors for fast nearest neighbor search
- Returns most similar vectors based on distance metrics


In [35]:
## load libraries
import os
from dotenv import load_dotenv
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# LangChain core imports
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.runnables import (
    RunnablePassthrough, 
 
)
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage

# LangChain specific imports
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# Load environment variables
load_dotenv()

True

### Data Ingestion And Processing


In [36]:
sample_documents = [
    Document(
        page_content="""
        Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.
        """,
        metadata={"source": "AI Introduction", "page": 1, "topic": "AI"}
    ),
    Document(
        page_content="""
        Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised, unsupervised, and reinforcement learning.
        """,
        metadata={"source": "ML Basics", "page": 1, "topic": "ML"}
    ),
    Document(
        page_content="""
        Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses multiple layers to progressively extract higher-level features from raw input.
        Deep learning has revolutionized computer vision, NLP, and speech recognition.
        """,
        metadata={"source": "Deep Learning", "page": 1, "topic": "DL"}
    ),
    Document(
        page_content="""
        Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
        It combines computational linguistics with machine learning and deep learning models.
        Applications include chatbots, translation, sentiment analysis, and text summarization.
        """,
        metadata={"source": "NLP Overview", "page": 1, "topic": "NLP"}
    )
]

print(sample_documents)

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='\n        Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.\n        '), Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='\n        Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.\n        '), Document(metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='\n        Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolu

In [39]:
## text splitting
text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=[" "]
)

## split the documents into chunks
chunks = text_splitter.split_documents(sample_documents)
print(chunks[0])
print(chunks[1])


page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.' metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}
page_content='Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised, unsupervised, and reinforcement learning.' metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}


In [40]:

print(f"Created {len(chunks)} chunks from {len(sample_documents)} documents")
print("\nExample chunk:")
print(f"Content: {chunks[0].page_content}")
print(f"Metadata: {chunks[0].metadata}")

Created 4 chunks from 4 documents

Example chunk:
Content: Artificial Intelligence (AI) is the simulation of human intelligence in machines.
        These systems are designed to think like humans and mimic their actions.
        AI can be categorized into narrow AI and general AI.
Metadata: {'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}


In [41]:
### load the embedding models
import os
load_dotenv()

os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

In [None]:
# Initialize OpenAI embeddings with the latest model

embeddings=OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536
)

## Example: create a embedding for a single text
sample_text="What is machine learning"
sample_embedding=embeddings.embed_query(sample_text)
sample_embedding

In [None]:
texts=["AI","MAchine learning","Deep Learning","Neural Network"]
batch_embeddings=embeddings.embed_documents(texts)
print(batch_embeddings[0])

In [45]:
print(batch_embeddings[1])

[-0.01823362149298191, 0.01074262149631977, 0.017460593953728676, -0.03524021804332733, 0.037424325942993164, 0.02608659490942955, 0.015890000388026237, 0.008374459110200405, -0.010945080779492855, 0.034872107207775116, -0.006638216320425272, -0.06866443157196045, -0.0323689728975296, 0.008969567716121674, 0.03658994659781456, 0.018712162971496582, 0.016270378604531288, -0.015399189665913582, 0.020614054054021835, 0.012945135124027729, -0.00261049997061491, 0.03781697154045105, 0.03403773158788681, -0.030013080686330795, 0.020650865510106087, 0.007926594465970993, 0.03614821657538414, 0.03766972944140434, 0.01052175648510456, -0.03482302650809288, -0.010932810604572296, -0.024233784526586533, -0.013914486393332481, -0.013865405693650246, 0.011945107951760292, -0.0018451418727636337, -0.027583567425608635, 0.027828972786664963, -0.04508097469806671, -0.005190324503928423, -0.019755134359002113, -0.04277416318655014, 0.03438129648566246, 0.06827178597450256, -0.020577242597937584, -0.034

In [46]:
### Compare Embedding using cosine similarity

def compare_embeddings(text1:str,text2:str):
    """Compare semantic simialrity of 2 texts usign embeddings"""

    emb1=np.array(embeddings.embed_query(text1))
    emb2=np.array(embeddings.embed_query(text2))

    ## Calculate the simialrity score

    similarity=np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
    return similarity

In [47]:
# Test semantic similarity
print("\nSemantic Similarity Examples:")
print(f"'AI' vs 'Artificial Intelligence': {compare_embeddings('AI', 'Artificial Intelligence'):.3f}")


Semantic Similarity Examples:
'AI' vs 'Artificial Intelligence': 0.563


In [48]:
print(f"'AI' vs 'Pizza': {compare_embeddings('AI', 'Pizza'):.3f}")

'AI' vs 'Pizza': 0.254


In [49]:
print(f"'Machine Learning' vs 'ML': {compare_embeddings('Machine Learning', 'ML'):.3f}")

'Machine Learning' vs 'ML': 0.461


### Create FAISS Vector Store

In [50]:
vectorstore=FAISS.from_documents(
    documents=chunks,
    embedding=embeddings
)
print(f"Vector store created with {vectorstore.index.ntotal} vectors")

Vector store created with 4 vectors


In [51]:
vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x1b1fa6e0d60>

In [52]:
## Save vector tore for later use
vectorstore.save_local("faiss_index")
print("Vector store saved to 'faiss_index' directory")

Vector store saved to 'faiss_index' directory


In [53]:
## load vector store
loaded_vectorstore=FAISS.load_local(
    "faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

print(f"Loaded vector store contains {loaded_vectorstore.index.ntotal} vectors")

Loaded vector store contains 4 vectors


In [54]:
## Similarity Search 
query="What is deep learning"

results=vectorstore.similarity_search(query,k=3)
print(results)

[Document(id='8f8852bf-c2e1-49f5-91dd-8e2418c7d6e4', metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recognition.'), Document(id='c0566de3-4f5b-42a1-be7f-2f6cecf6d6eb', metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.'), Document(id='2062f147-d8c2-4fa2-a03b-8647d8b73f75', metadata={'source': 'NLP Overview', 'page': 1, 'topic': 'NLP'}, page_content='Natural Language Processing (NLP) is a branch of AI that helps computers understand human lang

In [55]:
print(f"Query: {query}\n")
print("Top 3 similar chunks:")
for i, doc in enumerate(results):
    print(f"\n{i+1}. Source: {doc.metadata['source']}")
    print(f"   Content: {doc.page_content[:200]}...")

Query: What is deep learning

Top 3 similar chunks:

1. Source: Deep Learning
   Content: Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses multiple layers to progressively extract higher-level features from raw input.
        Deep learning ...

2. Source: ML Basics
   Content: Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being explicitly programmed, ML algorithms find patterns in data.
        Common types include supervised...

3. Source: NLP Overview
   Content: Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
        It combines computational linguistics with machine learning and deep learning models.
      ...


In [56]:
### Similarity Search with score
results_with_scores=vectorstore.similarity_search_with_score(query,k=3)

print("\n\nSimilarity search with scores:")
for doc, score in results_with_scores:
    print(f"\nScore: {score:.3f}")
    print(f"Source: {doc.metadata['source']}")
    print(f"Content preview: {doc.page_content[:100]}...")



Similarity search with scores:

Score: 0.556
Source: Deep Learning
Content preview: Deep Learning is a subset of machine learning based on artificial neural networks.
        It uses m...

Score: 1.208
Source: ML Basics
Content preview: Machine Learning is a subset of AI that enables systems to learn from data.
        Instead of being...

Score: 1.274
Source: NLP Overview
Content preview: Natural Language Processing (NLP) is a branch of AI that helps computers understand human language.
...


In [57]:
chunks

[Document(metadata={'source': 'AI Introduction', 'page': 1, 'topic': 'AI'}, page_content='Artificial Intelligence (AI) is the simulation of human intelligence in machines.\n        These systems are designed to think like humans and mimic their actions.\n        AI can be categorized into narrow AI and general AI.'),
 Document(metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.'),
 Document(metadata={'source': 'Deep Learning', 'page': 1, 'topic': 'DL'}, page_content='Deep Learning is a subset of machine learning based on artificial neural networks.\n        It uses multiple layers to progressively extract higher-level features from raw input.\n        Deep learning has revolutionized computer vision, NLP, and speech recogn

In [59]:
### Search with metadata filtering
filter_dict={"topic":"ML"}
filtered_results=vectorstore.similarity_search(
    query,
    k=3,
    filter=filter_dict
)
print(filtered_results)

[Document(id='c0566de3-4f5b-42a1-be7f-2f6cecf6d6eb', metadata={'source': 'ML Basics', 'page': 1, 'topic': 'ML'}, page_content='Machine Learning is a subset of AI that enables systems to learn from data.\n        Instead of being explicitly programmed, ML algorithms find patterns in data.\n        Common types include supervised, unsupervised, and reinforcement learning.')]


In [60]:
len(filtered_results)

1

## Build RAG Chain With LCEL 

In [75]:
## LLM GROQ LLM
from langchain.chat_models import init_chat_model

os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")

llm=init_chat_model(model="groq:gemma2-9b-it")
llm

ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x000001B1F90CB360>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x000001B1F90C8FC0>, model_name='gemma2-9b-it', model_kwargs={}, groq_api_key=SecretStr('**********'))

In [76]:
llm.invoke("Hi")

AIMessage(content='Hello! 👋\n\nHow can I help you today? 😊\n', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 15, 'prompt_tokens': 10, 'total_tokens': 25, 'completion_time': 0.027272727, 'prompt_time': 0.00117063, 'queue_time': 0.253079486, 'total_time': 0.028443357}, 'model_name': 'gemma2-9b-it', 'system_fingerprint': 'fp_10c08bf97d', 'finish_reason': 'stop', 'logprobs': None}, id='run--208de7c5-1ae3-45a8-a674-9e8c14783105-0', usage_metadata={'input_tokens': 10, 'output_tokens': 15, 'total_tokens': 25})

In [62]:
# 1. Simple RAG Chain with LCEL
simple_prompt = ChatPromptTemplate.from_template("""Answer the question based only on the following context:
Context: {context}

Question: {question}

Answer:""")

In [64]:
## Basic retriever
retriever=vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":3}
)

In [65]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001B1FA6E0D60>, search_kwargs={'k': 3})

In [None]:
from typing import List
# Format documents for the prompt
def format_docs(docs: List[Document]) -> str:
    """Format documents for insertion into prompt"""
    formatted = []
    for i, doc in enumerate(docs):
        source = doc.metadata.get('source', 'Unknown')
        formatted.append(f"Document {i+1} (Source: {source}):\n{doc.page_content}")
    return "\n\n".join(formatted)

In [78]:
simple_rag_chain=(
    {"context":retriever | format_docs,"question":RunnablePassthrough() }
    | simple_prompt
    | llm
    |StrOutputParser()

)

In [79]:
simple_rag_chain

{
  context: VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001B1FA6E0D60>, search_kwargs={'k': 3})
           | RunnableLambda(format_docs),
  question: RunnablePassthrough()
}
| ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\nContext: {context}\n\nQuestion: {question}\n\nAnswer:'), additional_kwargs={})])
| ChatGroq(client=<groq.resources.chat.completions.Completions object at 0x000001B1F90CB360>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x000001B1F90C8FC0>, model_name='gemma2-9b-it', model_kwargs={}, groq_api_key=SecretStr('**********'))
| StrOutputParser()

In [80]:
### Conversational RAg Chain

conversational_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant. Use the provided context to answer questions."),
    ("placeholder", "{chat_history}"),
    ("human", "Context: {context}\n\nQuestion: {input}"),
])

In [81]:
def create_conversational_rag():
    """Create a conversational RAG chain with memory"""
    return (
        RunnablePassthrough.assign(
            context=lambda x: format_docs(retriever.invoke(x["input"]))
        )
        | conversational_prompt
        | llm
        | StrOutputParser()
    )

conversational_rag = create_conversational_rag()

In [82]:
conversational_rag

RunnableAssign(mapper={
  context: RunnableLambda(lambda x: format_docs(retriever.invoke(x['input'])))
})
| ChatPromptTemplate(input_variables=['context', 'input'], optional_variables=['chat_history'], input_types={'chat_history': list[typing.Annotated[typing.Union[typing.Annotated[langchain_core.messages.ai.AIMessage, Tag(tag='ai')], typing.Annotated[langchain_core.messages.human.HumanMessage, Tag(tag='human')], typing.Annotated[langchain_core.messages.chat.ChatMessage, Tag(tag='chat')], typing.Annotated[langchain_core.messages.system.SystemMessage, Tag(tag='system')], typing.Annotated[langchain_core.messages.function.FunctionMessage, Tag(tag='function')], typing.Annotated[langchain_core.messages.tool.ToolMessage, Tag(tag='tool')], typing.Annotated[langchain_core.messages.ai.AIMessageChunk, Tag(tag='AIMessageChunk')], typing.Annotated[langchain_core.messages.human.HumanMessageChunk, Tag(tag='HumanMessageChunk')], typing.Annotated[langchain_core.messages.chat.ChatMessageChunk, Tag(tag=

In [83]:
### streaming RAG chain
streaming_rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | simple_prompt
    | llm
)

print("Modern RAG chains created successfully!")
print("Available chains:")
print("- simple_rag_chain: Basic Q&A")
print("- conversational_rag: Maintains conversation history")
print("- streaming_rag_chain: Supports token streaming")

Modern RAG chains created successfully!
Available chains:
- simple_rag_chain: Basic Q&A
- conversational_rag: Maintains conversation history
- streaming_rag_chain: Supports token streaming


In [86]:
# Test function for different chain types
def test_rag_chains(question: str):
    """Test all RAG chain variants"""
    print(f"Question: {question}")
    print("=" * 80)
    
    # 1. Simple RAG
    print("\n1. Simple RAG Chain:")
    answer = simple_rag_chain.invoke(question)
    print(f"Answer: {answer}")

    print("\n2. Streaming RAG:")
    print("Answer: ", end="", flush=True)
    for chunk in streaming_rag_chain.stream(question):
        print(chunk.content, end="", flush=True)
    print()

In [87]:
test_rag_chains("What is the difference between AI and machine learning")

Question: What is the difference between AI and machine learning

1. Simple RAG Chain:
Answer: AI is the broader concept of machines performing tasks that typically require human intelligence. Machine learning is a specific subset of AI where systems learn from data instead of being explicitly programmed.  


2. Streaming RAG:
Answer: AI is the broad concept of machines simulating human intelligence, while machine learning is a subset of AI that allows systems to learn from data without explicit programming. 



In [88]:
# Test with multiple questions
test_questions = [
    "What is the difference between AI and Machine Learning?",
    "Explain deep learning in simple terms",
    "How does NLP work?"
]

for question in test_questions:
    print("\n" + "=" * 80 + "\n")
    test_rag_chains(question)



Question: What is the difference between AI and Machine Learning?

1. Simple RAG Chain:
Answer: AI is the broader concept of machines performing tasks that typically require human intelligence, while Machine Learning is a specific subset of AI that allows systems to learn from data without explicit programming.  


2. Streaming RAG:
Answer: Artificial Intelligence (AI) is the broad concept of machines simulating human intelligence, while Machine Learning (ML) is a subset of AI that focuses on enabling systems to learn from data without explicit programming.  



Question: Explain deep learning in simple terms

1. Simple RAG Chain:
Answer: Deep learning is a type of machine learning that uses artificial networks with many layers to learn from data.  Think of it like teaching a computer to recognize patterns, like pictures or words, by showing it lots of examples. Each layer in the network helps it learn more complex features, ultimately allowing it to make accurate predictions. 


2. 

In [89]:
## Conversational example
print("\n3. Conversational RAG Example:")
chat_history = []

# First question
q1 = "What is machine learning?"
a1 = conversational_rag.invoke({
    "input": q1,
    "chat_history": chat_history
})

print(f"Q1: {q1}")
print(f"A1: {a1}")


3. Conversational RAG Example:
Q1: What is machine learning?
A1: Machine learning is a subset of artificial intelligence that allows systems to learn from data instead of being explicitly programmed. 

ML algorithms identify patterns in data to make predictions or decisions. 



In [90]:
# Update history
chat_history.extend([
    HumanMessage(content=q1),
    AIMessage(content=a1)
])

In [91]:
# Follow-up question
q2 = "How is it different from traditional programming?"
a2 = conversational_rag.invoke({
    "input": q2,
    "chat_history": chat_history
})
print(f"\nQ2: {q2}")
print(f"A2: {a2}")


Q2: How is it different from traditional programming?
A2: Here's how machine learning differs from traditional programming, drawing from the provided context:

**Traditional Programming:**

* **Explicit Instructions:**  Programmers write very specific instructions (code) that the computer follows step-by-step.  
* **Rule-Based:**  Programs rely on predefined rules and logic to process information.

**Machine Learning:**

* **Learning from Data:** Instead of explicit rules, ML algorithms learn patterns and relationships from large datasets.
* **Data-Driven:** The "program" is not fixed code but a set of algorithms that adapt and improve based on the data they are trained on.

**In essence:**

Traditional programming is like giving the computer a detailed recipe to follow. Machine learning is more like giving the computer a bunch of ingredients and letting it figure out how to make something delicious based on patterns it observes. 






# **Notes**

## **Introduction and Fundamentals**

**Introduction to FAISS**

> FAISS (Facebook AI Similarity Search) is a **high-performance library for similarity search and clustering of dense vectors**. It is widely used in machine learning and AI pipelines to **search, retrieve, and index large-scale vector representations**, such as embeddings generated by neural networks.

---

**What is FAISS?**

* FAISS provides **efficient indexing and retrieval** of high-dimensional vectors.
* It supports **both exact and approximate nearest neighbor (ANN) search**.
* Designed for **large-scale datasets**, it is optimized for **speed and memory usage** and can leverage **GPU acceleration** for massive vector collections.

---

**History and Evolution**

* Developed by **Facebook AI Research (FAIR)** to handle large-scale embedding search problems.
* Initially focused on CPU-based indexing and similarity search.
* Evolved to include **GPU support, advanced indexing structures**, and distributed capabilities.
* Became widely adopted in AI applications such as **RAG (Retrieval-Augmented Generation), recommendation systems, and image retrieval**.

---

**Applications and Use Cases**

* **Semantic Search:** Find documents, images, or audio similar to a query embedding.
* **Recommendation Systems:** Suggest products, content, or media based on user embeddings.
* **RAG Pipelines:** Retrieve relevant context for language models using vector similarity.
* **Clustering and Anomaly Detection:** Identify patterns or outliers in high-dimensional data.

---

**Comparison with Other Vector Search Libraries**

| Library      | Strengths                                                 | Limitations                                                 | Use Case                                |
| ------------ | --------------------------------------------------------- | ----------------------------------------------------------- | --------------------------------------- |
| **FAISS**    | High-performance, GPU support, exact & approximate search | Complex indexing for very large clusters may require tuning | Large-scale RAG, embedding search       |
| **Pinecone** | Managed service, scalable, simple API                     | Cloud-dependent, higher cost for large datasets             | SaaS applications, multi-user scenarios |
| **Milvus**   | Open-source, distributed, supports hybrid queries         | Requires separate deployment & management                   | Enterprise-grade vector DB              |
| **Weaviate** | Schema-aware, vector + semantic search                    | Slightly slower for large-scale GPU computation             | Knowledge graphs, semantic search       |

---

**Core Concepts**

**Vectors and Embeddings**

* Vectors are **numerical representations** of data, often derived from LLMs, CNNs, or other neural networks.
* Embeddings encode **semantic information** allowing similarity comparisons.

**Distance Metrics**
FAISS supports multiple metrics to measure similarity:

| Metric                          | Description                                     | Use Case                             |
| ------------------------------- | ----------------------------------------------- | ------------------------------------ |
| **L2 (Euclidean Distance)**     | Measures straight-line distance between vectors | Most common for dense embeddings     |
| **Cosine Similarity**           | Measures angle between vectors, normalized      | Text embeddings, semantic similarity |
| **Inner Product (Dot Product)** | Measures projection similarity                  | Ranking, recommendation tasks        |

**Indexes and Search Types**

* **Flat Index:** Exact search, stores all vectors, high accuracy but slower for large datasets.
* **IVF (Inverted File Index):** Approximate search, partitions dataset to reduce search space.
* **HNSW (Hierarchical Navigable Small World):** Graph-based ANN, efficient and accurate for high-dimensional data.
* **PQ (Product Quantization):** Compresses vectors to save memory while enabling fast search.

**Approximate vs. Exact Search**

* **Exact Search:** Guarantees correct nearest neighbors but slower for large datasets.
* **Approximate Search:** Sacrifices some accuracy for faster search, ideal for large-scale applications where speed is critical.

---

**Setting Up the Environment**

**Installing FAISS (CPU vs. GPU)**

* **CPU Installation:**

```bash
pip install faiss-cpu
```

* **GPU Installation:**

```bash
pip install faiss-gpu
```

**Dependencies and Requirements**

* **Python >=3.8**
* **NumPy**
* Optional: **PyTorch or CuPy** for GPU acceleration

**Configuring GPU Support**

* Verify GPU availability using:

```python
import faiss
print(faiss.get_num_gpus())
```

* FAISS automatically uses GPU if `faiss-gpu` is installed.
* Large indexes may require **manual GPU assignment**:

```python
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)
```

## **Creating Indexes in FAISS**

> FAISS provides a variety of **index types** optimized for different use cases, balancing **speed, memory efficiency, and accuracy**. Selecting the appropriate index depends on **dataset size, dimensionality, and performance requirements**.

---

**Flat (Brute-Force) Index**

* **Description:**

  * The simplest index in FAISS.
  * Stores all vectors in memory and performs **exact nearest neighbor search** by comparing the query vector against all stored vectors.
* **Characteristics:**

  * High accuracy (guaranteed exact results).
  * Linear search complexity (`O(n)`), slower for very large datasets.
  * Low preprocessing overhead, easy to implement.
* **Use Cases:**

  * Small to medium datasets.
  * Scenarios where **accuracy is critical** and search speed is less important.
* **Example:**

```python
import faiss
import numpy as np

d = 128  # Dimension
nb = 10000  # Number of vectors
xb = np.random.random((nb, d)).astype('float32')

index = faiss.IndexFlatL2(d)  # L2 distance
index.add(xb)
query = np.random.random((1, d)).astype('float32')
D, I = index.search(query, k=5)  # Top 5 nearest neighbors
```

---

**IndexIVFFlat (Inverted File Index)**

* **Description:**

  * Partitions the vector dataset into **clusters (inverted lists)** for faster search.
  * Performs **approximate nearest neighbor (ANN) search** by searching only a subset of clusters.
* **Characteristics:**

  * Faster than Flat index for large datasets.
  * Requires **training** on sample data to generate cluster centroids.
  * Trade-off between **accuracy and speed** depending on the number of clusters and probes.
* **Use Cases:**

  * Large-scale datasets with millions of vectors.
  * Applications where **slightly approximate results are acceptable** for speed gains.
* **Example:**

```python
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(d)
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist)
index_ivf.train(xb)  # Train on the dataset
index_ivf.add(xb)
index_ivf.nprobe = 10  # Number of clusters to search
D, I = index_ivf.search(query, k=5)
```

---

**IndexHNSW (Hierarchical Navigable Small World)**

* **Description:**

  * Graph-based approximate nearest neighbor index.
  * Connects vectors in a **small-world graph structure** to navigate efficiently.
* **Characteristics:**

  * High search speed and accuracy for **high-dimensional data**.
  * No training required.
  * Supports **dynamic insertion of new vectors**.
* **Use Cases:**

  * Real-time applications needing **fast queries with high recall**.
  * Scenarios with **frequent updates** to the dataset.
* **Example:**

```python
M = 32  # Number of neighbors in the graph
index_hnsw = faiss.IndexHNSWFlat(d, M)
index_hnsw.add(xb)
index_hnsw.hnsw.efSearch = 50  # Trade-off between speed and accuracy
D, I = index_hnsw.search(query, k=5)
```

---

**IndexPQ (Product Quantization)**

* **Description:**

  * Compresses vectors using **sub-vector quantization**, reducing memory footprint.
  * Performs **approximate search** on compressed representations.
* **Characteristics:**

  * Reduces storage requirements for large datasets.
  * May slightly reduce accuracy compared to exact search.
  * Often combined with other indexes like IVF for **scalable ANN search**.
* **Use Cases:**

  * Extremely large-scale datasets where memory efficiency is critical.
  * Cloud or GPU deployments with limited memory.
* **Example:**

```python
m = 8  # Number of sub-vectors
nbits = 8  # Bits per sub-vector
index_pq = faiss.IndexPQ(d, m, nbits)
index_pq.train(xb)
index_pq.add(xb)
D, I = index_pq.search(query, k=5)
```

---

**Choosing the Right Index**

| Index Type  | Accuracy    | Speed                   | Memory | Training Required | Best Use Case                             |
| ----------- | ----------- | ----------------------- | ------ | ----------------- | ----------------------------------------- |
| **Flat**    | Exact       | Slow for large datasets | High   | No                | Small datasets, high precision needs      |
| **IVFFlat** | Approximate | Fast                    | Medium | Yes               | Large datasets, balanced speed & accuracy |
| **HNSW**    | High        | Very Fast               | Medium | No                | Real-time search, high-dimensional data   |
| **PQ**      | Approximate | Fast                    | Low    | Yes               | Massive datasets with memory constraints  |


## **Adding and Searching Vectors in FAISS**


> Once an index is created in FAISS, the next critical steps are **adding vectors** to the index and **performing search queries**. These operations enable similarity search, semantic retrieval, and recommendation systems based on embeddings.

---

**Adding Vectors to Indexes**

* **Vectors** are added to FAISS indexes using the `.add()` method.
* Vectors must be **NumPy arrays of type `float32`** with shape `(number_of_vectors, dimension)`.
* Depending on the index type, vectors may require **training** before adding (e.g., IVF, PQ).

**Steps:**

1. Prepare vectors (from embeddings).
2. Train the index if required.
3. Add vectors to the index.

**Example: Adding vectors to different indexes**

```python
import faiss
import numpy as np

d = 128  # Dimension
nb = 10000  # Number of vectors
xb = np.random.random((nb, d)).astype('float32')

# Flat index
flat_index = faiss.IndexFlatL2(d)
flat_index.add(xb)

# IVF index
nlist = 100
quantizer = faiss.IndexFlatL2(d)
ivf_index = faiss.IndexIVFFlat(quantizer, d, nlist)
ivf_index.train(xb)
ivf_index.add(xb)
```

**Tips:**

* For **large datasets**, consider batch adding vectors to reduce memory spikes.
* Use **persistent storage** or save the index to disk for future use:

```python
faiss.write_index(flat_index, "flat.index")
faiss.write_index(ivf_index, "ivf.index")
```

---

**Performing Search Queries**

* FAISS supports **K-nearest neighbor (KNN) search** for querying vectors.
* The `.search()` method returns two arrays:

  1. **Distances (`D`)** – similarity or distance scores between the query and indexed vectors.
  2. **Indices (`I`)** – positions of the nearest neighbors in the index.

**Syntax:**

```python
D, I = index.search(query_vectors, k)
```

* `query_vectors`: NumPy array of query embeddings, shape `(n_queries, d)`.
* `k`: Number of nearest neighbors to retrieve.

**Example: Searching for nearest neighbors**

```python
query = np.random.random((5, d)).astype('float32')  # 5 query vectors
k = 5  # Retrieve top 5 neighbors

# Using Flat index
D_flat, I_flat = flat_index.search(query, k)

# Using IVF index
ivf_index.nprobe = 10  # Number of clusters to probe
D_ivf, I_ivf = ivf_index.search(query, k)

print("Distances:", D_flat)
print("Indices:", I_flat)
```

**Tips for Efficient Search:**

* **IVF indexes:** Adjust `nprobe` to trade off **accuracy vs. speed**. Higher `nprobe` → more accurate but slower.
* **HNSW indexes:** Adjust `efSearch` parameter for similar trade-offs.
* **Batch queries:** Send multiple query vectors together for **better throughput**.

---

**K-Nearest Neighbor (KNN) Search**

* KNN search is the **core operation** for retrieving vectors similar to a query.
* Metrics used for similarity depend on the index:

  * **L2 distance**: closer vectors have smaller distances.
  * **Cosine similarity / inner product**: larger values indicate higher similarity.
* FAISS can handle **single query or multiple queries** simultaneously.

**Use Cases of KNN in FAISS:**

1. **Semantic Search:** Retrieve top relevant documents or sentences.
2. **Recommendation Systems:** Find similar items based on embeddings.
3. **Clustering & Outlier Detection:** Identify nearest points in high-dimensional space.
4. **RAG Pipelines:** Retrieve context passages for language models.

---

**Saving and Loading Indexes**

* After adding vectors, indexes can be **persisted to disk** for reuse:

```python
faiss.write_index(flat_index, "flat.index")
loaded_index = faiss.read_index("flat.index")
```

* For GPU indexes, you may need to transfer back to CPU before saving:

```python
cpu_index = faiss.index_gpu_to_cpu(gpu_index)
faiss.write_index(cpu_index, "gpu_index.index")
```



## **Serialization and Persistence in FAISS**

> FAISS provides **robust mechanisms for saving, loading, and managing indexes**, enabling long-term storage, sharing, and incremental updates. Proper serialization and persistence are crucial for **production systems, large-scale embeddings, and distributed workflows**.

---

**Saving Indexes to Disk**

* FAISS indexes can be **persisted to disk** using `faiss.write_index()`.
* This ensures that **precomputed vectors and trained indexes** do not need to be recomputed every time the application starts.

**Example: Saving an index**

```python
import faiss
import numpy as np

d = 128
xb = np.random.random((10000, d)).astype('float32')

# Create and add vectors to a Flat index
index = faiss.IndexFlatL2(d)
index.add(xb)

# Save index to disk
faiss.write_index(index, "flat.index")
```

* **Best Practices:**

  * Include **metadata or versioning** to track index updates.
  * For GPU indexes, transfer to CPU before saving:

  ```python
  cpu_index = faiss.index_gpu_to_cpu(gpu_index)
  faiss.write_index(cpu_index, "gpu_index.index")
  ```

---

**Loading Indexes**

* Previously saved indexes can be **reloaded into memory** for immediate use.
* `faiss.read_index()` automatically reconstructs the index structure.

**Example: Loading an index**

```python
# Load the index from disk
loaded_index = faiss.read_index("flat.index")

# Perform search on the loaded index
query = np.random.random((1, d)).astype('float32')
D, I = loaded_index.search(query, k=5)
print("Nearest neighbors indices:", I)
```

* **GPU usage:** Loaded indexes can be transferred to GPU for faster queries:

```python
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, loaded_index)
```

---

**Updating and Merging Indexes**

**Adding New Vectors**

* Vectors can be **added incrementally** without retraining (for most index types like Flat, HNSW).
* For IVF or PQ indexes, if the **new data significantly differs**, retraining might improve accuracy.

**Example: Adding new vectors**

```python
new_vectors = np.random.random((5000, d)).astype('float32')
loaded_index.add(new_vectors)  # Incremental addition
```

**Merging Indexes**

* FAISS allows **merging multiple indexes** into one, useful for distributed datasets or batch indexing.
* Use `faiss.merge_into()` or `faiss.merge_indexes()` depending on the index type.

**Example: Merging indexes**

```python
index1 = faiss.IndexFlatL2(d)
index2 = faiss.IndexFlatL2(d)

index1.add(np.random.random((5000, d)).astype('float32'))
index2.add(np.random.random((5000, d)).astype('float32'))

faiss.merge_into(index2, index1, shift_ids=True)  # Merge index1 into index2
```

* **Considerations:**

  * For IVF or PQ indexes, indexes must have **compatible parameters** (same dimensions, sub-quantizers, etc.).
  * After merging, it is advisable to **retrain or reoptimize** for approximate search indexes to maintain accuracy.

---

**Practical Tips for Persistence**

| Task            | Recommendation                                                 |
| --------------- | -------------------------------------------------------------- |
| Saving indexes  | Use versioned filenames and CPU indexes for compatibility      |
| Loading indexes | Transfer to GPU only if needed for fast search                 |
| Adding new data | Batch additions to avoid frequent reallocation overhead        |
| Merging indexes | Ensure index compatibility; retrain ANN indexes if necessary   |
| Backup          | Periodically backup indexes to prevent data loss in production |

---


## **Index Types and Trade-offs in FAISS**

> FAISS offers a variety of **index types** designed to optimize **speed, memory usage, and accuracy** for different vector search scenarios. Understanding the **trade-offs** between these indexes is crucial to selecting the right solution for your dataset and use case.

---

**Flat, IVFFlat, IVF-PQ, HNSW, and Hybrid Indexes**

| Index Type                                        | Description                                                       | Accuracy       | Speed                           | Memory Usage | Training Required          | Best Use Case                                                           |
| ------------------------------------------------- | ----------------------------------------------------------------- | -------------- | ------------------------------- | ------------ | -------------------------- | ----------------------------------------------------------------------- |
| **Flat (Brute-Force)**                            | Exact nearest neighbor search; compares query with all vectors    | 100%           | Slow for large datasets         | High         | No                         | Small datasets, high precision needs                                    |
| **IVFFlat (Inverted File)**                       | Partitions dataset into clusters; searches only relevant clusters | High (approx.) | Faster than Flat                | Medium       | Yes (train on sample data) | Large datasets, balanced speed & accuracy                               |
| **IVF-PQ (Inverted File + Product Quantization)** | Combines IVF with quantization to reduce memory footprint         | Medium to High | Fast                            | Low          | Yes                        | Massive datasets where memory is limited, approximate search acceptable |
| **HNSW (Hierarchical Navigable Small World)**     | Graph-based ANN search; vectors connected in small-world graph    | High           | Very Fast                       | Medium       | No                         | Real-time search, high-dimensional embeddings, dynamic datasets         |
| **Hybrid Indexes**                                | Combines multiple strategies (e.g., IVF + HNSW)                   | High           | Optimized for specific workload | Medium       | Depends on combination     | Custom pipelines, large-scale RAG, recommendation systems               |

---

**Memory vs. Speed Trade-offs**

* **Flat Index:**

  * Pros: Exact results, simple to implement.
  * Cons: Memory-intensive and slow for large datasets.

* **IVFFlat / IVF-PQ:**

  * Pros: Much faster search on large datasets; IVF-PQ reduces memory usage significantly.
  * Cons: Slight loss in accuracy for approximate search; requires training.

* **HNSW:**

  * Pros: High accuracy and very fast queries; supports dynamic insertions.
  * Cons: Slightly more complex structure; memory usage moderate.

* **Hybrid Indexes:**

  * Pros: Can be optimized for specific use cases combining speed, accuracy, and memory.
  * Cons: Configuration complexity; tuning required.

**Rule of Thumb:**

* Small datasets → **Flat Index** (accuracy prioritized).
* Medium to large datasets → **IVFFlat** (good balance).
* Very large datasets with memory constraints → **IVF-PQ**.
* Real-time queries and high-dimensional embeddings → **HNSW**.
* Custom pipelines → **Hybrid** indexes.

---

**Selecting the Right Index for Your Data
**
1. **Dataset Size:**

   * Small (<100k vectors) → Flat index
   * Large (>1M vectors) → IVF or HNSW

2. **Memory Constraints:**

   * Limited memory → IVF-PQ
   * Memory available → Flat or HNSW

3. **Query Latency Requirements:**

   * Real-time → HNSW or Hybrid
   * Batch processing → IVFFlat or IVF-PQ

4. **Accuracy Needs:**

   * Critical accuracy → Flat or HNSW
   * Approximate acceptable → IVFFlat or IVF-PQ

5. **Update Frequency:**

   * Frequent updates → HNSW (dynamic insertions)
   * Static dataset → IVF-PQ or IVFFlat

**Example Decision Flow:**

| Condition                                        | Recommended Index |
| ------------------------------------------------ | ----------------- |
| Small dataset, exact search                      | Flat              |
| Large dataset, approximate search, medium memory | IVFFlat           |
| Large dataset, memory-limited, approximate       | IVF-PQ            |
| High-dimensional embeddings, real-time queries   | HNSW              |
| Custom needs, mixed requirements                 | Hybrid            |




## **Quantization Techniques in FAISS**

> Quantization is a set of **techniques used to compress vectors**, reducing memory usage and speeding up approximate nearest neighbor (ANN) search without a significant loss in accuracy. FAISS provides several quantization methods to handle **large-scale vector datasets efficiently**.

---

**Product Quantization (PQ)**

* **Concept:**

  * Divides each high-dimensional vector into **sub-vectors**.
  * Each sub-vector is **quantized separately** using a codebook, allowing compact storage.
* **Benefits:**

  * Drastically reduces memory footprint.
  * Enables fast approximate distance computation using precomputed tables.
* **Trade-offs:**

  * Slight loss of accuracy compared to exact search.
  * Works best when combined with IVF for large datasets.

**Example:** Using PQ in FAISS

```python
import faiss
import numpy as np

d = 128  # Dimension
m = 8    # Number of sub-vectors
nbits = 8  # Bits per sub-vector

xb = np.random.random((10000, d)).astype('float32')
index_pq = faiss.IndexPQ(d, m, nbits)
index_pq.train(xb)  # Train PQ codebooks
index_pq.add(xb)
query = np.random.random((1, d)).astype('float32')
D, I = index_pq.search(query, k=5)
```

---

**Scalar Quantization**

* **Concept:**

  * Each vector component is quantized independently to a lower-precision representation (e.g., 8-bit integers).
* **Benefits:**

  * Extremely fast and memory-efficient.
  * Simple to implement.
* **Limitations:**

  * Accuracy loss may be higher for high-dimensional or highly correlated vectors.

---

**OPQ (Optimized Product Quantization)**

* **Concept:**

  * Enhances standard PQ by **rotating vectors before quantization** to reduce quantization error.
  * Learned linear transformation maximizes variance along sub-vectors.
* **Benefits:**

  * Improves accuracy compared to vanilla PQ.
  * Especially useful for **high-dimensional embeddings**.
* **Trade-offs:**

  * Requires extra **training** for the rotation matrix.
  * Slightly more computational overhead during indexing.

**Example:** Using OPQ with FAISS

```python
d = 128
m = 8
nbits = 8

index_opq = faiss.IndexPQ(d, m, nbits)
index_opq = faiss.IndexPreTransform(faiss.OPQMatrix(d, m), index_opq)
index_opq.train(xb)
index_opq.add(xb)
```

---

**Clustering and Partitioning**

* FAISS uses **clustering algorithms like k-means** to partition vectors into **coarse groups**.
* Helps in **IVF-based indexes** by reducing the number of vectors searched per query.

**Coarse Quantizers:**

* Assign each vector to the nearest cluster centroid.
* During search, only vectors in the **closest clusters** are considered, improving speed.
* **Example:** IVF index uses coarse quantization:

```python
nlist = 100  # Number of clusters
quantizer = faiss.IndexFlatL2(d)
ivf_index = faiss.IndexIVFFlat(quantizer, d, nlist)
ivf_index.train(xb)
ivf_index.add(xb)
```

* **Probing:** During query, the `nprobe` parameter determines **how many clusters to search**, balancing accuracy vs. speed.

---

**Handling Large Datasets**

1. **Batch Processing:**

   * Add vectors in batches to avoid memory spikes.

2. **Disk-Based Indexes:**

   * Use **IVF-PQ** or **IndexIVFScalarQuantizer** to reduce RAM usage.

3. **GPU Acceleration:**

   * Transfer indexes to GPU for faster training and search:

   ```python
   res = faiss.StandardGpuResources()
   gpu_index = faiss.index_cpu_to_gpu(res, 0, ivf_index)
   ```

4. **Hybrid Approaches:**

   * Combine coarse quantization (IVF) with PQ or OPQ for **memory-efficient approximate search** on massive datasets.

### **GPU Acceleration in FAISS**

> FAISS provides **GPU acceleration** to significantly speed up vector indexing and similarity search, especially for **large-scale, high-dimensional datasets**. Using GPUs allows **parallel computation of distance metrics** and fast approximate nearest neighbor searches.

---

**GPU vs. CPU Indexes**

| Feature                   | CPU Index                             | GPU Index                                                     |
| ------------------------- | ------------------------------------- | ------------------------------------------------------------- |
| **Speed**                 | Slower for large datasets             | Much faster due to parallelism                                |
| **Memory**                | Limited to system RAM                 | Limited to GPU VRAM; multiple GPUs can be used                |
| **Scalability**           | Easy to scale with disk-based storage | Better for real-time, in-memory search on massive datasets    |
| **Index Types Supported** | All FAISS indexes                     | Most, but some complex indexes may require CPU fallback       |
| **Ease of Use**           | Simple, no dependencies               | Requires CUDA, GPU drivers, and proper FAISS GPU installation |

**Example:** Creating a CPU vs. GPU index

```python
import faiss
import numpy as np

d = 128
xb = np.random.random((10000, d)).astype('float32')
query = np.random.random((1, d)).astype('float32')

# CPU Flat Index
cpu_index = faiss.IndexFlatL2(d)
cpu_index.add(xb)
D_cpu, I_cpu = cpu_index.search(query, k=5)

# GPU Flat Index
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)  # Transfer CPU index to GPU
D_gpu, I_gpu = gpu_index.search(query, k=5)
```

---

**Multi-GPU Support**

* FAISS can leverage **multiple GPUs** for indexing and search of extremely large datasets.
* Useful when **single GPU memory is insufficient** or parallel computation is needed for high throughput.

**Steps for Multi-GPU:**

1. Create a **GPU resource object for each GPU**.
2. Distribute index shards across GPUs using `faiss.index_cpu_to_all_gpus()`.

**Example:**

```python
gpu_index_multi = faiss.index_cpu_to_all_gpus(cpu_index)
D_multi, I_multi = gpu_index_multi.search(query, k=5)
```

* FAISS automatically **splits computations across GPUs**, providing **linear speedup** in many scenarios.
* Multi-GPU works best with **large batch queries or huge indexes**.

---

**Performance Tuning**

To maximize performance on GPU:

1. **Batch Queries:**

   * Perform multiple queries simultaneously to fully utilize GPU cores.

2. **Use Approximate Search:**

   * Combine IVF or HNSW with PQ/OPQ to reduce computation.

3. **Tune IVF Parameters:**

   * `nlist` (number of clusters) and `nprobe` (number of clusters to search) control the **trade-off between speed and accuracy**.

4. **Memory Management:**

   * Monitor GPU memory usage to prevent **out-of-memory errors**.
   * Use smaller **sub-vector size or PQ codes** to reduce memory footprint.

5. **Hybrid Indexing:**

   * Combine coarse quantizers (IVF) with PQ/OPQ on GPU for **efficient large-scale ANN search**.

6. **Use Pretrained Indexes:**

   * Load precomputed indexes to GPU for **immediate high-performance search** without retraining.

## **Embedding Generation and FAISS Integration**

> FAISS works most effectively when **vectors (embeddings) represent meaningful features** of your data. Proper embedding generation and preprocessing are crucial for **accurate similarity search** and **retrieval-augmented generation (RAG) pipelines**.

---

**Integrating FAISS with NLP Embeddings**

**NLP embeddings** are dense vector representations of text generated by models such as **BERT, OpenAI embeddings, or Hugging Face Transformers**. These embeddings capture **semantic meaning**, enabling FAISS to perform **semantic search and retrieval**.

**Steps:**

1. Generate embeddings using your model of choice:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
texts = ["Hello world", "FAISS vector search", "OpenAI embeddings"]
embeddings = model.encode(texts).astype('float32')
```

2. Add embeddings to a FAISS index:

```python
import faiss
d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)
```

3. Perform similarity search:

```python
query = model.encode(["Vector search example"]).astype('float32')
D, I = index.search(query, k=3)
print("Nearest neighbors:", I)
```

**Tips:**

* Normalize embeddings when using **cosine similarity**:

```python
faiss.normalize_L2(embeddings)
faiss.normalize_L2(query)
```

---

**Image Embeddings**

FAISS is also widely used for **image similarity search**. Embeddings can be generated using:

* **CNN Features:** Extracted from models like ResNet, EfficientNet.
* **CLIP:** Produces joint image-text embeddings, allowing **cross-modal search**.

**Example with CLIP:**

```python
import torch
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

images = [Image.open("example1.jpg"), Image.open("example2.jpg")]
inputs = processor(images=images, return_tensors="pt")
with torch.no_grad():
    embeddings = model.get_image_features(**inputs).numpy().astype('float32')
```

* Add embeddings to FAISS and perform similarity queries just like text embeddings.

---

**Preprocessing and Normalization**

1. **Standardize Vector Dimensions:** Ensure all embeddings have the same size (`d`).
2. **Normalization:**

   * Cosine similarity requires L2-normalized embeddings.
   * `faiss.normalize_L2(vectors)`
3. **Batch Processing:** Efficiently generate embeddings in batches to save memory.

---

**Retrieval-Augmented Generation (RAG)**

* FAISS is commonly used in **RAG pipelines** to retrieve relevant documents or context for LLMs.
* Workflow:

  1. Embed a **corpus of documents** into vectors.
  2. Store them in FAISS.
  3. When a **user query** arrives, embed it and retrieve nearest neighbors.
  4. Feed retrieved documents into an LLM for **contextualized generation**.

**Example Integration with OpenAI LLM:**

```python
from openai import OpenAI

client = OpenAI(api_key="YOUR_KEY")
query_embedding = model.encode(["Explain FAISS indexing"]).astype('float32')
D, I = index.search(query_embedding, k=3)

# Retrieve top documents and pass to LLM
docs = [texts[i] for i in I[0]]
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": " ".join(docs)}]
)
```

---

**Using FAISS with Vector Databases**

* FAISS can be used as a **standalone vector search engine** or integrated into **vector databases** like **Chroma, Milvus, or Weaviate**.
* Vector databases often provide **metadata storage, filtering, and persistence**, while FAISS handles **high-performance similarity search**.

---

**LangChain and FAISS Integration**

* FAISS integrates seamlessly with **LangChain** for building **RAG pipelines and QA systems**.
* Steps:

  1. Store embeddings in a FAISS index.
  2. Use LangChain **retriever interfaces** to query the index.
  3. Feed retrieved documents to LLM chains for response generation.

**Example:**

```python
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_texts(texts, embedding_model)

retriever = vectorstore.as_retriever(search_type="similarity", k=3)
query = "How does FAISS handle large datasets?"
results = retriever.get_relevant_documents(query)
```

---

**Building QA Systems**

1. **Document Ingestion:** Preprocess and embed knowledge base documents.
2. **Indexing:** Add embeddings to FAISS.
3. **Query Handling:** Embed user query and retrieve nearest vectors.
4. **LLM Response Generation:** Pass retrieved context to an LLM for answer generation.
5. **Optional Enhancements:**

   * Use **metadata filtering** for domain-specific retrieval.
   * Implement **multi-hop reasoning** with multiple retrievals.
   * Cache embeddings for frequently asked questions.

## **Hybrid Search**

> Hybrid search combines the strengths of **traditional keyword-based search** with **semantic vector-based search**, delivering more **relevant, context-aware results**. This approach is widely used in **modern search engines, QA systems, and recommendation platforms**.

---

**Combining Keyword and Vector Search**

* **Keyword Search:**

  * Matches query terms directly against text or metadata.
  * Fast and precise for exact matches but limited in understanding context or synonyms.

* **Vector Search:**

  * Uses embeddings to find **semantically similar content**.
  * Captures meaning beyond exact words, robust to paraphrasing.

**Hybrid Approach:**

* Both search types are executed in parallel.
* Results are **merged using a weighted scoring system** to rank documents.
* Ensures both **exact matches and semantic relevance** are considered.

**Example Workflow:**

1. Generate **query embedding** for vector search.
2. Run **keyword search** on document corpus.
3. Retrieve **top-N results** from both methods.
4. Combine results using a **custom scoring formula**.

---

**Weighted Scoring Mechanisms**

* Each document receives **two scores**:

  1. **Vector similarity score** (cosine similarity or L2 distance).
  2. **Keyword relevance score** (BM25, TF-IDF, or custom metric).

* **Combined score:**

```python
final_score = alpha * vector_score + (1 - alpha) * keyword_score
```

* `alpha` controls the balance:

  * `alpha = 0.7` → more emphasis on semantic relevance
  * `alpha = 0.3` → more emphasis on exact keyword matches

**Tips:**

* Normalize scores to the same scale before combining.
* Tune `alpha` based on application type:

  * QA Systems → higher vector weight.
  * Legal or scientific search → higher keyword weight.
* Can extend to **multi-modal hybrid search** (e.g., combining text, image, and metadata).

---

**Real-World Examples**

| Use Case                   | Hybrid Strategy                                                                                       | Benefits                                                                          |
| -------------------------- | ----------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **E-commerce Search**      | Combine product title keyword search with embedding-based semantic search of descriptions and reviews | Improves discovery of relevant products even when exact keywords are missing      |
| **Enterprise Document QA** | Keyword search on legal terms + vector search on embeddings of policy documents                       | Ensures precise legal term matches while retrieving semantically relevant content |
| **Content Recommendation** | Vector search on user interaction history + keyword-based category filters                            | Balances semantic similarity with categorical relevance                           |
| **Academic Research**      | Keyword search on metadata + vector search on paper abstracts                                         | Retrieves exact subject papers while identifying semantically related research    |

---

**Implementing Hybrid Search**

**Example using FAISS + Keyword Search**

```python
from sklearn.feature_extraction.text import TfidfVectorizer
import faiss
import numpy as np

# Sample documents
docs = ["Machine learning basics", "Deep learning with PyTorch", "FAISS vector search"]

# Keyword-based scoring
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)
query = "vector search"
keyword_scores = tfidf_matrix @ vectorizer.transform([query]).T
keyword_scores = keyword_scores.toarray().flatten()

# Vector-based scoring using embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(docs).astype('float32')
query_embedding = model.encode([query]).astype('float32')

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
D, I = index.search(query_embedding, k=len(docs))
vector_scores = 1 / (1 + D.flatten())  # Convert distance to similarity

# Hybrid score
alpha = 0.6
final_scores = alpha * vector_scores + (1 - alpha) * keyword_scores
ranked_docs = [docs[i] for i in np.argsort(-final_scores)]
print(ranked_docs)
```


## **Handling Large Datasets in FAISS**

> FAISS is designed to **scale to millions or even billions of vectors**, but working with very large datasets requires **special strategies for memory management, distributed computing, and incremental updates**. Proper handling ensures **efficient, low-latency searches** without exhausting system resources.

---

**Sharding Indexes**

* **Concept:** Split a large dataset into smaller **shards**, each with its own FAISS index.
* **Benefits:**

  * Reduces memory usage per index.
  * Enables **parallel search across shards**.
* **Implementation:**

  * Divide embeddings into N chunks.
  * Create separate indexes for each chunk.
  * Perform queries across all shards and merge results.

**Example:**

```python
shard_size = 100000
shards = [embeddings[i:i+shard_size] for i in range(0, len(embeddings), shard_size)]
indexes = []
for shard in shards:
    idx = faiss.IndexFlatL2(embeddings.shape[1])
    idx.add(shard)
    indexes.append(idx)
```

---

**Memory Management**

* **Strategies:**

  * Use **IVF-PQ or HNSW** to reduce RAM usage.
  * Normalize vectors in-place to avoid duplicates.
  * Store indexes on **disk or persistent storage** when not in active use.
* **GPU Considerations:**

  * Transfer only active shards to GPU for processing.
  * Use `faiss.index_cpu_to_gpu()` selectively for high-demand queries.

---

**Incremental Updates**

* FAISS supports **adding new vectors without retraining** for many index types (e.g., Flat, HNSW).
* For **IVF-PQ or IVF-Flat**, retraining may be needed periodically to maintain accuracy.
* **Best practices:**

  * Batch additions to minimize overhead.
  * Keep a **versioned index system** for rollback and consistency.

**Example:**

```python
new_vectors = np.random.random((5000, d)).astype('float32')
indexes[0].add(new_vectors)  # Incremental addition to first shard
```

---

**Distributed FAISS**

* For extremely large datasets, FAISS can be **distributed across multiple machines**.
* Strategies include:

  1. **Sharded Indexes across nodes** – Each node handles a portion of the dataset.
  2. **Parallel Queries** – Query all nodes simultaneously and merge results.

**Multi-Node Setup:**

* Each node runs a FAISS instance with a shard.
* A **central coordinator** aggregates results.
* Optional: Use **gRPC or REST APIs** for communication between nodes.

---
**Parallelizing Search**

* **Within a single node:**

  * FAISS supports **multi-threaded search** using `omp_set_num_threads()`.

  ```python
  faiss.omp_set_num_threads(8)
  ```
* **Across multiple nodes or GPUs:**

  * Each shard can be queried independently.
  * Merge top-k results using a **priority queue** or sorted merge.

**Tips:**

* Use **GPU acceleration** for high-dimensional embeddings.
* Adjust **nprobe** (for IVF indexes) to balance **speed and accuracy**.

---

**Best Practices**

| Area               | Recommendation                                                           |
| ------------------ | ------------------------------------------------------------------------ |
| Indexing           | Use approximate indexes (IVF-PQ, HNSW) for massive datasets              |
| Sharding           | Split dataset into manageable chunks to reduce memory usage              |
| Updates            | Batch incremental additions; retrain IVF indexes periodically            |
| Parallelization    | Use multi-threading, multi-GPU, or multi-node setups for high throughput |
| Persistence        | Save indexes to disk regularly; maintain versioned backups               |
| Monitoring         | Track memory and GPU usage; monitor search latency                       |
| Query Optimization | Adjust `nprobe` (IVF) or `efSearch` (HNSW) for optimal trade-offs        |



## **Performance Optimization in FAISS**

> Optimizing FAISS is crucial for **low-latency, high-throughput vector search** in production environments. Performance tuning involves **index selection, parameter optimization, hardware utilization, and careful deployment strategies**.

---

**Speeding Up Searches**

1. **Use Approximate Nearest Neighbor (ANN) Indexes:**

   * **IVF-PQ, HNSW, or Hybrid indexes** provide significant speedups over brute-force Flat indexes.
   * Trade-off between speed and accuracy is adjustable via parameters like `nprobe` or `efSearch`.

2. **GPU Acceleration:**

   * Transfer indexes to GPU using `faiss.index_cpu_to_gpu()` for **parallelized distance computation**.
   * Multi-GPU setups can handle very large indexes efficiently.

3. **Batch Queries:**

   * Process multiple queries simultaneously to maximize **parallel computation efficiency**.

4. **Normalized Embeddings:**

   * Normalize vectors to L2 for **cosine similarity**, which can be computed faster than non-normalized distance metrics.

5. **Index Sharding:**

   * Split large datasets into multiple shards to **reduce per-index memory footprint** and enable parallel search.

---

**Index Tuning and Parameters**

| Index Type   | Key Parameters                                               | Effect                                                                                                                 |
| ------------ | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------- |
| **IVF**      | `nlist` (number of clusters), `nprobe` (clusters to search)  | Higher `nlist` improves partitioning; higher `nprobe` improves accuracy but increases latency                          |
| **HNSW**     | `M` (max connections per node), `efConstruction`, `efSearch` | Higher `M` and `efConstruction` improve accuracy but require more memory; `efSearch` balances query speed vs. accuracy |
| **PQ / OPQ** | `m` (sub-vectors), `nbits` (quantization bits)               | More sub-vectors and bits improve accuracy but increase memory usage                                                   |
| **Flat**     | N/A                                                          | Exact search; no tuning required, but slower for large datasets                                                        |

* **Recommendation:** Start with default parameters and **benchmark on your dataset**, then tune based on latency vs. accuracy requirements.

---

**Benchmarking and Profiling**

* **Measure query latency, throughput, and memory usage** on representative datasets.
* **Python example for timing searches:**

```python
import time
start = time.time()
D, I = index.search(query_embeddings, k=10)
end = time.time()
print("Query latency:", end - start, "seconds")
```

* **Profile memory usage:** Monitor RAM and GPU VRAM consumption to avoid **out-of-memory errors**.
* **Compare different index types** for trade-offs between **speed, accuracy, and memory footprint**.

---

**Deployment**

**Serving FAISS with APIs**

* FAISS indexes can be **exposed as REST or gRPC APIs** for real-time vector search.
* Frameworks like **FastAPI** or **Flask** allow easy integration:

```python
from fastapi import FastAPI
import faiss, numpy as np

app = FastAPI()
index = faiss.read_index("faiss.index")

@app.post("/search")
def search(query_vector: list, k: int = 5):
    q = np.array(query_vector).astype('float32').reshape(1, -1)
    D, I = index.search(q, k)
    return {"indices": I.tolist(), "distances": D.tolist()}
```

---

**Cloud Deployment (AWS, GCP, Azure)**

* **Cloud VMs or GPU instances** can host FAISS for large-scale search:

  * **AWS:** EC2 GPU instances, Elastic Inference
  * **GCP:** Compute Engine GPU VMs, Vertex AI for LLM integrations
  * **Azure:** N-series VMs for GPU workloads
* Use **object storage** (S3, GCS, Azure Blob) for storing indexes, loaded into memory on startup.
* Consider **auto-scaling** and **load balancing** for high query throughput.

---
**Integrating with Microservices**

* FAISS can be deployed as a **microservice** within a larger architecture:

  * Separate service for vector storage and search.
  * Other services (RAG pipeline, LLM orchestration) query FAISS via **HTTP/gRPC API**.
* Benefits:

  * **Decoupled architecture**
  * Easy scaling of the FAISS service independently
  * Enables **multi-language or multi-platform integration**

---

**Summary**

Optimizing FAISS involves:

1. Choosing the **right index type** for your data and query load.
2. **Tuning index parameters** like `nprobe`, `M`, and `efSearch` for your workload.
3. Leveraging **GPU acceleration and multi-threading**.
4. Efficiently managing **memory and large datasets** via sharding.
5. Deploying indexes as **APIs or microservices** for scalable production environments.

## **Approximate Nearest Neighbor (ANN) Algorithms**

Approximate Nearest Neighbor (ANN) algorithms are designed to **find vectors that are close to a query vector quickly**, trading off some accuracy for **speed and memory efficiency**. They are critical when working with **high-dimensional embeddings** and **large-scale datasets**.

**Key Benefits:**

* Drastically reduce search time compared to exact nearest neighbor methods.
* Enable scaling to **millions or billions of vectors**.
* Often sufficient for **semantic search, recommendation systems, and RAG pipelines** where slight approximations are acceptable.

**Common ANN Algorithms in FAISS:**

* **HNSW (Hierarchical Navigable Small World graphs)**
* **IVF-PQ (Inverted File with Product Quantization)**
* **IVFFlat**

---

**HNSW in Depth**

HNSW is a **graph-based ANN algorithm** that builds a **hierarchical network of nodes**, where each node represents a vector.

**How it Works:**

1. Nodes are connected to a small number of neighbors (small-world property).
2. Search starts at the top layer and progressively descends through layers to refine nearest neighbors.
3. Allows **efficient search with logarithmic complexity** for high-dimensional data.

**Key Parameters:**

| Parameter          | Description                                    | Impact                                                  |
| ------------------ | ---------------------------------------------- | ------------------------------------------------------- |
| **M**              | Maximum number of connections per node         | Higher M increases accuracy and memory usage            |
| **efConstruction** | Size of dynamic list during index construction | Higher value improves index quality but slows training  |
| **efSearch**       | Size of candidate list during search           | Higher value improves accuracy but increases query time |

**Advantages:**

* Very high accuracy for approximate search.
* Supports **dynamic insertions** without retraining.
* Excellent for **real-time applications** and **high-dimensional embeddings**.

**Example:**

```python
import faiss
d = 128
index = faiss.IndexHNSWFlat(d, M=32)
index.hnsw.efConstruction = 200
index.add(embeddings)
index.hnsw.efSearch = 100
D, I = index.search(query_embeddings, k=5)
```

---

**IVF-PQ Trade-offs**

**IVF-PQ (Inverted File + Product Quantization)** combines **coarse clustering** with **vector compression**.

* **How it Works:**

  1. Dataset is partitioned into clusters (IVF).
  2. Within each cluster, vectors are compressed using **PQ**, reducing memory.
* **Parameters:**

  * `nlist`: Number of clusters (affects speed vs. accuracy).
  * `m`: Number of PQ sub-vectors (affects compression and accuracy).
  * `nbits`: Bits per sub-vector (affects memory usage and precision).

**Trade-offs:**

| Factor        | High Accuracy Setting              | High Speed / Low Memory Setting |
| ------------- | ---------------------------------- | ------------------------------- |
| **nlist**     | Large (more clusters)              | Small (fewer clusters)          |
| **m / nbits** | Larger m, more bits per sub-vector | Smaller m, fewer bits           |
| **nprobe**    | High (search more clusters)        | Low (search fewer clusters)     |

**Advantages:**

* Memory-efficient for very large datasets.
* Can handle **hundreds of millions of vectors** on commodity hardware.

**Disadvantages:**

* Approximate search may slightly reduce accuracy.
* Requires **training the PQ and cluster centroids** before adding vectors.

---

**Evaluating Accuracy vs. Performance**

When using ANN algorithms, it is important to **measure the trade-off between speed and accuracy**:

1. **Accuracy Metrics:**

   * **Recall@k:** Fraction of true nearest neighbors retrieved in top-k results.
   * **Precision:** Fraction of retrieved vectors that are truly nearest neighbors.

2. **Performance Metrics:**

   * **Query latency:** Time per search.
   * **Throughput:** Number of queries per second.
   * **Memory usage:** RAM or GPU VRAM consumed by the index.

**Evaluation Approach:**

* Run **benchmark tests** on representative datasets.
* Adjust **index parameters** (`nprobe`, `efSearch`, `M`) to find an **optimal balance**.
* For real-time applications, prioritize latency and throughput; for batch offline processing, prioritize accuracy.

**Example Benchmark Table:**

| Index   | Recall@10 | Query Time (ms) | Memory (MB) | Use Case                              |
| ------- | --------- | --------------- | ----------- | ------------------------------------- |
| Flat    | 1.0       | 120             | 2000        | Small dataset, exact search           |
| HNSW    | 0.98      | 5               | 1200        | Real-time semantic search             |
| IVF-PQ  | 0.92      | 10              | 500         | Large-scale approximate search        |
| IVFFlat | 0.95      | 8               | 800         | Medium-scale, balanced speed/accuracy |



## **Custom Metrics and Distance Functions in FAISS**

> FAISS allows customization of **distance functions and similarity metrics** to match the requirements of your data and application. Using the appropriate metric is critical for **accurate nearest neighbor retrieval**, whether you are working with text embeddings, image features, or hybrid representations.

---

**Implementing Common Metrics**

**1. L2 (Euclidean Distance)**

* **Definition:** Measures straight-line distance between vectors.
* **Use Cases:** General-purpose similarity search, image embeddings, numeric feature spaces.
* **FAISS Implementation:** Default for `IndexFlatL2` and many other indexes.

```python
import faiss
import numpy as np

d = 128
index = faiss.IndexFlatL2(d)
index.add(np.random.random((1000, d)).astype('float32'))
query = np.random.random((1, d)).astype('float32')
D, I = index.search(query, k=5)
```

---

**2. Inner Product**

* **Definition:** Measures similarity by dot product of vectors.
* **Use Cases:** When vector norms matter less than direction, e.g., word embeddings, recommendation scores.
* **FAISS Implementation:** Use `IndexFlatIP`

```python
index = faiss.IndexFlatIP(d)
faiss.normalize_L2(vectors)  # Optional for cosine similarity equivalence
index.add(vectors)
```

---

**3. Cosine Similarity**

* **Definition:** Measures angle between vectors, independent of magnitude.
* **Use Cases:** Semantic similarity for NLP embeddings, text search.
* **Implementation in FAISS:** Normalize vectors and use `IndexFlatIP`

```python
faiss.normalize_L2(vectors)
faiss.normalize_L2(query)
index = faiss.IndexFlatIP(d)
index.add(vectors)
D, I = index.search(query, k=5)
```

---

**Custom Similarity Functions**

* FAISS allows **user-defined distance metrics** using the **Python API or C++ extensions**.
* Custom metrics can incorporate **domain-specific weights, hybrid features, or multi-modal embeddings**.

**Example:** Weighted combination of text and image embeddings:

```python
def custom_similarity(query, vectors, alpha=0.7):
    # query and vectors: shape (n_vectors, d)
    text_sim = np.dot(query[:, :128], vectors[:, :128].T)
    image_sim = np.dot(query[:, 128:], vectors[:, 128:].T)
    return alpha * text_sim + (1-alpha) * image_sim
```

* The output can then be used to **rank vectors manually or in a custom FAISS index wrapper**.

---

**Case Studies**

**1. Semantic Search in Tex**t

* **Scenario:** Search through thousands of documents for queries like "AI in healthcare".
* **Approach:**

  * Generate embeddings using BERT or OpenAI embeddings.
  * Normalize vectors for cosine similarity.
  * Store in FAISS `IndexFlatIP` or `HNSW` for fast retrieval.
* **Outcome:** Retrieves semantically relevant documents even if keywords do not match exactly.

---

**2. Image Retrieval Systems**

* **Scenario:** Find visually similar images from a large image database.
* **Approach:**

  * Use CNN or CLIP embeddings.
  * Index embeddings in FAISS using `IndexIVFPQ` for large datasets.
  * Query by image or text (in case of CLIP cross-modal search).
* **Outcome:** Fast, accurate retrieval of visually or semantically similar images.

---

**3. Recommendation Engines**

* **Scenario:** Recommend products or content to users based on embeddings of past interactions.
* **Approach:**

  * Represent users and items as vectors (e.g., collaborative filtering + content embeddings).
  * Index item embeddings in FAISS using `HNSW` or `IVFPQ`.
  * Perform nearest neighbor search to recommend top-k items.
* **Outcome:** Personalized recommendations that balance semantic similarity and user preferences.

---

**Summary**

Using **custom metrics, L2, cosine, or inner product distances**, FAISS can handle **varied data types and similarity requirements**. By combining **domain-specific metrics with ANN algorithms**, FAISS powers real-world applications such as:

* **Semantic text search**: Fast retrieval of relevant documents.
* **Image retrieval**: Find visually similar images in large datasets.
* **Recommendation systems**: Personalized suggestions with vector-based similarity.