# **VectorStore**

> A vectorstore is a storage backend for embeddings, allowing you to index, store, and query them efficiently. It pairs embeddings with metadata to support searches for semantically similar content.

**Vectorstores in LangChain are specialized tools designed to store and retrieve vector embeddings for text or other data, which are essential for tasks like similarity search, question answering, and RAG (Retrieval-Augmented Generation) systems.**

## Popular Vectorstore Options in LangChain

1. FAISS (Facebook AI Similarity Search)
2. Pinecone
3. Chroma
4. Qdrant



### For Embaddings we will use 3 different sources of embaddings

1. GoogleGenerativeAIEmbeddings
2. Cohere
3. HuggingFaceEmbeddings

---

## **FAISS**

Facebook AI Similarity Search (FAISS) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also includes supporting code for evaluation and parameter tuning.

In [1]:
!pip install -qU langchain-community faiss-cpu

In [2]:
!pip install langchain-google-genai langchain-cohere -q

In [3]:
from langchain_community.document_loaders.word_document import Docx2txtLoader

file_path = './Data/RAG_Types_Table.docx'

loader = Docx2txtLoader(file_path)

docs = loader.load()

In [8]:
docs

[Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages\n\nDisadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG\n\n- High accuracy by combining multiple information sources\n- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios\n\n- Complexity in implementation\n- Higher computational resources required\n- Increased latency\n\n- When accuracy is paramount, and there are multiple data types\n\nCombines retrieval-based techniques (like search engines or databases) and generation-based techniques (like GPT-based models) to provide comprehensive responses.\n\nGenerative RAG\n\n- Provides flexible and creative responses\n- Can generate human-like content\n- Capable of handling open-domain questions\n\n- Risk of generating hallucinated information\n- Requires more extensive training data\n\n- For open-ended or c

In [None]:
# !pip install sentence_transformers -q

In [None]:
# !pip install langchain-huggingface -q

In [12]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS
import os
os.environ['Google_API_KEY'] = "You-API-KEY"

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

db = FAISS.from_documents(docs, embeddings)

In [13]:
db

<langchain_community.vectorstores.faiss.FAISS at 0x25b38251460>

In [14]:

ques = 'Types of RAG?'

answer = db.similarity_search(ques)

In [15]:
answer

[Document(id='e11a7630-8931-45c9-98a3-3447355fb529', metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages\n\nDisadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG\n\n- High accuracy by combining multiple information sources\n- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios\n\n- Complexity in implementation\n- Higher computational resources required\n- Increased latency\n\n- When accuracy is paramount, and there are multiple data types\n\nCombines retrieval-based techniques (like search engines or databases) and generation-based techniques (like GPT-based models) to provide comprehensive responses.\n\nGenerative RAG\n\n- Provides flexible and creative responses\n- Can generate human-like content\n- Capable of handling open-domain questions\n\n- Risk of generating hallucinated information\n- Requires more exten

----

# **Qdrant**

Qdrant (read: quadrant ) is a vector similarity search engine. It provides a production-ready service with a convenient API to store, search, and manage vectors with additional payload and extended filtering support. It makes it useful for all sorts of neural network or semantic-based matching, faceted search, and other applications.

In [4]:
docs

[Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages\n\nDisadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG\n\n- High accuracy by combining multiple information sources\n- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios\n\n- Complexity in implementation\n- Higher computational resources required\n- Increased latency\n\n- When accuracy is paramount, and there are multiple data types\n\nCombines retrieval-based techniques (like search engines or databases) and generation-based techniques (like GPT-based models) to provide comprehensive responses.\n\nGenerative RAG\n\n- Provides flexible and creative responses\n- Can generate human-like content\n- Capable of handling open-domain questions\n\n- Risk of generating hallucinated information\n- Requires more extensive training data\n\n- For open-ended or c

In [2]:
!pip install -qU langchain-qdrant

In [5]:
qdrant_key = "You-API-KEY"
qdrant_url = "You-Qdrant-URL"

In [6]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./Data/Central_Limit_Theorem.pdf')

docs = loader.load()

In [7]:
docs

[Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 0}, page_content='Bernoulli distribution is a probability distribution that models a binary outcome, where the \noutcome can be either success (represented by the value 1) or failure (represented by the \nvalue 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli, \nwho first introduced it in the late 1600s.\nThe Bernoulli distribution is characterized by a single parameter, which is the probability of \nsuccess, denoted by p. The probability mass function (PMF) of the Bernoulli distribution is:\nThe Bernoulli distribution is commonly used in machine learning for modelling \nbinary outcomes, such as whether a customer will make a purchase or not, \nwhether an email is spam or not, or whether a patient will have a certain disease \nor not.\nBernoulli Distribution\n27 March 2023 16:06\n   Session on Central Limit Theorem Page 1    '),
 Document(metadata={'source': './Data/Central_Li

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=50)

chunks = splitter.split_documents(docs)

In [11]:
len(chunks)

70

In [13]:
len(chunks[0].page_content)

92

In [18]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_qdrant import Qdrant
from langchain_cohere.embeddings import CohereEmbeddings
from qdrant_client import QdrantClient
embeddings = CohereEmbeddings(cohere_api_key="You-API-KEY", model="embed-english-v3.0")

batch_size = 100  # Adjust based on your dataset size
for i in range(0, len(chunks), batch_size):
    batch = chunks[i:i+batch_size]
    Qdrant.from_documents(
        batch, embeddings, url=qdrant_url, api_key=qdrant_key, collection_name='statistics'
    )


In [19]:
from qdrant_client import QdrantClient

# Connect to the Qdrant server
client = QdrantClient(url=qdrant_url, api_key=qdrant_key)

# Check collection info
collection_info = client.get_collection(collection_name="statistics")
print(collection_info)

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=70 segments_count=2 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=1024, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=None), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None

In [20]:
# Example search query
query = "Find statistics related documents"

# Generate the embedding for the query
query_embedding = embeddings.embed_query(query)

In [21]:
# Perform the similarity search
search_results = client.search(
    collection_name="statistics",
    query_vector=query_embedding,  # The query embedding
    limit=5,  # Number of top results to return
)

# Display search results
for result in search_results:
    print(f"Document ID: {result.id}, Score: {result.score}")

  search_results = client.search(


In [34]:
# Loop through the search results
for result in search_results:
    # Extract document ID
    document_id = result.id
    
    # Extract page content and metadata
    page_content = result.payload.get("page_content", "No page content available")
    metadata = result.payload.get("metadata", {})
    
    # Display the information
    print(f"Document ID: {document_id}")
    print(f"Page Content: {page_content}")
    print(f"Metadata: {metadata}")
    print("-" * 50)


Document ID: b038f7ff-308f-4bac-ae5b-3262c71b0f4c
Page Content: avoid skewed results.
1.
Calculate the sample mean (average salary) and sample standard 
deviation for each sample.
2.
Metadata: {'source': './Data/Central_Limit_Theorem.pdf', 'page': 6}
--------------------------------------------------
Document ID: b6ef766e-1f7d-4649-bcba-cca95f2cc4d0
Page Content: Why Sampling Distribution is important?
Sampling distribution is important in statistics and machine learning because it allows us to
Metadata: {'source': './Data/Central_Limit_Theorem.pdf', 'page': 3}
--------------------------------------------------
Document ID: 8e070619-6116-478c-99c4-58db4fb9ea61
Page Content: sample statistic (such as the sample mean or sample proportion) computed from multiple 
independent samples of the same size from a population.
Metadata: {'source': './Data/Central_Limit_Theorem.pdf', 'page': 3}
--------------------------------------------------
Document ID: 4819ccd9-d110-4647-be8f-b3ccd2bb6d87
Page

### Now saving the vector store in out local pc

In [32]:
vectore_db = Qdrant.from_documents(
    chunks, embeddings, path='vectore_store', collection_name='Statistics'
)

In [33]:
search = vectore_db.similarity_search('Find statistics related documents')
search

[Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 6, '_id': '83d11b7152db41a79a2ca1bb57f41c4a', '_collection_name': 'Statistics'}, page_content='avoid skewed results.\n1.\nCalculate the sample mean (average salary) and sample standard \ndeviation for each sample.\n2.'),
 Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 3, '_id': '795db2b04ffb4905bbb5399d2c6a446f', '_collection_name': 'Statistics'}, page_content='Why Sampling Distribution is important?\nSampling distribution is important in statistics and machine learning because it allows us to'),
 Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 3, '_id': '51eef85e4e464ee2aac6ecd17b693971', '_collection_name': 'Statistics'}, page_content='sample statistic (such as the sample mean or sample proportion) computed from multiple \nindependent samples of the same size from a population.'),
 Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 3, '_id

In [35]:
retriver = vectore_db.as_retriever(
    search_type='mmr',
    search_kwargs= {'k': 2}
)

In [36]:
retriver.get_relevant_documents('Find statistics related documents')

  retriver.get_relevant_documents('Find statistics related documents')


[Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 6, '_id': '83d11b7152db41a79a2ca1bb57f41c4a', '_collection_name': 'Statistics'}, page_content='avoid skewed results.\n1.\nCalculate the sample mean (average salary) and sample standard \ndeviation for each sample.\n2.'),
 Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 4, '_id': '8dff57da01264b7bb0b798037092fb55', '_collection_name': 'Statistics'}, page_content='The CLT is important in statistics and machine learning because it allows us to')]