# **Vector Stores and Retrievers**

<image src="Images\langchain_rag.jpg">

### Overview

A vector database (VectorDB) stores and retrieves unstructured data by embedding it as vectors. At query time, it embeds the query and finds vectors most similar to the query. This process enables efficient similarity searches.

**Key Features:**  
- Stores high-dimensional vectors with associated text.  
- Supports efficient cosine similarity searches.  
- Allows easy addition, updating, and deletion of vectors.  

Popular VectorDB options include **Chroma** and **FAISS**.

### **Steps**
> 1. Initialize an Embedding Model
> 2. Setting a Connection with the ChromaDB
> 3. Load a document
> 4. Split the document into chunks
> 5. Add Chunks to ChromaDB
> 6. Apply Similarity Search

### **Step 1: Initialize an Embedding Model**

In [6]:
# !pip install langchain-huggingface

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

### **Step 2: Setting a Connection with the ChromaDB**

In [7]:
# !pip install langchain-chroma

In [3]:
from langchain_chroma import Chroma

db = Chroma(collection_name="vector_database",
            embedding_function=embedding_model,
            persist_directory='./chroma_db')

In [4]:
db.get()

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

`Note:` Initially the database is empty

### **Start 3: Load a document**

In [10]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader(path="example_data/subtitles", glob="*.srt", show_progress=True, loader_cls=TextLoader)

data = loader.load()

100%|██████████| 10/10 [00:00<00:00, 69.69it/s]


### **Step 4: Split the document into chunks**

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=300)

chunks = text_splitter.split_documents(data)

In [21]:
print(len(chunks))
print()
print(type(chunks[0]))
print()
print(chunks[0].page_content[:75])

1004

<class 'langchain_core.documents.base.Document'>

1
00:00:01,435 --> 00:00:04,082
This is pretty much
what's happened so far.


### **Step 5: Add Chunks to ChromaDB**

In [None]:
# db.add_documents(chunks)

In [24]:
# db.get()

### **Step 6: Similarity Search**

In [39]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search(query=query, k=5)

metadatas = [doc.metadata for doc in relevant_chunks]

metadatas

[{'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'}]

In [37]:
print("Type of output:", type(relevant_chunks))
print()
print("Type of each item in output:", type(relevant_chunks[0]))
print()
print("Number of output docs:", len(relevant_chunks))
print()
print("PRINTING THE DOCUMENT:\n", relevant_chunks[0].page_content)

Type of output: <class 'list'>

Type of each item in output: <class 'langchain_core.documents.base.Document'>

Number of output docs: 5

PRINTING THE DOCUMENT:
 145
00:08:42,594 --> 00:08:44,425
No, Amish boy.

146
00:08:46,398 --> 00:08:50,061
Let's start with the cons
because they're more fun.

147
00:08:50,335 --> 00:08:51,165
Rachel first.

148
00:08:52,171 --> 00:08:53,331
I don't know.

149
00:08:53,839 --> 00:08:55,067
I mean....

150
00:08:55,274 --> 00:08:59,802
All right, I guess you can say
she's a little spoiled sometimes.

151
00:09:00,245 --> 00:09:01,940
You could say that.

152
00:09:03,816 --> 00:09:07,775
I guess, sometimes
she's a little ditzy, you know?

153
00:09:08,153 --> 00:09:11,088
And I've seen her be a little
too into her looks.

154
00:09:11,757 --> 00:09:13,816
And Julie and I have
a lot in common...


### **Similarity Search With Score**

In [52]:
query = "What is their on Julie vs Rachels List?"

relevant_chunks = db.similarity_search_with_score(query=query, k=5)

print(len(relevant_chunks))

5


In [53]:
# Similarity Score

[doc[1] for doc in relevant_chunks]

[0.6429129838943481,
 0.6612476110458374,
 0.798521101474762,
 0.8262538313865662,
 0.8489298820495605]

### **Similarity Search By Vector**

In [55]:
query = "What is their on Julie vs Rachels List?"

vector_query = embedding_model.embed_query(query)

relevant_chunks = db.similarity_search_by_vector(embedding=vector_query, k=5)

print(len(relevant_chunks))

5


In [56]:
[doc.metadata for doc in relevant_chunks]

[{'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'},
 {'source': 'example_data\\subtitles\\Friends_2x08.srt'}]