# VectorStoreDB

* A vector store database stores high-dimensional vector embeddings for tasks like similarity search and clustering.
  
* It uses indexing techniques for fast retrieval of similar vectors based on proximity in the vector space.

* Few popular VectorDB are FAISS, CHROMADB, ASTRADB etc..

## Functionalities: 

#### Similarity Search with Score

* It allows you to return not only the relevant documents but also the distance score of the query to them.

* The returned distance score is L2(manhattan) distance, Therefore, a lower score is better.

--------------------------------------------------------------------------------------------------------

## 1. FAISS (Facebook AI Similarity Search):

* It is a library for efficient similarity search and clustering of dense vectors.

* It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM.

* It also contains supporting code for evaluation and parameter tuning.

#### Pre-Requiste: 

install "faiss-cpu"

#### Implementation

In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [4]:
# Importing necessary Libraries
from langchain_community.document_loaders import TextLoader # To load a txt file
from langchain.text_splitter import RecursiveCharacterTextSplitter # To create chunks
from langchain_community.embeddings import OllamaEmbeddings # embedding from Ollama
from langchain_community.vectorstores import FAISS # store in FAISS Database

# Data Ingestion
loader = TextLoader("Data/sample_text.txt", encoding="utf-8")
documents = loader.load()

# Text Splitting using RecursiveCharacterTextSplitter
text_splitter= RecursiveCharacterTextSplitter(chunk_size=50,chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# Create Embedding using OllamaEmbeddings , by default its using llama 2 at this time
embeddings = OllamaEmbeddings()

# store the vector in FAISS Vectorstore
db = FAISS.from_documents(docs,embeddings)

# Querying
query = "who created earth?"

# Do the similarity search to search and retrieve relevant chunk from FAISS DB
docs = db.similarity_search(query)

# Convert the vectordb as a Retriever class since it has more advanced functionality than simple similarity search (acts as interface if we query it will retrieve details from db and provide you response)
retriever = db.as_retriever()
docs1= retriever.invoke(query)

# Note, in this case similary search and retriever will give the same answer

# Find the Similarity Score which is one of the functionality of FAISS
docs_and_score = db.similarity_search_with_score(query)
print(docs_and_score)

[(Document(id='2453b437-821e-4ff3-a7bf-67e0b0694772', metadata={'source': 'Data/sample_text.txt'}, page_content='the Spirit of God was hovering over the waters'), 15665.869), (Document(id='e15efb4a-a491-41c8-9132-040a4bc5d641', metadata={'source': 'Data/sample_text.txt'}, page_content='1. In the beginning God created the heavens and'), 16669.348), (Document(id='75ea1ddd-bb56-44df-9a9d-e0f85b7f8a37', metadata={'source': 'Data/sample_text.txt'}, page_content='and empty, darkness was over the surface of the'), 22797.424), (Document(id='af38e18c-a18a-4572-94d8-338079d88866', metadata={'source': 'Data/sample_text.txt'}, page_content='3. And God said, “Let there be light,” and there'), 23770.254)]


##### Without giving a sentence to query, we can also convert the query to vectors and do similarity search.

In [5]:
# Querying
query = "who created earth?"

# Convert the query into vector using ollama embedding
embedding_vector = embeddings.embed_query(query)

# Find the Similarity Score
docs_and_score_2 = db.similarity_search_with_score(query)
print(docs_and_score_2)

[(Document(id='2453b437-821e-4ff3-a7bf-67e0b0694772', metadata={'source': 'Data/sample_text.txt'}, page_content='the Spirit of God was hovering over the waters'), 15665.869), (Document(id='e15efb4a-a491-41c8-9132-040a4bc5d641', metadata={'source': 'Data/sample_text.txt'}, page_content='1. In the beginning God created the heavens and'), 16669.348), (Document(id='75ea1ddd-bb56-44df-9a9d-e0f85b7f8a37', metadata={'source': 'Data/sample_text.txt'}, page_content='and empty, darkness was over the surface of the'), 22797.424), (Document(id='af38e18c-a18a-4572-94d8-338079d88866', metadata={'source': 'Data/sample_text.txt'}, page_content='3. And God said, “Let there be light,” and there'), 23770.254)]


#### Saving and Loading the FAISS VectorDB

In [6]:
# save the VectorDB in local
db.save_local("faiss_index")

It will create a pkl file

In [9]:
# Load the VectorDB that is stored in the local
new_db = FAISS.load_local("faiss_index",embeddings,allow_dangerous_deserialization=True)
docs = new_db.similarity_search_with_score(query)
docs

[(Document(id='2453b437-821e-4ff3-a7bf-67e0b0694772', metadata={'source': 'Data/sample_text.txt'}, page_content='the Spirit of God was hovering over the waters'),
  15665.869),
 (Document(id='e15efb4a-a491-41c8-9132-040a4bc5d641', metadata={'source': 'Data/sample_text.txt'}, page_content='1. In the beginning God created the heavens and'),
  16669.348),
 (Document(id='75ea1ddd-bb56-44df-9a9d-e0f85b7f8a37', metadata={'source': 'Data/sample_text.txt'}, page_content='and empty, darkness was over the surface of the'),
  22797.424),
 (Document(id='af38e18c-a18a-4572-94d8-338079d88866', metadata={'source': 'Data/sample_text.txt'}, page_content='3. And God said, “Let there be light,” and there'),
  23770.254)]

------------------------------------------------------------------------------------------------------

## 2. ChromaDB

* ChromaDB is an open-source vector database designed for storing and retrieving embeddings efficiently, often used in AI applications like semantic search and recommendation systems. 

* It helps in managing high-dimensional data for tasks like similarity search and retrieval-augmented generation (RAG).

* Chroma is licensed under Apache 2.O

#### Pre-Requiste: 

install "langchain_chroma"

#### Implementation

In [11]:
from langchain_chroma import Chroma #importing Chromadb to store vectors
from langchain_community.document_loaders import TextLoader # To load a txt file
from langchain.text_splitter import RecursiveCharacterTextSplitter # To create chunks
from langchain_community.embeddings import OllamaEmbeddings # embedding from Ollama

# Data Ingestion
loader = TextLoader("Data/sample_text.txt", encoding="utf-8")
documents = loader.load()

# Text Splitting using RecursiveCharacterTextSplitter
text_splitter= RecursiveCharacterTextSplitter(chunk_size=50,chunk_overlap=20)
docs = text_splitter.split_documents(documents)

# Create Embedding using OllamaEmbeddings , by default its using llama 2 at this time
embeddings = OllamaEmbeddings()

# store the vector in Chroma Vectorstore
db = Chroma.from_documents(docs,embeddings)

# Querying
query = "who created earth?"

# Do the similarity search to search and retrieve relevant chunk from FAISS DB
docs = db.similarity_search(query)
print(docs[0].page_content)

the Spirit of God was hovering over the waters


#### Saving and Loading the Chroma VectorDB

In [14]:
# save the Chroma VectorDB in local
chroma_vectordb = Chroma.from_documents(docs,embeddings,persist_directory="./chroma_db")

In [18]:
# Load the Chroma VectorDB that is stored in the local
db2 = Chroma(persist_directory="./chroma_db",embedding_function=embeddings)

# Similarity search from the vectordb
docs = db2.similarity_search_with_score(query)
docs

[(Document(id='f5302f96-790e-4342-945d-b536ee81ead4', metadata={'source': 'Data/sample_text.txt'}, page_content='the Spirit of God was hovering over the waters'),
  15665.86765243212),
 (Document(id='0bd52894-bf1e-4076-be82-f7386e14ef07', metadata={'source': 'Data/sample_text.txt'}, page_content='1. In the beginning God created the heavens and'),
  16669.3484730196),
 (Document(id='068613f8-b466-4236-a110-8f9677cfb137', metadata={'source': 'Data/sample_text.txt'}, page_content='and empty, darkness was over the surface of the'),
  22797.422525673806),
 (Document(id='40bdde05-04b6-40b9-8a35-60444b31e697', metadata={'source': 'Data/sample_text.txt'}, page_content='3. And God said, “Let there be light,” and there'),
  23770.25397816902)]

#### Retriever Option

In [19]:
# Convert the vectordb as a Retriever class since it has more advanced functionality than simple similarity search (acts as interface if we query it will retrieve details from db and provide you response)
retriever = db.as_retriever()
docs2= retriever.invoke(query)
docs2

[Document(id='f5302f96-790e-4342-945d-b536ee81ead4', metadata={'source': 'Data/sample_text.txt'}, page_content='the Spirit of God was hovering over the waters'),
 Document(id='0bd52894-bf1e-4076-be82-f7386e14ef07', metadata={'source': 'Data/sample_text.txt'}, page_content='1. In the beginning God created the heavens and'),
 Document(id='068613f8-b466-4236-a110-8f9677cfb137', metadata={'source': 'Data/sample_text.txt'}, page_content='and empty, darkness was over the surface of the'),
 Document(id='40bdde05-04b6-40b9-8a35-60444b31e697', metadata={'source': 'Data/sample_text.txt'}, page_content='3. And God said, “Let there be light,” and there')]

------------------------------------------------------------------------------------------------------