Vector Store DB are essential to store the embeddings vector of the input text formed by the Embedding Models. There are many VectorDB such as FAISS, PineconeDB, ChromaDB, AstraDB

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

Importing Required Libraries

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

Data Ingestion

In [2]:
loader = TextLoader('Sample-Transformers.txt')
doc = loader.load()

Text Splitting into Chunks

In [15]:
splitter = CharacterTextSplitter(chunk_size = 200,chunk_overlap = 10)
docs = splitter.split_documents(doc)

Created a chunk of size 369, which is longer than the specified 200
Created a chunk of size 335, which is longer than the specified 200
Created a chunk of size 312, which is longer than the specified 200
Created a chunk of size 318, which is longer than the specified 200
Created a chunk of size 698, which is longer than the specified 200
Created a chunk of size 260, which is longer than the specified 200
Created a chunk of size 265, which is longer than the specified 200
Created a chunk of size 364, which is longer than the specified 200


Embeddings created of Text Chunks

In [16]:
embeddings = OllamaEmbeddings(model='gemma2')
doc_embeddings = embeddings.embed_documents(docs)

Vector DB created with Stored Embeddings

In [17]:
db = FAISS.from_documents(docs,embeddings)

In [18]:
db.similarity_search(query='what is a transformer')

[Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='### Key Components of the Transformer'),
 Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='By enabling more efficient parallelization and capturing complex dependencies within text, Transformers have set new benchmarks and opened up new possibilities for language understanding and generation.'),
 Document(metadata={'source': 'Sample-Transformers.txt'}, page_content="4. **Encoder-Decoder Structure**:\n   The Transformer consists of two main parts: the encoder and the decoder. \n   - **Encoder**: The encoder is composed of a stack of identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The encoder processes the input sequence and generates a continuous representation.\n   - **Decoder**: The decoder is also composed of a stack of identical layers, but each layer has an additional sub-layer to perform mult

In [19]:
db.similarity_search_with_score(query='What is a transformer')

[(Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='### Key Components of the Transformer'),
  8133.539),
 (Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='By enabling more efficient parallelization and capturing complex dependencies within text, Transformers have set new benchmarks and opened up new possibilities for language understanding and generation.'),
  10672.725),
 (Document(metadata={'source': 'Sample-Transformers.txt'}, page_content="4. **Encoder-Decoder Structure**:\n   The Transformer consists of two main parts: the encoder and the decoder. \n   - **Encoder**: The encoder is composed of a stack of identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The encoder processes the input sequence and generates a continuous representation.\n   - **Decoder**: The decoder is also composed of a stack of identical layers, but each layer has an additi

Saving and Loading FAISS DB

In [11]:
db.save_local("FAISS_DB")

In [13]:
new_db = FAISS.load_local('FAISS_DB',embeddings,allow_dangerous_deserialization=True)

new_docs = new_db.similarity_search(query='What is a transformer')

In [14]:
new_docs

[Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='### Key Components of the Transformer'),
 Document(metadata={'source': 'Sample-Transformers.txt'}, page_content='By enabling more efficient parallelization and capturing complex dependencies within text, Transformers have set new benchmarks and opened up new possibilities for language understanding and generation.'),
 Document(metadata={'source': 'Sample-Transformers.txt'}, page_content="4. **Encoder-Decoder Structure**:\n   The Transformer consists of two main parts: the encoder and the decoder. \n   - **Encoder**: The encoder is composed of a stack of identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The encoder processes the input sequence and generates a continuous representation.\n   - **Decoder**: The decoder is also composed of a stack of identical layers, but each layer has an additional sub-layer to perform mult

Same commands are for ChromaDB