FAISS: Facebook AI Similarity Search is a library for efficient similarity serach and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [20]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
import os
from dotenv import load_dotenv

load_dotenv()
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

loader = TextLoader('text.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size = 300, chunk_overlap = 50)
docs = text_splitter.split_documents(documents)


Created a chunk of size 309, which is longer than the specified 300
Created a chunk of size 338, which is longer than the specified 300
Created a chunk of size 410, which is longer than the specified 300
Created a chunk of size 302, which is longer than the specified 300
Created a chunk of size 359, which is longer than the specified 300
Created a chunk of size 888, which is longer than the specified 300
Created a chunk of size 341, which is longer than the specified 300


In [21]:
embeddings = HuggingFaceEmbeddings(model="all-MiniLM-L6-v2")
db = FAISS.from_documents(docs, embeddings)

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 96e0d75f-3895-43e7-9bf9-ee31809ccde2)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].


In [23]:
query = "When did Datopic is established?"
docs = db.similarity_search(query)

# print(docs[0].page_content)
print(docs)

[Document(id='762f9366-b71b-472c-8d67-64150e93ffa1', metadata={'source': 'text.txt'}, page_content='Datopic has been an early adopter of blockchain, working on supply chain transparency, digital identity, and smart contract implementation. The company provides blockchain consulting, DApp development, and system integration with traditional architectures to create trustless, decentralized ecosystems.'), Document(id='07f508ab-2031-4be6-baba-4738d439ab05', metadata={'source': 'text.txt'}, page_content='The company’s registered office is located at RZ-77A, Dabri Extension, Main Palam Road, New Delhi – 110045, India. Operationally, Datopic has also established a significant presence in Noida, Uttar Pradesh, serving as its primary technology and development hub.'), Document(id='b9c35d2a-655f-442e-a6b5-83281139126c', metadata={'source': 'text.txt'}, page_content='Datopic Technologies Pvt. Ltd. represents a promising new-age digital solutions company focused on innovation, data intelligence, a

Similarity Search with score: there are some FAISS specific methods. One of them is similarity_search_with_score, whihc allows you  to return not only the documents but also the distance socre of the query to them. The returned distance score is L2 distanc (manhattan). Therefore, a lower score is better.

In [24]:
docs_with_score = db.similarity_search_with_score(query)
print(docs_with_score)
# print(docs_with_score[0].page_content)

[(Document(id='762f9366-b71b-472c-8d67-64150e93ffa1', metadata={'source': 'text.txt'}, page_content='Datopic has been an early adopter of blockchain, working on supply chain transparency, digital identity, and smart contract implementation. The company provides blockchain consulting, DApp development, and system integration with traditional architectures to create trustless, decentralized ecosystems.'), np.float32(0.8544675)), (Document(id='07f508ab-2031-4be6-baba-4738d439ab05', metadata={'source': 'text.txt'}, page_content='The company’s registered office is located at RZ-77A, Dabri Extension, Main Palam Road, New Delhi – 110045, India. Operationally, Datopic has also established a significant presence in Noida, Uttar Pradesh, serving as its primary technology and development hub.'), np.float32(0.9200494)), (Document(id='b9c35d2a-655f-442e-a6b5-83281139126c', metadata={'source': 'text.txt'}, page_content='Datopic Technologies Pvt. Ltd. represents a promising new-age digital solutions 

In [25]:
embedding_vector = embeddings.embed_query(query)
docs_score = db.similarity_search_by_vector(embedding_vector)
print(docs_score)

[Document(id='762f9366-b71b-472c-8d67-64150e93ffa1', metadata={'source': 'text.txt'}, page_content='Datopic has been an early adopter of blockchain, working on supply chain transparency, digital identity, and smart contract implementation. The company provides blockchain consulting, DApp development, and system integration with traditional architectures to create trustless, decentralized ecosystems.'), Document(id='07f508ab-2031-4be6-baba-4738d439ab05', metadata={'source': 'text.txt'}, page_content='The company’s registered office is located at RZ-77A, Dabri Extension, Main Palam Road, New Delhi – 110045, India. Operationally, Datopic has also established a significant presence in Noida, Uttar Pradesh, serving as its primary technology and development hub.'), Document(id='b9c35d2a-655f-442e-a6b5-83281139126c', metadata={'source': 'text.txt'}, page_content='Datopic Technologies Pvt. Ltd. represents a promising new-age digital solutions company focused on innovation, data intelligence, a

In [27]:
# savaing and loading
db.save_local("faiss_index")

In [29]:
new_df = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

In [33]:
result = new_df.similarity_search("When was Datopic founded?")
print(result[0].page_content)

Datopic has been an early adopter of blockchain, working on supply chain transparency, digital identity, and smart contract implementation. The company provides blockchain consulting, DApp development, and system integration with traditional architectures to create trustless, decentralized ecosystems.


In [34]:
# retriver option

retriver = db.as_retriever()
retriver.invoke(query)[0].page_content

'Datopic has been an early adopter of blockchain, working on supply chain transparency, digital identity, and smart contract implementation. The company provides blockchain consulting, DApp development, and system integration with traditional architectures to create trustless, decentralized ecosystems.'