### Faiss
Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader=TextLoader("./data/speech.txt")
documents=loader.load()
text_splitter=CharacterTextSplitter(separator="\n", chunk_size=50,chunk_overlap=20)
docs=text_splitter.split_documents(documents)

Created a chunk of size 70, which is longer than the specified 50
Created a chunk of size 412, which is longer than the specified 50
Created a chunk of size 168, which is longer than the specified 50
Created a chunk of size 118, which is longer than the specified 50


In [2]:
docs

[Document(metadata={'source': './data/speech.txt'}, page_content='Hello!'),
 Document(metadata={'source': './data/speech.txt'}, page_content='My name is Eniola, and I am excited to be your tutor for this program.'),
 Document(metadata={'source': './data/speech.txt'}, page_content='I am a Data Scientist, Open-source Contributor, and Frontend Engineer with years of experience in the tech industry. Throughout my career, I’ve been privileged to build innovative products, solve real-world problems, and empower over 5,000 individuals to grow into better versions of themselves. My work and insights have also connected me with an audience of over 10,000 followers across social media platforms.'),
 Document(metadata={'source': './data/speech.txt'}, page_content='Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'),
 Document(metadata={'source': './data/speech.txt'}, page_content

In [4]:
embeddings = (OllamaEmbeddings(model="gemma2:2b")) # by defaults llamma2
database = FAISS.from_documents(docs, embeddings)

In [5]:
database

<langchain_community.vectorstores.faiss.FAISS at 0x127403c10>

In [6]:
query = "WHat's the speaker future ambition?"
retrived_result = database.similarity_search(query)
print(retrived_result)

[Document(id='483d7af3-c715-4787-a3cf-0f601eb16647', metadata={'source': './data/speech.txt'}, page_content='Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'), Document(id='0c9c3689-0912-4e06-8684-6b2ea809a018', metadata={'source': './data/speech.txt'}, page_content='In this program, I am here to guide you every step of the way, ensuring you have an unforgettable learning experience.'), Document(id='2a6f56d5-f064-4155-a53d-19e9010d805d', metadata={'source': './data/speech.txt'}, page_content='I am a Data Scientist, Open-source Contributor, and Frontend Engineer with years of experience in the tech industry. Throughout my career, I’ve been privileged to build innovative products, solve real-world problems, and empower over 5,000 individuals to grow into better versions of themselves. My work and insights have also connected me with an audience of over 10,000 followers

#### As a Retriever
We can also convert the vectorstore into a Retriever class. This allows us to easily use it in other LangChain methods, which largely work with retrievers

In [7]:
retrieval = database.as_retriever()
docs = retrieval.invoke(query)
docs[0].page_content

'Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'

#### Similarity Search with score
There are some FAISS specific methods. One of them is similarity_search_with_score, which allows you to return not only the documents but also the distance score of the query to them. The returned distance score is L2 distance. Therefore, a lower score is better.

In [8]:
docs_and_score = database.similarity_search_with_score(query)
docs_and_score

[(Document(id='483d7af3-c715-4787-a3cf-0f601eb16647', metadata={'source': './data/speech.txt'}, page_content='Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'),
  4657.192),
 (Document(id='0c9c3689-0912-4e06-8684-6b2ea809a018', metadata={'source': './data/speech.txt'}, page_content='In this program, I am here to guide you every step of the way, ensuring you have an unforgettable learning experience.'),
  5404.864),
 (Document(id='2a6f56d5-f064-4155-a53d-19e9010d805d', metadata={'source': './data/speech.txt'}, page_content='I am a Data Scientist, Open-source Contributor, and Frontend Engineer with years of experience in the tech industry. Throughout my career, I’ve been privileged to build innovative products, solve real-world problems, and empower over 5,000 individuals to grow into better versions of themselves. My work and insights have also connected me with an au

In [9]:
embedding_vector = embeddings.embed_query(query)
embedding_vector

[-0.6218839287757874,
 1.9015374183654785,
 -1.4655261039733887,
 0.6944152116775513,
 -1.2071727514266968,
 -0.5505391359329224,
 0.9263079762458801,
 -0.9349704384803772,
 0.6580371260643005,
 1.05746328830719,
 1.0083682537078857,
 -0.8687774538993835,
 -0.6997878551483154,
 -1.4026693105697632,
 0.38171881437301636,
 -3.6128616333007812,
 0.2681864798069,
 0.6873461008071899,
 -4.685661792755127,
 0.5125055909156799,
 1.6587244272232056,
 -0.6581463813781738,
 -1.401424527168274,
 0.433203786611557,
 -0.1228959783911705,
 0.051090583205223083,
 -0.4813109040260315,
 -1.561653733253479,
 -1.026887059211731,
 1.1503727436065674,
 0.15982654690742493,
 -1.6763666868209839,
 -0.1743612140417099,
 0.06150215491652489,
 0.4587891697883606,
 1.637152910232544,
 0.4506804347038269,
 0.15144476294517517,
 1.752927541732788,
 -2.004190444946289,
 -0.23241868615150452,
 2.483579635620117,
 0.619808554649353,
 -0.40016889572143555,
 -3.1416680812835693,
 -1.881230354309082,
 0.6439006924629211

In [10]:
docs_score = database.similarity_search_with_score(embedding_vector)

In [11]:
print(docs_score)

[(Document(id='2a6f56d5-f064-4155-a53d-19e9010d805d', metadata={'source': './data/speech.txt'}, page_content='I am a Data Scientist, Open-source Contributor, and Frontend Engineer with years of experience in the tech industry. Throughout my career, I’ve been privileged to build innovative products, solve real-world problems, and empower over 5,000 individuals to grow into better versions of themselves. My work and insights have also connected me with an audience of over 10,000 followers across social media platforms.'), 14094.66), (Document(id='483d7af3-c715-4787-a3cf-0f601eb16647', metadata={'source': './data/speech.txt'}, page_content='Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'), 14572.62), (Document(id='0c9c3689-0912-4e06-8684-6b2ea809a018', metadata={'source': './data/speech.txt'}, page_content='In this program, I am here to guide you every step of the way,

In [12]:
# saving and loading the database
database.save_local("faiss_db")

In [13]:
new_database = FAISS.load_local("faiss_db", embeddings, allow_dangerous_deserialization=True)

In [14]:
docs = new_database.similarity_search(query)

In [15]:
docs[0].page_content

'Looking ahead, my vision is to create groundbreaking projects and support millions—if not billions—of people in achieving their goals and thriving in the tech industry.'