# Building Vector Stores DB with FAISS
* Building a Vector Store Database (DB) using FAISS (Facebook AI Similarity Search) to perform similarity searches on text embeddings. 
* We further used Ollama Embedding Model (llama-2.1b) for generating embeddings from text documents.

## 1. FAISS

`FAISS` means **(Facebook AI Similarity Search)** is a powerful open-source library for efficient similarity search and clustering of dense vectors.
*  It supports various indexing techniques (Flat, IVF, HNSW, PQ) to balance speed, memory usage, and accuracy. 
* `FAISS` is widely used for large-scale semantic search, recommendation systems, and nearest neighbor search on billions of vectors.

* It also contains supporting code for evaluation and parameter tuning

#### Importing necessary libraries

In [32]:
"""
1. TextLoader: Loads your text documents.
2. FAISS: Vector store used for similarity search.
3. OllamaEmbeddings: Generates vector embeddings using the llama-2.1b model.
4 CharacterTextSplitter: Splits long text documents into smaller chunks.
"""

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import CharacterTextSplitter

### 2. load and split documents

* We load a text file (speech.txt) and then splitting it into smaller chunks using the CharacterTextSplitter.

In [16]:
# load a document
loader = TextLoader("speech.txt")
documents = loader.load()
text_splitter =  CharacterTextSplitter(chunk_size = 200, chunk_overlap=30)
docs = text_splitter.split_documents(documents)


Created a chunk of size 942, which is longer than the specified 200
Created a chunk of size 617, which is longer than the specified 200
Created a chunk of size 744, which is longer than the specified 200
Created a chunk of size 302, which is longer than the specified 200
Created a chunk of size 1481, which is longer than the specified 200
Created a chunk of size 225, which is longer than the specified 200
Created a chunk of size 284, which is longer than the specified 200


In [17]:
docs

[Document(metadata={'source': 'speech.txt'}, page_content="or the process of speaking to a group of people, see Public speaking. For other uses, see Speech (disambiguation).\nDuration: 15 seconds.0:15Subtitles available.CC\nSpeech production visualized by real-time MRI\nPart of a series on\nLinguistics\nOutlineHistoryIndex\nGeneral linguistics\nApplied linguistics\nTheoretical frameworks\nTopics\n Portal\nvte\nSpeech is the use of the human voice as a medium for language. Spoken language combines vowel and consonant sounds to form units of meaning like words, which belong to a language's lexicon. There are many different intentional speech acts, such as informing, declaring, asking, persuading, directing; acts may vary in various aspects like enunciation, intonation, loudness, and tempo to convey meaning. Individuals may also unintentionally communicate aspects of their social position through speech, such as sex, age, place of origin, physiological and mental condition, education, and

### 3.  Generating Embeddings (Using OllamaEmbedding Model) and FAISS DB
* This creates a vector representation (embedding) of the text using the llama-2.1b model.

In [33]:
from langchain_community.embeddings  import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="llama3.2:1b"  # Or the specific model name you installed
)

#embeddings

# Creating a FAISS Vector Store (DB)
db = FAISS.from_documents(docs, embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x116321050>

### 4. Querying the Database (Natural Language Search)
* We performed a semantic search to find the most relevant text chunks related to your query.
* FAISS uses the embeddings to compare the query vector to the stored document vectors and retrieve the closest matches.

In [19]:
### Querying 
query = "What does Anthony does? and what are some table etiquette ?"
docs = db.similarity_search(query)
docs

[Document(id='e43ae176-a477-494e-9198-cc2582972ef5', metadata={'source': 'speech.txt'}, page_content='Great question! Maintaining proper table etiquette at formal or semi-formal events like weddings can make a strong, positive impression. Here are some key points to keep in mind:'),
 Document(id='16289864-e95c-400e-9f8c-04a0b8ed119b', metadata={'source': 'speech.txt'}, page_content='Don’t Lick Fingers or Pick Teeth: Use napkins to clean your fingers and a toothpick discreetly if necessary (preferably after leaving the table).'),
 Document(id='1841da0b-c5a4-42c7-a017-9980441e2408', metadata={'source': 'speech.txt'}, page_content='3. Table Interaction:\nWait for Everyone Before Starting: In formal settings, it’s polite to wait until everyone has been served before you start eating.'),
 Document(id='52dd4647-83a2-4d46-a423-8a3073dba042', metadata={'source': 'speech.txt'}, page_content="Avoid Eating with Your Hands: Unless it's a finger food event or explicitly allowed (like BBQ or African

In [20]:
# displaying the responses
docs[0].page_content

'Great question! Maintaining proper table etiquette at formal or semi-formal events like weddings can make a strong, positive impression. Here are some key points to keep in mind:'

### 5. AS a Retriever 
* we can also review the vectorstore into a Retriever class. This allows us to easily use it in other LangChain methods, which largely work with retrievers.

In [25]:
query = "how to eat in public"

retriever = db.as_retriever()
docs = retriever.invoke(query)
docs[0].page_content

'Don’t Lick Fingers or Pick Teeth: Use napkins to clean your fingers and a toothpick discreetly if necessary (preferably after leaving the table).'

### 6. Similarity Search with score

* There are some FAISS specific methods. One of them is similarity_search_With_score, which allows you to return not only the documents but also the distance score of the query to them. 
* The returned distance score is L2 distance. Therefore, a lower score is better. 

In [26]:
docs_and_score = db.similarity_search(query)
docs_and_score

[Document(id='16289864-e95c-400e-9f8c-04a0b8ed119b', metadata={'source': 'speech.txt'}, page_content='Don’t Lick Fingers or Pick Teeth: Use napkins to clean your fingers and a toothpick discreetly if necessary (preferably after leaving the table).'),
 Document(id='aab346fb-4766-4808-a78b-a004742d3f96', metadata={'source': 'speech.txt'}, page_content='Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs, \n and real-world AI applications. With over two years of research experience, Anthony has published in top venues, including IEEE, ACM TKDD, and AAAI. His work includes developing Adaptive DecayRank, an anomaly detection model leveraging Bayesian PageRank updates, and contributions to graph neural networks (GNNs) for coordinated sensor attack detection in autonomous vehicles. He has hands-on expertise in NLP, generative AI, and cyber risk evaluation, demonstrated through var

#### 7.  Querying the Database (By Vector)
* Instead of querying by text, you are querying directly using an embedding vector.
* This approach is useful if you already have a vector representation of your query (e.g., precomputed embedding).

In [27]:
embeddings_vector = embeddings.embed_query(query)
embeddings_vector

[-0.9270519614219666,
 4.469832897186279,
 1.7618122100830078,
 0.40265709161758423,
 2.0088579654693604,
 0.4227581024169922,
 1.0448734760284424,
 1.0283260345458984,
 -1.128825306892395,
 0.46850448846817017,
 -2.047393321990967,
 -0.5958850383758545,
 -2.9498541355133057,
 -0.44426390528678894,
 1.6121407747268677,
 -1.1918613910675049,
 0.053721871227025986,
 0.3366502821445465,
 1.9542479515075684,
 4.106456279754639,
 -0.3982490003108978,
 0.5903869271278381,
 0.48642802238464355,
 2.243391275405884,
 -1.3617596626281738,
 -2.1715164184570312,
 -5.5376877784729,
 1.269402027130127,
 -0.30988937616348267,
 1.9214322566986084,
 0.9841897487640381,
 -0.6293820738792419,
 -0.0643991231918335,
 1.8185293674468994,
 1.1561007499694824,
 -0.10601756721735,
 1.7110618352890015,
 0.2773783504962921,
 0.18936070799827576,
 -2.9806408882141113,
 0.9325524568557739,
 -2.091839551925659,
 0.63725745677948,
 0.4528833031654358,
 -1.1809234619140625,
 1.7720048427581787,
 -0.3199886381626129,


In [28]:
docs_score = db.similarity_search_by_vector(embeddings_vector)
docs_score 

[Document(id='16289864-e95c-400e-9f8c-04a0b8ed119b', metadata={'source': 'speech.txt'}, page_content='Don’t Lick Fingers or Pick Teeth: Use napkins to clean your fingers and a toothpick discreetly if necessary (preferably after leaving the table).'),
 Document(id='aab346fb-4766-4808-a78b-a004742d3f96', metadata={'source': 'speech.txt'}, page_content='Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs, \n and real-world AI applications. With over two years of research experience, Anthony has published in top venues, including IEEE, ACM TKDD, and AAAI. His work includes developing Adaptive DecayRank, an anomaly detection model leveraging Bayesian PageRank updates, and contributions to graph neural networks (GNNs) for coordinated sensor attack detection in autonomous vehicles. He has hands-on expertise in NLP, generative AI, and cyber risk evaluation, demonstrated through var

### 8. How to save the vectorDB in local mem
* Saving: The `db.save_local()` command saves your FAISS index to disk for future use.
* Loading: The `FAISS.load_local()` command reloads your saved index so you can continue using it without re-computation.

In [29]:
# saving and loading
db.save_local("faiss_index")

In [31]:
# loading
new_df = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization= True)
docs = new_df.similarity_search("more on Anthony ekle")
docs

[Document(id='1f887f87-3b29-43a4-b546-9d4218834681', metadata={'source': 'speech.txt'}, page_content="The evolutionary origin of speech is subject to debate and speculation. \nWhile animals also communicate using vocalizations, \nand trained apes such as Washoe and Kanzi can use simple sign language, no animals' \nvocalizations are articulated phonemically and syntactically, and do not constitute speech."),
 Document(id='16289864-e95c-400e-9f8c-04a0b8ed119b', metadata={'source': 'speech.txt'}, page_content='Don’t Lick Fingers or Pick Teeth: Use napkins to clean your fingers and a toothpick discreetly if necessary (preferably after leaving the table).'),
 Document(id='aab346fb-4766-4808-a78b-a004742d3f96', metadata={'source': 'speech.txt'}, page_content='Anthony Ekle is a PhD candidate in Computer Science at Tennessee Tech University,\n specializing in graph-based machine learning, anomaly detection in dynamic graphs, \n and real-world AI applications. With over two years of research ex