# Vector Store Implementation + RAG

The following code is an implementation made for the final exam of the Information Retrival course.

**Author:** *Jacopo Zacchigna*

<img src="https://images.contentstack.io/v3/assets/bltefdd0b53724fa2ce/blt185ef72de6dc0e43/6466a9a1f21a3540facf75ac/vector-search-diagram-cropped-white-space.png" width="75%" height="75%">

---

The notebook is an implementation of a vector store.
The code is structured in multiple class:

- Index
- VectorStore

Furtheremore I also implemented some additional classes to test the vector store and also for the RAG:

- TextLoader
- DirectoryReader
- TextSplitter

---

##### The text for the RAG demo is taken from:

- Grokking Paper: https://arxiv.org/pdf/2201.02177.pdf
- Attention Is All You Need: https://arxiv.org/pdf/1706.03762.pdf

### Imports external libraries for the Vector Store

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import pickle
import os

## Index (Helper class)

Implementation of an Helper class index which is going to be used in my vector Store

- **Add items:** to add all of the vectors with the relative indices to the stored_vectors dictionary
- **knn_query:** to get the `top_n` most similar vectors inside a vector store with the relative
- **_cosine_similarity:** helper function to compute the cosine similarity

In [2]:
class Index:
    def __init__(self, dim=None):
        self.dim = dim
        
        # Dictionary to store the vectors
        self.stored_vectors = {}

    def add_items(self, vectors, vectors_id: int):
        """
        Update the indexing structure for the vector store
        """
        for vector_id, vector in zip(vectors_id, vectors):
            if vector.shape != (self.dim,):
                raise ValueError("Vectors must have shape (dim,)")
            self.stored_vectors[vector_id] = vector

    def knn_query(self, query_vector: np.ndarray, top_n: int = 5):
        """
        Find the top n similar vectors to the query vector using cosine similarity.

        Args:
            query_vector (numpy.ndarray): The query vector.
            top_n (int): The number of top similar vectors to return.

        Returns:
            A tuple of two numpy arrays: the first array contains the indices of the top n similar vectors,
            and the second array contains the corresponding cosine similarity scores.
        """
        # For every vector in the vector store compute the similarity and create a tuple with the relative index (int)
        similarities = [(index, self._cosine_similarity(query_vector, vector)) for index, vector in self.stored_vectors.items()]

        # Sort based on the similarity (second element of the vector) and take the top_n most similar vectors
        
        # Then zip: [(index, similarity), (index, similarity) ...] -> ([index, index, ...], [similarity, similarity, ...])
        top_n_indices, top_n_similarities = zip(*sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n])

        return top_n_indices, top_n_similarities
        
    def _cosine_similarity(self, query_vector, vector) -> float:
        """
        Compute the similarity between two vectors

        Args:
            query_vector (numpy.ndarray): The query vector
            vector (numpy.ndarray): The vector to compare

        Returns:
            The dot product of the vectors, normalized by the product of their norms
        """

        # Compute the dot product between the two vectors
        dot_product = np.dot(query_vector, vector)

        # Normalization values
        query_vector_norm = np.linalg.norm(query_vector)
        vector_norm = np.linalg.norm(vector)

        # Return the similarity
        return dot_product / (query_vector_norm * vector_norm)

## Vector Store class

This is the main part of the code which implements the vector store.

*It focueses on implementing the following function with some additional functionality and some basic error handling:*

- **_load_vector_store:** loads the index and sentences

- **save_vector_store:** saves the index and sentences to the specified directory

- **create_vector_store:** adds vectors to the vector store

- **update_vector_store:** updates the existing vector store with new vectors

- **delete_vector_store:** deletes a persistent vector store

- **query_similar_vectors:** finds similar vectors to the query vector based on cosine similarity

In [3]:
class VectorStore:
    def __init__(self, model_name="nomic-ai/nomic-embed-text-v1.5", persist=True, persist_path="vector_store"):
        self.persist = persist
        self.persist_path = persist_path
        
        # Initialize our index our index
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        
        # Counter to then store the ids of vectors
        self.id_counter = 0

        # Dictionary to store chunk corresponding to vectors
        self.text_chunks = {}

    def save_vector_store(self):
        # In the case the vector store was created without persistence
        if not self.persist:
            # Set it to be persitentxw
            self.persist = True
    
        # Create the directory if it doesn't exist
        os.makedirs(self.persist_path, exist_ok=True)
    
        # Serialize and save the index
        with open(os.path.join(self.persist_path, "index.pkl"), "wb") as f:
            pickle.dump(self.index, f)
    
        # Serialize and save the text_chunks
        with open(os.path.join(self.persist_path, "text_chunks.pkl"), "wb") as f:
            pickle.dump(self.text_chunks, f)

    def create_vector_store(self, text_chunks):
        # Get the embeddings
        embeddings = self.model.encode(text_chunks)
        self.embeddings_dimension = len(embeddings[0])

        # Create the index with the dimensionality of the vector I got
        self.index = Index(dim=self.embeddings_dimension)

        # Create a dictionary with the documents and the relative embeddings
        chunks_embeddings = {text_chunks[i]: embeddings[i] for i in range(len(text_chunks))}
        
        try:
            vectors = []
            ids = []
            
            for chunk, vector in chunks_embeddings.items():
                # Append the new vector
                vectors.append(vector)
                # Assign a unique integer id to every vector
                ids.append(self.id_counter)
                # Store the text chunks
                self.text_chunks[self.id_counter] = chunk
                # Increment the counter for the next vector
                self.id_counter += 1
                
            # Adding the items to the index
            self.index.add_items(vectors, ids)

            if self.persist:
                self.save_vector_store()

            print("\033[32mVector store created successfully\033[0m", end="\n\n")
        except Exception as e:
            raise e

    def update_vector_store(self, text_chunks):
        """
        Update the existing vector store with new documents

        documents: List of documents to add to my vector store
        """
        embeddings = self.model.encode(text_chunks)
        chunks_embeddings = {text_chunks[i]: embeddings[i] for i in range(len(text_chunks))}

        try:
            if self.persist:
                # Load existing index and text_chunks
                self.index, self.text_chunks = self._load_vector_store()

            # Get the max for the counter and start from the next one
            self.id_counter = max(self.text_chunks.keys()) + 1

            # Add new vectors to the index and text_chunks
            vectors = []
            ids = []
            for chunk, vector in chunks_embeddings.items():
                vectors.append(vector)
                ids.append(self.id_counter)
                self.text_chunks[self.id_counter] = chunk
                self.id_counter += 1

            # Adding the vectors, index to the our index
            self.index.add_items(vectors, ids)

            print("\033[32mVector store updated successfully\033[0m", end="\n\n")
        except Exception as e:
            raise e

    def delete_vector_store(self) -> None:
        """
        Delete a persistent vector store that was craeted
        """
        try:
            # Check if the directory exists
            if os.path.exists(self.persist_path):
                # Delete index and text_chunks files
                os.remove(os.path.join(self.persist_path, "index.pkl"))
                os.remove(os.path.join(self.persist_path, "text_chunks.pkl"))
                os.rmdir(self.persist_path)
                print("\033[32mVector store deleted successfully\033[0m", end="\n\n")
            else:
                print("Vector store does not exist", end="\n\n")
        except Exception as e:
            raise e

    def query_similar_vectors(self, query: str, top_n=5):
        """
        Find similar vectors to the query

        Args:
            query (str): The query that is going to be searched for inside my vector store
            num_results (int): The number of similar vectors to return

        Returns:
            A list of tuples, each containing a document and its similarity to the query vector
        """
        if self.persist:
            # Load existing index and text_chunks
            self._load_vector_store()

        # Use the same model to encode the query
        query_vector = self.model.encode(query)
        
        # Querry for the top_n most similar vectors to my querry vector
        top_n_indices, top_n_similarities = self.index.knn_query(query_vector, top_n=top_n)

        # Return the most similar documents in a list of tuples with (text_chunks, similarity_score)
        return [(self.text_chunks[index], similarities) for index, similarities in zip(top_n_indices, top_n_similarities)]
        
    def _load_vector_store(self):
        index_file = os.path.join(self.persist_path, "index.pkl")
        text_chunks_file = os.path.join(self.persist_path, "text_chunks.pkl")

        if not os.path.exists(index_file) or not os.path.exists(text_chunks_file):
            raise FileNotFoundError("Index and text_chunks files not found in the specified directory.")

        with open(index_file, "rb") as f:
            self.index = pickle.load(f)
        with open(text_chunks_file, "rb") as f:
            self.text_chunks = pickle.load(f)

        return self.index, self.text_chunks
        
    def print_similar_vectors(self, similar_vectors) -> None:
        """
        Helper function to print the most similar vector with the relative similarity score in a nice way
        """
        print("\033[1mSimilar Text Retrived:\033[0m")
        print("___________________________________\n")
        for chunk, similarity_score in similar_vectors:
            print("\033[1m- Retrieved Text:\033[0m", chunk)
            print("\033[1m    Similarity Score:\033[0m", similarity_score)
            print()

## Demo of The vector store

Using nomic embed for the demo and a custom index. The demo showcase how to update a vector store and make queries

In [4]:
class TextLoader:
    """
    Class to laod a csv file and split it into two splits
    """
    def __init__(self, data_path):
        self.data_path = data_path

    def load_data(self):
        # Load the data from a CSV file
        return pd.read_csv(self.data_path, delimiter=";")

    def split_data(self, split_ratio=0.8, random_state=1337):
        # Load the data
        data = self.load_data()

        # Split them in two splits
        data_1 = data.sample(frac=split_ratio, random_state=random_state)
        data_2 = data.drop(data_1.index)

        # Return them as a tuple of text lists
        return data_1["text"].tolist(), data_2["text"].tolist()

### Vector Store example

Creation of the vector store, and simple query

In [5]:
# Get the raw text
# Split to then use it to update the vector store
data_1, data_2 = TextLoader('data/sample.csv').split_data()

# Create an instance of a vector store from documents
db = VectorStore(model_name="nomic-ai/nomic-embed-text-v1.5", persist=True, persist_path="demo")

# Create the vector store with some of the data
db.create_vector_store(data_1)

# Define a querry
query = "I want to buy a car"

# Search for it in my vector store and return the 3 most similar results
similar_vectors = db.query_similar_vectors(query, top_n=3)

# Pritty print the most similar vectors with relative similarity score
db.print_similar_vectors(similar_vectors)

<All keys matched successfully>


[32mVector store created successfully[0m

[1mSimilar Text Retrived:[0m
___________________________________

[1m- Retrieved Text:[0m A vintage car gleamed in the sunlight, its polished chrome catching the eye of passersby.
[1m    Similarity Score:[0m 0.5708912

[1m- Retrieved Text:[0m The sleek, silver sports car raced down the winding mountain road, its engine roaring with power.
[1m    Similarity Score:[0m 0.52294475

[1m- Retrieved Text:[0m A family sedan cruised along the highway, its occupants singing along to their favorite songs on the radio.
[1m    Similarity Score:[0m 0.5017636



#### Updating the vetor store

In [6]:
# Update the vector store
db.update_vector_store(data_2)

# Query the vector store
query = "I want to buy a cycle"

similar_vectors = db.query_similar_vectors(query, top_n=3)
db.print_similar_vectors(similar_vectors)

[32mVector store updated successfully[0m

[1mSimilar Text Retrived:[0m
___________________________________

[1m- Retrieved Text:[0m And as the sun rises once again, the cycle begins anew, a testament to the beauty and resilience of life.
[1m    Similarity Score:[0m 0.53241175

[1m- Retrieved Text:[0m The Harley-Davidson motorcycle rumbled to life, its deep, throaty growl announcing its presence on the road.
[1m    Similarity Score:[0m 0.5212596

[1m- Retrieved Text:[0m A vintage car gleamed in the sunlight, its polished chrome catching the eye of passersby.
[1m    Similarity Score:[0m 0.4940513



#### Deleting the Vector Store

In [7]:
# Delete saved vector store
db.delete_vector_store()

[32mVector store deleted successfully[0m



## RAG (Retrival Augmented Generation)

#### Additional Imports

- **Pypdf:** to read pdf files
- **ollama:** to run mistral 7B (not MoE) locally fast

In [8]:
from pypdf import PdfReader
import ollama

### DirectoryReader

Simple implementation of a class to load all of the test present in a directorly fro `pdf` and `txt` files.

In [9]:
class DirectoryReader:
    """
    Class to load all of the text from a directory with different filetypes
    """
    def __init__(self, data_path):
        self.data_path = data_path

    # This supports pdfs and txt but can be extended if needed
    # Furtheremore we could store the metadata for the source documents and not put all of the text together
    def load_data(self):
        # List all files in the data directory
        files = os.listdir(self.data_path)
        self.text = ''
    
        # Read the contents of each file
        for file in files:
            # Get the file path
            file_path = os.path.join(self.data_path, file)
            
            if file.endswith('.pdf'):
                # load the pdf inside the text attribute
                self._load_pfd(file_path)
            elif file.endswith('.txt'):
                # load the txt inside the text attribute
                self._load_txt(file_path)
            else:
                print(f"File type not supported for: {file}")
    
        return self.text

    def _load_pfd(self, file_path):
        with open(file_path, 'rb') as f:  # Open the file in binary mode
            pdf = PdfReader(f)  # Create a PdfReader object
            for page in pdf.pages:
                self.text += page.extract_text()
        
    def _load_txt(self, file_path):
        with open(file_path, 'r') as txt_file:  # Open the file in text mode
            self.text += txt_file.read()
        

### RecursiveCharacterTextSplitter

Implementation of a class that recursivly split the text starting from `\n\n` as separator getting to `''` whilest the chunk is smaller then the chunk size defined. This doesn't allow for overlap between chunks but that could potentially be implemented in the future

In [10]:
class RecursiveTextSplitter:
    """
    Class that splits text into chunks that can be used to create a vector store
    """
    def __init__(self, chunk_size=100):
        self.chunk_size = chunk_size
        self._separators = ["\n\n", '\n', " ", ".", ",", ""]

    def split_text(self, text: str):
        """Split incoming text and return chunks."""
        chunks = []

        separator = self._get_separator(text)

        # Split the text with the separator if empty use list to split the char
        splits = text.split(separator) if separator else list(text)

        # Find good splits
        good_splits = []
        for split in splits:
            # if the split is good because it is smaller then chunk size
            if len(split) <= self.chunk_size:
                good_splits.append(split)
            else:
                # Recursively split the chunk if it was larger with another separator
                merged_chunks = self.split_text(split)
                chunks.extend(merged_chunks)

        # Try to merge splits together to still be smaller then the chunk size
        if good_splits:
            merged_chunks = self._merge_splits(good_splits)
            chunks.extend(merged_chunks)

        return chunks

    def _get_separator(self, text: str) -> str:
        """Generator that returns the different separators"""
        return next((s for s in self._separators if s in text), "")

    def _merge_splits(self, splits):
        """Merge the splits if the cumulative length is less than the chunk size."""
        merged_chunks = []
        current_chunk = ''

        for split in splits:
            if len(current_chunk) + len(split) > self.chunk_size:
                merged_chunks.append(current_chunk)
                current_chunk = split
            else:
                current_chunk += self._get_separator(current_chunk) + split

        # Append the new chunk
        merged_chunks.append(current_chunk)
        return merged_chunks

### Creation of the vector store and showcase of the query

In [11]:
# Get all of the raw text 
raw_text = DirectoryReader('data/').load_data()

# Characther spillter initialization
text_splitter = RecursiveTextSplitter(chunk_size=1000)

# Split the documents into chunks
data = text_splitter.split_text(raw_text)

# Initialize an instance of a vector store without persitence
db = VectorStore(model_name="nomic-ai/nomic-embed-text-v1.5", persist=False)

# Create the vector store
db.create_vector_store(data)

# Define a querry and search for it in my vector store
query = "When does grooking happen ?"
similar_vectors = db.query_similar_vectors(query, top_n = 3)
db.print_similar_vectors(similar_vectors)

File type not supported for: sample.csv


<All keys matched successfully>


[32mVector store created successfully[0m

[1mSimilar Text Retrived:[0m
___________________________________

[1m- Retrieved Text:[0m approximately half of them achieved high validation accuracy. We then used the method described in Keskar et al. (2016) to calculate the sharpness approximation value, φ. We found that the validation accuracy and the φscore across our trained networks had Spearman correlation coefﬁcient of −0.79548 (signiﬁcant with p<0.000014 ). This is suggestive that grokking may only happen after the network’s parameters are in ﬂatter regions of the loss landscape. It would be valuable for future work to explore this hypothesis, as well as test other generalization measures. Figure 7: Networks trained on the S5composition objective appear to only grok in relatively ﬂat regions of the loss landscape. 10
[1m    Similarity Score:[0m 0.5271473

[1m- Retrieved Text:[0m binary op tables. •We show that, long after severely overﬁtting, validation accuracy sometimes su

### Mistral without RAG

Running the small model on the querry gives us poor results that are not grounded in our data

In [12]:
stream = ollama.chat(
    model='llama3',
    messages=[{'role': 'user', 'content': f"{query}"}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

I think there may be a bit of confusion here!

Grooking is not actually a real thing that happens. It seems you might have made a typo or created a made-up word.

If you meant to ask about something else, feel free to clarify or rephrase your question, and I'll do my best to help!

### Mistral with RAG

Running the model by passing to it the context provide us much more accurate results. This is what we were looking for

In [13]:
# Get only the text and put it all together with spaces in the middle
context = '\n\n'.join([f"Source {i}: {text}" for i, (text, _) in enumerate(similar_vectors)])

stream = ollama.chat(
    model='llama3',
    messages=[{'role': 'user', 'content': f"Question: {query}\n\n Context: {context}"}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

Grokking happens when a neural network suddenly begins to increase its validation accuracy from chance level towards perfect generalization, even after severe overfitting. This phenomenon was observed in the studies referenced, particularly in Figure 1, where an example of this sudden improvement is shown.

In [14]:
# Get only the text and put it all together with spaces in the middle
context = '\n\n'.join([f"Source {i}: {text}" for i, (text, _) in enumerate(similar_vectors)])

stream = ollama.chat(
    model='mistral',
    messages=[{'role': 'user', 'content': f"Question: {query}\n\n Context: {context}"}],
    stream=True,
)

for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)

 Based on the context provided in the sources you have given, it appears that "grokking" is a phenomenon where validation accuracy suddenly increases toward perfect generalization after severely overfitting the data. The exact timing of when this happens is not explicitly stated in the sources, but they suggest that it may occur long after the network's parameters are in flatter regions of the loss landscape and require optimization for generalization to become effective. However, further research is needed to explore this hypothesis and test other generalization measures as suggested by the authors.