# Vector Store Implementation

The following code is an implementation made for the final exam of the Information Retrival course.

**Author:** *Jacopo Zacchigna*

---

The notebook is an implementation of a vector store.
The code is structured in multiple class:

- Index
- VectorStore

And there is also the implementation of additional classes that are helpfull to load the data and retrive interesting informations:

- TextSplitter
- Retriever

---

##### The text for the demo is from:

http://ir.dcs.gla.ac.uk/resources/test_collections/time/

### Imports external libraries

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import pickle
import os

# Pritty print
import termcolor

## Index (Helper class)

Implementation of an Helper class index which is going to be used in my vector Store

- **Add items:** to add all of the vectors with the relative indices to the stored_vectors dictionary
- **knn_query:** to get the `top_n` most similar vectors inside a vector store with the relative
- **_cosine_similarity:** helper function to compute the cosine similarity

In [2]:
class Index:
    def __init__(self, dim=None):
        self.dim = dim
        
        # Dictionary to store the vectors
        self.stored_vectors = {}

    def add_items(self, vectors, vectors_id: int):
        """
        Update the indexing structure for the vector store
        """
        for vector_id, vector in zip(vectors_id, vectors):
            if vector.shape != (self.dim,):
                raise ValueError("Vectors must have shape (dim,)")
            self.stored_vectors[vector_id] = vector

    def knn_query(self, query_vector: np.ndarray, top_n: int = 5):
        """
        Find the top n similar vectors to the query vector using cosine similarity.

        Args:
            query_vector (numpy.ndarray): The query vector.
            top_n (int): The number of top similar vectors to return.

        Returns:
            A tuple of two numpy arrays: the first array contains the indices of the top n similar vectors,
            and the second array contains the corresponding cosine similarity scores.
        """
        similarities = [(index, self._cosine_similarity(query_vector, vector)) for index, vector in self.stored_vectors.items()]

        # Sort based on the similarity (second element of the vector) and take the first top_n elements
        # Then unpack it into indices and distances
        top_n_indices, top_n_similarities = zip(*sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n])

        return top_n_indices, top_n_similarities
        
    def _cosine_similarity(self, query_vector, vector) -> float:
        """
        Compute the similarity between two vectors

        Args:
            query_vector (numpy.ndarray): The query vector
            vector (numpy.ndarray): The vector to compare

        Returns:
            The dot product of the vectors, normalized by the product of their norms
        """

        dot_product = np.dot(query_vector, vector)
        
        query_vector_norm = np.linalg.norm(query_vector)
        vector_norm = np.linalg.norm(vector)

        # Return the similarity
        return dot_product / (query_vector_norm * vector_norm)

## Vector Store class

This is the main part of the code which implements the vector store.

*It focueses on implementing the following function with some additional functionality and some basic error handling:*

- **_load_vector_store:** loads the index and sentences

- **save_vector_store:** saves the index and sentences to the specified directory

- **create_vector_store:** adds vectors to the vector store

- **update_vector_store:** updates the existing vector store with new vectors

- **delete_vector_store:** deletes a persistent vector store

- **get_similar_vectors:** finds similar vectors to the query vector based on cosine similarity

In [3]:
class VectorStore:
    def __init__(self, vector_dimension=None, persist=True, persist_path="vector_store"):

        if vector_dimension is None:
            raise ValueError("You should pass the vector size")

        # Initialize our index our index
        self.index = Index(dim=vector_dimension)
        self.persist = persist
        self.persist_path = persist_path
        self.vector_dimension = vector_dimension
    
        # Counter to then store the ids of vectors
        self.id_counter = 0
        
        # Dictionary to store sentences corresponding to vectors
        self.sentences = {}

    def _load_vector_store(self):
        index_file = os.path.join(self.persist_path, "index.pkl")
        sentences_file = os.path.join(self.persist_path, "sentences.pkl")
        
        if not os.path.exists(index_file) or not os.path.exists(sentences_file):
            raise FileNotFoundError("Index and sentences files not found in the specified directory.")

        with open(index_file, "rb") as f:
            self.index = pickle.load(f)
        with open(sentences_file, "rb") as f:
            self.sentences = pickle.load(f)

        return self.index, self.sentences

    def save_vector_store(self):
        """
        Save the index and corresponding sentences
        """
        
        # Create the directory if it doesn't exist
        os.makedirs(self.persist_path, exist_ok=True)
        
        # Serialize and save the index
        with open(os.path.join(self.persist_path, "index.pkl"), "wb") as f:
            pickle.dump(self.index, f)

        # Serialize and save the sentences
        with open(os.path.join(self.persist_path, "sentences.pkl"), "wb") as f:
            pickle.dump(self.sentences, f)
            
    def create_vector_store(self, new_sentence_vectors):
        """
        Add vectors to the vector store

        id: the unique id for the vector
        vecotor: the vector to be added
        """
        try:
            vectors = []
            ids = []
            for sentence, vector in new_sentence_vectors.items():
                # Append the new vector
                vectors.append(vector)
                # Assign a unique integer id to every vector
                ids.append(self.id_counter)
                # Store the sentence
                self.sentences[self.id_counter] = sentence
                # Increment the counter for the next vector
                self.id_counter += 1

            # Adding the items to the index
            self.index.add_items(vectors, ids)
    
            if self.persist:                
                self.save_vector_store()
                
            print("Vector store created successfully", end="\n\n")
            
        except Exception as e:
            raise e

    def update_vector_store(self, new_sentence_vectors):
        """
        Update the existing vector store with new vectors

        new_id_vectors: Dictionary containing new vectors to be added
        persist_path: Path to the directory where the existing vector store is saved
        """
        try:
            # Load existing index and sentences
            self.index, self.sentences = self._load_vector_store()

            # Update the id counter
            self.id_counter = max(self.sentences.keys()) + 1

            # Add new vectors to the index and sentences
            vectors = []
            ids = []
            for sentence, vector in new_sentence_vectors.items():
                vectors.append(vector)
                ids.append(self.id_counter)
                self.sentences[self.id_counter] = sentence
                self.id_counter += 1
                
            # Adding the vectors, index to the our index
            self.index.add_items(vectors, ids)
            
            print("Vector store updated successfully", end="\n\n")
        except Exception as e:
            raise e

    def delete_vector_store(self) -> None:
        """
        Delete a persistent vector store that was craeted
        """
        
        try:
            # Check if the directory exists
            if os.path.exists(self.persist_path):
                # Delete index and sentences files
                os.remove(os.path.join(self.persist_path, "index.pkl"))
                os.remove(os.path.join(self.persist_path, "sentences.pkl"))
                print("Vector store deleted successfully", end="\n\n")
            else:
                print("Vector store does not exist", end="\n\n")
        except Exception as e:
            raise e

    def get_similar_vectors(self, query_vector, top_n=5) -> list:
        """
        Find similar vectors to the query vector

        Args:
            query_vector (numpy.ndarray): The query vector to compare with the vecotr in the store
            num_results (int): The number of similar vectors to return

        Returns:
            A list of tuples, each containing a vector id and its similarity to the query vector
        """
        if self.persist:
            # Load existing index and sentences
            self._load_vector_store()
        
        result = []
        labels, distances = self.index.knn_query(query_vector, top_n=top_n)

        similar_vectors = [(self.sentences[label], distance) for label, distance in zip(labels, distances)]
        
        return similar_vectors

## Demo of The vector store

Using nomic embed for the demo and a custom index

#### TextSplitter and Retriver class

Implementation of the text splitter with different options

In [4]:
class TextSplitter:
    def __init__(self, data_path, split_ratio=0.8, random_state=200):
        self.data = pd.read_csv(data_path, delimiter=";")
        self.data_1 = self.data.sample(frac=split_ratio, random_state=random_state)
        self.data_2 = self.data.drop(self.data_1.index)

    # text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
    # Function to split text into chunks...

    # Function to split in 80/20
    def get_data_split(self):
        return self.data_1["text"].tolist(), self.data_2["text"].tolist()

    """
    # Split the text into chunks
    texts = text_splitter.split_text(pdf_text)

    # Create a metadata for each chunk
    metadatas = [{"source": f"{i}-pl"} for i in range(len(texts))]

    # Create a Chroma vector store
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    docsearch = await cl.make_async(Chroma.from_texts)(
        texts, embeddings, metadatas=metadatas
    )
    """

class Retriever:
    def __init__(self, model_name="nomic-ai/nomic-embed-text-v1.5", persist=True):
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        self.persist = persist

    def create_vector_store(self, data):
        vectors = self.model.encode(data)
        vector_dimension = len(vectors[0])
        
        # You can also change the persit path
        self.vector_store = VectorStore(vector_dimension=vector_dimension, persist=self.persist)
        new_sentence_vectors = {data[i]: vectors[i] for i in range(len(data))}
        self.vector_store.create_vector_store(new_sentence_vectors)

    def update_vector_store_data(self, data):
        vectors = self.model.encode(data)
        new_sentence_vectors = {data[i]: vectors[i] for i in range(len(data))}
        self.vector_store.update_vector_store(new_sentence_vectors)
        return new_sentence_vectors

    def query_similar_vectors(self, query, top_n=5):
        query_vector = self.model.encode(query)
        similar_vectors = self.vector_store.get_similar_vectors(query_vector, top_n=top_n)
        return similar_vectors

    def print_similar_vectors(self, similar_vectors):
        print("Similarity Vectors:")
        for sentence, similarity_score in similar_vectors:
            print(termcolor.colored(f"- Sentence: {sentence}", "green", "on_grey", ["bold"]))
            print(termcolor.colored(f"  Similarity Score: {similarity_score}", "yellow", "on_grey", ["bold"]))
            print()

### Showcase

In [5]:
# Write a function  to get the everything as text from csv

text_splitter = TextSplitter("data.csv")
data_1, data_2 = text_splitter.get_data_split()

retriever = Retriever()
retriever.create_vector_store(data_1)

query = "I want to buy a car"
similar_vectors = retriever.query_similar_vectors(query)

retriever.print_similar_vectors(similar_vectors)

<All keys matched successfully>


Vector store created successfully

Similarity Vectors:
[1m[40m[32m- Sentence: The sleek, silver sports car raced down the winding mountain road, its engine roaring with power.[0m
[1m[40m[33m  Similarity Score: 0.5229447484016418[0m

[1m[40m[32m- Sentence: A family sedan cruised along the highway, its occupants singing along to their favorite songs on the radio.[0m
[1m[40m[33m  Similarity Score: 0.5017635822296143[0m

[1m[40m[32m- Sentence: A group of friends revved their engines, ready to hit the open road and leave the city behind.[0m
[1m[40m[33m  Similarity Score: 0.4948986768722534[0m

[1m[40m[32m- Sentence: The Harley-Davidson motorcycle rumbled to life, its deep, throaty growl announcing its presence on the road.[0m
[1m[40m[33m  Similarity Score: 0.4918859899044037[0m

[1m[40m[32m- Sentence: The sound of engines filled the air, a symphony of power and speed that echoed through the streets.[0m
[1m[40m[33m  Similarity Score: 0.4788406491279602

### Adding update of the vector store

In [6]:
# Update the vector store
new_sentence_vectors_2 = retriever.update_vector_store_data(data_2)

# Query the vector store
query = "I want to buy a cycle"

similar_vectors = retriever.query_similar_vectors(query, top_n=3)
retriever.print_similar_vectors(similar_vectors)

Vector store updated successfully

Similarity Vectors:
[1m[40m[32m- Sentence: And as the sun rises once again, the cycle begins anew, a testament to the beauty and resilience of life.[0m
[1m[40m[33m  Similarity Score: 0.5324117541313171[0m

[1m[40m[32m- Sentence: The Harley-Davidson motorcycle rumbled to life, its deep, throaty growl announcing its presence on the road.[0m
[1m[40m[33m  Similarity Score: 0.521259605884552[0m

[1m[40m[32m- Sentence: In the city, motorcyclists weaved through traffic with ease, their nimble machines darting between cars and buses.[0m
[1m[40m[33m  Similarity Score: 0.48292413353919983[0m



### Adding persistency

In [7]:
# Delete saved vector store
retriever.vector_store.delete_vector_store()

Vector store deleted successfully



### Evaluate the system on a set of test queries.