# Vector Store Implementation

The following code is an implementation made for the final exam of the Information Retrival course.

**Author:** *Jacopo Zacchigna*

---

The notebook is an implementation of a vector store.
The code is structured in multiple class:

- Index
- VectorStore

And there is also the implementation of additional classes that are helpfull to load the data and retrive interesting informations:

- TextSplitter
- Retriever

---

##### The text for the demo is from:

http://ir.dcs.gla.ac.uk/resources/test_collections/time/

### Imports external libraries

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import pickle
import os

# Pritty print
import termcolor

## Index (Helper class)

Implementation of an Helper class index which is going to be used in my vector Store

- **Add items:** to add all of the vectors with the relative indices to the stored_vectors dictionary
- **knn_query:** to get the `top_n` most similar vectors inside a vector store with the relative
- **_cosine_similarity:** helper function to compute the cosine similarity

In [2]:
class Index:
    def __init__(self, dim=None):
        self.dim = dim
        
        # Dictionary to store the vectors
        self.stored_vectors = {}

    def add_items(self, vectors, vectors_id: int):
        """
        Update the indexing structure for the vector store
        """
        for vector_id, vector in zip(vectors_id, vectors):
            if vector.shape != (self.dim,):
                raise ValueError("Vectors must have shape (dim,)")
            self.stored_vectors[vector_id] = vector

    def knn_query(self, query_vector: np.ndarray, top_n: int = 5):
        """
        Find the top n similar vectors to the query vector using cosine similarity.

        Args:
            query_vector (numpy.ndarray): The query vector.
            top_n (int): The number of top similar vectors to return.

        Returns:
            A tuple of two numpy arrays: the first array contains the indices of the top n similar vectors,
            and the second array contains the corresponding cosine similarity scores.
        """
        similarities = [(index, self._cosine_similarity(query_vector, vector)) for index, vector in self.stored_vectors.items()]

        # Sort based on the similarity (second element of the vector) and take the first top_n elements
        # Then unpack it into indices and distances
        top_n_indices, top_n_similarities = zip(*sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n])

        return top_n_indices, top_n_similarities
        
    def _cosine_similarity(self, query_vector, vector) -> float:
        """
        Compute the similarity between two vectors

        Args:
            query_vector (numpy.ndarray): The query vector
            vector (numpy.ndarray): The vector to compare

        Returns:
            The dot product of the vectors, normalized by the product of their norms
        """

        dot_product = np.dot(query_vector, vector)
        
        query_vector_norm = np.linalg.norm(query_vector)
        vector_norm = np.linalg.norm(vector)

        # Return the similarity
        return dot_product / (query_vector_norm * vector_norm)

## Vector Store class

This is the main part of the code which implements the vector store.

*It focueses on implementing the following function with some additional functionality and some basic error handling:*

- **_load_vector_store:** loads the index and sentences

- **save_vector_store:** saves the index and sentences to the specified directory

- **create_vector_store:** adds vectors to the vector store

- **update_vector_store:** updates the existing vector store with new vectors

- **delete_vector_store:** deletes a persistent vector store

- **get_similar_vectors:** finds similar vectors to the query vector based on cosine similarity

In [3]:
class VectorStore:
    def __init__(self, model_name="nomic-ai/nomic-embed-text-v1.5", persist=True, persist_path="vector_store"):
        self.persist = persist
        self.persist_path = persist_path
        
        # Initialize our index our index
        self.model = SentenceTransformer(model_name, trust_remote_code=True)
        
        # Counter to then store the ids of vectors
        self.id_counter = 0

        # Dictionary to store sentences corresponding to vectors
        self.sentences = {}

    # Add type hinting for documents
    @classmethod
    def from_documents(cls, documents, model_name="nomic-ai/nomic-embed-text-v1.5", persist=True, persist_path="vector_store"):
        """
        From documents create the vector store using a defined model, with optional persistence
        """
        # Create an instance of the class
        vector_store = cls(model_name, persist, persist_path)

        # Get the embeddings
        embeddings = vector_store.model.encode(documents)
        vector_store.embeddings_dimension = len(embeddings[0])

        # Create the index
        vector_store.index = Index(dim=vector_store.embeddings_dimension)

        
        # Create a dictionary with the documents and the relative embeddings
        new_documents_embeddings = {documents[i]: embeddings[i] for i in range(len(documents))}

        # Create the vector store
        vector_store.create_vector_store(new_documents_embeddings)

        return vector_store

    def _load_vector_store(self):
        index_file = os.path.join(self.persist_path, "index.pkl")
        sentences_file = os.path.join(self.persist_path, "sentences.pkl")

        if not os.path.exists(index_file) or not os.path.exists(sentences_file):
            raise FileNotFoundError("Index and sentences files not found in the specified directory.")

        with open(index_file, "rb") as f:
            self.index = pickle.load(f)
        with open(sentences_file, "rb") as f:
            self.sentences = pickle.load(f)

        return self.index, self.sentences

    def _save_vector_store(self):
        # Create the directory if it doesn't exist
        os.makedirs(self.persist_path, exist_ok=True)

        # Serialize and save the index
        with open(os.path.join(self.persist_path, "index.pkl"), "wb") as f:
            pickle.dump(self.index, f)

        # Serialize and save the sentences
        with open(os.path.join(self.persist_path, "sentences.pkl"), "wb") as f:
            pickle.dump(self.sentences, f)

    def create_vector_store(self, new_documents_embeddings):
        try:
            vectors = []
            ids = []
            for sentence, vector in new_documents_embeddings.items():
                # Append the new vector
                vectors.append(vector)
                # Assign a unique integer id to every vector
                ids.append(self.id_counter)
                # Store the sentence
                self.sentences[self.id_counter] = sentence
                # Increment the counter for the next vector
                self.id_counter += 1

            # Adding the items to the index
            self.index.add_items(vectors, ids)

            if self.persist:
                self._save_vector_store()

            print("Vector store created successfully", end="\n\n")
        except Exception as e:
            raise e

    def update_vector_store(self, documents):
        """
        Update the existing vector store with new documents

        documents: List of documents to add to my vector store
        """
        embeddings = self.model.encode(documents)
        new_documents_embeddings = {documents[i]: embeddings[i] for i in range(len(documents))}

        try:
            # Load existing index and sentences
            self.index, self.sentences = self._load_vector_store()

            # Update the id counter
            self.id_counter = max(self.sentences.keys()) + 1

            # Add new vectors to the index and sentences
            vectors = []
            ids = []
            for sentence, vector in new_documents_embeddings.items():
                vectors.append(vector)
                ids.append(self.id_counter)
                self.sentences[self.id_counter] = sentence
                self.id_counter += 1

            # Adding the vectors, index to the our index
            self.index.add_items(vectors, ids)

            print("Vector store updated successfully", end="\n\n")
        except Exception as e:
            raise e

    def delete_vector_store(self) -> None:
        """
        Delete a persistent vector store that was craeted
        """
        try:
            # Check if the directory exists
            if os.path.exists(self.persist_path):
                # Delete index and sentences files
                os.remove(os.path.join(self.persist_path, "index.pkl"))
                os.remove(os.path.join(self.persist_path, "sentences.pkl"))
                print("Vector store deleted successfully", end="\n\n")
            else:
                print("Vector store does not exist", end="\n\n")
        except Exception as e:
            raise e

    def query_similar_vectors(self, query: str, top_n=5):
        """
        Find similar vectors to the query

        Args:
            query (str): The query that is going to be searched for inside my vector store
            num_results (int): The number of similar vectors to return

        Returns:
            A list of tuples, each containing a document and its similarity to the query vector
        """
        if self.persist:
            # Load existing index and sentences
            self._load_vector_store()

        # Use the same model to encode the query
        query_vector = self.model.encode(query)
        
        # Querry for the top_n most similar vectors to my querry vector
        labels, distances = self.index.knn_query(query_vector, top_n=top_n)

        # Return the most similar documents in a list of tuples with (sentence, similarity_score)
        return [(self.sentences[label], distance) for label, distance in zip(labels, distances)]
        
    def print_similar_vectors(self, similar_vectors) -> None:
        """
        Helper function to print the most similar vector with the relative similarity score in a nice way
        """
        
        print("Similarity Vectors:")
        for sentence, similarity_score in similar_vectors:
            print(termcolor.colored(f"- Sentence: {sentence}", "green", "on_grey", ["bold"]))
            print(termcolor.colored(f"  Similarity Score: {similarity_score}", "yellow", "on_grey", ["bold"]))
            print()

## Demo of The vector store

Using nomic embed for the demo and a custom index

#### TextSplitter and Retriver class

Implementation of the text splitter with different options

In [4]:
class TextLoader:
    def __init__(self, data_path):
        self.data_path = data_path

    def load_data(self):
        # Load the data from a CSV file
        return pd.read_csv(self.data_path, delimiter=";")

    def split_data(self, split_ratio=0.8, random_state=1337):
        # Load the data
        data = self.load_data()

        # Split them in two splits
        data_1 = data.sample(frac=split_ratio, random_state=random_state)
        data_2 = data.drop(data_1.index)
        
        return data_1["text"].tolist(), data_2["text"].tolist()

class CharacterTextSplitter:
    def __init__(self, chunk_size=100, chunk_overlap=0):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def split_documents(self, raw_documents):
        self.data = raw_documents
        self.split_text()

    """
    def split_text(self):
        self.chunks = (text[i:i+self.chunk_size] for i in range(0, len(self.data), self.chunk_size))
        self.chunks = (chunk for chunk in self.chunks if len(chunk) > self.chunk_overlap)
    """

    def split_text(self):
        return self.data

### Showcase

In [5]:
# Get the raw text
# Split to then use it to update the vector store
data_1, data_2 = TextLoader('data.csv').split_data()

# Craete a tecxt splitter
# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split the documents recursivly
# documents_1 = text_splitter.split_documents(data_1)

# Create the vecotr store from documents
db = VectorStore.from_documents(data_1, model_name="nomic-ai/nomic-embed-text-v1.5")

# Define a querry and searcah for it in my vector stor
query = "I want to buy a car"
similar_vectors = db.query_similar_vectors(query)

db.print_similar_vectors(similar_vectors)

<All keys matched successfully>


Vector store created successfully

Similarity Vectors:
[1m[40m[32m- Sentence: A vintage car gleamed in the sunlight, its polished chrome catching the eye of passersby.[0m
[1m[40m[33m  Similarity Score: 0.5708912014961243[0m

[1m[40m[32m- Sentence: The sleek, silver sports car raced down the winding mountain road, its engine roaring with power.[0m
[1m[40m[33m  Similarity Score: 0.5229447484016418[0m

[1m[40m[32m- Sentence: A family sedan cruised along the highway, its occupants singing along to their favorite songs on the radio.[0m
[1m[40m[33m  Similarity Score: 0.5017635822296143[0m

[1m[40m[32m- Sentence: The Harley-Davidson motorcycle rumbled to life, its deep, throaty growl announcing its presence on the road.[0m
[1m[40m[33m  Similarity Score: 0.4918859899044037[0m

[1m[40m[32m- Sentence: The sound of engines filled the air, a symphony of power and speed that echoed through the streets.[0m
[1m[40m[33m  Similarity Score: 0.4788406491279602[0m



### Adding update of the vector store

In [8]:
# Update the vector store
db.update_vector_store(data_2)

# Query the vector store
query = "I want to buy a cycle"

similar_vectors = db.query_similar_vectors(query, top_n=3)
db.print_similar_vectors(similar_vectors)

Vector store updated successfully

Similarity Vectors:
[1m[40m[32m- Sentence: And as the sun rises once again, the cycle begins anew, a testament to the beauty and resilience of life.[0m
[1m[40m[33m  Similarity Score: 0.5324117541313171[0m

[1m[40m[32m- Sentence: The Harley-Davidson motorcycle rumbled to life, its deep, throaty growl announcing its presence on the road.[0m
[1m[40m[33m  Similarity Score: 0.521259605884552[0m

[1m[40m[32m- Sentence: A vintage car gleamed in the sunlight, its polished chrome catching the eye of passersby.[0m
[1m[40m[33m  Similarity Score: 0.4940513074398041[0m



### Adding persistency

In [9]:
# Delete saved vector store
db.delete_vector_store()

Vector store deleted successfully



### Evaluate the system on a set of test queries.