## Demo of implementing RAG
This Jupyter notebook is for an initial implementation of RAG, following the tutorial from [Hugging Face](https://huggingface.co/blog/ngxson/make-your-own-rag).

First we begin by loading our dataset

In [1]:
dataset = []
with open('cat-facts.txt', 'r') as file:
  dataset = file.readlines()
  print(f'Loaded {len(dataset)} entries')

dataset[0:5]  # Display the first 5 entries

Loaded 151 entries


['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.\n',
 'Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.\n',
 'When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.\n',
 'The technical term for a cat’s hairball is a “bezoar.”\n',
 'A group of cats is called a “clowder.”\n']

## Implementing vector database

We have to convert our plain text to vectors, to be able to use vector similarity search, rather than keyword search (not a viable alternative)
<br><br/>
To do this, we need to have selected the models we are going to work with. For convenience, we will use Ollama models, as they are free and can be run locally with minimal setup.

In [2]:
import ollama

EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

In [3]:
# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
VECTOR_DB = []

def add_chunk_to_database(chunk):
  embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
  VECTOR_DB.append((chunk, embedding))

We will assume each line in our database to be one chunk. Let us now calculate the embeddings and also add them to the VECTOR_DB list

In [4]:
# Let us check if the vectorization has already been done
try:
  with open('vector_db.txt', 'r') as file:
    for line in file:
      chunk, embedding_str = line.strip().split('\t')
      embedding = list(map(float, embedding_str.split(',')))
      VECTOR_DB.append((chunk, embedding))
  print(f'Loaded {len(VECTOR_DB)} entries from vector_db.txt')
  vectorized_dataset_loaded = True
except FileNotFoundError:
  print('vector_db.txt not found, proceeding to vectorize the dataset')
  for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    print(f'Added chunk {i+1}/{len(dataset)} to the database')
  vectorized_dataset_loaded = False
    
  print(f'Added {len(VECTOR_DB)} chunks to the database')

Loaded 151 entries from vector_db.txt


Let us save our vectorized database

In [5]:
print(f"The vectors in this list of tuples are {(type(VECTOR_DB[0][1]))}")

The vectors in this list of tuples are <class 'list'>


In [6]:
# Save the vector database to a txt file
if not vectorized_dataset_loaded:
  print('Saving the vector database to vector_db.txt')
  with open('vector_db.txt', 'w') as file:
    for chunk, embedding in VECTOR_DB:
      file.write(f"{chunk.strip()}\t{','.join(map(str, embedding))}\n")


## Information Retrival
Now we want to implement a function to retrieve information that is closest to the query for the LLM.<br>
First we need to make a function that calculates cosine similarity. <br>
Lets do it in a way that is fast (even though for this demo, we really don't need to)

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def cosine_sim_sklearn_vecs(a, b):
    """
    Calculate cosine similarity between two vectors using sklearn.
    """
    a = np.array(a).reshape(1, -1)
    b = np.array(b).reshape(1, -1)
    return cosine_similarity(a, b)[0, 0]


Now lets implement retrival

In [8]:
def retrieve(query, top_n=3):
    # Calculate the embedding for the query
    query_emb = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    # Calculate cosine similarity for each chunk in the vector database
    similarities = [(chunk, cosine_sim_sklearn_vecs(query_emb, emb)) for chunk, emb in VECTOR_DB]
    # Sort the similarities in descending order
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]


In [9]:
# input_query = input('Ask me a question: ')
input_query = "I'm George. I have a pet called Muezza. What is my favorite type of cat?"
print(f'Input query: {input_query}')
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
    print(f' - (similarity: {similarity:.2f}) {chunk}')

instruction_prompt = (
    "You are a helpful chatbot.\n"
    "Use only the following pieces of context to answer the question. "
    "Don't make up any new information:\n"
    + "\n".join([f' - {chunk}' for chunk, _ in retrieved_knowledge])
)

Input query: I'm George. I have a pet called Muezza. What is my favorite type of cat?
Retrieved knowledge:
 - (similarity: 0.79) Mohammed loved cats and reportedly his favorite cat, Muezza, was a tabby. Legend says that tabby cats have an “M” for Mohammed on top of their heads because Mohammad would often rest his hand on the cat’s head.
 - (similarity: 0.76) If you name is George, you are more likely to have parrots as pets. However, nonetheless, your favorite type of cat is probably going to be a persian cat.
 - (similarity: 0.67) The most popular pedigreed cat is the Persian cat, followed by the Main Coon cat and the Siamese cat.
Retrieved knowledge:
 - (similarity: 0.79) Mohammed loved cats and reportedly his favorite cat, Muezza, was a tabby. Legend says that tabby cats have an “M” for Mohammed on top of their heads because Mohammad would often rest his hand on the cat’s head.
 - (similarity: 0.76) If you name is George, you are more likely to have parrots as pets. However, noneth

As you see, this simple approach of retrieval works relatively well, however, it can fail to bring up relevant information, especially if the chunks contain multiple piece of similar information.

Notice that the input query states that the user is George, and there is information about Georges favorite cat in the 'database'. However, the first piece of information that is retrieved is about Mohammed, not Geroge.  <br> <br/>
Let us see how the chatbot responds to the prompt

In [10]:
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

response = ""
for chunk in stream:
    response += chunk['message']['content']
print("Chatbot response:")
print(response)


Chatbot response:
You are correct that your name is George, not Mohammed.

Since you mentioned that Muezza is a tabby cat, it's likely that your favorite type of cat is a Persian cat. That's in line with the popular pedigreed cats mentioned: Persians, Main Coon cats, and Siamese cats.


## Improving Retrieval with Reranking
While cosine similarity is a good first step, as mentioned, it can struggle with more information dense chunks. Even though, in this case, we still have gotten the correct answer (most of the time).

To keep the efficiency of the cosine similarity approach, we wil still select the top 5 pieces of information using cosine similarity, and then rerank them to reorder and select the top 3 pieces of most relevant information. We can afford to use a more computationally demanding model for this step.

We will use a cross-encoder reranker (e.g., from Hugging Face Transformers) to score each (query, chunk) pair and sort the results accordingly.

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


In [12]:
# Load a cross-encoder reranker model (e.g., 'cross-encoder/ms-marco-MiniLM-L-6-v2')
# Actually, 'cross-encoder/ms-marco-MiniLM-L-6-v2' specifically does not work on M2 machines,
# so we use 'cross-encoder/ms-marco-MiniLM-L-12-v2 (took a while to find out this information)
reranker_model_name = 'cross-encoder/ms-marco-MiniLM-L-12-v2'
reranker_tokenizer = AutoTokenizer.from_pretrained(reranker_model_name)
reranker_model = AutoModelForSequenceClassification.from_pretrained(reranker_model_name)

def rerank(query, retrieved_chunks, top_k=3):
    pairs = [(query, chunk) for chunk, _ in retrieved_chunks]
    inputs = reranker_tokenizer.batch_encode_plus(pairs, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        scores = reranker_model(**inputs).logits.squeeze(-1).tolist()
    reranked = sorted(zip([chunk for chunk, _ in retrieved_chunks], scores), key=lambda x: x[1], reverse=True)
    return reranked[:top_k]

In [13]:
# Retrieve and rerank
input_query = "I'm George. I have a pet called Muezza. What is my favorite type of cat?"
retrieved_knowledge = retrieve(input_query, top_n=5)  # Retrieve more chunks to rerank
reranked_knowledge = rerank(input_query, retrieved_knowledge)

print('Reranked knowledge:')
for chunk, score in reranked_knowledge:
    print(f' - (score: {score:.2f}) {chunk}')

instruction_prompt = (
    "You are a helpful chatbot.\n"
    "Use only the following pieces of context to answer the question. "
    "Don't make up any new information:\n"
    + "\n".join([f' - {chunk}' for chunk, _ in reranked_knowledge])
)

Reranked knowledge:
 - (score: 5.13) If you name is George, you are more likely to have parrots as pets. However, nonetheless, your favorite type of cat is probably going to be a persian cat.
 - (score: 4.26) Mohammed loved cats and reportedly his favorite cat, Muezza, was a tabby. Legend says that tabby cats have an “M” for Mohammed on top of their heads because Mohammad would often rest his hand on the cat’s head.
 - (score: -5.20) The most popular pedigreed cat is the Persian cat, followed by the Main Coon cat and the Siamese cat.


In [14]:
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

response = ""
for chunk in stream:
    response += chunk['message']['content']
print("Chatbot response:")
print(response)


Chatbot response:
As George, your favorite type of cat is probably a Persian cat! That's what legend says, anyway...
