## Demo of implementing RAG
This Jupyter notebook is for an initial implementation of RAG, following the tutorial from [Hugging Face](https://huggingface.co/blog/ngxson/make-your-own-rag).

First we begin by loading our dataset

In [1]:
dataset = []
with open('cat-facts.txt', 'r') as file:
  dataset = file.readlines()
  print(f'Loaded {len(dataset)} entries')

dataset[0:5]  # Display the first 5 entries

Loaded 150 entries


['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life.\n',
 'Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor.\n',
 'When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.\n',
 'The technical term for a cat’s hairball is a “bezoar.”\n',
 'A group of cats is called a “clowder.”\n']

## Implementing vector database

We have to convert our plain text to vectors, to be able to use vector similarity search, rather than keyword search (not a viable alternative)
<br><br/>
To do this, we need to have selected the models we are going to work with. For convenience, we will use Ollama models, as they are free and can be run locally with minimal setup.

In [2]:
import ollama

EMBEDDING_MODEL = 'hf.co/CompendiumLabs/bge-base-en-v1.5-gguf'
LANGUAGE_MODEL = 'hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF'

In [3]:
# Each element in the VECTOR_DB will be a tuple (chunk, embedding)
VECTOR_DB = []

def add_chunk_to_database(chunk):
  embedding = ollama.embed(model=EMBEDDING_MODEL, input=chunk)['embeddings'][0]
  VECTOR_DB.append((chunk, embedding))

We will assume each line in our database to be one chunk. Let us now calculate the embeddings and also add them to the VECTOR_DB list

In [5]:
# Let us check if the vectorization has already been done
try:
  with open('vector_db.txt', 'r') as file:
    for line in file:
      chunk, embedding_str = line.strip().split('\t')
      embedding = list(map(float, embedding_str.split(',')))
      VECTOR_DB.append((chunk, embedding))
  print(f'Loaded {len(VECTOR_DB)} entries from vector_db.txt')
  vectorized_dataset_loaded = True
except FileNotFoundError:
  print('vector_db.txt not found, proceeding to vectorize the dataset')
  for i, chunk in enumerate(dataset):
    add_chunk_to_database(chunk)
    print(f'Added chunk {i+1}/{len(dataset)} to the database')
  vectorized_dataset_loaded = False
    
  print(f'Added {len(VECTOR_DB)} chunks to the database')

Loaded 300 entries from vector_db.txt


Let us save our vectorized database

In [6]:
print(f"The vectors in this list of tuples are {(type(VECTOR_DB[0][1]))}")

The vectors in this list of tuples are <class 'list'>


In [7]:
# Save the vector database to a txt file
if not vectorized_dataset_loaded:
  print('Saving the vector database to vector_db.txt')
  with open('vector_db.txt', 'w') as file:
    for chunk, embedding in VECTOR_DB:
      file.write(f"{chunk.strip()}\t{','.join(map(str, embedding))}\n")


## Information Retrival
Now we want to implement a function to retrieve information that is closest to the query for the LLM.<br>
First we need to make a function that calculates cosine similarity. <br>
Lets do it in a way that is fast (even though for this demo, we really don't need to)

In [36]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def cosine_sim_sklearn_vecs(a, b):
    """
    Calculate cosine similarity between two vectors using sklearn.
    """
    a = np.array(a).reshape(1, -1)
    b = np.array(b).reshape(1, -1)
    return cosine_similarity(a, b)[0, 0]


Now lets implement retrival

In [42]:
def retrieve(query, top_n=3):
    # Calculate the embedding for the query
    query_emb = ollama.embed(model=EMBEDDING_MODEL, input=query)['embeddings'][0]
    # Calculate cosine similarity for each chunk in the vector database
    similarities = [(chunk, cosine_sim_sklearn_vecs(query_emb, emb)) for chunk, emb in VECTOR_DB]
    # Sort the similarities in descending order
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_n]


In [53]:
input_query = input('Ask me a question: ')
print(f'Input query: {input_query}')
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
    print(f' - (similarity: {similarity:.2f}) {chunk}')

instruction_prompt = (
    "You are a helpful chatbot.\n"
    "Use only the following pieces of context to answer the question. "
    "Don't make up any new information:\n"
    + "\n".join([f' - {chunk}' for chunk, _ in retrieved_knowledge])
)

Input query: what was the largest ever produced cat litter
Retrieved knowledge:
 - (similarity: 0.72) Most cats give birth to a litter of between one and nine kittens. The largest known litter ever produced was 19 kittens, of which 15 survived.
 - (similarity: 0.72) Most cats give birth to a litter of between one and nine kittens. The largest known litter ever produced was 19 kittens, of which 15 survived.
 - (similarity: 0.68) A cat called Dusty has the known record for the most kittens. She had more than 420 kittens in her lifetime.


In [54]:
stream = ollama.chat(
  model=LANGUAGE_MODEL,
  messages=[
    {'role': 'system', 'content': instruction_prompt},
    {'role': 'user', 'content': input_query},
  ],
  stream=True,
)

# print the response from the chatbot in real-time
# This might not work in all environments, but should work in most terminals
print('Chatbot response:')
for chunk in stream:
  print(chunk['message']['content'], end='', flush=True)


Chatbot response:
According to the text, there is no information about a "cat litter". The text only mentions that cats typically give birth to litters of between one and nine kittens, with some records including a larger litter such as 19 kittens.

We can see that this works, although the prompts could be further tuned for a better response...