### 2023.11.30 - Introduction to Transformers | Homework 4
In this exercise, you will implement key components of Retrieval-Augmented Generation (RAG): Data Ingegstion, Retrieval and Augmentation.
RAG significantly enhances the capabilities of language models by allowing them to incorporate external knowledge.

In case you are interested in diving deeper into RAG, checkout the following resources:
- Original Paper on RAG: [Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401.pdf)
- LamaIndex Tutorial Series: [Building RAG from Scratch (Lower-Level)](https://docs.llamaindex.ai/en/stable/optimizing/building_rag_from_scratch.html)

Base your code on the following skeleton code that we provide:

In [1]:
!/opt/conda/envs/pytorch/bin/python -m pip install sklearn

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package uses
  [31m   [0m   'sklearn' instead of 'scikit-le

In [2]:
import requests
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [3]:
# Import any additional dependencies
# In case you don't need any, just remove the error raise below
# YOUR CODE HERE

### Embedding Model
The embedding model transforms textual data into a numerical format (embeddings) that can be easily stored and processed.

In our excercise we will leverage the free inference API from huggingface as well as an open source model.
In order to use this API you need to create an account and obtain an access token under https://huggingface.co/settings/tokens.

In [4]:
token = "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"

In [5]:
API_URL = "https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5"
headers = {"Authorization": f"Bearer {token}"}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

To keep our example simple we will use a small set of predefined, small sentences as our knowledge base. Keep in mind that in real life scenario pre-processing is an important step.

In [6]:
knowledge_base = [
    "on the 23th december i ate a lovely cheesecake for dinner and a carrotte as a breakfast",
    "the second name of my ants second chicken is miranda",
    "the eiffel tower is located in south tirol."
]

In [7]:
embeddings = query({"inputs": knowledge_base})
embeddings # NOTE: Sometimes the API returns an error, if this is the case, just run this cell again

[[0.005976018495857716,
  0.04743701219558716,
  0.046013496816158295,
  -0.02352082170546055,
  0.002019573003053665,
  -0.02387968823313713,
  0.0548194944858551,
  0.060507241636514664,
  0.00896780751645565,
  0.022642672061920166,
  0.013118397444486618,
  -0.007841447368264198,
  0.06415598094463348,
  0.030713792890310287,
  0.008385112509131432,
  -0.013428348116576672,
  0.06005748733878136,
  -0.05913832411170006,
  -0.11512435227632523,
  -0.007879646494984627,
  -0.009034581482410431,
  0.01553868968039751,
  -0.028244825080037117,
  -0.006699493154883385,
  0.007302557118237019,
  0.11347273737192154,
  0.0118391253054142,
  -0.029667159542441368,
  -0.059424713253974915,
  -0.09644351154565811,
  0.0461956262588501,
  -0.015238747000694275,
  0.06134472414851189,
  -0.05341811850667,
  -0.06515027582645416,
  0.014987241476774216,
  -0.0018718718783929944,
  0.029459916055202484,
  -0.03854014351963997,
  0.029086750000715256,
  0.08560528606176376,
  0.006443233694881201

After encoding our knowledge base into embeddings we need to store them together witht the original text, since most embedding models don't provide a decoder element.

<b>Task:</b> Create an array of nodes, where each node as the form {"embd": THE EMBEDDING, "text": THE HUMAN READABLE TEXT}. Each element of the knowledge base should have one node. So your db should look something like [{"embd": [0,321, ...], "text": "on the 23th ..."}, ...]

In [8]:
db = [{"embd": embeddings[i], "text": knowledge_base[i]} for i in range(len(knowledge_base))]

To be able to query our db we need to transform a given prompt into the same vector space

In [9]:
prompt = "What is the second name of my ants second chicken?"

In [10]:
prompt_embd = query({"inputs": prompt})
prompt_embd # NOTE: Sometimes the API returns an error, if this is the case, just run this cell again

[-0.04123039171099663,
 -0.07210378348827362,
 0.00310553633607924,
 -0.02681468427181244,
 0.010677281767129898,
 0.01596728339791298,
 0.04665118083357811,
 0.04646291211247444,
 0.07410270720720291,
 -0.015378935262560844,
 -0.004554321523755789,
 -0.0879453644156456,
 0.00013415655121207237,
 -0.024630388244986534,
 0.0031998520717024803,
 -0.013545660302042961,
 -0.06153857707977295,
 0.04205211251974106,
 -0.07401518523693085,
 0.0025465115904808044,
 -0.040876470506191254,
 -0.04882597178220749,
 0.009514120407402515,
 -0.07900737971067429,
 -0.015235225670039654,
 0.07530120015144348,
 -0.021136516705155373,
 0.05697115510702133,
 -0.07360400259494781,
 -0.11288599669933319,
 -0.04130445793271065,
 -0.0017673190450295806,
 -0.027058053761720657,
 -0.008186215534806252,
 0.0028259598184376955,
 0.0031193732284009457,
 -0.016712041571736336,
 -0.011016673408448696,
 -0.0032114824280142784,
 0.012778617441654205,
 0.035399481654167175,
 -0.03541136905550957,
 0.021044807508587837,

<b>Task:</b> Implement a function named calculate_similarity which takes two arguments, vec1 and vec2. These arguments represent text embeddings that should be semantically compared. The function should return a single similarity value between 0 and 1, where 1 indicates an identical vector and 0 orthogonal vectors.

In [11]:
def calculate_similarity(vec1, vec2):
    """
    Calculate the cosine similarity between two vectors.

    Args:
    vec1 (list or array): The first vector.
    vec2 (list or array): The second vector.

    Returns:
    float: A similarity score between 0 and 1, where 1 means identical and 0 means orthogonal.
    """
    n_vec1 = np.array(vec1)
    n_vec1 = n_vec1 / np.linalg.norm(n_vec1)
    n_vec2 = np.array(vec2)
    n_vec2 = n_vec2 / np.linalg.norm(n_vec2)
    return cosine_similarity([n_vec1], [n_vec2])

<b>Task:</b> Calculate the cosine similarity between a given prompt embedding and each embedding in your database (db).
Identify the database entry (node) that has the highest similarity to the prompt and retrieve the text associated with this most similar node as your augmentation data. (_hint:_ you might want to use np.argmax on an array of similarities)

In [12]:
# TODO: Implement similarity search below
# YOUR CODE HERE
augemntation_data = knowledge_base[np.argmax([calculate_similarity(prompt_embd, e) for e in embeddings])]
print(augemntation_data)

the second name of my ants second chicken is miranda


In [13]:
def get_augmented_promp(prompt, augmentation):
    return f"""
Context information: "{augmentation}".
Given the context information and not prior knowledge, answer the query.
Query: {prompt}
Answer: \
"""

In [14]:
"""
Expected Output:
'\nContext information: "the second name of my ants second chicken is miranda".\nGiven the context information and not prior knowledge, answer the query.\nQuery: What is the second name of my ants second chicken?\nAnswer: '
"""
augmented_prompt = get_augmented_promp(prompt, augemntation_data)
augmented_prompt

assert "miranda" in augmented_prompt