### Simple RAG Example
Retreival Augmented Generation


https://parlance-labs.com/education/rag/ben.html

![rag-tree.webp](rag-tree.webp)


- RAG is Retreival Augmented Generation. 
- It just means 'provide relevant context'
- It works by 
1. creating an embedding from a prompt
2. creating embeddings from sections of a document
3.  finding the cosine similarity between the prompt and each section of the document.

-Once the paragraph with the hightest cosine similarity to the prompt is found, the top 3 sentences are fed into a generative model to generate an answer. 

In [1]:
# %pip install -U sentence-transformers
# $pip install wikipedia-api
# %pip install claudette

In [2]:
from sentence_transformers import SentenceTransformer
from wikipediaapi import Wikipedia
from claudette import Chat, models

  from .autonotebook import tqdm as notebook_tqdm


I got the model list from here
https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

In [3]:
# model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)

### Fetch some text content and embed it


In [4]:
wiki = Wikipedia('RAGBot/0.0', 'en')
doc = wiki.page('Albert Einstein').text
paragraphs = doc.split('\n\n')
# ... make embedding
docs_embed = model.encode(paragraphs, normalize_embeddings=True)


### Make an embedding of the prompt (query)

In [14]:
query = "What gave Einstein the chance to meet his fellow neuroanatomist Ramon?"
query_embed = model.encode(query, normalize_embeddings=True)
query_embed.shape

(768,)

In [15]:
import numpy as np
similarities = np.dot(docs_embed, query_embed)
similarities

array([0.5728152 , 0.47459427, 0.44192564, 0.47382304, 0.5413194 ,
       0.43490496, 0.4756278 , 0.5182402 , 0.48718232, 0.41080484,
       0.37724665, 0.57194126, 0.50365496, 0.46164155, 0.57233626,
       0.5127679 , 0.52641445, 0.5398744 , 0.6040658 , 0.4396679 ,
       0.47936997, 0.5163653 , 0.5112858 , 0.5028564 , 0.48975816,
       0.40509236, 0.55507207, 0.5173719 , 0.47807235, 0.40626654,
       0.3443527 , 0.39776003, 0.49057645, 0.5369898 , 0.33842805,
       0.39533782, 0.31863862, 0.4034021 , 0.43526107, 0.42657036,
       0.41576654, 0.44364488, 0.3890611 , 0.40116668, 0.50536174,
       0.44676283, 0.3784415 , 0.44369176, 0.43291909, 0.42396727,
       0.46925837, 0.37816215, 0.44824862, 0.48065543, 0.42301148,
       0.37530896, 0.46815017, 0.4325635 , 0.3740057 , 0.53968686,
       0.40272278, 0.37232846, 0.40951195, 0.42262912, 0.47417068,
       0.46981564, 0.39782795, 0.31879896, 0.33717418], dtype=float32)

In [16]:
top_3_idx = np.argsort(similarities)[::-1][:3]

In [17]:
top_3_idx

array([18,  0, 14])

In [18]:
most_similar_documents = [paragraphs[idx] for idx in top_3_idx]

In [19]:
llm = models[2]
chat = Chat(llm, sp=f"Here is some information from Wikipedia, it will help you to answer a question. Wikipedia information: {str(most_similar_documents)}")

In [20]:
chat(query)

According to the information provided, Einstein's 1923 visit to Spain gave him the chance to meet the fellow Nobel laureate, the neuroanatomist Santiago Ramón y Cajal. The passage states:

"(His Spanish trip also gave him a chance to meet a fellow Nobel laureate, the neuroanatomist Santiago Ramón y Cajal.)"

So Einstein's 1923 trip to Spain provided him the opportunity to meet and interact with the Nobel laureate neuroanatomist Santiago Ramón y Cajal.

<details>

- id: `msg_01Wu1ME2c2Chmg5x9Ej2Tohm`
- content: `[{'text': 'According to the information provided, Einstein\'s 1923 visit to Spain gave him the chance to meet the fellow Nobel laureate, the neuroanatomist Santiago Ramón y Cajal. The passage states:\n\n"(His Spanish trip also gave him a chance to meet a fellow Nobel laureate, the neuroanatomist Santiago Ramón y Cajal.)"\n\nSo Einstein\'s 1923 trip to Spain provided him the opportunity to meet and interact with the Nobel laureate neuroanatomist Santiago Ramón y Cajal.', 'type': 'text'}]`
- model: `claude-3-haiku-20240307`
- role: `assistant`
- stop_reason: `end_turn`
- stop_sequence: `None`
- type: `message`
- usage: `{'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 0, 'input_tokens': 1957, 'output_tokens': 119}`

</details>

# Improvements
__Cross Encoder__
- This method use a Bi-encoder model. It is very efficient, since the embeddings for the document can be computed in parallel, and stored in a database, so that in the end all that you need to look up is the similarity between the prompt embedding and the embeddings stored in a database. 

The documents and query representations are computed entirely separately in the bi-encoder, so they aren't aware of each other. 
Improvements can be made by using a Cross-encoder model. These essentially are a binary classifier, where p(positive class) is taken as the similarity score. These are slower than bi-encoders, but are more accurate.


<img src="cross-encoder.png">

__Reranking__
Cross encoders are computationally expensive to run, so using a cross-encoder on the entire set of documents, for every prompt, would take a long time. 
One solution is to return a shortlist of documents using a computationally efficient approach, such as a bi-encoder, and then re-rank these by using a cross-encoder. There's a library for this - github.com/answerdotai/rerankers 



__Keyword Search__
always have keyword search and full text search in the pipeline. 

__tf-idf__ term frequency-inverse document frequency weighs down common words and weighs up rare words. 

__BM25__ is a way to implement tf-idf. 
