# RAG

RAG stands for Retrieval-Augmented Generation, a technique that is used to improve the accuracy of text generated by large language models (LLMs). RAG combines the strengths of generative AI and retrieval AI to produce more relevant, up-to-date, and accurate responses.

Using RAG(retreiver augmented generation), we first convert all docs to chunks of text. Chunking is done as embedding models have a limit of context and rag performs better with smaller chunks. Then embed the documents using an embedding model. Embedding model converts a text to vectors of numbers and store those vectors in vector database. When user asks qustion, question is converted to vector with same embedding model. Use semantic search to find n no of relevent chunks, and feed those chunks of data as reference along with question to LLM.

## Distance functions

There four most popular distance function to find nearest niebours:

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)

## Common embedding models

Model
OpenAl, English v3, Multilingual v3, all-MiniLM-L6-v2

Size
~100k, ~30k, ~200k, ~30k

Vocabulary size is a tokenizer hyperparameter and
must be selected upfront.

## Search Types

- **Dense Search (Semantic Search)**
  - Semantica search focusses on the meaning of content being searched.
  - Uses vector embedding representation of data to perform search.  
  - This types of search allows one to capture and return semantically similar objects.

- **Sparse Search**  
  -  Also known as lexical search looks for literals or pattern matching in the string.
  - **Bag of Words**  
    - The easiest way to do keyword matching is using Bag of Words - to count how many times a word occurs in the query and the data vector and then return objects with the highest matching word frequencies.

  - **This is known as Sparse Search**  
    - because the text is embedded into vectors by counting how many times every unique word in your vocabulary occurs in the query and stored sentences. Since the likelihood of any given sentence containing every word in your vocabulary is quite low the embeddings is mostly zeroes and thus is known as a sparse embedding.
  - **Sparse vector using BM25**

    - **Best Matching 25 (BM25)**: In practice when performing keyword search we use a modification of simple word frequencies called best matching 25. BM25 is a ranking function that is used to retrieve text by estimating the relative importance of terms in the text to the search query. It is calculated based on the number of documents in the data corpus and the word frequency across all relevant documents.


- **Hybrid Search**

  - **What is Hybrid Search?**
    - Hybrid search is the process of performing both vector/dense search and keyword/sparse search and then combining the results.

  - **Combination based on a Scoring System**
    - This combination can be done based on a scoring system that measures how well each object matches the query using both dense and sparse searches.


## Sentence-window retrieval

We make smaller chunks from document mostly a sentence. And when chunks are retrieved, context around the chunks added to the retrieved ones.

## Auto-merging retrieval


- Define a hierarchy of smaller chunks linked to parent chunks.
- If the set of smaller chunks linking to a parent chunk exceeds some threshold, then “merge” smaller chunks into the bigger parent chunk. 

![image.png](attachment:image.png)

## Expansion

Expansion with generated answers. Get sample answer from LLM. Combine with question and then get chunks. Can give better result.


## Reranking

Reranking the query expansion using cross-encoders. Query the vector DB and
request additional results. ReRank output so the most relevant have the highest rank. Select the top ranking results. Use croos-encoders for reranking.



- **Bi-Encoder vs. Cross-Encoder**  
    - Bi-Encoders produce for a given sentence a sentence embedding. We pass to a BERT independently the sentences A and B, which result in the sentence embeddings u and v. Each input is fed into its own encoder producing two independent embeddings. These sentence embedding can then be compared using cosine similarity. 

    - In contrast, for a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair. Cross-encoders process two input sequences together as a single input. This allows the model to directly compare and contrast the inputs, understanding their relationship in a more integrated and nuanced way.


- **Combining Bi- and Cross-Encoders**

    - Cross-Encoder achieve higher performance than Bi-Encoders, however, they do not scale well for large datasets. Here, it can make sense to combine Cross- and Bi-Encoders, for example in Information Retrieval / Semantic Search scenarios: First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination.

    - SentenceTransformers also supports to load Cross-Encoders for sentence pair scoring and sentence pair classification tasks.






![image.png](attachment:image.png)

## The RAG Triad

Relevance:
Is the
response
relevant to
the query?

Context
Relevance:
Is the
retrieved
context
relevant to
the query?

Groundedness: Is the response
supported by the context?

## Multimodal

multi2vec-palm
wav2clip

![image.png](attachment:image.png)

In [None]:
"""SentenceTransformer('all-MiniLM-L6-v2', device=device)
SentenceTransformer('sentence-transformers/clip-ViT-B-32'
import tiktoken

openai_tokenizer = tiktoken.encoding_for_model("text-embedding-3-large")
openai_tokenizer.n_vocab"""