# Advanced RAG with additional information about quality pf embeddings

# Advanced RAG with embeddings evaluation

![image.png](attachment:bea2e4c6-9a32-4206-8866-32e61a5d5239.png)

#### Table of contents

1. [Install dependencies](#section-id1).
2. [Loading knowledge base](#section-id2).
3. [Recursive chunking](#section-id3).
4. [FAISS](#section-id4).
5. [Calinski–Harabasz index](#section-id5).
5. [RAGAS](#section-id6).

<a id='section-id1'></a>
### Install dependencies

In [1]:
%pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl pacmap datasets langchain-community ragatouille
%pip install fsspec==2023.6.0
%pip install --upgrade pandas
%pip install --upgrade typing_extensions
%pip install numpy==1.22
%pip install --upgrade h5py
%pip install protobuf==3.20.3
!python3 -m pip install --upgrade pip

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Collecting pip
  Using cached pip-24.2-py3-none-any.whl.metadata (3.6 kB)
Using cached pip-24.2-py3-none-any.whl (1.8 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-24.2


In [2]:
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt

<a id='section-id2'></a>
### Loading knowledge base

Data info:

TriviaqQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaqQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.

The data fields are the same among all splits.

rc
* question: a string feature.
* question_id: a string feature.
* question_source: a string feature.
* entity_pages: a dictionary feature containing:
* doc_source: a string feature.
* filename: a string feature.
* title: a string feature.
* wiki_context: a string feature.
* search_results: a dictionary feature containing:
* description: a string feature.
* filename: a string feature.
* rank: a int32 feature.
* title: a string feature.
* url: a string feature.
* search_context: a string feature.
* aliases: a list of string features.
* normalized_aliases: a list of string features.
* matched_wiki_entity_name: a string feature.
* normalized_matched_wiki_entity_name: a string feature.
* normalized_value: a string feature.
* type: a string feature.
* value: a string feature.

In [3]:
from datasets import load_dataset

ds = load_dataset("mandarjoshi/trivia_qa", "rc.wikipedia")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/26.7k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/240M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/266M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/240M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/259M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/253M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/235M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/221M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/61888 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7993 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7701 [00:00<?, ? examples/s]

In [4]:
ds_train = ds['train']
ds_test = ds['test']

In [5]:
def extract_wiki_context(json_str):
    return json_str['wiki_context'][0]

ds_train = ds_train.map(lambda x: {'wiki_context': extract_wiki_context(x['entity_pages'])})
ds_test = ds_test.map(lambda x: {'wiki_context': extract_wiki_context(x['entity_pages'])})

Map:   0%|          | 0/61888 [00:00<?, ? examples/s]

Map:   0%|          | 0/7701 [00:00<?, ? examples/s]

In [6]:
dd = ds_test.shuffle(seed=42).select(range(1000))

In [7]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["wiki_context"])
    for doc in tqdm(dd)
]

  0%|          | 0/1000 [00:00<?, ?it/s]

<a id='section-id3'></a>
#### Recursive chunking

Recursive chunking breaks down the text into smaller parts step by step using a given list of separators sorted from the most important to the least important separator. If the first split doesn't give the right size or shape of chunks, the method repeats itself on the new chunks using a different separator. For instance with the list of separators ["\n\n", "\n", ".", ""]:

The method will first break down the document wherever there is a double line break "\n\n".
Resulting documents will be split again on simple line breaks "\n", then on sentence ends ".".
Finally, if some chunks are still too big, they will be split whenever they overflow the maximum size.
With this method, the global structure is well preserved, at the expense of getting slight variations in chunk size.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# We use a hierarchical list of separators specifically tailored for splitting Markdown documents
# This list is taken from LangChain's MarkdownTextSplitter class
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]

In [9]:
from transformers import AutoTokenizer

EMBEDDING_MODEL_NAME = "thenlper/gte-small"


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name, truncate_dim=64),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=MARKDOWN_SEPARATORS,
    )

    docs_processed = []
    for doc in tqdm(knowledge_base, desc="docs_processed ready..."):
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in tqdm(docs_processed, desc="docs_processed_unique ready..."):
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique


docs_processed = split_documents(
    512,  # We choose a chunk size adapted to our model
    RAW_KNOWLEDGE_BASE,
    tokenizer_name=EMBEDDING_MODEL_NAME,
)

tokenizer_config.json:   0%|          | 0.00/394 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

docs_processed ready...:   0%|          | 0/1000 [00:00<?, ?it/s]

docs_processed_unique ready...:   0%|          | 0/15670 [00:00<?, ?it/s]

<a id='section-id4'></a>
#### FAISS

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI that is designed for efficient similarity search and clustering of dense vectors. It enables fast and scalable nearest neighbor searches, particularly useful for large datasets, by employing various techniques such as quantization and indexing to optimize both memory usage and search speed.

In [10]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/66.7M [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

<a id='section-id5'></a>
### Calinski–Harabasz index

The Calinski–Harabasz index (CHI), also known as the Variance Ratio Criterion (VRC), is a metric for evaluating clustering algorithms

Given a data set of n points: {x1, ..., xn}, and the assignment of these points to k clusters: {C1, ..., Ck}, the Calinski–Harabasz (CH) Index is defined as the ratio of the between-cluster separation (BCSS) to the within-cluster dispersion (WCSS), normalized by their number of degrees of freedom

$$
CH = \frac{\text{BCSS/(k - 1)}}{\text{WCSS/(n - k)}}
$$

$$BCSS = \sum_{i=0}^k n_i ||c_i - c||^2$$

$$WCSS = \sum_{k=0}^K \sum_{x\in C_i}^k ||x_i - c_k||^2$$

In [12]:
faiss_index = KNOWLEDGE_VECTOR_DATABASE.index
vectors = faiss_index.reconstruct_n(0, faiss_index.ntotal)

In [32]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(vectors)

In [42]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import calinski_harabasz_score

In [56]:
X = [
    list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0]) for idx in range(len(docs_processed))
]


In [75]:
k = 3 # Количество кластеров
minibatch_kmeans = MiniBatchKMeans(n_clusters=k, batch_size=100, random_state=0)
minibatch_kmeans.fit(X)





In [76]:
print(calinski_harabasz_score(vectors, minibatch_kmeans.labels_))

406.8930940026027


In [83]:
labels = minibatch_kmeans.labels_

In [55]:
import pacmap
import numpy as np
import plotly.express as px

embedding_projector = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1)

embeddings_2d = [
    list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0]) for idx in range(len(docs_processed))
]

# Fit the data (the index of transformed data corresponds to the index of the original data)
documents_projected = embedding_projector.fit_transform(np.array(embeddings_2d), init="pca")





In [77]:
df = pd.DataFrame.from_dict(
    [
        {
            "x": documents_projected[i, 0],
            "y": documents_projected[i, 1],
            "source": minibatch_kmeans.labels_[i],
            "extract": docs_processed[i].page_content[:100] + "...",
            "symbol": "circle",
            "size_col": 4,
        }
        for i in range(len(docs_processed))
    ]
)


In [78]:

# Visualize the embedding
fig = px.scatter(
    df,
    x="x",
    y="y",
    color="source",
    size="size_col",
    symbol="symbol",
    width=1000,
    height=700,
)
fig.update_traces(
    marker=dict(opacity=1, line=dict(width=0, color="DarkSlateGrey")),
    selector=dict(mode="markers"),
)
fig.update_layout(
    legend_title_text="<b>Chunk source</b>",
    title="<b>2D Projection of Chunk Embeddings via PaCMAP</b>",
)
fig.show()

In [80]:
user_query = 'Who is Elbert'

In [81]:
retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)

In [82]:
embedded_docs = embedding_model.embed_documents([doc.page_content for doc in retrieved_docs[:5]])
found_indices = np.array(embedded_docs)

In [92]:
import numpy as np
from sklearn.metrics import pairwise_distances
X = vectors
sample_vectors = found_indices  # Векторы 5 текстов выборки

def dist(X, sample_vectors, labels):
  sample_vectors = embedding_model.embed_documents([doc.page_content for doc in retrieved_docs[:5]])
  # Находим уникальные кластеры
  unique_labels = np.unique(labels)

  # Вычисляем центры кластеров (например, средние векторы)
  cluster_centers = []
  for label in unique_labels:
      if label != -1:  # игнорируем шум
          cluster_data = X[labels == label]
          cluster_center = np.mean(cluster_data, axis=0)
          cluster_centers.append(cluster_center)

  cluster_centers = np.array(cluster_centers)

  # Теперь вычисляем расстояния от выборки до ближайших центров кластеров
  distances = pairwise_distances(sample_vectors, cluster_centers)

  # Находим минимальное расстояние для каждого текста в выборке
  nearest_distances = np.min(distances, axis=1)
  return nearest_distances


<a id='section-id6'></a>
### RAGAS evaluation

$
Faithfulness = \frac{\text{|Number of claims that can be inferred from given context|}}{\text{|Total number of claims in the generated answer|}}
$

* $ \text{context recall} = \frac{\text{|GT sentences that can be attributed to context|}}{\text{|Number of sentences in GT|}} $

* $Context Precision@k = \frac{\sum \text{precision@k}}{\text{total number of relevant items in the top K results}}$

* $Precision@k = \frac{\text {true positives@k}}{\text{(true positives@k + false positives@k)}}$

In [87]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta" # ЗАМЕНИТЬ НА ЛАМУ

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(READER_MODEL_NAME, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

In [88]:
## Поправить шаблон

prompt_in_chat_format = [
    {
        "role": "system",
        "content": """Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.""",
    },
    {
        "role": "user",
        "content": """Context:
{context}
---
Now here is the question you need to answer.

Question: {question}""",
    },
]
RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
    prompt_in_chat_format, tokenize=False, add_generation_prompt=True
)
print(RAG_PROMPT_TEMPLATE)

<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}</s>
<|assistant|>



In [90]:
from ragatouille import RAGPretrainedModel

RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

artifact.metadata:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [99]:
from transformers import Pipeline


def answer_with_rag(
    question: str,
    llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 5,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    # print("=> Retrieving documents...")
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # Keep only the text

    # Optionally rerank results
    if reranker:
        # print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]
    dist_val = dist(X, relevant_docs, labels)

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + relevant_docs[i] + ':' + str(dist_val[i]) for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    # print("=> Generating answer...")
    answer = llm(final_prompt)[0]["generated_text"]

    return answer, relevant_docs

In [96]:
question = "who is albert"

answer, relevant_docs = answer_with_rag(question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER)

=> Retrieving documents...
=> Reranking documents...


100%|██████████| 1/1 [00:00<00:00,  3.66it/s]


=> Generating answer...


In [97]:
answer

'Answer: Albert can refer to either Albert Einstein (0:::) or King Albert (mentioned in Document 2 in relation to the abdication of King Edward VIII). Without further context, it is impossible to determine which Albert is being referred to. Please provide additional information to clarify which Albert is being discussed.'

In [None]:
ds_test = ds_test.map(lambda x: {'rag_answer': answer_with_rag(x['question'], READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER)[0]})