<a href="https://colab.research.google.com/github/Alishaw99/Evaluating-LLMs-And-Rags/blob/main/RAGAS_Synthetic_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Evaluating RAG (Retrieval-Augmented Generation) augmented pipelines is crucial for assessing their performance. However, manually creating hundreds of QA (Question-Context-Answer) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity required for a thorough evaluation, ultimately impacting the quality of the assessment. By using synthetic data generation developer time in data aggregation process can be reduced by 90%.

In [14]:
!pip install ragas langchain-openai sentence_transformers xmltodict python-dotenv



In [15]:
import os
from google.colab import userdata
import pandas as pd
from langchain_community.document_loaders import PubMedLoader
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_openai import ChatOpenAI
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

In [16]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [17]:
data_generation_model = ChatOpenAI(model='gpt-4o-mini')

In [18]:
critic_model = ChatOpenAI(model='gpt-4o')

In [19]:
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [20]:
loader = PubMedLoader("cancer", load_max_docs=5)

In [21]:
loader

<langchain_community.document_loaders.pubmed.PubMedLoader at 0x7db1780e7dc0>

In [22]:
documents = loader.load()

In [23]:
documents

[Document(metadata={'uid': '39146560', 'Title': 'RNF43 in cancer: Molecular understanding and clinical significance in immunotherapy.', 'Published': '--', 'Copyright Information': '© 2024 John Wiley & Sons Ltd.'}, page_content='Identifying biomarkers to predict immune checkpoint inhibitor (ICI) efficacy is warranted. Considering that somatic mutation-derived neoantigens induce strong immune responses, patients with a high tumor mutational burden reportedly tend to respond to ICIs. Therefore, the original function of neoantigenic mutations and their impact on the tumor microenvironment (TME) require attention. RNF43 is a type of RING E3 ubiquitin ligase, and long-term survivors in most cancers had conserved patterns of mutations of RNF43. Also, high microsatellite instability patients had a higher RNF43 mutation rate compared with microsatellite stability tumor patients, who were more sensitive to ICI treatment. Therefore, RNF43 has become a promising biomarker of immunotherapy in a wid

In [24]:
generator = TestsetGenerator.from_langchain(
    data_generation_model,
    critic_model,
    embeddings
)

In [25]:
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [26]:
testset = generator.generate_with_langchain_docs(documents, 5, distributions)

embedding nodes:   0%|          | 0/10 [00:00<?, ?it/s]



Generating:   0%|          | 0/5 [00:00<?, ?it/s]

In [27]:
test_df = testset.to_pandas()

In [28]:
print(test_df)

                                            question  \
0  What role do clinician response assessments pl...   
1  What is the significance of understanding the ...   
2  How does the RNF43 mutation affect treatment r...   
3  What role do clinician assessments play in mNS...   
4  How does RNF43 mutation affect tumor microenvi...   

                                            contexts  \
0  [PURPOSE: Real-world data (RWD) holds promise ...   
1  [PURPOSE: Financial hardship (FH) is a complex...   
2  [Identifying biomarkers to predict immune chec...   
3  [PURPOSE: Real-world data (RWD) holds promise ...   
4  [Identifying biomarkers to predict immune chec...   

                                        ground_truth evolution_type  \
0  Clinician response assessments play a crucial ...         simple   
1  The significance of understanding the intercon...         simple   
2  The answer to given question is not present in...  multi_context   
3  Clinician assessments play a crucial ro

In [29]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What role do clinician response assessments pl...,[PURPOSE: Real-world data (RWD) holds promise ...,Clinician response assessments play a crucial ...,simple,"[{'uid': '39146509', 'Title': 'Evaluation of R...",True
1,What is the significance of understanding the ...,[PURPOSE: Financial hardship (FH) is a complex...,The significance of understanding the intercon...,simple,"[{'uid': '39146505', 'Title': 'Exploring the R...",True
2,How does the RNF43 mutation affect treatment r...,[Identifying biomarkers to predict immune chec...,The answer to given question is not present in...,multi_context,"[{'uid': '39146560', 'Title': 'RNF43 in cancer...",True
3,What role do clinician assessments play in mNS...,[PURPOSE: Real-world data (RWD) holds promise ...,Clinician assessments play a crucial role in e...,multi_context,"[{'uid': '39146509', 'Title': 'Evaluation of R...",True
4,How does RNF43 mutation affect tumor microenvi...,[Identifying biomarkers to predict immune chec...,The RNF43 mutation affects the tumor microenvi...,reasoning,"[{'uid': '39146560', 'Title': 'RNF43 in cancer...",True


In [30]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=test_df)

https://docs.google.com/spreadsheets/d/19vat6-NVEvWvfBMYu61fMZoODJ9RUyqVQWMnTOUxK7E#gid=0


  return frame.applymap(_clean_val).replace({np.nan: None})


Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents.