Evaluating RAG (Retrieval-Augmented Generation) augmented pipelines is crucial for assessing their performance. However, manually creating hundreds of QA (Question-Context-Answer) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity required for a thorough evaluation, ultimately impacting the quality of the assessment. By using synthetic data generation developer time in data aggregation process can be reduced by 90%.

In [36]:
!pip install ragas langchain-openai sentence_transformers xmltodict python-dotenv

Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.met

In [37]:
import os
from google.colab import userdata
import pandas as pd
from langchain_community.document_loaders import PubMedLoader
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_openai import ChatOpenAI
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

In [38]:
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [39]:
data_generation_model = ChatOpenAI(model='gpt-4o-mini')

In [40]:
critic_model = ChatOpenAI(model='gpt-4o')

In [46]:
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [42]:
loader = PubMedLoader("cancer", load_max_docs=5)

In [43]:
loader

<langchain_community.document_loaders.pubmed.PubMedLoader at 0x7dcae911eb90>

In [44]:
documents = loader.load()

In [45]:
documents

[Document(metadata={'uid': '39137027', 'Title': 'Standardization of scan protocols for RT CT simulator from different vendors using quantitative image quality technique.', 'Published': '2024-08-13', 'Copyright Information': '© 2024 The Author(s). Journal of Applied Clinical Medical Physics published by Wiley Periodicals, LLC on behalf of The American Association of Physicists in Medicine.'}, page_content="OBJECTIVE: To investigate the feasibility of standardizing RT simulation CT scanner protocols between vendors using target-based image quality (IQ) metrics.\nMETHOD AND MATERIALS: A systematic assessment process in phantom was developed to standardize clinical scan protocols for scanners from different vendors following these steps: (a) images were acquired by varying CTDI and using an iterative reconstruction (IR) method (IR: iDose and model-based iterative reconstruction [IMR] of CT-Philips Big Bore scanner, SAFIRE of CT-Siemens biograph PETCT scanner), (b) CT exams were classified 

In [47]:
generator = TestsetGenerator.from_langchain(
    data_generation_model,
    critic_model,
    embeddings
)

In [48]:
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [49]:
testset = generator.generate_with_langchain_docs(documents, 5, distributions)

embedding nodes:   0%|          | 0/10 [00:00<?, ?it/s]



Generating:   0%|          | 0/5 [00:00<?, ?it/s]

In [50]:
test_df = testset.to_pandas()

In [51]:
print(test_df)

                                            question  \
0  How does the level of T-cell exhaustion affect...   
1  What is the role of liquid-liquid phase separa...   
2  How does LLPS affect PML/RARα microspeckles in...   
3  How does LLPS affect PML/RARα microspeckles in...   
4  How did the development of the TAA burden (TAB...   

                                            contexts  \
0  [Tumor-associated antigens (TAAs) are importan...   
1  [In acute promyelocytic leukemia (APL), the pr...   
2  [Tumor-associated antigens (TAAs) are importan...   
3  [Tumor-associated antigens (TAAs) are importan...   
4  [Tumor-associated antigens (TAAs) are importan...   

                                        ground_truth evolution_type  \
0  The level of T-cell exhaustion affects the ass...         simple   
1  Liquid-liquid phase separation (LLPS) is a key...         simple   
2  The answer to given question is not present in...  multi_context   
3  The answer to given question is not pre

In [52]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How does the level of T-cell exhaustion affect...,[Tumor-associated antigens (TAAs) are importan...,The level of T-cell exhaustion affects the ass...,simple,"[{'uid': '39137006', 'Title': 'Tumor-Associate...",True
1,What is the role of liquid-liquid phase separa...,"[In acute promyelocytic leukemia (APL), the pr...",Liquid-liquid phase separation (LLPS) is a key...,simple,"[{'uid': '39136995', 'Title': 'Phase separatio...",True
2,How does LLPS affect PML/RARα microspeckles in...,[Tumor-associated antigens (TAAs) are importan...,The answer to given question is not present in...,multi_context,"[{'uid': '39137006', 'Title': 'Tumor-Associate...",True
3,How does LLPS affect PML/RARα microspeckles in...,[Tumor-associated antigens (TAAs) are importan...,The answer to given question is not present in...,multi_context,"[{'uid': '39137006', 'Title': 'Tumor-Associate...",True
4,How did the development of the TAA burden (TAB...,[Tumor-associated antigens (TAAs) are importan...,The development of the TAA burden (TAB) algori...,simple,"[{'uid': '39137006', 'Title': 'Tumor-Associate...",True


In [53]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=test_df)

https://docs.google.com/spreadsheets/d/18mP55CFzzr7Tmj-OcGsmCw5_nKT6yPjskDpasLesKqs#gid=0


  return frame.applymap(_clean_val).replace({np.nan: None})


Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents.