# Synthetic Test Set Generation using RAGAS
This notebook demonstrates how to generate a synthetic test set of questions that can be used to evaluate a RAG pipeline using the RAGAS library.

In [2]:
import nest_asyncio
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from ragas.run_config import RunConfig
from ragas.testset.evolutions import multi_context, reasoning, simple
from ragas.testset.generator import TestsetGenerator
from langchain.docstore.document import Document
import json

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
nest_asyncio.apply()  # apply the event loop async fix

First we create all the necessary objects used to access the local LLM, embeddings and create a generator object that will be used to create the test set. We then define a distribution for the types of questions we want to be generated in our test set.

In [4]:
llm = ChatOllama(model="mistral-nemo", num_ctx=16384)
embeddings = OllamaEmbeddings(model="mistral-nemo", num_ctx=16384)
gen = TestsetGenerator.from_langchain(
    llm, llm, embeddings, run_config=RunConfig(max_workers=1, max_retries=1)
)
dist = {simple: 0.6, multi_context: 0.2, reasoning: 0.2}

Next we load some text/documents that will be used to create the synthetic test set. These can be loaded in anyway you see fit but should be of type `langchain.docstore.document.Document`.

In [5]:
with open("../data/extracted_metadata.json") as f:
    json_data = json.load(f)
    docs = [
        Document(
            page_content=metadata["value"],
            metadata={"id": metadata["id"], "field": metadata["field"]},
        )
        for metadata in json_data
    ]

In [6]:
docs

[Document(metadata={'id': 'b77ce981-d038-4774-a620-f50da5dd3d31', 'field': 'title'}, page_content='Land Cover Map 2017 (land parcels, GB)'),
 Document(metadata={'id': 'b77ce981-d038-4774-a620-f50da5dd3d31', 'field': 'description'}, page_content="This is the land parcels (polygon) dataset for the UKCEH Land Cover Map of 2017 (LCM2017) representing Great Britain. It describes Great Britain's land cover in 2017 using UKCEH Land Cover Classes, which are based on UK Biodiversity Action Plan broad habitats.  This dataset was derived from the corresponding LCM2017 20m classified pixels dataset.  All further LCM2017 datasets for Great Britain are derived from this land parcel product.  A range of land parcel attributes are provided.  These include the dominant UKCEH Land Cover Class given as an integer value, and a range of per-parcel pixel statistics to help assessing classification confidence and accuracy; for a full explanation please refer to the dataset documentation.\n\nThis work was sup

Then we generate the test set. Here we are only generating `5` test questions for speed, but generate as many as you feel you need.

In [7]:
testset = gen.generate_with_langchain_docs(docs, 5, dist, is_async=False)

Filename and doc_id are the same for all nodes.                 
Generating: 100%|██████████| 5/5 [01:10<00:00, 14.03s/it]


Finally convert the test set to a pandas data frame and save it.

In [8]:
df = testset.to_pandas()
df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What are the total number of land parcels iden...,"[Land Cover Map 2020 (land parcels, N. Ireland)]",The answer to given question is not present in...,simple,[{'id': '36343ace-d56a-43ea-9d48-2f434dafcb26'...,True
1,What are the land parcels included in the Land...,"[Land Cover Map 2017 (land parcels, GB)]",The answer to given question is not present in...,simple,[{'id': 'b77ce981-d038-4774-a620-f50da5dd3d31'...,True
2,What are the main land cover classes used in t...,[This is the land parcels (polygon) dataset fo...,The main land cover classes used in the LCM201...,simple,[{'id': 'b77ce981-d038-4774-a620-f50da5dd3d31'...,True
3,What did RF ID in NI (LCM2020) vs UK (LCM2021)?,[UKCEH’s automated land cover algorithms gener...,The answer to given question is not present in...,multi_context,[{'id': '36343ace-d56a-43ea-9d48-2f434dafcb26'...,True
4,What's NI's top two land covers from '21 & '20...,[Land Cover Map 2021 (25m rasterised land parc...,The top two land covers in Northern Ireland fo...,reasoning,[{'id': 'f3310fe1-a6ea-4cdd-b9f6-f7fc66e4652e'...,True
