# Synthetic Test Set Generation using RAGAS
This notebook demonstrates how to generate a synthetic test set of questions that can be used to evaluate a RAG pipeline using the RAGAS library.

In [None]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.chat_models import ChatOllama
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from ragas.run_config import RunConfig
import nest_asyncio

In [None]:
nest_asyncio.apply()  # apply the event loop async fix

First we create all the necessary objects used to access the local LLM, embeddings and create a generator object that will be used to create the test set. We then define a distribution for the types of questions we want to be generated in our test set.

In [None]:
llm = ChatOllama(model="mistral-nemo", num_ctx=16384)
embeddings = OllamaEmbeddings(model="mistral-nemo", num_ctx=16384)
gen = TestsetGenerator.from_langchain(
    llm, llm, embeddings, run_config=RunConfig(max_workers=1, max_retries=1)
)
dist = {simple: 0.6, multi_context: 0.2, reasoning: 0.2}

Next we load some text/documents that will be used to create the synthetic test set. These can be loaded in anyway you see fit but should be of type `langchain.docstore.document.Document`.

In [None]:
docs = []  # load a set of langchain documents to base the synthetic test set generation on

Then we generate the test set. Here we are only generating `5` test questions for speed, but generate as many as you feel you need.

In [None]:
testset = gen.generate_with_langchain_docs(docs, 5, dist, is_async=False)

Finally convert the test set to a pandas data frame and save it.

In [None]:
df = testset.to_pandas()
df.to_csv("data/synthetic-datasets/test-set.csv", index=False)
df