# Generate synthetic test dataset (with RAGAS)

- Author: [Yoonji](https://github.com/samdaseuss)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

### Welcome Back!
Hi everyone! Welcome to our first lecture in the evaluation section. We're going to try something special today! While we've been building RAG systems, we haven't really talked about how to test if they're working well. To properly evaluate a RAG system, we need good test data - and that's exactly what we'll be creating in this tutorial! We'll learn how to build datasets that will help us measure our RAG pipeline's performance.

### Today, what we are going to learn...
We'll be using RAGAS to generate evaluation datasets. Specifically, we'll dive into:
* How to preprocess documents for evaluation
* How to define various evaluation objects
* How to generate different types of test questions by configuring data distributions

Through hands-on practice, you'll learn all these techniques and be able to create your own evaluation datasets!

### We're going to learn ...
The main goal of this section is to create test datasets that can objectively evaluate our RAG system. Think of it as building a really good test that can tell us exactly how well our RAG system is performing on different types of questions and scenarios.

By the end of this tutorial, you'll have all the tools you need to create comprehensive test datasets that will help you understand your RAG system's strengths and areas for improvement. Ready to get started? Let's dive in!

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Looking Back at What We've Learned](#looking-back-at-what-weve-learned)
    * [We Have Learned About RAG](#we-have-learned-about-rag)
    * [Is Our RAG Design Effective?](#is-our-rag-design-effective)
    * [Why Use Synthetic Test Dataset?](#why-use-synthetic-test-dataset)
- [Installation](#installation)
- [What is RAGAS](#what-is-ragas)
- [RAGAS in Python](#ragas-in-python)
- [Document](#document)
- [Document Preprocessing](#document-preprocessing)
- [Dataset Generation](#dataset-generation)
- [Distribution of Question Types](#distribution-of-question-types)

### References

- [Testset Generation for RAG](https://docs.ragas.io/en/stable/getstarted/rag_testset_generation/)
- [Testset Generation for RAG : 📚 Core Concepts > Test Data Generation > RAG](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/)

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
!pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain-anthropic",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
    ],
    verbose=False,
    upgrade=False,
)

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Generate synthetic test dataset (with RAGAS)",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Looking Back at What We've Learned

### We Have Learned About RAG

LLM is a powerful technology, but it has limitations in reflecting real-time information due to the constraints of its training data.

For example, let's say NASA discovered a new planet yesterday, making the total number of planets in the solar system nine. What would happen if we asked an LLM about the number of planets in the solar system? Because LLM responds based on its trained data, it would say there are eight planets. We call this phenomenon 'hallucination,' and to resolve this, we need to wait for a model 'version up.'

RAG emerged to overcome these limitations. Instead of immediately responding to user questions, the RAG pipeline first searches for the latest information from external knowledge repositories and then generates responses based on this information. This enables the system to provide answers that reflect the most up-to-date information.

### Is Our RAG Design Effective?

You have learned various techniques for implementing RAG. Some of you may have already built your own RAG systems and applied them to your work.

However, we need to ask an important question: Is our RAG system truly a 'good' RAG? How can we judge the quality of RAG?

Simply saying "this RAG doesn't perform well" is not enough. We need to be able to measure and verify RAG's performance through objective evaluation metrics.

### Why Use Synthetic Test Dataset?

Evaluating the performance of RAG systems is a crucial process. However, manually creating hundreds of question-answer pairs requires enormous time and effort.

Moreover, manually written questions often remain at a simple and superficial level, making it difficult to thoroughly evaluate the performance of RAG systems.

By utilizing synthetic data to solve these problems, we can reduce developer time spent on building test datasets by up to 90%. Additionally, it enables more thorough performance evaluation by automatically generating test cases of various difficulty levels and types.

## Installation

To proceed with this tutorial, you need to install the `RAGAS` package. Through the command below, we'll install the `RAGAS` package, and immediately after, we'll explore the concept of `RAGAS` and learn about Python's `RAGAS package` in detail.

In [120]:
%pip install -qU ragas

Note: you may need to restart the kernel to use updated packages.


## What is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment Suite) is a comprehensive evaluation framework designed to assess the performance of RAG systems. It helps developers and researchers measure how well their RAG implementations are working through various metrics and evaluation methods.

Let's revisit the example we saw earlier.

Let's say NASA discovered a new planet yesterday, making the total number of planets in our solar system nine. To evaluate the performance of a RAG system, let's ask the test question "How many planets are in our solar system?" RAGAS evaluates the system's response using these key metrics:

1. `Answer Relevancy`: Checks if the answer directly addresses the question about the number of planets
2. `Context Relevancy`: Checks if the system retrieved the recent NASA announcement instead of old astronomy textbooks
3. `Faithfulness`: Checks if the answer about nine planets is based on the NASA announcement and not on outdated data
4. `Context Precision`: Checks if the system used the NASA announcement efficiently without including unnecessary space information

For example, if the RAG system responds with **outdated information** saying there are eight planets, RAGAS will give it a low context relevancy score. Or if it makes claims about the new planet that aren't in the NASA announcement, it will receive a low faithfulness score.

## RAGAS in Python
You can easily use `RAGAS` with Python libraries.

Ragas is a library that provides tools to supercharge the evaluation of Large Language Model (LLM) applications. It is designed to help you evaluate your LLM applications with ease and confidence.

## Document
While the official RAGAS package website demonstrates tutorials using `markdown`, in this tutorial, we'll be working with `pdf` files. Please use the files located in the `data` folder.

In [5]:
file_path = 'data/'

## Document Preprocessing

In [6]:
from langchain_community.document_loaders import DirectoryLoader

# Create a document loader
loader = DirectoryLoader(file_path, glob="**/*.pdf")

# Load documents
docs = loader.load()

In [7]:
docs



Each document object includes a metadata dictionary that can be used to store additional information about the document, which can be accessed through `metadata`.

Please check if the metadata dictionary contains a key called `filename`.

This key will be used in the `Test datasets generation process`. The `filename` attribute in metadata is used to identify chunks belonging to the same document.

In [8]:
# Set metadata ('filename' must exist)
for doc in docs:
    doc.metadata["filename"] = doc.metadata["source"]

In [9]:
docs



## Dataset Generation
We'll create datasets using ChatOpenAI. Before writing the code, let's define the roles of our objects:
- Dataset Generator: `generator_llm`
- Dataset Critic: `critic_llm`
- Document Embeddings: `embeddings`

In [10]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from ragas.testset.transforms import KeyphrasesExtractor
from ragas.testset.graph import KnowledgeGraph
from ragas.testset.graph import Node, NodeType


# Dataset Generator
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Dataset Critic
critic_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Document Embeddings
embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

First, let's initialize the DocumentStore. We'll configure it to use custom LLM and embeddings.

In [11]:
# Configure the text splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Wrap LangChain's ChatOpenAI model with LangchainLLMWrapper to make it compatible with Ragas
langchain_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Initialize the key phrase extractor using the LLM defined above
keyphrase_extractor = KeyphrasesExtractor(llm=langchain_llm)

# Create ragas_embeddings
ragas_embeddings = LangchainEmbeddingsWrapper(embeddings)

kg = KnowledgeGraph()
for doc in docs:
   kg.nodes.append(
       Node(
           type=NodeType.DOCUMENT,
           properties={
               "page_content": doc.page_content,
               "document_metadata": doc.metadata
           }
       )
   )

### Self Check

```python
print(len(generator.knowledge_graph.nodes))
```
Run this code to verify if knowledge graph nodes have been created. If no nodes were created, there may be issues with executing subsequent code.

```python
for node in generator.knowledge_graph.nodes:
    print(node.properties)
```

In [12]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm,
    embedding_model=ragas_embeddings,
    knowledge_graph=kg,
)

## Distribution of Question Types
Before we begin generating questions, let's first define the distribution (frequency) of questions by type. Using the `SingleHopSpecificQuerySynthesizer`, we aim to create a test set with the following distribution of question types:

- `simple`: Basic questions (40%)
- `reasoning`: Questions requiring reasoning (20%)
- `multi_context`: Questions requiring consideration of multiple contexts (20%)
- `conditional`: Conditional questions (20%)

### Role of the synthesizers Module
The synthesizers module in Ragas is a core module responsible for Query Synthesis. It provides functionality to generate various types of questions based on documents stored in the Knowledge Graph. This module is used to automatically generate test sets for evaluating RAG (Retrieval-Augmented Generation) systems.

In [13]:
from ragas.testset.synthesizers.single_hop.specific import (
   SingleHopSpecificQuerySynthesizer,
)
from ragas.testset.synthesizers.multi_hop.specific import (MultiHopQuerySynthesizer)

# Create Synthesizer instances for each question type
simple_synthesizer = SingleHopSpecificQuerySynthesizer(llm=generator_llm)  
reasoning_synthesizer = SingleHopSpecificQuerySynthesizer(llm=generator_llm)
multi_context_synthesizer = SingleHopSpecificQuerySynthesizer(llm=generator_llm)
conditional_synthesizer = SingleHopSpecificQuerySynthesizer(llm=generator_llm)

# Set distribution by question type
distribution = [
   (simple_synthesizer, 0.4),        # simple: 40%
   (reasoning_synthesizer, 0.2),     # reasoning: 20%  
   (multi_context_synthesizer, 0.2), # multi_context: 20%
   (conditional_synthesizer, 0.2),   # conditional: 20%
]

In [14]:
dataset = generator.generate_with_langchain_docs(
   documents=docs, # document data
   testset_size=10, # number of questions to generate
   query_distribution=distribution, # distribution by question type 
   with_debugging_logs=True # output debugging logs
)

Applying HeadlinesExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/1 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/1 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/15 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/4 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [15]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,what vertex ai do?,[Cognitive architectures: How agents operate T...,Vertex AI is mentioned in the context of produ...,single_hop_specifc_query_synthesizer
1,How does the ReAct framework utilize the 'Flig...,[their training data. Knowledge is extended th...,The ReAct framework allows AI agents to choose...,single_hop_specifc_query_synthesizer
2,Howw doo I book a flight to Zurich using an AI...,"[data. This means that, in a sense, a language...","To book a flight to Zurich using an AI agent, ...",single_hop_specifc_query_synthesizer
3,Wht is SCaNN used for in AI devlopment?,[that they are meant to offer the developer mu...,SCaNN is used as a matching algorithm to match...,single_hop_specifc_query_synthesizer
4,what langchain do?,[Cognitive architectures: How agents operate T...,LangChain is mentioned in the context of agent...,single_hop_specifc_query_synthesizer
5,what langchain do?,[their training data. Knowledge is extended th...,LangChain is a pre-built agent framework used ...,single_hop_specifc_query_synthesizer
6,How does Vertex AI facilitate the deployment o...,[Cognitive architectures: How agents operate T...,Vertex AI agents are utilized in production ap...,single_hop_specifc_query_synthesizer
7,Wht is LangChain and how does it fit into agen...,[their training data. Knowledge is extended th...,LangChain is a pre-built agent framework that ...,single_hop_specifc_query_synthesizer
8,what vertex ai do?,[Cognitive architectures: How agents operate T...,The context mentions 'Production applications ...,single_hop_specifc_query_synthesizer
9,what is ReAct do?,[their training data. Knowledge is extended th...,ReAct is a prompt engineering framework that p...,single_hop_specifc_query_synthesizer
