# Synthetic Evaluation Data Generation


## Quickstart

### Install required libraries

```
$ pip install -r requirements.txt
```

Please also see [README.md](../README.md) for environment setup including necessary library installation.


### Prepare input data

The synthetic data generation framework supports two input formats `rawdoc` or `squad`. 

- `input_format=rawdoc`

The file should be stored in a JSONL format. Each line contains a document in the format of `{"text": <document>, "title": <title>}`.

```
{"text": "The quick brown fox jumps over the lazy dog.", "title": "Classic Pangram" }
{"text": "The Eiffel Tower is an iron lattice tower on the Champ de Mars in Paris.", "title": "Iconic Landmark" }
...
```
Additionally, if the documents already have a document id, the input file can also contain document ids. The same ids will be persisted in the generated data as well. Another accepted format is `{"_id": <document_id>, "text": <document>, "title": <title>}`.
```
{"_id": "5", "text": "The quick brown fox jumps over the lazy dog.", "title": "Classic Pangram" }
{"_id": "doc3", "text": "The Eiffel Tower is an iron lattice tower on the Champ de Mars in Paris.", "title": "Iconic Landmark" }
...
```
This repository contains a sample JSONL file `data/sample_data.jsonl`.


- `input_format=squad`

If you have manually created questions and would like to conduct further analysis (correlation between synthetic questions and original questions), the input data should follow the SQuAD format.

```
       {
            "data": [
                {
                    "paragraphs": [
                        {
                            "context": "The quick brown fox jumps over the lazy dog.",
                            "qas": [
                                {
                                    "question": "What does the fox jump over?",
                                    "id": "q1",
                                    "synthetic": true,
                                    "answers": [
                                        {
                                            "text": "The fox jump over the lazy dog",
                                            "answer_start": -1,  # For generative answers
                                            "synthetic": true,
                                        }
                                    ]
                                }
                            ]
                        }
                    ],
                    "title": "Example"
                }
            ],
            "version": "2.0"
        }        
```

In [15]:
import os
from omegaconf import OmegaConf
import sys
sys.path.append("../")
sys.path.append("../../../")
from retriever_evalset_generator import RetrieverEvalSetGenerator
from filters import EasinessFilter, AnswerabilityFilter
from nemo_curator.modules.filter import ScoreFilter, Score
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modules.config import RetrieverEvalSDGConfig
import warnings
warnings.filterwarnings('ignore')


## Generating API key

- The SDG pipeline uses NIM models, in order to use them, you need to generate an API key.

- Visit [this page](https://build.nvidia.com/mistralai/mixtral-8x7b-instruct) and click "Get API Key" to generate an API key

![NVIDIA API Catalog](../figures/api_key.png) 

### Loading datasets
We now load a sample dataset from out data folder

In [16]:
import pandas as pd
df = pd.read_json("../data/sample_data_rawdoc.jsonl", lines=True)

In [17]:
df.head()

Unnamed: 0,text,title
0,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon
1,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection
2,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love
3,Machu Picchu is a 15th-century Inca citadel si...,Machu Picchu - Lost City of the Incas
4,"The Colosseum, also known as the Flavian Amphi...",The Colosseum - Ancient Roman Architecture


### Read pipeline config file, instantiate Generator Object

In [4]:
cfg = RetrieverEvalSDGConfig.from_yaml("../config/config.yaml")
cfg.api_key = "your api key here"
retrieval_evalset_generator = RetrieverEvalSetGenerator(cfg)

### Running the Synthetic Data Generator
We first create the dataset object from the pandas dataframe, and pass along the dataset object through the generator and the filters. The dataset object gets transformed along the different steps of the pipeline (i.e. generator, filters)

In [5]:
dataset = DocumentDataset.from_pandas(df)
generated_dataset = retrieval_evalset_generator(dataset)
generated_df = generated_dataset.df.compute()

### Probing the generated Data
For those documents that do not have a document id, the pipeline generates a random hash as document id. For those that have an existing document id, the pipeline persists the same ids in the generated data.

In [6]:
generated_df.head()

Unnamed: 0,_id,text,title,question-id,question,answer,score
0,f95b082088c0bec2d0dc98a3799f31b3ebc050517cbfd3...,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,ff9de122e00206eebb733e02e7d38ba50e9e9c640e5128...,What is the significance of the Eiffel Tower i...,The Eiffel Tower is an iconic landmark in Pari...,1
1,a8b7a30bebd2693a4f797455bb06d720613f72a11d3aaf...,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,d1cff3a0aaefc5ee61462131ac196f81b725f41458d3b6...,Who was responsible for designing the Eiffel T...,The Eiffel Tower was designed by the engineer ...,1
2,24f82d3e8cc1227c8b4b50f0912192582724a754b48601...,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,651eb12e56f0f40473acd24abb4199760ae746b16bdd48...,What was the occasion for building the Eiffel ...,The Eiffel Tower was built for the 1889 Exposi...,1
3,b4c53eb52d1463892cc90f45be86de7e2bc8b34d349e21...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,ef457b9db10d45b1f5dc22c9751029576e41cbc09c9eb6...,What is the purpose of the Great Wall of China?,The purpose of the Great Wall of China is to p...,1
4,35f021445b7dd578f0e1ae1fd6e21f8bd2b76cb159a5a1...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,3f180fd92f1aadef054b1481eeb5bbbaaa98d58992f3e7...,What materials were used to build the Great Wa...,The Great Wall of China was built using materi...,1


### Data Quality Assessment
We apply two filters:

*Answerability filer* uses LLM-as-judge in order to determine quality of questions in terms of them being answerable from content in the passage. The filter weeds out questions that are invalid and not relevant to the document chunk that was used to generate them.

*Easiness filter* is used to filter out questions that are deemed easy for the retriever models to retrieve positive passages for the given generated question. It uses embedding model as judge. The user needs to provide threshold (number between 0 and 1) for this filter. Lower the value of the filter, harder the questions in the dataset. If the threshold value is higher, then we have many easy questions in the dataset. 

The filters can be applied in any order. 

In [7]:
ef = EasinessFilter(cfg)
easiness_filter = ScoreFilter(ef,
                              text_field = ["text", "question"],
                              score_field = "easiness_scores")
af = AnswerabilityFilter(cfg)
answerability_filter = ScoreFilter(af,
                              text_field = ["text", "question"],
                              score_field = "answerability_scores")

### Easiness filter
We see an additional column being generated "easiness_scores". This filter removes questions that are too easy to retrieve by retriever models.

In [8]:
filtered_dataset = easiness_filter(generated_dataset)
filtered_df_1 = filtered_dataset.df.compute()

In [9]:
filtered_df_1.head()

Unnamed: 0,_id,text,title,question-id,question,answer,score,easiness_scores
1,7c5a346e2859af30eae493b5df189716db5355d75760a1...,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,fb39bc1a988e291cb8ac36a1acad08ed24d7969ab08ff0...,Who was responsible for designing the Eiffel T...,The Eiffel Tower was designed by the engineer ...,1,0.570407
3,e94a0488a45accbf70a59ca6d66180a5c771499dbd08a5...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,24d74a34810676af68b0a4314d58733b10f79f6f75eb6c...,What is the purpose of the Great Wall of China?,The purpose of the Great Wall of China is to p...,1,0.527854
4,0b16e4f1e3f3a027072b98284623f4d951870f12f46c95...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,778edd3d72d428075464e7501fae1da599a46f83d09f99...,What materials were used to build the Great Wa...,The Great Wall of China was built using materi...,1,0.55047
5,f187bbe2592f4751233bba08aaac5a7321530de04dd19e...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,becf01b49eb0599e980b569d669b422882eededdcf6d5e...,What direction does the Great Wall of China ge...,The Great Wall of China generally follows an e...,1,0.506733
6,7350f3fb9a7c31e8c62a3be39f3ba1e53ff08cbd6d0a00...,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,eb0c1e84bd7c176e5a00eaa7c03af4db330fa97a3e8e2b...,What is the Taj Mahal and who built it?,The Taj Mahal is an ivory-white marble mausole...,1,0.521735


In [10]:
print (f"Total number of generated data points = {generated_df.shape[0]}") 
print (f"Total number of data points after application of easiness filter = {filtered_df_1.shape[0]}")

Total number of generated data points = 30
Total number of data points after application of easiness filter = 21


### Answerability filter
We see additional column "answerability scores", which shows the rating provided by the LLM-as-judge on criteria used to judge the questions. The criteria can be found in the config. 

In [11]:
filtered_dataset_2 = answerability_filter(filtered_dataset)
filtered_df_2 = filtered_dataset_2.df.compute()

In [12]:
filtered_df_2.head()

Unnamed: 0,_id,text,title,question-id,question,answer,score,easiness_scores,answerability_scores
3,f6f3fece8d3797c71da36869fe0aabcf764f6ffa4694da...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,d29d16da19288242b409d2216d2d9b6aba261936fdd6fa...,What is the purpose of the Great Wall of China?,The purpose of the Great Wall of China is to p...,1,0.527854,"{\n""criterion_1_explanation"": ""The question is..."
4,b70486ff18e0a6870b167b5c23c24080c75687db219a05...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,2cb46db5fd94ccc60980ff357a0f08292760a29f554c0a...,What materials were used to build the Great Wa...,The Great Wall of China was built using materi...,1,0.55047,"{\n""criterion_1_explanation"": ""The question is..."
5,d736edbfbde871f0064bef695a91dd663d62eacf4cff17...,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,44c7792e88cc656c440f0b9ffe3fd729dbdd05ee4594ee...,In which direction was the Great Wall of China...,The Great Wall of China was generally built al...,1,0.553109,"{\n""criterion_1_explanation"": ""The question is..."
6,bdfe150a3a62112c42a1145e98b5068e70303c28de303e...,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,b0b0d6fbe55151b16bf4b8e6ce355ff9b9c9f07ba129dd...,What is the Taj Mahal primarily used for?,The Taj Mahal is primarily used as a mausoleum...,1,0.444493,"{\n""criterion_1_explanation"": ""The question is..."
7,b5eee1a84b594db452d8a69ed00231a43f7b93f76e54e6...,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,83ef0cdb990227b9399ab24b0b3074c3842315b14f9ec4...,Who commissioned the construction of the Taj M...,The Taj Mahal was commissioned by the Mughal e...,1,0.555006,"{\n""criterion_1_explanation"": ""The question is..."


In [13]:
print (f"Total number of data points after application of answerability filter = {filtered_df_2.shape[0]}")

Total number of data points after application of answerability filter = 18


We see that upon adding the answerability filter, the number of data points further reduced. We removed unanswerable questions i.e. questions that can't be answered solely based on content provided in the context document.