# Synthetic Evaluation Data Generation Using NeMo Retriever SDG


## Quickstart

### Install required libraries

```
$ pip install -r requirements.txt
```

Please also see [README.md](../README.md) for environment setup including necessary library installation.


### Prepare input data

The synthetic data generation framework supports two input formats `rawdoc` or `squad`. 

- `input_format=rawdoc`

The file should be stored in a JSONL format. Each line contains a document in the format of `{"text": <document>, "title": <title>}`.

```
{"text": "The quick brown fox jumps over the lazy dog.", "title": "Classic Pangram" }
{"text": "The Eiffel Tower is an iron lattice tower on the Champ de Mars in Paris.", "title": "Iconic Landmark" }
...
```

This repository contains a sample JSONL file `data/sample_data.jsonl`.


- `input_format=squad`

If you have manually created questions and would like to conduct further analysis (correlation between synthetic questions and original questions), the input data should follow the SQuAD format.

```
       {
            "data": [
                {
                    "paragraphs": [
                        {
                            "context": "The quick brown fox jumps over the lazy dog.",
                            "qas": [
                                {
                                    "question": "What does the fox jump over?",
                                    "id": "q1",
                                    "synthetic": true,
                                    "answers": [
                                        {
                                            "text": "The fox jump over the lazy dog",
                                            "answer_start": -1,  # For generative answers
                                            "synthetic": true,
                                        }
                                    ]
                                }
                            ]
                        }
                    ],
                    "title": "Example"
                }
            ],
            "version": "2.0"
        }        
```


### Run pipeline

- Visit [this page](https://build.nvidia.com/mistralai/mixtral-8x7b-instruct) and click "Get API Key" to generate an API key

![NVIDIA API Catalog](../figures/api_key.png) |
-

- Run the following command. It will roughly take 5-10 minutes. 
    - Add `PYTHONPATH=.` if you get an error message `ModuleNotFoundError: No module named 'nemo_retriever_sdg'`

```
HYDRA_FULL_ERROR=1 PYTHONPATH=. python scripts/run_pipeline.py \
  api_key="<API KEY>" \
  input_file=$(pwd)/data/sample_data_rawdoc.jsonl \
  input_format="rawdoc"
  output_dir=$(pwd)/outputs/sample_synthetic_data
```


## Output

This creates synthetic eval datasets in the SQuAD and BEIR formats. 
If you use `input_format=squad` and `evaluate=True`, you would see `eval` and `beir/original` dictionaries.

```
outputs/sample_synthetic_data
├── beir
│   ├── all
│   │   └── synthetic
│   │       ├── corpus.jsonl
│   │       ├── qrels
│   │       │   └── test.tsv
│   │       └── queries.jsonl
│   └── filtered
│       └── synthetic
│           ├── corpus.jsonl
│           ├── qrels
│           │   └── test.tsv
│           └── queries.jsonl
├── eval
│   ├── all
│   │   ├── beir_evaluator__recall5.csv
│   │   ├── beir_evaluator__synthetic_topk_rel_doc_flags.csv
│   │   ├── beir_evaluator__synthetic_topk_rel_doc_flags_counts.csv
│   │   └── beir_evaluator__type_model_eval_dict.json
│   └── filtered
│       ├── beir_evaluator__recall5.csv
│       ├── beir_evaluator__synthetic_topk_rel_doc_flags.csv
│       ├── beir_evaluator__synthetic_topk_rel_doc_flags_counts.csv
│       └── beir_evaluator__type_model_eval_dict.json
├── report__all.json
├── report__filtered.json
└── squad
    ├── synthetic_data__all.json
    └── synthetic_data__filtered.json

```

### SQuAD format

The command will generate a `.json` file in a modified version of the SQuAD v2 format. The difference from the origial SQuAD v2 format is 
- Set `answer_start: -1` for generative answers.
    - Note: only generative answers (not extractive) can be created by the current version of the SDG pipeline. Thus, the value is set to be a dummy value `-1` for synehtic answers
- Use of `synthetic: true` for synthetic quetsions and answers

```
{
    "data": [
        {
            "paragraphs": [
                {
                    "context": "The quick brown fox jumps over the lazy dog.",
                    "document_id": "Example",
                    "qas": [
                        {
                            "question": "What does the fox jump over?",
                            "id": "q1",
                            "synthetic": true,
                            "answers": [
                                {
                                    "text": "The fox jump over the lazy dog",
                                    "answer_start": -1,  # For generative answers
                                    "synthetic": true,
                                }
                            ]
                        }
                    ]
                }
            ],
        }
    ],
    "version": "2.0"
}
```

### BEIR format

The directory structure follows the BEIR format. 

```
synthetic
├── corpus.jsonl
├── qrels
│   └── test.tsv
└── queries.jsonl
```

You can use the directory as it to load by the BEIR framework for evauation. For example,

```
from beir.datasets.data_loader import GenericDataLoader
corpus, queries, qrels = GenericDataLoader(data_folder="synthetic").load(split="test")
```


### Report (`report.json`)

```
{
  "synthetic_question_length": {
    "count": 1500,
    "mean": 83.68066666666667,
    "std": 22.751082774243716,
    "min": 31,
    "25%": 67,
    "50%": 82,
    "75%": 97,
    "max": 267
  },
  "original_question_length": {
    "count": 137,
    "mean": 53.613138686131386,
    "std": 21.75709761649885,
    "min": 12,
    "25%": 37,
    "50%": 50,
    "75%": 70,
    "max": 107
  },
  "synthetic_lexical_divergence": {
    "count": 1500,
    "mean": 0.057704510678110436,
    "std": 0.05364795698080602,
    "min": 0,
    "25%": 0,
    "50%": 0.04347826086956519,
    "75%": 0.08890374331550804,
    "max": 0.2777777777777778
  },
  "original_lexical_divergence": {
    "count": 137,
    "mean": 0.028657367331165418,
    "std": 0.04839056633930506,
    "min": 0,
    "25%": 0,
    "50%": 0,
    "75%": 0.045454545454545414,
    "max": 0.2272727272727273
  }
}
```



### Synthetic Data Generation (SDG) Pipeline

![Overall architecture of the SDG Pipeline](../figures/sdg_pipeline.png)
<p style="text-align: center;">Figure 1. Overall architecture of the SDG Pipeline.</p>

First step to running the SDG pipeline is to setup the configurations for: 

- 1. LLM generator model,
- 2. Easiness filter (embedding-model-as-a-judge).
- 3. Answerability filter (LLM-as-a-Judge)

*Answerability filer* uses LLM-as-judge in order to determine quality of questions in terms of them being answerable from content in the passage. The filter weeds out questions that are invalid and not relevant to the document chunk that was used to generate them.

*Easiness filter* is used to filter out questions that are deemed easy for the retriever models to retrieve positive passages for the given generated question. It uses embedding model as judge. The user needs to provide threshold (number between 0 and 1) for this filter. Lower the value of the filter, harder the questions in the dataset. If the threshold value is higher, then we have many easy questions in the dataset. 

The filters can be applied in any order. 

Additionally, we need to specify configuration for 
1. Evaluators if 'evalulate' flag is set to true.
2. Analyzers (query length & lexical divergence between context & query)

Let's see how the config file looks like.

In [None]:
!cat ../scripts/conf/config.yaml

As you can see above the prompts for generator model, the llm-as-judge model needs to be specified in the config.yaml file.

Also, for shorter test runs, the user can specify max_examples parameter. This sets number of input document chunks (from the input file) to be used for synthetic data generation.

The data containing the passages needs to be placed in the data directory. It has to be in jsonl format as mentioned before.

In [None]:
!ls ../data

### Running the SDG pipeline

The pipeline can be run using the run_pipeline.py. It needs api_key which can be obtained using the pointer above. We show here a run using input document in 'rawdoc' format. You can see the progress of the data generation pipeline as well.

In [None]:
!HYDRA_FULL_ERROR=1 PYTHONPATH=.. python ../scripts/run_pipeline.py \
  api_key=<API Key> \
  input_file=../data/sample_data_rawdoc.jsonl \
  input_format=rawdoc \
  output_dir=../outputs/sample_data_synthetic_w_evals

### Output/Results

After the run, an 'outputs' directory is created, which has two sub-directories, 'beir' and 'squad' containing results in beir and squad formats respecitively. The output directory structure should look like below

- beir
    - all
    - filtered
- squad
    - all
    - filtered
- eval
    - all
        - synthetic
        - original (if `use_original=true`)
    - filtered
        - synthetic
        - original (if `use_original=true`)

In [None]:
!ls ../outputs/sample_data_synthetic_w_evals/

Here is the snapshot of sample output in squad format

![Sample output in the SQuAD format](../figures/sample_output.png)
<p style="text-align: center;">Figure 2. Sample output in the SQuAD format.</p>

As seen from Figure 2, we can observe that in addition to question and answer being generated for the given passage, we also have other meta data such as filter-by-easiness and filter-by-answerability as well as llm-as-judge-score. 

Filter-by-easiness takes values Y/N which denotes where easiness filter (embedding-model-as-judge) would filter the questions based on the threshold we have set in the config file. We had set a threshold value of 0.8 for cosine-similarity metric, so we observe that filter-by-easiness is 'Y'. The question is deemed too easy for retrieval and would not be passed by easiness filter.

Filter-by-answerability also takes values Y/N. We see that all the criteria are satisfied for llm-as-judge so question is good quality and would pass the answerability filter.

Now lets take a look at the generated queries, we show queries in beir format.

In [None]:
# Take a look at generated questions 
!cat ../outputs/sample_data_synthetic_w_evals/beir/all/synthetic/queries.jsonl

### Evaluation 

We showcase beir evaluation of synthetically generated data. All the results can be found in outputs directory

In [None]:
!ls ../outputs/sample_data_synthetic_w_evals/eval/filtered/

In [None]:
import json
import pandas as pd

In [None]:
df = pd.read_csv("../outputs/sample_data_synthetic_w_evals/eval/all/beir_evaluator__recall5.csv")
display(df)

In [None]:
df = pd.read_csv("../outputs/sample_data_synthetic_w_evals/eval/filtered/beir_evaluator__recall5.csv")
display(df)

Recall@5 values for three different embedding models. We see a value of 1 for all since we have very small sample data set

### Analysis Report

We also showcase other statistics such as query length, lexical divergence (uni-gram) between query and passage

In [None]:
pd.DataFrame(json.load(open("../outputs/sample_data_synthetic_w_evals/report__all.json")))

In [None]:
pd.DataFrame(json.load(open("../outputs/sample_data_synthetic_w_evals/report__filtered.json")))