# Working with LUQ Datasets

Welcome to the LUQ Datasets tutorial!
In this guide, you'll learn how to:
- Create your own LUQ datasets
- Access and use existing LUQ datasets

This tutorial is designed to help you get started quickly and effectively with LUQ's dataset tools and resources.

## Step 1: Preprocess the Dataset

To ensure consistency across various QA datasets, **LUQ** requires all datasets to be converted into a unified format. The required format is simple:

```json
{
  "data": [
    {
      "question": "Is this an example of a question?",
      "answer": "Yes, it is"
    }
    // ...
  ]
}

LUQ provides a helper script to preprocess several commonly used QA datasets such as CoQA, Natural Questions (NQ), and more.

📄 You can find the script here: `scripts/process_dataset.py`.

Let’s walk through an example using the CoQA dataset for demonstration.

In [None]:
!python scripts/process_dataset.py \
    --dataset=coqa \
    --output=data/coqa/processed.json

## Step 2: Add LLM Responses to the Dataset

In this step, we enhance the dataset by generating multiple responses using a Large Language Model (LLM). This is done with the script: `scripts/add_generations_to_dataset.py`.


This script:
- Generates multiple responses (**samples**) per question.
- Produces a final answer (typically sampled with a low temperature).
- Adds token-level log probabilities for downstream analysis or training tasks.

---

### 💾 Output Format

The output is a `.json` file structured as follows:

```json
{
  "llm": "llm_name",
  "raw_dataset": "name_or_url_of_raw_dataset",
  "data": [
    {
      "question": "Is this a question?",
      "samples": ["Yes", "No", "Maybe so"],
      "answer": "Yes",
      "gt_answer": "Yes",
      "samples_temp": 1,
      "answer_temp": 0.1,
      "top-p": 0.95
    }
    // ...
  ]
}

📘 Field Descriptions
- llm – Name of the LLM used (e.g., gpt-4, llama-2)
- raw_dataset – Name or URL of the original dataset
- data – List of entries, each containing:
    - question: The input question
    - samples: Multiple responses generated by the LLM
    - answer: Final selected answer (usually low-temperature)
    - gt_answer: Ground truth answer (if available)
    - samples_temp: Temperature used for generating samples
    - answer_temp: Temperature used for generating the answer
    - top-p: Top-p value (nucleus sampling)
    - (Optional) Other generation parameters used

In [None]:
!python scripts/add_generations_to_dataset.py \
    --input-file=./data/coqa/processed_short.json\
    --output-file=./data/coqa/processed_gen_short.json\

## Step 3: Assess the Accuracy of the Predictions

In this step, we evaluate the quality of the generated answers by using an **LLM-as-a-judge** approach. This means using a language model to assess whether each prediction is correct.

To do this, use the following script: `scripts/add_accuracy_to_dataset.py`.


This script adds an **accuracy score** to each response by comparing the LLM-generated answers to the ground truth (`gt_answer`). The evaluation is typically done by prompting an LLM to judge whether each response is correct, based on the context and expected answer.

---

After running this step, the dataset will include additional fields such as:

- `accuracy`: A score or flag indicating whether the answer is correct
- `judgment_explanation` *(optional)*: The LLM's reasoning or justification

This automatic evaluation enables large-scale analysis of model performance without requiring human annotations.


In [None]:
!python scripts/eval_accuracy.py \
    --input-file=data/coqa/processed_gen_short.json \
    --output-file=data/coqa/processed_gen_acc_short.json \
    --model-name=gpt2 \
    --model-type=huggingface

## Step 4 (Optional): Upload the Dataset to Hugging Face

As an optional final step, you can upload your generated dataset to **Hugging Face** to make it publicly available and easily shareable with others.

To do this, use the script: `scripts/upload_dataset.py`.


Uploading your dataset to Hugging Face allows others to explore, download, and use your data through the Hugging Face Hub, making collaboration and reproducibility easier.

> 🔒 Make sure you have a Hugging Face account and the appropriate API token configured before uploading.


In [None]:
python scripts/upload_dataset.py \
    --path=data/coqa/processed_gen_acc_short.json \
    --repo-id your-username/dataset-name \
    --token your-huggingface-token