# Preprocessing of a dataset

To pre-process a dataset, you need to obtain a human question-answer set and a matching Knowledge Base. These should be saved in  a specific format to ensure the rest of the code works as expected.


## Paths
If you create a new dataset, it should be structured as follows:

```
📂 data 
├── 📂 SQuAD
|   ├── 📂source_documents 
|   |   ├── dev-v2.0.json
|   |   ├── train-v2.0.json
|   |   📂human
|   |   ├── questions.json 
|   ├── corpus.json 
├── 📂 *{MY_DATASET_NAME}*
|   ├── 📂source_documents 
|   |   ├── {YOUR_SOURCE_FILE_1}.csv 
|   |   ├── ...
|   ├── 📂human
```

The `corpus.json` and `questions.json` files will then be created by a preprocessing notebook.

the code below creates the correct file tree for you so that you only have to drag your source files in place.


In [1]:
name = "example_dataset"
create_folders=True

In [None]:
if create_folders:
    import os
    root = os.environ['PROJECT_ROOT']
    os.makedirs(os.path.join(root, "data", name, "source_documents"))
    os.makedirs(os.path.join(root, "data", name, "human"))

## Saving the corpus
The corpus should be saved to `data/{MY_DATASET_NAME}/corpus.json` as follows:

```json
{
    "{document_title_a}" : "{document_content_a}",
    "{document_title_b}" : "{document_content_b}",
    ...
    "{document_title_z}" : "{document_content_z}"
}
```



## Saving the human questions and reference answers
The human questions and reference answers should be saved to `data/{MY_DATASET_NAME}/human/questions.json`.

Here there are two options:
1. The first option is for when you do not know the appropriate source document from the corpus based on which a question's reference answer is formulated. 
2. The second option is for when this source document is known, and you want to keep this information.

### Option A - no source document known
Questions should be saved as a list of dictionaries with keys `question` and `reference`:

```json
[
    {
        "question" : "{question_a}",
        "reference" : "{reference_answer_a}"
    },
    {
        "question" : "{question_b}",
        "reference" : "{reference_answer_b}"
    }
]
```

### Option B - source document known
If the source document of each question is known, as is for example the case in the SQuAD dataset, you may want to save this information. In that case, you can save it in the structure outlined below. If you run experiments with such datasets, ensure to set `use_doc_ids` to True in your experiment config file.

```json
{
    "{doc_title_a}" :
    [
        {
            "question" : "{doc_a_question_a}",
            "reference" : "{doc_a_reference_answer_a}"
        },
        {
            "question" : "{doc_a_question_b}",
            "reference" : "{doc_a_reference_answer_b}"
        }
    ],
    "{doc_title_b}" :
    [
        {
            "question" : "{doc_b_question_a}",
            "reference" : "{doc_b_reference_answer_a}"
        },
        {
            "question" : "{doc_b_question_b}",
            "reference" : "{doc_b_reference_answer_b}"
        }
    ]
}
```