# Preprocessing Notebook

This Notebook illustrates how to use our `preprocessing` library.

We will show how to:
* Modify the original `SQuAD` 1.1 and 2.0 datasets to convert it to a `Question Generation` format
* Download our modified `SQuAD` dataset from the hub
* Preprocess and save our modified dataset to use it for training

---

**Table of Contents**

0. [Install Dependencies](#install-dependencies)
1. [Create custom SQuAD dataset in json format](#create-custom-squad-dataset-in-json-format)
2. [Download our dataset from Hugginface](#download-our-dataset-from-huggingface)
3. [Preprocess our dataset to be used for training](#preprocess-our-dataset-to-be-used-for-training)

---

## 0. Install Dependencies

First install the required libraries:

In [None]:
%pip install pandas -q
%pip install datasets -q
%pip install transformers -q

## 1. Create custom SQuAD dataset in json format

We will create 4 files:
* `squad_v1_train`
* `squad_v1_validation`
* `squad_v2_train`
* `squad_v2_validation`

These are also available in our dataset repository on `huggingface.com` under `the-coorporation/the_squad`.

In [None]:
%load_ext autoreload
%autoreload 2

from preprocessing.squad_converter import SquadVersion, convert_and_save_squad
convert_and_save_squad(SquadVersion.V1, "./data/squad")
convert_and_save_squad(SquadVersion.V2, "./data/squad")

They can also be created and returned in memory:

In [None]:
%load_ext autoreload
%autoreload 2

from preprocessing.squad_converter import SquadVersion, convert_squad
squad_v1_train, squad_v1_validation = convert_squad(SquadVersion.V1)
squad_v2_train, squad_v2_validation = convert_squad(SquadVersion.V2)

## 2. Download our Dataset from Huggingface

We can now download either SQuAD 1.1 or 2.0 from our organization.

The version is specified with the `name` argument. If no argument is provided `v2` will be loaded.

In [None]:
%load_ext autoreload
%autoreload 2

from datasets import load_dataset
squad_v1 = load_dataset("the-coorporation/the_squad", name="v1")
squad_v2 = load_dataset("the-coorporation/the_squad")

In [None]:
print(f"V1: {squad_v1}")
print(f"V2: {squad_v2}")

## 3. Preprocess our dataset to be used for training

To preprocess and save our modified `SQuAD` dataset locally, we use the `SquadPreprocessor`.

It needs a `tokenizer`, which we get from the `QG` model. By setting `padding` to `False`, we don't add any padding tokens to our dataset entries. Generally, we want to benefit from `Dynamic Padding`, therefore, a `Data Collator` will be responsible for adding padding instead.

The `SquadPreprocessor` will save two files on disk:
* `training_data.pt`
* `validation_data.pt`

We will load these files when we want to train the `QG` model.

In [1]:
%load_ext autoreload
%autoreload 2

from models.qg import QG
from preprocessing.squad_preprocessor import SquadPreprocessor

qg = QG("t5-small", "t5-small")
processor = SquadPreprocessor(qg._tokenizer, padding=False)
processor.preprocess_and_save("the-coorporation/the_squad", "./data/")

Downloading builder script:   0%|          | 0.00/4.24k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/629 [00:00<?, ?B/s]

No config specified, defaulting to: the_squad/v2


Downloading and preparing dataset the_squad/v2 (download: 23.04 MiB, generated: 20.74 MiB, post-processed: Unknown size, total: 43.78 MiB) to /home/laugu/.cache/huggingface/datasets/the-coorporation___the_squad/v2/2.0.0/5abe37fe976461f1fb98ecbcf3d45a991e2ec721fa49eb97da04d30718432bc5...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/22.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.56M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/18877 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1204 [00:00<?, ? examples/s]

Dataset the_squad downloaded and prepared to /home/laugu/.cache/huggingface/datasets/the-coorporation___the_squad/v2/2.0.0/5abe37fe976461f1fb98ecbcf3d45a991e2ec721fa49eb97da04d30718432bc5. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/18877 [00:00<?, ? examples/s]

Map:   0%|          | 0/1204 [00:00<?, ? examples/s]

Map:   0%|          | 0/18877 [00:00<?, ? examples/s]

Map:   0%|          | 0/1204 [00:00<?, ? examples/s]

Map:   0%|          | 0/18877 [00:00<?, ? examples/s]

Map:   0%|          | 0/1204 [00:00<?, ? examples/s]