# 3. Scores Generator (A)

We have divided this notebook into the following parts:

1. Load **matrix**: We load a CSV file with the preprocessed matrix. 
2. Load **model**: Using hugging face API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.

3. **Model-specific preprocessing**: We apply model specific fine-tuning that is related with how the models were trained to encode the strings.
3. Create **preds**: We create a CSV file with the predictions concerning the model to evaluate.

**Note**: We assume that all matrices have a set of `ID_COLS` that uniquely identifies each row. Additionally, for multi-way (or multi-annotated) datasets, we assume a row-wise format, that is, all the necessary data has already been unrolled  along the first dimension. For example, let us consider a __source dataset__ with $200$ examples, where each of them comprises two different annotations. This notebook __assumes that dataset was previously preprocessed__ and is __now unflattened__ totalling $400$ rows (one per example and annotation) when loaded from memory. While this duplicates memory, it avoids having complex pipelines with intrinsic hand-tailored routines for each dataset (i.e., _bye bye spaghetti_ code).

In [1]:
OUTPUT_DIR = "../outputs"

# TODO - Come up with some uuid (model_name + dataset + split)
BASE_FILENAME = "dev4-uqa-t5-small"
ROOT_DIR = f"{OUTPUT_DIR}/results/mocha/narrativeqa/dev4"

MATRIX_DIR = f"{ROOT_DIR}/matrix"
MATRIX_FILEPATH = f"{MATRIX_DIR}/{BASE_FILENAME}.csv.gz"

# Outputs
PREDS_DIR = f"{ROOT_DIR}/preds"
!mkdir -p {PREDS_DIR}


SEED = 42
# Arguments used to read the files from disk
csv_kwargs = {
   "compression": "gzip",
   "encoding": "utf-8",
}

# ----------------------------------------
## Columns names
# ----------------------------------------
ID_COLS = ["example_id", "answer_id"]

UNIQUE_ID_COL = ID_COLS[0]
NON_UNIQUE_ID_COL = ID_COLS[1]
print("Using", UNIQUE_ID_COL, "as the unique column to de-duplicate the data")

Using example_id as the unique column to de-duplicate the data


## Load matrix 

This is the preprocessed matrix that will be used by every model when creating predictions. We expect it to  have the following columns:
- `ID_COLS: List[str]`, can be one or more set of unique identifier columns.
- `TOPIC: str`, optional, provides a high-level categorization of the different examples.

- Dataset specific columns, such as `CONTEXT`, `QUESTION`, `ANSWER` for open-book (closed-domain) QA tasks. Amongst these we usually define the `TARGET_LABEL` and the `FEATURES` the ones that will be encoded together for generative purposes.


By default we will assume the following columns:
- `TARGET_LABEL = 'label'`
- `FEATURES = ['question', 'context']`


**Note**: ~~May have to reconsider the use of pandas, for larger datasets, since it wont be feasible to hold them in memory. Instead, may consider HuggingFace `datasets` or `pyspark`.~~ Consider [building script](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-or-remote-files) in case more demanding needs arise.

In [2]:
import pandas as pd
import numpy as np

import datasets

In [3]:
TARGET_LABEL = "label"
FEATURES = ["question", "context"]

In [4]:
import datasets
matrix = datasets.load_dataset('csv', data_files=MATRIX_FILEPATH)["train"]
print("Loaded", len(matrix), "datapoints from", MATRIX_FILEPATH)

Using custom data configuration default-10dc393789c9a3ef
Reusing dataset csv (/home/kat/.cache/huggingface/datasets/csv/default-10dc393789c9a3ef/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/1 [00:00<?, ?it/s]

Loaded 445 datapoints from ../outputs/results/mocha/narrativeqa/dev4/matrix/dev4-uqa-t5-small.csv.gz


In [5]:
matrix

Dataset({
    features: ['example_id', 'answer_id', 'title', 'context', 'question', 'label', 'multi_way_labels'],
    num_rows: 445
})

### Remove duplicate entries when generating predictions

In [6]:
from utils.datasets import drop_duplicates

In [7]:
matrix = drop_duplicates(matrix, UNIQUE_ID_COL)
print("Remaining", len(matrix), "datapoints after dropping duplicates")

Loading cached processed dataset at /home/kat/.cache/huggingface/datasets/csv/default-10dc393789c9a3ef/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-4250be564768902f.arrow


Remaining 277 datapoints after dropping duplicates


## Load model

Using HF's API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.


In [8]:
from models.model import T5Model

In [9]:
# model_name = "allenai/unifiedqa-t5-small"
model_name = "t5-small"
model_hf_kwargs = {
    # Path to directory to store the pretrained models
    # (may make ensuing analysis faster)
    "cache_dir": f"{OUTPUT_DIR}/model/cache",
    # Specific version of the model to use (defaults to main)
    # "revision": "main",
}

model_hyperparameters = {
    "padding": "max_length",
    "max_length": 512,
    
    "truncation": True,
    "add_special_tokens": True,
    "return_attention_mask": True,
    # All generate-specific kwargs should start with the prefix "generate_" 
    "generate__max_length": 100,
    "generate__batch_size": 200,
}

model = T5Model(model_name, model_hyperparameters, model_hf_kwargs)
model.load()

## Generate predictions
Using the model and the preprocessed matrix, generate the predictions. 
The predictions files will contain the following information:

Useful resources:
- [dataset and Pytorch](https://huggingface.co/docs/datasets/use_dataset.html)
- [fine-tuning a pretrained model](https://huggingface.co/course/chapter3/4?fw=pt)
- [generator](https://huggingface.co/docs/transformers/v4.16.2/en/internal/generation_utils#transformers.generation_utils.GreedySearchDecoderOnlyOutput)

### Model-tailored Preprocessing

We apply model specific fine-tuning that is related with how the models were trained to encode the strings. We will apply this on a per-batch basis to avoid additional overhead in iterating the datasets. We use the [`datasets.Dataset.set_format`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.set_format) as a more efficient way to cast the necessary columns to pytorch structures. 


In [10]:
matrix_fmt = matrix.map(model._format_row, fn_kwargs={"features": FEATURES})
matrix_fmt = matrix_fmt.map(lambda examples: model.encode(examples, 'encoded'), batched=True)
matrix_fmt.set_format(type="torch", columns=["input_ids", "attention_mask"], output_all_columns=False)

0ex [00:00, ?ex/s]

  0%|          | 0/1 [00:00<?, ?ba/s]



### Creating Greedy Predictions

We want to be able to create predictions both for __beam search__ and for __greedy search__. We will focus for now in the case when we have a single return sequence (even though we can have multiple beams or multiple paths explored).

A predictions matrix will have the following attributes/columns:
- `ID_COLUMNS`: ideally comprised of the unique identifiers you specified in the beginning.
- `pred_id`: unique identifier for each example (computed for each instance based on the model_uuid and the generated tokens).
- `score_proba`: score associated with the generated sentence. computed as the multiplication of the individual raw_scores. the score is within $[0, 1]$.
- `preds`: textual representation of the generated instance
- `preds_raw_int`: tokens id 
- `preds_raw_str`: tokens str
- `preds_raw_scores`: scores for each of the tokens, lie in the range $[0, 1]$.
- `len`: length of the sentence
- `truncated`: whether the sequence was truncated (i.e., actually had the eos token).

In [11]:
from models.predictions import GreedyGenerator
from utils_generic import filter_params_by_prefix

In [12]:
GENERATE_PREFIX = "generate__"

model_generate_hyperparams = filter_params_by_prefix(model_hyperparameters, GENERATE_PREFIX)
model_generate_hyperparams = {param_name[len(GENERATE_PREFIX):]: param_val for param_name, param_val in model_generate_hyperparams.items()}
print("Generator kwargs:", model_generate_hyperparams)


print("Creating **Greedy Generator**")
generator = GreedyGenerator()

print("Generating...")
batches = generator.generate(
    data=matrix_fmt,
    id_cols=ID_COLS,
    model=model._model,
    tokenizer=model._tokenizer,
    **model_generate_hyperparams,
)

Generator kwargs: {'max_length': 100, 'batch_size': 200}
Creating **Greedy Generator**
Generating...


## Dump prediction file

In [13]:
from utils.output import OutputResult

In [14]:
# Sanity check (:
out_result = OutputResult(
    filename=BASE_FILENAME,
    output_dir=PREDS_DIR,
    out_extension=".csv.gz",
)

print("Writing predictions at:", out_result.filename)
out_result.write(batches)

Writing predictions at: dev4-uqa-t5-small
Processing examples 0-200
Processing examples 200-277


In [15]:
pd.read_csv(out_result.filepath, compression="gzip", encoding="utf-8").head()

Unnamed: 0,example_id,answer_id,preds,preds_id,preds_raw_int,preds_raw_str,preds_raw_count,truncated,score_proba,preds_raw_scores
0,2e6b688ad84546d681f1339df539a0b2,618baa6d3449a6d9171cd39bf204e8c9,doctor pascal rougon,abe46eacce04f58879f56150c248564d,"[2472, 330, 1489, 3, 3964, 5307, 1]","['▁doctor', '▁pas', 'cal', '▁', 'rou', 'gon']",6,1,0.444278,"[0.4561927914619446, 0.9993246793746948, 0.999..."
1,2aea58b6733f531e127a96752d9db1a4,52e62d271a8dbbc887e25f68681794b4,converts,88a6dc645c7d491d08aece45c8eaca71,"[5755, 7, 1]","['▁convert', 's']",2,1,0.685860,"[0.8886100649833679, 0.9971410036087036, 0.774..."
2,a9bf0ee9eac598663c3f2c9a9908b1e6,2093f3df3379e16364141294a8c9a062,"into a ""swimming tank""",4e6598f9b608b3f532f4f1e537e461a8,"[139, 3, 9, 96, 7, 210, 23, 635, 53, 5040, 121...","['▁into', '▁', 'a', '▁""', 's', 'w', 'i', 'mm',...",11,1,0.549538,"[0.7651848196983337, 0.9803006052970886, 0.999..."
3,f485d3509a0606a7b570cc5f2edbd083,e2fd235719de9140057fdb4e61e930e9,"a competition of ""court compliment""",36bbc8a0cef5d357cacd366611cc23cf,"[3, 9, 2259, 13, 96, 14492, 12064, 121, 1]","['▁', 'a', '▁competition', '▁of', '▁""', 'court...",8,1,0.397013,"[0.45407742261886597, 0.9586191773414612, 0.97..."
4,921d4e8c8c14e3b385bb80546b792154,8266b87c9f4982b8d7ed33efba54537b,"a gypsy boy, pablo",6a2bcdc69999a8d9394ff25f86d6580e,"[3, 9, 3, 122, 63, 19819, 4940, 6, 2576, 4672, 1]","['▁', 'a', '▁', 'g', 'y', 'psy', '▁boy', ',', ...",10,1,0.159552,"[0.60820072889328, 0.5329287648200989, 0.94543..."
...,...,...,...,...,...,...,...,...,...,...
272,cb25a066e09c116f9519a5a77e9a0b5f,b0b95faf20f3d7991d8bb09a1e08a4ae,patience,3b5bfb437430411c9990d029972d497b,"[11998, 1]",['▁patience'],1,1,0.784210,"[0.8764204978942871, 0.8947873711585999]"
273,92cf35db62a8b8b885ae6bd88b74796e,83eeb53928af196b2521ec80c13d3594,alexandria,ddc42a3937dc82ea8f2a094e82c6b1eb,"[1240, 226, 232, 52, 23, 9, 1]","['▁ale', 'x', 'and', 'r', 'i', 'a']",6,1,0.911054,"[0.9166272878646851, 0.999991774559021, 0.9999..."
274,3215cbad5038466d27701023ed9b1425,b07f2a24dd9d08626ce39fbc31afa97c,lady mabel grex,c95fd0ced49c47fe215d16c97d51d870,"[9360, 954, 2370, 3542, 994, 1]","['▁lady', '▁ma', 'bel', '▁gr', 'ex']",5,1,0.593904,"[0.8748897314071655, 0.6942368745803833, 0.999..."
275,69fb15c137edf9c225875ffa7112906a,e23c2e8728e615c361a843ec13823480,frank tregear,4e5be9eac0b8795072ddc8f213d9a44e,"[3, 89, 6254, 3, 929, 397, 291, 1]","['▁', 'f', 'rank', '▁', 'tre', 'ge', 'ar']",7,1,0.137433,"[0.2087908536195755, 0.6718899011611938, 0.999..."
