# 3. Scores Generator (A)

We have divided this notebook into the following parts:

1. Load **matrix**: We load a CSV file with the preprocessed matrix. 
2. Load **model**: Using hugging face API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.

3. **Model-specific preprocessing**: We apply model specific fine-tuning that is related with how the models were trained to encode the strings.
3. Create **preds**: We create a CSV file with the predictions concerning the model to evaluate.

**Note**: We assume that all matrices have a set of `ID_COLS` that uniquely identifies each row. Additionally, for multi-way (or multi-annotated) datasets, we assume a row-wise format, that is, all the necessary data has already been unrolled  along the first dimension. For example, let us consider a __source dataset__ with $200$ examples, where each of them comprises two different annotations. This notebook __assumes that dataset was previously preprocessed__ and is __now unflattened__ totalling $400$ rows (one per example and annotation) when loaded from memory. While this duplicates memory, it avoids having complex pipelines with intrinsic hand-tailored routines for each dataset (i.e., _bye bye spaghetti_ code).

In [1]:
OUTPUT_DIR = "../outputs"

MODEL_NAME = "allenai/unifiedqa-t5-small"
#model_name = "t5-small"

# name of the dataset to preprocess
# DATASET_NAME, SPLIT_NAME = "squad", "validation"
# DATASET_NAME, SPLIT_NAME = "newsqa", "dev"
# DATASET_NAME, SPLIT_NAME = ('squadshifts', 'new_wiki'), "test"
# DATASET_NAME, SPLIT_NAME = ('squadshifts', 'nyt'), "test"
# DATASET_NAME, SPLIT_NAME = ('squadshifts', 'amazon'), "test"
# DATASET_NAME, SPLIT_NAME = ('squadshifts', 'reddit'), "test"
# DATASET_NAME, SPLIT_NAME = "narrativeqa", "test_5k_sample_seed_2022"
DATASET_NAME, SPLIT_NAME = "narrativeqa", "test_len_10"


IS_LOCAL_FS_DATASET = True \
    if (DATASET_NAME in ("newsqa", ) or SPLIT_NAME in ("test_5k_sample_seed_2022",)) \
    else False

if isinstance(DATASET_NAME, tuple):
    NORMALIZED_DATASET_NAME = "".join(DATASET_NAME)
else:
    NORMALIZED_DATASET_NAME = DATASET_NAME

BASE_FILENAME = f"{NORMALIZED_DATASET_NAME}_{SPLIT_NAME}"


ROOT_DIR = f"{OUTPUT_DIR}/results/{NORMALIZED_DATASET_NAME}/{SPLIT_NAME}"

MATRIX_DIR = f"{ROOT_DIR}/matrix"
MATRIX_FILEPATH = f"{MATRIX_DIR}/{BASE_FILENAME}_preprocessed.csv"

# Outputs
PREDS_DIR = f"{ROOT_DIR}/preds"
!mkdir -p {PREDS_DIR}

SEED = 42
# Arguments used to read the files from disk
csv_kwargs = {
   "compression": "gzip",
   "encoding": "utf-8",
}

# ----------------------------------------
## Columns names
# ----------------------------------------
ID_COLS = ["example_id", "answer_id"]

UNIQUE_ID_COL = ID_COLS[0]
NON_UNIQUE_ID_COL = ID_COLS[1]
print("Using", UNIQUE_ID_COL, "as the unique column to de-duplicate the data")

Using example_id as the unique column to de-duplicate the data


## Load matrix 

This is the preprocessed matrix that will be used by every model when creating predictions. We expect it to  have the following columns:
- `ID_COLS: List[str]`, can be one or more set of unique identifier columns.
- `TOPIC: str`, optional, provides a high-level categorization of the different examples.

- Dataset specific columns, such as `CONTEXT`, `QUESTION`, `ANSWER` for open-book (closed-domain) QA tasks. Amongst these we usually define the `TARGET_LABEL` and the `FEATURES` the ones that will be encoded together for generative purposes.


By default we will assume the following columns:
- `TARGET_LABEL = 'label'`
- `FEATURES = ['question', 'context']`


**Note**: ~~May have to reconsider the use of pandas, for larger datasets, since it wont be feasible to hold them in memory. Instead, may consider HuggingFace `datasets` or `pyspark`.~~ Consider [building script](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-or-remote-files) in case more demanding needs arise.

In [2]:
import pandas as pd
import numpy as np

import datasets

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
TARGET_LABEL = "label"
FEATURES = ["question", "context"]

In [4]:
import datasets
matrix = datasets.load_dataset('csv', data_files=MATRIX_FILEPATH)["train"]
print("Loaded", len(matrix), "datapoints from", MATRIX_FILEPATH)

Using custom data configuration default-c08f75ec88dbd2c1


Downloading and preparing dataset csv/default to /home/kat/.cache/huggingface/datasets/csv/default-c08f75ec88dbd2c1/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2116.20it/s]
Extracting data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1616.30it/s]


Dataset csv downloaded and prepared to /home/kat/.cache/huggingface/datasets/csv/default-c08f75ec88dbd2c1/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 898.14it/s]

Loaded 3205 datapoints from ../outputs/results/narrativeqa/test_len_10/matrix/narrativeqa_test_len_10_preprocessed.csv





### Remove duplicate entries when generating predictions

In [5]:
from utils.datasets import drop_duplicates

In [6]:
matrix = drop_duplicates(matrix, UNIQUE_ID_COL)
print("Remaining", len(matrix), "datapoints after dropping duplicates")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 144.59ba/s]

Remaining 3205 datapoints after dropping duplicates





In [7]:
matrix

Dataset({
    features: ['example_id', 'title', 'question', 'context', 'labels', 'multi_way_labels', 'answer_id', 'question_len', 'context_len', 'labels_len'],
    num_rows: 3205
})

## Load model

Using HF's API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.


In [7]:
from models.model import T5Model, UnifiedQAT5Model

if "unified" in MODEL_NAME:
    print("Using UnifiedQA:", MODEL_NAME)
    MODEL = UnifiedQAT5Model
elif "t5" in MODEL_NAME:
    print("Using T5 model:", MODEL_NAME)
    MODEL = T5Model
else:
    raise NotImplementedError

ModuleNotFoundError: No module named 'torch.nn'

In [8]:
model_hf_kwargs = {
    # Path to directory to store the pretrained models
    # (may make ensuing analysis faster)
    "cache_dir": f"{OUTPUT_DIR}/model/cache",
    # Specific version of the model to use (defaults to main)
    # "revision": "main",
}

model_hyperparameters = {
    "padding": "max_length",
    "max_length": 512,
    
    "truncation": True,
    "add_special_tokens": True,
    "return_attention_mask": True,
    # All generate-specific kwargs should start with the prefix "generate_" 
    "generate__max_length": 100,
    "generate__batch_size": 700,
}

model = MODEL(MODEL_NAME, model_hyperparameters, model_hf_kwargs)
model.load()

## Generate predictions
Using the model and the preprocessed matrix, generate the predictions. 
The predictions files will contain the following information:

Useful resources:
- [dataset and Pytorch](https://huggingface.co/docs/datasets/use_dataset.html)
- [fine-tuning a pretrained model](https://huggingface.co/course/chapter3/4?fw=pt)
- [generator](https://huggingface.co/docs/transformers/v4.16.2/en/internal/generation_utils#transformers.generation_utils.GreedySearchDecoderOnlyOutput)

### Model-tailored Preprocessing

We apply model specific fine-tuning that is related with how the models were trained to encode the strings. We will apply this on a per-batch basis to avoid additional overhead in iterating the datasets. We use the [`datasets.Dataset.set_format`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.set_format) as a more efficient way to cast the necessary columns to pytorch structures. 


In [9]:
matrix_fmt = matrix.map(model._format_row, fn_kwargs={"features": FEATURES})
matrix_fmt = matrix_fmt.map(lambda examples: model.encode(examples, 'encoded'), batched=True)
matrix_fmt.set_format(type="torch", columns=["input_ids", "attention_mask"], output_all_columns=False)

0ex [00:00, ?ex/s]

  0%|          | 0/10 [00:00<?, ?ba/s]



### Creating Greedy Predictions

We want to be able to create predictions both for __beam search__ and for __greedy search__. We will focus for now in the case when we have a single return sequence (even though we can have multiple beams or multiple paths explored).

A predictions matrix will have the following attributes/columns:
- `ID_COLUMNS`: ideally comprised of the unique identifiers you specified in the beginning.
- `pred_id`: unique identifier for each example (computed for each instance based on the model_uuid and the generated tokens).
- `score_proba`: score associated with the generated sentence. computed as the multiplication of the individual raw_scores. the score is within $[0, 1]$.
- `preds`: textual representation of the generated instance
- `preds_raw_int`: tokens id 
- `preds_raw_str`: tokens str
- `preds_raw_scores`: scores for each of the tokens, lie in the range $[0, 1]$.
- `len`: length of the sentence
- `truncated`: whether the sequence was truncated (i.e., actually had the eos token).

In [10]:
from models.predictions import GreedyGenerator
from utils_generic import filter_params_by_prefix

In [11]:
GENERATE_PREFIX = "generate__"

model_generate_hyperparams = filter_params_by_prefix(model_hyperparameters, GENERATE_PREFIX)
model_generate_hyperparams = {param_name[len(GENERATE_PREFIX):]: param_val for param_name, param_val in model_generate_hyperparams.items()}
print("Generator kwargs:", model_generate_hyperparams)


print("Creating **Greedy Generator**")
generator = GreedyGenerator()

print("Generating...")
batches = generator.generate(
    data=matrix_fmt,
    id_cols=ID_COLS,
    model=model._model,
    tokenizer=model._tokenizer,
    **model_generate_hyperparams,
)

Generator kwargs: {'max_length': 100, 'batch_size': 700}
Creating **Greedy Generator**
Generating...


## Dump prediction file

In [12]:
from utils.output import OutputResult

In [13]:
# Sanity check (:
out_result = OutputResult(
    filename=BASE_FILENAME + f"_{NORMALIZED_DATASET_NAME}_{SPLIT_NAME}",
    output_dir=PREDS_DIR,
    out_extension="csv.gz",
)

print("Writing predictions at:", out_result.filename)
out_result.write(batches)

Writing predictions at: squadshiftsreddit_test_squadshiftsreddit_test
Processing examples 0-700 (out of 9803)
Processing examples 700-1400 (out of 9803)
Processing examples 1400-2100 (out of 9803)
Processing examples 2100-2800 (out of 9803)
Processing examples 2800-3500 (out of 9803)
Processing examples 3500-4200 (out of 9803)
Processing examples 4200-4900 (out of 9803)
Processing examples 4900-5600 (out of 9803)
Processing examples 5600-6300 (out of 9803)
Processing examples 6300-7000 (out of 9803)
Processing examples 7000-7700 (out of 9803)
Processing examples 7700-8400 (out of 9803)
Processing examples 8400-9100 (out of 9803)
Processing examples 9100-9800 (out of 9803)
Processing examples 9800-9803 (out of 9803)


In [14]:
df = pd.read_csv(out_result.filepath, compression="gzip", encoding="utf-8")
df.isna().any()

example_id          False
answer_id           False
preds                True
preds_id            False
preds_raw_int       False
preds_raw_str       False
preds_raw_count     False
truncated           False
score_proba         False
preds_raw_scores    False
dtype: bool

In [15]:
df.head()

Unnamed: 0,example_id,answer_id,preds,preds_id,preds_raw_int,preds_raw_str,preds_raw_count,truncated,score_proba,preds_raw_scores
0,5d9b75b18ae5305bc982a86b,3b77386b86e94b981a76f05a5f43da45,turn signals and exhaust,27551083ee6365ec2b6561cbafa8aa33,"[919, 9650, 11, 10685, 1]","['▁turn', '▁signals', '▁and', '▁exhaust']",4,1,0.387073,"[0.9219509959220886, 0.9995096921920776, 0.526..."
1,5d9b75b18ae5305bc982a869,ff4c857dccfa1633340e74559b52bcb1,a 2010 510 SMR,e1bee9c3f921c59a3b290e94be3ea880,"[3, 9, 2735, 3, 25926, 180, 9320, 1]","['▁', 'a', '▁2010', '▁', '510', '▁S', 'MR']",7,1,0.42179,"[0.8206230401992798, 0.9121021628379822, 0.994..."
2,5d9b75b18ae5305bc982a86a,69670401970f3c9ba88b04d9f7dd1ebf,"turn signals, and mirrors",986a63653ca5d44930340e91eac69cb5,"[919, 9650, 6, 11, 5432, 7, 1]","['▁turn', '▁signals', ',', '▁and', '▁mirror', ...",6,1,0.355219,"[0.8474096655845642, 0.9997405409812927, 0.637..."
3,5d9b75b18ae5305bc982a86c,7449ba52d1b5ecda71d9adb1749d4819,the poster and her boyfriend like the SXV mirr...,75419bc260e7ffc02db952288ebb7a4d,"[8, 10836, 11, 160, 18124, 114, 8, 180, 4, 553...","['▁the', '▁poster', '▁and', '▁her', '▁boyfrien...",13,1,0.047296,"[0.6735487580299377, 0.5055192708969116, 0.946..."
4,5d9bb1b18ae5305bc982d357,3a855fd1b5e0f5ee64350ba22adbfd56,my boyfriend's husky,f32251d4fed562bfc752df4a43baf5c7,"[82, 18124, 31, 7, 3, 11823, 3781, 1]","['▁my', '▁boyfriend', ""'"", 's', '▁', 'hus', 'ky']",7,1,0.378949,"[0.8585028052330017, 0.9997809529304504, 0.723..."


In [16]:
df.tail()

Unnamed: 0,example_id,answer_id,preds,preds_id,preds_raw_int,preds_raw_str,preds_raw_count,truncated,score_proba,preds_raw_scores
9798,5d9cb4f22358f20614262617,513e64e8c9fda01d49792dc7a25f1810,dota,9373295a57fc51bab0afc62ee2c763e3,"[103, 17, 9, 1]","['▁do', 't', 'a']",3,1,0.495587,"[0.5015457272529602, 0.9999641180038452, 0.999..."
9799,5d9cb4f22358f20614262618,fd9de0cc46acc434d7ac9db53f7dff25,some of the most memorable SpongeBob SquarePan...,e6c7ab712ea812ce5b5b69b60611e3bf,"[128, 13, 8, 167, 10080, 2526, 2444, 15, 279, ...","['▁some', '▁of', '▁the', '▁most', '▁memorable'...",15,1,0.541317,"[0.671987771987915, 0.9991242289543152, 0.9999..."
9800,5d9cb4f22358f20614262619,82fd14bdbacb142f370d4c9f46bd00dc,"commenting on towers, Roshan and hero picks",e3e1f28da0456db79291e84d77a5ae0a,"[1670, 53, 30, 7293, 7, 6, 7963, 2618, 11, 160...","['▁comment', 'ing', '▁on', '▁tower', 's', ',',...",13,1,0.254643,"[0.42295578122138977, 0.9977735877037048, 0.98..."
9801,5d9cb4f22358f2061426261a,ff7435b4b80a6d4ea5f9c384248a4997,make a comment on his love of money,adbeb42e531f3760095e017d026c0df8,"[143, 3, 9, 1670, 30, 112, 333, 13, 540, 1]","['▁make', '▁', 'a', '▁comment', '▁on', '▁his',...",9,1,0.557354,"[0.7199561595916748, 0.9998311996459961, 0.999..."
9802,5d9cb4f22358f2061426261b,01afc3ca99156fdf7da380154483e7dc,adding some of the most memorable SpongeBob Sq...,1f53d0d56c593ba266643310dc867821,"[2651, 128, 13, 8, 167, 10080, 2526, 2444, 15,...","['▁adding', '▁some', '▁of', '▁the', '▁most', '...",16,1,0.692029,"[0.9828467965126038, 0.9836350679397583, 0.999..."
