# 3. Scores Generator (A)

We have divided this notebook into the following parts:

1. Load **matrix**: We load a CSV file with the preprocessed matrix. 
2. Load **model**: Using hugging face API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.

3. **Model-specific preprocessing**: We apply model specific fine-tuning that is related with how the models were trained to encode the strings.
3. Create **preds**: We create a CSV file with the predictions concerning the model to evaluate.

**Note**: We assume that all matrices have a set of `ID_COLS` that uniquely identifies each row. Additionally, for multi-way (or multi-annotated) datasets, we assume a row-wise format, that is, all the necessary data has already been unrolled  along the first dimension. For example, let us consider a __source dataset__ with $200$ examples, where each of them comprises two different annotations. This notebook __assumes that dataset was previously preprocessed__ and is __now unflattened__ totalling $400$ rows (one per example and annotation) when loaded from memory. While this duplicates memory, it avoids having complex pipelines with intrinsic hand-tailored routines for each dataset (i.e., _bye bye spaghetti_ code).

In [1]:
OUTPUT_DIR = "../outputs"
!mkdir -p {OUTPUT_DIR}

ROOT_DIR = f"{OUTPUT_DIR}/results/mocha/narrativeqa/dev4"
!mkdir -p {ROOT_DIR}

# TODO - Come up with some uuid (model_name + dataset + split)
MATRIX_FILEPATH = f"{ROOT_DIR}/matrix/dev4-uqa-t5-small_preds.csv.gz"

# Outputs
PREDS_FILEPATH = f"{ROOT_DIR}/preds/dev4-uqa-t5-small_preds.csv.gz"
!mkdir -p {ROOT_DIR}/preds


SEED = 42
# Arguments used to read the files from disk
csv_kwargs = {
   "compression": "gzip"
}

# ----------------------------------------
## Columns names
# ----------------------------------------
ID_COLS = ["example_id", "answer_id"]

UNIQUE_ID_COL = ID_COLS[0]
NON_UNIQUE_ID_COL = ID_COLS[1]
print("Using", UNIQUE_ID_COL, "as the unique column to de-duplicate the data")

Using example_id as the unique column to de-duplicate the data


## Load matrix 

This is the preprocessed matrix that will be used by every model when creating predictions. We expect it to  have the following columns:
- `ID_COLS: List[str]`, can be one or more set of unique identifier columns.
- `TOPIC: str`, optional, provides a high-level categorization of the different examples.

- Dataset specific columns, such as `CONTEXT`, `QUESTION`, `ANSWER` for open-book (closed-domain) QA tasks. Amongst these we usually define the `TARGET_LABEL` and the `FEATURES` the ones that will be encoded together for generative purposes.


By default we will assume the following columns:
- `TARGET_LABEL = 'label'`
- `FEATURES = ['question', 'context']`


**Note**: ~~May have to reconsider the use of pandas, for larger datasets, since it wont be feasible to hold them in memory. Instead, may consider HuggingFace `datasets` or `pyspark`.~~ Consider [building script](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-or-remote-files) in case more demanding needs arise.

In [2]:
import pandas as pd
import numpy as np

import datasets

In [3]:
TARGET_LABEL = "label"
FEATURES = ["question", "context"]

In [4]:
import datasets
matrix = datasets.load_dataset('csv', data_files=MATRIX_FILEPATH)["train"]
print("Loaded", len(matrix), "datapoints from", MATRIX_FILEPATH)

Using custom data configuration default-93e3039272822da3
Reusing dataset csv (/home/kat/.cache/huggingface/datasets/csv/default-93e3039272822da3/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/1 [00:00<?, ?it/s]

Loaded 445 datapoints from ../outputs/results/mocha/narrativeqa/dev4/matrix/dev4-uqa-t5-small_preds.csv.gz


### Remove duplicate entries when generating predictions

In [5]:
def drop_duplicates(data, col):
    unique_col_values = {k: True for k in data.unique(col)}
    return data.filter(lambda example: unique_col_values.pop(example[col], False))

matrix = drop_duplicates(matrix, UNIQUE_ID_COL)
print("Remaining", len(matrix), "datapoints after dropping duplicates")

Loading cached processed dataset at /home/kat/.cache/huggingface/datasets/csv/default-93e3039272822da3/0.0.0/6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e/cache-c622453c434803da.arrow


Remaining 277 datapoints after dropping duplicates


## Load model

Using HF's API, we load a pre-trained or a fine-tuned model and apply it to the said matrix to obtain corresponding predictions.


In [6]:
# model_name = "allenai/unifiedqa-t5-small"
model_name = "t5-small"
model_hf_kwargs = {
    # Path to directory to store the pretrained models
    # (may make ensuing analysis faster)
    "cache_dir": f"{OUTPUT_DIR}/model/cache",
    # Specific version of the model to use (defaults to main)
    # "revision": "main",
}

model_hyperparameters = {
    "padding": "max_length",
    "max_length": 512,
    
    "truncation": True,
    "add_special_tokens": True,
    "return_attention_mask": True,
    "target_max_length": 100,
}

In [7]:
from dataclasses import dataclass, field
from utils_generic import filter_params, filter_params_by_prefix

import logging
import transformers


@dataclass
class T5Model:
    model_name: str
    model_hyperparameters: dict
        
    model_hf_kwargs: dict = field(default_factory=dict)
    _tokenizer = None
    _model = None
        
    def _format_row(self, row, features):
        prefixes = [f"{f}: {row[f]}" for f in features]
        sep = f" {self._tokenizer.eos_token} "
        return {"encoded": sep.join(prefixes)}

    def encode(self, data, target_label, prefix: str = None):
        if prefix is None:
            hyperparams = self.model_hyperparameters
        else:
            hyperparams = filter_params_by_prefix(self.model_hyperparameters, prefix)
        
        hyperparams = filter_params(hyperparams, self._tokenizer)
        logging.warning(f"Using {hyperparams} to encode (target={target_label}, prefix={prefix}): {hyperparams}")
        return self._tokenizer(data[target_label], **hyperparams)
        
    def load(self):
        # Configuration (defines vocab size, model dimensions, ...)
        # config = transformers.T5Config.from_pretrained(
        #    model_name, **kwargs)
        # ^ Note:
        # Changes in T5 configurations should be done here:
        # ...
        tokenizer_fn = transformers.T5TokenizerFast.from_pretrained
        tokenizer_params = filter_params(self.model_hf_kwargs, tokenizer_fn)
        self._tokenizer = tokenizer_fn(self.model_name, **tokenizer_params)

        model_fn = transformers.T5ForConditionalGeneration.from_pretrained
        model_params = filter_params(self.model_hf_kwargs, model_fn)
        self._model = model_fn(self.model_name, **model_params)

In [8]:
model = T5Model(model_name, model_hyperparameters, model_hf_kwargs)
model.load()

## Generate predictions
Using the model and the preprocessed matrix, generate the predictions. 
The predictions files will contain the following information:

- `example_id`: the id of the example used when making the prediction.
- `score_proba: float`, a probability value that corresponds to the generated answer up until the `EOS` (end-of-sequence) token.
- `preds: str`, a textual description of the decoded model answers (after prediction).
- `preds_token_ids: List[int]`: a list with the generated tokens ids.
- `preds_token_score_proba: List[float]`: a list with the (conditional) probability generated for each token in the sequence.


Useful resources:
- [dataset and Pytorch](https://huggingface.co/docs/datasets/use_dataset.html)
- [fine-tuning a pretrained model](https://huggingface.co/course/chapter3/4?fw=pt)
- [generator](https://huggingface.co/docs/transformers/v4.16.2/en/internal/generation_utils#transformers.generation_utils.GreedySearchDecoderOnlyOutput)

### Model-tailored Preprocessing

We apply model specific fine-tuning that is related with how the models were trained to encode the strings. We will apply this on a per-batch basis to avoid additional overhead in iterating the datasets. We use the [`datasets.Dataset.set_format`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.set_format) as a more efficient way to cast the necessary columns to pytorch structures. 


### Creating Predictions

We want to be able to create predictions both for `Beam search` and for `greedy search`. We will focus for now in the case when we have a single return sequence (even though we can have multiple beams or multiple paths explored).

A predictions matrix will have the following attributes/columns:
- `ID_COLUMNS`: ideally comprised of the unique identifier 
- `pred_uuid`: unique identifier for each example (computed for each instance based on the model_uuid and the generated tokens).
- `score_proba`: score associated with the generated sentence. computed as the multiplication of the individual raw_scores. the score is within $[0, 1]$.
- `preds`: textual representation of the generated instance
- `preds_raw_int`: tokens id 
- `preds_raw_str`: tokens str
- `preds_raw_scores`: scores for each of the tokens, lie in the range $[0, 1]$.
- `len`: length of the sentence
- `truncated`: whether the sequence was truncated (i.e., actually had the eos token).

Similarly to the implementation of [lm-calibration](https://github.com/jzbjyb/lm-calibration/blob/887e3e13df0462842ce288fffe588e549a3360ee/model/gpt2.py#L67) we apply log-softmax to the log-probabilities before summing them.

In [9]:
GREEDY_KWARGS = {
    "num_beams": 1,
    "do_sample": False,
}

In [10]:
from collections import defaultdict

batch_preds = defaultdict(list)

batch = matrix
batch = batch.map(model._format_row, fn_kwargs={"features": FEATURES})
batch = batch.map(lambda examples: model.encode(examples, 'encoded'), batched=True)
batch.set_format(type="torch", columns=["input_ids", "attention_mask"], output_all_columns=False)

0ex [00:00, ?ex/s]

  0%|          | 0/1 [00:00<?, ?ba/s]



### Greedy results 

In the case you're using a T5-like or BART-like model, the output of the call below will be a [GreedySearchEncoderDecoderOutput](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.generation_utils.GreedySearchEncoderDecoderOutput). According to the documentation: 

In [11]:
from itertools import filterfalse 
from utils_generic import generate_uuid

import torch
import torch.nn.functional as F


class GreedyGenerator:
    def __init__(self):
        super().__init__()
        self._num_beams = 1
        self._do_sample = False
        self._clean_up_tokenization_spaces = True
        self._skip_special_tokens = True
        
    @property
    def generate_hyperparams(self) -> dict:
        return {
            'num_beams': self._num_beams,
            'do_sample': self._do_sample,
        }

    @property
    def decoding_hyperparams(self) -> dict:
        return {
            "clean_up_tokenization_spaces": self._clean_up_tokenization_spaces,
            "skip_special_tokens": self._skip_special_tokens,
        }
    
    def generate(self, data, id_cols, tokenizer, model, batch_size: int=None, **kwargs) -> dict:
        """
            - `score_proba`: score associated with the generated sentence. computed as the multiplication of the individual raw_scores. the score is within $[0, 1]$.
            - `preds`: textual representation of the generated instance
            - `preds_raw_int`: tokens id 
            - `preds_raw_str`: tokens str
            - `preds_raw_scores`: scores for each of the tokens, lie in the range $[0, 1]$.
            - `len`: length of the sentence
            - `truncated`: whether the sequence was truncated (i.e., actually had the eos token).
        """
        if batch_size is None:
            batch_size = len(data)
        else:
            batch_size = min(batch_size, len(data))
        
        n = len(data)
        logging.info(f"Processing {n} examples in total")
        for b_start in range(0, n, batch_size):
            metadata = { }
            # Batch indexing
            # ---------------------------------------------------------------
            b_end = b_start + batch_size
            b_end = min(b_end, n) 
            batch = data.select(range(b_start, b_end))
            logging.info(f"Processing examples {b_start}-{b_end}")
            
            metadata.update({id_col: batch[id_col] for id_col in id_cols})
            
            # Generate
            # ---------------------------------------------------------------
            results = model.generate(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                # We're interested in returning information about the scores
                output_scores=True,
                return_dict_in_generate=True,
                # Force truncation (ensure the last token is always the EOS)
                forced_eos_token_id=tokenizer.eos_token_id,
                **self.generate_hyperparams,            
                **kwargs, # max_length
            )

            # Textual representation of the predicted sequence
            metadata["preds"] = tokenizer.batch_decode(results.sequences, **self.decoding_hyperparams)

            # Compute unique identifiers for each prediction
            # Ideally the identifier will depend on the model's,
            # the tokenizer's and the matrix's uuid but for now
            # we will simplify and only consider the generated text.
            #
            # Note: This assumes the name of the prediction file is being
            # handled by some component that has access to all this information
            # and is, therefore, able to avoid name clashes.
            uuid_metadata = { }
            uuid = lambda pred: generate_uuid(dict(text=pred, **uuid_metadata))
            metadata["preds_id"] = [uuid(pred) for pred in metadata["preds"]]
            
            # Individual tokens raw representation
            def skip_tokens(seq, token):
                predicate = filterfalse(lambda t: t == token, seq)
                return list(predicate)
            
            metadata["preds_raw_int"] = results.sequences.tolist()
            metadata["preds_raw_int"] = [skip_tokens(s, tokenizer.pad_token_id)
                                         for s in metadata["preds_raw_int"]]
            
            # Individual tokens raw textual representation
            #metadata["preds_raw_str"] = [[tokenizer.decode(t, skip_special_tokens=True) for t in seq]
            metadata["preds_raw_str"] = [tokenizer.convert_ids_to_tokens(seq, skip_special_tokens=True)
                                        for seq in metadata["preds_raw_int"]]

            # Individual tokens count (does not include special tokens like EOS or pad)
            metadata["preds_raw_count"] = list(map(len, metadata["preds_raw_str"]))

            # Whether the sentence was truncated or not (i.e., it has an EOS token)
            is_truncated = lambda s: int(any(s == tokenizer.eos_token_id))
            metadata["truncated"] = [is_truncated(seq) for seq in results.sequences] 

            # ---------------------------------------------------------------------------------
            # Compute score_proba
            # ---------------------------------------------------------------------------------
            # Pair each timestep logits `score_t` with corresponding generated token
            # *Note*: since greedy_results.scores is a T sized tuple with B * V matrices
            # representing the logits for the different instances in the batch at each timestep
            # we can couple the actual logit score at each timestep with the corresponding token.
            scores, seq_tokens = results.scores, results.sequences[:,1:]
            # ^Note: The sequences are considering an initial pad token whose score is not outputted
            # by the greedy decoder.
            assert len(scores) == seq_tokens.shape[-1], "Dimension mismatch: Sequences vs scores"

            n_pred_timesteps = len(scores)
            scores_tokens = [(F.log_softmax(scores[t], dim=-1), seq_tokens[:,t]) for t in range(n_pred_timesteps)]
            #^Note: 
            # - `scores` is a |B| X |V| matrix with all the logits per batch per vocabulary
            # at prediction timestep t. Like Jiang et. al 
            # (https://github.com/jzbjyb/lm-calibration/blob/887e3e13df0462842ce288fffe588e549a3360ee/model/gpt2.py#L67)
            # we apply F.log_softmax to ensure the logprobabilities are comparable amongst
            # the different batches
            # - `seq_tokens[:, t]` is a |B| X 1 matrix with the predicted token types at
            # timestep t
            greedy_scores = [scores_t.gather(-1, token_t.unsqueeze(-1)) for scores_t, token_t in scores_tokens]
            greedy_scores = torch.cat(greedy_scores, dim=1)
            # Must mask the greedy scores corresponding to the padding
            pad_mask = (seq_tokens == tokenizer.pad_token_id)
            greedy_scores[pad_mask] = 0

            metadata["score_proba"] = torch.exp(torch.sum(greedy_scores, dim=1)).tolist()
            metadata["preds_raw_scores"] = torch.exp(greedy_scores).tolist()
            
            # Drop the tokens that are not important
            metadata["preds_raw_scores"] = [skip_tokens(s, 1) for s in metadata["preds_raw_scores"]]
            yield pd.DataFrame(metadata)
            
            
# Sanity check (:
batches = iter(GreedyGenerator().generate(
    batch, id_cols=ID_COLS, model=model._model, tokenizer=model._tokenizer, max_length=14, batch_size=50))
b = next(batches)
b

Unnamed: 0,example_id,answer_id,preds,preds_id,preds_raw_int,preds_raw_str,preds_raw_count,truncated,score_proba,preds_raw_scores
0,2e6b688ad84546d681f1339df539a0b2,618baa6d3449a6d9171cd39bf204e8c9,doctor pascal rougon,abe46eacce04f58879f56150c248564d,"[2472, 330, 1489, 3, 3964, 5307, 1]","[▁doctor, ▁pas, cal, ▁, rou, gon]",6,1,0.444278,"[0.45619308948516846, 0.9993246793746948, 0.99..."
1,2aea58b6733f531e127a96752d9db1a4,52e62d271a8dbbc887e25f68681794b4,converts,88a6dc645c7d491d08aece45c8eaca71,"[5755, 7, 1]","[▁convert, s]",2,1,0.68586,"[0.8886101841926575, 0.9971410036087036, 0.774..."
2,a9bf0ee9eac598663c3f2c9a9908b1e6,2093f3df3379e16364141294a8c9a062,"into a ""swimming tank""",4e6598f9b608b3f532f4f1e537e461a8,"[139, 3, 9, 96, 7, 210, 23, 635, 53, 5040, 121...","[▁into, ▁, a, ▁"", s, w, i, mm, ing, ▁tank, ""]",11,1,0.549538,"[0.7651849389076233, 0.9803006052970886, 0.999..."
3,f485d3509a0606a7b570cc5f2edbd083,e2fd235719de9140057fdb4e61e930e9,"a competition of ""court compliment""",36bbc8a0cef5d357cacd366611cc23cf,"[3, 9, 2259, 13, 96, 14492, 12064, 121, 1]","[▁, a, ▁competition, ▁of, ▁"", court, ▁complime...",8,1,0.397013,"[0.45407775044441223, 0.9586190581321716, 0.97..."
4,921d4e8c8c14e3b385bb80546b792154,8266b87c9f4982b8d7ed33efba54537b,"a gypsy boy, pablo",6a2bcdc69999a8d9394ff25f86d6580e,"[3, 9, 3, 122, 63, 19819, 4940, 6, 2576, 4672, 1]","[▁, a, ▁, g, y, psy, ▁boy, ,, ▁pa, blo]",10,1,0.159552,"[0.6082004308700562, 0.5329281687736511, 0.945..."
5,bdd61f9bb3f8781f37190f4c671320a3,5e82884a83521fa2ebd1b6cc95a4b14a,wounding herself mortally with a rifle,8c11d545467e2d0b65e00c013ca80cb9,"[9699, 53, 6257, 24301, 120, 28, 3, 9, 18371, 1]","[▁wound, ing, ▁herself, ▁mortal, ly, ▁with, ▁,...",9,1,0.128462,"[0.1338125765323639, 0.986093282699585, 0.9979..."
6,5c7966bb9f9aea98baf2b9523776f82b,9b983e2910f05fe7597787f8e0faf4f2,suicide,24af5c17e5ba8954f17a98806458889a,"[12259, 1]",[▁suicide],1,1,0.231917,"[0.2435421347618103, 0.9522663950920105]"
7,ad4563df309a58f9affa2cd854a0ce0e,2e85cb5ca4286f3cbfd589cf587cf9bd,he takes his orders and becomes the parish pri...,aa8433cd5f28980726dab95cccd6f010,"[3, 88, 1217, 112, 5022, 11, 2992, 8, 14961, 1...","[▁, he, ▁takes, ▁his, ▁orders, ▁and, ▁becomes,...",12,1,0.031928,"[0.1293245106935501, 0.5078274011611938, 0.674..."
8,b9756df13b1b7963dc0b507dfa5297a2,772a657a23d1613e49036fa98549a39f,gets himself expelled from cambridge after att...,7e1c3d68eb5c86149f40adcbac0f70da,"[2347, 2448, 1215, 14528, 45, 5511, 9818, 227,...","[▁gets, ▁himself, ▁ex, pelled, ▁from, ▁cam, br...",12,1,0.580548,"[0.671929121017456, 0.996823787689209, 0.99861..."
9,a0491b138fb1f9b4dbc1c4eee651c922,57bdce5ad3ce730e7d074358eb971e1a,attending the derby without permission,c24a004b9c7ed44b15bf356cd9252ea3,"[7078, 8, 74, 969, 406, 6059, 1]","[▁attending, ▁the, ▁der, by, ▁without, ▁permis...",6,1,0.144703,"[0.19296632707118988, 0.9960280656814575, 0.99..."


In [12]:
from typing import Callable

import importlib


def import_method(fullpath: str) -> Callable:
    """Import a specific method given the corresponding path.
    
    Parameters
    ----------
    fullpath: str
        Path of method to import.
    
    Returns
    -------
    method: callable
        The method object.
    """
    if fullpath is None:
        raise ValueError(f"Cannot import method {fullpath}")
    
    paths = fullpath.rsplit('.', 1)
    
    if len(paths) == 1:
        module, method = "__main__", paths
    else:
        module, method = paths
    
    try:
        module = importlib.import_module(module)
    except:
        module = import_method(module)
    return getattr(module, method)


def method_name(method: Callable) -> str:
    module = method.__module__
    qualname = method.__qualname__
    
    return f"{module}.{qualname}"


In [13]:
from dataclasses import dataclass, field
from typing import Union, Iterable

import os
import yaml


@dataclass
class OutputResult:
    filename: str
        
    output_fn_classpath: str = field(default="pandas.DataFrame.to_csv")
    output_fn_kwargs: dict = field(default_factory=dict)
    output_dir: str = field(default="./outputs")
    out_extension: str = None
    
    def __post_init__(self):
        super().__init__()
        
        self.output_fn_kwargs = {
            "compression": "gzip",
            "index": False,
            "header": True,
            "encoding": "utf-8",
        }        
        os.makedirs(self.output_dir, exist_ok=True)
        
        # Dynamically load function
        if isinstance(self.output_fn_classpath, str):
            self.output_fn = import_method(self.output_fn_classpath)
    
        elif isinstance(self.output_fn_classpath, callable):
            self.output_fn = self.output_fn_classpath
            self.output_fn_classpath = method_name(self.output_fn_classpath)

    @property
    def filepath(self):
        filepath = f"{self.output_dir}/{self.filename}"
        if self.out_extension:
            filepath = f"{filepath}.{out_extension}"
        return filepath
    
    @property
    def configpath(self):
        return f"{self.filepath}.out_config"
        
    def dump_configs(self):
        with open(self.configpath , "w") as f:
            yaml.dump(self.output_fn_kwargs, f)
            
    def write(self, batches: Iterable, exists_new: bool=True):
        batches = iter(batches)
        
        out_kwargs = self.output_fn_kwargs.copy()
        
        if exists_new is True:
            first_batch = next(batches)
            out_kwargs["mode"] = "w"
            logging.info(f"Creating file {self.filepath} w/ {self.output_fn_classpath} and arguments: {out_kwargs}")
            self.output_fn(first_batch, self.filepath, **out_kwargs)
            out_kwargs["header"] = False

        out_kwargs["mode"] = "a"
        for batch in batches:
            self.output_fn(batch, self.filepath, **out_kwargs)
        
        self.dump_configs()
        

# Sanity check (:
OutputResult("test.txt").write(batches)

In [15]:
pd.read_csv("./outputs/test.txt", compression="gzip", encoding="utf-8")

Unnamed: 0,example_id,answer_id,preds,preds_id,preds_raw_int,preds_raw_str,preds_raw_count,truncated,score_proba,preds_raw_scores
0,e4470b4cf3b0e1ac33a7df420f3330c5,92efeb81c61ddc7329ff790463e7b571,pre-hyborian empire of acher,2b6e48fa493821a89e8f8890c31ad02e,"[554, 18, 107, 63, 115, 32, 5288, 21039, 13, 3...","['▁pre', '-', 'h', 'y', 'b', 'o', 'rian', '▁em...",12,1,0.539075,"[0.587063729763031, 0.9998001456260681, 0.9990..."
1,3ec8bc0b55fd924732cb1ee6a2be87c7,9d686cf84b96ded14795fe2c10fd7a5b,frank strawn-hamilton,a0bc0382c1322576cb76af418746b042,"[3, 89, 6254, 21920, 29, 18, 1483, 23, 7377, 1]","['▁', 'f', 'rank', '▁straw', 'n', '-', 'ham', ...",9,1,0.385791,"[0.550155520439148, 0.706671953201294, 0.99998..."
2,b85207cc18922156da06749514ae7ae4,4ef87189b2550c086d08819c7b85ebdd,abbot of his monastery,592a416c6bbfc98f207f566a66d00a04,"[703, 4045, 13, 112, 29592, 1]","['▁ab', 'bot', '▁of', '▁his', '▁monastery']",5,1,0.612940,"[0.9626514315605164, 0.9999915361404419, 0.647..."
3,2260ae74efff1f474eeb0ea2ab38f89e,dffb85d46b8a8030635734a62624eb5a,make zenobia his queen,5eb23cf32e10d12e1098d6cd95a98d51,"[143, 3, 1847, 6690, 9, 112, 14915, 1]","['▁make', '▁', 'zen', 'obi', 'a', '▁his', '▁qu...",7,1,0.179402,"[0.18902887403964996, 0.9879463315010071, 0.99..."
4,fcc79b6e8bb0c6221e81cc489a5eaaf8,70d027b8da1225649c15c951b7f2929d,serge is portrayed giving several wildly enthu...,20f038e8e8c3b331c02451d9e0375d80,"[7637, 397, 19, 3, 27486, 1517, 633, 3, 28890,...","['▁ser', 'ge', '▁is', '▁', 'portrayed', '▁givi...",12,1,0.035970,"[0.1573231816291809, 0.9991739392280579, 0.262..."
...,...,...,...,...,...,...,...,...,...,...
222,cb25a066e09c116f9519a5a77e9a0b5f,b0b95faf20f3d7991d8bb09a1e08a4ae,patience,3b5bfb437430411c9990d029972d497b,"[11998, 1]",['▁patience'],1,1,0.784210,"[0.8764203190803528, 0.8947872519493103]"
223,92cf35db62a8b8b885ae6bd88b74796e,83eeb53928af196b2521ec80c13d3594,alexandria,ddc42a3937dc82ea8f2a094e82c6b1eb,"[1240, 226, 232, 52, 23, 9, 1]","['▁ale', 'x', 'and', 'r', 'i', 'a']",6,1,0.911054,"[0.9166275858879089, 0.999991774559021, 0.9999..."
224,3215cbad5038466d27701023ed9b1425,b07f2a24dd9d08626ce39fbc31afa97c,lady mabel grex,c95fd0ced49c47fe215d16c97d51d870,"[9360, 954, 2370, 3542, 994, 1]","['▁lady', '▁ma', 'bel', '▁gr', 'ex']",5,1,0.593906,"[0.8748897314071655, 0.6942394375801086, 0.999..."
225,69fb15c137edf9c225875ffa7112906a,e23c2e8728e615c361a843ec13823480,frank tregear,4e5be9eac0b8795072ddc8f213d9a44e,"[3, 89, 6254, 3, 929, 397, 291, 1]","['▁', 'f', 'rank', '▁', 'tre', 'ge', 'ar']",7,1,0.137433,"[0.20879089832305908, 0.6718905568122864, 0.99..."
