## Paper 2 Data Workflow for Data Extraction - CUADv1 - Final Model

#### Sources of information, code and discussions


1. The foundation workflow is from Hugging Face's Token Classification example hosted on Colab [here][1]
2. The models are base models, each using a downstream token clasification task, example [here][2]

[1]: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb
[2]: https://huggingface.co/roberta-base

# Initialize Environment

In [1]:
# Use if running in Kaggle environment or if the libraries need updating

#!pip install -U torch
#!pip install -U transformers
#!pip install -U wandb

In [2]:
import os, re, math, random, json, string

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
import wandb

import transformers
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import TrainerCallback, AdamW, get_cosine_schedule_with_warmup
from transformers import DataCollatorForTokenClassification, PreTrainedModel, RobertaTokenizerFast

from datasets import load_dataset, ClassLabel, Sequence, load_metric

In [3]:
# Need to log in to weights and biases in the command line using: wandb login
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mnative[0m (use `wandb login --relogin` to force relogin)


True

## Configuration

* BATCH_SIZES - These are batch sizes for each fold. For maximum speed, it is best to use the largest batch size your GPU or TPU memory allows.
* EPOCHS - These are maximum epochs. Note that each fold, the best epoch model is saved and used. So if epochs is too large, it won't matter. Consider early stopping in Callbacks.
* MODEL_CHECKPOINT - The name of the transformer model.

In [4]:
# Hugging Face model references for Transformer library
models = dict(
    ROBERTA = "roberta-base",
    DISTILBERT_U = "distilbert-base-uncased",
    DISTILBERT_C = "distilbert-base-cased",
    DEBERTA_V2_XL = "microsoft/deberta-v2-xlarge")

In [5]:
# Logging date for w&b
from datetime import date
today = date.today()
log_date = today.strftime("%d-%m-%Y")

In [6]:
# RANDOM SEED FOR REPRODUCIBILITY
RANDOM_SEED = 42

# BATCH SIZE
# TRY 4, 8, 16, 32, 64, 128, 256. REDUCE IF OOM ERROR, HIGHER FOR TPUS
BATCH_SIZES = 8

# EPOCHS - TRANSFORMERS ARE TYPICALLY FINE-TUNED BETWEEN 1 AND 3 EPOCHS 
EPOCHS = 8

# WHICH PRE-TRAINED TRANSFORMER TO FINE-TUNE?
MODEL_CHECKPOINT = models['ROBERTA']

# SPECIFY THE WEIGHTS AND BIASES PROJECT NAME
%env WANDB_PROJECT = 'P2D-NER-2021' 

# DETERMINE WHETHER TO SAVE THE MODEL IN THE 100GB OF FREE W&B STORAGE
%env WANDB_LOG_MODEL = false 

env: WANDB_PROJECT='P2D-NER-2021'
env: WANDB_LOG_MODEL=false


### Step1: File and dataset handling
Data cleaning, annotations and  formatting has already been done, tokenized to seperate words, tagged using the IOB format and serialized using the Pandas df.to_json() function using the orient="table" parameter to a JSONL file. 

Here we load in the dataset with this JSON format.

In [7]:
FEATURE_CLASS_LABELS = "feature_class_labels.json"
DATA_FILE = 'cuad-v1-annotated.json'
TEMP_MODEL_OUTPUT_DIR = 'temp_model_output_dir'
SAVED_MODEL = f"p2d-NER-Fine-Tune-Transformer-Final-{MODEL_CHECKPOINT}" # Change for notebook version

In [8]:
data_files = DATA_FILE
datasets = load_dataset('json', data_files=data_files, field='data')
print(datasets)

Using custom data configuration default-075146409b74ff61
Reusing dataset json (/home/phil/.cache/huggingface/datasets/json/default-075146409b74ff61/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02)


DatasetDict({
    train: Dataset({
        features: ['id', 'ner_tags', 'split_tokens'],
        num_rows: 314
    })
})


In [9]:
# # Create train and validation datasets
# datasets = datasets['train'].train_test_split(test_size=1-TRAIN_SPLIT, seed=RANDOM_SEED)
# print(datasets)

In [10]:
# Check the ner_tags to ensure that these are integers
datasets["train"].features["ner_tags"]

Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)

In [11]:
# Open the label list created in pre-processing corresponding to the ner_tag indices
with open(FEATURE_CLASS_LABELS, 'r') as f:
    label_list = json.load(f)

for n in range(len(label_list)):
    print(n, label_list[n])

0 B-AGMT_DATE
1 B-DOC_NAME
2 B-PARTY
3 I-AGMT_DATE
4 I-DOC_NAME
5 I-PARTY
6 O


In [12]:
# Check some random samples to ensure data loaded as expected:
def show_random_elements(dataset, num_examples=1):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

show_random_elements(datasets["train"], num_examples=3)

Unnamed: 0,id,ner_tags,split_tokens
0,15564,"[6, 6, 1, 4, 6, 1, 4, 6, 6, 6, 6, 6, 6, 6, 6, 0, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...]","[EXHIBIT, 10.24, ENDORSEMENT, AGREEMENT, This, Endorsement, Agreement, (, "", Agreement, "", ), is, made, this, 14th, day, of, March, ,, 2016, (, "", Effective, Date, "", ), ,, by, and, between, Lifeway, Foods, ,, Inc., (, "", Lifeway, "", ), with, a, principal, business, address, of, 6431, West, Oakton, Street, ,, Morton, Grove, ,, IL, 60053, and, Ludmila, Smolyansky(""Individual, "", ), on, her, own, behalf, with, an, address, of, 182, N., Harbor, Drive, ,, Chicago, ,, IL, 60602, ., Lifeway, and, Individual, are, collectively, referred, to, as, the, "", parties, ,, "", or, individually, as, a, "", party, ., "", ...]"
1,15785,"[6, 6, 1, 4, 4, 6, 1, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 6, ...]","[Exhibit, 10.4, INTELLECTUAL, PROPERTY, AGREEMENT, This, INTELLECTUAL, PROPERTY, AGREEMENT, (, this, “, Agreement, ”, or, “, IPA, ”, ), ,, effective, as, of, this, 30, day, of, June, 2016, (, the, “, Effective, Date, ”, ), among, THE, HERTZ, CORPORATION, ,, a, Delaware, corporation, ,, with, an, address, of, 8501, Williams, Road, ,, Estero, ,, Florida, 33928, (, hereinafter, “, THC”);, HERTZ, SYSTEM, ,, INC, ., ,, a, Delaware, corporation, ,, with, an, address, of, 8501, Williams, Road, ,, Estero, ,, Florida, 33928, ,, United, States, of, America, (, hereinafter, “, HSI, ”, ), and, HERC, RENTALS, INC, ., ,, ...]"
2,15831,"[1, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 0, 3, 3, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 2, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 2, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...]","[Consulting, and, Product, Development, Agreement, ARTICLE, 1, PREAMBLE, This, Consulting, and, Licensing, Agreement, (, "", Agreement, "", ), is, entered, into, this, 1st, day, of, September, 2016, (, “, Effective, Date, ”, ), by, and, between, Emerald, Health, Sciences, Inc., (, “, EHS, ”, ), ,, Emerald, Health, Nutraceuticals, Inc., (, “, EHN, ”, ), ,, and, Michael, T., Murray, ,, N.D., (, “, Dr., Murray, ”, ), ., This, Agreement, sets, forth, a, description, of, those, responsibilities, of, EHS, ,, EHN, ,, and, Dr., Murray, ,, of, certain, rights, granted, to, EHS, and, EHN, ,, and, of, certain, other, ...]"


### Step 2: Preprocessing Data - Tokenization
Before we can feed those texts to our model, we need to preprocess them specifically for the pre-trained model that we are using. Even though we have already tokenized and split our words into a list, each model will have it's own further method of tokenization to match the dictionary for each specific model. 

This is done by a 🤗 Transformers Tokenizer which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained method, which will ensure:
 - we get a tokenizer that corresponds to the model architecture we want to use,
 - we download the vocabulary used when pretraining this specific checkpoint.
 
The exception here is for Roberta Base where we specifically call the tokenizer for this model, due to inconsistencies in the Hugging Face library for this model.

The vocabulary will be cached, so it's not downloaded again the next time we run the cell.

If, as is the case here, inputs have already been split into words,we pass the list of words to the tokenzier with the argument "is_split_into_words=True"

Some tokens will be split into subtokens if the token is not in the model dictionary. This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain, first because some special tokens might be added (eg a [CLS] and a [SEP]) and then because of those possible splits of words in multiple tokens.

Some tokenizers returns outputs that have a word_ids method which can help us. Otherwise we have to build our own, as is the case for DeBERTa for example.

We will return a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids.

Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from.

Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. Just change the value of the following flag : "label_all_tokens = True/False"

In [13]:
# Instantiate the tokenizer
#For RoBERTa-base, need to use RobertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.

if MODEL_CHECKPOINT == models['ROBERTA']:
    tokenizer = RobertaTokenizerFast.from_pretrained(models["ROBERTA"], add_prefix_space=True)
else:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
        

In [14]:
def word_id_func(input_ids, print_labs=False):
    tokens = tokenizer.convert_ids_to_tokens(input_ids)
    
    word_ids = []
    i=0
    spec_toks = ['[CLS]', '[SEP]', '[PAD]']
    for t in tokens:
        if t in spec_toks:
            word_ids.append(-100)
            print(t, i) if print_labs else None
        elif t.startswith('▁'):
            i += 1
            word_ids.append(i)
            print(t, i) if print_labs else None
        else:
            word_ids.append(i)
            print(t, i) if print_labs else None
        print("Total:", i) if print_labs else None
    return word_ids

def tokenize_and_align_labels(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["split_tokens"],
                                 truncation=True,
                                 is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

def tokenize_and_align_labels_deberta(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["split_tokens"],
                                 truncation=True,
                                 is_split_into_words=True)
    labels = []
    word_ids_list = []
    for input_ids in tokenized_inputs["input_ids"]:
        wids = word_id_func(input_ids, print_labs=False)
        word_ids_list.append(wids)
    
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = word_ids_list[i]
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx == -100:
                label_ids.append(-100)
            #We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx-1])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx-1] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [15]:
# To apply this function on all the words and labels in our dataset,
# we just use the map method of our dataset object we created earlier.
# This will apply the function on all the elements of all the splits in dataset, so our training, 
# validation and testing data will be preprocessed in one single command.

# 🤗 Datasets warns you when it uses cached files, you can pass load_from_cache_file=False in the
# call to map to not use the cached files and force the preprocessing to be applied again.
if MODEL_CHECKPOINT == models['DEBERTA_V2_XL']:
    tokenize_and_align_labels = tokenize_and_align_labels_deberta

tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True, load_from_cache_file=True)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




### Step 3: Build Model
Since all our tasks are about token classification, we use the AutoModelForTokenClassification class. Like with the tokenizer, the from_pretrained method will download and cache the model for us. The only thing we have to specify is the number of labels for our problem (which we can get from the features, as seen before).

The warning is telling us we are throwing away some weights (the vocab_transform and vocab_layer_norm layers) and randomly initializing some other (the pre_classifier and classifier layers). This is absolutely normal in this case, because we are removing the head used to pretrain the model on a masked language modeling objective and replacing it with a new head for which we don't have pretrained weights, so the library warns us we should fine-tune this model before using it for inference, which is exactly what we are going to do.

In [16]:
model = AutoModelForTokenClassification.from_pretrained(MODEL_CHECKPOINT, num_labels=len(label_list))

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForTokenClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able

#### Training Schedule
This is a common train schedule for transfer learning. The learning rate starts at zero, to initially preserve the pre-trained weights, then increases to a maximum, then reduces using a cosine exponential curve to attempt to find the global optima.

Changing the schedule and/or learning rates is a popular way to experiment to find good model performance. Note how the learning rate max is larger with larger batches sizes. This is a good practice to follow.

Weight decay is the amount of L2 regularization to force into the model's optimizer to make it work harder and offset any tendancy for the model to overfit.

In [17]:
#Optimizer
learning_rate = 0.0000075
lr_max = learning_rate * BATCH_SIZES
weight_decay = 0.05

optimizer = AdamW(
    model.parameters(),
    lr=lr_max,
    weight_decay=weight_decay)

print("The maximum learning rate is: ",lr_max)

# Learning Rate Schedule
num_train_samples = len(datasets["train"])
warmup_ratio = 0.2 # Percentage of total steps to go from zero to max learning rate
num_cycles=0.8 # The cosine exponential rate

num_training_steps = num_train_samples*EPOCHS/BATCH_SIZES
num_warmup_steps = num_training_steps*warmup_ratio

lr_sched = get_cosine_schedule_with_warmup(optimizer=optimizer,
                                           num_warmup_steps=num_warmup_steps,
                                           num_training_steps = num_training_steps,
                                           num_cycles=num_cycles)

The maximum learning rate is:  6e-05



To instantiate a Trainer, we will need to define three more things. The most important is the TrainingArguments, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

In [18]:
args = TrainingArguments(output_dir = TEMP_MODEL_OUTPUT_DIR,
                         learning_rate=lr_max,
                         per_device_train_batch_size=BATCH_SIZES,
                         num_train_epochs=EPOCHS,
                         weight_decay=weight_decay,
                         lr_scheduler_type = 'cosine',
                         warmup_ratio=warmup_ratio,
                         logging_strategy="epoch",
                         save_strategy="epoch",
                         seed=RANDOM_SEED,
                         report_to = 'wandb', # enable logging to W&B
                         run_name = MODEL_CHECKPOINT+"-"+log_date
                        )

Then we will need a data collator that will batch our processed examples together while applying padding to make them all the same size (each pad will be padded to the length of its longest example). There is a data collator for this task in the Transformers library, that not only pads the inputs, but also the labels.

In [19]:
data_collator = DataCollatorForTokenClassification(tokenizer)

### Step 4: Training

The last thing to define for our Trainer is how to compute the metrics from the predictions. Here we will load the seqeval metrics (which are commonly used to evaluate results on the benchmark CONLL dataset). https://github.com/chakki-works/seqeval

Note - Either BILOU or IOB tags can be used. Whilst BILOU provides for more features, research suggests using the simpler IOB for token classification shouldn't impact accuracy. 

So we will need to do a bit of post-processing on our predictions:
 - select the predicted index (with the maximum logit) for each token
 - convert it to its string label
 - ignore everywhere we set a label of -100

The following function does all this post-processing on the result of Trainer.evaluate (which is a namedtuple containing predictions and labels) before applying the metric:

In [20]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)]

    # Define the metric parameters
    overall_precision = precision_score(true_labels, true_predictions, zero_division=1)
    overall_recall = recall_score(true_labels, true_predictions, zero_division=1)
    overall_f1 = f1_score(true_labels, true_predictions, zero_division=1)
    overall_accuracy = accuracy_score(true_labels, true_predictions)
    
    # Return a dictionary with the calculated metrics
    return {
        "precision": overall_precision,
        "recall": overall_recall,
        "f1": overall_f1,
        "accuracy": overall_accuracy,}

In [21]:
# Define and instantiate the Trainer...
trainer = Trainer(
                model=model,
                args=args,
                train_dataset=tokenized_datasets["train"],
                data_collator=data_collator,
                tokenizer=tokenizer,
                optimizers=(optimizer, lr_sched)
                )

In [22]:
# Train
trainer.train()



Step,Training Loss
20,1.6113
40,0.4497
60,0.1689
80,0.0619
100,0.041
120,0.029
140,0.0195
160,0.0145




TrainOutput(global_step=160, training_loss=0.2994753928855062, metrics={'train_runtime': 57.3977, 'train_samples_per_second': 2.788, 'total_flos': 0, 'epoch': 8.0, 'init_mem_cpu_alloc_delta': 1910181888, 'init_mem_gpu_alloc_delta': 497540096, 'init_mem_cpu_peaked_delta': 380674048, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1164800000, 'train_mem_gpu_alloc_delta': 1501276160, 'train_mem_cpu_peaked_delta': 309084160, 'train_mem_gpu_peaked_delta': 3436821504})

In [23]:
# Finish Weighs & Biases logging for this run
wandb.finish() 

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
train/loss,0.0145
train/learning_rate,2e-05
train/epoch,8.0
train/global_step,160.0
_runtime,58.0
_timestamp,1621420341.0
_step,8.0
train/train_runtime,57.3977
train/train_samples_per_second,2.788
train/total_flos,0.0


0,1
train/loss,█▃▂▁▁▁▁▁
train/learning_rate,▁▄██▇▅▃▁
train/epoch,▁▂▃▄▅▆▇██
train/global_step,▁▂▃▄▅▆▇██
_runtime,▁▂▃▄▅▆▇██
_timestamp,▁▂▃▄▅▆▇██
_step,▁▂▃▄▅▅▆▇█
train/train_runtime,▁
train/train_samples_per_second,▁
train/total_flos,▁


In [24]:
# Save the model, good practice given the work required to train a model and  
# also can be used just for inference on new data
trainer.save_model(SAVED_MODEL)