# Fine Tuning

In [1]:
import sys
sys.path.append("..")
from mangoes.modeling import BERTForSequenceClassification, BERTForQuestionAnswering, BERTForCoreferenceResolution, \
    BERTForMultipleChoice

Users can fine tune a pretrained model on downstream tasks, using the same interface as the pretraining and feature extraction classes. Below are some examples:

## Text Classification

A common fine-tuning task is text classification, including text sequence classification (ie, sentiment analysis) or token classification (ie, named entity classification). Here we will show an example of sentiment analysis, but the process is the same for token classification. 
The first step is to load the pretrained base bert model. We can load the pretrained model we saved in the pretraining demo:

In [2]:
saved_model_dir = "./model_output/"

# Since we saved the trained tokenizer with the model, we can pass the same directry for the tokenizer argument.
# We pass the task labels when loading, as these are needed when instaniating the model.
loaded_model = BERTForSequenceClassification.load(saved_model_dir, saved_model_dir, labels=["pos", "neg"],
                                                  label2id={'neg': 0, 'pos': 1})

# alternatively, we could have loaded a pretrained bert model from huggingface:
# loaded_model = BERTForSequenceClassification.load("bert-base-uncased", "bert-base-uncased", 
#                                                   labels=["pos", "neg"], label2id={'neg': 0, 'pos': 1})

Some weights of the model checkpoint at ./model_output/ were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./model_output/ and are newly

### Prepare Data

We'll use the IMDb sentiment analysis dataset. Here we load it using the nlp package, then extract the raw data:

In [3]:
from nlp import load_dataset

train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])

train_texts = [x['text'] for x in train_dataset]
train_targets = [x['label'] for x in train_dataset]
test_texts = [x['text'] for x in test_dataset]
test_targets = [x['label'] for x in test_dataset]

# for the sake of the demo, we'll truncate the datasets so the trainings take less time
# we take from the beginning and end of the dataset to get a mix of pos and neg examples

train_texts = train_texts[:150] + train_texts[-150:]
train_targets = train_targets[:150] + train_targets[-150:]
test_texts = test_texts[:150] + test_texts[-150:]
test_targets = test_targets[:150] + test_targets[-150:]

### Training

We can calculate metrics during the training of the model by creating and passing in a compute_metrics function, as described here (https://huggingface.co/transformers/training.html#trainer):

In [4]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

For the train method, we can use the raw data as input:

In [5]:
loaded_model.train(train_text=train_texts, train_targets=train_targets,
                   eval_text=test_texts, eval_targets=test_targets,
                   output_dir="./model_ckpts/", max_len=512, num_train_epochs=1, 
                   per_device_train_batch_size=32, per_device_eval_batch_size=16,
                   logging_steps=10, learning_rate=0.0001, evaluation_strategy="epoch",
                   compute_metrics=compute_metrics)

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
1,0.7018,0.695033,0.5,0.0,0.0,0.0,9.5842,31.302


  _warn_prf(average, modifier, msg_start, len(result))


...or initialized torch.Datasets:

In [6]:
from mangoes.modeling import MangoesTextClassificationDataset

train_dataset = MangoesTextClassificationDataset(train_texts, train_targets, loaded_model.tokenizer, 
                                                 max_len=512, label2id={'neg': 0, 'pos': 1})
eval_dataset = MangoesTextClassificationDataset(test_texts, test_targets, loaded_model.tokenizer, 
                                                max_len=512, label2id={'neg': 0, 'pos': 1})

loaded_model.train(train_dataset=train_dataset, eval_dataset=eval_dataset,   
                   output_dir="./model_ckpts/", max_len=512, num_train_epochs=1, 
                   per_device_train_batch_size=32, per_device_eval_batch_size=16,
                   logging_steps=10, learning_rate=0.0001, evaluation_strategy="epoch",
                   compute_metrics=compute_metrics, seed=7)

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall,Runtime,Samples Per Second
1,0.6961,0.694875,0.5,0.666667,0.5,1.0,9.6059,31.231


### Inference

We can use the predict function for direct sentiment prediction, or generate_outputs for more detailed outputs:

In [7]:
predictions = loaded_model.predict("This is a good movie")
print(predictions)

[{'label': 'pos', 'score': 0.5291166305541992}]


In [8]:
outputs = loaded_model.generate_outputs("This is a good movie", output_hidden_states=True, output_attentions=True)
print(outputs.keys())

dict_keys(['logits', 'hidden_states', 'attentions', 'offset_mappings'])


### Saving and Loading
Users can save and load finetuned models using the same save and load methods. Here we can load a finetuned model that has been uploaded to Huggingface's servers, then use it to classify text:

In [9]:
loaded_model = BERTForSequenceClassification.load("textattack/bert-base-uncased-imdb", 
                                                  "textattack/bert-base-uncased-imdb", 
                                                  labels=["pos", "neg"], label2id={'neg': 0, 'pos': 1})

predictions = loaded_model.predict("This is a good movie")
print(predictions)

[{'label': 'pos', 'score': 0.9922362565994263}]


## Question Answering

Question answering is another common fine-tuning task.
The first step is again to load the pretrained base bert model:

In [10]:
saved_model_dir = "./model_output/"

# Since we saved the trained tokenizer with the model, we can pass the same directry for the tokenizer argument.
# We pass the task labels when loading, as these are needed when instaniating the model.
loaded_model = BERTForQuestionAnswering.load(saved_model_dir, saved_model_dir)

# alternatively, we could have loaded a pretrained bert model from huggingface:
# loaded_model = BERTForQuestionAnswering.load("bert-base-uncased", "bert-base-uncased"

Some weights of the model checkpoint at ./model_output/ were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at ./model_output/ and are newly initialized: ['qa_o

### Data preparation
We'll use the Squad dataset:

In [11]:
from nlp import load_dataset

train_dataset, eval_dataset = load_dataset('squad', split=['train', 'validation'])
train_contexts = [x['context'] for x in train_dataset][:250]
train_questions = [x['question'] for x in train_dataset][:250]
train_starts = [x['answers']['answer_start'][0] for x in train_dataset][:250]
train_answers = [x['answers']['text'][0] for x in train_dataset][:250]

### Training
We can call train by passing the raw data or torch.Datasets to the train function. Here we'll pass the raw data:

In [12]:
loaded_model.train(train_question_texts=train_questions, train_context_texts=train_contexts,
                   train_answer_texts=train_answers, train_start_indices=train_starts,
                   output_dir="./model_ckpts/", num_train_epochs=1, learning_rate=0.00005,
                   per_device_train_batch_size=32, logging_steps=4)

convert squad examples to features: 100%|██████████| 250/250 [00:03<00:00, 71.26it/s]
add example index and unique id: 100%|██████████| 250/250 [00:00<00:00, 414620.80it/s]


Step,Training Loss
4,5.9612
8,5.9025
12,5.8788


Alternatively, we could instantiated a transformers.Trainer object directly and pass it to the train function. Additionally, the base BERT model can be frozen and we can only train the task heads:

In [13]:
from transformers import Trainer, TrainingArguments, PrinterCallback
from mangoes.modeling import MangoesQuestionAnsweringDataset

train_dataset = MangoesQuestionAnsweringDataset(loaded_model.tokenizer, train_questions,
                                                train_contexts, train_answers, train_starts)

train_args = TrainingArguments(output_dir="./model_ckpts/", num_train_epochs=1, learning_rate=0.00005,
                                            per_device_train_batch_size=32, logging_steps=4)
trainer = Trainer(loaded_model.model, args=train_args, train_dataset=train_dataset,
                  tokenizer=loaded_model.tokenizer, callbacks=[PrinterCallback])

loaded_model.train(trainer=trainer, freeze_base=True)

convert squad examples to features: 100%|██████████| 250/250 [00:03<00:00, 70.83it/s]
add example index and unique id: 100%|██████████| 250/250 [00:00<00:00, 696728.24it/s]


Step,Training Loss
4,5.8467
8,5.8461
12,5.8491


{'loss': 5.8467, 'learning_rate': 3.571428571428572e-05, 'epoch': 0.29}
{'loss': 5.8461, 'learning_rate': 2.1428571428571428e-05, 'epoch': 0.57}
{'loss': 5.8491, 'learning_rate': 7.142857142857143e-06, 'epoch': 0.86}
{'train_runtime': 14.7859, 'train_samples_per_second': 0.947, 'epoch': 1.0}


### Inference
We can use the predict or generate_outputs functions for model inference:

In [14]:
predictions = loaded_model.predict(question=train_questions[0], context=train_contexts[0])
print(predictions)
outputs = loaded_model.generate_outputs(question=train_questions[0], context=train_contexts[0], output_hidden_states=True)
print(outputs.keys())

[{'score': 3.868541170959361e-05, 'start': 260, 'end': 287, 'answer': 'Me Omnes". Next to the Main'}]
dict_keys(['start_logits', 'end_logits', 'hidden_states', 'offset_mappings'])


### Saving and Loading
Users can save and load finetuned models using the same save and load methods. Here we can load a finetuned model that has been uploaded to Huggingface's servers, then use it to answer a question:

In [15]:
loaded_model = BERTForQuestionAnswering.load("csarron/bert-base-uncased-squad-v1", 
                                             "csarron/bert-base-uncased-squad-v1")

predictions = loaded_model.predict(question="Where does the answer reside?", context="The answer resides in the context")
print(predictions)

[{'score': 0.7772578001022339, 'start': 19, 'end': 32, 'answer': 'in the context'}]


## Multiple Choice
Another fine-tuning task is training a model to answer multiple choice questions. We'll start by loading the base model we pretrained:

In [16]:
saved_model_dir = "./model_output/"

loaded_model = BERTForMultipleChoice.load(saved_model_dir, saved_model_dir)

# alternatively, we could have loaded a pretrained bert model from huggingface:
# loaded_model = BERTForMultipleChoice.load("bert-base-cased", "SpanBERT/spanbert-base-cased")

Some weights of the model checkpoint at ./model_output/ were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at ./model_output/ and are newly initialized: ['bert.pooler.dens

### Data Preparation
We'll use a subset of the hellaswag (extension of SWAG) dataset:

In [17]:
from nlp import load_dataset

train_dataset, eval_dataset = load_dataset('hellaswag', split=['train', 'validation'])
train_contexts = [x['ctx_a'] for x in train_dataset][:65]
train_choices = [[x['ctx_b'] + " " + ending for ending in x['endings']] for x in train_dataset][:65]
train_labels = [x['label'] for x in train_dataset][:65]
eval_contexts = [x['ctx_a'] for x in eval_dataset][:100]
eval_choices = [[x['ctx_b'] + " " + ending for ending in x['endings']] for x in eval_dataset][:100]
eval_labels = [x['label'] for x in eval_dataset][:100]

print(train_contexts[1])
print(train_choices[1])
print(train_labels[1])

Using custom data configuration default


A female chef in white uniform shows a stack of baking pans in a large kitchen presenting them.
['the pans contain egg yolks and baking soda.', 'the pans are then sprinkled with brown sugar.', 'the pans are placed in a strainer on the counter.', 'the pans are filled with pastries and loaded into the oven.']
3


### Training
Next, we can pass this raw data into the train function along with any training parameters:

In [18]:
loaded_model.train(train_question_texts=train_contexts, eval_question_texts=eval_contexts,
                   train_choices_texts=train_choices, eval_choices_texts=eval_choices,
                   train_labels=train_labels, eval_labels=eval_labels, learning_rate=0.0005,
                   per_device_train_batch_size=8, per_device_eval_batch_size=8, logging_steps=4,
                   max_len=384, output_dir="./model_ckpts/", num_train_epochs=1, task_learn_rate=0.005)

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,1.3781,1.38625,7.3312,13.64


Alternatively, you could instantiate torch Datasets and pass them as data arguments:

In [19]:
from mangoes.modeling import MangoesMultipleChoiceDataset

train_dataset = MangoesMultipleChoiceDataset(loaded_model.tokenizer, train_contexts, 
                                             train_choices, train_labels, 384)

eval_dataset = MangoesMultipleChoiceDataset(loaded_model.tokenizer, eval_contexts, 
                                             eval_choices, eval_labels, 384)

loaded_model.train(train_dataset=train_dataset, eval_dataset=eval_dataset, per_device_train_batch_size=8, 
                   per_device_eval_batch_size=8, logging_steps=4, output_dir="./model_ckpts/", 
                   learning_rate=0.005, num_train_epochs=1)

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,1.326,1.386535,7.6374,13.093


### Inference
We can use the predict or generate_outputs functions for model inference:

In [20]:
questions = "What did the cat say to the dog?"
choices = ["It said meow", "it said bark"]


predictions = loaded_model.predict(questions, choices)
print(predictions)
outputs = loaded_model.generate_outputs(questions, choices)
print(outputs.keys())
print(outputs['logits'])   # pre-softmax scores

[{'answer_index': 0, 'score': 0.5091835856437683, 'answer_text': 'It said meow'}]
odict_keys(['logits', 'offset_mappings'])
tensor([[1.2972, 1.2604]])


### Saving

In [21]:
loaded_model.save("output_directory/")

## Co-reference Resolution
Another fine-tuning task is co-reference resolution. We'll start by loading the base model we pretrained:

In [31]:
saved_model_dir = "./model_output/"

# loaded_model = BERTForCoreferenceResolution.load(saved_model_dir, saved_model_dir, use_metadata=True)

# alternatively, we could have loaded a pretrained bert model from huggingface:
loaded_model = BERTForCoreferenceResolution.load("bert-base-cased", "SpanBERT/spanbert-base-cased")

Some weights of BertForCoreferenceResolutionBase were not initialized from the model checkpoint at SpanBERT/spanbert-base-cased and are newly initialized: ['span_attend_projection.weight', 'span_attend_projection.bias', 'mention_scorer.0.weight', 'mention_scorer.0.bias', 'mention_scorer.3.weight', 'mention_scorer.3.bias', 'width_scores.0.weight', 'width_scores.0.bias', 'width_scores.3.weight', 'width_scores.3.bias', 'fast_antecedent_projection.weight', 'fast_antecedent_projection.bias', 'slow_antecedent_scorer.0.weight', 'slow_antecedent_scorer.0.bias', 'slow_antecedent_scorer.3.weight', 'slow_antecedent_scorer.3.bias', 'slow_antecedent_projection.weight', 'slow_antecedent_projection.bias', 'genre_embeddings.weight', 'distance_embeddings.weight', 'slow_distance_embeddings.weight', 'distance_projection.weight', 'distance_projection.bias', 'same_speaker_embeddings.weight', 'span_width_embeddings.weight', 'span_width_prior_embeddings.weight', 'segment_dist_embeddings.weight']
You should p

### Data Preparation
First, we load a small subset of the ONTONOTES dataset for demo purposes:

In [23]:
import json

with open('data/coref_data.json') as json_file:
    data_dict = json.load(json_file)
print(data_dict.keys())


dict_keys(['sentences', 'clusters', 'speakers', 'genres'])


### Training
Next, we can pass this raw data into the train function along with any training parameters:

In [24]:
loaded_model.train(output_dir="./model_ckpts/", train_documents=data_dict["sentences"][:6], 
                   train_cluster_ids=data_dict["clusters"][:6], train_speaker_ids=data_dict["speakers"][:6],
                   train_genres=data_dict["genres"][:6], 
                   eval_documents=data_dict["sentences"][6:12],
                   eval_cluster_ids=data_dict["clusters"][6:12], eval_speaker_ids=data_dict["speakers"][6:12],
                   eval_genres=data_dict["genres"][6:12],
                   num_train_epochs=1, learning_rate=0.0005,
                   logging_steps=2, task_learn_rate=0.001, evaluation_strategy="epoch")

  torch.cat([dummy_scores, top_antecedent_scores], 1))  # [top_cand, top_ant + 1]


Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,0.0,361.961548,9.7831,0.613


### Saving/Loading
The save and load methods can be used to save a trained model, and load either a pretrained base model for fine-tuning or an already fine-tuned model for inference. Here we'll load an already fine-tuned model (skip this cell if this presaved model is not available):

In [25]:
loaded_model = BERTForCoreferenceResolution.load("./coref_model/", "./coref_model/", use_metadata=True)


### Inference
The predict and generate_outputs functions can be called to use the model for inference. The predict function gives direct co-reference predictions, while the generate_outputs functions returns all the antecedent and mention scores and indices, as well as the hidden states and attention matrices, if asked for:

In [26]:
# pre-tokenized
document = data_dict["sentences"][50][7:12]
speakers = data_dict["speakers"][50][7:12]
genre = data_dict["genres"][50]

# not pre-tokenized
input_doc = [' '.join(sent) for sent in document]
input_speaker = [sent[0] for sent in speakers]
print(input_doc)

['As the year 2006 ended , so did a big hush - hush " deal . "', 'Caijing Magazine described it as " a quiet change of ownership for this vast business empire . "', 'These were the final few steps as Shandong Luneng Group " completed " its privatization .', 'In the past , when I heard the big name of Luneng Group , I usually would link it with soccer .', 'I could never expect it is a 73.805 billion yuan giant .']


In [29]:
predictions = loaded_model.predict(document, pre_tokenized=True, speaker_ids=speakers, genre=genre)

for coref in predictions:
    print(coref["cluster_tokens"])

[['a', 'big', 'h', '##ush', '-', 'h', '##ush', '"', 'deal'], ['it']]
[['this', 'vast', 'business', 'empire'], ['Shan', '##dong', 'Lu', '##nen', '##g', 'Group'], ['its'], ['Lu', '##nen', '##g', 'Group'], ['it']]
[['I'], ['I'], ['I']]


In [30]:
outputs = loaded_model.generate_outputs(input_doc, pre_tokenized=False, speaker_ids=input_speaker, genre=genre)
print(outputs.keys())

dict_keys(['loss', 'candidate_starts', 'candidate_ends', 'candidate_mention_scores', 'top_span_starts', 'top_span_ends', 'top_antecedents', 'top_antecedent_scores', 'flattened_ids', 'flattened_text'])
