<a id = 'top'></a>

#  A quick-start guide to fine-tune BERT with PyTorch and the Trainer Class
  * A. [Fine-tuned PyTorch Example?](#fineTuned) 
  * B. [Datasets](#datasetClass)    
      * 1. [GLUE](#glueData)
      * 2. [Exploratory Data Analysis](#EDA)
      * 3. [Model Selection](#modelSelection)
      * 4. [Tokenizer Selection](#tokenizerSelection)
      * 5. [Encode Data](#encodeData)
  * C. [Fine-Tuning](#fineTuning)
      * 1. [Trainer](#trainer)
      * 2. [Train and Evaluate](#trainEval)

Hugging Face is a company that offers a library of "transformers" as well as pre-trained language models.  We are going to explore several ways of working with these models at a very high level.  In later classes, when we have covered how a transformer works, we'll come back and look at them at a deeper level.  This tutorial is designed to look at the Huggingface library at the same level as Keras rather at the lower level of TensorFlow.


---

This directory includes three different uses of the HuggingFace Library.  These uses are incompatible with each other so you should only run one at a time and then stop and restart your notebook.  Alternatively, you can copy this notebook to your Google drive file and then open it in a Colab and run it there.  


In [None]:
!pip install -q transformers
#!pip install transformers

[K     |████████████████████████████████| 3.4 MB 5.4 MB/s 
[K     |████████████████████████████████| 3.3 MB 39.7 MB/s 
[K     |████████████████████████████████| 67 kB 5.3 MB/s 
[K     |████████████████████████████████| 596 kB 40.2 MB/s 
[K     |████████████████████████████████| 895 kB 50.1 MB/s 
[?25h

[Return to Top](#top)
 <a id = 'fineTuned'></a>
 # Fine-tuned PyTorch Example

Now let's look at how to train a model.  We'll use a simple data set called COLA that looks at a sentence and says whether it is grammatically correct or not.  We'll use abstract classes that simplify the process of training by consolidating a number of piece under one class.  We'll also use PyTorch which is the native computational graph language used in Hugging Face.  TensorFlow is Google's computational graph language.  There are two reasons to look at this.  First, many models first get put on HuggingFace in PyTorch. Eventually they get ported over to TensorFlow.  Depending on what model you want to use, you may have to run the PyTorch version.  Second, it's important to always be aware of what you're using.  The good news is that HuggingFace has built these models so that the underlying weight parameters can be used across PyTorch and TensorFlow.  It is simply the commands you use to run and manipulate the model that are in PyTorch or TensorFlow.



---


This notebook borrows liberally from https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb which shows how to train a model on a number of GLUE tasks.

[Return to Top](#top)
 <a id = 'datasetClass'></a>
# Datasets



HuggingFace provides a class for the managing datasets.  They also provide a library of actually data that is accessible via the datasets class.  We'll take advantage of the datasets object in Huggingface to access some well known corpora, specifically GLUE, which contains a set of classification tasks.  It is good for learning how to work with HuggingFace Transformers library and also good for baselines.

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.18.0-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.3 MB/s eta 0:00:01
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 41.4 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 49.8 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 49.2 MB/s 
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (144 kB)
[K     |████████████████████████████████| 144 kB 47.3 MB/s 
Collecting asynctest==0.13.0
  Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting async-t

In [None]:
from datasets import load_dataset, load_metric

[Return to Top](#top)
 <a id = 'glueData'></a>
### GLUE

GLUE contains a set of nine different classification tests and a method for aggregating the scores.  The purpose of this dataset is to be able to measure the ability of these large multi-task models. The example below focuses on the COLA task which classifies a sentence as acceptable or unacceptable.  We'll use the dataset object to look at some of the features of the COLA data.

In [None]:
GLUE_TASKS = ["cola", "mnli", "mnli-mm", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]

In [None]:
task = "cola"
actual_task = "mnli" if task == "mnli-mm" else task
dataset = load_dataset("glue", actual_task)
metric = load_metric('glue', actual_task)

Downloading:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Downloading and preparing dataset glue/cola (download: 368.14 KiB, generated: 596.73 KiB, post-processed: Unknown size, total: 964.86 KiB) to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading:   0%|          | 0.00/377k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

[Return to Top](#top)
 <a id = 'EDA'></a>
### Exploratory Data Analysis

Let's look inside the COLA dataset and see what it contains.  We see it has train, validation, and test records.  

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})

In [None]:
dataset["train"][0]

{'idx': 0,
 'label': 1,
 'sentence': "Our friends won't buy this analysis, let alone the next one we propose."}

In [None]:
#from https://github.com/huggingface/notebooks/blob/master/examples/text_classification.ipynb
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

We can see a random selection of 10 records from the training set and the label associated with each one.  It is a good idea to get to know your data.

In [None]:
show_random_elements(dataset["train"])

Unnamed: 0,sentence,label,idx
0,I dried the clothes in the sun.,acceptable,2393
1,The dentist is eager to examine Pat.,acceptable,4292
2,I loaned my binoculars a man who was watching the race.,unacceptable,1135
3,The rope coiled around the post.,acceptable,2586
4,He left.,acceptable,3573
5,Mike expected Greg incorrectly to take out the trash.,acceptable,6265
6,What is likely to have been bought at the supermarket?,acceptable,6216
7,Traci gave the whale a lollipop.,acceptable,5874
8,Adam asked if Hyacinth likes pineapples.,acceptable,5922
9,I inquired if we could leave early.,acceptable,7888


Each dataset object has a metric object associated with it.  Here you can see the metric object for GLUE.  IT indicates the accepted type(s) of evaluations that can be used with the dataset.  Note that our COLA test uses [Matthews Correlation](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) rather than accuracy or an F1 score.

In [None]:
metric  #show the metric object contents

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

In [None]:
#quick illustration of having the metric class compute the metric with some fake data
import numpy as np

fake_preds = np.random.randint(0, 2, size=(64,))
fake_labels = np.random.randint(0, 2, size=(64,))
metric.compute(predictions=fake_preds, references=fake_labels)

{'matthews_correlation': -0.3110917000380287}

[Return to Top](#top)
 <a id = 'modelSelection'></a>
### Model Selection


In order to do anything with HuggingFace we need to select a model.  Recall from before that this will indicate a particular architecture with a language and in some cases some task capabilities grafted on to the model.  In this case we want the distilbert model.  This is a variant that runs a smaller architecture and therefore tends to operate faster at the expense of some accuracy in preditiction.  We also specify that we want the base size (as opposed to a large size).  Finally, we want the uncased variant, meaning that all words will be forced to lowercase as part of the tokenization.  If you're lookng for proper nouns then all lowercase text will be problematic.  If you don't case about distinguishing proper nouns from nouns then uncased will most likely help you.

In [None]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

[Return to Top](#top)
 <a id = 'tokenizerSelect'></a>
### Tokenizer Selection 

We'll use the AutoTokenizer object because it insures that we get the correct tokenizer given our pre-trained model.  Different models each have their own tokenizer that you can think of as it's own set of learned word embeddings.  YOu need to instantiate the correct tokenizer for your model.  If we accidentally try to use the FlauBERT tokenizer with DistilBERT, we will have problems even though we may still get poor predictions. The AutoTokenizer object makes sure we use the toeknizer for the model we specified in the model_checkpoint variable above.

In [None]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

What's the tokenizer doing?  It's taking care of breaking down a sentence into the parts the model can understand and was trained on, as well as a bunch of housekeeping that's needed by the model in order to work properly.  Once we've covered how a transformer works in live session, we'll come back (in week 9) and discuss its various components.  For now, you don't need to understand it in order to make use of it.

In [None]:
#tokenizer("Hello, this one sentence!", "And this second sentence goes with it.")
tokenizer("Hello, we only need this one sentence for our task!")

{'input_ids': [101, 7592, 1010, 2057, 2069, 2342, 2023, 2028, 6251, 2005, 2256, 4708, 999, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The tokenizer converts the incoming words to integer ids that are used to retrieve input word embeddings for the model.  All tokenizers convert words to input ids.  The wrong tokenizer will produce the wrong set of token ids and result in very poor predictions.

---

Below we see some internal scaffolding used to help support all of the GLUE tests and to be able to process single sentence and multi-sentence inputs.

In [None]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

In [None]:
sentence1_key, sentence2_key = task_to_keys[task]
if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: Our friends won't buy this analysis, let alone the next one we propose.


In [None]:
sentence1_key

'sentence'

In [None]:
sentence2_key

[Return to Top](#top)
 <a id = 'encodeData'></a>
### Encode Data


Let's create all of the encoded data for training.  Since we encode it all ahead of time, we can leverage some abstract classes to help us with the training process.

In [None]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True, padding=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)


In [None]:
preprocess_function(dataset['train'][:5])

{'input_ids': [[101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102], [101, 2028, 2062, 18404, 2236, 3989, 1998, 1045, 1005, 1049, 3228, 2039, 1012, 102, 0, 0, 0, 0, 0], [101, 2028, 2062, 18404, 2236, 3989, 2030, 1045, 1005, 1049, 3228, 2039, 1012, 102, 0, 0, 0, 0, 0], [101, 1996, 2062, 2057, 2817, 16025, 1010, 1996, 13675, 16103, 2121, 2027, 2131, 1012, 102, 0, 0, 0, 0], [101, 2154, 2011, 2154, 1996, 8866, 2024, 2893, 14163, 8024, 3771, 1012, 102, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]]}

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

[Return to Top](#top)
 <a id = 'fineTuning'></a>
# Fine Tuning

Fine-tuning is the process of taking an already trained model and teaching it to make a new set of predictions.  In HuggingFace we have a large number of models that have been trained to make predictions about language.  These models are then trained a second time, on a new task.  We're going to use a model that was trained on some language tasks and we are going to fine tune it using the COLA data to predict is a sentence in accceptable or unacceptable.

---

In order to do the training we are going to take advantage of some abstract classes from HuggingFace such as the Trainer and the TrainingArguments objects.

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = 2
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

[Return to Top](#top)
 <a id = 'trainer'></a>
### Trainer

We'll use the Trainer object provided by Huggingface to manage the training process for us.  We need to create a structure to hold the arguments for the Trainer class.

In [None]:
metric_name = "matthews_correlation"

args = TrainingArguments(
    "test-glue",
    evaluation_strategy = "steps",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
)

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels)

In [None]:
#validation_key = "validation"
validation_key = "validation_mismatched" if task == "mnli-mm" else "validation_matched" if task == "mnli" else "validation"
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

[Return to Top](#top)
 <a id = 'trainEval'></a>
### Train and Evaluate

Once we've instantiated the Trainer and Training Model arguments, we can easily train the model based on our new task..

In [None]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running training *****
  Num examples = 8551
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2675


Step,Training Loss,Validation Loss,Matthews Correlation
500,0.523,0.464664,0.459811
1000,0.3511,0.447707,0.523757
1500,0.2393,0.611639,0.543622
2000,0.1671,0.797166,0.532507
2500,0.1291,0.832661,0.548133


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-500
Configuration saved in test-glue/checkpoint-500/config.json
Model weights saved in test-glue/checkpoint-500/pytorch_model.bin
tokenizer config file saved in test-glue/checkpoint-500/tokenizer_config.json
Special tokens file saved in test-glue/checkpoint-500/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to test-glue/checkpoint-1000
Configuration saved in test-glue/checkpoint-1000/config.json
Model weights saved in test-glue/checkpoint-1000/pytorch_model.bin
tokeniz

TrainOutput(global_step=2675, training_loss=0.27132948331743756, metrics={'train_runtime': 472.9739, 'train_samples_per_second': 90.396, 'train_steps_per_second': 5.656, 'total_flos': 501564261000636.0, 'train_loss': 0.27132948331743756, 'epoch': 5.0})

In [None]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: idx, sentence.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16


{'epoch': 5.0,
 'eval_loss': 0.8326605558395386,
 'eval_matthews_correlation': 0.5481326292844919,
 'eval_runtime': 2.8723,
 'eval_samples_per_second': 363.12,
 'eval_steps_per_second': 22.978}

### Future Discussion
Trained models can be saved and re-used.  This provides an opportunity to train the model using a variety of tasks that may be related to your ultimate predictive task, save the results (basically the parameter values), and then re-use that saved model.  This is a process we'll discuss later in live session called transfer learning. 

In [None]:
trainer.save_model("path/to/awesome-name-you-picked")
tokenizer.save_pretrained("path/to/repo/clone/your-model-name")

In [None]:
tokenizer = AutoTokenizer.from_pretrained("namespace/awesome-name-you-picked")
model = AutoModel.from_pretrained("namespace/awesome-name-you-picked")