<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Before we start running our own Python code, install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets

Both of these packages are used extensively on this course.

In [1]:
!pip3 install -q transformers datasets

[K     |████████████████████████████████| 3.8 MB 5.7 MB/s 
[K     |████████████████████████████████| 312 kB 32.1 MB/s 
[K     |████████████████████████████████| 6.5 MB 37.9 MB/s 
[K     |████████████████████████████████| 67 kB 4.9 MB/s 
[K     |████████████████████████████████| 596 kB 48.1 MB/s 
[K     |████████████████████████████████| 895 kB 44.3 MB/s 
[K     |████████████████████████████████| 134 kB 39.2 MB/s 
[K     |████████████████████████████████| 212 kB 34.6 MB/s 
[K     |████████████████████████████████| 1.1 MB 5.8 MB/s 
[K     |████████████████████████████████| 127 kB 36.3 MB/s 
[K     |████████████████████████████████| 144 kB 40.9 MB/s 
[K     |████████████████████████████████| 94 kB 3.3 MB/s 
[K     |████████████████████████████████| 271 kB 46.6 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires foliu

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

We'll also use the [`pprint`](https://docs.python.org/3/library/pprint.html) ("pretty-print") module to format output more readably below. The only difference to just using `print` is that some data structures will be easier to read and interpret.

In [2]:
from pprint import PrettyPrinter

pprint = PrettyPrinter(compact=True).pprint

Finally, we'll reduce logging output. The `transformers` library by default produces fairly verbose logging. Running the following code will disable low-priority output (`INFO` logging level and below). You can skip running this or comment out the following lines to see more complete logs.

In [3]:
import logging

logging.disable(logging.INFO)

---

# Download and prepare data

The data used in this notebook is found online in as a single file in a [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) (TSV) format. We first use the [`wget`](https://en.wikipedia.org/wiki/Wget) operating system command to download the data to the computer running this notebook.

In [4]:
!wget -q -nc https://raw.githubusercontent.com/TurkuNLP/sentiment-target-corpus/main/sentiment-target-fi.tsv

(Above, the `-q` and `-nc` arguments tell `wget` to run in "quiet" mode and not to download the file again if it is already found on the computer. The latter option is useful in particular if running a notebook more than once.)

Let's have a look at the first few lines of the file to get an idea of its contents.

In [5]:
lines = []
with open("sentiment-target-fi.tsv") as f:    # open file
    for line in f:    # iterate over lines in file
        line = line.rstrip("\n")    # remove newline character
        lines.append(line)    # store line in list

for line in lines[:10]:    # iterate over first ten lines
    print(line)    # print each line

#ID	text	majority	agreement	annotations
comments2015d-45552-1.1	<TARGET>Kuopionlahden</TARGET> vierasvenelaituri (kuten nimi oikeasti on) on mielestämme tosi kiva.	positive	0.6666666666666666	positive,positive,neither,positive,neither,positive
comments2014a-469816-3.1	Käytetyssä <TARGET>Meganessa</TARGET> yksi etu verrattuna vaikkapa Skoda Octaviaan on se että Meganen voi ostaa jonkun vuoden uudempana.	positive	1.0	positive,positive,positive,positive,positive,positive
comments2007d-182055-6.1	Monet heistä ovat olleet kymmeniä vuosia <TARGET>Finnairin</TARGET> palveluksessa ja nyt tulee kenkää.	negative	0.8	negative,negative,negative,neither,negative
threads2013a-835251-2.1	<TARGET>Keskusta</TARGET> on aina ollut (ja on oleva) muun Suomen tuki ja turva.	positive	1.0	positive,positive,positive
comments2009a-483243-7.1	Itselläni on vuosimallin -93 <TARGET>LS-400</TARGET> ja olen tyytyväinen.	positive	1.0	positive,positive,positive,positive,positive
comments2003a-590164-4.1	<TARGET>The Bea

We see above that the first line is a "header" starting with the `#` character used for comments in Python code and the lines contain five fields separated by TAB characters:

* `ID`: unique identifier (not required here)
* `text`: the text of the example, with sentiment target marked with `<TARGET>...</TARGET>`
* `majority`: the label assigned to the example by the majority of annotators
* `agreement`: the ratio of all annotators that assigned the majority label
* `annotations`: a comma-separated list of all of the annotations for the example

## Reformat data for `datasets` package

* You really only need to do this once
* Make a [*json lines*](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) (JSONL) file with one [JSON](https://en.wikipedia.org/wiki/JSON)-encoded example per line
* Each example has the `text` and the `label` as an integer
* We have four different labels in this particular data

In [6]:
import re
import json

label_names=["positive","negative","reject","neither"]
data=[]
with open("sentiment-target-fi.tsv") as f:
    for line in f:
        line=line.rstrip("\n")
        if not line or line.startswith("#"): #skip empty and comments (incl. header)
            continue
        cols=line.split("\t")
        if len(cols)!=5: #skip weird lines that don't have the right number of columns
            continue
        data.append(cols)

We'll randomly shuffle the data to assure that our split into training, validation and test data is random.

Note that here we call [`random.seed`](https://docs.python.org/3/library/random.html#bookkeeping-functions) with an arbitrary integer value so that the shuffle always results in the same split. This is not necessary (and must be avoided if you e.g. need different shuffles!), but useful here as it will make results in this notebook more stable. 

In [7]:
import random

random.seed(1234)    # seed with arbitrary integer value
random.shuffle(data) # produces arbitrary but fixed order

Now, let's cast the data into simple python dictionaries and save these as JSONL.

In [8]:
with open("sentiment-data.jsonl","wt") as f:
    for cols in data:
        item = {
            "text": cols[1],
            "label": label_names.index(cols[2]),    # translate from label strings to integers
        }
        print(json.dumps(item,ensure_ascii=False,sort_keys=True),file=f)

Let's read back the file we just created and print one example to make sure it looks as intended:

In [9]:
examples = []
with open("sentiment-data.jsonl") as f:
    for line in f:
        example = json.loads(line)
        examples.append(example)

pprint(examples[0])

{'label': 0,
 'text': 'Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin '
         'ne biisit olivat mun mieleen!'}


## Load data with `datasets`

Every popular framework has its own preferred idea of how to represent data. We will here use the Hugging Face [datasets](https://huggingface.co/docs/datasets/) library, which is very popular, so it makes sense to be acquainted with it.

The `datasets` library provides broad support for [loading datasets](https://huggingface.co/docs/datasets/loading_datasets.html) through the `load_dataset` function, which we will use here to load the JSONL-formatted data we prepared just above.

In [10]:
import datasets

file_name = "sentiment-data.jsonl"
dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"everything":file_name},    # All data files, here we only have one
    split={
        "train":"everything[:80%]",  # First 80% is the train set
        "validation":"everything[80%:90%]",   # Next 10% is the validation/dev set
        "test":"everything[90%:]"    # last 10% is the test set
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "label":datasets.ClassLabel(names=label_names),
        "text":datasets.Value("string")
    })
)

Using custom data configuration default-4e1edd71bcbf208a


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-4e1edd71bcbf208a/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-4e1edd71bcbf208a/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's see what's in the dataset that we just loaded:

In [11]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1864
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 233
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 233
    })
})


As intended, we see here a datasets in three parts, `train`, `validation`, and `test`, where each has the `text` and `label` values.

We can look at individual values e.g. as follows:

In [12]:
print(dataset['train']['text'][0])
print(dataset['train']['label'][0])

Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin ne biisit olivat mun mieleen!
0


## Tokenize and vectorize data

Each text in our dataset is a character string. As you know, machine learning methods do not generally operate directly on character strings. Instead, text is typically _tokenized_ and _vectorized_ for input to machine learning methods.

Here, we will use previously created tokenizers through the simple [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) class. The `AutoTokenizer.from_pretrained` function can load the tokenizer associated with any of the large number of models found in the [Hugging Face models repository](https://huggingface.co/models).            

In [13]:
import transformers

model_name = "TurkuNLP/bert-base-finnish-cased-v1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/414k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/796k [00:00<?, ?B/s]

The tokenizers in the `transformers` library take a text string as argument and both split the text into tokens (tokenization) and map those tokens into integer identifiers (vectorization).

The values returned by the tokenizer depend on the model that the tokenizer was created for, but normally include at least `input_ids`, which are the result of tokenization and vectorization:

In [14]:
pprint(tokenizer("Tämä on esimerkkilause"))

{'attention_mask': [1, 1, 1, 1, 1, 1, 1],
 'input_ids': [102, 1131, 145, 6755, 6917, 50010, 103],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}


We can ignore the values other than `input_ids` returned by the tokenizer (`token_type_ids` and `attention_mask`) for now. We will come back to these later in the course.

We can now tokenize and vectorize our whole dataset with the tokenizer by calling the [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function:

In [15]:
# Define a simple function that applies the tokenizer
def tokenize(example):
    return tokenizer(example["text"])

# Apply the tokenizer to the whole dataset using .map()
dataset = dataset.map(tokenize)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

**Note**: unlike the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values. Here, the call adds the values returned by the `tokenizer()` call (`input_ids` etc.) to each example while also keeping the original `text` and `label` values.

In [16]:
pprint(dataset["train"][0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [102, 2175, 6623, 3592, 5571, 16307, 50051, 50073, 12355, 2377,
               22145, 1296, 369, 9627, 5571, 499, 16307, 50051, 50073, 12355,
               2377, 166, 5390, 492, 119, 861, 3213, 475, 33571, 1141, 1662,
               3063, 380, 103],
 'label': 0,
 'text': 'Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin '
         'ne biisit olivat mun mieleen!',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


## Input encoding for MLP

Out `input_ids` are currently represented as a sparse vector containing the indices of the tokens found in the text (see e.g. [Torch documentation](https://pytorch.org/docs/stable/sparse.html)).

The simplest way to present this input to the MLP model is to create a dense "bag-of-tokens" representation where the index for every token that is present in the input is 1, and the rest are 0.

For example, if our input has vocab size of 5 and tokens `[0,3]` are present, we would like to get `[1,0,0,1,0]` on the input. The simple code below does just that; for the documentation, see [`torch.Tensor.scatter()`](https://pytorch.org/docs/stable/generated/torch.Tensor.scatter.html)

In [18]:
import torch

# Create a matrix of two examples (vectors) with
# the ids (indices) which we want to set to 1
input_ids = torch.tensor(
  [
     [0,0,1],
     [0,2,3]
  ]
)

vocab_size = 5
example_num = len(input_ids)

# These are the 1s we will be copying over
ones = torch.ones_like(input_ids, dtype=torch.float)

# This is the target, initialized to zeros
zeros = torch.zeros((example_num, vocab_size))

# Scatter says: 
#   work on dimension 1
#   `input_ids` are the indices to set
#   `ones` are the values to set
input = zeros.scatter(1, input_ids, ones)

print(input)
# see how in the first row indices 0 and 1 are set to 1
# and in the second row indices 0,2,3 are set to 1
# exactly as it was supposed to be!

tensor([[1., 1., 0., 0., 0.],
        [1., 0., 1., 1., 0.]])


# Build model

A PyTorch model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [35]:
import torch

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be the parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size
        # Hidden layer: input size x hidden size
        self.hidden=torch.nn.Linear(in_features=self.vocab_size,out_features=config.hidden_size)
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        
    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None,attention_mask=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        # The batch is in input_ids
        batch_size=input_ids.shape[0] #this is how many examples we have
        # The following block converts the input ids into a suitable input for
        # the input layer, it is adapted from above
        input=torch.zeros((batch_size,self.vocab_size),dtype=torch.float,device=input_ids.device)
        ones=torch.ones_like(input_ids,dtype=torch.float)
        input=input.scatter(1,input_ids,ones)
        projected=torch.tanh(self.hidden(input)) #Note how non-linearity is applied here and not when configuring the layer in __init__()
        logits=self.output(projected)
        
        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is useful for classification
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=tokenizer.vocab_size,hidden_size=100,nlabels=4)


# Train model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training. 

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [36]:
# Set training arguments
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    learning_rate=5e-4,
    max_steps=5000,
    load_best_model_at_end=True
)

print(trainer_args)

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=100,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=False,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0005,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=mlp_checkpoints/runs

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

As the task is simple multi-class classification, we'll use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels.

In [37]:
import numpy as np

accuracy = datasets.load_metric("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [38]:
# Instantiate the model  
mlp = MLP(mlp_config)

data_collator = transformers.DataCollatorWithPadding(tokenizer)

# Argument gives the number of steps of patience before early stopping
early_stopping = transformers.EarlyStoppingCallback(3)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    callbacks=[early_stopping]
)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy
100,1.1408,1.044309,0.403433
200,1.018,0.991139,0.60515
300,0.822,0.968134,0.515021
400,0.6875,0.938572,0.592275
500,0.6093,0.963259,0.583691
600,0.4185,1.017672,0.553648
700,0.4068,1.042851,0.579399
800,0.2391,1.087399,0.566524


TrainOutput(global_step=800, training_loss=0.6677587938308716, metrics={'train_runtime': 6.2589, 'train_samples_per_second': 6390.922, 'train_steps_per_second': 798.865, 'total_flos': 9188898518976.0, 'train_loss': 0.6677587938308716, 'epoch': 3.43})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [39]:
eval_results = trainer.evaluate(dataset["test"])

pprint(eval_results)

{'epoch': 3.43,
 'eval_accuracy': 0.6609442060085837,
 'eval_loss': 0.801636815071106,
 'eval_runtime': 0.1673,
 'eval_samples_per_second': 1393.105,
 'eval_steps_per_second': 179.37}


From the `eval_results` above, we can see that we can specifically grab the accuracy with

In [40]:
print('Accuracy:', eval_results['eval_accuracy'])

Accuracy: 0.6609442060085837


# Save the model for later use

1.   `.save_pretrained(path)`
2.   `.from_pretrained(path)`



In [25]:
mlp.save_pretrained("mlp-sent")

Now let us load the model from its saved state and check that it worked well

In [31]:

mlp2=MLP.from_pretrained("mlp-sent")

In [32]:
trainer = transformers.Trainer(
    model=mlp2,
    args=trainer_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    compute_metrics=compute_accuracy,
    data_collator=data_collator,
    callbacks=[early_stopping]
)

In [34]:
eval_results = trainer.evaluate(dataset["test"])

pprint(eval_results)

print('Accuracy:', eval_results['eval_accuracy'])

{'eval_accuracy': 0.6652360515021459,
 'eval_loss': 0.8015390634536743,
 'eval_runtime': 0.2806,
 'eval_samples_per_second': 830.289,
 'eval_steps_per_second': 106.904}
Accuracy: 0.6652360515021459
