<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/hf_trainer_mlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!wget -nc https://raw.githubusercontent.com/TurkuNLP/sentiment-target-corpus/main/sentiment-target-fi.tsv
!pip3 install transformers datasets

--2022-03-03 20:20:38--  https://raw.githubusercontent.com/TurkuNLP/sentiment-target-corpus/main/sentiment-target-fi.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 454531 (444K) [text/plain]
Saving to: ‘sentiment-target-fi.tsv’


2022-03-03 20:20:39 (12.6 MB/s) - ‘sentiment-target-fi.tsv’ saved [454531/454531]

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.3 MB/s 
[?25hCollecting datasets
  Downloading datasets-1.18.3-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 40.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |

# Prep data to a suitable format

* You really only need to do this once
* Make a *json lines* file with one json-encoded example per line
* Each example has the `text` and the `label` as an integer
* We have four different labels in this particular data

In [2]:
import re
import json
import random

label_names=["positive","negative","reject","neither"]
data=[]
with open("sentiment-target-fi.tsv") as f:
    for line in f:
        line=line.rstrip("\n")
        if not line or line.startswith("#"): #skip empty and comments
            continue
        cols=line.split("\t")
        if len(cols)!=5: #skip weird lines that don't have the right number of columns
            continue
        data.append(cols)
random.shuffle(data) #shake well
with open("sentiment-data.jsonl","wt") as f: #write out as jsonl
    for cols in data:
        txt=cols[1]
        item={"label":label_names.index(cols[2]),"text":cols[1]} #note here we translate from label strings to integers
        print(json.dumps(item,ensure_ascii=False,sort_keys=True),file=f)

#One line looks like this:
# {"label": 0, "text": "En tiedä mitä kuvanvalmistamoa käytät, mutta ainakin <TARGET>Fotoyksillä</TARGET> onnistuu helposti."}


# Datasets

Every popular framework has its own preferred idea of how to represent data. Let us look into the Hugging Face datasets which is very popular, so it makes sense to be acquainted with it.



In [3]:
import datasets

fname="sentiment-data.jsonl"
dset=datasets.load_dataset('json',                             # Format of the data
                           data_files={"everything":fname},    # All data files, here we only have one
                           split={"train":"everything[:80%]",  # First 80% is the train set
                                  "validation":"everything[80%:90%]",   # Next 10% is the validation/dev set
                                  "test":"everything[90%:]"},           # last 10% is the test set
                           features=datasets.Features({ #And here we tell how to interpret the data attributes
                               "label":datasets.ClassLabel(names=["positive","negative","neither","reject"]),
                               "text":datasets.Value("string")})
                           )                           


Using custom data configuration default-af54bb440075cbc3


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-af54bb440075cbc3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-af54bb440075cbc3/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# Tokenize and translate into integers

* One can use a pre-existing tokenizer
* It will, by default, produce `input_ids` which translates text tokens to integers
               

In [4]:
import transformers
tokenizer=transformers.AutoTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")

tokenized=tokenizer("Minulla on simpukkakoira",add_special_tokens=False) #nevermind special tokens, their time will come :)
print(tokenized)

#nevermind token_type_ids and attention_mask, their time will come :)
#

print(tokenizer.convert_ids_to_tokens(tokenized["input_ids"]))

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/414k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/796k [00:00<?, ?B/s]

{'input_ids': [3668, 145, 22966, 1233, 16323], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}
['Minulla', 'on', 'simp', '##ukka', '##koira']


In [5]:
# Apply the tokenizer to the whole dataset using .map()

dset=dset.map(lambda x: tokenizer(x["text"],add_special_tokens=True))

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

In [6]:
print(dset["train"][0])

{'label': 0, 'text': 'Olimme <TARGET>Finnmatkoilla</TARGET> oppaat olivat ystävällisiä ja olivat mukana alusta loppuun.', 'input_ids': [102, 15491, 5571, 16307, 50051, 50073, 12355, 2377, 7937, 38667, 5571, 499, 16307, 50051, 50073, 12355, 2377, 27588, 129, 1141, 36579, 142, 1141, 1454, 2915, 2872, 111, 103], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


# Input encoding for MLP

* The simplest way is to set every input that is present to 1, rest at 0
* So e.g. if our input has vocab size of 5 and tokens `[0,3]` present, we would like to get `[1,0,0,1,0]` on the input
* The simple code below does just that:

In [7]:
import torch
# These are the ids which we want to set to 1
input_ids=torch.tensor([[0,0,1],[0,2,3]])
# These are the 1s we will be copying over
ones=torch.ones_like(input_ids,dtype=torch.float)
# This is the target, initialized to zeros
zeros=torch.zeros((2,5))
# Scatter says: 
#   work on dimension 1
#   `input_ids` are the indices to set
#   `ones` are the values to set
zeros=zeros.scatter(1,input_ids,ones)
print(zeros)
# see how in the first row indices 0 and 1 are set to 1
# and in the second row indices 0,2,3 are set to 1
# exactly as it was supposed to be!

tensor([[1., 1., 0., 0., 0.],
        [1., 0., 1., 1., 0.]])


# Build the model

* Model in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation

In [54]:
import torch

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    # In the initialization method, one instantiates the layers
    # these will be the parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size
        # Hidden layer: input size x hidden size
        self.hidden=torch.nn.Linear(in_features=self.vocab_size,out_features=config.hidden_size)
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        
    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None,attention_mask=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        # The batch is in input_ids
        batch_size=input_ids.shape[0] #this is how many examples we have
        # The following block converts the input ids into a suitable input for
        # the input layer, it is adapted from above
        input=torch.zeros((batch_size,self.vocab_size),dtype=torch.float,device=input_ids.device)
        ones=torch.ones_like(input_ids,dtype=torch.float)
        input=input.scatter(1,input_ids,ones)
        projected=torch.tanh(self.hidden(input)) #Note how non-linearity is applied here and not when configuring the layer in __init__()
        logits=self.output(projected)
        
        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is useful for classification
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=tokenizer.vocab_size,hidden_size=100,nlabels=4)



# Model training

* Hugging Face trainer
  * Loads of arguments that control the training
  * data collator builds the batches
  * early stopping callback stops when eval loss no longer improves
  

In [55]:
# Instantiate the model  
mlp=MLP(mlp_config)

# Now it's ready to train

trainer_args=transformers.TrainingArguments("mlp_checkpoints",
                                            evaluation_strategy="steps",
                                            logging_strategy="steps",
                                            eval_steps=100,
                                            logging_steps=100,
                                            learning_rate=5e-4,
                                            max_steps=5000,
                                            load_best_model_at_end=True)

data_collator=transformers.DataCollatorWithPadding(tokenizer)

early_stopping=transformers.EarlyStoppingCallback(3) #5 steps worth of patience before early stopping
trainer=transformers.Trainer(model=mlp,
                             args=trainer_args,
                             train_dataset=dset["train"],
                             eval_dataset=dset["validation"],
                             data_collator=data_collator,
                             callbacks=[early_stopping])
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set  don't have a corresponding argument in `MLP.forward` and have been ignored: token_type_ids, text. If token_type_ids, text are not expected by `MLP.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1864
  Num Epochs = 22
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5000


Step,Training Loss,Validation Loss
100,1.1296,1.033874
200,1.0216,0.981771
300,0.7946,0.920572
400,0.6572,0.920484
500,0.5288,0.951628
600,0.3928,0.996256
700,0.3603,1.041534
800,0.2256,1.081407


The following columns in the evaluation set  don't have a corresponding argument in `MLP.forward` and have been ignored: token_type_ids, text. If token_type_ids, text are not expected by `MLP.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 233
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `MLP.forward` and have been ignored: token_type_ids, text. If token_type_ids, text are not expected by `MLP.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 233
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in `MLP.forward` and have been ignored: token_type_ids, text. If token_type_ids, text are not expected by `MLP.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 233
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argu

TrainOutput(global_step=800, training_loss=0.6388063192367553, metrics={'train_runtime': 6.6147, 'train_samples_per_second': 6047.119, 'train_steps_per_second': 755.89, 'total_flos': 9199481759424.0, 'train_loss': 0.6388063192367553, 'epoch': 3.43})

In [56]:
p=trainer.predict(dset["test"])


The following columns in the test set  don't have a corresponding argument in `MLP.forward` and have been ignored: token_type_ids, text. If token_type_ids, text are not expected by `MLP.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 233
  Batch size = 8


In [35]:
print(p.predictions)

[[ 1.08090138e+00 -5.47813326e-02 -1.99721789e+00 -7.83405676e-02]
 [ 3.69524449e-01  7.14528739e-01 -1.95008755e+00  3.53959501e-01]
 [ 9.83934045e-01  9.41887319e-01 -2.31117988e+00 -3.83542389e-01]
 [-3.50395858e-01  4.18187559e-01 -1.43816924e+00  1.15239048e+00]
 [ 1.30936575e+00  2.70491242e-01 -2.14849186e+00 -4.87339199e-01]
 [ 1.01546860e+00 -8.97850841e-02 -1.94118166e+00  1.84918940e-03]
 [ 3.14074129e-01  1.15226364e+00 -2.03308868e+00  1.41028270e-01]
 [ 1.62899971e-01  1.10502172e+00 -1.84038258e+00  2.26572528e-01]
 [ 8.95214677e-01  7.03578472e-01 -2.14888072e+00 -2.55215108e-01]
 [ 1.09011209e+00  4.82480824e-01 -2.17213988e+00 -3.56718361e-01]
 [ 7.13776469e-01  5.65811217e-01 -2.07716441e+00  6.46461993e-02]
 [ 1.16213596e+00  5.31934679e-01 -2.36070848e+00 -2.75549412e-01]
 [ 1.67490810e-01  6.71655715e-01 -1.76645350e+00  4.93649900e-01]
 [ 7.49501824e-01  1.71850294e-01 -1.99872386e+00  2.66508996e-01]
 [-8.16112399e-01  1.45339394e+00 -1.34600854e+00  8.85945559e

In [57]:
predictions=p.predictions.argmax(-1)
print("Predicted",predictions)
print(p.label_ids)
print(sum(p.label_ids==predictions)/len(predictions))


Predicted [0 1 1 3 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0
 1 1 0 1 0 0 0 1 1 0 0 1 1 0 0 1 3 0 0 0 1 0 0 0 0 1 1 1 0 0 1 0 1 3 0 1 0
 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 3 0 1 3 3 0 1 0 1 0 0 1
 0 0 1 1 1 0 0 0 0 0 0 1 3 0 1 0 0 3 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 3 0
 0 3 0 0 0 0 0 1 1 0 0 0 0 0 1 3 0 1 0 0 0 1 0 0 0 0 0 1 3 0 1 1 0 0 1 0 0
 0 1 0 1 1 0 1 0 0 1 0]
[0 1 1 1 1 3 3 1 0 0 0 0 0 3 3 0 1 0 0 0 0 0 0 0 3 3 3 1 3 0 1 3 0 0 1 0 0
 1 1 0 1 1 0 3 3 1 0 0 3 3 3 0 1 0 0 3 0 3 3 1 0 0 1 1 0 1 0 3 0 1 1 0 1 1
 0 3 0 0 0 0 3 1 3 3 1 1 1 1 1 3 1 3 1 0 1 1 1 1 0 0 0 1 3 1 1 3 0 1 1 3 3
 0 0 0 1 1 0 0 3 0 1 1 1 0 0 3 1 0 3 1 1 0 0 3 0 0 0 0 1 1 3 0 1 3 0 0 0 1
 1 3 0 0 1 0 1 3 3 0 0 1 1 1 0 0 3 0 3 1 1 1 0 1 0 1 1 1 3 0 1 3 1 0 0 1 0
 1 1 0 1 0 1 0 1 0 0 1 3 0 3 0 3 0 0 1 0 1 1 0 0 0 0 0 3 1 0 0 0 0 0 3 0 3
 1 3 0 1 3 1 1 0 0 0 0]
0.5622317596566524
