# Setup

Before we start running our own Python code, install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets

Both of these packages are used extensively on this course.

In [1]:
!pip3 install -q transformers datasets

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

We'll also use the [`pprint`](https://docs.python.org/3/library/pprint.html) ("pretty-print") module to format output more readably below. The only difference to just using `print` is that some data structures will be easier to read and interpret.

In [2]:
from pprint import PrettyPrinter

pprint = PrettyPrinter(compact=True).pprint

# Download and prepare data

The data used in this notebook is found online in as a single file in a [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) (TSV) format. We first use the [`wget`](https://en.wikipedia.org/wiki/Wget) operating system command to download the data to the computer running this notebook.

In [3]:
!wget -q -nc https://raw.githubusercontent.com/TurkuNLP/sentiment-target-corpus/main/sentiment-target-fi.tsv

(Above, the `-q` and `-nc` arguments tell `wget` to run in "quiet" mode and not to download the file again if it is already found on the computer. The latter option is useful in particular if running a notebook more than once.)

Let's have a look at the first few lines of the file to get an idea of its contents.

In [4]:
lines = []
with open("sentiment-target-fi.tsv") as f:    # open file
    for line in f:    # iterate over lines in file
        line = line.rstrip("\n")    # remove newline character
        lines.append(line)    # store line in list

for line in lines[:10]:    # iterate over first ten lines
    print(line)    # print each line

#ID	text	majority	agreement	annotations
comments2015d-45552-1.1	<TARGET>Kuopionlahden</TARGET> vierasvenelaituri (kuten nimi oikeasti on) on mielestämme tosi kiva.	positive	0.6666666666666666	positive,positive,neither,positive,neither,positive
comments2014a-469816-3.1	Käytetyssä <TARGET>Meganessa</TARGET> yksi etu verrattuna vaikkapa Skoda Octaviaan on se että Meganen voi ostaa jonkun vuoden uudempana.	positive	1.0	positive,positive,positive,positive,positive,positive
comments2007d-182055-6.1	Monet heistä ovat olleet kymmeniä vuosia <TARGET>Finnairin</TARGET> palveluksessa ja nyt tulee kenkää.	negative	0.8	negative,negative,negative,neither,negative
threads2013a-835251-2.1	<TARGET>Keskusta</TARGET> on aina ollut (ja on oleva) muun Suomen tuki ja turva.	positive	1.0	positive,positive,positive
comments2009a-483243-7.1	Itselläni on vuosimallin -93 <TARGET>LS-400</TARGET> ja olen tyytyväinen.	positive	1.0	positive,positive,positive,positive,positive
comments2003a-590164-4.1	<TARGET>The Bea

We see above that the first line is a "header" starting with the `#` character used for comments in Python code and the lines contain five fields separated by TAB characters:

* `ID`: unique identifier (not required here)
* `text`: the text of the example, with sentiment target marked with `<TARGET>...</TARGET>`
* `majority`: the label assigned to the example by the majority of annotators
* `agreement`: the ratio of all annotators that assigned the majority label
* `annotations`: a comma-separated list of all of the annotations for the example

## Reformat data for `datasets` package

* You really only need to do this once
* Make a [*json lines*](https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON) (JSONL) file with one [JSON](https://en.wikipedia.org/wiki/JSON)-encoded example per line
* Each example has the `text` and the `label` as an integer
* We have four different labels in this particular data

In [5]:
import re
import json

label_names=["positive","negative","reject","neither"]
data=[]
with open("sentiment-target-fi.tsv") as f:
    for line in f:
        line=line.rstrip("\n")
        if not line or line.startswith("#"): #skip empty and comments (incl. header)
            continue
        cols=line.split("\t")
        if len(cols)!=5: #skip weird lines that don't have the right number of columns
            continue
        data.append(cols)

We'll randomly shuffle the data to assure that our split into training, validation and test data is random.

Note that here we call [`random.seed`](https://docs.python.org/3/library/random.html#bookkeeping-functions) with an arbitrary integer value so that the shuffle always results in the same split. This is not necessary (and must be avoided if you e.g. need different shuffles!), but useful here as it will make results in this notebook more stable. 

In [6]:
import random

random.seed(1234)    # seed with arbitrary integer value
random.shuffle(data) # produces arbitrary but fixed order

Now, let's cast the data into simple python dictionaries and save these as JSONL.

In [7]:
with open("sentiment-data.jsonl","wt") as f:
    for cols in data:
        item = {
            "text": cols[1],
            "label": label_names.index(cols[2]),    # translate from label strings to integers
        }
        print(json.dumps(item,ensure_ascii=False,sort_keys=True),file=f)

Let's read back the file we just created and print one example to make sure it looks as intended:

In [8]:
examples = []
with open("sentiment-data.jsonl") as f:
    for line in f:
        example = json.loads(line)
        examples.append(example)

pprint(examples[0])

{'label': 0,
 'text': 'Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin '
         'ne biisit olivat mun mieleen!'}


## Load data with `datasets`

Every popular framework has its own preferred idea of how to represent data. We will here use the Hugging Face [datasets](https://huggingface.co/docs/datasets/) library, which is very popular, so it makes sense to be acquainted with it.

The `datasets` library provides broad support for [loading datasets](https://huggingface.co/docs/datasets/loading_datasets.html) through the `load_dataset` function, which we will use here to load the JSONL-formatted data we prepared just above.

In [9]:
import datasets

file_name = "sentiment-data.jsonl"
dataset = datasets.load_dataset(
    'json',                             # Format of the data
    data_files={"everything":file_name},    # All data files, here we only have one
    split={
        "train":"everything[:80%]",  # First 80% is the train set
        "validation":"everything[80%:90%]",   # Next 10% is the validation/dev set
        "test":"everything[90%:]"    # last 10% is the test set
    },
    features=datasets.Features({    # Here we tell how to interpret the attributes
        "label":datasets.ClassLabel(names=label_names),
        "text":datasets.Value("string")
    })
)

Using custom data configuration default-a44919187b39c1e6


Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-a44919187b39c1e6/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-a44919187b39c1e6/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's see what's in the dataset that we just loaded:

In [10]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1864
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 233
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 233
    })
})


As intended, we see here a datasets in three parts, `train`, `validation`, and `test`, where each has the `text` and `label` values.

We can look at individual values e.g. as follows:

In [11]:
print(dataset['train']['text'][0])
print(dataset['train']['label'][0])

Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin ne biisit olivat mun mieleen!
0


---

**Side note**: reducing logging output

The `transformers` library by default produces fairly verbose logging output. Running the following will disable low-priority output (`INFO` logging level and below). You can skip running this or comment out the following lines to see more complete logs.

In [12]:
import logging

logging.disable(logging.INFO)

---

## Tokenize and vectorize data

Each text in our dataset is a character string. As you know, machine learning methods do not generally operate directly on character strings. Instead, text is typically _tokenized_ and _vectorized_ for input to machine learning methods.

Here, we will use previously created tokenizers through the simple [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) class. The `AutoTokenizer.from_pretrained` function can load the tokenizer associated with any of the large number of models found in the [Hugging Face models repository](https://huggingface.co/models).            

In [13]:
import transformers

model_name = "TurkuNLP/bert-base-finnish-cased-v1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

The tokenizers in the `transformers` library take a text string as argument and both split the text into tokens (tokenization) and map those tokens into integer identifiers (vectorization).

The values returned by the tokenizer depend on the model that the tokenizer was created for, but normally include at least `input_ids`, which are the result of tokenization and vectorization:

In [14]:
pprint(tokenizer("Tämä on esimerkkilause"))

{'attention_mask': [1, 1, 1, 1, 1, 1, 1],
 'input_ids': [102, 1131, 145, 6755, 6917, 50010, 103],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0]}


We can ignore the values other than `input_ids` returned by the tokenizer (`token_type_ids` and `attention_mask`) for now. We will come back to these later in the course.

We can now tokenize and vectorize our whole dataset with the tokenizer by calling the [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function:

In [15]:
# Define a simple function that applies the tokenizer
def tokenize(example):
    return tokenizer(example["text"])

# Apply the tokenizer to the whole dataset using .map()
dataset = dataset.map(tokenize)

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

0ex [00:00, ?ex/s]

**Note**: unlike the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values. Here, the call adds the values returned by the `tokenizer()` call (`input_ids` etc.) to each example while also keeping the original `text` and `label` values.

In [16]:
pprint(dataset["train"][0])

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
                    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [102, 2175, 6623, 3592, 5571, 16307, 50051, 50073, 12355, 2377,
               22145, 1296, 369, 9627, 5571, 499, 16307, 50051, 50073, 12355,
               2377, 166, 5390, 492, 119, 861, 3213, 475, 33571, 1141, 1662,
               3063, 380, 103],
 'label': 0,
 'text': 'Mä pidän eniten <TARGET>Blessed Be</TARGET>-levystä, joten varsinkin '
         'ne biisit olivat mun mieleen!',
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


# Input encoding for MLP

* The simplest way is to set every input that is present to 1, rest at 0
* So e.g. if our input has vocab size of 5 and tokens `[0,3]` present, we would like to get `[1,0,0,1,0]` on the input
* The simple code below does just that:

In [17]:
import torch

# These are the ids which we want to set to 1
input_ids=torch.tensor([[0,0,1],[0,2,3]])

# These are the 1s we will be copying over
ones=torch.ones_like(input_ids,dtype=torch.float)

# This is the target, initialized to zeros
zeros=torch.zeros((2,5))

# Scatter says: 
#   work on dimension 1
#   `input_ids` are the indices to set
#   `ones` are the values to set
zeros=zeros.scatter(1,input_ids,ones)

print(zeros)
# see how in the first row indices 0 and 1 are set to 1
# and in the second row indices 0,2,3 are set to 1
# exactly as it was supposed to be!

tensor([[1., 1., 0., 0., 0.],
        [1., 0., 1., 1., 0.]])


# Build the model

* Model in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation

In [18]:
import torch

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    # In the initialization method, one instantiates the layers
    # these will be the parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size
        # Hidden layer: input size x hidden size
        self.hidden=torch.nn.Linear(in_features=self.vocab_size,out_features=config.hidden_size)
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        
    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None,attention_mask=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        # The batch is in input_ids
        batch_size=input_ids.shape[0] #this is how many examples we have
        # The following block converts the input ids into a suitable input for
        # the input layer, it is adapted from above
        input=torch.zeros((batch_size,self.vocab_size),dtype=torch.float,device=input_ids.device)
        ones=torch.ones_like(input_ids,dtype=torch.float)
        input=input.scatter(1,input_ids,ones)
        projected=torch.tanh(self.hidden(input)) #Note how non-linearity is applied here and not when configuring the layer in __init__()
        logits=self.output(projected)
        
        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is useful for classification
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=tokenizer.vocab_size,hidden_size=100,nlabels=4)



# Model training

* Hugging Face trainer
  * Loads of arguments that control the training
  * data collator builds the batches
  * early stopping callback stops when eval loss no longer improves
  

In [19]:
import numpy as np

accuracy = datasets.load_metric("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

In [20]:
# Instantiate the model  
mlp = MLP(mlp_config)

# Set training arguments
trainer_args=transformers.TrainingArguments(
    "mlp_checkpoints",
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    learning_rate=5e-4,
    max_steps=5000,
    load_best_model_at_end=True
)

data_collator=transformers.DataCollatorWithPadding(tokenizer)

early_stopping=transformers.EarlyStoppingCallback(3) # steps of patience before early stopping

trainer=transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    data_collator=data_collator,
    compute_metrics=compute_accuracy,
    callbacks=[early_stopping]
)

trainer.train()



Step,Training Loss,Validation Loss,Accuracy
100,1.1311,1.043762,0.429185
200,1.0175,0.988612,0.600858
300,0.8192,0.962071,0.515021
400,0.6861,0.935318,0.587983
500,0.6133,0.959736,0.587983
600,0.4237,1.011624,0.562232
700,0.4122,1.035669,0.587983
800,0.2429,1.081599,0.575107


TrainOutput(global_step=800, training_loss=0.6682539749145507, metrics={'train_runtime': 4.9043, 'train_samples_per_second': 8156.158, 'train_steps_per_second': 1019.52, 'total_flos': 9188898518976.0, 'train_loss': 0.6682539749145507, 'epoch': 3.43})

In [21]:
eval_results = trainer.evaluate(dataset["test"])

pprint(eval_results)

{'epoch': 3.43,
 'eval_accuracy': 0.6609442060085837,
 'eval_loss': 0.800762951374054,
 'eval_runtime': 0.1132,
 'eval_samples_per_second': 2058.777,
 'eval_steps_per_second': 265.079}


In [22]:
print('Accuracy:', eval_results['eval_accuracy'])

Accuracy: 0.6609442060085837
