# Fine-Tuning & a bit of Probing

In [1]:
# dependencies
# !pip install torch "transformers[torch]" datasets evaluate seqeval

In [2]:
import numpy as np
import pandas as pd
import torch
import datasets
import evaluate
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, AutoModel
from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

## Recap: architecture of a Transformer model

In [3]:
model_name = "distilbert/distilbert-base-multilingual-cased"

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [5]:
# what do we have here?
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(119547, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): 

Explanation of some of the dimensions:

In [6]:
# number of tokens in the vocabulary
len(tokenizer.vocab)

119547

In [7]:
# maximum length of input (in tokens)
tokenizer.model_max_length

512

In [8]:
# dimensionality of representations (in neurons)
model.config.dim

768

In [9]:
# dimensionality of the hidden layer(s) (in neurons) 
# that come after the transformer blocks
model.config.hidden_dim

3072

In [10]:
# total number of trainable parameters
model.num_parameters()

134734080

**Question**:  
Is 134M parameters a lot?  
Compared to current state-of-art models, and compared to the original BERT model?

**If you want to find out at home**:   
How did we get to this number of parameters?  
In other words, which hyperparameters do we have to add and multiply to get to this number?

<br>

Now we're going fine-tune and probe on our use case today: 
## Named Entity Recognition

Brief intro:  
Named entity recognition (NER) aims to find tokens in an unstructured text that contain some real world object, such as a person's name, a location, an organization, time, currency and so on. Often, NER is the task of finding Proper Nouns. 

A popular problem in NER is the [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/) task, which uses the categories, person (PER), organization (ORG) location (LOC) and miscellaneous (MISC). For example:

*$Franz$ $Kafka_{PER}$ studied at $Charles$ $University_{ORG}$ in $Prague_{LOC}$*

Let's get the corrected version CoNLL-2003:

In [11]:
ds = datasets.load_dataset(
    "conllpp",
    trust_remote_code=True #this time we'll trust remote code
)

In [12]:
# note, this is a DatasetDict
# which consists of three Datasets
ds

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [13]:
# one datapoint looks like this
ds["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

NER tags are displayed as numbers when showing a single data point, but behind the scenes, it is a `ClassLabel`.   
This means there is a mapping of integers to label name. `0` corresponds to `O`, and `1` corresponds to `B-PER`.

In [14]:
ds["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

How to understand these labels?

There are different tag standards for NER.  
Probably the most used one is the IOB-format which frames the task as token classification denoting the **B**eginning, **I**nside, or **O**utside of a named entity.

- Words marked with O are not a named entity.  
- B-* indicate a named entity (e.g. Aarhus), or the start of a multiword entity (i.e. B-ORG for the Aarhus in Aarhus University)
- I-* indicate the continuation of a token (e.g. University).

In [15]:
label_list = ds["train"].features[f"ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Let's create a dictionary to translate integers to labels.

In [16]:
id2label = {i: lab for i, lab in enumerate(label_list)}
id2label

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

And then we'll initialize a new model for the task:

In [17]:
del model

tokenizer = AutoTokenizer.from_pretrained(model_name)

# NB: we're changing the model to a token classification one
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))
# and adding a label dictionary
model.config.id2label = id2label

# why AutoModelForTokenClassification?
# what different kinds of auto-models there are – link for documentation

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert/distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# the model looks different now
# notice the last two parts that were added
model

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(119547, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
  

Let's give it a go:

In [19]:
inputs = tokenizer(
    "Franz Kafka studied at Charles University in Prague", return_tensors="pt"
)

with torch.no_grad():
    logits = model(**inputs).logits

# which index is the maximum of the last label?
predicted_token_class_ids = logits.argmax(-1)

predicted_tokens_classes = [id2label[t.item()] for t in predicted_token_class_ids[0]]
predicted_tokens_classes

['B-LOC',
 'B-LOC',
 'I-LOC',
 'B-MISC',
 'B-MISC',
 'B-MISC',
 'B-LOC',
 'B-MISC',
 'I-ORG',
 'B-MISC']

This is complete nonsense. That is why we need:

<br>

## Fine-tuning

Remember the Semantic Textual Similarity experiment from class 5?  
That was a case of zero-shot learning: we tested the capabilities of the model without preparing it for the task.

This time we want to teach the model to solve a concrete task.   
Much of this is adapted from Huggingface's documentation.  
[Here](https://github.com/huggingface/transformers/tree/main/notebooks) you will find tutorials on how to fine-tune for many different tasks.

There is an important pre-processing step at the start:

In [20]:
example = ds["train"][0]

tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 'EU',
 're',
 '##jects',
 'German',
 'call',
 'to',
 'boy',
 '##cott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

The two great inventions of BERT-style tokenization just turned into two problems:
- Subword tokenization can split one word into multiple tokens. The tokens no longer align with the NER labels.
- We now have special tokens! They don't have NER labels.

In [21]:
example["ner_tags"]

[3, 0, 7, 0, 0, 0, 7, 0, 0]

In [22]:
# problem
len(tokens) == len(example["ner_tags"])

False

We have to:
- The the model to ignore special tokens. In pytorch, index `-100` is ignored.
- Find words that were splitted
- Give all splitted parts of a word the same NER label

In [23]:
def tokenize_and_align_labels(examples):
    # tokenize
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, padding=True)

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        # same words have same word_id
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # if it isn't a word (), ignore
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx])

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [24]:
# the function returns a dict
tokenize_and_align_labels(ds["train"][0:2])

{'input_ids': [[101, 17751, 11639, 93376, 12026, 20575, 10114, 26905, 48426, 11160, 10109, 27012, 119, 102], [101, 10979, 46006, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'labels': [[-100, 3, 0, 0, 7, 0, 0, 0, 0, 7, 0, 0, 0, -100], [-100, 1, 2, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]]}

Btw, it's very easy to process a whole dataset with `.map()`:  
The map function adds the new columns to all the splits of the existing dataset.

In [25]:
ds_aligned = ds.map(tokenize_and_align_labels, batched=True)

The data is ready.  

Now, how do we evaluate the model?  
[Seqeval](https://github.com/chakki-works/seqeval) is designed to evaluate NER labels, i.e. IOB notation

We'll just load it and make some format conversions.

In [26]:
metric = evaluate.load("seqeval")

In [27]:
# this is how it works
metric.compute(
    predictions=[["O", "O", "B-PER"]],
    references=[["O", "O", "B-PER"]]
)

{'PER': {'precision': np.float64(1.0),
  'recall': np.float64(1.0),
  'f1': np.float64(1.0),
  'number': np.int64(1)},
 'overall_precision': np.float64(1.0),
 'overall_recall': np.float64(1.0),
 'overall_f1': np.float64(1.0),
 'overall_accuracy': 1.0}

Handling data formats (optional):  
We will need to do a bit of post-processing on our predictions.
- select the predicted index (with the maximum logit) for each token
- convert it to its string label
- ignore everywhere we set a label of -100

The following function does all this post-processing on the result of `Trainer.evaluate` 
(which is a namedtuple containing predictions and labels) before applying the metric.

In [28]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


We will use `transformers.Trainer`. To instantiate a `Trainer`, we will need to define three more things. The most important is the `TrainingArguments`, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional.

For the full overview of possible training arguments, see [documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

In [29]:
# reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# "data collator" to take care of the padding & turncation problems
data_collator = DataCollatorForTokenClassification(tokenizer)

args = TrainingArguments(
    f"{model_name}-finetuned-ner", # name of this new fine-tuned model
    eval_strategy="epoch", # evaluation at the end of each epoch, instead of none
    learning_rate=2e-5, # slightly lower than default
    num_train_epochs=3, # default
    weight_decay=0.01, #regularization to prevent overfitting
    )

trainer = Trainer(
    model,
    args,
    train_dataset=ds_aligned["train"],
    eval_dataset=ds_aligned["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
    )


Notice how we haven't selected any columns?  
That's because `DataCollator` automatically uses these columns (if it can find them):
- `input_ids` and `attention_mask` for input 
- `labels` for output

But let's train (will take ~3 min on a GPU, ~10 hrs on a CPU)

In [30]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0847,0.068274,0.915576,0.924828,0.920179,0.981419
2,0.0418,0.069891,0.939521,0.930116,0.934795,0.984285
3,0.0259,0.067315,0.935928,0.938049,0.936987,0.985001


TrainOutput(global_step=5268, training_loss=0.07295154547637035, metrics={'train_runtime': 488.2483, 'train_samples_per_second': 86.274, 'train_steps_per_second': 10.79, 'total_flos': 1327458291095034.0, 'train_loss': 0.07295154547637035, 'epoch': 3.0})

In [31]:
# Let's make a proper report per table
predictions, labels, _ = trainer.predict(ds_aligned["test"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
pd.DataFrame(results)


Unnamed: 0,LOC,MISC,ORG,PER,overall_precision,overall_recall,overall_f1,overall_accuracy
precision,0.917406,0.785657,0.895665,0.939371,0.899695,0.900315,0.900005,0.973204
recall,0.933982,0.77394,0.895397,0.930556,0.899695,0.900315,0.900005,0.973204
f1,0.92562,0.779755,0.895531,0.934942,0.899695,0.900315,0.900005,0.973204
number,2878.0,1274.0,3346.0,2664.0,0.899695,0.900315,0.900005,0.973204


If you have successfully run the fine-tuning, your model should be good enough to win [CoNLL-2003](https://www.clips.uantwerpen.be/conll2003/ner/).

In [32]:
# load the model we just trained
model = AutoModelForTokenClassification.from_pretrained("distilbert/distilbert-base-multilingual-cased-finetuned-ner")

inputs = tokenizer(
    "Franz Kafka studied at Charles University in Prague", return_tensors="pt"
)

with torch.no_grad():
    logits = model(**inputs).logits

# which index is the maximum of the last label?
predicted_token_class_ids = logits.argmax(-1)

predicted_tokens_classes = [id2label[t.item()] for t in predicted_token_class_ids[0]]
predicted_tokens_classes

['O', 'B-PER', 'I-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'B-LOC', 'O']

This time we got it right!  

<br>

## Probing

With probing we are trying to answer the question:  
Has the model simply learned the task, or has encoded genuine information?  

There are many way how to approach it. Let's pick one:  
- Does the model overrely on captial letters? How does it perform if you make everything uppercase or lowercase?
- How does it generalize on a new NER test set dataset?
- How does it generalize if we translate some of the examples to a different language?
- Does it make any mistakes in [world-universities](https://github.com/endSly/world-universities-csv/blob/master/world-universities.csv)?

In [None]:
# Load tokenizer and model trained using the script above

# use this tokenizer if you work on data which is "like this"
tokenizer = AutoTokenizer.from_pretrained(
    "janko/distilbert-base-multilingual-cased-finetuned-ner",
    padding=True
    )

# use this one if you want to work on data which is ["like", "this"]
# tokenizer = AutoTokenizer.from_pretrained(
#     "janko/distilbert-base-multilingual-cased-finetuned-ner",
#     padding=True, is_split_into_words=True
#     )

model = AutoModelForTokenClassification.from_pretrained(
    "janko/distilbert-base-multilingual-cased-finetuned-ner",
    output_hidden_states=True
    )

# Check if gpu is available, if not pick cpu
# (cuda is the gpu driver: software that allows python and the gpu to talk)
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize pipeline on the chosen device
nlp = pipeline("ner", tokenizer=tokenizer, model=model, device=device)