<a href="https://colab.research.google.com/github/Andrea4-sr/mlnlp_ex5/blob/andrea2/ex05_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""Exercise 5: Sequence and Anger Regression using Transformers"""


'Exercise 5: Sequence and Anger Regression using Transformers'

In [2]:
!pip3 install transformers datasets sklearn
!pip3 install datasets
!pip3 install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 11.9 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 43.6 MB/s 
[?25hCollecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 47.7 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 1.9 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux

In [3]:
import pandas as pd
import datasets
from datasets import load_dataset
from transformers import BertForTokenClassification, AdamW, TrainingArguments, Trainer
from random import shuffle
from sklearn.metrics import f1_score
import torch

In [4]:
# choose 7000 items as instructions say to make sure dataset of chosen
# language has length of at least 7000 items
polyglotner = datasets.load_dataset("polyglot_ner", 'fr', split="train[:7000]") 
dataset = [d for d in polyglotner]

Downloading builder script:   0%|          | 0.00/6.01k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/86.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Downloading and preparing dataset polyglot_ner/fr to /root/.cache/huggingface/datasets/polyglot_ner/fr/1.0.0/bb2e45c90cd345c87dfd757c8e2b808b78b0094543b511ac49bc0129699609c1...


Downloading data:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/418411 [00:00<?, ? examples/s]

Dataset polyglot_ner downloaded and prepared to /root/.cache/huggingface/datasets/polyglot_ner/fr/1.0.0/bb2e45c90cd345c87dfd757c8e2b808b78b0094543b511ac49bc0129699609c1. Subsequent calls will reuse this data.


In [None]:
# for verification purposes
#dataset

In [5]:
# function to create training sets, and eval set
def create_dataset(dataset):
  shuffle(dataset)
  train3000=dataset[:3000]
  train1000=dataset[3001:4001]
  eval2000=dataset[4001:6001]
  return train3000, train1000, eval2000

In [6]:
train3000, train1000, eval2000 = create_dataset(dataset)

In [7]:
# check for length of each set
len(train3000), len(train1000), len(eval2000)


(3000, 1000, 2000)

In [8]:
# function to add numerical labels manually (instead of strings!)

def manual_labels(label2ix={"O": 0, "LOC": 1, "PER": 2, "ORG": 3}, labels=[], padding=0):
  # note: label2ix is specific to the dataset we are working with! but it can be swapped with a different dictionary
  if labels:
    new_labels = [label2ix[label] for label in labels]
    if len(new_labels) < padding:  # check if padding is even needed
      missing = [0 for i in range(padding-len(new_labels))]
      return new_labels + missing
  else:
    # might be a useful feature to have
    return label2ix 

In [9]:
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments

# download model and its corresponding tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', is_split_into_tokens=True)
model_fr = BertForTokenClassification.from_pretrained('bert-base-multilingual-cased', num_labels=4)


Downloading:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at 

In [10]:
# encoding dataset and returning training set 
# only return training set since we don't use test sets in this exercise
def encoding(dataset):

  encoded_dataset = [tokenizer(' '.join(item['words']), return_tensors="pt", padding='max_length', truncation=True, max_length=300) for item in dataset] 

  # add num labels and padding labels to encoded dataset
  for enc_item, item in zip(encoded_dataset, dataset):
    padding = len(enc_item['attention_mask'][0])  # use the length of the attention mask as a reference for how big the padding should be
    enc_item['labels'] = torch.LongTensor([manual_labels(labels=item['ner'], padding=padding)])
  
  # from notebook
  # we don't need the batch dimension when using the trainer
  # because the trainer does batching for us 
  for item in encoded_dataset:
    for key in item:
        item[key] = torch.squeeze(item[key])

  train_set = encoded_dataset[:round(len(encoded_dataset)*1)]
  #test_set = encoded_dataset[round(len(encoded_dataset)*1):]  # we don't need a test set

  return train_set

In [11]:
# generate encoded training set of size 3000

encodedtrain3000 = encoding(train3000)

In [12]:
# generate encoded training set of size 1000

encodedtrain1000 = encoding(train1000)

In [13]:
# generate encoded evaluation set of size 2000

encodedeval2000 = encoding(eval2000)

In [14]:
# checking that torch sizes are equal
print(len(encodedtrain3000))
for key, val in encodedtrain3000[0].items():
  print(f'key: {key}, dimensions: {val.size()}')

3000
key: input_ids, dimensions: torch.Size([300])
key: token_type_ids, dimensions: torch.Size([300])
key: attention_mask, dimensions: torch.Size([300])
key: labels, dimensions: torch.Size([300])


In [17]:
# training different models with chosen parameters

def model_training(model, trainingdata, requires_grad=True):

  training_args = TrainingArguments(
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    output_dir='results',
    logging_dir='logs',
    no_cuda=False,  # defaults to false anyway, just to be explicit
    ) 

  trainer = Trainer(
    model=model_fr,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=trainingdata,
    )
  
  # freeze embeddings
  if not requires_grad:
    for param in model_fr.base_model.parameters():
      param.requires_grad = False
  
  trainer.train()

  return model, trainer

In [24]:
# funtion to calculate evaluation metrics
# f1-macro, f1-micro 

def eval_metrics(trainer, test_set):

  preds = trainer[1].predict(test_set)

  total_mac=0
  total_mic=0
  count=0

  for i in zip(preds.label_ids, preds.predictions.argmax(-1)):
    f1_mac = f1_score(i[0], i[1], average='macro')
    f1_mic = f1_score(i[0], i[1], average='micro')
    total_mac += f1_mac
    total_mic += f1_mic
    count += 1

  return {'macro': total_mac/count, 'micro': total_mic/count}

In [33]:
model1 = model_training(model='Model1000',
                     trainingdata=encodedtrain1000,
                     )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 250
  Number of trainable parameters = 3076


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




In [34]:
metricsmodel1000 = eval_metrics(model1, encodedeval2000)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 4


In [35]:
# Metrics for Model trained on 1000 items
metricsmodel1000

{'macro': 0.792672857827045, 'micro': 0.99564666666667}

In [36]:
model2 = model_training(model='Model3000',
                        trainingdata=encodedtrain3000)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3000
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 3076


Step,Training Loss
500,0.0111


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




In [37]:
metricsmodel3000 = eval_metrics(model2, encodedeval2000)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 4


In [38]:
# Metrics for Model trained on 3000 items
metricsmodel3000

{'macro': 0.7927132815992657, 'micro': 0.9956200000000034}

In [39]:
model3 = model_training(model='Modelfrozenembeddings',
                        trainingdata=encodedtrain3000,
                        requires_grad=False)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 3000
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 750
  Number of trainable parameters = 3076


Step,Training Loss
500,0.0111


Saving model checkpoint to results/checkpoint-500
Configuration saved in results/checkpoint-500/config.json
Model weights saved in results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in results/checkpoint-500/tokenizer_config.json
Special tokens file saved in results/checkpoint-500/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




In [40]:
metricsmodel3000frozen = eval_metrics(model3, encodedeval2000)

***** Running Prediction *****
  Num examples = 2000
  Batch size = 4


In [41]:
# Metrics for Model trained on 3000 items - frozen embeddings
metricsmodel3000frozen

{'macro': 0.7931849252145734, 'micro': 0.9955650000000035}