Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [None]:
! pip install datasets
! pip install transformers

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 18.8 MB/s eta 0:00:01[K     |██                              | 20 kB 8.9 MB/s eta 0:00:01[K     |███                             | 30 kB 6.5 MB/s eta 0:00:01[K     |████                            | 40 kB 3.7 MB/s eta 0:00:01[K     |█████                           | 51 kB 3.6 MB/s eta 0:00:01[K     |██████                          | 61 kB 4.3 MB/s eta 0:00:01[K     |███████                         | 71 kB 4.6 MB/s eta 0:00:01[K     |████████                        | 81 kB 4.5 MB/s eta 0:00:01[K     |█████████                       | 92 kB 5.0 MB/s eta 0:00:01[K     |██████████                      | 102 kB 4.3 MB/s eta 0:00:01[K     |███████████                     | 112 kB 4.3 MB/s eta 0:00:01[K     |████████████                    | 122 kB 4.3 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 4.3 MB/s eta 0:00:01[K

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [None]:
from datasets import load_dataset
from datasets import load_metric

In [None]:
dataset = load_dataset('yahoo_answers_topics')

In [None]:
dataset = dataset.remove_columns(['question_title','question_content','id'])
metric = load_metric('f1')

In [None]:
ds_train=dataset['train']#.shard(224,index=1)
ds_test=dataset['test']#.shard(20,index=1)

In [None]:
ds_train

In [None]:
ds_test

# Fine-tuning the model** (20 points)

In [None]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification, ElectraTokenizerFast,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel, AdamW, TrainingArguments, Trainer)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric
import numpy as np

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- Tune the training hyperparameters (and write down your results).

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


Pretrained model performance

In [None]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"

In [None]:
model_initial=ElectraForMaskedLM.from_pretrained(MODEL_NAME)
tokenizer=ElectraTokenizerFast.from_pretrained(TOKENIZER_NAME)

In [None]:
unmasking = pipeline(
    task = "fill-mask",
    model = MODEL_NAME,
    tokenizer=TOKENIZER_NAME,
    top_k=3,
)

In [None]:
speech1 = f"- Why don't you ask {unmasking.tokenizer.mask_token}?"
speech2 = f"- What is {unmasking.tokenizer.mask_token}?"
speech3 = f"- Let's talk about {unmasking.tokenizer.mask_token} physics"

In [None]:
from pprint import pprint
pprint(unmasking(speech1))

Classifier training

In [None]:
#tokenizing answers in train-test datasets
def tokenize_function(examples):
    model_inputs=tokenizer(examples["best_answer"], padding="max_length", truncation=True)
    return model_inputs

tokenized_datasets_train = ds_train.map(tokenize_function,batched=True)
tokenized_datasets_train=tokenized_datasets_train.rename_column('topic','label')
tokenized_datasets_test = ds_test.map(tokenize_function,batched=True)
tokenized_datasets_test=tokenized_datasets_test.rename_column('topic','label')

In [None]:
# defining untrained electra classifier with 10 possible labels (topics)
electra_classifier = ElectraForSequenceClassification.from_pretrained(MODEL_NAME,num_labels=10)
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(electra_classifier.parameters(), lr=1e-3, weight_decay=0.000)
num_epoch=5

In [None]:
# defining untrained electra classifier with 10 possible labels (topics)
electra_classifier = ElectraForSequenceClassification.from_pretrained(MODEL_NAME,num_labels=10)
loss = torch.nn.CrossEntropyLoss()
# electra_classifier.classifier.dense=torch.nn.Linear(256, 64)
# electra_classifier.classifier.out_proj=torch.nn.Sequential(
#     torch.nn.LeakyReLU(),
#     torch.nn.Linear(64, 10))
for param in electra_classifier.electra.parameters():
      param.requires_grad = False
optimizer = torch.optim.AdamW(electra_classifier.classifier.parameters(), lr=1e-3, weight_decay=0.0005)
num_epoch=10

In [None]:
electra_classifier.classifier

ElectraClassificationHead(
  (dense): Linear(in_features=256, out_features=256, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=256, out_features=10, bias=True)
)

In [None]:
electra_classifier.to('cuda:0')

In [None]:
devic='cuda:0'
for epoch in range(num_epoch):
    losses = []
    electra_classifier.train()
    for i, data in enumerate(tokenized_datasets_train):
        optimizer.zero_grad()
        input_ids = torch.tensor([data['input_ids']],device=devic)
        attention_mask = torch.tensor([data['attention_mask']], device=devic)
        label = torch.tensor([data['label']], device=devic)

        out=electra_classifier(input_ids,attention_mask)
        loss_value=loss(out[0],label)
        loss_value.backward()
        optimizer.step()
        losses.append(loss_value)
        #print(loss_value)
    print(f"Epoch {epoch}\n Current loss {torch.mean(torch.tensor(losses))}\n")
    preds=testify(tokenized_datasets_test)
    print("F1:",f1_score(tokenized_datasets_test['label'], preds, average='weighted'))

In [None]:
def testify(dataset):
  pred_labels=[]
  for i,data in enumerate(dataset):
    with torch.no_grad():
      input_ids = torch.tensor([data['input_ids']],device=devic)
      attention_mask = torch.tensor([data['attention_mask']], device=devic)
      out = electra_classifier(input_ids,attention_mask)
      pred_labels.append(out[0].argmax().cpu().numpy())

  return pred_labels

In [None]:
preds=testify(tokenized_datasets_test)

1


IndexError: ignored

In [None]:
from sklearn.metrics import f1_score
f1_score(tokenized_datasets_test['label'], preds, average='weighted')

0.3360615863978749

In [None]:
electra_classifier.save_pretrained('./drive/MyDrive/NLP/mymodel')

In [None]:
model = ElectraForMaskedLM.from_pretrained('./drive/MyDrive/NLP/mymodel')

Some weights of the model checkpoint at ./drive/MyDrive/NLP/mymodel were not used when initializing ElectraForMaskedLM: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
- This IS expected if you are initializing ElectraForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForMaskedLM were not initialized from the model checkpoint at ./drive/MyDrive/NLP/mymodel and are newly initialized: ['generator_predictions.dense.bias', 'generator_lm_head.bias', 'generator_lm_head.weight', 'generator_predictions.LayerNorm.bias', 'generator_predictions.Laye

In [None]:
unmasking1 = pipeline(
    task = "fill-mask",
    model = './drive/MyDrive/NLP/mymodel',
    tokenizer=TOKENIZER_NAME,
    top_k=3,
)

Some weights of the model checkpoint at ./drive/MyDrive/NLP/mymodel were not used when initializing ElectraForMaskedLM: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.bias']
- This IS expected if you are initializing ElectraForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForMaskedLM were not initialized from the model checkpoint at ./drive/MyDrive/NLP/mymodel and are newly initialized: ['generator_predictions.dense.bias', 'generator_lm_head.bias', 'generator_lm_head.weight', 'generator_predictions.LayerNorm.bias', 'generator_predictions.Laye

In [None]:
speech1 = f"- Why don't you ask {unmasking1.tokenizer.mask_token}?"
speech2 = f"- What is {unmasking1.tokenizer.mask_token}?"
speech3 = f"- Let's talk about {unmasking1.tokenizer.mask_token} physics"

In [None]:
from pprint import pprint
pprint(unmasking1(speech1))

[{'score': 0.0004469767736736685,
  'sequence': "- why don't you ask peeling?",
  'token': 28241,
  'token_str': 'peeling'},
 {'score': 0.00037571616121567786,
  'sequence': "- why don't you askutter?",
  'token': 26878,
  'token_str': '##utter'},
 {'score': 0.0003629341081250459,
  'sequence': "- why don't you ask roaming?",
  'token': 24430,
  'token_str': 'roaming'}]
