Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [1]:
%%capture
! pip install datasets
! pip install transformers

In [2]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel,
                          get_linear_schedule_with_warmup)

import torch
from torch.utils.data import DataLoader, TensorDataset
from datasets import load_metric
import numpy as np
import pandas as pd
import plotly.express as px
from torch.nn import functional as F
from tqdm.notebook import tqdm
from torch.optim import AdamW
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [3]:
from datasets import load_dataset, DatasetDict
train_and_test_dataset = load_dataset('yahoo_answers_topics')

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/867 [00:00<?, ?B/s]

Downloading and preparing dataset yahoo_answers_topics/yahoo_answers_topics (download: 304.68 MiB, generated: 756.21 MiB, post-processed: Unknown size, total: 1.04 GiB) to /root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902...


Downloading:   0%|          | 0.00/319M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset yahoo_answers_topics downloaded and prepared to /root/.cache/huggingface/datasets/yahoo_answers_topics/yahoo_answers_topics/1.0.0/b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
train_test = DatasetDict()
train_test['train'] = train_and_test_dataset['train'].select([i for i in range(train_and_test_dataset['train'].num_rows // 100)])
train_test['test'] = train_and_test_dataset['test'].select([i for i in range(train_and_test_dataset['test'].num_rows // 10)])
train_test

DatasetDict({
    train: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer'],
        num_rows: 14000
    })
    test: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer'],
        num_rows: 6000
    })
})

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


# Fine-tuning the model** (20 points)

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


In [5]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"

In [6]:
device = torch.device("cuda:0")

## 1. load tokenizer and model

In [7]:
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
masked_model = ElectraForMaskedLM.from_pretrained(MODEL_NAME)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/51.7M [00:00<?, ?B/s]

In [8]:
fill_mask = pipeline(
    "fill-mask",
    model=masked_model,
    tokenizer=tokenizer
)

## 2. look at the predictions of the model as-is before any fine-tuning

```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

In [9]:
fill_mask("Why don't you ask [MASK]?")

[{'score': 0.5343002080917358,
  'token': 2033,
  'token_str': 'm e',
  'sequence': "why don't you ask me?"},
 {'score': 0.0819597840309143,
  'token': 3980,
  'token_str': 'q u e s t i o n s',
  'sequence': "why don't you ask questions?"},
 {'score': 0.043953415006399155,
  'token': 2068,
  'token_str': 't h e m',
  'sequence': "why don't you ask them?"},
 {'score': 0.04017249867320061,
  'token': 2339,
  'token_str': 'w h y',
  'sequence': "why don't you ask why?"},
 {'score': 0.030024176463484764,
  'token': 4426,
  'token_str': 'y o u r s e l f',
  'sequence': "why don't you ask yourself?"}]

In [10]:
fill_mask("What is [MASK]?")

[{'score': 0.4963121712207794,
  'token': 2009,
  'token_str': 'i t',
  'sequence': 'what is it?'},
 {'score': 0.2050580382347107,
  'token': 2023,
  'token_str': 't h i s',
  'sequence': 'what is this?'},
 {'score': 0.12763574719429016,
  'token': 2008,
  'token_str': 't h a t',
  'sequence': 'what is that?'},
 {'score': 0.015324683859944344,
  'token': 3308,
  'token_str': 'w r o n g',
  'sequence': 'what is wrong?'},
 {'score': 0.015023304149508476,
  'token': 2054,
  'token_str': 'w h a t',
  'sequence': 'what is what?'}]

In [11]:
fill_mask("Let's talk about [MASK] physics")

[{'score': 0.24027419090270996,
  'token': 8559,
  'token_str': 'q u a n t u m',
  'sequence': "let's talk about quantum physics"},
 {'score': 0.2125832438468933,
  'token': 9373,
  'token_str': 't h e o r e t i c a l',
  'sequence': "let's talk about theoretical physics"},
 {'score': 0.05639351159334183,
  'token': 10811,
  'token_str': 'p a r t i c l e',
  'sequence': "let's talk about particle physics"},
 {'score': 0.03320807218551636,
  'token': 2613,
  'token_str': 'r e a l',
  'sequence': "let's talk about real physics"},
 {'score': 0.022627977654337883,
  'token': 8045,
  'token_str': 'm a t h e m a t i c a l',
  'sequence': "let's talk about mathematical physics"}]

## 3. convert `best_answer` to the input tokens (supporting function for dataset is provided below) 


In [12]:
feature_column_names = list(train_test['train'].features.keys())
feature_column_names.remove('topic')
feature_column_names

['id', 'question_title', 'question_content', 'best_answer']

In [13]:
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_dataset = train_test.map(tokenize_function, batched=True, remove_columns=feature_column_names)
tokenized_dataset = tokenized_dataset.rename_column("topic", "labels")
tokenized_dataset.set_format("torch")

  0%|          | 0/14 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

## 4. define optimizer, sheduler (optional)



## 5. fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score

In [14]:
label_encoder = LabelEncoder()
label_encoder.fit_transform(tokenized_dataset['train']['labels'])
metric_f1 = load_metric("f1")
metric_accuracy = load_metric("accuracy")

def compute_metrics(logits, labels, print_label=False):
    predictions = np.argmax(logits, axis=-1)
    if print_label:
        print("labels: ", labels)
        print("predict:", predictions)

    f1 = metric_f1.compute(predictions=predictions, references=labels, average='weighted')["f1"]
    accuracy = metric_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    return {"accuracy": accuracy, "f1": f1}

Downloading:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

In [15]:
def dataset_processing(dataset, batch_size):
    print(f"Dataset train length: {len(dataset['train'])}, test length: {len(dataset['test'])}")
    train_loader = DataLoader(dataset['train'], shuffle=True, batch_size=batch_size)
    test_loader = DataLoader(dataset['test'], batch_size=batch_size)
    
    return train_loader, test_loader

In [16]:
def train(model, train_dataloader, test_dataloader, optimizer, scheduler):
    model.train()
    model.to(device)
    
    f1 = []
    total = 0
    total_loss = []
    total_correct = 0
    for epoch in range(epochs):
        step = 0
        with tqdm(total=len(train_dataloader), unit='batch') as t:
            for batch in train_dataloader:
                t.set_description(f"Epoch {epoch}, train")
                step += 1
                b_input_ids = batch['input_ids'].to(device)
                b_input_mask = batch['attention_mask'].to(device)
                b_labels = batch['labels'].to(device)
                
                model.zero_grad()
                outputs = model(
                    input_ids=b_input_ids,
                    attention_mask=b_input_mask,
                    labels=b_labels
                )
                loss, logits = outputs.loss, outputs.logits

                loss.backward()
                optimizer.step()
                scheduler.step()
                
                total_loss += [loss.item()]
                t.set_postfix(loss='{:.3f}'.format(loss.item()))
                t.update(1)
        
        model.eval()
        val_step = 0
        total_eval_loss = 0
        total_eval_f1 = 0
        total_eval_acc = 0
        with tqdm(total=len(test_dataloader), unit='batch') as t:
            for batch in test_dataloader:
                t.set_description(f"Epoch {epoch}, valid")
                with torch.no_grad():        
                    val_step += 1
                    b_input_ids = batch['input_ids'].to(device)
                    b_input_mask = batch['attention_mask'].to(device)
                    b_labels = batch['labels'].to(device)
                
                    output = model(b_input_ids, attention_mask=b_input_mask)
                    logits = output.logits

                    logits = logits.detach().cpu().numpy()
                    label_ids = b_labels.detach().cpu().numpy()
                    print_res = False
                    if val_step - 1 % 20 == 0:
                        print_res = True
                    metrics_f1_acc = compute_metrics(logits, label_ids, print_res)
                    total_eval_f1 += metrics_f1_acc['f1']
                    total_eval_acc += metrics_f1_acc['accuracy']
                    t.set_postfix(
                        f1='{:.2f}'.format(total_eval_f1 / val_step),
                        acc='{:.2f}'.format(total_eval_acc / val_step)
                    )
                    t.update(1)
            f1 += [total_eval_f1 / val_step]
    return total_loss, f1, model

In [17]:
classification_model = ElectraForSequenceClassification.from_pretrained(
            MODEL_NAME, 
            num_labels = 10)

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.dense.bias', 'generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.dense.weight', 'generator_lm_head.bias', 'generator_lm_head.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

In [18]:
classification_model.classifier.dense.out_features = 128
classification_model.classifier.out_proj.in_features = 128
classification_model.classifier

ElectraClassificationHead(
  (dense): Linear(in_features=256, out_features=128, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (out_proj): Linear(in_features=128, out_features=10, bias=True)
)

In [19]:
epochs = 4
lr_list = [1e-3]
batch_list = [32]
# test_size_list = [0.1, 0.2, 0.3]

total_step = len(lr_list) * len(batch_list)
step = 0

find_best_f1 = {}
for lr in lr_list:
    for batch in batch_list:
        step += 1
        print(f"Step {step}/{total_step}, lr: {lr}, batch_size: {batch}")
        print("Starting preparing dataset and tokenization")
        train_loader, test_loader = dataset_processing(tokenized_dataset, batch)
        classification_model = ElectraForSequenceClassification.from_pretrained(
            MODEL_NAME, 
            num_labels = 10)
        for param in classification_model.electra.parameters():
            param.requires_grad = False
        classification_model.classifier.dropout.p = 0.5
        classification_model.classifier.dense.out_features = batch
        classification_model.classifier.out_proj.in_features = batch
        optimizer = AdamW(classification_model.parameters(), lr=lr, eps=1e-7)
        scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                    num_warmup_steps = 0,
                                                    num_training_steps = len(train_loader) * epochs)
        print("Starting training")
        losses, f1_saved, trained_model = train(classification_model, train_loader, test_loader, optimizer, scheduler)
        params = {"lr": lr, "batch_size": batch}
        find_best_f1[f"step_{step}"] = {"max_f1": max(f1_saved), "params": params}
        df = pd.DataFrame.from_dict({"loss": losses, "step": [i for i in range(len(losses))]})
        fig = px.line(df, x="step", y="loss", title=f"Losses with params: {params}, Step: {step}")
        fig.show()
        trained_model.save_pretrained(f'/kaggle/working/step_{step}')
        trained_model_masked = ElectraForMaskedLM.from_pretrained(f'/kaggle/working/step_{step}')
        trained_fill_mask = pipeline(
            "fill-mask",
            model=trained_model_masked,
            tokenizer=tokenizer
        )
        for pred in trained_fill_mask("Why don't you ask [MASK]?"):
            print(pred['sequence'])
        for pred in trained_fill_mask("What is [MASK]?"):
            print(pred['sequence'])
        for pred in trained_fill_mask("Let's talk about [MASK] physics"):
            print(pred['sequence'])
        torch.cuda.empty_cache()
max_f1 = 0
best_f1 = {}
print(find_best_f1)
for key in find_best_f1.keys():
    if find_best_f1[key]["max_f1"] > max_f1:
        best_f1_key = key
        max_f1 = find_best_f1[key]["max_f1"] 
print(f"Best f1: {max_f1} with params: {find_best_f1[best_f1_key]['params']} and {best_f1_key}")

Step 1/1, lr: 0.001, batch_size: 32
Starting preparing dataset and tokenization
Dataset train length: 14000, test length: 6000


Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraForSequenceClassification: ['generator_predictions.dense.bias', 'generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.dense.weight', 'generator_lm_head.bias', 'generator_lm_head.weight']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-small-generator and are newly initializ

Starting training


  0%|          | 0/438 [00:00<?, ?batch/s]

  0%|          | 0/188 [00:00<?, ?batch/s]

labels:  [8 1 3 3 2 2 4 2 3 7 1 1 7 8 6 9 7 3 9 8 0 1 4 5 1 6 2 5 6 0 9 6]
predict: [6 6 6 6 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]


  0%|          | 0/438 [00:00<?, ?batch/s]

  0%|          | 0/188 [00:00<?, ?batch/s]

labels:  [8 1 3 3 2 2 4 2 3 7 1 1 7 8 6 9 7 3 9 8 0 1 4 5 1 6 2 5 6 0 9 6]
predict: [6 1 1 6 2 6 4 6 5 1 0 1 6 6 4 6 4 0 6 6 9 1 4 6 1 6 1 5 6 1 9 4]


  0%|          | 0/438 [00:00<?, ?batch/s]

  0%|          | 0/188 [00:00<?, ?batch/s]

labels:  [8 1 3 3 2 2 4 2 3 7 1 1 7 8 6 9 7 3 9 8 0 1 4 5 1 6 2 5 6 0 9 6]
predict: [6 1 5 6 2 6 4 6 5 7 0 6 6 6 4 6 4 3 5 6 6 2 4 7 1 6 2 5 4 6 6 4]


  0%|          | 0/438 [00:00<?, ?batch/s]

  0%|          | 0/188 [00:00<?, ?batch/s]

labels:  [8 1 3 3 2 2 4 2 3 7 1 1 7 8 6 9 7 3 9 8 0 1 4 5 1 6 2 5 6 0 9 6]
predict: [6 1 6 6 2 6 4 6 5 7 0 1 6 6 4 6 4 3 5 6 6 2 4 6 1 6 2 5 4 6 6 4]


Some weights of the model checkpoint at /kaggle/working/step_1 were not used when initializing ElectraForMaskedLM: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
- This IS expected if you are initializing ElectraForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForMaskedLM were not initialized from the model checkpoint at /kaggle/working/step_1 and are newly initialized: ['generator_predictions.dense.bias', 'generator_predictions.LayerNorm.bias', 'generator_predictions.LayerNorm.weight', 'generator_predictions.dense.weight', 'generator_

why don't you asknable?
why don't you asksily?
why don't you ask outlaws?
why don't you asktius?
why don't you ask inevitable?
what is regents?
what is giants?
what is emeritus?
what is aforementioned?
what is legends?
let's talk about custer physics
let's talk about macleod physics
let's talk about iroquois physics
let's talk about eminent physics
let's talk about outlaws physics
{'step_1': {'max_f1': 0.44977231336858825, 'params': {'lr': 0.001, 'batch_size': 32}}}
Best f1: 0.44977231336858825 with params: {'lr': 0.001, 'batch_size': 32} and step_1
