Soft deadline: `30.03.2022 23:59`

In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [None]:
! pip install datasets
! pip install transformers

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Fine-tuning the model** (20 points)

In [3]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- Tune the training hyperparameters (and write down your results).

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


In [4]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-base-generator"

* load tokenizer and model

In [None]:
tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)
model = ElectraForMaskedLM.from_pretrained(MODEL_NAME)

In [None]:
model

* look at the predictions of the model as-is before any fine-tuning
* The easiest way to get predictions is to use transformers pipeline function

In [7]:
masked = pipeline('fill-mask', model=model, tokenizer=tokenizer)

In [8]:
masked("Why don't you ask [MASK]?")

[{'score': 0.5342992544174194,
  'sequence': "why don't you ask me?",
  'token': 2033,
  'token_str': 'm e'},
 {'score': 0.08196018636226654,
  'sequence': "why don't you ask questions?",
  'token': 3980,
  'token_str': 'q u e s t i o n s'},
 {'score': 0.04395333677530289,
  'sequence': "why don't you ask them?",
  'token': 2068,
  'token_str': 't h e m'},
 {'score': 0.04017288610339165,
  'sequence': "why don't you ask why?",
  'token': 2339,
  'token_str': 'w h y'},
 {'score': 0.030024440959095955,
  'sequence': "why don't you ask yourself?",
  'token': 4426,
  'token_str': 'y o u r s e l f'}]

In [9]:
masked("What is [MASK]")

[{'score': 0.9262322783470154,
  'sequence': 'what is?',
  'token': 1029,
  'token_str': '?'},
 {'score': 0.05156780779361725,
  'sequence': 'what is.',
  'token': 1012,
  'token_str': '.'},
 {'score': 0.021510401740670204,
  'sequence': 'what is!',
  'token': 999,
  'token_str': '!'},
 {'score': 0.0001196492012240924,
  'sequence': 'what is -',
  'token': 1011,
  'token_str': '-'},
 {'score': 0.00010928419214906171,
  'sequence': 'what is "',
  'token': 1000,
  'token_str': '"'}]

In [10]:
masked("Let's talk about [MASK] physics")

[{'score': 0.24027501046657562,
  'sequence': "let's talk about quantum physics",
  'token': 8559,
  'token_str': 'q u a n t u m'},
 {'score': 0.21258601546287537,
  'sequence': "let's talk about theoretical physics",
  'token': 9373,
  'token_str': 't h e o r e t i c a l'},
 {'score': 0.056394025683403015,
  'sequence': "let's talk about particle physics",
  'token': 10811,
  'token_str': 'p a r t i c l e'},
 {'score': 0.0332079641520977,
  'sequence': "let's talk about real physics",
  'token': 2613,
  'token_str': 'r e a l'},
 {'score': 0.022627945989370346,
  'sequence': "let's talk about mathematical physics",
  'token': 8045,
  'token_str': 'm a t h e m a t i c a l'}]

* get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)

In [None]:
model = ElectraForMaskedLM.from_pretrained('/content/drive/MyDrive/Colab Notebooks/University/Advanced NLP/model')

In [None]:
model

In [12]:
masked = pipeline('fill-mask', model=model, tokenizer=tokenizer)

In [13]:
masked("Why don't you ask [MASK]?")

[{'score': 0.007874231785535812,
  'sequence': "why don't you ask?",
  'token': 0,
  'token_str': '[ P A D ]'},
 {'score': 0.006127915345132351,
  'sequence': "why don't you ask horn?",
  'token': 7109,
  'token_str': 'h o r n'},
 {'score': 0.005303030833601952,
  'sequence': "why don't you ask flap?",
  'token': 20916,
  'token_str': 'f l a p'},
 {'score': 0.004756000358611345,
  'sequence': "why don't you ask felipe?",
  'token': 17095,
  'token_str': 'f e l i p e'},
 {'score': 0.004418663680553436,
  'sequence': "why don't you askrix?",
  'token': 17682,
  'token_str': '# # r i x'}]

In [14]:
masked("What is [MASK]")

[{'score': 0.005034726113080978,
  'sequence': 'what israße',
  'token': 27807,
  'token_str': '# # r a ß e'},
 {'score': 0.004568912088871002,
  'sequence': 'what is schwarz',
  'token': 29058,
  'token_str': 's c h w a r z'},
 {'score': 0.004546832758933306,
  'sequence': 'what is headline',
  'token': 17653,
  'token_str': 'h e a d l i n e'},
 {'score': 0.004321066662669182,
  'sequence': 'what is worrying',
  'token': 15366,
  'token_str': 'w o r r y i n g'},
 {'score': 0.004065139684826136,
  'sequence': 'what isriding',
  'token': 21930,
  'token_str': '# # r i d i n g'}]

In [15]:
masked("Let's talk about [MASK] physics")

[{'score': 0.009592956863343716,
  'sequence': "let's talk aboutasa physics",
  'token': 16782,
  'token_str': '# # a s a'},
 {'score': 0.008506628684699535,
  'sequence': "let's talk about lateral physics",
  'token': 11457,
  'token_str': 'l a t e r a l'},
 {'score': 0.00676969438791275,
  'sequence': "let's talk about leyte physics",
  'token': 27214,
  'token_str': 'l e y t e'},
 {'score': 0.0054626683704555035,
  'sequence': "let's talk aboutriding physics",
  'token': 21930,
  'token_str': '# # r i d i n g'},
 {'score': 0.0044946749694645405,
  'sequence': "let's talk about statewide physics",
  'token': 13486,
  'token_str': 's t a t e w i d e'}]

**Conclusion:** ELECTRA was pre-trained on a large dataset, which consists of 3.3 Billion tokens from Wikipedia and BooksCorpus. The pre-training objective of the training was related to Masked Language Modeling. Therefore, the model performs more confident (higher probabilities) and more understandable to human results.  
ELECTRA architecture was trained by me with all trainable layers that hold pre-trained features on a classification task. The produced results follow the peculiarities of the dataset content.