# Week 9: Sentence Level Classification with BERT

Your goal this week is to train a classifier that can predict the CEFR level of any given sentence. In this notebook we will guide you through the process of using 🤗[Hugging Face](https://huggingface.co/) and its transformers library as the training framework, with [Pytorch](https://pytorch.org/) as the deep learning backend, but feel free to use [TensorFlow](https://www.tensorflow.org) if that's what you are more familiar with.

For this assignment we will provide a dataset containing sentences with the corresponding CEFR level, and you have to use BERT and train a sentence classifier with this dataset.

## Prepare your environment

As always, we highly recommend that you install all packages with a virtual environment manager, like [venv](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) or [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html), to prevent version conflicts of different packages.  

### Install CUDA
Deep learning is a computionally extensive process. It takes lots of time if relying only on the CPU, especially when it's trained on a large dataset. That's why using GPU instead is generally recommended.  
To use GPU for computation, you have to install [CUDA toolkit](https://developer.nvidia.com/cuda-toolkit) as well as the [cuDNN library](https://developer.nvidia.com/cudnn) provided by NVIDIA.  

If you already had CUDA installed on your machine, then great! You're done here.  
If you don't, you can refer to [Appendix](#Appendix-1-Install-CUDA) to see how to do so.


### Install python packages
The following python packages will be used in this tutorial:

1. `numpy`: for matrix operation
2. `scikit-learn`: for label encoding
3. `datasets`: for data preparation
4. `transformers`: for model loading and finetuing
5. `pytorch`: the backend DL framework
  - Note that the pt version must support the CUDA version you've installed if you want to use GPU.

### Select GPU(s) for your backend

Skip this section if you have no intension of using GPU with tensorflow/pytorch.

In [1]:
import os

# select your GPU. Note that this should be set before you load tensorflow or pytorch.
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# To use multiple GPUs, combine all GPU ID with commas
# e.g. >>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,3'

In [2]:
import torch
# Check if any GPU is used

torch.cuda.is_available()

  from .autonotebook import tqdm as notebook_tqdm


True

## Prepare the dataset

Before starting the training, we need to load and process our dataset - but wait, let's decide which model we want to use first.  

In the highly unlikely chance you've never heard of it, [BERT](https://arxiv.org/abs/1810.04805) (**B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is a language model proposed by Google AI in 2018, and it's currently one of the most popular models used in NLP.  
You can learn more about it here:
- [BERT Explained: A Complete Guide with Theory and Tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/) by Samia, 2019.


However, we will not directly use BERT in this tutorial, because it's large and takes too long to train. Instead, we'll be using [DistilBert](https://medium.com/huggingface/distilbert-8cf3380435b5), a version of BERT that while light-weight, reserves 95% of its original accuracy.




In [3]:
# the model you want to use. Available models can be found here: https://huggingface.co/models
MODEL_NAME = 'distilbert-base-uncased'

### Load data

Similar to the `transformers` library, `datasets` is also a package by huggingface. It contains many public datasets online and can help us with the data processing.  
We can use `load_dataset` function to read the input `.csv` file provided for this assignment.

Reference:
 - [Official datasets document](https://huggingface.co/docs/datasets)
 - [datasets.load_dataset](https://huggingface.co/docs/datasets/loading.html)

In [4]:
# [ TODO ] load the data using the load_dataset function
from datasets import load_dataset
dataset = load_dataset("data")

Using custom data configuration data-2aa0c8c78d4e2ea2
Found cached dataset csv (C:/Users/love4/.cache/huggingface/datasets/csv/data-2aa0c8c78d4e2ea2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|██████████| 2/2 [00:00<00:00, 154.91it/s]


In [5]:
dataset.map(batched=False, with_indices=False)

Loading cached processed dataset at C:/Users/love4/.cache/huggingface/datasets/csv/data-2aa0c8c78d4e2ea2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-74188df31858a08d.arrow
Loading cached processed dataset at C:/Users/love4/.cache/huggingface/datasets/csv/data-2aa0c8c78d4e2ea2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-6ac0574c087480a3.arrow


DatasetDict({
    train: Dataset({
        features: ['text', 'level'],
        num_rows: 20720
    })
    test: Dataset({
        features: ['text', 'level'],
        num_rows: 2300
    })
})

In [6]:
print(dataset['train'])
print(dataset['train'][1])
print(dataset['train']['text'][:5])

Dataset({
    features: ['text', 'level'],
    num_rows: 20720
})
{'text': 'You can contact me by e-mail.', 'level': 'A1'}
['My mother is having her car repaired.', 'You can contact me by e-mail.', 'He had a break for the weekend, and he called me: "I am in London, so, if you want to see me, it\'s the time!"', "Research shows that 40 percent of the program's viewers are aged over 55.", "I'd guess she's about my age."]


### Preprocessing

As always, texts should be tokenized, embedded, and padded before being put into the model.  
But not to worry, there are libraries from huggingface to help with this, too.

#### Sentence processing

Different pre-trained language models may have their own preprocessing models, and that's why we should use the tokenizers trained along with that model. In our case, we are using distilBERT, so we should use the distilBERT tokenizer.  

With huggingface, loading different tokenizers is extremely easy: just import the AutoTokenizer from `transformers` and tell it what model you plan to use, and it will handle everything for you.

Reference:
 - [transformers.AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer)

In [6]:
import transformers

In [7]:
# [ TODO ] load the distilBERT tokenizer using AutoTokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)

#### Label processing

Our labels also need to be processed, so let's do that next.

For this tutorial, we'll use the OneHotEncoder provided by scikit-learn.

For now, just declare a new encoder and use `fit` to learn the data. Hint: you should still end up with 6 labels.

Documents:
 - [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder)

In [8]:
from sklearn.preprocessing import OneHotEncoder

In [9]:
# [ TODO ] declare a new encoder and let it learn from the dataset

encoder = OneHotEncoder()

In [10]:
import numpy as np
encoder.fit(np.array(dataset.data["train"]["level"]).reshape(-1, 1))

In [11]:
# check if you still have 6 labels
LABEL_COUNT = len(encoder.categories_[0])
print(LABEL_COUNT)

6


#### Process the data

To make things easier, we can write a function to process our dataset in batches. 

In [12]:
def preprocess(dataslice):
    
    """ Input: a batch of your dataset
        Example: { 'text': [['sentence1'], ['setence2'], ...],
                   'label': ['label1', 'label2', ...] }
    """
    
    # [ TODO ] use your tokenizor and encoder to get sentence embeddings and encoded labels
    output = tokenizer(dataslice['text'], truncation=True, padding=True, return_tensors="pt")
    
    # print({'label':encoder.transform(np.array(dataslice['level']).reshape(-1, 1)).toarray()})
    output.update({'label':encoder.transform(np.array(dataslice['level']).reshape(-1, 1)).toarray()})
    return output
    """ Output: a batch of processed dataset
        Example: { 'input_ids': ...,
                   'attention_masks': ...,
                   'label': ... }
    """

In [13]:
# map the function to the whole dataset
processed_data = dataset.map(preprocess,    # your processing function
                             batched = True # Process in batches so it can be faster
                            )

Loading cached processed dataset at C:/Users/love4/.cache/huggingface/datasets/csv/data-2aa0c8c78d4e2ea2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-6bd6e1a4adb69a45.arrow
Loading cached processed dataset at C:/Users/love4/.cache/huggingface/datasets/csv/data-2aa0c8c78d4e2ea2/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317\cache-2f00b6dbd2146cb4.arrow


In [14]:
print(processed_data)
processed_data['train'][0]

DatasetDict({
    train: Dataset({
        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],
        num_rows: 20720
    })
    test: Dataset({
        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],
        num_rows: 2300
    })
})


{'text': 'My mother is having her car repaired.',
 'level': 'B1',
 'input_ids': [101,
  2026,
  2388,
  2003,
  2383,
  2014,
  2482,
  13671,
  1012,
  102,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'label': [0.0, 0.0, 1.0, 0.0, 0.0, 0.0]}

### DataCollator

You might have noticed that we skipped padding the sentences. That's because we are going to do it during training.  

To do training-time processing, we can use the DataCollator Class provided by `transformers`. And guess what - transformers has a class that will handle padding for us, too!

 - [transformers.DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding)

In [15]:
# [ TODO ] declare a collator to do padding during traning
from transformers import DataCollatorWithPadding
data_collator =  DataCollatorWithPadding(tokenizer=tokenizer)

## Training

Finally, we can move on to training.

### Preparation

We can load the pretrained model from `transformers`.  
Generally, you need to build your own model on top of BERT if you want to use BERT for some downstream tasks, but again, sequence classification is a popular topic. With the support from `transformers` library, it can be done in two lines of codes: 

1. Load `AutoModelForSequenceClassification` Class.
2. Load the pretrained model.

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                           num_labels = LABEL_COUNT)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

#### Split train/val data

The `Dataset` class we prepared before has a `train_test_split` method. You can use it to split your (processed) dataset.

Document:
 - [datasets.Dataset - Sort, shuffle, select, split, and shard](https://huggingface.co/docs/datasets/process.html#sort-shuffle-select-split-and-shard)

In [17]:
# [ TODO ] choose a validation size and split your data
train_val_dataset = processed_data['train'].train_test_split(test_size=0.1)

In [18]:
print(train_val_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],
        num_rows: 18648
    })
    test: Dataset({
        features: ['text', 'level', 'input_ids', 'attention_mask', 'label'],
        num_rows: 2072
    })
})


#### Setup training parameters

We are using the TrainerAPI to do the training. Trainer is yet another utility provided by huggingface, which helps you train the model with ease.  

Document:
- [transformers.TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments)
- [transformers.Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer)

In [19]:
from transformers import TrainingArguments, Trainer

In [20]:
# [ TODO ] set and tune your training properties
OUTPUT_DIR = './model/'
LEARNING_RATE = 0.00001
BATCH_SIZE = 64
EPOCH = 50
training_args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    learning_rate = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    num_train_epochs = EPOCH,
    # you can set more parameters here if you want
)

# now give all the information to a trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_val_dataset['train'],
    eval_dataset=train_val_dataset['test'],
    tokenizer=tokenizer,
    # set your parameters here

)

### Training

This is the easy part. Simply ask the trainer to train the model for you!

In [21]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: level, text. If level, text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 18648
  Num Epochs = 50
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 14600
  Number of trainable parameters = 66958086
  0%|          | 0/14600 [00:00<?, ?it/s]You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  3%|▎         | 500/14600 [01:35<41:39,  5.64it/s] Saving model checkpoint to ./model/checkpoint-500
Configuration saved in ./model/checkpoint-500

{'loss': 0.3944, 'learning_rate': 9.657534246575343e-06, 'epoch': 1.71}


Model weights saved in ./model/checkpoint-500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-500\special_tokens_map.json
  7%|▋         | 1000/14600 [03:08<43:11,  5.25it/s] Saving model checkpoint to ./model/checkpoint-1000
Configuration saved in ./model/checkpoint-1000\config.json


{'loss': 0.3178, 'learning_rate': 9.315068493150685e-06, 'epoch': 3.42}


Model weights saved in ./model/checkpoint-1000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-1000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-1000\special_tokens_map.json
 10%|█         | 1500/14600 [04:42<41:03,  5.32it/s]  Saving model checkpoint to ./model/checkpoint-1500
Configuration saved in ./model/checkpoint-1500\config.json


{'loss': 0.2765, 'learning_rate': 8.972602739726028e-06, 'epoch': 5.14}


Model weights saved in ./model/checkpoint-1500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-1500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-1500\special_tokens_map.json
 14%|█▎        | 2000/14600 [06:16<39:28,  5.32it/s]  Saving model checkpoint to ./model/checkpoint-2000
Configuration saved in ./model/checkpoint-2000\config.json


{'loss': 0.2362, 'learning_rate': 8.63013698630137e-06, 'epoch': 6.85}


Model weights saved in ./model/checkpoint-2000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-2000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-2000\special_tokens_map.json
 17%|█▋        | 2500/14600 [07:49<38:02,  5.30it/s]  Saving model checkpoint to ./model/checkpoint-2500
Configuration saved in ./model/checkpoint-2500\config.json


{'loss': 0.1995, 'learning_rate': 8.287671232876712e-06, 'epoch': 8.56}


Model weights saved in ./model/checkpoint-2500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-2500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-2500\special_tokens_map.json
 21%|██        | 3000/14600 [09:23<36:21,  5.32it/s]  Saving model checkpoint to ./model/checkpoint-3000
Configuration saved in ./model/checkpoint-3000\config.json


{'loss': 0.1697, 'learning_rate': 7.945205479452055e-06, 'epoch': 10.27}


Model weights saved in ./model/checkpoint-3000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-3000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-3000\special_tokens_map.json
 24%|██▍       | 3500/14600 [10:57<34:52,  5.31it/s]  Saving model checkpoint to ./model/checkpoint-3500
Configuration saved in ./model/checkpoint-3500\config.json


{'loss': 0.1456, 'learning_rate': 7.6027397260273985e-06, 'epoch': 11.99}


Model weights saved in ./model/checkpoint-3500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-3500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-3500\special_tokens_map.json
 27%|██▋       | 4000/14600 [12:30<33:26,  5.28it/s]  Saving model checkpoint to ./model/checkpoint-4000
Configuration saved in ./model/checkpoint-4000\config.json


{'loss': 0.1234, 'learning_rate': 7.260273972602741e-06, 'epoch': 13.7}


Model weights saved in ./model/checkpoint-4000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-4000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-4000\special_tokens_map.json
 31%|███       | 4500/14600 [14:04<30:40,  5.49it/s]  Saving model checkpoint to ./model/checkpoint-4500
Configuration saved in ./model/checkpoint-4500\config.json


{'loss': 0.1063, 'learning_rate': 6.917808219178082e-06, 'epoch': 15.41}


Model weights saved in ./model/checkpoint-4500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-4500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-4500\special_tokens_map.json
 34%|███▍      | 5000/14600 [15:38<29:14,  5.47it/s]  Saving model checkpoint to ./model/checkpoint-5000
Configuration saved in ./model/checkpoint-5000\config.json


{'loss': 0.0924, 'learning_rate': 6.5753424657534245e-06, 'epoch': 17.12}


Model weights saved in ./model/checkpoint-5000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-5000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-5000\special_tokens_map.json
 38%|███▊      | 5500/14600 [17:08<27:01,  5.61it/s]  Saving model checkpoint to ./model/checkpoint-5500
Configuration saved in ./model/checkpoint-5500\config.json


{'loss': 0.0792, 'learning_rate': 6.2328767123287685e-06, 'epoch': 18.84}


Model weights saved in ./model/checkpoint-5500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-5500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-5500\special_tokens_map.json
 41%|████      | 6000/14600 [18:39<25:58,  5.52it/s]  Saving model checkpoint to ./model/checkpoint-6000
Configuration saved in ./model/checkpoint-6000\config.json


{'loss': 0.0665, 'learning_rate': 5.89041095890411e-06, 'epoch': 20.55}


Model weights saved in ./model/checkpoint-6000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-6000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-6000\special_tokens_map.json
 45%|████▍     | 6500/14600 [20:13<23:44,  5.69it/s]  Saving model checkpoint to ./model/checkpoint-6500
Configuration saved in ./model/checkpoint-6500\config.json


{'loss': 0.0609, 'learning_rate': 5.547945205479452e-06, 'epoch': 22.26}


Model weights saved in ./model/checkpoint-6500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-6500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-6500\special_tokens_map.json
 48%|████▊     | 7000/14600 [21:47<24:00,  5.28it/s]  Saving model checkpoint to ./model/checkpoint-7000
Configuration saved in ./model/checkpoint-7000\config.json


{'loss': 0.0537, 'learning_rate': 5.2054794520547945e-06, 'epoch': 23.97}


Model weights saved in ./model/checkpoint-7000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-7000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-7000\special_tokens_map.json
 51%|█████▏    | 7500/14600 [23:20<21:47,  5.43it/s]  Saving model checkpoint to ./model/checkpoint-7500
Configuration saved in ./model/checkpoint-7500\config.json


{'loss': 0.0471, 'learning_rate': 4.863013698630138e-06, 'epoch': 25.68}


Model weights saved in ./model/checkpoint-7500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-7500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-7500\special_tokens_map.json
 55%|█████▍    | 8000/14600 [24:54<20:35,  5.34it/s]Saving model checkpoint to ./model/checkpoint-8000
Configuration saved in ./model/checkpoint-8000\config.json


{'loss': 0.041, 'learning_rate': 4.52054794520548e-06, 'epoch': 27.4}


Model weights saved in ./model/checkpoint-8000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-8000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-8000\special_tokens_map.json
 58%|█████▊    | 8500/14600 [26:28<19:02,  5.34it/s]Saving model checkpoint to ./model/checkpoint-8500
Configuration saved in ./model/checkpoint-8500\config.json


{'loss': 0.0366, 'learning_rate': 4.178082191780822e-06, 'epoch': 29.11}


Model weights saved in ./model/checkpoint-8500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-8500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-8500\special_tokens_map.json
 62%|██████▏   | 9000/14600 [28:01<17:42,  5.27it/s]Saving model checkpoint to ./model/checkpoint-9000
Configuration saved in ./model/checkpoint-9000\config.json


{'loss': 0.0345, 'learning_rate': 3.8356164383561645e-06, 'epoch': 30.82}


Model weights saved in ./model/checkpoint-9000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-9000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-9000\special_tokens_map.json
 65%|██████▌   | 9500/14600 [29:29<15:02,  5.65it/s]Saving model checkpoint to ./model/checkpoint-9500
Configuration saved in ./model/checkpoint-9500\config.json


{'loss': 0.0292, 'learning_rate': 3.4931506849315072e-06, 'epoch': 32.53}


Model weights saved in ./model/checkpoint-9500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-9500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-9500\special_tokens_map.json
 68%|██████▊   | 10000/14600 [30:56<13:27,  5.69it/s]Saving model checkpoint to ./model/checkpoint-10000
Configuration saved in ./model/checkpoint-10000\config.json


{'loss': 0.0274, 'learning_rate': 3.1506849315068495e-06, 'epoch': 34.25}


Model weights saved in ./model/checkpoint-10000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-10000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-10000\special_tokens_map.json
 72%|███████▏  | 10500/14600 [32:25<10:04,  6.78it/s]Saving model checkpoint to ./model/checkpoint-10500
Configuration saved in ./model/checkpoint-10500\config.json


{'loss': 0.0248, 'learning_rate': 2.8082191780821922e-06, 'epoch': 35.96}


Model weights saved in ./model/checkpoint-10500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-10500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-10500\special_tokens_map.json
 75%|███████▌  | 11000/14600 [33:52<10:21,  5.80it/s]Saving model checkpoint to ./model/checkpoint-11000
Configuration saved in ./model/checkpoint-11000\config.json


{'loss': 0.0225, 'learning_rate': 2.4657534246575345e-06, 'epoch': 37.67}


Model weights saved in ./model/checkpoint-11000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-11000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-11000\special_tokens_map.json
 79%|███████▉  | 11500/14600 [35:20<09:00,  5.73it/s]Saving model checkpoint to ./model/checkpoint-11500
Configuration saved in ./model/checkpoint-11500\config.json


{'loss': 0.0207, 'learning_rate': 2.123287671232877e-06, 'epoch': 39.38}


Model weights saved in ./model/checkpoint-11500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-11500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-11500\special_tokens_map.json
 82%|████████▏ | 12000/14600 [36:48<07:16,  5.95it/s]Saving model checkpoint to ./model/checkpoint-12000
Configuration saved in ./model/checkpoint-12000\config.json


{'loss': 0.0204, 'learning_rate': 1.7808219178082193e-06, 'epoch': 41.1}


Model weights saved in ./model/checkpoint-12000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-12000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-12000\special_tokens_map.json
 86%|████████▌ | 12500/14600 [38:15<06:11,  5.65it/s]Saving model checkpoint to ./model/checkpoint-12500
Configuration saved in ./model/checkpoint-12500\config.json


{'loss': 0.0175, 'learning_rate': 1.4383561643835616e-06, 'epoch': 42.81}


Model weights saved in ./model/checkpoint-12500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-12500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-12500\special_tokens_map.json
 89%|████████▉ | 13000/14600 [39:43<04:21,  6.13it/s]Saving model checkpoint to ./model/checkpoint-13000
Configuration saved in ./model/checkpoint-13000\config.json


{'loss': 0.0169, 'learning_rate': 1.095890410958904e-06, 'epoch': 44.52}


Model weights saved in ./model/checkpoint-13000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-13000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-13000\special_tokens_map.json
 92%|█████████▏| 13500/14600 [41:10<03:15,  5.63it/s]Saving model checkpoint to ./model/checkpoint-13500
Configuration saved in ./model/checkpoint-13500\config.json


{'loss': 0.0152, 'learning_rate': 7.534246575342466e-07, 'epoch': 46.23}


Model weights saved in ./model/checkpoint-13500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-13500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-13500\special_tokens_map.json
 96%|█████████▌| 14000/14600 [42:38<01:45,  5.71it/s]Saving model checkpoint to ./model/checkpoint-14000
Configuration saved in ./model/checkpoint-14000\config.json


{'loss': 0.0152, 'learning_rate': 4.1095890410958903e-07, 'epoch': 47.95}


Model weights saved in ./model/checkpoint-14000\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-14000\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-14000\special_tokens_map.json
 99%|█████████▉| 14500/14600 [44:08<00:18,  5.29it/s]Saving model checkpoint to ./model/checkpoint-14500
Configuration saved in ./model/checkpoint-14500\config.json


{'loss': 0.014, 'learning_rate': 6.84931506849315e-08, 'epoch': 49.66}


Model weights saved in ./model/checkpoint-14500\pytorch_model.bin
tokenizer config file saved in ./model/checkpoint-14500\tokenizer_config.json
Special tokens file saved in ./model/checkpoint-14500\special_tokens_map.json
100%|█████████▉| 14599/14600 [44:28<00:00,  5.19it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


100%|██████████| 14600/14600 [44:28<00:00,  5.47it/s]

{'train_runtime': 2668.3688, 'train_samples_per_second': 349.427, 'train_steps_per_second': 5.472, 'train_loss': 0.09273023164435609, 'epoch': 50.0}





TrainOutput(global_step=14600, training_loss=0.09273023164435609, metrics={'train_runtime': 2668.3688, 'train_samples_per_second': 349.427, 'train_steps_per_second': 5.472, 'train_loss': 0.09273023164435609, 'epoch': 50.0})

### Save for future use

Hint: try using `save_pretrained`

In [33]:
# [ TODO ] practice saving your model for future use

model.save_pretrained('./model/finetuned')

Configuration saved in ./model/finetuned\config.json
Model weights saved in ./model/finetuned\pytorch_model.bin


## Prediction

Now we know exactly how to train a model, but how do we use it for predicting results?

### Load finetuned model

In [23]:
# [ TODO ] load the model that you saved
from transformers import AutoModelForSequenceClassification
mymodel = AutoModelForSequenceClassification.from_pretrained('./model/finetuned/', num_labels = LABEL_COUNT)

loading configuration file ./model/finetuned1/config.json
Model config DistilBertConfig {
  "_name_or_path": "./model/finetuned1/",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "multi_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "vocab_size": 30522
}

loading weights file ./model/fine

### Get the prediction

Here are a few example sentences:

In [24]:
examples = [
    # A2
    "Remember to write me a letter.",
    # B2
    "Strawberries and cream - a perfect combination.",
    "This so-called \"Perfect Evening\" was so disappointing, as well as discouraging us from coming to your Circle Theatre again.",
    # C1
    "Some may altogether give up their studies, which I think is a disastrous move.",
]

All we need to do is to transform them to embeddings, and then we can get predictions by calling your finetuned model.  

Since we don't have a DataCollator to pad the sentence and do the matrix transformation this time, we have to pad and transform the matrice on our own.

In [25]:
# Transform the sentences into embeddings
input = tokenizer(examples, truncation=True, padding=True, return_tensors="pt")
# Get the output
logits = mymodel(**input).logits
logits

tensor([[ -0.9517,  -0.0914,  -9.9327,  -5.6061,  -7.9281, -10.1516],
        [ -5.2187,  -7.7887,  -3.6023,   1.1966,  -3.9121,  -5.7949],
        [ -9.3922,  -6.8770,  -7.9751,  -0.8796,   0.7909,  -9.3858],
        [ -8.9794,  -6.9766,  -8.5034,   3.1735,  -4.0929,  -7.8373]],
       grad_fn=<AddmmBackward0>)

Logits aren't very readable for us. Let's use softmax 
activation to transform them into more probability-like numbers.

In [26]:
from torch import nn

predicts = nn.functional.softmax(logits, dim = -1)
predicts

tensor([[2.9632e-01, 7.0052e-01, 3.7273e-05, 2.8209e-03, 2.7667e-04, 2.9944e-05],
        [1.6089e-03, 1.2314e-04, 8.1011e-03, 9.8332e-01, 5.9428e-03, 9.0429e-04],
        [3.1800e-05, 3.9335e-04, 1.3117e-04, 1.5826e-01, 8.4115e-01, 3.2004e-05],
        [5.2691e-06, 3.9043e-05, 8.4811e-06, 9.9923e-01, 6.9807e-04, 1.6510e-05]],
       grad_fn=<SoftmaxBackward0>)

#### Transform logits back to labels

Now you've got the output. Write a function to map it back into labels!

In [27]:
label = ['A1', 'A2', 'B1', 'B2', 'C1', 'C2']
def res2label(e):
    return label[e]
res = np.vectorize(res2label)(np.array(torch.argmax(predicts, dim=1)))
res

array(['A2', 'B2', 'C1', 'B2'], dtype='<U2')

## Evaluation

Let's see how you did!  
Load the testing data and calculate your accuracy.

We want you to calculate the three kinds of accuracy mentioned in the lecture, which will also be explained in the following section.

In [28]:
# [ TODO ] 
# load test data
# preprocess
input = tokenizer(processed_data['test']['text'], truncation=True, padding=True, return_tensors="pt")
# get predictions
logits = mymodel(**input).logits
predicts = nn.functional.softmax(logits, dim = -1)
predicts = np.array(torch.argmax(predicts, dim=1))
# transform predictions back into labels
predict_label = np.vectorize(res2label)(predicts)

In [29]:
#  try printing out some predictions to check if the outputs are reasonable and if you need to adjust your model at the end of every step.

for idx, (sent, level) in enumerate(zip(processed_data['test']['text'], predict_label)):
    if idx >= 10: break
    print(f'{level}: {sent}') 

C2: No longer a remote, backward, unimportant country, it became a force to be reckoned with in Europe.
B2: Unfortunately he was too fast and I couldn't keep up with him.
B2: Most mushrooms are totally harmless, but some are poisonous.
B2: This provided solid evidence that he committed the crime.
C1: You can't just accept everything you read in the newspapers at face value.
A2: Remember to write me a letter.
B1: She has long blond hair and blue eyes. She has a good figure.
B2: Nowadays the aim in clothing is not just for covering and protecting ourselves.
A2: Take two tablets, three times a day.
C2: Well, you will be if you saw our slide show and talk - members can hardly forget that relaxing afternoon when we unfolded the sails on the lake and enjoyed the tranquility of the area.


### Six Level Accuracy

Exact accuracy is probably what you're most familiar with:

$
accuracy = \frac{\#exactly\:the\:same\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
                    ^  ^     ^
```

The six level accuracy is $\frac{3}{6} = 0.5$

As the requirement, <u>your exact accuracy should be higher than $0.5$</u>.

In [30]:
# [ TODO ] calculate accuracy
t = 0
for i in range(len(predict_label)):
    if(processed_data['test']['level'][i] == predict_label[i]):
        t += 1
print(round(t/len(predict_label), 16))

0.5547826086956522


### Three Level Accuracy

Three Level Accuracy is used when you only want a more general sense of right or wrong.

$
accuracy = \frac{\#the\:same\:ABC\:levels}{\#total}
$

Example:
```
Prediction:   A1 A2 B1 B2 C1 C2
Ground truth: A2 B1 B1 B2 B2 C2
              ^     ^  ^     ^
```

The three level accuracy is $\frac{4}{6} = 0.667$

As the requirement, <u>your exact accuracy should be higher than $0.6$</u>.

In [31]:
# [ TODO ] calculate accuracy
t = 0
for i in range(len(predict_label)):
    if(processed_data['test']['level'][i][0] == predict_label[i][0]):
        t += 1
print(round(t/len(predict_label), 16))

0.7378260869565217


### Fuzzy accuracy

However, the level of a sentence is relatively subjective. Generally speaking, $\pm1$ errors are allowed in the real evaluation in linguistic area.  

For example, if the actual label is 'B1', we'll also consider the prediction 'right' if the model predicts 'B2' or 'A2'.

Hence, the fuzzy accuracy is

$
accuracy = \frac{\#good\:enough\:answers}{\#total}
$

Example:
```
Prediction:   0 1 2 3 4 5
Ground truth: 0 1 1 3 3 3
              ^ ^ ^ ^ ^
```

The fuzzy accuracy is $\frac{5}{6} = 0.833$

As the requirement, <u>your accuracy should be higher than $0.8$</u>.

In [32]:
# [ TODO ] calculate accuracy
index = {'A1': 0, 'A2': 1, 'B1': 2, 'B2': 3, 'C1': 4, 'C2': 5}
def label2index(l):
    return index.get(l)

test_index = list(map(label2index, list(processed_data['test']['level'])))

t = 0
for i in range(len(predict_label)):
    if(abs(test_index[i] - predicts[i]) <= 1):
        t += 1
print(round(t/len(predict_label), 16))

0.8578260869565217


## TA's Note

Congratulations, you made it to the end of the tutorial! Make sure you make an appointment to show your work and turn in your finished assignment before next week's lesson. We will ask you to run your code, so double check that everything is working and that your model is saved. Don't worry if you didn't pass the evaluation requirements, you'll still get partial points for trying.

## Appendix 


<a name="Appendix-1-Install-CUDA"></a>

### Appendix 1 - Install CUDA

1. Check your GPU vs. CUDA compatibility:
   - [NVIDIA -> Your GPU Compute Capability](https://developer.nvidia.com/cuda-gpus) -> GeForce and TITAN Products
2. Check library vs. CUDA compatibility: 
   - Pytorch: [Previous PyTorch Versions](https://pytorch.org/get-started/previous-versions/)
   - Tensorflow: [Linux/MacOX](https://www.tensorflow.org/install/source#tested_build_configurations) or [Windows](https://www.tensorflow.org/install/source_windows#tested_build_configurations)
3. Note the highest CUDA version that fits your system.

#### >> for conda/mamba users

You can directly install CUDA library with the selected CUDA version.
1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. `conda/mamba install -c conda-forge cudatoolkit=${VERSION}`

#### >> for non-conda users

1. Get [the driver for NVIDIA GPU](https://www.nvidia.com/download/index.aspx)
2. Download and install [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit-archive)
3. Download and install [cuDNN Library](https://developer.nvidia.com/rdp/cudnn-archive)

### Appendix 2 - Further Readings

1. [Huggingface Official Tutorials](https://github.com/huggingface/notebooks/tree/master/examples)
2. How to use Bert with other downstream tasks: [How to use BERT from the Hugging Face transformer library](https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209): 
3. Training with pytorch backend: [transformers-tutorials](https://github.com/abhimishra91/transformers-tutorials)
4. A more complicated example that include manual data/training processing with Pytorch: [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1)
5. [Text Classification with tensorflow](https://github.com/huggingface/notebooks/blob/master/examples/text_classification-tf.ipynb): tensorflow example