# Training and fine-tuning

https://huggingface.co/transformers/training.html

In [1]:
import torch
from torch.nn import functional as F

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
device

torch.cuda.empty_cache()

## Fine-tuning in native PyTorch

Before beginning, we load model and tokenizer. 

In [2]:
from transformers import BertTokenizer, BertForSequenceClassification

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# The `return_dict` argument is very useful
# Because after one epoch of training, we can retrieve info such as "loss" by keyword
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', return_dict = True
).to(device)

# Set model in train mode 
# Same syntax as PyTorch
model.train();


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=440473133.0), HTML(value='')))




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

We can use any optimizer from `PyTorch` or `transformers`. 

We can also use learning rate scheduling tools from `transformers`. 

In [3]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=1e-5)


The following code cell shows an example of hyperparameter tuning. Run either the previous cell or the cell below. 

In [4]:
no_decay = ['bias', 'LayerNorm.weight']

optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, 
    {'params': [p for n, p in model.named_parameters() if any (nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr = 1e-5)

Now we set up a simple dummy training batch using `__call__()`. This returns a `BatchEncoding()` instance which prepares everything we might need to pass to the model. 

In [5]:
text_batch = ['The team is excited', 'They could not care less']

# A BatchEncoding() instance
encoding = tokenizer(
    text_batch, 
    return_tensors = 'pt', 
    padding = True, 
    truncation = True
).to(device)

input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
# Training labels: [positive, negative]
labels = torch.tensor([1, 0]).unsqueeze(0).to(device)

In [6]:
# tokenizer.tokenize('The team is excited')
# Result: ['the', 'team', 'is', 'excited']

Run an epoch of training. 

In [7]:
optimizer.zero_grad()

outputs = model(
    input_ids, 
    attention_mask = attention_mask, 
    labels = labels
)


# compute loss
# This may be the incorrect loss to compute. Illustration purpose only
loss = outputs.loss
# Alternatively, we can compute the loss outselves
# loss = F.cross_entropy(output.logits, labels)

# Backprop 
loss.backward()
optimizer.step()

For more information about <b>scheduler</b>, see the tutorial. 

### Freezing the encoder

In some cases, we might need to keep the weights of pre-trained encoder frozen and optimizing only the weights of the head layers. The following code cell will do so. 

<span style="color:red;"><b>Question!</b></span> `Trainer` class only has `train_dataset` argument. How do we differentiate between 

In [8]:
for param in model.base_model.parameters():
    param.requires_grad = False

## Fine-tuning with `Trainer` class

`Trainer` and `TrainingArguments` classes provides us with routine of training, fine-tuning, and evaluation as well as other features (logging, gradient accumulation, mixed precision, etc.)

The following code cell serves as an example skeleton. 

In [9]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=434.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1344997306.0), HTML(value='')))




Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint a

NameError: name 'train_dataset' is not defined

`trainer.train()`
`trainer.evaluate()`

Define `compute_metrics` function by ourselves and pass it to `Trainer`. 