# Finetuning

- Using pretrained model (MLM) weights as the base for the finetuned model.
- Here finetuning, or supervised BERT based model training, involves the building of a model to classify samples of text data into several groups.
- This notebook has been run in "debug" mode, so the lack of data is reflected in the results. Preferably use much more data! 

## 1. Deciding on GPU

In [24]:
#check gpu(s)
!nvidia-smi

In [2]:
#pick gpu
import os
os.environ["CUDA_VISIBLE_DEVICES"]="4"

## 2. Essentials (libraries, config etc.)

In [3]:
#hide warnings
import warnings
warnings.filterwarnings('ignore')

In [4]:
import pandas as pd
import numpy as np
from transformers import AutoModel, AutoTokenizer, get_cosine_schedule_with_warmup, AutoConfig
import torch
import torch.nn as nn
from torch.utils.data import Sampler, Dataset, DataLoader
from IPython.display import display
from accelerate import Accelerator
from tqdm.notebook import tqdm
import random
import os
import multiprocessing
from sklearn.model_selection import StratifiedKFold
#import more_itertools
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score

In [5]:
#configuration, change to fit your use case
class cfg():
    max_len = 100 #max token length
    data_folder = "./data/"#"/path/to/data/"
    model_name = "TurkuNLP/bert-base-finnish-cased-v1"
    pt_model_path='./mlm_output_folder/pytorch_model.bin'
    train_batch_size = 32
    valid_batch_size = 64
    test_batch_size = 64
    
    device = "cuda" if torch.cuda.is_available() else "cpu" #in case no GPU is available, we run with CPU
    debug = True
    seed = 2023

    #testing the code with small numbers
    epochs = 2 #switch to e.g. 5 or 10
    n_folds = 2 #switch to e.g. 5 

In [6]:
#set seeds
def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=cfg.seed)

## 2. Data

In [7]:
#load training and testing data as pandas dataframes
train =  pd.read_csv(cfg.data_folder+'finetune_trainset.csv')
train = train[['label','text']]
test = pd.read_csv(cfg.data_folder+'finetune_testset.csv')
test = test[['label','text']]

#check if the model runs with fewer data samples
if cfg.debug:
    cfg.train_batch_size=4
    cfg.valid_batch_size=8
    train = train[:100]
    test = test[:10]

In [8]:
#update data properties in config
cfg.labels = train.label.unique()
cfg.num_labels = len(cfg.labels)

In [9]:
display(train)
display(test)

Unnamed: 0,label,text
0,2,– Ei...
1,0,"Kurkist kehtoon, kuinka kultaa Lapsi paljon tu..."
2,0,Entä jos miehellä ei enää seiso?
3,0,Eihän ala-ikäiset saa muutakaan tehdä ilman va...
4,0,"""Mies on naisen pää, koska Allah on toisia suo..."
...,...,...
95,0,Satuin paikalle kun hän oli ihan p.a. ja tarvi...
96,0,Ei kuitenkaan simasalapim menetelmällä.
97,0,Miksi silti olen aviossa?
98,2,"Hänen olemuksestaan huomaa heti, ettei hän pid..."


Unnamed: 0,label,text
0,0,"Ensinnäkin, korvikkeen saa lämmittää mikrossa,..."
1,0,"En tiedä, miksi trollaat asialla, joka on help..."
2,0,Nyt todella tiedän mitä glamour elämä on...
3,0,Kissa oli saanut olla itse valitsemansa ajan e...
4,2,Mietippä rehellisesti tiedätkö oikeasti narsis...
5,1,Erittäin hyvä....
6,0,Johan se tuli selväksi sullekkin että osittain...
7,1,Kiitos kuitenkin vastauksestasi.
8,1,Jos viittitte kattoo ton keskipitkän ruotsin k...
9,0,"Eipä vaivuta synkkyyteen, mielestäni nämä pals..."


In [10]:
#stratified kfold for creating training and validation datasets
mskf = StratifiedKFold(n_splits=cfg.n_folds, shuffle=True, random_state=2023)

for fold, (trn_, val_) in enumerate(mskf.split(train, train["label"])):
    print(len(trn_), len(val_))
    train.loc[val_, "kfold"] = fold
    
train["kfold"] = train["kfold"].astype(int)

50 50
50 50


In [11]:
class ClassificationDataset(Dataset):
    def __init__(self, df):
        self.texts = df["text"].values
        self.is_train = False
        if "label" in df.columns:
            self.labels = df["label"].values
            self.is_train = True
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        if self.is_train:
              text, label = self.texts[idx], self.labels[idx]
        else:
              text = self.texts[idx]
        
        example = tokenizer(text, max_length=cfg.max_len, 
                                 padding="max_length", add_special_tokens=True, truncation=True, return_attention_mask=True,
        return_token_type_ids=True)
        example["input_ids"] = torch.tensor(example["input_ids"])
        example["token_type_ids"] = torch.tensor(example["token_type_ids"])
        example["attention_mask"] = torch.tensor(example["attention_mask"])
        if self.is_train:
            return example, torch.tensor(label)
        else:
            return example

## 3. Model training

### 3.1 Building the model

In [12]:
#https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
class WeightedLayerPooling(torch.nn.Module):
    def __init__(self, num_hidden_layers, layer_start: int = 4, layer_weights = None):
        super(WeightedLayerPooling, self).__init__()
        self.layer_start = layer_start
        self.num_hidden_layers = num_hidden_layers
        self.layer_weights = layer_weights if layer_weights is not None \
            else torch.nn.Parameter(
                torch.tensor([1] * (num_hidden_layers+1 - layer_start), dtype=torch.float)
            )

    def forward(self, all_hidden_states):
        all_layer_embedding = all_hidden_states[self.layer_start:, :, :, :]
        weight_factor = self.layer_weights.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).expand(all_layer_embedding.size())
        weighted_average = (weight_factor*all_layer_embedding).sum(dim=0) / self.layer_weights.sum()
        return weighted_average

In [13]:
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        tconfig = AutoConfig.from_pretrained(cfg.model_name)
        tconfig.update({'output_hidden_states': True})
        tconfig.update({'num_labels': cfg.num_labels})
        self.model = AutoModel.from_pretrained(cfg.pt_model_path, config=tconfig)
        self.model.base_model.embeddings.requires_grad_(False)
        self.fc = torch.nn.Linear(tconfig.hidden_size, cfg.num_labels)
        self.pooler = WeightedLayerPooling(tconfig.num_hidden_layers, layer_start=9, layer_weights=None)
        self.ms_dropout = [torch.nn.Dropout(x/10) for x in range(5)]
        self.dp = torch.nn.Dropout(0.2)

    def forward(self, inputs):        
        out_e = self.model(**inputs)
        out = torch.stack(out_e["hidden_states"])
        out = self.pooler(out)
        for i, fc_dp in enumerate(self.ms_dropout):
            if i == 0:
                outputs = self.fc(fc_dp(out[:, 0]))
            else:
                outputs += self.fc(fc_dp(out[:, 0]))
        outputs = self.fc(self.dp(out[:,0]))
        return outputs

### 3.2 Logging information during model training

In [14]:
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

In [15]:
#evaluation metrics (accuracy, f1-score, precision, recall) calculated
def get_eval_metrics(labels, preds, avg_type='weighted', print_metrics=True):
    if isinstance(labels, list):
        labels = torch.cat(labels).cpu()
    if isinstance(preds, list):
        preds = torch.cat(preds).cpu()
    acc_score = accuracy_score(labels, preds)
    ff1_score = f1_score(labels, preds, average=avg_type, labels=cfg.labels)
    rec_score = recall_score(labels, preds, average=avg_type, labels=cfg.labels)
    prec_score = precision_score(labels, preds, average=avg_type, labels=cfg.labels)
    
    if print_metrics:
        print(f"accuracy score: {acc_score}")
        print(f"f1 score: {ff1_score}")
        print(f"recall score: {rec_score}")
        print(f"precision score: {prec_score}")
    
    return [acc_score, ff1_score, rec_score, prec_score]

### 3.3 Training and validation loops

In [16]:
def train_epoch(dataloader, model, optimizer, loss_fn, scheduler, epoch, fold, valid_dataloader):
    model.train()
    print("="*15, ">" f"Fold {fold+1} Epoch {epoch}", "<", "*"*15, "\n\n")
    
    losses = AverageMeter()
    for batch_idx, (example, labels) in tqdm(enumerate(dataloader), total=len(dataloader)):
        optimizer.zero_grad()
        inputs = {k : v.to(cfg.device) for (k, v) in example.items()}
        labels = torch.tensor(labels)
        with torch.cuda.amp.autocast(enabled=True):
            out = model(inputs).squeeze()
        
        loss = loss_fn(out.cpu().float(), labels.long())
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        losses.update(loss.item(), cfg.train_batch_size)
        
        if (batch_idx+1) % 100 == 0:
            print(f"Epoch [{epoch}] | Batch Number: [{batch_idx+1}/{len(dataloader)}] | Loss: [{losses.avg}]\n")
            
    return losses.avg

In [17]:
def validate_fn(dataloader, model, loss_fn):
    model.eval()
    losses = AverageMeter()
    metrics = AverageMeter()
    val_accuracy = []
    val_preds = []
    val_f1 = []
    val_precision=[]
    val_recall=[]
    val_labels=[]
    for batch_idx, (example, labels) in tqdm(enumerate(dataloader), total=len(dataloader)):
        inputs = {k : v.to(cfg.device) for (k, v) in example.items()}
        
        with torch.no_grad():
            out = model(inputs).squeeze()
        loss = loss_fn(out.cpu(), labels.long())
        losses.update(loss.item(), cfg.train_batch_size)

        # Get the predictions
        preds = torch.argmax(out.cpu(), dim=1).flatten()
        val_preds.append(preds)
        val_labels.append(labels)

    return losses.avg, val_labels, val_preds   

In [18]:
def train_fold(fold): 
    train_df = train[train["kfold"] != fold]
    valid_df = train[train["kfold"] == fold]
    
    global tokenizer 
    tokenizer = AutoTokenizer.from_pretrained(cfg.model_name)
    train_dataset = ClassificationDataset(train_df)
    train_dataloader = DataLoader(train_dataset, shuffle=True, num_workers=2, batch_size=cfg.train_batch_size)
    valid_dataset = ClassificationDataset(valid_df)
    valid_dataloader = DataLoader(valid_dataset, shuffle=True, num_workers=2, batch_size=cfg.valid_batch_size)
    
    model = Model()
    model.to(cfg.device)
    
    optimizer = torch.optim.AdamW([
        {"params": model.fc.parameters(), "lr": 3e-5},
        {"params": model.pooler.parameters(), "lr": 3e-5},
        {"params": model.model.parameters(), "lr": 1e-5},
    ],
    lr=5e-4)
    scheduler = get_cosine_schedule_with_warmup(optimizer, 
                                                num_warmup_steps=0, 
                                                num_cycles=0.5, 
                                                num_training_steps=int(len(train_dataset) / cfg.train_batch_size * cfg.epochs))

    
    #best_val_loss=np.inf #for saving fewer checkpoints
    for epoch in range(cfg.epochs):
        #training
        train_loss = train_epoch(train_dataloader, model, optimizer, nn.CrossEntropyLoss(), scheduler, epoch+1, fold,
                                 valid_dataloader)

        #validation
        valid_loss, valid_labels, valid_pred  = validate_fn(valid_dataloader, model, nn.CrossEntropyLoss())
        print("="*15, ">" f"Fold {fold+1} Epoch {epoch+1} Results:", "<", "*"*15, "\n\n")
        print(f"Training Loss: {train_loss}")
        print(f"Validation Loss: {valid_loss}")
        _ = get_eval_metrics(valid_labels, valid_pred)
        
        #saving model
        #if valid_loss < best_val_loss: #for saving fewer checkpoints
        print("SAVING MODEL: {} fold, {} epoch, valid_loss: {: >4.5f}".format(fold+1,epoch+1, valid_loss))
        #best_val_loss = valid_loss #for saving fewer checkpoints
        torch.save(model.state_dict(), f"finbert_base_fold_{fold+1}_epoch_{epoch+1}.pth")


In [19]:
print(f"Training for {cfg.n_folds} folds")
for fold in range(cfg.n_folds):
    train_fold(fold)

Training for 2 folds


Some weights of the model checkpoint at ./mlm_output_folder/pytorch_model.bin were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ./mlm_output_folder/pytorch_model.bin and are newly initialized: ['bert.pooler.dense.we





  0%|          | 0/13 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]



Training Loss: 0.7918790659079185
Validation Loss: 0.7287997603416443
accuracy score: 0.74
f1 score: 0.6294252873563217
recall score: 0.74
precision score: 0.5476
SAVING MODEL: 1 fold, 1 epoch, valid_loss: 0.72880




  0%|          | 0/13 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]



Training Loss: 0.5804865045043138
Validation Loss: 0.9009665080479213
accuracy score: 0.74
f1 score: 0.6294252873563217
recall score: 0.74
precision score: 0.5476
SAVING MODEL: 1 fold, 2 epoch, valid_loss: 0.90097


Some weights of the model checkpoint at ./mlm_output_folder/pytorch_model.bin were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ./mlm_output_folder/pytorch_model.bin and are newly initialized: ['bert.pooler.dense.we





  0%|          | 0/13 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]



Training Loss: 0.7921924843237951
Validation Loss: 0.6542312247412545
accuracy score: 0.76
f1 score: 0.6563636363636364
recall score: 0.76
precision score: 0.5776
SAVING MODEL: 2 fold, 1 epoch, valid_loss: 0.65423




  0%|          | 0/13 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]



Training Loss: 0.7197250643601785
Validation Loss: 0.6447714396885463
accuracy score: 0.76
f1 score: 0.6563636363636364
recall score: 0.76
precision score: 0.5776
SAVING MODEL: 2 fold, 2 epoch, valid_loss: 0.64477


## 4. Model testing

In [20]:
def test_fn(dataloader, model, loss_fn):
    model.eval()
    losses = AverageMeter()
    val_probs = []
    val_preds = []
    val_labels = []
    for batch_idx, (example, labels) in tqdm(enumerate(dataloader), total=len(dataloader)):
        inputs = {k : v.to(cfg.device) for (k, v) in example.items()}
        
        with torch.no_grad():
            out = model(inputs).squeeze()
        loss = loss_fn(out.cpu(), labels.long())
        losses.update(loss.item(), cfg.train_batch_size)

        # Get the predictions
        preds = torch.argmax(out.cpu(), dim=1).flatten()
        probs = out
        val_probs.append(probs)
        val_preds.append(preds)
        val_labels.append(labels)
               
    return val_labels, val_preds, val_probs, losses.avg 

In [21]:
test_dataset = ClassificationDataset(test)
test_dataloader = DataLoader(test_dataset, shuffle=False, num_workers=2, batch_size=cfg.test_batch_size)
preds_in_all_folds_val=[]
probs_in_all_folds_val=[]

for fold_num in range(cfg.n_folds):
    pth = f"finbert_base_fold_{fold_num+1}_epoch_2.pth"
    model = Model().to(cfg.device)
    model.load_state_dict(torch.load(pth))
    
    labels, preds, probs1, test_loss = test_fn(test_dataloader, model, nn.CrossEntropyLoss())
    preds_in_all_folds_val.append(preds)
    probs_in_all_folds_val.append(probs1)
    print("Testing fold ",fold_num+1)
    print(f"Test loss: {test_loss}")
    _ = get_eval_metrics(labels, preds)

Some weights of the model checkpoint at ./mlm_output_folder/pytorch_model.bin were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ./mlm_output_folder/pytorch_model.bin and are newly initialized: ['bert.pooler.dense.we

  0%|          | 0/1 [00:00<?, ?it/s]

Testing fold  1
Test loss: 1.1715978384017944
accuracy score: 0.6
f1 score: 0.4499999999999999
recall score: 0.6
precision score: 0.36


Some weights of the model checkpoint at ./mlm_output_folder/pytorch_model.bin were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ./mlm_output_folder/pytorch_model.bin and are newly initialized: ['bert.pooler.dense.we

  0%|          | 0/1 [00:00<?, ?it/s]

Testing fold  2
Test loss: 1.0782538652420044
accuracy score: 0.6
f1 score: 0.4499999999999999
recall score: 0.6
precision score: 0.36


In [22]:
#tensor sum 
def sum_of_tensors(probs_in_all_folds_val):
    probs_sum = torch.cat(probs_in_all_folds_val[0]).cpu()
    for i in range(1,len(probs_in_all_folds_val)):
        probs_sum = probs_sum + torch.cat(probs_in_all_folds_val[i]).cpu()
    return probs_sum

Averaging out folds.

In [23]:
preds_all_folds = torch.argmax((sum_of_tensors(probs_in_all_folds_val)/len(probs_in_all_folds_val)), dim=1)
print("Testing..")
_ = get_eval_metrics(labels, preds_all_folds)

Testing..
accuracy score: 0.6
f1 score: 0.4499999999999999
recall score: 0.6
precision score: 0.36
