# AI4Code PyTorch Training Starter + AMP + 🤗 + W&B 📉 - Code BERT Large

This training kernel is a fork of my original training kernel [[TRAIN] AI4Code PyTorch - 🤗 BERT Large + W&B 📉](https://www.kaggle.com/code/heyytanay/train-ai4code-pytorch-bert-large-w-b) (check it out and upvote if you found it helpful!)

This is training script written in Vanilla PyTorch with a Custom Trainer Class. I hope it can help you in developing more sophisticated models.

In this notebook, I am using CodeBERT model by Microsoft (large variant of the model to be exact). For more details on it, check out it's [HuggingFace docs](https://huggingface.co/microsoft/codebert-base) and [Github repo](https://github.com/microsoft/CodeBERT).

Think of this notebook has a skeleton for all similar Models (in-fact any PyTorch Hugginface model in reality). You can change chunks of code to suit your needs and it will work efficiently in most cases.

I've borrowed chunks of code from Ahmet Erdem's notebook [here](https://www.kaggle.com/code/aerdem4/ai4code-pytorch-distilbert-baseline). Please check that out, it's very informative to get started!

**Feel free to fork and change the models and do some preprocessing, but if you do please leave an upvote :)**

In [None]:
%%sh
pip install -q --upgrade transformers

In [None]:
import platform
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

import gc
import os
import wandb
import json
import glob
from scipy import sparse
from pathlib import Path

import torch
import transformers
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import GroupShuffleSplit
from sklearn.metrics import mean_squared_error
import warnings
warnings.simplefilter('ignore')

We define a Config class to store variables and functions that are to be used globally inside our training script.
This makes the code more modular and easy to approach at the same time.

Keeping the number of epochs to 2 since a single epoch on 15K samples take ~1 hour and 13 minutes. You can change it to however much you need.

The notebook will not give an OOM error at any point, should you change the epoch size since I have written the code to be heavily optimized.

In [None]:
class Config:
    NB_EPOCHS = 2
    LR = 3e-4
    T_0 = 20
    η_min = 1e-4
    MAX_LEN = 120
    TRAIN_BS = 16
    VALID_BS = 16
    MODEL_NAME = 'microsoft/codebert-base'
    data_dir = Path('../input/AI4Code')
    TOKENIZER = transformers.RobertaTokenizer.from_pretrained(MODEL_NAME)
    scaler = GradScaler()

## About W&B:
<center><img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases"/></center><br>
<p style="text-align:center">WandB is a developer tool for companies turn deep learning research projects into deployed software by helping teams track their models, visualize model performance and easily automate training and improving models.
We will use their tools to log hyperparameters and output metrics from your runs, then visualize and compare results and quickly share findings with your colleagues.<br><br></p>

To login to W&B, you can use below snippet.

```python
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wb_key = user_secrets.get_secret("WANDB_API_KEY")

wandb.login(key=wb_key)
```
Make sure you have your W&B key stored as `WANDB_API_KEY` under Add-ons -> Secrets

You can view [this](https://www.kaggle.com/ayuraj/experiment-tracking-with-weights-and-biases) notebook to learn more about W&B tracking.

If you don't want to login to W&B, the kernel will still work and log everything to W&B in anonymous mode.

In [None]:
WANDB_CONFIG = {
    'TRAIN_BS': Config.TRAIN_BS,
    'VALID_BS': Config.VALID_BS,
    'N_EPOCHS': Config.NB_EPOCHS,
    'ARCH': Config.MODEL_NAME,
    'MAX_LEN': Config.MAX_LEN,
    'LR': Config.LR,
    'NUM_WORKERS': 8,
    'OPTIM': "AdamW",
    'LOSS': "MSELoss",
    'DEVICE': "cuda",
    'T_0': 20,
    'η_min': 1e-4,
    'infra': "Kaggle",
    'competition': 'ai4code',
    '_wandb_kernel': 'tanaym'
}

# Start W&B logging
# W&B Login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
wb_key = user_secrets.get_secret("WANDB_API_KEY")

wandb.login(key=wb_key)

run = wandb.init(
    project='pytorch',
    config=WANDB_CONFIG,
    group='nlp',
    job_type='train',
)

Below are some utility functions that we will be using.

In [None]:
from bisect import bisect

def wandb_log(**kwargs):
    """
    Logs a key-value pair to W&B
    """
    for k, v in kwargs.items():
        wandb.log({k: v})

def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):
        j = bisect(sorted_so_far, u)
        inversions += i - j
        sorted_so_far.insert(j, u)
    return inversions

def kendall_tau(ground_truth, predictions):
    total_inversions = 0
    total_2max = 0  # twice the maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

def read_notebook(path):
    return (
        pd.read_json(
            path,
            dtype={'cell_type': 'category', 'source': 'str'})
        .assign(id=path.stem)
        .rename_axis('cell_id')
    )

def get_ranks(base, derived):
    return [base.index(d) for d in derived]

Data preprocessing for further training process taken from the Starter notebook for this competition.

In [None]:
NUM_TRAIN = 15000

paths_train = list((Config.data_dir / 'train').glob('*.json'))[:NUM_TRAIN]
notebooks_train = [
    read_notebook(path) for path in tqdm(paths_train, desc='Train NBs')
]
df = (
    pd.concat(notebooks_train)
    .set_index('id', append=True)
    .swaplevel()
    .sort_index(level='id', sort_remaining=False)
)

df_orders = pd.read_csv(
    Config.data_dir / 'train_orders.csv',
    index_col='id',
    squeeze=True,
).str.split()

df_orders_ = df_orders.to_frame().join(
    df.reset_index('cell_id').groupby('id')['cell_id'].apply(list),
    how='right',
)

ranks = {}
for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {'cell_id': cell_id, 'rank': get_ranks(cell_order, cell_id)}

df_ranks = (
    pd.DataFrame
    .from_dict(ranks, orient='index')
    .rename_axis('id')
    .apply(pd.Series.explode)
    .set_index('cell_id', append=True)
)

df_ancestors = pd.read_csv(Config.data_dir / 'train_ancestors.csv', index_col='id')
df = df.reset_index().merge(df_ranks, on=["id", "cell_id"]).merge(df_ancestors, on=["id"])
df["pct_rank"] = df["rank"] / df.groupby("id")["cell_id"].transform("count")

In [None]:
NVALID = 0.1  # size of validation set

splitter = GroupShuffleSplit(n_splits=1, test_size=NVALID, random_state=0)

train_ind, val_ind = next(splitter.split(df, groups=df["ancestor_id"]))

train_df = df.loc[train_ind].reset_index(drop=True)
val_df = df.loc[val_ind].reset_index(drop=True)

train_df_mark = train_df[train_df["cell_type"] == "code"].reset_index(drop=True)
val_df_mark = val_df[val_df["cell_type"] == "code"].reset_index(drop=True)

**Model class for CodeBert Small model**

In [None]:
class CodeBertModel(nn.Module):
    def __init__(self):
        super(CodeBertModel, self).__init__()
        self.bert = transformers.RobertaModel.from_pretrained(Config.MODEL_NAME)
        self.drop = nn.Dropout(0.3)
        self.fc = nn.Linear(768, 1)
    
    def forward(self, ids, mask, token_type_ids):
        _, output = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids, return_dict=False)
        output = self.drop(output)
        output = self.fc(output)
        return output

Custom dataset for the Markdown cells

In [None]:
class AI4CodeDataset(Dataset):
    def __init__(self, df, is_test=False):
        self.df = df.reset_index(drop=True)
        self.is_test = is_test

    def __getitem__(self, idx):
        sample = self.df.iloc[idx]
        
        inputs = Config.TOKENIZER.encode_plus(
            sample['source'],
            None,
            add_special_tokens=True,
            max_length=Config.MAX_LEN,
            padding="max_length",
            return_token_type_ids=True,
            truncation=True
        )
        ids = torch.tensor(inputs['input_ids'], dtype=torch.long)
        mask = torch.tensor(inputs['attention_mask'], dtype=torch.long)
        token_type_ids = torch.tensor(inputs['token_type_ids'], dtype=torch.long)

        if self.is_test:
            return (ids, mask, token_type_ids)
        else:    
            targets = torch.tensor([sample.pct_rank], dtype=torch.float)
            return (ids, mask, token_type_ids, targets)

    def __len__(self):
        return len(self.df)

Below is a custom `Trainer` class that I wrote from scratch to facilitate my training and validation sub-routines.

In [None]:
class Trainer:
    def __init__(self, config, dataloaders, optimizer, model, loss_fns, scheduler, device="cuda:0"):
        self.train_loader, self.valid_loader = dataloaders
        self.train_loss_fn, self.valid_loss_fn = loss_fns
        self.scheduler = scheduler
        self.optimizer = optimizer
        self.model = model
        self.device = torch.device(device)
        self.config = config

    def train_one_epoch(self):
        """
        Trains the model for 1 epoch
        """
        self.model.train()
        train_pbar = tqdm(enumerate(self.train_loader), total=len(self.train_loader))
        train_preds, train_targets = [], []

        for bnum, cache in train_pbar:
            ids = self._convert_if_not_tensor(cache[0], dtype=torch.long)
            mask = self._convert_if_not_tensor(cache[1], dtype=torch.long)
            ttis = self._convert_if_not_tensor(cache[2], dtype=torch.long)
            targets = self._convert_if_not_tensor(cache[3], dtype=torch.float)
            
            with autocast(enabled=True):
                outputs = self.model(ids=ids, mask=mask, token_type_ids=ttis).view(-1)
                
                loss = self.train_loss_fn(outputs, targets)
                loss_itm = loss.item()
                
                wandb_log(
                    train_batch_loss = loss_itm
                )
                
                train_pbar.set_description('loss: {:.2f}'.format(loss_itm))

                Config.scaler.scale(loss).backward()
                Config.scaler.step(self.optimizer)
                Config.scaler.update()
                self.optimizer.zero_grad()
                self.scheduler.step()
                            
            train_targets.extend(targets.cpu().detach().numpy().tolist())
            train_preds.extend(outputs.cpu().detach().numpy().tolist())
        
        # Tidy
        del outputs, targets, ids, mask, ttis, loss_itm, loss
        gc.collect()
        torch.cuda.empty_cache()
        
        return train_preds, train_targets

    @torch.no_grad()
    def valid_one_epoch(self):
        """
        Validates the model for 1 epoch
        """
        self.model.eval()
        valid_pbar = tqdm(enumerate(self.valid_loader), total=len(self.valid_loader))
        valid_preds, valid_targets = [], []

        for idx, cache in valid_pbar:
            ids = self._convert_if_not_tensor(cache[0], dtype=torch.long)
            mask = self._convert_if_not_tensor(cache[1], dtype=torch.long)
            ttis = self._convert_if_not_tensor(cache[2], dtype=torch.long)
            targets = self._convert_if_not_tensor(cache[3], dtype=torch.float)

            outputs = self.model(ids=ids, mask=mask, token_type_ids=ttis).view(-1)
            valid_loss = self.valid_loss_fn(outputs, targets)
            
            wandb_log(
                valid_batch_loss = valid_loss.item()
            )
            
            valid_pbar.set_description(desc=f"val_loss: {valid_loss.item():.4f}")

            valid_targets.extend(targets.cpu().detach().numpy().tolist())
            valid_preds.extend(outputs.cpu().detach().numpy().tolist())

        # Tidy
        del outputs, targets, ids, mask, ttis, valid_loss
        gc.collect()
        torch.cuda.empty_cache()
        
        return valid_preds, valid_targets

    def fit(self, epochs: int = 10, output_dir: str = "/kaggle/working/", custom_name: str = 'model.pth'):
        """
        Low-effort alternative for doing the complete training and validation process
        """
        best_loss = int(1e+7)
        best_preds = None
        for epx in range(epochs):
            print(f"{'='*20} Epoch: {epx+1} / {epochs} {'='*20}")

            train_preds, train_targets = self.train_one_epoch()
            train_mse = mean_squared_error(train_targets, train_preds)
            print(f"Training loss: {train_mse:.4f}")

            valid_preds, valid_targets = self.valid_one_epoch()
            valid_mse = mean_squared_error(valid_targets, valid_preds)
            print(f"Validation loss: {valid_mse:.4f}")
            
            wandb_log(
                train_mse = train_mse,
                valid_mse = valid_mse
            )
            
            if valid_mse < best_loss:
                best_loss = valid_mse
                self.save_model(output_dir, custom_name)
                print(f"Saved model with val_loss: {best_loss:.4f}")
            
    def save_model(self, path, name, verbose=False):
        """
        Saves the model at the provided destination
        """
        try:
            if not os.path.exists(path):
                os.makedirs(path)
        except:
            print("Errors encountered while making the output directory")

        torch.save(self.model.state_dict(), os.path.join(path, name))
        if verbose:
            print(f"Model Saved at: {os.path.join(path, name)}")

    def _convert_if_not_tensor(self, x, dtype):
        if self._tensor_check(x):
            return x.to(self.device, dtype=dtype)
        else:
            return torch.tensor(x, dtype=dtype, device=self.device)

    def _tensor_check(self, x):
        return isinstance(x, torch.Tensor)

Optimizer only for certain parameters in the model

In [None]:
def yield_optimizer(model):
    """
    Returns optimizer for specific parameters
    """
    param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.003,
        },
        {
            "params": [
                p for n, p in param_optimizer if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    return transformers.AdamW(optimizer_parameters, lr=Config.LR)

Main training code. I will be adding KFolds support soon!

In [None]:
# Training Code
if __name__ == '__main__':
    if torch.cuda.is_available():
        print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))
        DEVICE = torch.device('cuda:0')
    else:
        print("\n[INFO] GPU not found. Using CPU: {}\n".format(platform.processor()))
        DEVICE = torch.device('cpu')

    train_set = AI4CodeDataset(train_df_mark)
    valid_set = AI4CodeDataset(val_df_mark)

    train_loader = DataLoader(
        train_set,
        batch_size = Config.TRAIN_BS,
        shuffle = True,
        num_workers = 8
    )

    valid_loader = DataLoader(
        valid_set,
        batch_size = Config.VALID_BS,
        shuffle = False,
        num_workers = 8
    )

    model = CodeBertModel().to(DEVICE)
    nb_train_steps = int(len(train_df_mark) / Config.TRAIN_BS * Config.NB_EPOCHS)
    optimizer = yield_optimizer(model)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
        optimizer, 
        T_0=Config.T_0, 
        eta_min=Config.η_min
    )
    train_loss_fn, valid_loss_fn = nn.MSELoss(), nn.MSELoss()
    
    wandb.watch(model, criterion=train_loss_fn)
    
    trainer = Trainer(
        config = Config,
        dataloaders = (train_loader, valid_loader),
        loss_fns = (train_loss_fn, valid_loss_fn),
        optimizer = optimizer,
        model = model,
        scheduler = scheduler,
    )

    best_pred = trainer.fit(
        epochs = Config.NB_EPOCHS,
        custom_name = f"ai4code_codebert_small.bin"
    )

In [None]:
# Finish the logging run
run.finish()

<center>
    <img src="https://img.shields.io/badge/Upvote-If%20you%20like%20my%20work-07b3c8?style=for-the-badge&logo=kaggle">
</center>