# Customizable pytorch LLM training pipeline

## Why this notebook ?

Like many, I'm trying to improve my skills and understanding of LLMs.
There are new things popping out everyday and it is hard to keep up.
There are many great resources to finetune an LLM in a few lines of code (transformers, trl, auto-train, ...).
However, I don't feel like learning much when using these ready-to-train libraries.
So I am here proposing a simple self-contained notebook where I try as much as possible to make all the subtelties visible and customizable to the final users, without of course reinventing the wheel.

As I said, I am currently learning all this and there might be errors or easy improvements to my pipeline: if you find some do not hesitate to let me know so that we can all improve!

## Ideas for Customization

Here is a few ideas to go beyond this notebook:
- try other LLM backbones
- add a more complex prediction head
- explore different loss functions
- try other pooling method

## Things that still don't work
Don't hesitate to let me know how this could work!

- loading AND training the LLM in float16, bfloat16, or int8, int4: I still don't know what needs to be done inside the pipeline to make this work (it would change everything in terms of memory needed)
- Save only the peft weights + the custom layers to save memory when saving the model

## Next steps to explore
I'll try to find the time to explore and share similar notebooks with:
- qlora
- torchtune
- what else? share your ideas in comments!

**Happy Kaggling!**

In [1]:
# install latest libraries
! pip install -q /kaggle/input/lal-scoring-wheels/tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl --no-deps
! pip install -q /kaggle/input/lal-scoring-wheels/transformers-4.40.0-py3-none-any.whl --no-deps
! pip install -q /kaggle/input/lal-scoring-wheels/peft-0.10.0-py3-none-any.whl --no-deps
! pip install -q /kaggle/input/lal-scoring-wheels/accelerate-0.29.3-py3-none-any.whl --no-deps

In [2]:
import numpy as np
import pandas as pd
import torch
import time
import datetime

# Experiment configuration
Once your set up is finalized you only need to play with hyperparameters to find the best model

In [3]:
class Config():
    def __init__(self):
        # Parameters related to the problem
        self.num_classes = 1 # 1 class for regression
        
        # Parameters related to network
        self.architecture = {"backbone": "/kaggle/input/gemma/transformers/1.1-2b-it/1",
                             "params": {}}
        self.remove_layers = 8 # number of layer to remove to make the model smaller
        self.freeze_layers = None # number of layers to freeze to reduce number of training parameters
        self.use_lora = True
        self.lora_config = {"r": 16, # rank of the decomposed matrix (higher means less memory saving)
                            "lora_alpha": 32, # scaling factor, should be 2xr according to https://www.entrypointai.com/blog/lora-fine-tuning/
                            "lora_dropout": 0.1, # usual dropout
                            # make sure that you name correctly your modules according to your backbone
                            # you should spot the linear layers in the attention blocks
                            "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj",
                                               "gate_proj", "up_proj", "down_proj",],
                           }
        
        self.token_info = {"padding" :"longest", # batch are going to be the length of longest sequence
                           "max_length" : 256, # trained with 1024 locally, lowered to train within kaggle notebook
                           "truncation": True,
                           "pad_to_multiple_of" : 512 # I heard that modern GPUs are fastest with multiple of 512? is that True?
                          }

        # Parameters related to training
        self.max_epochs = 1 # number of epochs

        self.initial_lr =1e-4
        self.optimizer_name = "AdamW" #"AdamW" # try 8 bit adam AdamW8bit     
        self.optimizer_params = {"lr": self.initial_lr, 
                                 "weight_decay":1e-2
                                }
        self.loss_config = {"loss_name" : "MSELoss",
                            "reduction":"mean",
                           }
        
        self.scheduler_name = "OneCycleLR"
        self.steps_per_epochs = -1 # this is automatically overwritten
        self.scheduler_params={
                              "max_lr":self.optimizer_params["lr"] if type(self.optimizer_params)==dict else self.optimizer_params[-1]["lr"],
                               "div_factor":10,
                              "steps_per_epoch": self.steps_per_epochs,
                              "final_div_factor":1e2, #1e2
                               "anneal_strategy":"cos", #"cos"
                               "three_phase" : False,
                              "pct_start":0.1, #0.3
                              "epochs": self.max_epochs}
        
        
        self.eval_on_train = False # You might want to compute the exact metric on training set to monitor overfitting
        self.batch_size = 1 # Let's start small
        self.gradient_accumulation = 16 // self.batch_size # this allows you to train with low batch size but compute gradients on more that a few samples
        self.mixed_precision = True
        self.num_workers = 2 # I think num_workers for kaggle environment should be kept low
        self.pin_memory = True
        self.clip_value = 10.0

        # parameters related to logs
        self.verbose = 1 # how often do you want to compute the competition metric?
        self.save_path = "/kaggle/working/"

PATH_TO_DATA = "/kaggle/input/learning-agency-lab-automated-essay-scoring-2"
exp_config = Config()

# Define datasets

In [4]:
from dataclasses import dataclass
from torch.utils.data import DataLoader, Dataset
from typing import Optional, Union, Any
from transformers import DataCollatorWithPadding


from transformers import AutoTokenizer

def define_tokenizer(cfg):
    """
    Let's use basic AutoTokenizer
    """

    tokenizer = AutoTokenizer.from_pretrained(cfg.architecture["backbone"])    
    return tokenizer
    
class LALDataset(Dataset):
    """
    There are simpler ways of creating a dataset nowadays (using datasets library for example).
    But I prefer to define it that way as I feel more in control of what is actually happening.
    Here the dataset is very simple, but more customization could be done.
    
    If there is a good reason not to do that and use more recent methods please let me know!
    """
    def __init__(self, df, config, inference, remove=True):
        """
        df: pandas dataframe
        config: experiment config
        inference (bool): are we in inference mode ?
        remove (bool): should we remove unecessary columns that might not colate correctly?
        """
        self.df = df
        # tokenizer needs to be defined as it's used by datacollator
        self.tokenizer = define_tokenizer(config)
        self.inference = inference
        self.config = config
        self.remove = remove
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, row_idx):
        
        full_text = self.df.loc[row_idx, "full_text"]
        
        tokenized_text = self.tokenizer(full_text,
                                        return_offsets_mapping=False, # mostly needed for entity recognition
                                        truncation=self.config.token_info["truncation"],
                                        max_length=self.config.token_info["max_length"])
        
        labels = self.df.loc[row_idx, "score"]
            
        # here we append eos token at the end which will work as a CLS token
        # note that this must be the last token as GPT like models have a causal attention
        tokenized_text.input_ids.append(self.tokenizer.eos_token_id)
        tokenized_text.attention_mask.append(1)
       
        out_dict = {
                "input_ids": tokenized_text.input_ids,
                "attention_mask": tokenized_text.attention_mask,
                "labels": torch.Tensor([labels])
            }
        return out_dict


def define_loader(dataset,
                  config,
                  inference,
                 ):
    """
    Use config and inference mode to create dataloader for train and test.
    """
    num_workers = config.num_workers
    pin_memory = config.pin_memory
    
    # collate_fn = None
    # we use here a basic data collator
    collate_fn = DataCollatorWithPadding(tokenizer=dataset.tokenizer,
                                         padding=config.token_info["padding"],
                                         max_length=config.token_info["max_length"],
                                         pad_to_multiple_of=config.token_info["pad_to_multiple_of"]
                                    )


    loader = DataLoader(
                dataset,
                batch_size=config.batch_size,
                shuffle=not inference,
                drop_last=not inference,
                num_workers=num_workers,
                pin_memory=pin_memory, 
                collate_fn=collate_fn,
                # worker_init_fn=worker_init_fn,
            )
    return loader


def get_dataset_and_loader(df, config, inference, remove=True):
    """
    Returns both dataset and dataloader
    """
    dataset = LALDataset(df, config, inference, remove=remove)
    loader = define_loader(dataset, config, inference)
    return dataset, loader

def create_loaders(df, train_idx, valid_idx, config, eval_on_train):
    
    # You can set larger max length for inference
    valid_config = copy.deepcopy(config)
    valid_config.token_info['max_length'] = config.token_info['max_length']
    
    _, train_dl = get_dataset_and_loader(df=df.iloc[train_idx].reset_index(drop=True),
                                        config=config,
                                        inference=False,
                                        )

    _, valid_dl = get_dataset_and_loader(df=df.iloc[valid_idx].reset_index(drop=True),
                                        config=valid_config,
                                        inference=True)

    if eval_on_train:
        _, train_aux_dl = get_dataset_and_loader(df=df.iloc[train_idx].reset_index(drop=True),
                                                config=valid_config,
                                                inference=True)

        eval_loaders = [train_aux_dl, valid_dl]
        eval_names = ["train", "valid"]
    else:
        eval_loaders = [valid_dl]
        eval_names = ["valid"]
    return train_dl, valid_dl, eval_loaders, eval_names

2024-04-23 16:17:37.039701: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-23 16:17:37.039830: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-23 16:17:37.173532: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Define network's architecture

Here we use a simple architecture composed of an LLM backbone finetuned with LORA and a linear head.
We only use the last eos_token to predict the final score.

You can easily customize this architecture as you would do with any torch.nn.Module:
- add metadata as inputs: tf-idf, num_words, engineered features etc...
- make the final head more complicated (MLP with activations etc...)
- try pooling methods insread of eos_oken pooling etc...

In [5]:
from transformers import AutoModelForSequenceClassification, AutoModelForCausalLM
from peft import get_peft_model, LoraConfig, TaskType
from transformers import AutoConfig

class CustomLLM(torch.nn.Module):
    """
    Here is where you can customize your architecture
    """
    def __init__(self, cfg, eos_token_id):
        super().__init__()
        self.num_classes = cfg.num_classes
        self.eos_token_id = eos_token_id
        self.model_config = AutoConfig.from_pretrained(
                cfg.architecture["backbone"],
            )

        self.activation = torch.nn.Identity() # Activation could be differnt
        # get a backbone for our network
        # Here let's go with AutoModelForCausalLM, it does not matter as we take remove the final layers
        self.backbone  = AutoModelForCausalLM.from_pretrained(cfg.architecture["backbone"],
                                                                    device_map="cpu", #"cuda",
                                                                    load_in_4bit=False, #cfg.load_in_4bit,
                                                                    torch_dtype=torch.float32, # let's leave bfloat16 for later (not working)
                                                                    **cfg.architecture["params"])
        # remove the head as we are going to use a custom head
        self.backbone.lm_head = torch.nn.Identity()

        
        if cfg.remove_layers is not None:
            # we only remove the last layers as they can be superfluous: https://arxiv.org/html/2403.17887v1
            self.backbone.layers = self.backbone.model.layers[:-cfg.remove_layers]          
        
        if hasattr(cfg, "num_layers_to_freeze"):
            print(f"freezing {cfg.num_layers_to_freez} layers.")
            if cfg.num_layers_to_freeze > 0:
                if cfg.freeze_embeddings:
                    # should you train embeddings?
                    for param in self.backbone.embed_tokens.parameters():
                        param.requires_grad = False
                # Here the first layers are frozen: only remaining last layers will be trained
                for layer in self.backbone.model.layers[:cfg.num_layers_to_freeze]:
                    for param in layer.parameters():
                        param.requires_grad = False
                
        if cfg.use_lora:
            # Here we apply lora from peft library
            peft_config = LoraConfig(
                task_type=TaskType.CAUSAL_LM,
                inference_mode=False, # this does not seem to change anything -> how are we suppose to use it properly?
                r=cfg.lora_config["r"],
                lora_alpha=cfg.lora_config["lora_alpha"],
                lora_dropout=cfg.lora_config["lora_dropout"],
                target_modules=cfg.lora_config["target_modules"],
                # modules_to_save=["lstm_head", "final_linear"]
                # bnb_4bit_compute_dtype=torch.bfloat16, # leave this for later       
            )
            
            self.backbone = get_peft_model(self.backbone, peft_config)
        else:
            print("NOT USING LORA")
            
        # this is for gradient checkpoint, left for later
        # self.transformers_model.gradient_checkpointing_enable()
                    
        self.final_linear = torch.nn.Linear(self.model_config.hidden_size, cfg.num_classes)
        
        if cfg.use_lora:
            self.backbone.print_trainable_parameters()
        
    def forward(self, batch):
        x = batch["input_ids"] # (bs, num_tokens)
        # this assumes that you only have one eos_token per example
        eos_positions = torch.argwhere(x == self.eos_token_id)[:, 1]
        x = self.backbone(
            input_ids=x,
            attention_mask=batch["attention_mask"],
        )["logits"] # (bs, num_tokens, hidden_size)
        
        # we are only interested in the eos_token
        x = x[torch.arange(x.shape[0]), eos_positions] # (bs, hidden_size)
        
        logits = self.final_linear(x) # (bs, num_classes)

        return {"logits": logits}

# Training recipe

You may need to change this if you make significant changes in your modelling apporach

In [6]:
from dataclasses import dataclass
from typing import List, Any, Dict
from torch.nn.utils import clip_grad_norm_
from abc import abstractmethod
from sklearn.base import BaseEstimator
import json
from pathlib import Path
from tqdm.notebook import tqdm
import copy

# Layers to which we do not want to apply weight decay with AdamW
ALL_LAYERNORM_LAYERS = [torch.nn.LayerNorm, torch.nn.Embedding]


def get_parameter_names(network, forbidden_layer_types):
        """
        Returns the names of the model parameters that are not inside a forbidden layer.
        """
        result = []
        for name, child in network.named_children():
            result += [
                f"{name}.{n}"
                for n in get_parameter_names(child, forbidden_layer_types)
                if not isinstance(child, tuple(forbidden_layer_types))
            ]
        # Add model specific parameters (defined with nn.Parameter) since they are not in any child.
        result += list(network._parameters.keys())
        return result
    
def define_loss_function(loss_config):
    """
    Basic torch loss functions or locally defined loss
    """
    copy_config = copy.copy(loss_config)
    loss_name = copy_config.pop('loss_name')
    try:
        loss_fn = getattr(torch.nn, loss_name)(**copy_config)
    except AttributeError:
        try:
            loss_fn = globals().get(loss_name)(copy_config)
        except:
            raise NotImplementedError("Unkown loss function :", loss_name)
    return loss_fn

def prepare_log_folder(log_path):
    """
    Utility function to create experiment folder
    Creates the directory for logging.
    Logs will be saved at log_path/date_of_day/exp_id

    Args:
        log_path (str): Directory

    Returns:
        str: Path to the created log folder
    """
    today = str(datetime.date.today())
    log_today = os.path.join(log_path, today)

    if not os.path.exists(log_today):
        Path(log_today).mkdir(parents=True)

    exp_id = (
        np.max([int(f) if str(f).isdigit() else -1 for f in os.listdir(log_today)]) + 1
        if len(os.listdir(log_today))
        else 0
    )
    log_folder = os.path.join(log_today, f"{exp_id}")

    assert not os.path.exists(log_folder), "Experiment already exists"
    os.mkdir(log_folder)
    print("Saving logs at :", log_folder)
    return log_folder

def save_config(config, folder):
    """
    Saves a config as a json, copies data and model configs.

    Args:
        config (Config): Config.
        folder (str): Folder to save at.
    """
    with open(os.path.join(folder, "config.json"), "w") as f:
        json.dump(config.__dict__.copy(), f)

@dataclass
class AbstractBaseModel(BaseEstimator):
    """ Abstract class for scikit-like model.
        Allows to build upon to train, infer, save, load etc..
    """

    network: torch.nn.Module = None
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    seed: int = 42
    mixed_precision: bool = False

    def __post_init__(self):
        torch.manual_seed(self.seed)
        self.network.to(self.device)

    def fit(
        self,
        train_dataloader,
        eval_loaders=None,
        eval_names=None,
        eval_metric=None,
        loss_config=None,
        max_epochs=100,
        callbacks=None,
        optimizer_name="Adam",
        optimizer_params={"lr": 1e-3},
        gradient_accumulation=None,
        scheduler_name=None,
        scheduler_params=None,
        mixed_precision=False,
        clip_value=None,
        log_path=None,
        verbose=1,
    ):
        """
        Train a neural network stored in self.network
        Using train_dataloader for training data and
        eval_loaders for validation.
        Parameters
        ----------
        train_dataloader : Dataloader
            Train set
        eval_loader : list of dataloaders
            The last one is used for early stopping
        eval_name : list of str
            List of eval set names.
        eval_metric : list of str
            List of evaluation metrics.
            The last metric is used for early stopping.
        loss_name : Name
            a PyTorch loss function name
        max_epochs : int
            Maximum number of epochs during training
        num_workers : int
            Number of workers used in torch.utils.data.DataLoader
        drop_last : bool
            Whether to drop last batch during training
        pin_memory: bool
            Whether to set pin_memory to True or False during training
        from_unsupervised: unsupervised trained model
            Use a previously self supervised model as starting weights
        clip_value: float (default to None)
            Gradient clipping
        """
        # update model name

        self.max_epochs = max_epochs
        self._stop_training = False
        self.optimizer_name = optimizer_name
        self.optimizer_params = optimizer_params
        eval_loaders = eval_loaders if eval_loaders else []       
        self.mixed_precision = mixed_precision
        self.clip_value = clip_value
        self.verbose = verbose
        self.gradient_accumulation = gradient_accumulation
        self.metrics = eval_metric
        
        if loss_config is None:
            raise(NotImplementedError, "Please specifify a loss")
        else:
            self.loss_fn = define_loss_function(loss_config)
        
        self._set_optimizer()
        
        # scheduler
        self.scheduler_fn = getattr(torch.optim.lr_scheduler, scheduler_name) # this will only accept torch schedulers
        self.scheduler_params = copy.copy(scheduler_params)
        self.scheduler = self.scheduler_fn(self._optimizer, **self.scheduler_params)
        
        
        # Training loop over epochs
        start_time = time.time()
        for epoch_idx in range(self.max_epochs):
            self.epoch_idx = epoch_idx
            epoch_loss, epoch_lr = self._train_epoch(train_dataloader)
            msg = f"epoch {epoch_idx:<3} | lr: {epoch_lr:.2e} | loss: {epoch_loss:.4f} "
            # Apply predict epoch to all eval sets
            if ((self.verbose != 0) and (epoch_idx % self.verbose == 0)) or (epoch_idx==self.max_epochs-1):
                for eval_name, valid_dataloader in zip(eval_names, eval_loaders):
                    with torch.no_grad():
                        prob_pred, prob_true, scores = self._predict_epoch(eval_name, valid_dataloader)
                    for metric_name, metric_score in scores:
                        msg += f"| {metric_name:<3} ({eval_name}): {metric_score:.4f} "
            total_time = int(time.time() - start_time)
            msg += f"|  {str(datetime.timedelta(seconds=total_time)) + 's':<6}"
            print(msg)
        print("End of training!")
        self.network.eval()
        return prob_pred, prob_true
        
    def predict_proba(self, dataloader, return_target=False):
        """
        Make predictions on a batch (valid)
        Parameters
        ----------
        X : a :tensor: `torch.Tensor`
            Input data
        Returns
        -------
        predictions : np.array
            Predictions of the regression problem
        """
        self.network.eval()
        results_prob = []
        results_targets = []
        pbar = tqdm(dataloader,
                     leave=False,
                     total=len(dataloader),
                     desc=f'Inference')
        
        with torch.no_grad():
            for batch in pbar:
                out_probs = self._predict_batch(batch).cpu()
                results_prob.append(out_probs)

                if return_target:
                    targets = batch["labels"]
                    targets = targets.to("cpu").detach()
                    results_targets.append(targets)

        res_prob = self.stack_preds(results_prob)                
        if return_target:
            res_target = self.stack_targets(results_targets)
            return res_prob, res_target
        else:
            return res_prob

    def save_model(self, path, model_name):
        """
        Save the model somewhere

        Users can specify both the path and model_name
        If no model_name given an automatic one will be creted
        """
        Path(path).mkdir(parents=True, exist_ok=True)
        # Save state_dict with half precision for less gpu usage during inference: same result if trained with mixed precision
        torch.save(self.network.half().state_dict(), Path(path).joinpath(f"{model_name}.pt"))
        # torch.save(self.network.state_dict(), Path(path).joinpath(f"{model_name}.pt"))
        return


    def _train_epoch(self, train_loader):
        """
        Trains one epoch of the network in self.network
        Parameters
        ----------
        train_loader : a :class: `torch.utils.data.Dataloader`
            DataLoader with train set
        """
        self.network.train()
        num_iter_epoch = len(train_loader)
        pbar = tqdm(enumerate(train_loader),
                                     leave=False,
                                     total=len(train_loader),
                                     desc=f'train epoch {self.epoch_idx}')
        
        epoch_loss = 0
        for batch_idx, batch in pbar:
            batch_loss = self._train_batch(batch, batch_idx, num_iter_epoch)
            epoch_loss = (train_loader.batch_size*batch_idx*epoch_loss + train_loader.batch_size*batch_loss) / (train_loader.batch_size*(batch_idx+1))            
            pbar.set_description(f'train epoch {self.epoch_idx}: loss {epoch_loss:.3f}', refresh=True)
            # update scheduler
            self.scheduler.step()

        epoch_lr = self._optimizer.param_groups[-1]["lr"]
        return epoch_loss, epoch_lr

    def _train_batch(self, batch, batch_idx, num_iter_epoch):
        """
        Trains one batch of data
        Parameters
        ----------
        batch_logs : dict
            Dictionnary with "batch_size" and "loss".
        """
        self._send_batch_to_device(batch)
                                   
        with torch.cuda.amp.autocast(enabled=self.mixed_precision):
            # use mixed precision for float16 training
            y = batch["labels"]
            batch_logs = {"batch_size": y.shape[0]}
            out_probs = self.network(batch)
            # computing loss with division by gradient accumulation
            loss = self.loss_fn(out_probs["logits"], y.unsqueeze(-1)) / self.gradient_accumulation
            self.scaler.scale(loss).backward()
            
        if ((batch_idx + 1) % self.gradient_accumulation == 0) or ((batch_idx + 1)==num_iter_epoch):
            # Perform backward pass and optimization
            if self.clip_value is not None:
                self.scaler.unscale_(self._optimizer)
                clip_grad_norm_(self.network.parameters(), max_norm=self.clip_value)

            self.scaler.step(self._optimizer)
            self.scaler.update()
            # set the gradients to 0 only when calling step
            self._optimizer.zero_grad(set_to_none=True)
        return loss.detach().item()

    def _predict_epoch(self, name, loader):
        """
        Predict an epoch and update metrics.
        Parameters
        ----------
        name : str
            Name of the validation set
        loader : torch.utils.data.Dataloader
                DataLoader with validation set
        """
        prob_pred, prob_true = self.predict_proba(loader, return_target=True)
        
        scores = []
        for metric_fn in self.metrics:
            metric_score = metric_fn(prob_true, prob_pred)
            scores.append((metric_fn._name, metric_score))
        # need to compute metrics here
        return prob_pred, prob_true, scores

    def stack_preds(self, list_prob):
        return torch.vstack(list_prob)

    def stack_targets(self, list_prob):
        return torch.hstack(list_prob)

    def _send_batch_to_device(self, batch):
        for key, value in batch.items():
            batch[key] = value.to(self.device)
            
    def _predict_batch(self, batch):
        """
        Predict one batch of data.
        """
        with torch.cuda.amp.autocast(enabled=self.mixed_precision):
            self._send_batch_to_device(batch)
            # compute model output
            out_probs = self.network(batch)["logits"]
            # apply activation
            if isinstance(self.network, torch.nn.DataParallel):
                # deal with data parrallel
                out_probs = self.network.module.activation(out_probs)
            else:
                out_probs = self.network.activation(out_probs)
            
        return out_probs.detach()

    def _set_optimizer(self):
        """Setup optimizer."""
        
        name = self.optimizer_name

        # disable decay for layer norm
        decay_parameters = get_parameter_names(self.network, ALL_LAYERNORM_LAYERS)
        decay_parameters = [name for name in decay_parameters if "bias" not in name]
        optimizer_grouped_parameters = [
            {
                "params": [
                    p for n, p in self.network.named_parameters() if (n in decay_parameters and p.requires_grad)
                ],
                "weight_decay": self.optimizer_params["weight_decay"],
            },
            {
                "params": [
                    p for n, p in self.network.named_parameters() if (n not in decay_parameters and p.requires_grad)
                ],
                "weight_decay": 0.0,
            },
        ]
        other_params = self.optimizer_params.copy()
        _ = other_params.pop("weight_decay")
                
        self._optimizer = getattr(torch.optim, name)(optimizer_grouped_parameters, **other_params)        
        self.scaler = torch.cuda.amp.GradScaler(enabled=self.mixed_precision)
        return


# Metrics to track

Here you can define metrics you want to track during model training (every epoch)

In [7]:
from sklearn.metrics import (
    mean_squared_error
)
class RMSE:
    """
    Root Mean Squared Error.
    """

    def __init__(self):
        self._name = "rmse"


    def __call__(self, y_true, y_score):
        """
        Compute MSE (Mean Squared Error) of predictions.

        Parameters
        ----------
        y_true : np.ndarray
            Target matrix or vector
        y_score : np.ndarray
            Score matrix or vector

        Returns
        -------
        float
            MSE of predictions vs targets.
        """
        return mean_squared_error(y_true.numpy(), y_score.numpy(), squared=False)


import numpy as np
from numba import jit 

# @jit
def qwk6(a1, a2, max_rat=6):
    """
    Comp metric adapted from CPMP: https://www.kaggle.com/c/prostate-cancer-grade-assessment/discussion/145105
    """
    assert(len(a1) == len(a2))
    
    a1 = a1.astype(np.int64).reshape(-1)
    # take closest integer for continuous predictions
    a2 = np.rint(a2).astype(np.int64).reshape(-1)
    # a2 = np.asarray(a2, dtype=int)

    hist1 = np.zeros((max_rat + 1, ))
    hist2 = np.zeros((max_rat + 1, ))

    o = 0
    for k in range(a1.shape[0]):
        i, j = a1[k], a2[k]
        hist1[i] += 1
        hist2[j] += 1
        o +=  (i - j) * (i - j)

    e = 0
    for i in range(max_rat + 1):
        for j in range(max_rat + 1):
            e += hist1[i] * hist2[j] * (i - j) * (i - j)

    e = e / a1.shape[0]
    return (1 - o / e)

class QWK:
    def __init__(self):
        self._name = "qwk"
    def __call__(self, y_true, y_pred, max_rat=6):
        return qwk6(y_true.numpy(), y_pred.numpy())


# from sklearn.metrics import cohen_kappa_score


# class ScikitQWK:
#     """
#     Competition metric with scikit
#     """

#     def __init__(self):
#         self._name = "scikit_qwk"
#     def __call__(self, y_true, y_pred):
#         y_pred = np.rint(y_pred) # convert predictions to closest integer
#         return cohen_kappa_score(y_true, y_pred, weights="quadratic")

# Puting everything together for training one fold

This is just a simple function that will allow you to train one fold and save the corresponding configs and model checkpoint.

In [8]:
def update_sched_params(config, train_loader):
    """
    This helper function allows to define steps per epoch dynamically

    Parameters
    ----------
    - config : experiment config
    - train_loader : train data loader used for this fold
    """
    nb_epochs = config.max_epochs
    is_per_epoch = config.scheduler_params.get("steps_per_epoch", None)

    if is_per_epoch is not None:
        if is_per_epoch <= 0:
            # this means automatic number of steps
            config.scheduler_params["steps_per_epoch"] = len(train_loader)
        # else use the defined value
    
    
    # for get_cosine_schedule_with_warmup
    warmup_ratio = config.scheduler_params.pop("warmup_ratio", None)

    if warmup_ratio is not None:
        num_train_steps = int(len(train_loader) * nb_epochs)
        num_warmup_steps = int(num_train_steps * warmup_ratio)
        config.scheduler_params["num_warmup_steps"] = num_warmup_steps
        config.scheduler_params["num_training_steps"] = num_train_steps
        # else use the defined value
    return config

def train_fold(df,
               train_idx,
               valid_idx,
               config,
               fold_nb):

    print("Num train and valid samples:", train_idx.shape[0], valid_idx.shape[0])
    config = copy.deepcopy(config)
    train_dl, valid_dl, eval_loaders, eval_names =  create_loaders(df,
                                                                   train_idx,
                                                                   valid_idx,
                                                                   config,
                                                                   eval_on_train=config.eval_on_train
                                                                   )
    log_folder = prepare_log_folder(config.save_path)
    # add the eos_token_id to config
    config.eos_token_id = train_dl.dataset.tokenizer.eos_token_id
    save_config(config, log_folder)

    
    network = CustomLLM(config, train_dl.dataset.tokenizer.eos_token_id)
    model = AbstractBaseModel(network=network)
        
    # update scheduler
    config = update_sched_params(config, train_dl)

    prob_pred, prob_true = model.fit(train_dl,
                                      eval_loaders= eval_loaders,
                                      eval_names=eval_names,
                                      eval_metric=[RMSE(), QWK()], #  , ScikitQWK()
                                      loss_config=config.loss_config, 
                                      max_epochs=config.max_epochs,
                                      callbacks=None,
                                      optimizer_name=config.optimizer_name,
                                      optimizer_params=config.optimizer_params,
                                      scheduler_name=config.scheduler_name,
                                      scheduler_params=config.scheduler_params,
                                      gradient_accumulation=config.gradient_accumulation,
                                      mixed_precision=config.mixed_precision,
                                      clip_value=config.clip_value,
                                      verbose=config.verbose,
             )

    # prob_pred, prob_true = model.predict_proba(loader, return_target=True)
    
    model.save_model(path=log_folder, model_name=f"fold_{fold_nb}")
    torch.cuda.empty_cache()
        
    return prob_pred, prob_true

# Training: 5 fold cross validation

In [9]:
# use stratified kfold
from sklearn.model_selection import StratifiedKFold
import os

# download training data
df_train = pd.read_csv(os.path.join(PATH_TO_DATA, "train.csv"))

TRAIN = False # switch to False for inference and True for training
INFERENCE = True
DEBUG = False

if TRAIN:
    if DEBUG:
        df_train = df_train[:50]
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    
    for fold_nb, (train_idx, valid_idx) in enumerate(skf.split(df_train, df_train.score)):
        prob_pred, prob_true = train_fold(df_train,
                                           train_idx,
                                           valid_idx,
                                           exp_config,
                                           fold_nb=fold_nb,
                                           )
        break

In [10]:
# Example of what you should see (trained on my personal setup)

# Num train and valid samples: 13845 3462
# Saving logs at : ../../logs/essay_scoring/2024-04-23/2
# trainable params: 9,805,824 || all params: 2,515,978,240 || trainable%: 0.3897420034920493
# epoch 0   | lr: 5.87e-05 | loss: 0.0662 | rmse (valid): 0.5429 | qwk (valid): 0.8105 |  0:45:41s
# epoch 1   | lr: 1.00e-07 | loss: 0.0301 | rmse (valid): 0.5167 | qwk (valid): 0.8358 |  1:31:20s
# End of training!

# Inference

Here is where you can perform simple inference from a previously trained checkpoint.

In [11]:
class SavedConfig:
    """
    Placeholder to load a config from a saved json
    """
    def __init__(self, dic):
        for k, v in dic.items():
            setattr(self, k, v)

def load_model(path, model_name, override_backbone=None):
    """
    Load a previsouly trained model
    """
    # get saved configurations
    with open(os.path.join(path, "config.json"), "r") as f:
        saved_configs = json.load(f)

    saved_configs = SavedConfig(saved_configs)
    
    if override_backbone is not None:
        saved_configs.architecture = {"backbone": override_backbone,
                             "params": {}}
    # create network
    network = CustomLLM(saved_configs, saved_configs.eos_token_id)
    # load trained weights
    state_dict = torch.load(os.path.join(path, f"{model_name}.pt"))

    network.load_state_dict(state_dict)     

    # create a model
    clf = AbstractBaseModel(network=network,
                            mixed_precision=saved_configs.mixed_precision)
    clf.network.eval()
    return saved_configs, clf

In [12]:
MODEL_PATH = "/kaggle/input/simple-model-training/2/" #"your/path/to/pretrained/here"
MODEL_NAME = "fold_0"

if INFERENCE:
    # this loads saved configs and model
    # my model was trained with MAX_LENGTH=1024 on my local machine
    # I need to override the backbone as I did not train on kaggle notebooks
    saved_configs, saved_model = load_model(MODEL_PATH, MODEL_NAME, override_backbone="/kaggle/input/gemma/transformers/1.1-2b-it/1")
    # make sure to use batch size of 1 to limit memory consumption
    saved_configs.batch_size=1
    
    df_test = pd.read_csv(os.path.join(PATH_TO_DATA, "test.csv"))
    # let's generated a dummy 'score' column to match the train.csv
    df_test["score"] = -1
    # generate the test dataset and datalaoder
    ds_test, dl_test = get_dataset_and_loader(df_test, saved_configs, inference=True)
    
    test_preds = saved_model.predict_proba(dl_test).numpy()
    df_sub = pd.DataFrame()
    df_sub["essay_id"] = df_test["essay_id"]
    # convert prediction to integers
    df_sub["score"] = np.rint(test_preds).astype(int)
    # save submission
    df_sub.to_csv("submission.csv", index=None)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

trainable params: 9,805,824 || all params: 2,515,978,240 || trainable%: 0.3897420034920493


Inference:   0%|          | 0/3 [00:00<?, ?it/s]