# 8-bit Adam Optimization 👾

#### The optimizer is responsible for computing the gradient statistics for back propagation. These calculations are typically done on 32-bit values, but this notebook demonstrates how to use an 8-bit optimizer that saves memory and increases speed.

#### The problem with reducing the number of bits is that the precision of each value decreases. Tim Dettmers ([@timdettmers](https://www.kaggle.com/timdettmers)) did research at Facebook to figure out how to do stable optimization using 8 bits using a clever quantization trick. For a more detailed look at his research, please [read his paper](https://arxiv.org/abs/2110.02861) or [view his humorous video](https://www.youtube.com/watch?v=IxrlHAJtqKE). The GitHub repo, which contains installation instructions for your specific GPU, can be found here: https://github.com/facebookresearch/bitsandbytes 

#### It was found that this allows for slightly faster training and for slightly larger models to be loaded into memory without sacrificing performance. 

#### In this notebook, I compare the training times between the regular 32-bit Adam and the 8-bit Adam optimizer when training longformer-large for 1 epoch using a maximum sequence length of 2048. To use 8-bit Adam, you need to install the library and then change the one [line where the optimizer gets created](#Optimizer). In some cases using 8-bit Adam allows for larger batch sizes. Not this time, though 😅

#### If you want to jump straight to the results, [click here](#Weights-and-Biases-Report-✨)

#### If you want to see the notebook during the actual runs, version 1 has 8-bit Adam and version 3 has 32-bit Adam.

#### I think 8-bit Adam is mostly useful for training large language models from scratch, and less for finetuning models with < 1B parameters. Perhaps the best use-case for Kaggle would be for the users who don't have any other compute and batch size of 1 just barely doesn't fit using 32-bit Adam. In that instance, 8-bit Adam would allow people to use Kaggle GPUs to train models that wouldn't fit otherwise.

#### The speed-ups are modest as seen in this image from the paper below.
![8bit speed table](https://pbs.twimg.com/media/FBLeOZnVEAkt-Ij?format=png&name=small)

#### 8 bit optimization also enables fitting bigger models on smaller GPUs: 
![8bit fit big models](https://pbs.twimg.com/media/FBLeMS_VIBEettR?format=png&name=900x900)


#### [Tim's announcement tweet](https://twitter.com/Tim_Dettmers/status/1446472128979562499?s=20) 

# Install necessary libraries 📚

You must install the right version of `bitsandbytes` according to the GPU's CUDA version.

In [None]:
# Take note of what cuda version you have by running either of the following commands
# !conda list | grep cudatoolkit
!nvidia-smi

# choices: {cuda92, cuda 100, cuda101, cuda102, cuda110, cuda111, cuda113}
# replace XXX with the respective number
# pip install bitsandbytes-cudaXXX
!pip install bitsandbytes-cuda110 -q
!pip install -U wandb -q
!pip install seqeval git+https://github.com/huggingface/transformers.git -q

In [None]:
# This tests if the installation was successful
!wget https://gist.githubusercontent.com/TimDettmers/1f5188c6ee6ed69d211b7fe4e381e713/raw/4d17c3d09ccdb57e9ab7eca0171f2ace6e4d2858/check_bnb_install.py && python check_bnb_install.py

In [None]:
import os
import json
import math
from pathlib import Path
from functools import partial

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
import datasets
from transformers import (
    TrainingArguments, 
    Trainer, 
    AutoConfig,
    AutoModelForTokenClassification,
    AutoTokenizer,
    get_scheduler,
    DataCollatorForTokenClassification,
)
import bitsandbytes as bnb

# Config 

In [None]:
class CFG:
    
    fold = 0
    
    model_name = "allenai/longformer-large-4096"
    
    max_seq_length = 2048
    text_column = "text"
    label_column = "labels"
    word_id_column = "word_ids"

    training_args = TrainingArguments(
        output_dir="lf_2k",
        overwrite_output_dir=True,
        do_train=True,
        do_eval=True,
        evaluation_strategy="epoch",
        per_device_train_batch_size=1,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps=8,
        learning_rate=3e-5,
        weight_decay=0.01,
        num_train_epochs=1,
        max_steps=-1, # set >0 to limit
        lr_scheduler_type="linear",
        warmup_ratio=0.1,
        logging_strategy="steps",
        logging_steps=100,
        save_strategy="epoch",
        save_steps=None,
        seed=18,
        fp16=True, 
        eval_steps=None, # change evaluation_strategy to steps to use this
        dataloader_num_workers=2,
        run_name="longformer-2k-8bit-test",
        group_by_length=True, # This can also help speed training
        report_to="wandb",
        resume_from_checkpoint=None,
    )

# for convenience
args = CFG.training_args

In [None]:
# I use weights and biases to track training.
# The following code requires attaching a secret to the notebook.
if "wandb" in args.report_to:
    import wandb
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    key = user_secrets.get_secret("wandb")
    
    wandb.login(key=key)
    os.environ["WANDB_PROJECT"] = "feedback-prize"

# Create dataset

This is a little slow so it would probably be wise to create your dataset in another notebook and then load it in to the training notebook.

In [None]:
%%time

if not Path("full_dataset.dataset").exists():
    texts, ids = [], []
    for file in tqdm(Path("../input/feedback-prize-2021/train").glob("*.txt"), total=15594, desc="Reading train texts"):
        ids.append(file.stem)

        with open(file) as fp:
            texts.append(fp.read())
        
    
def add_label_information(examples):
    
    texts = examples[CFG.text_column]
    ids = examples["id"]
    all_labels, folds, words = [], [], []
    
    for text, id_ in zip(texts, ids):
    
        df = train_df[train_df["id"]==id_]

        text = text.split()
        num_words = len(text)

        labels = ["O"]*num_words

        for discourse_type, predictionstring in df[["discourse_type", "predictionstring"]].values:

            first = True
            for word_id in map(int, predictionstring.split()):
                prefix = "I-"
                if first:
                    prefix = "B-"
                    first = False
                labels[word_id] = prefix+discourse_type
                
        all_labels.append(labels)
        folds.append(df["kfold"].values[0])
        words.append(text)

    examples[CFG.label_column] = all_labels
    examples["fold"] = folds
    examples[CFG.text_column] = words
    return examples
    

# Using fold strategy shown by Abhishek https://www.kaggle.com/abhishek/creating-folds-properly-hopefully-p/
train_df = pd.read_csv("../input/creating-folds-properly-hopefully-p/train_folds.csv", usecols=["id", "discourse_type", "predictionstring", "kfold"])


# This step can take a few minutes
if not Path("full_dataset.dataset").exists():
    temp_dataset = datasets.Dataset.from_dict({"id": ids, CFG.text_column: texts})
    temp_dataset = temp_dataset.map(add_label_information, batched=True, num_proc=args.dataloader_num_workers)

    full_dataset = datasets.DatasetDict()
    full_dataset["train"] =  temp_dataset.filter(lambda x: x["fold"]!=CFG.fold)
    full_dataset["validation"] =  temp_dataset.filter(lambda x: x["fold"]==CFG.fold)
    full_dataset.save_to_disk("full_dataset.dataset")
else:
    full_dataset = datasets.DatasetDict.load_from_disk("full_dataset.dataset")
full_dataset

# Tokenizing data

Don't pad to max length unless you are on a TPU or you really want to extend your training

In [None]:
# https://github.com/huggingface/transformers/blob/669e3c50c98ad5b506555a551d2ecbf72ceb3c99/examples/pytorch/token-classification/run_ner.py#L371
def tokenize_and_align_labels(examples, label2id, return_word_ids=False):
    tokenized_inputs = tokenizer(
        examples[CFG.text_column],
        truncation=True,
        max_length=CFG.max_seq_length,
        # We use this argument because the texts in our dataset are lists of words (with a label for each word).
        is_split_into_words=True,
    )
    labels = []
    all_word_ids = []
    for i, label in enumerate(examples[CFG.label_column]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            else:
                label_ids.append(label2id[label[word_idx]])
            previous_word_idx = word_idx
            
        if return_word_ids:
            all_word_ids.append(word_ids)

        labels.append(label_ids)
    
    tokenized_inputs[CFG.label_column] = labels
    
    if return_word_ids:
        tokenized_inputs[CFG.word_id_column] = all_word_ids
    
    return tokenized_inputs

In [None]:
%%time

label_list = ['O', 'B-Claim', 'I-Claim', 'B-Concluding Statement', 'I-Concluding Statement', 
              'B-Counterclaim', 'I-Counterclaim', 'B-Evidence', 'I-Evidence','B-Lead', 'I-Lead', 
              'B-Position', 'I-Position', 'B-Rebuttal', 'I-Rebuttal']

label2id = {label:id_ for id_, label in enumerate(label_list)}
id2label = {id_:label for id_, label in enumerate(label_list)}

tokenizer = AutoTokenizer.from_pretrained(CFG.model_name, add_prefix_space=True)

train_dataset = full_dataset["train"].map(
        partial(
            tokenize_and_align_labels,
            label2id=label2id,
            return_word_ids=False
    ),
    batched=True,
    num_proc=args.dataloader_num_workers,
    remove_columns=["fold", "text", "id"]
)

    
validation_dataset = full_dataset["validation"].map(
partial(
    tokenize_and_align_labels,
    label2id=label2id,
    return_word_ids=True
),
batched=True,
num_proc=args.dataloader_num_workers,
    remove_columns=["fold"]
)

# bonus points if you can explain why it says
# Ignored unknown kwarg option direction

In [None]:
model_config = AutoConfig.from_pretrained(
    CFG.model_name,
    num_labels=len(label_list),
    label2id=label2id,
    id2label=id2label,
    finetuning_task="ner",
)

model = AutoModelForTokenClassification.from_pretrained(CFG.model_name, config=model_config)

# Optimizer

Here is the key cell where the 8-bit Adam optimizer gets set. It's pretty much trivially easy...

In [None]:
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
# This one line is all that is need to run 8-bit Adam
optimizer = bnb.optim.Adam8bit(optimizer_grouped_parameters, lr=args.learning_rate)


num_update_steps_per_epoch = len(train_dataset) // args.per_device_train_batch_size // args.gradient_accumulation_steps
if args.max_steps == -1 or args.max_steps is None:
    args.max_steps = args.num_train_epochs * num_update_steps_per_epoch
else:
    args.num_train_epochs = math.ceil(args.max_steps / num_update_steps_per_epoch)
    
if args.warmup_ratio is not None:
    args.num_warmup_steps = int(args.warmup_ratio * args.max_steps)

lr_scheduler = get_scheduler(
    name=args.lr_scheduler_type,
    optimizer=optimizer,
    num_warmup_steps=args.num_warmup_steps,
    num_training_steps=args.max_steps,
)

# Data Collator and Metrics 📏 

This collator is really handy because I can tell it to pad to a multiple of a number. Longformer likes to have inputs in multiples of 512, so it will handle the padding for me!

In [None]:
# Data collator
pad_to_multiple_of = 512 # this is for longformer, use 1024 for bigbird
    
data_collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=pad_to_multiple_of)

# Metrics
metric = datasets.load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# CV calculation functions

In [None]:
# Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(pred, ground_truth):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(pred.split(' '))
    set_gt = set(ground_truth.split(' '))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter/ len_pred
    return [overlap_1, overlap_2]


def score_feedback_comp(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df = pred_df[['id','discourse_type','predictionstring']] \
        .reset_index(drop=True).copy()
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    # Step 1. all ground truths and predictions for a given discourse_type are compared.
    joined = pred_df.merge(gt_df,
                           left_on=['id','discourse_type'],
                           right_on=['id','discourse_type'],
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    joined['predictionstring_gt'] = joined['predictionstring_gt'].fillna(' ')
    joined['predictionstring_pred'] = joined['predictionstring_pred'].fillna(' ')

    joined['overlaps'] = [calc_overlap(pred, gt) for pred, gt in joined[['predictionstring_pred', 'predictionstring_gt']].values]

    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['overlap1'] = joined['overlaps'].apply(lambda x: eval(str(x))[0])
    joined['overlap2'] = joined['overlaps'].apply(lambda x: eval(str(x))[1])


    joined['potential_TP'] = (joined['overlap1'] >= 0.5) & (joined['overlap2'] >= 0.5)
    joined['max_overlap'] = joined[['overlap1','overlap2']].max(axis=1)
    tp_pred_ids = joined.query('potential_TP') \
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','predictionstring_gt']).first()['pred_id'].values

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined['pred_id'].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query('potential_TP')['gt_id'].unique()
    unmatched_gt_ids = [c for c in joined['gt_id'].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    denominator = (TP + 0.5*(FP+FN))
    if denominator == 0:
        return 0.0
    my_f1_score = TP / denominator
    return {
        "F1": round(my_f1_score, 4),
        "Precision": TP/(TP+FP),
        "Recall": TP/(TP+FN), # This was calculated incorrectly in the runs
    }
        

id2label={i: l for l, i in label2id.items()}
# https://www.kaggle.com/zzy990106/pytorch-ner-infer?scriptVersionId=82677278&cellId=13
def get_label_predictions(dataset, preds):

    ids = dataset["id"]
    word_ids = dataset[CFG.word_id_column]
    words = dataset[CFG.text_column]
    
    all_preds = []

    for id_, sample_preds, sample_word_ids, words in zip(ids, preds, word_ids, words):
        label_preds = [""]*len(words)

        for pred, w_id in zip(sample_preds, sample_word_ids):
            if w_id is None:
                continue
            if label_preds[w_id] == "":
                label_preds[w_id] = id2label[pred]

        j = 0
        while j < len(label_preds):
            label = label_preds[j]

            if label.startswith("B"):
                label = label.replace("B", "I")
                end = j + 1
                while end < len(label_preds) and label_preds[end] == label:
                    end += 1

                if end - j > 7:
                    all_preds.append((id_, label.lstrip("BI-"), ' '.join(map(str, list(range(j, end))))))

                j = end
            else:
                j += 1
                
    return all_preds

# Create custom trainer

In [None]:
class FeedbackPrizeTrainer(Trainer):
    
    def __init__(self, *args, **kwargs):
        # The Trainer will remove the important columns needed for cv from the eval_dataset,
        # so we'll just store it like this
        if "cv_dataset" in kwargs:
            self.cv_dataset = kwargs.pop("cv_dataset")
        super().__init__(*args, **kwargs)
        
        
    def evaluation_loop(
        self, 
        dataloader,
        description,
        prediction_loss_only = None,
        ignore_keys = None,
        metric_key_prefix = "eval",
    ):
        
        eval_output =  super().evaluation_loop(
            dataloader,
            description,
            prediction_loss_only,
            ignore_keys,
            metric_key_prefix
        )
        
        # Custom CV F1 calculation
        # This same loop gets called during predict, and we can't do CV when predicting
        if metric_key_prefix == "eval":
            
            eval_id_preds = eval_output.predictions.argmax(-1)
            eval_label_preds = get_label_predictions(self.cv_dataset, eval_id_preds)
            
            eval_pred_df = pd.DataFrame(eval_label_preds, columns=["id", "discourse_type", "predictionstring"])
            
            eval_gt_df = train_df[train_df["id"].isin(self.cv_dataset["id"])].reset_index(drop=True).copy()
            
            classes = ['Lead', 'Position', 'Evidence', 'Claim', 'Concluding Statement', 'Counterclaim', 'Rebuttal']
            f1_scores = []
            for class_ in classes:
                gt_df = eval_gt_df.loc[eval_gt_df['discourse_type'] == class_].copy()
                pred_df = eval_pred_df.loc[eval_pred_df['discourse_type'] == class_].copy()
                eval_scores = score_feedback_comp(pred_df, gt_df)
                for score_name, score in eval_scores.items():
                    eval_output.metrics[f"{metric_key_prefix}_{class_}_CV_{score_name}"] = score
                f1_scores.append(eval_scores["F1"])
                
            eval_output.metrics[f"{metric_key_prefix}_Overall_CV_F1"] = np.mean(f1_scores)
        
        return eval_output

# Initialize our Trainer
trainer = FeedbackPrizeTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    cv_dataset=validation_dataset, 
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, lr_scheduler)
)

# Train! 🚆

In [None]:
%env TOKENIZERS_PARALLELISM=true


train_result = trainer.train()
metrics = train_result.metrics
trainer.save_model()  # Saves the tokenizer too

metrics["train_samples"] = len(train_dataset)

trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

# Weights and Biases Report ✨

#### You can see the lower memory usage and marginally faster training time. The loss curves are nearly identical and CV F1 scores are pretty much the same as well. 

<iframe src="https://wandb.ai/nbroad/feedback-prize/reports/8-bit-Adam-vs-32-bit-Adam--VmlldzoxNDQ5Nzg3" style="border:none;height:1024px;width:100%">