#### Hugging Face Trainer - A classification workflow
by Marsh [ @vbookshelf ]<br>
9 April 2022

## Overview

This task can be framed as a classification problem. We are given an anchor, target and a context. Based on the context we need to predict if the anchor and target are:
- 1.0 - A very close match
- 0.75 - Close synonyms
- 0.5 - Synonyms which don’t have the same meaning
- 0.25 - Somewhat related
- 0.0 - Unrelated

The scores (labels) in the training data are made up of 5 increments [0.0, 0.25, 0.5, 0.75, 1.0]. These increments can be mapped to five classes: [0, 1, 2, 3, 4]. In this notebook we will train a Bert for Patents classifier to predict these classes. After inference the predicted classes will be converted back to float increments.

This solution uses the Hugging Face trainer. One of the downsides of using this trainer on Kaggle is that it uses a lot of disk space. This causes notebooks to crash during training. I've included some notes that explain how to set TrainingArguments to reduce the amount of disk space that gets used.

This notebook includes both training and inference. We will train five folds for one epoch each. Then we will take a simple average of the fold predictions.

The top scoring public notebooks on this competition use a regression approach. I've also inlcuded a quick side note that explains how to set up the Hugging Face trainer for regression. The regression setup uses MSE loss.

At the end of this notebook there are links to helpful resources that explain concepts like fp16, gradient accumulation and weight decay.

Let's get started.

In [None]:
import pandas as pd
import numpy as np
import os

import gc

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


#from tqdm import tqdm
# tqdm doesn't work well in colab.
# This is the solution:
# https://stackoverflow.com/questions/41707229/tqdm-printing-to-newline
import tqdm.notebook as tq
#for i in tq.tqdm(...):


import string

from sklearn import model_selection
from sklearn.utils import shuffle
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold

import transformers
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

os.environ["WANDB_DISABLED"] = "true"


print(torch.__version__)
#print(torchvision.__version__)

In [None]:
# Set the seed values

import random

seed = 1024

random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

In [None]:
os.listdir('../input/')

In [None]:
base_path = '../input/us-patent-phrase-to-phrase-matching/'


## Config

In [None]:
# The model is stored in a Kaggle dataset.
# The internet connection in this notebook is off.
MODEL_PATH = '../input/bert-for-patents/bert-for-patents/'

# Set the max token length.
# Determine this by looking at max token lengths 
# in the train set. Process is shown below.
MAX_LEN = 64

NUM_EPOCHS = 1

NUM_FOLDS = 5

# Specify which folds should be used in training.
# This is helpful when you have to train the folds in 
# separate notebooks.
START_FOLD = 0
STOP_FOLD = 5 # this number is not included

NUM_CLASSES = 5 # [0, 1, 2, 3, 4]

L_RATE = 2e-5

# 1. Setting fp16=True (TrainingArguments) allows us to use larger batch sizes. This speeds up training.
# 2. Also, because the gradient accmulation parameter is
# set to 2 (TrainingArguments), the equivalent batch size is actually 40 i.e. 2*20 = 40.
# Gradient accumulation is helpful when you have to use very small batch sizes.
BATCH_SIZE = 20

NUM_CORES = os.cpu_count()

# When training with multiple GPUs, if the number
# of workers (CPU cores) is set too high that can slow down training.
# Not applicable on Kaggle because there's only one GPU.
if torch.cuda.device_count() > 1:
    NUM_CORES = 4

NUM_CORES

## Check the device

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(device)

if torch.cuda.is_available():
    print('Num GPUs:', torch.cuda.device_count())
    print('GPU Type:', torch.cuda.get_device_name(0))

## Load the data

In [None]:
# Train data

path = base_path + 'train.csv'
df_data = pd.read_csv(path)

print(df_data.shape)

df_data.head()

In [None]:
# Check the distribution of the train labels

df_data['score'].value_counts()

In [None]:
# Test data

path = base_path + 'test.csv'
df_test = pd.read_csv(path)

print(df_test.shape)

df_test.head()

## Add the context meanings

Here we will add the context meanings to the train and test data. We will create a new column call 'title'.

In [None]:
# Ref: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

# The letters (keys) in this dictionary are the
# first letters of the context. Refer to the column called 'context'.

context_mapping_dict = {
                        "A": "Human Necessities",
                        "B": "Operations and Transport",
                        "C": "Chemistry and Metallurgy",
                        "D": "Textiles",
                        "E": "Fixed Constructions",
                        "F": "Mechanical Engineering",
                        "G": "Physics",
                        "H": "Electricity",
                        "Y": "Emerging Cross-Sectional Technologies",
                        }

In [None]:
def map_context(x):
    
    # get the first letter
    letter = x[0]
    
    # extract the meaning from the dictionary
    meaning = context_mapping_dict[letter]
    
    return meaning


# Create a new column
df_data['title'] = df_data['context'].apply(map_context)

df_data.head()

In [None]:
# Test data.
# Create a new column.
df_test['title'] = df_test['context'].apply(map_context)

df_test.head()

## Create the label column

In [None]:
def create_label(x):
    
    if x == 0:
        return 0

    if x == 0.25:
        return 1
    
    if x == 0.5:
        return 2

    if x == 0.75:
        return 3

    if x == 1.0:
        return 4

# Note: This column must be called 'labels'. The Hugging Face trainer
# automatically detects the column that contains the training labels.
df_data['labels'] = df_data['score'].apply(create_label)

# Create a dummy label column so that the dataloader works on the test set.
df_test['labels'] = 0

print(df_data.shape)

df_data.head()

In [None]:
df_data['labels'].value_counts()

## Combine the anchor and target

In [None]:
df_data['combined_sentence'] = df_data['anchor'] + ' vs ' + df_data['target']

df_data.head()

In [None]:
df_test['combined_sentence'] = df_test['anchor'] + ' vs ' + df_test['target']

df_test.head()

In [None]:
df_data.head()

## Check the token lengths

Here we want to see what the max token length is in the train set. This will help us to set the MAX_LEN parameter. Setting a shorter MAX_LEN will use less RAM and help the model train faster.

In [None]:
# Instantiate the tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

In [None]:
# Example 

# The parameters for tokenizer.encode can be found here:
# https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerBase.encode

text = "Hello. How are you?"

# remove any spaces
text = " ".join(text.split())

encoded_text = tokenizer.encode(text)

print(text)
print(encoded_text)
print(len(encoded_text))

In [None]:
def get_num_tokens(x):
    
    # convert to type string
    x = str(x)
    # remove any spaces
    x = " ".join(x.split())
    
    # get a list of tokens
    token_list = tokenizer.encode(x)
    
    # get the number of tokens
    num_tokens = len(token_list)
    
    return num_tokens

# Create new columns containing the token lengths
df_data['num_tokens_combined_text'] = df_data['combined_sentence'].apply(get_num_tokens)
df_data['num_tokens_title'] = df_data['title'].apply(get_num_tokens)

df_data.head()

In [None]:
# Get the max token lengths
print(df_data['num_tokens_combined_text'].max())
print(df_data['num_tokens_title'].max())

In [None]:
# Based on these lengths I've set MAX_LEN = 64
# This is set in the CONFIG above.
# When choosing the MAX_LEN we need to consider the possibility that 
# the private test set could have text with a token length greater than 64.

## Create the 5 folds

In [None]:
# Filter out only the columns we need.

cols = ['title', 'labels', 'combined_sentence', 'anchor']

df_data = df_data[cols]

In [None]:
# Use the anchor column for stratification.
# Ref: https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/315220

skf = StratifiedKFold(n_splits=NUM_FOLDS, shuffle=True, random_state=42)

for fold, (t_, val_) in enumerate(skf.split(X=df_data, y=df_data['anchor'])):
      df_data.loc[val_ , "fold"] = fold
        
df_data['fold'].value_counts()

In [None]:
df_data.head()

## Explore how the Hugging Face Dataset works

We will need to convert the Pandas dataframe to a Hugging Face dataset before the data can be fed into the Trainer.

In [None]:
# Example

# The column containing the labels you want to predict should be named: labels

# This is the dataset docs:
# https://huggingface.co/docs/datasets/v1.2.0/exploring.html

from datasets import Dataset
import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({"col-a": [1, 2, 3],
                  "col-b": [4, 5, 6]})

# Convert the dataframe to a HuggingFace dataset.
# Imagine that it's the same as a pandas dataframe with labeled columns and rows.
dataset = Dataset.from_pandas(df)

len(dataset)

In [None]:
# Check the items in the first row

dataset[0]

In [None]:
print(dataset.shape)
print(dataset.num_columns)
print(dataset.num_rows)

## Set up the tokenize function and the metric function

In [None]:
def tokenize_data_fn(hf_dataset):
    
    """
    This function will tokenize all text in a specified column.
    We use it in the same way that we use 'apply' in Pandas.
    
    """
    
    tokenized_examples = tokenizer(
                            hf_dataset['combined_sentence'], # sentence1
                            hf_dataset['title'], # sentence2 - context
                            truncation="only_second", # only truncate sentence2
                            max_length=MAX_LEN,
                            padding="max_length",
                            )
    
    return tokenized_examples




def compute_metrics(eval_pred):
    
    # Declare as global so we can calculate the cv score for all folds and 
    # then print it when training is complete.
    global corr
    
    """    
    This function is used to calculate the metric during training.
    We will save the best model based on this metric.
    
    """
    
    from scipy.stats import pearsonr
    
    score_list = []
    
    logits, labels = eval_pred
    
    # logits shape: (num_rows, num_cols)
    # labels shape: (num_rows,)
    
    # take the argmax
    preds = np.argmax(logits, axis=1)
    
    # Calculate the correlation.
    # preds and labels should have the same length.
    # corr is a scalar.
    corr, _ = pearsonr(preds, labels)
    
    print(f'Pearson: {corr}')
    
    return {
            'pearson': corr
            }
    


## How to reduce the disk space that gets used during training

By default the Trainer saves logs and every checkpoint during training. On Kaggle this quickly uses up the available disk space. 

The files are saved in a folder called "runs" and in another folder that we name. In this notebook that folder is named "comp_folder". We set the name as a training argument.

The info below is based on this thread:<br>
https://discuss.huggingface.co/t/save-only-best-model-in-trainer/8442

This is a full list of training arguments:<br>
(Click the link then scroll down until you get to TrainingArguments)<br>
https://huggingface.co/transformers/main_classes/trainer.html


1. To make the trainer overwrite old log files we can set: overwrite_output_dir=True. 

2. Another way to reduce the amount of disk space is to set save_strategy="no" and load_best_model_at_end=False. In this case nothing will be saved during training. When training finishes you will need to save the model by using trainer.save_model("model_name"). This will save the last model, not the best model. 

3. Another option is to set save_total_limit=2. This is the option that I'm using in the training loop below. In this case only two models will be saved at any given time - the most recent model and the best model (based on the metric that's being monitoted.). Note that even if save_total_limit=1 the Trainer will still save two models, the best one and the last one. 

4. Also, you will note that I delete the "runs" folder and the "comp_folder" at the end of the training loop for each fold. This also reduces disk space and ensures that these folders don't appear in the output data when the notebook is committed.

## Train

In [None]:
# This is where the scores for each fold
# will be stored.
score_list = []

for i in range(START_FOLD, STOP_FOLD):

    # Choose the fold
    df_train = df_data[df_data['fold'] != i]
    df_val = df_data[df_data['fold'] == i]
    
    ####################################################
    # FOR TESTING ONLY
    # Comment out these two lines during training.
    #df_train = df_train[0:1000]
    #df_val = df_val[0:1000]
    ####################################################

    # Reset the indices
    df_train = df_train.reset_index(drop=True)
    df_val = df_val.reset_index(drop=True)
    #df_test = df_test.reset_index(drop=True)

    # Register the data
    train_dataset = Dataset.from_pandas(df_train)
    val_dataset = Dataset.from_pandas(df_val)
    #test_dataset = Dataset.from_pandas(df_test)

    # Tokenize the data
    from transformers import AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    pad_on_right = tokenizer.padding_side == "right"

    # Don't remove the 'labels' column.
    # The trainer automatically detects this column and uses the labels to
    # calculate the loss during training. If the labels column can't be detected then
    # you will get a Keyerror: loss.
    cols = ['title', 'combined_sentence', 'anchor', 'fold']
    
    # Here "map" is similar to "apply" in Pandas

    tokenized_train = train_dataset.map(tokenize_data_fn, batched=True, 
                                    remove_columns=cols
                                       )

    tokenized_val = val_dataset.map(tokenize_data_fn, batched=True, 
                                        remove_columns=cols
                                   )



    # Initialize the model

    from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

    model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, 
                                                               num_labels=NUM_CLASSES)


    # Intialize the data collator
    # I don't know the details of what this actually does.

    from transformers import default_data_collator

    data_collator = default_data_collator


    # Initilaize the trainer
    # Full list of training arguments:
    # (Click the link then scroll down until you get to TrainingArguments)
    # https://huggingface.co/transformers/main_classes/trainer.html

    args = TrainingArguments(
        f"comp_folder",
        overwrite_output_dir=True, # This reduces the amt of disk space that gets used.
        fp16=True,  # fp16 training to allow larger batch sizes to be used
        evaluation_strategy = "epoch",
        save_strategy = "epoch",
        learning_rate=L_RATE,
        warmup_ratio=0.1,
        gradient_accumulation_steps=2, #8
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        dataloader_num_workers=NUM_CORES,
        
        # Two options to address the exceeding disk space problem:
        # Ref: https://discuss.huggingface.co/t/save-only-best-model-in-trainer/8442/7
        
        # Option 1:
        # The disk still fills up but not as fast.
        # Two models are always saved - best model and last model.
        # Changing the 2 to 1 doesn't make a difference because 2 models are always saved.
        # The saved fold models as well as the two models that are saved during training
        # also uses up the disk space.
        # You may need to train 3 folds in one notebook and 2 folds in another.
        save_total_limit = 2, 
        load_best_model_at_end=True, # load the best model and then save it manually.
        
        # Option 2
        # Don't save anything during training. 
        # Manually save the model later.
        # Train for a specified number of epochs.
        # Can't use a metric to get the best model.
        # Remember that the saved fold models also uses up the disk space.
        # May need to train 3 folds in one notebook and 2 folds in another.
        #save_strategy = "no", 
        
        metric_for_best_model="pearson" # choose the best model based on this metric
        )
    
    

    trainer = Trainer(
            model,
            args,
            train_dataset=tokenized_train,
            eval_dataset=tokenized_val,
            tokenizer=tokenizer,
            data_collator=data_collator,
            compute_metrics=compute_metrics
        )
    
    
    #-------------------------------------------------
    
    # Test that the eval metric is being calculated.
    # Comment this step out when test is complete.
    
    #eval_results_dict = trainer.evaluate()
    #print('\n Check that eval is working:')
    #print(eval_results_dict)
    print('\n')
    
    #-------------------------------------------------

    
    # Train the model
    print(f"fold_{i}")
    print('\nTraining...')
    trainer.train()
    
    # Save the score for the metric we are monitoring
    score_list.append(corr)

    # Save the model.
    # The best model gets loaded when training is completed.
    trainer.save_model(f"model_{i}")
    
    
    
    # Delete folders to save disk space
    
    import shutil

    if os.path.isdir('runs') == True:
        shutil.rmtree('runs')

    if os.path.isdir('comp_folder') == True:
        shutil.rmtree('comp_folder')
        

# Print the CV score

print('\n')
print('===================')
cv = sum(score_list)/len(score_list)
print(f'cv: {cv}')
print('===================')

In [None]:
# Print the CV score

print('===================')
cv = sum(score_list)/len(score_list)
print(f'cv: {cv}')
print('===================')

In [None]:
!ls

## Inference

In [None]:
# Create a list of fold model paths

model_0 = 'model_0'
model_1 = 'model_1'
model_2 = 'model_2'
model_3 = 'model_3'
model_4 = 'model_4'

MODEL_LIST = [model_0, model_1, model_2, model_3, model_4]

In [None]:
# Make predictions using all fold models

raw_predictions_list = []

# Make a prediction using each fold model
for i, model_path in enumerate(MODEL_LIST):
    
    # Create the test dataset
    test_dataset = Dataset.from_pandas(df_test)

    test_features = test_dataset.map(
                    tokenize_data_fn,
                    batched=True,
                    remove_columns=test_dataset.column_names
                    )

    # Make a prediction for one model
    raw_predictions = trainer.predict(test_features)

    # Save the predictions from each fold in a list
    raw_predictions_list.append(raw_predictions)


print(len(raw_predictions_list))

In [None]:
# Average the predictions for all folds

for i, raw_preds in enumerate(raw_predictions_list):
    
    np_preds_logits = raw_preds.predictions

    if i == 0:     
        fin_logits = np_preds_logits
        
    else:
        fin_logits = fin_logits + np_preds_logits

        
# Average the predictions
avg_logits = fin_logits/len(MODEL_LIST)

avg_logits.shape

In [None]:
# Take the argmax

preds = np.argmax(avg_logits, axis=1)

preds.shape

In [None]:
# Add the preds to df_test

df_test['preds'] = preds

In [None]:
# Change the preds to the corresponding float values

def change_preds(x):
    
    if x == 0:
        return 0

    if x == 1:
        return 0.25
    
    if x == 2:
        return 0.5

    if x == 3:
        return 0.75

    if x == 4:
        return 1.0
    
df_test['modified_preds'] = df_test['preds'].apply(change_preds)

# filter out the columns we don't need
cols = ['id', 'modified_preds']
df = df_test[cols]

print(df_test.shape)

df_test.head()

## Create a submission csv file

Here we will ensure that the submission csv has the same order as the sample submission. We do this by performing a merge. I'm doing this to ensure that we don't get any submission errors.

In [None]:
# Load the sample submission

path = base_path + 'sample_submission.csv'
df_sample = pd.read_csv(path)

print(df_sample.shape)

df_sample.head()

In [None]:
# Add the preds to df_sample
# The order is changed to match df_sample

df_sample = pd.merge(df_sample, df, on='id', how='left')

print(df_sample.shape)

df_sample.head()

In [None]:
# Overwrite the score column
df_sample['score'] = list(df_sample['modified_preds'])

# drop the modified_preds column
df_sample = df_sample.drop('modified_preds', axis=1)

df_sample.head()

In [None]:
# Create a submission csv file

path = 'submission.csv'
df_sample.to_csv(path, index=False)

In [None]:
# Create a requirements.txt file
# This is a list of all packages and their versions that were 
# used to create this solution.

!pip freeze > requirements.txt

In [None]:
!ls

## How to use the Hugging Face trainer for regression

For a regression problem we still use AutoModelForSequenceClassification but we set num_labels=1.

When num_labels=1 the trainer automatically knows that this is a regression problem. It then uses MSE loss.

In [None]:
# Regression setup:
# model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH, num_labels=1)

Refer to this discussion:<br>
https://discuss.huggingface.co/t/how-to-set-up-trainer-for-a-regression/12994/2

Also refer to this link to the source code referenced in the above discussion:<br>
Click the link and scroll down. The relevant section of the source code is highlighted. This confirms that MSE loss gets used when num_labels=1.
https://github.com/huggingface/transformers/blob/7ae6f070044b0171a71f3269613bf02fd9fca6f2/src/transformers/models/bert/modeling_bert.py#L1564-L1575

## Resources



- Gradient Accumulation explanation<br>
https://colab.research.google.com/github/kozodoi/website/blob/master/_notebooks/2021-02-19-gradient-accumulation.ipynb#:~:text=Simply%20speaking%2C%20gradient%20accumulation%20means,might%20find%20this%20tutorial%20useful.

- fp16 explanation<br>
https://www.youtube.com/watch?v=ks3oZ7Va8HU

- Weight Decay explanation<br>
https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_multilayer-perceptrons/weight-decay.ipynb

- Docs for the Hugging Face Trainer API<br>
https://huggingface.co/docs/transformers/training

- Full list of training arguments<br>
(Click the link then scroll down until you get to TrainingArguments)<br>
https://huggingface.co/transformers/main_classes/trainer.html

- Explanation of Pearson correlation<br>
https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

- Dataset docs<br>
https://huggingface.co/docs/datasets/v1.2.0/exploring.html

- The parameters for tokenizer.encode can be found here:<br>
https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizerBase.encode

## Reference Notebooks

- US Patent Phrase Matching: Adding meaning of code<br>
https://www.kaggle.com/code/xhlulu/us-patent-phrase-matching-adding-meaning-of-code

- [USPPPM] BERT for Patents Baseline [train]<br>
https://www.kaggle.com/code/ksork6s4/uspppm-bert-for-patents-baseline-train/notebook

- USPPPM-Huggingface Train & Inference Baseline<br>
https://www.kaggle.com/code/phantivia/uspppm-huggingface-train-inference-baseline

- USPPPM-Huggingface patent-bert<br>
https://www.kaggle.com/code/danofer/uspppm-huggingface-patent-bert


Thank you for reading.