# **Fine-tuning Tutorial with RoBERTa**

Shiyu Ji

March 8, 2024

## Preparations

### Load and Install Necessary Packages

Here we mainly use [Huggingface's `transformers` package](https://huggingface.co/), developed to assist with the training, inference and hosting of various transformer-based models.

We load other packages (`datasets`) provided by Huggingface to load data as well.


In [1]:
### Install all packages
!pip install transformers
!pip install datasets
!pip install accelerate -U

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.18.0 dill-0.3.8 multiprocess-0.70.16
Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected pa

In [2]:
# Load packages in python
import datasets
import sys

from transformers import AutoTokenizer, TextClassificationPipeline
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, DefaultDataCollator

import numpy as np
import pandas as pd

import os

### Connect to Google Drive

In [3]:
from google.colab import drive

## Prompt to link to google drive
drive.mount('/content/drive')

## Change this to your own google drive directory

tutorial_path = os.path.join(os.getcwd(), "drive", "My Drive", "fine_tuning_tutorial")

Mounted at /content/drive


## Task Description: News Category Classification

Given news headlines found on Huffington Posts, we would like to identify the news category of that headline.

### Step 0. Dataset description and pre-processing considerations

The dataset we use in this task comes from [kaggle](https://www.kaggle.com/datasets/rmisra/news-category-dataset), which has 210K news headlines from Huffington Post. Before training the model, we examine the dataset and have noticed that there are redundant categories and those that does not fit well with our general perception of news category. I preprocessed this dataset by merging some categories and deleting categories that are not useful.

This results in the following categories:

`‘COMEDY’, ‘SPORTS’, ‘RELIGION’, ‘FAMILY’, ‘STYLE’, ‘FOOD & DRINK’, ‘GREEN’, ‘SCIENCE’, ‘WORLD NEWS’, ‘ENVIRONMENT’,  ‘HOME & LIVING’, ‘MONEY’,  ‘CULTURE & ARTS’,   ‘WELLNESS’, ‘U.S. NEWS’, ‘TRAVEL’,  ‘WOMEN’, ‘CRIME’, ‘ENTERTAINMENT’,  ‘BUSINESS’, ‘MEDIA’, ‘WEIRD NEWS’,  ‘POLITICS’, ‘TECH’, ‘EDUCATION’`




### Step 1. Split dataset and Load Datasets

I split the dataset into 8:1:1 training/eval/testing split, stratified by each category, such that each split contains all categories being classified.

**Note:**

Huggingface's auto model classification pipeline requires a mapping between category name and integers. I created a mapping as well during the creation of the dataset, so that all category labels have a fixed mapping to its numeric label.

In [4]:
## Set up training/eval file name and label mappings
training_file = os.path.join(tutorial_path, "data", "headline_train.parquet")
eval_file = os.path.join(tutorial_path, "data", "headline_eval.parquet")
label_dict_file = os.path.join(tutorial_path, "data", "label_dict.parquet")

test_file = os.path.join(tutorial_path, "data", "headline_test.parquet")

In [5]:
## OPTIONAL: Look at the data
pd.read_parquet(training_file)

Unnamed: 0,text,label
0,The Mysteries of Inequality Are Only Mysteriou...,BUSINESS
1,Bernie Sanders Proposes Taking Marijuana Off T...,POLITICS
2,We Tasted It: Coffee Bean's Birthday Cake Ice ...,FOOD & DRINK
3,Restoration Hardware Sees Itself As 'Critical ...,BUSINESS
4,"If You Like To Blow S*** Up, Check Out These C...",SCIENCE
...,...,...
147949,12 Moments That Restored Our Faith In Fashion ...,STYLE
147950,Color Matters: Choosing the Correct Eye Shadow,STYLE
147951,Bernie Sanders: GOP Lawmakers Dodging Town Hal...,POLITICS
147952,Death With Dignity Advocates Say Most Catholic...,RELIGION


No charts were generated by quickchart


In [6]:
## Load data using huggingface model
## streaming option enables the training of model without loading the full dataset
full_dataset = datasets.load_dataset("parquet", data_files = {"train": training_file, "eval":eval_file}, streaming = True)

## shuffle training split, randomization helps the performance of stochastic gradient descent
full_dataset["train"] = full_dataset["train"].shuffle(buffer_size=1000000)

## Create named label to id mapping used for
label_dict = pd.read_parquet(label_dict_file)

numeric_list = label_dict["id"].tolist()
topic_list = label_dict["label"].tolist()
topic_num = len(numeric_list)

id2label = dict(zip(numeric_list, topic_list))
label2id = dict(zip(topic_list, numeric_list))

### Step 3: Setting up tokenization

We decide a pre-trained model class and set up the tokenization of the dataset. The tokenizer and the pre-trained model should be the same: since tokenizer should prepare the sentence into tokens that the pre-trained models could recognize. Here we use `RoBERTa-base` by [Liu et. al (2022)](https://arxiv.org/abs/1907.11692)

To do this, we use the `map` function in `Datasets` package to do the job: it works like `apply` in `pandas` package. We develop helper functions for that purpose

In [7]:
model_name = "roberta-base"

In [8]:
## Helper Functions
def tokenize_general_text_data(examples, tokenizer):
  return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=512)

def label_data(examples, label2id):
  return {"label":label2id[examples["label"]]}

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

## preprocess data: The dataset is not tokenized right now since the streaming option is on
full_dataset["train"] = full_dataset["train"].map(tokenize_general_text_data, fn_kwargs = {"tokenizer": tokenizer}, batched = True)
full_dataset["train"] = full_dataset["train"].map(label_data, fn_kwargs = {"label2id": label2id})
full_dataset["eval"] = full_dataset["eval"].map(tokenize_general_text_data, fn_kwargs = {"tokenizer": tokenizer}, batched = True)
full_dataset["eval"] = full_dataset["eval"].map(label_data, fn_kwargs = {"label2id": label2id})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

### Step 4. Load the model

While there are other ways to train the model, `transformers` and huggingface provide `AutoModel` classes that are suitable for a variety of common tasks: this helps users who are not familiar with `PyTorch` and other Neural Network related methods.

Here we have a sequence classification task, so we use `AutoModelForSequenceClassification`. The `from_pretrained` method under this class sets up the model that has a classification head on the top of the pooled embedding output from the encoded sentence.

The pooled output refer to different things with different base model you select. For `BERT` and `RoBERTa`, it is the one associated with the `CLS` token.

In [10]:
topic_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=topic_num, id2label=id2label, label2id=label2id)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Step 5: Train the model

A classical training pipeline via `torch` package involves the "manual" calculation of objective function, backpropagation to get the gradient and update the optimizer to get parameter updates. This involves manual codings to go through different loops and batches.


`Huggingface` makes your life easier by providing `TrainingArguments` and `Trainer` classes to have everything packed in two lines of codes. In addition, this training argument and trainer function work for all `torch`-based models.

In [11]:
## Here you set up the parameters for training, the default optimizer is AdamW

## you need to specify steps since you stream the dataset
## There are 147954 training cases, so it means ~9248 steps per epoch if each batch is 16 examples
## For illustrative purpose, we train for 1000 steps, which takes about 3 minutes on V100
training_steps = 1000
output_temp = os.path.join(tutorial_path, "temp_model") # temporary folder to store the intermediate model
output_model = os.path.join(tutorial_path, "model" , "1000_step_model_demo")


training_arguments = TrainingArguments(
    output_dir = output_temp,
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    max_steps=training_steps,
    eval_steps = 100,
    logging_steps = 100,
    save_steps = 5000,
    fp16_full_eval = True,
    fp16 = True
  )


## Data collator could be think of a function that gather all tensors in a batch into matrices and vectors that the model can take

data_collator = DefaultDataCollator()


## Helper function to evaluate the predictions.
def compute_metrics(model_output):
  predicted_values, label = model_output
  prediction = np.argmax(predicted_values, axis = 1).flatten()
  label = label.flatten()
  accuracy = np.mean(label ==  prediction)
  argsort_matrix = predicted_values.argsort()
  top_2 = accuracy + np.mean(argsort_matrix[:,-2] == label)
  top_3 = top_2 + np.mean(argsort_matrix[:,-3] == label)
  return {"Accuracy": accuracy,
          "Accuracy-top2": top_2,
          "Accuracy-top3": top_3}

  ## Trainer argument set up the training and eval sets
  ## you need to provide a compute metric function to specify any extra metrics you would like to evaluate (Accuracy, F1...)
trainer = Trainer(
    model= topic_model,
    args= training_arguments,
    train_dataset= full_dataset["train"],
    eval_dataset= full_dataset["eval"],
    data_collator = data_collator,
    compute_metrics=compute_metrics
    )

trainer.train()


Step,Training Loss,Validation Loss


TrainOutput(global_step=1000, training_loss=1.5607593994140625, metrics={'train_runtime': 208.2384, 'train_samples_per_second': 76.835, 'train_steps_per_second': 4.802, 'total_flos': 4210646237184000.0, 'train_loss': 1.5607593994140625, 'epoch': 1.0})

In [None]:
## save model
trainer.save_model(output_model)

### Step 6: Evaluate the model we just trained on the test set

This should be done only once after you decide on a model!!!

In [None]:
test_dataset = datasets.load_dataset("parquet", data_files = {"train": test_file}, streaming = True)

test_dataset["train"] = test_dataset["train"].map(tokenize_general_text_data, fn_kwargs = {"tokenizer": tokenizer}, batched = True)
test_dataset["train"] = test_dataset["train"].map(label_data, fn_kwargs = {"label2id": label2id})

{'eval_loss': 1.2231824398040771, 'eval_Accuracy': 0.6841678382178004, 'eval_Accuracy-top2': 0.8016113334054288, 'eval_Accuracy-top3': 0.8566562128257813, 'eval_runtime': 66.4598, 'eval_samples_per_second': 278.274, 'eval_steps_per_second': 8.697, 'epoch': 1.0}


In [None]:
## Here we try the newly trained model
results = trainer.evaluate(test_dataset["train"])

print(results)

In [None]:
## Here we load the previously-trained model by me using 6 epoch
model_directory = os.path.join(tutorial_path, "model", "huff_headline_model")
topic_model = AutoModelForSequenceClassification.from_pretrained(model_directory, local_files_only=True)

## You can assemble a dummy dataset
trainer = Trainer(
    model= topic_model,
    args= training_arguments,
    train_dataset= full_dataset["train"],
    eval_dataset= full_dataset["eval"],
    data_collator = data_collator,
    compute_metrics=compute_metrics
    )

results = trainer.evaluate(test_dataset["train"])

print(results)

### Step 7: Inference and Transfer Learning

Imagine you have a local model and you want to apply this to another dataset. Here I will demonstrate how you will load the model and predict its outcomes.

In [12]:
base_model_type = "roberta-base"
model_directory = os.path.join(tutorial_path, "model", "huff_headline_model")


tokenizer = AutoTokenizer.from_pretrained(base_model_type)

### load model
topic_model = AutoModelForSequenceClassification.from_pretrained(model_directory, local_files_only=True)

## Pipeline that works on the function
classification_pipeline = TextClassificationPipeline(model=topic_model, tokenizer=tokenizer, device = 0)

  return self.fget.__get__(instance, owner)()


In [13]:
example_sentences = ["Pope Francis was discharged from Gemelli Hospital in Rome on Saturday after a three-day treatment for bronchitis, and he also interacted with a couple and a boy at the hospital.",
                     "Sega released a free-to-play murder mystery game titled 'The Murder of Sonic the Hedgehog' on PC via Steam for April Fool's Day.",
                     "Chelsea fell to the lower half of the Premier League table following a 2-0 home defeat to Aston Villa, with goals scored by Ollie Watkins and John McGinn, intensifying pressure on manager Graham Potter."]

prediction = classification_pipeline(example_sentences, top_k = 3)

In [14]:
print(prediction[0])
print(prediction[1])
print(prediction[2])

[{'label': 'WORLD NEWS', 'score': 0.5764216184616089}, {'label': 'RELIGION', 'score': 0.39822524785995483}, {'label': 'POLITICS', 'score': 0.0069416058249771595}]
[{'label': 'ENTERTAINMENT', 'score': 0.6623727679252625}, {'label': 'TECH', 'score': 0.25391462445259094}, {'label': 'HOME & LIVING', 'score': 0.013801216147840023}]
[{'label': 'SPORTS', 'score': 0.9792113304138184}, {'label': 'WORLD NEWS', 'score': 0.00973131787031889}, {'label': 'ENTERTAINMENT', 'score': 0.003638233756646514}]
