[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ZGObhOKJCQhJJZFakc-v2ykj-hXm7K2o?usp=sharing)


# Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library


This is the code for the medium post [Fine-tuning RoBERTa for Topic Classification with Hugging Face Transformers and Datasets Library](https://medium.com/@achillesmoraites/fine-tuning-roberta-for-topic-classification-with-hugging-face-transformers-and-datasets-library-c6f8432d0820).

**The code and the post assume that**:
- You have a Hugging Face 🤗 account and are familiar with the platform (at least with creating a model repo and access tokens).
- You are experienced with Machine Learning (ML), Deep Learning, and NLP.
- You have some experience with Deep learning frameworks like Pytorch or Tensorflow.
- You have coding experience with Python.
- You have access to a Jupyter Environment with a GPU that can support the training process, and you are proficient in using it.

## ⚠️Warning
The post and the accompanying code do not intend to teach ML, Deep Learning, or NLP!

The aim of the post and the code is to illustrate the process of finetuning a RoBERTa model and publishing it to the Hugging Face 🤗 platform.

Building a production-level ML model involves steps and processes not covered by the post and the code.


In [12]:
!pip install transformers datasets huggingface_hub tensorboard==2.14
!sudo apt-get install git-lfs --yes

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 18 not upgraded.


In [1]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.23.0-py3-none-any.whl (258 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.23.0


In [13]:
import torch
from datasets import load_dataset
from transformers import (
    RobertaTokenizerFast,
    RobertaForSequenceClassification,
    TrainingArguments,
    Trainer,
    AutoConfig,
)
from huggingface_hub import HfFolder, notebook_login

In [3]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [14]:
model_id = "roberta-base"
dataset_id = "ag_news"
# make sure to put your own model here
# Before you start make sure to have created an empty repository model in hugging face 🤗 using https://huggingface.co/new
# <username>/<model-name>
repository_id = "DaymonQu/roberta-base_ag_news_202310232117"

In [15]:
# Load dataset
dataset = load_dataset(dataset_id)
train_dataset = dataset['train']
test_dataset = dataset["test"].shard(num_shards=2, index=0)

# Split train_dataset into train and validation sets
val_dataset = dataset['test'].shard(num_shards=2, index=1)

# Preprocessing
tokenizer = RobertaTokenizerFast.from_pretrained(model_id)

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=256)

train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Extract the number of classess and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

# Create an id2label mapping
# We will need this to directly output the class names when using the pipeline without needing to map the labels later.
id2label = {i: label for i, label in enumerate(class_names)}

# 3. Update the model's configuration with the id2label mapping
config = AutoConfig.from_pretrained(model_id)
config.update({"id2label": id2label})

Map:   0%|          | 0/3800 [00:00<?, ? examples/s]

number of labels: 4
the labels: ['World', 'Sports', 'Business', 'Sci/Tech']


In [16]:
# Model
model = RobertaForSequenceClassification.from_pretrained(model_id, config=config,ignore_mismatched_sizes=True)

# TrainingArguments
training_args = TrainingArguments(
    output_dir=repository_id,
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=500,
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=2,
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repository_id,
    hub_token=HfFolder.get_token(),
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.1716,0.214238
2,0.1584,0.17676
3,0.1115,0.181394
4,0.1085,0.190863
5,0.0313,0.237505


TrainOutput(global_step=18750, training_loss=0.13822619705438613, metrics={'train_runtime': 7934.848, 'train_samples_per_second': 75.616, 'train_steps_per_second': 2.363, 'total_flos': 7.89347340288e+16, 'train_loss': 0.13822619705438613, 'epoch': 5.0})

In [18]:
trainer.evaluate()

{'eval_loss': 0.17675994336605072,
 'eval_runtime': 15.5929,
 'eval_samples_per_second': 243.701,
 'eval_steps_per_second': 7.632,
 'epoch': 5.0}

In [19]:
# Save our tokenizer and create model card
tokenizer.save_pretrained(repository_id)
trainer.create_model_card()
# Push the results to the hub
trainer.push_to_hub()

'https://huggingface.co/DaymonQu/roberta-base_ag_news_202310232117/tree/main/'

In [20]:
# TEST MODEL

from transformers import pipeline
# from datasets import load_dataset

# dataset = load_dataset(dataset_id)
# class_names = dataset["train"].features["label"].names

pip = pipeline('text-classification',repository_id)


text = "Kederis proclaims innocence Olympic champion Kostas Kederis today left hospital ahead of his date with IOC inquisitors claiming his innocence and vowing: quot;After the crucifixion comes the resurrection. quot; .."
result = pip(text)

predicted_label = result[0]["label"]
print(f"Predicted label: {predicted_label}")

Predicted label: Sports
