# Fine-tune transformer with  Hugging Face

For this part of the project I have explored the Hugging Face library, and the datasets and models available there. While there are many datasets readily available on the site, I decided to use the same dataset of airline tweets which I used with the Fastai library. As such, I hoped to experience the differences loading and preparing the dataset, but I also wanted to do some comparison of the resulting accuracies. As for previous experience with the library, I had none, neither had I any solid experience with transformers. So I stared by working my way through the [Introduction](https://huggingface.co/course/chapter1/1) course.

After having considered several different models I found the [Twitter-roBERTa-base for Sentiment Analysis](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) to be a good possible choice. 

> This is a roBERTa-base model trained on ~124M tweets from January 2018 to December 2021, and finetuned for sentiment analysis with the TweetEval benchmark.

Just as the airline dataset it had three possible classes negative, neutral and positive. 

# Prepare dataset

The Hugging Face library has it own dataset format which is different than the pandas dataframe, which made for a rather tricky start. 

Those which made the model had also served an function which created placeholders for both username and link, which seemed sensible to apply to the tweets. However, applying a function to a feature in this dataframe format was nowhere as easy as doing it with pandas. I do believe there are some support for turning a dataframe into a dataset, but I did not find any easy way to reverse it. Futhermore, most labels in other datasets from the site I explored was typed as a ClassLabel, which holds information about how to map integers to correct label name. 

The process of dividing the dataset into training, validation and  test was rather smooth as they seem to rely on sklearns train_test_split split function. However, all the parts of the dataset was still contained in single DatasetDict.

In [1]:
#collapse 
import transformers
from datasets import load_dataset, Features, Value, ClassLabel, load_metric

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#collapse 
transformers.logging.set_verbosity_warning() # Silence 

In [3]:
airline_dataset = load_dataset("csv", data_files=r"data_tweets_airline.csv")

Using custom data configuration default-f31bc029a11dc270
Reusing dataset csv (C:\Users\silje\.cache\huggingface\datasets\csv\default-f31bc029a11dc270\0.0.0\433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 99.98it/s]


In [4]:
airline_dataset

DatasetDict({
    train: Dataset({
        features: ['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count', 'text', 'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone'],
        num_rows: 14640
    })
})

In [5]:
#collapse 
airline_dataset = airline_dataset.rename_column("airline_sentiment", "label")

In [6]:
#collapse 
columns_to_remove = ['tweet_id', 'airline_sentiment_confidence', 'negativereason', 'negativereason_confidence', 'airline', 'airline_sentiment_gold', 'name', 'negativereason_gold', 'retweet_count',  'tweet_coord', 'tweet_created', 'tweet_location', 'user_timezone']
for col in columns_to_remove:
    airline_dataset = airline_dataset.remove_columns(col)

In [7]:
airline_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 14640
    })
})

In [8]:
airline_dataset["train"][0]

{'label': 'neutral', 'text': '@VirginAmerica What @dhepburn said.'}

In [9]:
#collapse 
features = airline_dataset["train"].features.copy()

In [10]:
'''
Take the text and creates placeholder for username and link. 
This was a method that the model creaters suggested.
'''
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    word = " ".join(new_text)   
    return word

def adjust_preprocess(batch):
    all_texts = []
    for text in batch["text"]:
        pre = preprocess(text)
        all_texts.append(pre)
    batch["text"] = all_texts
    return batch

In [11]:
airline_dataset["train"] = airline_dataset["train"].map(adjust_preprocess, batched=True, features=features)

Loading cached processed dataset at C:\Users\silje\.cache\huggingface\datasets\csv\default-f31bc029a11dc270\0.0.0\433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519\cache-b4164b8f269460ee.arrow


In [12]:
airline_dataset["train"][0]

{'label': 'neutral', 'text': '@user What @user said.'}

In [13]:
features["label"] = ClassLabel(names=["negative", "neutral", "positive"])

In [14]:
label_dict = {
    "negative" : 0,
    "neutral" : 1,
    "positive" : 2
}

def adjust_labels(batch):
    batch["label"] = [label_dict[sentiment] for sentiment in batch["label"]]
    return batch

In [15]:
airline_dataset["train"] = airline_dataset["train"].map(adjust_labels, batched=True, features=features)

Loading cached processed dataset at C:\Users\silje\.cache\huggingface\datasets\csv\default-f31bc029a11dc270\0.0.0\433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519\cache-2a82fa99d330842d.arrow


In [16]:
airline_dataset["train"].features

{'label': ClassLabel(num_classes=3, names=['negative', 'neutral', 'positive'], id=None),
 'text': Value(dtype='string', id=None)}

In [17]:
#collapse 
airline_dataset = airline_dataset["train"].train_test_split(test_size=0.2)
airline_dataset_clean = airline_dataset["train"].train_test_split(train_size=0.8, seed=42)
airline_dataset_clean["validation"] = airline_dataset_clean.pop("test")
airline_dataset_clean["test"] = airline_dataset["test"]
airline_dataset = airline_dataset_clean

In [18]:
airline_dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 9369
    })
    validation: Dataset({
        features: ['label', 'text'],
        num_rows: 2343
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 2928
    })
})

# Pre-trained model

In [19]:
#collapse 
from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
from transformers import Trainer, TrainingArguments

In [20]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"

In [21]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [22]:
#collapse 
# model.save_pretrained(MODEL)

# Fine-tune model 

It would have been possible to fine-tune the model with a custom training loop created with pytorch, but I decided to use their Training API instead.


As with other NLP models the text has to be tokenized before given to a model. Since I was fine-tuning the model I could use the AutoTokenizer to get the proper tokenizer class in accordance to the pre-trained model. Which would be some sort of a subword tokenizer, which is what the library argues for.


A major headache was discovering why the model refused to give any indication of accuracy, both while training and afterwards. Even with a custom function to calculate metrics, nothing would show. I think I used the better part of a day trying different approaches before realizing it was related to the dataset having three labels.

In [23]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [24]:
def tokenize_function(dataset):
    return tokenizer(dataset["text"],  padding=True)

tokenized_datasets = airline_dataset.map(tokenize_function, batched=True)

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 20.57ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 30.30ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 23.99ba/s]


In [25]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 9369
    })
    validation: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2343
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 2928
    })
})

In [26]:
#collapse 
import numpy as np

In [27]:
def compute_metrics(eval_preds):
    metric = load_metric("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [28]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [29]:
training_args = TrainingArguments(
    r"models/hug_twitter_airline",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=10
)

In [30]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [31]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 9369
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 11720


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4809,0.498291,0.842083
2,0.3719,0.607548,0.849765
3,0.3065,0.703529,0.855314
4,0.2051,0.79904,0.854033
5,0.1378,0.868735,0.848912
6,0.1001,1.013278,0.855741
7,0.0644,1.048996,0.856594
8,0.0579,1.099649,0.855314
9,0.0236,1.13457,0.857021
10,0.0228,1.176464,0.860009


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2343
  Batch size = 8
Saving model checkpoint to models/hug_twitter_airline\checkpoint-1172
Configuration saved in models/hug_twitter_airline\checkpoint-1172\config.json
Model weights saved in models/hug_twitter_airline\checkpoint-1172\pytorch_model.bin
tokenizer config file saved in models/hug_twitter_airline\checkpoint-1172\tokenizer_config.json
Special tokens file saved in models/hug_twitter_airline\checkpoint-1172\special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safe

TrainOutput(global_step=11720, training_loss=0.1764612882210533, metrics={'train_runtime': 1341.5117, 'train_samples_per_second': 69.839, 'train_steps_per_second': 8.736, 'total_flos': 5025201623506890.0, 'train_loss': 0.1764612882210533, 'epoch': 10.0})

In [38]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 2343
  Batch size = 8
Using the latest cached version of the module from C:\Users\silje\.cache\huggingface\modules\datasets_modules\metrics\accuracy\bbddc2dafac9b46b0aeeb39c145af710c55e03b223eae89dfe86388f40d9d157 (last modified on Tue Apr 26 19:27:52 2022) since it couldn't be found locally at accuracy, or remotely on the Hugging Face Hub.


{'eval_loss': 1.1764642000198364,
 'eval_accuracy': 0.860008536064874,
 'eval_runtime': 9.3111,
 'eval_samples_per_second': 251.635,
 'eval_steps_per_second': 31.468,
 'epoch': 10.0}

In [39]:
trainer.save_model("models/")

Saving model checkpoint to models/
Configuration saved in models/config.json
Model weights saved in models/pytorch_model.bin
tokenizer config file saved in models/tokenizer_config.json
Special tokens file saved in models/special_tokens_map.json


# Evaluate  

So in the end the model had an accuracy of 0.860008536064874 on the validation data, and 0.8562158469945356 on test data. Which is better result I got on the model fine-tuned with FastAi. Which supports the idea that transformers is next step for NLPs.  

In [32]:
predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=-1)
metric = load_metric("accuracy")
metric.compute(predictions=preds, references=predictions.label_ids)

The following columns in the test set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: text. If text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 2928
  Batch size = 8


{'accuracy': 0.8562158469945356}