In this notebook, an attempt is made to finetune a model on the text in the provided dataset.
Two models are finetuned for sentiment analysis.

First, as always, the necessary libraries are imported. I prefer to install libraries I'm not sure are installed in my notebook environment in a try-except block. If you get a a pip dependency error, restart the runtime and run the code again.

In [None]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
from sklearn.utils import resample
try:
  from datasets import Dataset
except:
  !pip install -q datasets
  from datasets import Dataset
import torch

# Install and import additional necessary libraries
try:
    import evaluate
except:
    !pip install -q evaluate
    import evaluate

# Install Hugging Face hub if not already installed
try:
    from huggingface_hub import login
except:
    !pip install huggingface_hub
    from huggingface_hub import login
from google.colab import userdata

I have to log into huggingface using my huggingface api token so that I can upload the trained model to the huggingface hub. This is done in the next cell using colab secrets. Also, make sure you upload the csv file to the runtime.

In [None]:
# logging into huggingface to upload fine-tuned model for safekeeping

hf_token = userdata.get('HF_TOKEN')
login(token=hf_token)
# Load dataset
# Make sure the dataset is uploaded to the runtime
df = pd.read_csv('Dataset_comments_seriallly.csv')

# Check the dataset structure
df.head()


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Unnamed: 0,S/N,Video_ID,Comment,Annotation,Language
0,1,FXNoo4BjNOE,Well it&#39;s allowed,Neutral,English
1,2,FXNoo4BjNOE,Asalamu alekum barka da sallah wai yaushe za a...,Neutral,Hausa
2,3,FXNoo4BjNOE,"Pls I don’t understand, I thought labarina end...",Neutral,English
3,4,FXNoo4BjNOE,Yarayu waka daha<br>Akashe taman ahuta,Negetive,Hausa
4,5,FXNoo4BjNOE,Ah Abu yayikyau sosai❤❤,Positive,Hausa


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   S/N         5000 non-null   int64 
 1   Video_ID    5000 non-null   object
 2   Comment     5000 non-null   object
 3   Annotation  5000 non-null   object
 4   Language    5000 non-null   object
dtypes: int64(1), object(4)
memory usage: 195.4+ KB


In [None]:
df['Annotation'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Annotation,Unnamed: 1_level_1
Positive,0.568
Negetive,0.2328
Neutral,0.199
negetive,0.0002


The negative class needs to be spelt the same.

There are 5000 entries, but we need to check the label balance of the dataset. In any case, some preprocessing is done in the cell below, renaming the columns and cleaning the text to allow for easy training.

In [None]:
# Rename columns for convenience
df = df.rename(columns={'Comment': 'text', 'Annotation': 'label'})
data = df[['text', 'label']].dropna(subset=['text', 'label']).reset_index(drop=True)

# Correct misspelled labels
data['label'] = data['label'].replace({'Negetive': 'Negative', 'negetive': 'Negative'})

# Display value counts after correction
print(f"\nValue counts after correction: \n {data['label'].value_counts(normalize=True)}")

# Convert text to lowercase and clean text
data['text'] = data['text'].str.lower()
def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'http\S+|www.\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and numbers
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    return text.strip()

data['text'] = data['text'].apply(clean_text)


Value counts after correction: 
 label
Positive    0.568
Negative    0.233
Neutral     0.199
Name: proportion, dtype: float64


The imbalance is addressed by duplicating the less captured labels (Negative and Neutral). In addition, the labels are encoded since models can only work on numbers and not text. Everything is commented for clarity.

In [None]:
# Encode sentiment labels to numerical values
label_encoder = LabelEncoder()
data['label'] = label_encoder.fit_transform(data['label'])

# Train-test split
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Separate majority and minority classes
df_positive = train_data[train_data['label'] == label_encoder.transform(['Positive'])[0]]
df_negative = train_data[train_data['label'] == label_encoder.transform(['Negative'])[0]]
df_neutral = train_data[train_data['label'] == label_encoder.transform(['Neutral'])[0]]

# Find the size of the majority class
majority_size = max(len(df_positive), len(df_negative), len(df_neutral))

# Oversample minority classes to match the majority class size
df_negative_upsampled = resample(df_negative, replace=True, n_samples=majority_size, random_state=42)
df_neutral_upsampled = resample(df_neutral, replace=True, n_samples=majority_size, random_state=42)

# Combine the majority class with upsampled minority classes
train_data_balanced = pd.concat([df_positive, df_negative_upsampled, df_neutral_upsampled])

# Display new class distribution after oversampling
train_data_balanced['label'].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
2,2319
0,2319
1,2319


In [None]:
train_data_balanced.head()

Unnamed: 0,text,label
4227,up yakubu m kumo the story writer,2
800,the brilliant and intelligent director which i...,2
3671,i really appreciate you aminu saira,2
4193,god bless best director,2
4793,nice,2


In [None]:
train_data_balanced.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6957 entries, 4227 to 664
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    6957 non-null   object
 1   label   6957 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 163.1+ KB


## BERT-BASE-UNCASED

Now it's all balanced and training can be done. Balancing the dataset in this way can have disadvantages but it is better than training with an unbalanced dataset.

In the remainder of the notebook, the huggingface transformers Trainer class is used to finetune the model, this is because it is more high-level than using pytorch,which is more low- level.

First, bert-base-uncased is finetuned and then another model is used for comparison. The F1 score and accuracy are calculated for both models.

Same process can be used for other models if needed.

Another thing to note is that the model is trained on both the English and the Hausa (the complete dataset). Separating the languages can have a different effect. This can also be checked. Maybe training on only monolingual. It will require filtering the dataset to only one language.

In [None]:
# Tokenization using Hugging Face tokenizer
model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize the datasets
tokenized_train_dataset = Dataset.from_pandas(train_data_balanced).map(tokenize_function, batched=True)
tokenized_test_dataset = Dataset.from_pandas(test_data).map(tokenize_function, batched=True)

# Load model for training
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Map:   0%|          | 0/6957 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [None]:
# Define evaluation metrics
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_macro = f1.compute(predictions=predictions, references=labels, average='macro')
    return {"accuracy": acc['accuracy'], "f1": f1_macro['f1']}


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=f"./results/{model_checkpoint}",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results for {model_checkpoint}:")
print(eval_results)

# Save the model to Hugging Face Hub
model.push_to_hub(f"{model_checkpoint}-sentiment-analysis")
tokenizer.push_to_hub(f"{model_checkpoint}-sentiment-analysis")

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Accuracy,F1
10,1.0896,1.002722,0.548,0.322337
20,1.0831,1.026122,0.491,0.439299
30,0.985,1.050096,0.479,0.472708
40,1.0214,0.934423,0.589,0.527517
50,0.9353,0.903833,0.607,0.556892
60,0.944,0.99048,0.458,0.383058
70,0.9241,0.89474,0.604,0.561808
80,0.937,0.903944,0.585,0.535763
90,0.8661,0.844327,0.621,0.574525
100,0.8106,0.857852,0.591,0.503526


Evaluation results for bert-base-uncased:
{'eval_loss': 0.5905988812446594, 'eval_accuracy': 0.797, 'eval_f1': 0.7562039900218958, 'eval_runtime': 16.7338, 'eval_samples_per_second': 59.759, 'eval_steps_per_second': 3.765, 'epoch': 3.0}


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lukmanaj/bert-base-uncased-sentiment-analysis/commit/b8917ca8a0d7b0ba11e997bb2092a6c10d68f939', commit_message='Upload tokenizer', commit_description='', oid='b8917ca8a0d7b0ba11e997bb2092a6c10d68f939', pr_url=None, pr_revision=None, pr_num=None)

## ANOTHER MODEL: cardiffnlp/twitter-roberta-base-sentiment-latest




This time twitter-roberta-sentiment-latest

In [None]:
# Tokenization using Hugging Face tokenizer
model_checkpoint = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load model for training
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize the datasets
tokenized_train_dataset = Dataset.from_pandas(train_data_balanced).map(tokenize_function, batched=True)
tokenized_test_dataset = Dataset.from_pandas(test_data).map(tokenize_function, batched=True)



config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Map:   0%|          | 0/6957 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=f"./results/{model_checkpoint}",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results for {model_checkpoint}:")
print(eval_results)

# Save the model to Hugging Face Hub
model.push_to_hub(f"finetuned-twitter-roberta-base-sentiment-latest")
tokenizer.push_to_hub(f"finetuned-twitter-roberta-base-sentiment-latest")



Step,Training Loss,Validation Loss,Accuracy,F1
10,0.9986,0.795043,0.64,0.600545
20,0.8704,0.812083,0.558,0.473346
30,0.7354,0.861676,0.661,0.632205
40,0.9181,0.709229,0.69,0.620486
50,0.7941,0.743679,0.639,0.618617
60,0.7965,0.834132,0.55,0.486768
70,0.7693,0.741998,0.661,0.615514
80,0.7442,0.704464,0.693,0.65192
90,0.6427,0.704407,0.703,0.646266
100,0.706,0.970408,0.587,0.510166


Evaluation results for cardiffnlp/twitter-roberta-base-sentiment-latest:
{'eval_loss': 0.6855161786079407, 'eval_accuracy': 0.766, 'eval_f1': 0.7291761104488804, 'eval_runtime': 2.2712, 'eval_samples_per_second': 440.305, 'eval_steps_per_second': 27.739, 'epoch': 3.0}


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/lukmanaj/finetuned-twitter-roberta-base-sentiment-latest/commit/16d2d3dcfcff8f2e261c23f0b77dd6431f117265', commit_message='Upload tokenizer', commit_description='', oid='16d2d3dcfcff8f2e261c23f0b77dd6431f117265', pr_url=None, pr_revision=None, pr_num=None)

Bert-base-uncased took longer to train but gave a better accuracy and F1 score.

The outputs for the training can look different because I did mine on kaggle. The code should also work here. The guide here should be easy to follow. I think the only difference is the one on wandb. You can do it without needing any loggings on wandb.

Finally, the finetuned model is saved in huggingface. An account in huggingface is required or you can just comment out the codes below:


```
# Save the model to Hugging Face Hub
model.push_to_hub(f"{model_checkpoint}-sentiment-analysis")
tokenizer.push_to_hub(f"{model_checkpoint}-sentiment-analysis")

```

```
# Save the model to Hugging Face Hub
model.push_to_hub(f"finetuned-twitter-roberta-base-sentiment-latest")
tokenizer.push_to_hub(f"finetuned-twitter-roberta-base-sentiment-latest")

```

That's for the two finetuned models.

It's just that it's easier to run inference and do evaluation afterwards if the model is saved online, except one has significant resources on their computer.


## Afro-XLMR_Large

In [None]:
# Tokenization using Hugging Face tokenizer
model_checkpoint = "Davlan/afro-xlmr-large"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Load model for training
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Tokenize the datasets
tokenized_train_dataset = Dataset.from_pandas(train_data_balanced).map(tokenize_function, batched=True)
tokenized_test_dataset = Dataset.from_pandas(test_data).map(tokenize_function, batched=True)



Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at Davlan/afro-xlmr-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/6957 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir=f"./results/{model_checkpoint}",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps = 10,
    fp16 = True,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation results for {model_checkpoint}:")
print(eval_results)

# Save the model to Hugging Face Hub
model.push_to_hub(f"finetuned-afro-xlmr-large-sent")
tokenizer.push_to_hub(f"finetuned-afro-xlmr-large-sent")

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Step,Training Loss,Validation Loss,Accuracy,F1
10,1.1565,1.028024,0.523,0.248321
20,1.106,1.201589,0.3,0.286839
30,1.144,1.145824,0.372,0.359984
40,1.1129,1.025278,0.528,0.356661
50,1.1097,1.106459,0.348,0.33835
60,1.0698,1.124247,0.369,0.275942
70,1.1096,1.085776,0.337,0.335534
80,1.1202,1.077097,0.53,0.342587
90,1.0984,1.130034,0.253,0.207386
100,1.1097,1.074089,0.372,0.271041
