# Finetuning a Transformer Model

The purpose of this project is to finetune a 
language model on a dataset that I build from scratch

To determine if a news article conveys positive or negative sentiment, I scraped news topics from BBC website and annotated them with the corresponding sentiment. I avoided using the Twitter dataset since there is already an existing dataset for the Igbo language from that source.

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests (from transformers)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.8 MB/s[0m 

In [None]:
#import libraries


import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import torch

from sklearn.model_selection import train_test_split
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, logging
from  datasets  import  load_dataset, Dataset

## 1. Build the Dataset 

In [None]:
#scraping the data from bbc website

general_news_title = []
for i in range(1,11):
  response = requests.get('https://www.bbc.com/igbo/topics/c3l19z3qjmyt?page=' + str(i))
  doc = BeautifulSoup(response.text, 'html.parser')
  links = doc.find_all('a', { 'class': 'focusIndicatorDisplayBlock bbc-uk8dsi e1d658bg0' })

  for link in links:
    general_news_title.append(link.text)
    print(link.text)

print(len(general_news_title))

Onyeisi ndị uweojii na Kenya agbaala arụkwaghịm
Okwu Nnamdi Kanu bụ okwu nwereonwe, ọ bụghị okwu nchekwa - Aloy Ejimakor
Africa Eye: Etu e si ebo ndị agadị ebubo amosu iji napụ ha ala ha
Ọrịa ndị a na-enweta site na mmiri ọjọọ
'Ndị mmadụ na-akpachapụ anyị n'ihi na anyị nwere ọrịa 'Sickle Cell anaemia'
Vidio, Lee ihe bụ Menopause ụmụnwoke a kpọrọ 'Andropause'Duration, 2,58
Vidio, Ụgwọ opekatampe: Ihe mere gọọmentị agaghị akwụnwu ndị ọrụ ego ole ha chọrọ - OnyejeochaDuration, 6,00
Vidio, Etu ụlọelu ogogo ise si daa n'ụlọakwụkwọ DMGS dị n'Onitsha, Anambra SteetiDuration, 2,06
Vidio, "Junior Pope anwụchaala tupu a gụpụta ya na mmiri" Duration, 7,09
Vidio, Ihe mere m jiri buru ngwa ahịa m site n'ala Igbo gawa Legọs - Cubana ChiefpriestDuration, 4,15
Vidio, Ihe mere m ga saa ara m - Eniola, nwaagbọghọ sara ara yaDuration, 5,43
Vidio, Agụrụ, mmanụ ụgbọala, dọla na ụzọ ndị ọzo ọchịchị Tinubu n'otu afọ siri metụta ndị NaịjirịaDuration, 1,13
Vidio, "Ọchịchị onyekwuouche ya ka dị ndụ n'Afrịka" - 

In [None]:
# clean the dataset, some of the topics have unnecessary texts that need to be removed

cleaned_titles = []
for title in general_news_title:
  if title.startswith('Vidio, '):
    title = title[len('Vidio,'):].strip()  # Remove "Vidio," from the beginning
    duration_index = title.find('Duration')
    if duration_index != -1:
      title = title[:duration_index].strip()  # Remove everything from "Duration" onwards
  cleaned_titles.append(title)
  print(title)



Onyeisi ndị uweojii na Kenya agbaala arụkwaghịm
Okwu Nnamdi Kanu bụ okwu nwereonwe, ọ bụghị okwu nchekwa - Aloy Ejimakor
Africa Eye: Etu e si ebo ndị agadị ebubo amosu iji napụ ha ala ha
Ọrịa ndị a na-enweta site na mmiri ọjọọ
'Ndị mmadụ na-akpachapụ anyị n'ihi na anyị nwere ọrịa 'Sickle Cell anaemia'
Lee ihe bụ Menopause ụmụnwoke a kpọrọ 'Andropause'
Ụgwọ opekatampe: Ihe mere gọọmentị agaghị akwụnwu ndị ọrụ ego ole ha chọrọ - Onyejeocha
Etu ụlọelu ogogo ise si daa n'ụlọakwụkwọ DMGS dị n'Onitsha, Anambra Steeti
"Junior Pope anwụchaala tupu a gụpụta ya na mmiri"
Ihe mere m jiri buru ngwa ahịa m site n'ala Igbo gawa Legọs - Cubana Chiefpriest
Ihe mere m ga saa ara m - Eniola, nwaagbọghọ sara ara ya
Agụrụ, mmanụ ụgbọala, dọla na ụzọ ndị ọzo ọchịchị Tinubu n'otu afọ siri metụta ndị Naịjirịa
"Ọchịchị onyekwuouche ya ka dị ndụ n'Afrịka" - Goodluck Jonathan
Akwamozu nne Obi Cubana gbanwere ndụ m - Anyidons
Etu m si tụọ ime mụọ nwa na-agbanyeghi ọrịa PCOS - Stephanie Coker Aderinokun
‘Egwurueg

In [None]:
dataset = pd.DataFrame({'Sentence': cleaned_titles})

In [None]:
dataset.to_csv('dataset.csv')

**I downloaded the CSV and manually annotated the dataset with the right sentiments before uploading to huggingface**

**Link to the dataset on Hugging Face Hub:** [https://huggingface.co/datasets/Ifyokoh/IgboSenti-BBC ]


## 2. Finetune a Foundation Model

Now that the dataset is ready, its time to pick a base model to finetune.

I used a `transformers` model AfriBERTa


In [None]:
# Load dataset from huggingface

dataset = load_dataset("Ifyokoh/IgboSenti-BBC")

dataset

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'label'],
        num_rows: 239
    })
})

In [None]:
df = dataset['train'].to_pandas()

df.head()

Unnamed: 0,Sentence,label
0,Onyeisi ndị uweojii na Kenya agbaala arụkwaghịm,Negative
1,"Okwu Nnamdi Kanu bụ okwu nwereonwe, ọ bụghị ok...",Neutral
2,Africa Eye: Etu e si ebo ndị agadị ebubo amosu...,Negative
3,Ọrịa ndị a na-enweta site na mmiri ọjọọ,Negative
4,Ndị mmadụ na-akpachapụ anyị n'ihi na anyị nwer...,Negative


In [None]:
df.groupby('label').count()

Unnamed: 0_level_0,Sentence
label,Unnamed: 1_level_1
Negative,93
Neutral,41
Positive,105


In [None]:
# Map categorical labels to numerical values
# The model requires integer labels for classification tasks.

label_mapping = {'Positive': 0, 'Neutral': 1, 'Negative': 2}
df['label'] = df['label'].map(label_mapping)


In [None]:
# Create a Hugging Face dataset

dataset = Dataset.from_pandas(df)

In [None]:
# Fine-Tuning a Foundation Model

model_name = "castorini/afriberta_large"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.model_max_length = 512


# Tokenization
def tokenize_function(examples):
    return tokenizer(examples['Sentence'], padding='max_length', truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)


# Split the dataset into training and validation sets
train_test_split = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at castorini/afriberta_large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/239 [00:00<?, ? examples/s]

In [None]:
# Set up logging

logging.set_verbosity_info()
logging.enable_default_handler()
logging.enable_explicit_format()


In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_strategy="steps",
    logging_steps=10,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# train the model
trainer.train()

[INFO|training_args.py:2048] 2024-07-20 01:28:56,665 >> PyTorch: setting up devices
[INFO|training_args.py:1751] 2024-07-20 01:28:56,668 >> The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
[INFO|trainer.py:805] 2024-07-20 01:28:57,373 >> The following columns in the training set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: Sentence. If Sentence are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:2128] 2024-07-20 01:28:57,397 >> ***** Running training *****
[INFO|trainer.py:2129] 2024-07-20 01:28:57,399 >>   Num examples = 191
[INFO|trainer.py:2130] 2024-07-20 01:28:57,401 >>   Num Epochs = 3
[INFO|trainer.py:2131] 2024-07-20 01:28:57,402 >> 

Epoch,Training Loss,Validation Loss
1,0.6949,0.990783
2,0.598,1.009145
3,0.4454,1.05066


[INFO|trainer.py:805] 2024-07-20 01:46:21,618 >> The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: Sentence. If Sentence are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:3788] 2024-07-20 01:46:21,624 >> 
***** Running Evaluation *****
[INFO|trainer.py:3790] 2024-07-20 01:46:21,627 >>   Num examples = 48
[INFO|trainer.py:3793] 2024-07-20 01:46:21,629 >>   Batch size = 8
[INFO|trainer.py:805] 2024-07-20 02:04:55,573 >> The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: Sentence. If Sentence are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:3788] 2024-07-20 02:04:55,581 >> 
***** Running Evaluation *****
[INFO|trainer.py:3790] 2024-07-20 02:04:55,58

TrainOutput(global_step=72, training_loss=0.6114468706978692, metrics={'train_runtime': 3360.8857, 'train_samples_per_second': 0.17, 'train_steps_per_second': 0.021, 'total_flos': 125811049927680.0, 'train_loss': 0.6114468706978692, 'epoch': 3.0})

In [None]:
# Evaluate the model
results = trainer.evaluate()
print(results)

# Save the fine-tuned model
model.save_pretrained("./finetuned_model")
tokenizer.save_pretrained("./finetuned_model")

[INFO|trainer.py:805] 2024-07-20 02:27:43,315 >> The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForSequenceClassification.forward` and have been ignored: Sentence. If Sentence are not expected by `XLMRobertaForSequenceClassification.forward`,  you can safely ignore this message.
[INFO|trainer.py:3788] 2024-07-20 02:27:43,324 >> 
***** Running Evaluation *****
[INFO|trainer.py:3790] 2024-07-20 02:27:43,326 >>   Num examples = 48
[INFO|trainer.py:3793] 2024-07-20 02:27:43,332 >>   Batch size = 8


Non-default generation parameters: {'max_length': 512}
[INFO|configuration_utils.py:472] 2024-07-20 02:29:04,302 >> Configuration saved in ./finetuned_model/config.json


{'eval_loss': 1.0506597757339478, 'eval_runtime': 80.9658, 'eval_samples_per_second': 0.593, 'eval_steps_per_second': 0.074, 'epoch': 3.0}


[INFO|modeling_utils.py:2690] 2024-07-20 02:29:19,698 >> Model weights saved in ./finetuned_model/model.safetensors
[INFO|tokenization_utils_base.py:2574] 2024-07-20 02:29:19,709 >> tokenizer config file saved in ./finetuned_model/tokenizer_config.json
[INFO|tokenization_utils_base.py:2583] 2024-07-20 02:29:19,713 >> Special tokens file saved in ./finetuned_model/special_tokens_map.json


('./finetuned_model/tokenizer_config.json',
 './finetuned_model/special_tokens_map.json',
 './finetuned_model/sentencepiece.bpe.model',
 './finetuned_model/added_tokens.json',
 './finetuned_model/tokenizer.json')

In [None]:
# Load the fine-tuned model
fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("./finetuned_model")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./finetuned_model")


# Function to generate predictions
def predict(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions.item()


# Sample texts
texts = ["Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị",
         "Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo",
         "Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ"]


# Predictions using base model
print("Base Model Predictions:")
for text in texts:
    prediction = predict(text, model, tokenizer)
    print(f"Text: {text} - Prediction: {prediction}")


# Predictions using fine-tuned model
print("Fine-tuned Model Predictions:")
for text in texts:
    prediction = predict(text, fine_tuned_model, fine_tuned_tokenizer)
    print(f"Text: {text} - Prediction: {prediction}")

[INFO|configuration_utils.py:731] 2024-07-20 02:38:44,151 >> loading configuration file ./finetuned_model/config.json
[INFO|configuration_utils.py:800] 2024-07-20 02:38:44,154 >> Model config XLMRobertaConfig {
  "_name_or_path": "./finetuned_model",
  "architectures": [
    "XLMRobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-05,
  "max_length": 512,
  "max_position_embeddings": 514,
  "model_type": "xlm-roberta",
  "num_attention_heads": 6,
  "num_hidden_layers": 10,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absol

Base Model Predictions:
Text: Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị - Prediction: 0
Text: Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo - Prediction: 0
Text: Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ - Prediction: 2
Fine-tuned Model Predictions:
Text: Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị - Prediction: 0
Text: Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo - Prediction: 0
Text: Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ - Prediction: 2


In [None]:
# testing with texts that are not from BBC to see if the result will change
texts = ["A mara m mma",
         "Nne m toro ogologo",
         "Oyi n"]

# Predictions using base model
print("Base Model Predictions:")
for text in texts:
    prediction = predict(text, model, tokenizer)
    print(f"Text: {text} - Prediction: {prediction}")

# Predictions using fine-tuned model
print("Fine-tuned Model Predictions:")
for text in texts:
    prediction = predict(text, fine_tuned_model, fine_tuned_tokenizer)
    print(f"Text: {text} - Prediction: {prediction}")

Base Model Predictions:
Text: A mara m mma - Prediction: 0
Text: Nne m toro ogologo - Prediction: 0
Text: Oyi na ama m - Prediction: 0
Fine-tuned Model Predictions:
Text: A mara m mma - Prediction: 0
Text: Nne m toro ogologo - Prediction: 0
Text: Oyi na ama m - Prediction: 0


In [None]:
## loaded the model from huggingface after uploading it to test

model = AutoModelForSequenceClassification.from_pretrained('Ifyokoh/Igbo-sentiment-bbc', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('Ifyokoh/Igbo-sentiment-bbc')

config.json:   0%|          | 0.00/987 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/503M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/1.55M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.98M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [None]:
# Function to generate predictions
def predict(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions.item()

# Sample texts
texts = ["Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị",
         "Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo",
         "Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ"]


# Predictions using model
print("Model Predictions:")
for text in texts:
    prediction = predict(text, model, tokenizer)
    print(f"Text: {text} - Prediction: {prediction}")


Base Model Predictions:
Text: Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị - Prediction: 0
Text: Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo - Prediction: 0
Text: Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ - Prediction: 2


**Write up**:
###  finetuning strategy used and why

- The model `castorini/afriberta_large` is a pretrained multilingual language model. This model was chosen because it was trained on an aggregation of datasets from the BBC news website and includes Igbo language. The model has been shown to obtain competitive downstream performances on text classification, making it a good fit for the task. Other  models like the `nlptown/bert-base-multilingual-uncased-sentiment`  would have been a good option but it was trained in European languages making Afriberta a better option.
- For training Configuration, the epoch was set to 3 to avoid overfitting as the dataset is small, all parameters were chosen to ensure efficient training and logging, with a learning rate and batch size that balance performance and computational resources. The logging configuration was added to show visibility into the training process.
-Fine-tuning adapts the model specifically for sentiment analysis, ensuring it performs well on this particular task. This strategy is computationally efficient, as it only requires updating the model’s weights based on the new dataset, rather than training a model from scratch.


###  Some samples from the base model and from the final finetuned model. How do they compare?

**Predictions**:

Positive: 0, Neutral: 1 Negative:2

**Base Model Predictions:**

- Igbo text: Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị

  English Meaning: I have hope that the aeroplane I built will fly one day

  Prediction: 0
- Igbo Text: Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo
  
  English Meaning: I have used running sports to win gold medal for Nigeria.
  
  Prediction: 0
- Igbo Text: Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ
  
  English meaning: I carried many people shot at Lekki Toll gate to hospital
  
  Prediction: 2

**Fine-tuned Model Predictions:**
- Text: Enwere m olileanya na ụgbọelu m rụrụ ga-efeli otu ụbọchị
  
  Prediction: 0
- Text: Ana m agbaburu egwuregwu ọsọ, mana ejirila m ya ritere Naịjiria ọlaedo
  
  Prediction: 0
- Text: Eburu m ọtụtụ a gbara egbe na 'Lekki Toll Gate'gawa ụlọ ọgwụ
  
  Prediction: 2

**Analysis:**
- The identical predictions between the base model and the fine-tuned model shows that the base model might already be performing well and fine-tuning did not lead to changes in these predictions. This can be due to the fact that the two model have thesame source dataset which is BBC. Because of that I tried to test the two models on an entirely different text that is not coming from BBC and it gave thesame result.

- The size of the training data used in fine-tuning is a key factor as the training dataset is small and so the fine-tuning process might not lead to significant improvements.

**Things to do to improve the model:**
- Ensure that the training data used for fine-tuning is large enough


**Link to the model on Hugging Face Hub:** [https://huggingface.co/Ifyokoh/Igbo-sentiment-bbc ]