Fine-Tuning TunBERT on T-HSAB Dataset

This notebook guides you through fine-tuning TunBERT (or optionally AraBERT) on the T-HSAB dataset for hate speech detection in Tunisian dialect.

Step 1: Install Required Libraries

Run the following command to install the necessary Python libraries:



In [1]:
!pip install transformers datasets pandas openpyxl torch scikit-learn

^C


Collecting transformers
  Downloading transformers-4.53.0-py3-none-any.whl.metadata (39 kB)
Collecting datasets
  Using cached datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting pandas
  Using cached pandas-2.3.0-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting openpyxl
  Using cached openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting torch
  Using cached torch-2.7.1-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.7.0-cp312-cp312-win_amd64.whl.metadata (14 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting huggingface-hub<1.0,>=0.30.0 (from transformers)
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Downloading numpy-2.3.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
     -------------------- -----------


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Step 2: Load the Dataset

Load your T-HSAB dataset from an XLSX file. Update the file path to point to your dataset.

In [None]:
import pandas as pd

# Load the dataset from XLSX
file_path = '../data/T-HSAB.xlsx'  # Replace with the actual path to your file
data = pd.read_excel(file_path,  header=None, names=['text', 'label'])

# Assume columns are 'text' and 'label'
# Check your dataset and adjust column names if necessary
print(data.head())  # Inspect the first few rows to confirm structure

                                                text   label
0  اسغي ياشعب تونس تدعوا بالاسلام كفار الحمدلله ن...    hate
1  قطع يد السارق توفرت الشروط شرط الحد الأدنى قيم...  normal
2                             تلوموش لطفي لعبدلي شرف  normal
3  مستغرب شعب يسمع تفاهة شانو لى الدرجة الشعب تاف...  normal
4  هههخ غزلتني مافهمتش شمدخلها الموضوع تتنطر وحده...  normal


In [None]:
print(data.columns)

Index(['text', 'label'], dtype='object')


Note: Ensure your dataset has columns named 'text' (for comments) and 'label' (for labels like 'normal', 'abusive', 'hate'). Adjust the code if your column names differ.



Step 3: Preprocess the Data

Map the string labels to integers and split the data into training (80%) and validation (20%) sets.

In [None]:
from sklearn.model_selection import train_test_split

# Map labels to integers (modify based on your dataset's labels)
label_mapping = {'normal': 0, 'abusive': 1, 'hate': 2}
data['label'] = data['label'].map(label_mapping)

# Split the data into training and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

Note: Update label_mapping if your dataset uses different labels (e.g., binary classification).



Step 4: Convert to Hugging Face Dataset Format

Convert the pandas DataFrames into Hugging Face's Dataset format for use with the Trainer API.

In [None]:
from datasets import Dataset

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_data)
val_dataset = Dataset.from_pandas(val_data)

  from .autonotebook import tqdm as notebook_tqdm


Step 5: Choose the Model and Load Tokenizer

We’ll use TunBERT for its specialization in Tunisian dialect. You can switch to AraBERT for general Arabic.

In [None]:
%pip install transformers datasets pandas openpyxl scikit-learn

ImportError: Failed to load PyTorch C extensions:
    It appears that PyTorch has loaded the `torch/_C` folder
    of the PyTorch repository rather than the C extensions which
    are expected in the `torch._C` namespace. This can occur when
    using the `install` workflow. e.g.
        $ python setup.py install && python -c "import torch"

    This error can generally be solved using the `develop` workflow
        $ python setup.py develop && python -c "import torch"  # This should succeed
    or by running Python from a different directory.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Use TunBERT (recommended) or AraBERT
model_name = "tunis-ai/TunBERT"  # Or "aubmindlab/bert-base-arabertv2" for AraBERT
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_mapping))

ImportError: Failed to load PyTorch C extensions:
    It appears that PyTorch has loaded the `torch/_C` folder
    of the PyTorch repository rather than the C extensions which
    are expected in the `torch._C` namespace. This can occur when
    using the `install` workflow. e.g.
        $ python setup.py install && python -c "import torch"

    This error can generally be solved using the `develop` workflow
        $ python setup.py develop && python -c "import torch"  # This should succeed
    or by running Python from a different directory.

Note: Change model_name to use AraBERT if preferred.



Step 6: Tokenize the Text

Tokenize the text using the model’s tokenizer, with a maximum sequence length of 128 tokens.

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
val_dataset = val_dataset.map(tokenize_function, batched=True)

# Set format for PyTorch
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Tip: Adjust max_length (e.g., 256) if your texts are longer or shorter.



Step 7: Define Training Arguments

Set up training parameters like learning rate, batch size, and number of epochs.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

ote: Modify hyperparameters (e.g., num_train_epochs) based on your dataset and hardware.



Step 8: Define Evaluation Metrics

Define a function to compute accuracy, precision, recall, and F1-score during evaluation.

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

Step 9: Initialize the Trainer

Initialize the Trainer with the model, arguments, datasets, and metrics.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

Step 10: Train the Model

Fine-tune the model on the training data.

In [None]:
trainer.train()
#Tip: Use a GPU (e.g., via Google Colab) for faster training.





Step 11: Evaluate the Model

Evaluate the model on the validation set and display the results.

In [None]:
eval_results = trainer.evaluate()
print("Evaluation results:", eval_results)

In [None]:
"""Step 12: Save the Fine-Tuned Model

Save the model and tokenizer for future use."""

model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

Using the Fine-Tuned Model for Inference

Load the saved model to classify new text.

In [None]:
from transformers import pipeline

# Load the fine-tuned model and tokenizer
classifier = pipeline("text-classification", model="./fine_tuned_model", tokenizer="./fine_tuned_model")

# Example usage
result = classifier("Your Tunisian text here")
print(result)

Note: The output will show a predicted label (e.g., LABEL_0 for 'normal') and confidence score. Map labels back to names if needed.