# Fine-tuning a Sequence Classification Model Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for sequence classification. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `aubmindlab/bert-base-arabertv02` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/sanad_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [39]:
import pandas as pd

df = pd.read_csv("hf://datasets/CUTD/sanad_df/sanad_df.csv")

df = df.head(5000)

In [40]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  5000 non-null   int64 
 1   text        5000 non-null   object
 2   label       5000 non-null   object
dtypes: int64(1), object(2)
memory usage: 117.3+ KB
None


In [41]:
print("Missing values:")
print(df.isnull().sum())

Missing values:
Unnamed: 0    0
text          0
label         0
dtype: int64


In [42]:
df = df.drop_duplicates()

## Step 2: Clean Unnecessary Columns

Remove any columns from the dataset that are not needed for training.

In [43]:
!pip install PyArabic



In [44]:
import pyarabic.araby as araby
import nltk
from nltk.corpus import stopwords
import nltk
from nltk.stem.isri import ISRIStemmer
nltk.download('stopwords')
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [45]:
arabic_stopwords = set(stopwords.words('arabic'))

st = ISRIStemmer()


def clean_text(text):

    cleaned_text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)


    cleaned_text = re.sub(r'[^\w\s]', '', cleaned_text)



    cleaned_text = araby.strip_tashkeel(cleaned_text)
    cleaned_text = araby.strip_tatweel(cleaned_text)
    cleaned_text = araby.strip_lastharaka(cleaned_text)


    words = cleaned_text.split()


    filtered_words = [word for word in words if word not in arabic_stopwords]


    stemmed_words = []
    for word in words:
      stemmed_words.append(st.stem(word))


    stemmed_words = ' '.join(filtered_words)


    cleaned_text = araby.normalize_hamza(stemmed_words)

    return cleaned_text


df['clean_text'] = df['text'].apply(clean_text)

In [46]:
df['clean_text']

Unnamed: 0,clean_text
0,الشارقة محمد ولد محمد سالمعرضت الءول خشبة مسرح...
1,عبدالحكيم الزبيدي شاعر وقاص وناقد جاءت نصوصه م...
2,انطلقت الءيام العام الفاءت فعاليات مهرجان دبي ...
3,ءقيمت الءول ءكسبو الشارقة ندوة حوارية حول ءهمي...
4,باسمة يونس حينما قال صاحب السمو الشيخ الدكتور ...
...,...
4995,ءبوظبي الخليجءكد الدكتور فالح حنظل خلال الجلسة...
4996,ءعلنت هيءة ءبوظبي للسياحة والثقافة اختيارها شخ...
4997,القاهرةالخليج طرءت الشعر بامتداد تاريخ الءدب ت...
4998,وحده الكتاب يتحول هيولى خفية تتجاوز الربط ءفرا...


In [47]:
df=df.drop('Unnamed: 0',axis=1)

In [48]:
df=df.drop('text',axis=1)

In [16]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Culture,6500
Finance,6500
Medical,2000


In [49]:
df.head()

Unnamed: 0,label,clean_text
0,Culture,الشارقة محمد ولد محمد سالمعرضت الءول خشبة مسرح...
1,Culture,عبدالحكيم الزبيدي شاعر وقاص وناقد جاءت نصوصه م...
2,Culture,انطلقت الءيام العام الفاءت فعاليات مهرجان دبي ...
3,Culture,ءقيمت الءول ءكسبو الشارقة ندوة حوارية حول ءهمي...
4,Culture,باسمة يونس حينما قال صاحب السمو الشيخ الدكتور ...


## Step 3: Splitting the Dataset

Split the dataset into training and testing sets, ensuring that 20% of the data is used for testing.

In [53]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

## Step 4: Tokenizing the Data

Initialize a tokenizer for the model.

In [54]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")

# Tokenize
train_encodings = tokenizer(train_df['clean_text'].tolist(), truncation=True, padding=True, max_length=128, return_tensors='tf')
val_encodings = tokenizer(val_df['clean_text'].tolist(), truncation=True, padding=True, max_length=128, return_tensors='tf')



## Step 5: Preprocessing the Text

Map the tokenization function to the dataset. Ensure the text data is processed using truncation to handle sequences that exceed the model's input size. Please do any further preprocessing.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

### Step 6: Label Encoding

Convert the categorical labels into numerical format using a label encoder if needed.

In [99]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

train_df['label'] = label_encoder.fit_transform(train_df['label'])
val_df['label'] = label_encoder.transform(val_df['label'])

In [100]:
type(train_df)

In [102]:
import tensorflow as tf
train_labels = tf.convert_to_tensor(train_df['label'])
val_labels = tf.convert_to_tensor(val_df['label'])

In [103]:
from datasets import Dataset
train_datasets = Dataset.from_pandas(train_df)
val_datasets = Dataset.from_pandas(val_df)

In [104]:
train_datasets

Dataset({
    features: ['label', 'clean_text', '__index_level_0__'],
    num_rows: 4000
})

In [105]:
val_datasets

Dataset({
    features: ['label', 'clean_text', '__index_level_0__'],
    num_rows: 1000
})

In [112]:
label_encoder = LabelEncoder()


encoded_labels = label_encoder.fit_transform(train_datasets['label'])


train_datasets = train_datasets.remove_columns(['label'])
train_datasets = train_datasets.add_column('label', encoded_labels)


print(train_datasets)

Dataset({
    features: ['clean_text', '__index_level_0__', 'label'],
    num_rows: 4000
})


In [113]:
label_encoder = LabelEncoder()


encoded_labels = label_encoder.fit_transform(val_datasets['label'])


val_datasets = val_datasets.remove_columns(['label'])
val_datasets = val_datasets.add_column('label', encoded_labels)


print(val_datasets)

Dataset({
    features: ['clean_text', '__index_level_0__', 'label'],
    num_rows: 1000
})


In [120]:
train_datasets['label'][:10]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

### Step 7: Data Collation for Padding

Prepare the data for training by ensuring all sequences in a batch are padded to the same length. Use a data collator to handle dynamic padding.

In [121]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

### Step 8: Model Initialization

Initialize a sequence classification model using the BERT-based architecture. Set the the right amount of output labels.

In [136]:
!pip install transformers



In [142]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

num_labels = 3
model = AutoModelForSequenceClassification.from_pretrained("aubmindlab/bert-base-arabertv02", num_labels=num_labels)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Step 9: Training Arguments

Define the training arguments, including parameters such as learning rate, batch size, number of epochs, and weight decay.

In [143]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    weight_decay=0.01,
    learning_rate=5e-5
)


## Step 10: Trainer Initialization and Training

Set up the trainer with the model, training arguments, dataset, tokenizer, and data collator. Train the model using the dataset you processed earlier.

In [144]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_datasets,
    eval_dataset=val_datasets,
    tokenizer=tokenizer,
    data_collator=data_collator
)

In [145]:
trainer.train()

ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']

## Step 11: Inference

Once the model is trained, perform inference on a sample text to evaluate the model's prediction capabilities. Use the tokenizer to process the text, and then feed it into the model to get the predicted label.