<a href="https://colab.research.google.com/github/Manal12449/Test8/blob/main/Sequence_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a Sequence Classification Model Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for sequence classification. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `aubmindlab/bert-base-arabertv02` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/sanad_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [4]:
from transformers import pipeline
import pandas as pd
from sklearn.model_selection import train_test_split


In [5]:


!pip install datasets




In [6]:
from datasets import load_dataset

ds = load_dataset("CUTD/sanad_df")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
df = pd.DataFrame(ds['train'])

In [10]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,label
0,0,الشارقة - محمد ولد محمد سالمعرضت مساء أمس الأو...,Culture
1,1,عبدالحكيم الزبيدي شاعر وقاص وناقد، جاءت نصوصه ...,Culture
2,2,انطلقت في مثل هذه الأيام من العام الفائت فعالي...,Culture
3,3,أقيمت مساء أمس الأول في إكسبو الشارقة ندوة حوا...,Culture
4,4,باسمة يونس حينما قال صاحب السموّ الشيخ الدكتور...,Culture


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  15000 non-null  int64 
 1   text        15000 non-null  object
 2   label       15000 non-null  object
dtypes: int64(1), object(2)
memory usage: 351.7+ KB


In [12]:
df.describe()

Unnamed: 0.1,Unnamed: 0
count,15000.0
mean,7499.5
std,4330.271354
min,0.0
25%,3749.75
50%,7499.5
75%,11249.25
max,14999.0


In [13]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
text,0
label,0


In [16]:
df.duplicated().sum()

6

In [17]:
df.drop_duplicates(inplace=True)

In [18]:
df.duplicated().sum()

0

In [14]:
df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
Culture,6500
Finance,6500
Medical,2000


## Step 2: Clean Unnecessary Columns

In [15]:
df = df[['text', 'label']]

Remove any columns from the dataset that are not needed for training.

## Step 3: Splitting the Dataset

Split the dataset into training and testing sets, ensuring that 20% of the data is used for testing.

In [None]:
#train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

#print(f"Training set size: {len(train_df)}")
#print(f"Testing set size: {len(test_df)}")

In [45]:
train, test = train_test_split(df, test_size=0.2, random_state=42)
tokenizer_data = ds.map(preprocess_data, batched=True, remove_columns=["Unnamed: 0"])


## Step 4: Tokenizing the Data

Initialize a tokenizer for the model.

In [18]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")
model = AutoModelForMaskedLM.from_pretrained("aubmindlab/bert-base-arabertv02")

Some weights of the model checkpoint at aubmindlab/bert-base-arabertv02 were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Step 5: Preprocessing the Text

Map the tokenization function to the dataset. Ensure the text data is processed using truncation to handle sequences that exceed the model's input size. Please do any further preprocessing.

**Bonus**: If you performed more comprehensive preprocessing, such as removing links, converting text to lowercase, or applying additional preprocessing techniques.

In [43]:
def preprocess_data(sentences):
    inputs = tokenizer(sentences['text'], truncation=True, padding='max_length', max_length=512, return_tensors='pt')
    inputs['labels'] = inputs['input_ids'].clone()
    inputs['label'] = sentences['label']
    return inputs

In [44]:
def preprocess_data(sentences):
    inputs = tokenizer(sentences['text'], truncation=True, padding='max_length', max_length=512, return_tensors='pt')
    inputs['labels'] = inputs['input_ids'].clone()

    if isinstance(sentences['label'][0], str):
        unique_labels = set(sentences['label'])
        label_map = {label: i for i, label in enumerate(unique_labels)}
        inputs['label'] = [label_map[l] for l in sentences['label']]
    else:
        inputs['label'] = sentences['label']
    return inputs

In [40]:
tokenizer_data = ds.map(preprocess_data, batched=True, remove_columns=["Unnamed: 0"])

### Step 6: Label Encoding

In [None]:
encoder = LabelEncoder()
df['label'] = encoder.fit_transform(df['label'])


Convert the categorical labels into numerical format using a label encoder if needed.

### Step 7: Data Collation for Padding

Prepare the data for training by ensuring all sequences in a batch are padded to the same length. Use a data collator to handle dynamic padding.

### Step 8: Model Initialization

Initialize a sequence classification model using the BERT-based architecture. Set the the right amount of output labels.

## Step 9: Training Arguments

Define the training arguments, including parameters such as learning rate, batch size, number of epochs, and weight decay.

In [47]:
!pip install transformers
from transformers import Trainer # import Trainer object from transformers library

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="no",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenizer_data['train'],
)





## Step 10: Trainer Initialization and Training

Set up the trainer with the model, training arguments, dataset, tokenizer, and data collator. Train the model using the dataset you processed earlier.

In [None]:
trainer.train()

Step,Training Loss


## Step 11: Inference

Once the model is trained, perform inference on a sample text to evaluate the model's prediction capabilities. Use the tokenizer to process the text, and then feed it into the model to get the predicted label.

In [None]:
from transformers import pipeline
sentiment_analyzer = pipeline("sentiment-analysis", model="aubmindlab/bert-base-arabertv02")

In [None]:
text='وفي كثير من الحالات، تكون الطفرات الوراثية في الجين الذي ينتج البروتين المعني هي العامل المسبب لمرض شاركو. لكن هذه الطفرات يمكن أن تحدث أيضا حتى من دون وجود تاريخ عائلي مرتبط بالمرض'
result = sentiment_analyzer(text)
print(result)