### REFERENCE

@inproceedings{inoue-etal-2021-interplay,
    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
    abstract = "In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.",
}

Khalid, S. (2020) Bert explained: A Complete Guide with Theory and tutorial, Medium. Available at: https://medium.com/@samia.khalid/bert-explained-a-complete-guide-with-theory-and-tutorial-3ac9ebc8fa7c (Accessed: 17 September 2023). 

Classify text with Bert&nbsp; :&nbsp;  text&nbsp; :&nbsp;  tensorflow (no date) TensorFlow. Available at: https://www.tensorflow.org/text/tutorials/classify_text_with_bert (Accessed: 17 September 2023). 

Text classification using Bert &amp; Tensorflow | Deep Learning Tutorial 47 (Tensorflow, Keras &amp; Python) (2021) YouTube. Available at: https://www.youtube.com/watch?v=hOCDJyZ6quA (Accessed: 17 September 2023). 

#### CAMeLBERT

In [22]:
!pip install transformers[torch]
!pip install accelerate -U
import accelerate
import transformers
print(accelerate.__file__)
print(transformers.__file__)


/usr/local/lib/python3.10/dist-packages/accelerate/__init__.py
/usr/local/lib/python3.10/dist-packages/transformers/__init__.py


In [23]:
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv('UseThisClean.csv')
comments = df['clean_text'].tolist()
labels = df['Class'].tolist()

# Split data: 80% training, 20% validation
train_texts, val_texts, train_labels, val_labels = train_test_split(comments, labels, test_size=0.2, random_state=42, stratify=labels)


In [24]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-mix')
model = AutoModelForSequenceClassification.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-mix', num_labels=3) # 3 classes


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-mix and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer

# Load your specific tokenizer
tokenizer = AutoTokenizer.from_pretrained('CAMeL-Lab/bert-base-arabic-camelbert-mix')

# Load the dataset
data = pd.read_csv('UseThisClean.csv')

# Drop NaN values
data = data.dropna(subset=['clean_text'])

# Convert the 'clean_text' column to string
data['clean_text'] = data['clean_text'].astype(str)

# Filter out empty strings
data = data[data['clean_text'].str.strip() != ""]

# Extract the 'clean_text' column and the 'Class' column
texts = data['clean_text'].tolist()
labels = data['Class'].tolist()

# Tokenize the data
inputs = tokenizer(texts, padding=True, truncation=True, max_length=25, return_tensors='pt')

# Extract input_ids and attention masks
input_ids = inputs["input_ids"]
attention_masks = inputs["attention_mask"]

# Convert labels to tensors
label_tensors = torch.tensor(labels)

# Split data into training and validation sets
train_inputs, val_inputs, train_masks, val_masks, train_labels, val_labels = train_test_split(
    input_ids, attention_masks, label_tensors, test_size=0.2
)


In [26]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, classification_report
import torch

# Define the custom dataset
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels

    def __len__(self):
        return len(self.inputs)

    def __getitem__(self, idx):
        return {'input_ids': self.inputs[idx], 'labels': self.labels[idx]}

# Instantiate the datasets
train_dataset = CustomDataset(train_inputs, train_labels)
eval_dataset = CustomDataset(val_inputs, val_labels)


import torch

# Create a tensor with float32 data type
labels = torch.tensor([0.0, 1.0, 2.0], dtype=torch.float32)

# Convert to int64 (Long tensor)
labels = labels.to(dtype=torch.int64)

print(labels)
print(labels.dtype)

# Convert train_labels to int64 (Long tensor)
train_labels = train_labels.to(dtype=torch.int64)



# Now print the dtype to confirm
print(train_labels.dtype)



tensor([0, 1, 2])
torch.int64
torch.int64


In [27]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./work',  # <-- This line is added. Adjust the path as needed.
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=8,
    fp16=True,
    save_total_limit=2,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=200,
    do_train=True,
    do_eval=True,
    dataloader_pin_memory=False,  # Add this line to disable memory pinning
    evaluation_strategy="epoch"
)

from sklearn.metrics import accuracy_score, classification_report

# Define the metric computation
def compute_metrics(p):
    pred_labels = p.predictions.argmax(-1)
    labels = p.label_ids
    accuracy = accuracy_score(labels, pred_labels)
    report = classification_report(labels, pred_labels)
    return {
        'accuracy': accuracy,
        'classification_report': report
    }

# Move input tensors to the same device as the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_inputs = train_inputs.to(device)
val_inputs = val_inputs.to(device)
train_labels = train_labels.to(device)
val_labels = val_labels.to(device)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

train_inputs = train_inputs.to(device)
val_inputs = val_inputs.to(device)
train_labels = train_labels.to(device)
val_labels = val_labels.to(device)




In [28]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy,Classification Report
1,0.6601,0.513936,0.785836,precision recall f1-score support  0 0.77 0.79 0.78 834  1 0.70 0.72 0.71 786  2 0.91 0.85 0.88 724  accuracy 0.79 2344  macro avg 0.79 0.79 0.79 2344 weighted avg 0.79 0.79 0.79 2344
2,0.4577,0.552561,0.781143,precision recall f1-score support  0 0.79 0.73 0.76 834  1 0.69 0.74 0.71 786  2 0.87 0.88 0.88 724  accuracy 0.78 2344  macro avg 0.79 0.78 0.78 2344 weighted avg 0.78 0.78 0.78 2344
3,0.1512,0.712607,0.787116,precision recall f1-score support  0 0.78 0.78 0.78 834  1 0.71 0.70 0.71 786  2 0.88 0.89 0.88 724  accuracy 0.79 2344  macro avg 0.79 0.79 0.79 2344 weighted avg 0.79 0.79 0.79 2344


Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.77      0.79      0.78       834
           1       0.70      0.72      0.71       786
           2       0.91      0.85      0.88       724

    accuracy                           0.79      2344
   macro avg       0.79      0.79      0.79      2344
weighted avg       0.79      0.79      0.79      2344
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.79      0.73      0.76       834
           1       0.69      0.74      0.71       786
           2       0.87      0.88      0.88       724

    accuracy                           0.78      2344
   macro avg       0.79      0.78      0.78      2344
weighted avg       0.78      0.78      0.7

TrainOutput(global_step=879, training_loss=0.37661943012536997, metrics={'train_runtime': 392.993, 'train_samples_per_second': 71.566, 'train_steps_per_second': 2.237, 'total_flos': 361331292656250.0, 'train_loss': 0.37661943012536997, 'epoch': 3.0})

In [29]:
results = trainer.evaluate()
print(results)


Trainer is attempting to log a value of "              precision    recall  f1-score   support

           0       0.78      0.78      0.78       834
           1       0.71      0.70      0.71       786
           2       0.88      0.89      0.88       724

    accuracy                           0.79      2344
   macro avg       0.79      0.79      0.79      2344
weighted avg       0.79      0.79      0.79      2344
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 0.7126073241233826, 'eval_accuracy': 0.7871160409556314, 'eval_classification_report': '              precision    recall  f1-score   support\n\n           0       0.78      0.78      0.78       834\n           1       0.71      0.70      0.71       786\n           2       0.88      0.89      0.88       724\n\n    accuracy                           0.79      2344\n   macro avg       0.79      0.79      0.79      2344\nweighted avg       0.79      0.79      0.79      2344\n', 'eval_runtime': 9.8104, 'eval_samples_per_second': 238.93, 'eval_steps_per_second': 59.732, 'epoch': 3.0}
