 
<img width="200px" height="200px" src='logo-en.png'/>

<br/>
<div style="text-align: center; font-size:20px; font-weight:bold; color: #212F3D">King Abdullah I School of Graduate Studies and Scientific Research</div><br/>
<div style="text-align: center; font-size:20px; font-weight:bold; color: #212F3D;">Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification</div><br/>
<div style="text-align: center; font-size:14px; font-weight:bold; color: #212F3D">Dania Refai<sup>1</sup>, Saleh Abu-Soud<sup>2</sup>, Mohammad Abdel-Rahman<sup>3</sup></div>
<br/>
<div style="text-align: left; font-size:14px; font-weight:normal; color: #212F3D">
    <sup>1</sup> Department of Computer Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan</div>
<br/>
<div style="text-align: left; font-size:14px; font-weight:normal; color: #212F3D">
    <sup>2</sup> Department of Data Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan</div>
<br/>
<div style="text-align: left; font-size:14px; font-weight:normal; color: #212F3D">
    <sup>3</sup> Department of Data Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan</div>
<br/>

<div style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">
        Crosspending author: Dania Refai (<span style="text-align: left; font-size:16px; font-weight:bold; color: #6495ED">Dania.Refai@hotmail.com</span>).
</div>
<br/>
<hr/>

### <span style="text-align: left; font-size:20px; font-weight:bold; color: #C70039">General Notes and Directions</span> ###
<hr/>

> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp;Make sure you have pytorch installed on your machine. Moreover, if you want more information please refer to <a href="https://pytorch.org/">INSTALL PYTORCH</a> from their official website.</li>
> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp;Make sure your installed python version is 3.8</li>
> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp;Make sure you are running the commands INSIDE source code directory (<span style="color: #C70039">.\Implementation\</span>)</li>
> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp;Run the following commands in your command shell to create and activate a Virtualenv (<span style="color: #C70039">Windows based systems</span>):</li>
> <ol>    
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> set PATH=C:\Users\(<span style="text-align: left; font-size:14px; font-weight:bold; color: #C70039">-windows_user-</span>)\AppData\Local\Programs\Python\Python38\
    </li>
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> %PATH%\python.exe -m pip install --upgrade pip
    </li>   
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> %PATH%python.exe %PATH%Scripts\pip.exe install virtualenv 
    </li>    
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> %PATH%\python.exe -m virtualenv venv 
    </li>
> </ol>
> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp; Activate the virtual environment: </li>
> <ol>    
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> .\venv\Scripts\activate
    </li>  
> </ol>
> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp; Install requirements:</li>
> <ol>    
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> .\venv\Scripts\pip3 install python-dotenv
    </li>
> <li style="text-align: left; font-family:console; font-size:14px; font-weight:bold; color: #212F3D; list-style-type: none;">
       <span style="color: #C70039">cmd&gt;</span> .\venv\Scripts\pip3 install -r requirements.txt
    </li>   
> </ol>

> <li style="text-align: left; font-size:14px; font-weight:bold; color: #212F3D">&nbsp;Notebook Purpose: <span style="color: #C70039">Sentiment Analysis for ASTD dataset using Model: </span>aubmindlab/bert-base-arabertv02-twitter</li>



### Imports

In [1]:
!set PYTORCH_NO_CUDA_MEMORY_CACHING=1

In [2]:
import torch, os
import pandas as pd
import numpy as np
from typing import List
from tqdm import tqdm_notebook as tqdm
from sklearn.model_selection import train_test_split
import random
import matplotlib.pyplot as plt
import copy
from preprocess import ArabertPreprocessor
from sklearn.metrics import precision_recall_curve, roc_curve, auc
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, f1_score, precision_score,
                             recall_score)
from torch.utils.data import DataLoader, Dataset
from transformers import (AutoConfig, AutoModelForSequenceClassification,
                          AutoTokenizer, BertTokenizer, Trainer,
                          TrainingArguments)
from transformers.data.processors.utils import InputFeatures
from sklearn.model_selection import StratifiedKFold
from statistics import mean
from transformers import pipeline
import more_itertools
import GPUtil as GPU
import gc; 
from GPUtil import showUtilization as gpu_usage
import seaborn as sns
from math import sqrt
import warnings
warnings.filterwarnings('ignore')
plt.style.use('classic')
%matplotlib inline
sns.set()


### Utils

In [3]:
class ClassificationDataset(Dataset):
    def __init__(self, text, target, model_name, max_len, label_map):
        super(ClassificationDataset).__init__()
        self.text = text
        self.target = target
        self.tokenizer_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.max_len = max_len
        self.label_map = label_map
      

    def __len__(self):
        return len(self.text)

    def __getitem__(self,item):
        text = str(self.text[item])
        text = " ".join(text.split())
        inputs = self.tokenizer(
          text,
          max_length=self.max_len,
          padding='max_length',
          truncation=True
          )      
        return InputFeatures(**inputs,label=self.label_map[self.target[item]])

In [4]:
'''
	This custom dataset class will help us hold our datasets in a structred manner.	
'''
class CustomDataset:
    def __init__(
        self,
        name: str,
        train: List[pd.DataFrame],
        test: List[pd.DataFrame],
        label_list: List[str],
    ):
        self.name = name
        self.train = train
        self.test = test
        self.label_list = label_list

### Loading Training Dataset (Already Augmented)

In [5]:
datasetname = 'ASTD'
datasetpath = "Augmented-Dataset/xls/ASTD-Unbalanced-Augmented-aragpt2-base.xlsx"
df = pd.read_excel( datasetpath)
df.columns = ['text', 'label', 'new_text', 'all_text', 'original_embbedding', 'new_embbedding', 'ecu_similarity', 'cos_similarity', 'jacc_similarity','text_split', 'all_text_split', 'new_text_split', 'bleu_sim_1','bleu_sim_2', 'bleu_sim_3', 'bleu_sim_4'] 
df.head()

Unnamed: 0,text,label,new_text,all_text,original_embbedding,new_embbedding,ecu_similarity,cos_similarity,jacc_similarity,text_split,all_text_split,new_text_split,bleu_sim_1,bleu_sim_2,bleu_sim_3,bleu_sim_4
0,5 هاتلي اخوان أي حاجة مش تنوين ومش ضمير اخوان ...,NEG,!!.,5 هاتلي اخوان أي حاجة مش تنوين ومش ضمير اخوان ...,"0.014882844,-0.051557414,-0.028316082,0.014168...","0.01946623,-0.010952667,-0.039843258,-0.057320...",0.772223,0.446,0.037037,"['5', 'هاتلي', 'اخوان', 'أي', 'حاجة', 'مش', 'ت...","['5', 'هاتلي', 'اخوان', 'أي', 'حاجة', 'مش', 'ت...",['!!.'],0.89,0.89,0.88,0.88
1,دباسم يوسف عمل برنامج البرنامج و #فسسسسسس,NEG,لر على # الفيس _ بوك [رابط]بسم الله الرحمن الر...,دباسم يوسف عمل برنامج البرنامج و # فسسلر على #...,"0.016909812,0.015640503,-0.02446039,-0.0235670...","0.017838204,0.007064947,-0.03709342,-0.0264731...",0.205765,0.929,0.4375,"['دباسم', 'يوسف', 'عمل', 'برنامج', 'البرنامج',...","['دباسم', 'يوسف', 'عمل', 'برنامج', 'البرنامج',...","['لر', 'على', '#', 'الفيس', '_', 'بوك', '[رابط...",0.14,0.13,0.12,0.11
2,منذ عامين وحتى الآن كل ما قدمه أنصار تيارات ال...,NEG,.,منذ عامين وحتى الآن كل ما قدمه أنصار تيارات ال...,"0.026780926,0.009709039,-0.030822175,-0.033138...","0.022658788,-0.0036188036,-0.033782676,-0.0437...",0.237325,0.904,0.0,"['منذ', 'عامين', 'وحتى', 'الآن', 'كل', 'ما', '...","['منذ', 'عامين', 'وحتى', 'الآن', 'كل', 'ما', '...",['.'],0.95,0.95,0.95,0.95
3,#السعاده ان يكون من نحب بخير وعافيه فنحن نشعر ...,POS,.,# السعاده ان يكون من نحب بخير وعافيه فنحن نشعر...,"0.011575715,-0.0191376,-0.041333534,-0.0137087...","0.022658788,-0.0036188036,-0.033782676,-0.0437...",0.352307,0.829,0.0,"['#السعاده', 'ان', 'يكون', 'من', 'نحب', 'بخير'...","['#', 'السعاده', 'ان', 'يكون', 'من', 'نحب', 'ب...",['.'],0.79,0.78,0.78,0.77
4,درية شرف الدين امرأة على الوشين لا مهنية ولا ا...,NEG,في الشوارع.,درية شرف الدين امرأة على الوشين لا مهنية ولا ا...,"0.016909812,0.015640503,-0.02446039,-0.0235670...","0.017580768,-0.0027376027,-0.03825421,-0.04189...",0.390541,0.834,0.36,"['درية', 'شرف', 'الدين', 'امرأة', 'على', 'الوش...","['درية', 'شرف', 'الدين', 'امرأة', 'على', 'الوش...","['في', 'الشوارع.']",0.92,0.92,0.92,0.92


In [6]:
# Parameters
all_datasets= []
SIM_COFFICIENTS_THRESHOLDS = {'ECU': df["ecu_similarity"].mean(), 'COS':df["cos_similarity"].mean(), 'JAC':df["jacc_similarity"].mean(), 'BLEU':df["bleu_sim_1"].mean()}
LABEL_TO_AUGMENT = ['NEG', 'NEUTRAL']
DATA_COLUMN = "text"
LABEL_COLUMN = "label"
SIM_COFFICIENTS_THRESHOLDS

{'ECU': 0.33158923031772564,
 'COS': 0.8526818791946309,
 'JAC': 0.3624915367325659,
 'BLEU': 0.3949177274138466}

### Train: Augmented, Test: Augmented, Text: All-Text

In [7]:
EcuDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-ECU-ALL-Text-Final.xlsx")
CosDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-COS-ALL-Text-Final.xlsx")
JacDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-JAC-ALL-Text-Final.xlsx")
BleDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-BLE-ALL-Text-Final.xlsx")
EcuDF.columns = [DATA_COLUMN, LABEL_COLUMN]
CosDF.columns = [DATA_COLUMN, LABEL_COLUMN]
JacDF.columns = [DATA_COLUMN, LABEL_COLUMN]
BleDF.columns = [DATA_COLUMN, LABEL_COLUMN]

In [8]:
# Original Dataset - all text
df = df[[DATA_COLUMN, LABEL_COLUMN]]
train, test = train_test_split(df, test_size=0.2, random_state=42)
label_list = list(df[LABEL_COLUMN].unique())
data = CustomDataset(datasetname+"-Not-Augmented-all-text", train, test, label_list)
all_datasets.append(data)

# Augmented-ECU-Final - all text 
train_ECU, test_ECU = train_test_split(EcuDF, test_size=0.2, random_state=42)
label_list_ECU = list(EcuDF[LABEL_COLUMN].unique())
data_ECU = CustomDataset("ECU-"+datasetname+"-Augmented-Test-all-text", train_ECU, test_ECU, label_list_ECU)
all_datasets.append(data_ECU)

# Augmented-COS-Final - all text
train_COS, test_COS = train_test_split(CosDF, test_size=0.2, random_state=42)
label_list_COS = list(CosDF[LABEL_COLUMN].unique())
data_COS = CustomDataset("COS-"+datasetname+"-Augmented-Test-all-text", train_COS, test_COS, label_list_COS)
all_datasets.append(data_COS)

# Augmented-JACC-Final - all text
train_JACC, test_JACC = train_test_split(JacDF, test_size=0.2, random_state=42)
label_list_JACC = list(JacDF[LABEL_COLUMN].unique())
data_JACC = CustomDataset("JAC-"+datasetname+"-Augmented-Test-all-text", train_JACC, test_JACC, label_list_JACC)
all_datasets.append(data_JACC)

# Augmented-BLEU-Final - all text
train_BLEU, test_BLEU = train_test_split(BleDF, test_size=0.2, random_state=42)
label_list_BLEU = list(BleDF[LABEL_COLUMN].unique())
data_BLEU = CustomDataset("BLE-"+datasetname+"-Augmented-Test-all-text", train_BLEU, test_BLEU, label_list_BLEU)
all_datasets.append(data_BLEU)

### Train: Augmented, Test: Augmented, Text: New-Text

In [9]:
EcuDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-ECU-new-Text-Final.xlsx")
CosDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-COS-new-Text-Final.xlsx")
JacDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-JAC-new-Text-Final.xlsx")
BleDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-BLE-new-Text-Final.xlsx")
EcuDF.columns = [DATA_COLUMN, LABEL_COLUMN]
CosDF.columns = [DATA_COLUMN, LABEL_COLUMN]
JacDF.columns = [DATA_COLUMN, LABEL_COLUMN]
BleDF.columns = [DATA_COLUMN, LABEL_COLUMN]

In [10]:
## Augmented-ECU-Final - new text 
train_ECU, test_ECU = train_test_split(EcuDF, test_size=0.2, random_state=42)
label_list_ECU = list(EcuDF[LABEL_COLUMN].unique())
data_ECU = CustomDataset("ECU-"+datasetname+"-Augmented-Test-new-text", train_ECU, test_ECU, label_list_ECU)
all_datasets.append(data_ECU)

# Augmented-COS-Final - new text
train_COS, test_COS = train_test_split(CosDF, test_size=0.2, random_state=42)
label_list_COS = list(CosDF[LABEL_COLUMN].unique())
data_COS = CustomDataset("COS-"+datasetname+"-Augmented-Test-new-text", train_COS, test_COS, label_list_COS)
all_datasets.append(data_COS)

# Augmented-JACC-Final - new text
train_JACC, test_JACC = train_test_split(JacDF, test_size=0.2, random_state=42)
label_list_JACC = list(JacDF[LABEL_COLUMN].unique())
data_JACC = CustomDataset("JAC-"+datasetname+"-Augmented-Test-new-text", train_JACC, test_JACC, label_list_JACC)
all_datasets.append(data_JACC)

# Augmented-BLEU-Final - all text
train_BLEU, test_BLEU = train_test_split(BleDF, test_size=0.2, random_state=42)
label_list_BLEU = list(BleDF[LABEL_COLUMN].unique())
data_BLEU = CustomDataset("BLE-"+datasetname+"-Augmented-Test-new-text", train_BLEU, test_BLEU, label_list_BLEU)
all_datasets.append(data_BLEU)

### Train: Augmented, Test: Not-Augmented, Text: All-Text

In [11]:
EcuDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-ECU-ALL-Text-Final.xlsx")
CosDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-COS-ALL-Text-Final.xlsx")
JacDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-JAC-ALL-Text-Final.xlsx")
BleDF = pd.read_excel( "Augmented-Dataset/All/"+datasetname+"-Augmented-BLE-ALL-Text-Final.xlsx")
EcuDF.columns = [DATA_COLUMN, LABEL_COLUMN]
CosDF.columns = [DATA_COLUMN, LABEL_COLUMN]
JacDF.columns = [DATA_COLUMN, LABEL_COLUMN]
BleDF.columns = [DATA_COLUMN, LABEL_COLUMN]

In [12]:
## Augmented-ECU-Final - all text 
train_ECU, test_ECU = train_test_split(EcuDF, test_size=0.2, random_state=42)
label_list_ECU = list(EcuDF[LABEL_COLUMN].unique())
data_ECU = CustomDataset("ECU-"+datasetname+"-Not-Augmented-Test-all-text", train_ECU, test, label_list_ECU)
all_datasets.append(data_ECU)

# Augmented-COS-Final - all text
train_COS, test_COS = train_test_split(CosDF, test_size=0.2, random_state=42)
label_list_COS = list(CosDF[LABEL_COLUMN].unique())
data_COS = CustomDataset("COS-"+datasetname+"-Not-Augmented-Test-all-text", train_COS, test, label_list_COS)
all_datasets.append(data_COS)

# Augmented-JACC-Final - all text
train_JACC, test_JACC = train_test_split(JacDF, test_size=0.2, random_state=42)
label_list_JACC = list(JacDF[LABEL_COLUMN].unique())
data_JACC = CustomDataset("JAC-"+datasetname+"-Not-Augmented-Test-all-text", train_JACC, test, label_list_JACC)
all_datasets.append(data_JACC)

# Augmented-BLEU-Final - all text
train_BLEU, test_BLEU = train_test_split(BleDF, test_size=0.2, random_state=42)
label_list_BLEU = list(BleDF[LABEL_COLUMN].unique())
data_BLEU = CustomDataset("BLE-"+datasetname+"-Not-Augmented-Test-all-text", train_BLEU, test, label_list_BLEU)
all_datasets.append(data_BLEU)

### Train: Augmented, Test: Not-Augmented, Text: New-Text

In [13]:
EcuDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-ECU-new-Text-Final.xlsx")
CosDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-COS-new-Text-Final.xlsx")
JacDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-JAC-new-Text-Final.xlsx")
BleDF = pd.read_excel( "Augmented-Dataset/new/"+datasetname+"-Augmented-BLE-new-Text-Final.xlsx")
EcuDF.columns = [DATA_COLUMN, LABEL_COLUMN]
CosDF.columns = [DATA_COLUMN, LABEL_COLUMN]
JacDF.columns = [DATA_COLUMN, LABEL_COLUMN]
BleDF.columns = [DATA_COLUMN, LABEL_COLUMN]

In [14]:
# Augmented-ECU-Final - new text 
train_ECU, test_ECU = train_test_split(EcuDF, test_size=0.2, random_state=42)
label_list_ECU = list(EcuDF[LABEL_COLUMN].unique())
data_ECU = CustomDataset("ECU-"+datasetname+"-Not-Augmented-Test-new-text", train_ECU, test, label_list_ECU)
all_datasets.append(data_ECU)

# Augmented-COS-Final - new text
train_COS, test_COS = train_test_split(CosDF, test_size=0.2, random_state=42)
label_list_COS = list(CosDF[LABEL_COLUMN].unique())
data_COS = CustomDataset("COS-"+datasetname+"-Not-Augmented-Test-new-text", train_COS, test, label_list_COS)
all_datasets.append(data_COS)

# Augmented-JACC-Final - new text
train_JACC, test_JACC = train_test_split(JacDF, test_size=0.2, random_state=42)
label_list_JACC = list(JacDF[LABEL_COLUMN].unique())
data_JACC = CustomDataset("JAC-"+datasetname+"-Not-Augmented-Test-new-text", train_JACC, test, label_list_JACC)
all_datasets.append(data_JACC)

# Augmented-BLEU-Final - all text
train_BLEU, test_BLEU = train_test_split(BleDF, test_size=0.2, random_state=42)
label_list_BLEU = list(BleDF[LABEL_COLUMN].unique())
data_BLEU = CustomDataset("BLE-"+datasetname+"-Not-Augmented-Test-new-text", train_BLEU, test, label_list_BLEU)
all_datasets.append(data_BLEU)

### Printing All Datasets Names

In [15]:
for d in all_datasets:
    print(d.name)

ASTD-Not-Augmented-all-text
ECU-ASTD-Augmented-Test-all-text
COS-ASTD-Augmented-Test-all-text
JAC-ASTD-Augmented-Test-all-text
BLE-ASTD-Augmented-Test-all-text
ECU-ASTD-Augmented-Test-new-text
COS-ASTD-Augmented-Test-new-text
JAC-ASTD-Augmented-Test-new-text
BLE-ASTD-Augmented-Test-new-text
ECU-ASTD-Not-Augmented-Test-all-text
COS-ASTD-Not-Augmented-Test-all-text
JAC-ASTD-Not-Augmented-Test-all-text
BLE-ASTD-Not-Augmented-Test-all-text
ECU-ASTD-Not-Augmented-Test-new-text
COS-ASTD-Not-Augmented-Test-new-text
JAC-ASTD-Not-Augmented-Test-new-text
BLE-ASTD-Not-Augmented-Test-new-text


### Training and Modeling (Model=aubmindlab/bert-base-arabertv02-twitter)

In [16]:
model_name = 'aubmindlab/bert-base-arabertv02-twitter' 
all_results = []

for d in all_datasets:
    models_files = []
    print ("*********** Dataset Name: " + d.name)
    selected_dataset = copy.deepcopy(d)
    
    ## ---> Training 
        # Preprocessing using the AraBERT Processor
    arabic_prep = ArabertPreprocessor(model_name)
    selected_dataset.train[DATA_COLUMN] = selected_dataset.train[DATA_COLUMN].apply(lambda x: arabic_prep.preprocess(x))
    selected_dataset.test[DATA_COLUMN] = selected_dataset.test[DATA_COLUMN].apply(lambda x: arabic_prep.preprocess(x)) 
    
    # Check the tokenized sentence length to decide on the maximum sentence length value
    tok = AutoTokenizer.from_pretrained(model_name)
    
    
    #print("Training Sentence Lengths: ")
    #plt.hist([ len(tok.tokenize(sentence)) for sentence in selected_dataset.train[DATA_COLUMN].to_list()],bins=range(0,200,2))
    #plt.show()
    #
    #print("Testing Sentence Lengths: ")
    #plt.hist([ len(tok.tokenize(sentence)) for sentence in selected_dataset.test[DATA_COLUMN].to_list()],bins=range(0,200,2))
    #plt.show()
    #
    ## Deciding the maximum length
    max_len = 200
    
    ## Check how many sequences will be truncated
    #print("Truncated training sequences: ", sum([len(tok.tokenize(sentence)) > max_len for sentence in selected_dataset.test[DATA_COLUMN].to_list()]))
    #print("Truncated testing sequences: ", sum([len(tok.tokenize(sentence)) > max_len for sentence in selected_dataset.test[DATA_COLUMN].to_list()]))
    #
    ## Creating the Classification Dataset Splits
    label_map = { v:index for index, v in enumerate(selected_dataset.label_list) }
    #print(label_map)
    
    train_dataset = ClassificationDataset(
        selected_dataset.train[DATA_COLUMN].to_list(),
        selected_dataset.train[LABEL_COLUMN].to_list(),
        model_name,
        max_len,
        label_map
      )
    test_dataset = ClassificationDataset(
        selected_dataset.test[DATA_COLUMN].to_list(),
        selected_dataset.test[LABEL_COLUMN].to_list(),
        model_name,
        max_len,
        label_map
      )
    
    
    # Return a pretrained model ready to do classification
    def model_init():
        return AutoModelForSequenceClassification.from_pretrained(model_name, return_dict=True, num_labels=len(label_map))
    
    # Defining Evaluation Metric
    # p should be of type EvalPrediction
    def compute_metrics(p): 
        preds = np.argmax(p.predictions, axis=1)
        assert len(preds) == len(p.label_ids)
        macro_f1 = f1_score(p.label_ids,preds,average='macro')
        macro_precision = precision_score(p.label_ids,preds,average='macro')
        macro_recall = recall_score(p.label_ids,preds,average='macro')
        acc = accuracy_score(p.label_ids,preds)
        # calculate the ROC and PR
        probas = p.predictions[:,1]        
        fpr, tpr, _ = roc_curve(p.label_ids, probas)
        precision, recall, _ = precision_recall_curve(p.label_ids, probas)
        roc_auc = auc(fpr, tpr)
        pr_auc = auc(recall, precision)
        
        # save model 
        timestamp = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
        model_name = f"{d.name}_{metric_value:.2f}_{timestamp}.h5"
        model_path = os.path.join("models", model_name)
        model.save(model_path)
        models_files.append (model_name)
        return {       
            'macro_f1' : macro_f1,
            'accuracy': acc,
            'precision': macro_precision,
            'recall':macro_recall,
            'fpr':fpr,
            'tpr':tpr,
            'precision_crv':precision,
            'recall_crv':recall,
            'roc_auc': roc_auc,
            'pr_auc': pr_auc
        }
    
    # Defining the Seeding Setter
    def set_seed(seed=42):
        random.seed(seed)
        np.random.seed(seed)
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic=True
        torch.backends.cudnn.benchmark = False
    
    # binomial confidence interval
    def get_ci_97_percent (acc, n):
        return 2.17 * sqrt( (acc * (1 - acc)) / n)
    
    ## ---> Modeling (Regular Training)
    # Training parameters | Parameters Reference: https://huggingface.co/docs/transformers/main_classes/trainer#trainingarguments
    training_args = TrainingArguments( 
        output_dir= "./train",    
        adam_epsilon = 1e-8,
        learning_rate = 2e-5,
        fp16 = False, # enable this when using V100 or T4 GPU
        per_device_train_batch_size = 64, # up to 64 on 16GB with max len of 128
        per_device_eval_batch_size = 128,
        gradient_accumulation_steps = 2, # use this to scale batch size without needing more memory
        num_train_epochs= 2,
        warmup_steps = 0,
        do_eval = True,
        evaluation_strategy = 'steps',
        load_best_model_at_end = True, # this allows to automatically get the best model at the end based on whatever metric we want
        metric_for_best_model = 'macro_f1',
        greater_is_better = True,
        seed = 25
      )
    
    set_seed(training_args.seed)
    
    # Trainer Creation
    trainer = Trainer(
        model = model_init(),
        args = training_args,
        train_dataset = train_dataset,
        eval_dataset=test_dataset,
        compute_metrics=compute_metrics,
    )
    
    # start the training
    trainer.train()
    
    # Artifacts Saving (Model, Tokenizer, and Configurations)
    inv_label_map = inv_label_map = { v:k for k, v in label_map.items()}
    trainer.model.config.label2id = label_map
    trainer.model.config.id2label = inv_label_map

    #print("####################### Start GPU Mointoring ###########################")
    #GPUs = GPU.getGPUs()        
    #gpu = GPUs[0]    
    #torch.cuda.empty_cache()
    #del training_args, trainer
    #gc.collect()
    #
    ##print("Initial GPU Usage")
    #gpu_usage() 
    #torch.cuda.empty_cache()
    #gpu_usage()
    
 
    # do kfold on the training. Check the perfomance on the test set
    kfold_dataset = selected_dataset.train
    kfold_dataset.reset_index(inplace=True,drop=True)
    
    # Defing the number of Stratified k-fold splits
    kf = StratifiedKFold(
        n_splits=5,
        shuffle=True,
        random_state=123
      )
    
    # Train using cross validation and save the best model at each fold    
    fold_best_f1 = 0
    best_fold = None
    
    for fold_num , (train, dev) in enumerate(kf.split(kfold_dataset,kfold_dataset['label'])):
       
        print("**************************Starting Fold Num: ", fold_num," **************************")
        
        train_dataset = ClassificationDataset(list(kfold_dataset[DATA_COLUMN][train]),
                                    list(kfold_dataset[LABEL_COLUMN][train]),
                                    model_name,
                                    max_len,
                                    label_map)
        
        val_dataset = ClassificationDataset(list(kfold_dataset[DATA_COLUMN][dev]),
                                    list(kfold_dataset[LABEL_COLUMN][dev]),
                                    model_name,
                                    max_len,
                                    label_map)
        
        training_args = TrainingArguments( 
          output_dir= f"./train",    
          adam_epsilon = 1e-8,
          learning_rate = 2e-5,
          fp16 = False,
          per_device_train_batch_size = 64,
          per_device_eval_batch_size = 128,
          gradient_accumulation_steps = 2,
          num_train_epochs= 2,
          warmup_steps = 0,
          do_eval = True,
          evaluation_strategy = 'steps',
          load_best_model_at_end = True,
          metric_for_best_model = 'macro_f1',
          greater_is_better = True,
          seed = 25
        )

        set_seed(training_args.seed)
    
        trainer = Trainer(
          model = model_init(),
          args = training_args,
          train_dataset = train_dataset,
          eval_dataset=val_dataset,
          compute_metrics=compute_metrics,
        )
        
        trainer.model.config.label2id = label_map
        trainer.model.config.id2label = inv_label_map
        
        
        trainer.train()
        results = trainer.evaluate()
        results['Dataset_Name'] = d.name
        
        if results['eval_macro_f1'] > fold_best_f1:
            print('* New Best Model Found!')
            fold_best_f1 = results['eval_macro_f1']
            best_fold = fold_num
        
        # +++ add confidence interval calculation for all measures
        results['Dataset_Name'] = d.name
        results['Fold_No' ] = fold_num
        results['ci_macro_f1' ] = get_ci_97_percent (results['eval_macro_f1'], len(d.test))
        results['ci_accuracy' ] = get_ci_97_percent (results['eval_accuracy'], len(d.test))
        results['ci_precision'] = get_ci_97_percent (results['eval_precision'], len(d.test))
        results['ci_recall']    = get_ci_97_percent (results['eval_recall'], len(d.test))
        all_results.append(results)
              
    try:
        with open(f"{d.name}.txt", "w") as file:
            for item in models_files:
                file.write(str(item) + "\n")
        print("List exported successfully!")
    except FileNotFoundError:
        with open(f"{d.name}.txt", "x") as file:
            for item in models_files:
                file.write(str(item) + "\n")
        print("File created and list exported successfully!")
        
     
        #print("####################### Start GPU Mointoring ###########################")
        #GPUs = GPU.getGPUs()        
        #gpu = GPUs[0]         
        #gpu_usage() 
        #torch.cuda.empty_cache()    
        #del train_dataset, val_dataset, training_args, trainer
        #gc.collect()

   

*********** Dataset Name: ASTD-Not-Augmented-all-text


Some weights of the model checkpoint at aubmindlab/bert-base-arabertv02-twitter were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at aubmi

Step,Training Loss,Validation Loss


KeyboardInterrupt: 

In [None]:
#ArSarcasem-Not-Augmented-all-text
#ECU-ASTD-Augmented-Test-all-text
#COS-ASTD-Augmented-Test-all-text
#JAC-ASTD-Augmented-Test-all-text
#BLE-ASTD-Augmented-Test-all-text
#ECU-ASTD-Augmented-Test-new-text
#COS-ASTD-Augmented-Test-new-text
#JAC-ASTD-Augmented-Test-new-text
#BLE-ASTD-Augmented-Test-new-text
#ECU-ASTD-Not-Augmented-Test-all-text
#COS-ASTD-Not-Augmented-Test-all-text
#JAC-ASTD-Not-Augmented-Test-all-text
#BLE-ASTD-Not-Augmented-Test-all-text
#ECU-ASTD-Not-Augmented-Test-new-text
#COS-ASTD-Not-Augmented-Test-new-text
#JAC-ASTD-Not-Augmented-Test-new-text
#BLE-ASTD-Not-Augmented-Test-new-text

### Export Sentiment ِAnalysis Training Results

In [None]:
trainResults = pd.DataFrame.from_dict(all_results, orient='columns')   

In [None]:
trainResults.head()

In [None]:
trainResults.to_excel("LatestResults/ASTD/ASTD-Results-1.xlsx")