# FineTuning AraBert for text classification.

As we saw in previous notebooks both ML and simple LSTM model didn't give us good accuracy. I'll try in this notebook to finetune Arabert which is a model trained on arabic dialects alread: [Arabert](https://github.com/aub-mind/arabert)

### Importing libraries

In [1]:
# !pip install transformers
!pip install datasets


In [2]:
from datasets import load_dataset,Dataset,concatenate_datasets
import os
import time
import datetime
import random
import pandas as pd
import numpy as np

from transformers import AutoModelForSequenceClassification,AutoConfig,AutoTokenizer
from transformers import AdamW,get_linear_schedule_with_warmup
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, precision_score , recall_score
from sklearn.preprocessing import LabelEncoder
from transformers import AutoConfig, BertForSequenceClassification, AutoTokenizer
from transformers.data.processors import SingleSentenceClassificationProcessor
from transformers import Trainer , TrainingArguments

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## 1- Formatting the data and making dataset using HuggingFace library datasets

Our data is slightly different from the wanted format, so I'll correct the format.

### a- Re-formatting the data

In [3]:
train = pd.read_csv("../input/clean-dialect-text/train.csv",lineterminator='\n')
val = pd.read_csv("../input/clean-dialect-text/validation.csv",lineterminator='\n')
test = pd.read_csv("../input/clean-dialect-text/test.csv",lineterminator='\n')

In [4]:
encoder = LabelEncoder()
classes = encoder.fit_transform(train['dialect'])
train['label'] = classes
val['label'] = encoder.transform(val['dialect'])

train.drop(columns=['id','length','dialect'],inplace=True)
train.rename(columns = {'clean_text':'text'}, inplace = True)

val.drop(columns=['id','length','dialect'],inplace=True)
val.rename(columns = {'clean_text':'text'}, inplace = True)

train.to_csv("new_train.csv",index=False)
val.to_csv("new_val.csv",index=False)

I saved files again to CSV because that is easier to load in the correct dataset splitted to train and test.

In [5]:
dataset = load_dataset("csv", data_files={"train":"./new_train.csv","test":"./new_val.csv"}, delimiter=",", lineterminator='\n')

Now we have our dataset, we need to tokenize it to enter the model.

## 2- Getting the tokenizer, the model and tokenize the data.

In [6]:
model_name = 'aubmindlab/bert-base-arabert'

config = AutoConfig.from_pretrained(model_name,num_labels=18, output_attentions=True) 
tokenizer = AutoTokenizer.from_pretrained(model_name, do_lower_case=False, do_basic_tokenize=True)
model = BertForSequenceClassification.from_pretrained(model_name,config=config)


As mentioned in warning above, the model is not trained and we have to train it on down stream task to use it.

In [19]:
def tokenize_function(examples):

    return tokenizer(examples["text"],padding="max_length", truncation=True,return_tensors='pt')

# removed batched = True because I wanted to get pt tensor and when using batched = True we get Python lists.
tokenized_datasets = dataset.map(tokenize_function)

Our data is very large, when I used all of it the notebook closed, So I'll just choose a subset to fit into workspace RAM

In [42]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))

eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(2000))

In [43]:
# just checking different examples to make sure that both datasets have the same length
len(eval_dataset[5]['input_ids'])

## 3- Choosing training arguments

In [9]:
training_args = TrainingArguments("./train")
training_args.do_train = True
training_args.evaluate_during_training = True
training_args.adam_epsilon = 1e-5
training_args.learning_rate = 2e-5
training_args.warmup_steps = 0
training_args.per_device_train_batch_size = 16
training_args.per_device_eval_batch_size = 16
training_args.num_train_epochs= 5
training_args.seed = 42

Transformers doesn't have f1_score as a metric, so I used custom function but it didn't work in the end!

In [10]:

#https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_utils.py
def compute_metrics(p): 
    preds = np.argmax(p.predictions, axis=1)
    assert len(preds) == len(p.label_ids)
    macro_f1 = f1_score(p.label_ids,preds,average='macro')
    acc = accuracy_score(p.label_ids,preds)
    return {
      'macro_f1' : macro_f1, 
      'accuracy': acc
   }



In [11]:
trainer = Trainer(model=model,
                  args = training_args,
                  train_dataset = train_dataset,
                  eval_dataset = eval_dataset,
                  compute_metrics = compute_metrics)

In [12]:
trainer.train()

In [51]:
#model.eval()

## 4 - Evaluation

As I mentioned above I faced a problem with the function so I made another one, very basic without creativity or optimization but doing the job. 

In [119]:
def eval(data):
    labels = []
    preds = []
    # range is 20000 also because of RAM
    for i in range(20000):
        #print(example)
        example = data[i]
        text = example['text']
        labels.append(example['label'])
        tokens = tokenizer([text],max_length=512,return_tensors='pt').to('cuda')
        out = model(**tokens)
        l = torch.argmax(out['logits'][0]).item()
        preds.append(l)
    macro_f1 = f1_score(labels,preds,average='macro')
    return macro_f1

In [120]:
eval(dataset['test'])

As we can see it's worst than machine learning model and basic LSTM, but I'm sure that the problem within my approach. This notebook needs more work.

## 5- Saving the model

In [114]:
model.save_pretrained("model.bin")