# Arabic dialect Prediction

This notebook aims to build a model that predicts the dialect given the text.First by attempting some classical ML models .Then  moving to deep learning approach through finetuning an [Multi-dialect-Arabic-BERT](https://github.com/mawdoo3/Multi-dialect-Arabic-BERT)  trained on 10M arabic tweets.

# Table of Contents:
* [Classic ML Approach](#1)
    * [Preprocessing](#1.1)
    * [Evaluate best classifier's performance on other datasets](#1.2)
    * [Summary of classic ML results](#1.3)
* [Deep Learning Approach](#2)
 * [Multi-dialect-Arabic-BERT](#2.1)
     * [Preprocessing](#2.1.1)
     * [Create data loaders for test and validation sets](#2.1.2)
     * [Define model initialization class and functions](#2.1.3)
     * [Define model train and evaluate functions](#2.1.4)
     * [Initialize and train model](#2.1.5)
     * [Save model](#2.1.6)
     * [Define prediction and test set evaluation functions](#2.1.7)
     * [Predict and evaluate validation subset](#2.1.8)
     * [Predict and evaluate test subset](#2.1.9)
     * [Summary of performance on test datasets](#2.1.10)

* [Summary](#3)


In [2]:
import os
import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv

%matplotlib inline

#### Load dataset 

In [3]:
import pandas as pd
NEW_DF = pd.read_csv('../input/dataset/Data.csv')


> <a id="1"></a>
# Classical ML approach


**Exploratory data analysis(EDA)**

In [4]:
NEW_DF.shape

In [5]:
NEW_DF.dialect.unique()

In [6]:
dialect_count = NEW_DF.groupby('dialect', as_index=False).count()
dialect_count.sort_values(['text'],ascending=False,)

In [7]:
import plotly.express as px

Fig1 =px.bar(x=dialect_count.dialect ,y=dialect_count.text, template='plotly_dark')

Fig1.update_traces(  texttemplate="%{y}",textposition='outside')
Fig1.show()

* our data is Imbalanced and that can cause a lot of frustration.
* we can try Resampling our Dataset

In [8]:
#using over-sampling method that can add more copies to the minority class.

under_balanced_df = pd.DataFrame()
under_balanceed_label =['OM','SY','DZ','IQ','SD','MA','YE','TN']

for lebel in under_balanceed_label:
    under_balanced_df=under_balanced_df.append(NEW_DF[NEW_DF.dialect==lebel].sample(25000, replace=True)).reset_index(drop=True)
under_balanced_df.shape

In [9]:
#using under-sampling method that can delete instances from the over-represented class.
over_balanceed_label=['LY', 'QA', 'PL', 'JO','SA','EG','LB','KW','AE','BH']
over_balanced_df = pd.DataFrame()

for lebel in over_balanceed_label:
    over_balanced_df=over_balanced_df.append(NEW_DF[NEW_DF.dialect==lebel].sample(25000, replace=True)).reset_index(drop=True)
over_balanced_df.shape

In [10]:
balanced_df = pd.concat([over_balanced_df, under_balanced_df], axis=0).reset_index(drop=True)
balanced_df.shape

In [11]:
import plotly.express as px
dialect_count_balanced = balanced_df.groupby('dialect', as_index=False).count().sort_values(['text'],ascending=False,)
Fig1 =px.bar(x=dialect_count_balanced.dialect ,y=dialect_count_balanced.text, template='plotly_dark')

Fig1.update_traces(  texttemplate="%{y}",textposition='outside')
Fig1.show()

# Text preprocessing

In [12]:
import re
import pandas as pd
def text_preprocessing(text):

        
    """
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    
    # Remove'@name'
    text = re.sub("@\S+", '', text)
    #Remove URLs
    text = re.sub('^https?:\/\/.*[\r\n]*', '', text)
    # Remove punctuation
    text = re.sub('[!"#$%&\'()*+,-./:;<=>?[\]^_`{|}~]','',text)
    #Remove newline character
    text = re.sub('\n', '', text)
    #Remove emoji 
    text = re.sub("[" u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                                u"\U00002500-\U00002BEF"  # chinese char
                                   u"\U00002702-\U000027B0"
                                   u"\U000024C2-\U0001F251"
                                   u"\U0001f926-\U0001f937"
                                   u"\U00010000-\U0010ffff"
                                   u"\u2640-\u2642"
                                   u"\u2600-\u2B55"
                                   u"\u200d"
                                   u"\u23cf"
                                   u"\u23e9"
                                   u"\u231a"
                                   u"\ufe0f"  # dingbats
                                   u"\u3030"
                               u"\U000024C2-\U0001F251"
                               "]+", '', text)
    # replcae (أ,آ,إ) by (ا)
    text = re.sub('[أإآ]', 'ا', text)
    # replace (ة) by (ه)
    text = re.sub('[ة]', 'ه', text)
    text = text.replace('هه', 'ه')
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    # remove multi spaces
    text = re.sub(' +', ' ', text)
    return text

In [13]:
balanced_df['Clean_text']=balanced_df['text'].apply(text_preprocessing)


In [14]:
balanced_df.head()

In [15]:
from sklearn.model_selection import train_test_split

X = balanced_df.text
y = balanced_df.dialect
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42,stratify=y,test_size=0.3)
X_train, X_val, y_train1, y_val = train_test_split(x_train, y_train, random_state=42,stratify=y_train,test_size=0.2)

# RandomForest Classifier

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer

Regressor = RandomForestClassifier(n_estimators=400,max_depth=35,random_state=128,max_features='sqrt',bootstrap=False)

pipe_reg = Pipeline([
    ('tfid', TfidfVectorizer(ngram_range=(1, 3),)),  
    ('model', Regressor)])

pipe_reg.fit(X_train, y_train1)


y_pred_pipe_reg_tr = pipe_reg.predict(X_train)
y_pred_pipe_reg_val = pipe_reg.predict(X_val)

**For validation dataset**

In [17]:
target = NEW_DF['dialect'].astype('category')
print('categories: {}'.format(target.cat.categories))
from sklearn import metrics
print(metrics.classification_report(y_val, y_pred_pipe_reg_val))

**For test dataset**

In [18]:
y_pred_pipe_reg_test= pipe_reg.predict(x_test)

In [26]:
y_test

In [19]:
target = NEW_DF['dialect'].astype('category')
print('categories: {}'.format(target.cat.categories))
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred_pipe_reg_test))

# Deep Learning Approach
- Given that the Random Forest Classifier model wasn't generalizing well for other datasets (possibly overfitting), I decided to try a DL approach using a pretrained model (i.e: increasing the dataset as a way of overcoming overfitting). For that I chose to use the [Multi-dialect-Arabic-BERT](https://github.com/mawdoo3/Multi-dialect-Arabic-BERT) 
> The models were pretrained  on 10M arabic tweets


In [11]:
import torch

if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [12]:
from transformers import AutoTokenizer, AutoModel

<a id="2.1"> </a>
### Multi-dialect-Arabic-BERT
Code adapted from https://skimai.com/fine-tuning-bert-for-sentiment-analysis/

In [13]:
tokenizer = AutoTokenizer.from_pretrained("bashar-talafha/multi-dialect-bert-base-arabic")

<a id="2.1.1"> </a>
##### Preprocessing

In [14]:
# Define preprocessing util function
def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
  

    # Normalize unicode encoding
    text = unicodedata.normalize('NFC', text)
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    #Remove URLs
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '<URL>', text)


    return text

In [15]:
# Create a function to tokenize a set of texts
import emoji
import unicodedata
def preprocessing_for_bert(data, text_preprocessing_fn = text_preprocessing ):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []
    tokenizer = AutoTokenizer.from_pretrained("bashar-talafha/multi-dialect-bert-base-arabic")

    # For every sentence...
    for i,sent in enumerate(data):
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text=text_preprocessing_fn(sent),  # Preprocess sentence
            add_special_tokens=True,        # Add `[CLS]` and `[SEP]`
            max_length=MAX_LEN,                  # Max length to truncate/pad
            padding='max_length',        # Pad sentence to max length
            #return_tensors='pt',           # Return PyTorch tensor
            return_attention_mask=True,     # Return attention mask
            truncation = True 
            )
        
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

In [16]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(balanced_df.dialect)

y_train_labeled = le.transform(balanced_df.dialect)
balanced_df['labeled']=y_train_labeled

In [17]:
from sklearn.model_selection import train_test_split

X = balanced_df.text
y = balanced_df.labeled
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42,stratify=y,test_size=0.3)
X_train, X_val, y_train1, y_val = train_test_split(x_train, y_train, random_state=42,stratify=y_train,test_size=0.2)

In [18]:
import re
# Specify `MAX_LEN`
MAX_LEN =  280

# Print sentence 0 and its encoded token ids
token_ids = list(preprocessing_for_bert([X[0]])[0].squeeze().numpy())
print('Original: ', X[0])
print('Token IDs: ', token_ids)

# Run function `preprocessing_for_bert` on the train set and the validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train)
val_inputs, val_masks = preprocessing_for_bert(X_val)

<a id="2.1.2"> </a>
##### Create data loaders for test and validation sets

In [19]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train1.tolist())
val_labels = torch.tensor(y_val.tolist())

# For fine-tuning BERT, the authors recommend a batch size of 16 or 32.
batch_size = 32

# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)

<a id="2.1.3"> </a>
##### Define model initialization class and functions

In [20]:
%%time
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in =768
        H, D_out = 50, 18

        # Instantiate BERT model
        self.bert = AutoModel.from_pretrained("bashar-talafha/multi-dialect-bert-base-arabic")
        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

In [21]:
from transformers import AdamW, get_linear_schedule_with_warmup

from torch.optim import SparseAdam, Adam
def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)
    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(params=list(bert_classifier.parameters()),
                      lr=5e-5,    # Default learning rate
                      eps=1e-8    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

<a id="2.1.4"> </a>
##### Define model train and evaluate functions

In [22]:
import random
import time
import torch
import torch.nn as nn
# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the BertClassifier model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 20 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

**Define prediction and test set evaluation functions**

In [23]:
import torch.nn.functional as F

def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)
    
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)

    # Apply softmax to calculate probabilities
    probs = F.softmax(all_logits, dim=1).cpu().numpy()

    return probs

In [24]:
from sklearn import metrics


def evaluate_roc(probs, y_true):
    """
    - Print AUC and accuracy on the test set
    - Plot ROC
    @params    probs (np.array): an array of predicted probabilities with shape (len(y_true), 18)
    @params    y_true (np.array): an array of the true values with shape (len(y_true),)
    """

    # Get accuracy over the test set
    y_pre=[]
    for y in probs:
        y_pre.append(np.argmax(y))

    y_true_inverse=list(le.inverse_transform(y_true))
    y_pre_inverse=list(le.inverse_transform(y_pre))
    print(metrics.classification_report(y_true_inverse, y_pre_inverse))


<a id="2.1.5"> </a>
##### Initialize and train model

In [25]:
import numpy as np
set_seed(42) 
bert_classifier, optimizer, scheduler = initialize_model(epochs=2)
train(bert_classifier, train_dataloader, val_dataloader, epochs=2, evaluation=True)

**Saving & loading the model**

In [50]:
torch.save(bert_classifier, 'model_DL.pt')
model_DL= torch.load('model_DL.pt')

**Predict and evaluate validation subset**

In [49]:
probs_bert_classifier = bert_predict(bert_classifier, val_dataloader)
evaluate_roc(probs_bert_classifier, y_val)

**Predict and evaluate Test subset**

In [51]:
# Run `preprocessing_for_bert` on the test set
test_inputs, test_masks = preprocessing_for_bert(x_test)
# x_test,  y_test
# Create the DataLoader for our test set
test_dataset = TensorDataset(test_inputs, test_masks)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=32)

In [49]:
# Compute predicted probabilities on the test set
probs = bert_predict(bert_classifier, test_dataloader)
evaluate_roc(probs, y_test)

<a id="2.1.11"> </a>
##### Summary of performance on test datasets

| Model | Accuracy (%)
| :---: | :---: |
| RandomForestClassifier | 51 
| Multi-dialect-Arabic-BERT | 64


