# ELMO/ARAVEC/FASTTEXT Baseline for Madar task 1

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for Madar task 1.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
from embed_classer import embed

2021-08-27 22:30:41.855287: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-08-27 22:30:41.855303: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-train.tsv", sep="\t", header=None, names=["Text", "label"])
df_dev = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-dev.tsv", sep="\t", header=None, names=["Text", "label"])
df_test = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-test.tsv", sep="\t", header=None, names=["Text", "label"])

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,Text,label
0,هناك ، أمام بيانات السائح تماما .,MSA
1,لم اسمع بهذا العنوان من قبل بالقرب من هنا .,MSA
2,استمر في السير في هذا الطريق حتى تجد صيدلية .,MSA
3,كم تكلفة الإفطار ؟,MSA
4,كيف أستطيع مساعدتك ؟,MSA


Below we list the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,Text,label
0,بالمناسبة ، اسمي هيروش إيجيما .,MSA
1,"هذا القطار يتوقف في لاك فورست , أليس كذلك ؟",MSA
2,"هذا الكارت , حسناً ؟",MSA
3,لم يخرج من الماكينة شيء .,MSA
4,عندك أية شيء يمكن أن أتعاطه للطفح الجلدي ؟,MSA


Below we list the 5 first entries in the test data.

In [5]:
df_test.head()

Unnamed: 0,Text,label
0,لا أعرف كثيراً عن النبيذ ؟ ماذا يناسب هذا الطبق ؟,MSA
1,رايح عالمدرسة هون ؟,DAM
2,قهوه مع كريمة و سكر ، لوسمحت .,SAN
3,بأي محطة لازم أنزل عشان أروح على امباير ستيت ب...,AMM
4,اسمي ميتشيكو تاناكا ، ورقم الرحلة خمسة صفر واح...,JED


## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [6]:
tf.random.set_seed(123)
max_sentence_len = 20

In [7]:
model_type = "fasttext"

if model_type == "aravec":
    model_path = '../pretrained/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '../pretrained/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '../pretrained'
    size = 1024

Next we load our model of choice:

In [8]:
embedder = embed(model_type, model_path)



Then we define the input and output to the model:

In [9]:
sentence = keras.Input(shape=(max_sentence_len, size), name='sentence')
label = keras.Input(shape=(26,), name='label')

This is followed by defining the structure of the network:

In [10]:
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
logits = rnn(sentence)
logits = keras.layers.Dense(26, activation=tf.nn.softmax)(logits)

2021-08-27 22:30:48.871687: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-08-27 22:30:48.871705: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-08-27 22:30:48.871719: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cadmus-ERAZER-P6605-MD61363): /proc/driver/nvidia/version does not exist
2021-08-27 22:30:48.871872: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Then we construct and compile the model:

In [11]:
model = keras.Model(sentence, outputs=logits)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [12]:
le = LabelEncoder()
le.fit(df_train["label"])
Y_train = le.transform(df_train["label"])
X_dev = embedder.embed_batch(df_dev["Text"].tolist(), max_sentence_len)
Y_dev = le.transform(df_dev["label"])

Given the size of the input we need to constructor a generator as follows:

In [13]:
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, text, labels, max_sentence_len, batch_size=32, shuffle=True):
        'Initialization'
        
        self.batch_size = batch_size
        self.labels = labels
        self.text = text
        self.shuffle = shuffle
        self.max_sentence_len = max_sentence_len
        self.list_IDs = list_IDs
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_text_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_text_temp)

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_text_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = []
        y = []

        for i in list_text_temp:
            y.append(self.labels[i])
            X.append(self.text[i])
        return embedder.embed_batch(X, self.max_sentence_len), np.array(y)

Next we fit the data:

In [14]:
training_generator = DataGenerator(df_train.reset_index()["index"].tolist(), df_train['Text'].tolist(), Y_train, max_sentence_len)
model.fit(training_generator, epochs=5, validation_data = (X_dev, Y_dev))

2021-08-27 22:30:49.972062: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fe87f88f2b0>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [15]:
X_test = embedder.embed_batch(df_test["Text"].tolist(), max_sentence_len)
Y_test = le.transform(df_test["label"])

Then we predict the labels for each and evaluate the f1 score:

In [16]:
predictions = np.argmax(model.predict(X_test), 1)
f1_score(Y_test, predictions, average="macro")

0.5073711072964296

We perpare the predictions as a pandas dataframe.

In [17]:
df_preds = pd.DataFrame(data=le.inverse_transform(predictions))

In the final step, we save the predictions as required by the competition guidelines.

In [18]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_preds.to_csv("./predictions/{}/madar.tsv".format(model_type), index=False, header=False, sep="\t")