# Universal Sentence Encoder Baseline for Madar task 1

In this notebook, we will walk you through the process of reproducing the Universal Sentence Encoder baseline for Madar task 1.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import pandas as pd
import tensorflow_text
from tensorflow import keras
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-train.tsv", sep="\t", header=None, names=["Text", "label"])
df_dev = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-dev.tsv", sep="\t", header=None, names=["Text", "label"])
df_test = pd.read_csv("../../data/madar-1/MADAR-Corpus-26-test.tsv", sep="\t", header=None, names=["Text", "label"])

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,Text,label
0,هناك ، أمام بيانات السائح تماما .,MSA
1,لم اسمع بهذا العنوان من قبل بالقرب من هنا .,MSA
2,استمر في السير في هذا الطريق حتى تجد صيدلية .,MSA
3,كم تكلفة الإفطار ؟,MSA
4,كيف أستطيع مساعدتك ؟,MSA


Below we list the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,Text,label
0,بالمناسبة ، اسمي هيروش إيجيما .,MSA
1,"هذا القطار يتوقف في لاك فورست , أليس كذلك ؟",MSA
2,"هذا الكارت , حسناً ؟",MSA
3,لم يخرج من الماكينة شيء .,MSA
4,عندك أية شيء يمكن أن أتعاطه للطفح الجلدي ؟,MSA


Below we list the 5 first entries in the test data.

In [5]:
df_test.head()

Unnamed: 0,Text,label
0,لا أعرف كثيراً عن النبيذ ؟ ماذا يناسب هذا الطبق ؟,MSA
1,رايح عالمدرسة هون ؟,DAM
2,قهوه مع كريمة و سكر ، لوسمحت .,SAN
3,بأي محطة لازم أنزل عشان أروح على امباير ستيت ب...,AMM
4,اسمي ميتشيكو تاناكا ، ورقم الرحلة خمسة صفر واح...,JED


## Model Preparation

We start by setting the randomisation seed:

In [6]:
tf.random.set_seed(123)

Next we load the Universal Sentence Encoder (WARNING: This will download and cache a huge model of around 1 GB in size)

In [7]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Then we define the input and output to the model:

In [8]:
sentence = keras.Input(shape=512, name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [9]:
logits = keras.layers.Dense(512)(sentence)
logits = keras.layers.Dense(26, activation=tf.nn.softmax)(logits)

Then we construct and compile the model:

In [10]:
model = keras.Model(sentence, outputs=logits)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [11]:
le = LabelEncoder()
le.fit(df_train["label"])
Y_train = le.transform(df_train["label"])
X_dev = embed(df_dev["Text"])
Y_dev = le.transform(df_dev["label"])

Given the size of the input we need to constructor a generator as follows:

In [12]:
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, text, labels, batch_size=32, shuffle=True):
        'Initialization'
        
        self.batch_size = batch_size
        self.labels = labels
        self.text = text
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.text) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_text_temp = [self.text[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_text_temp)

        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.text))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_text_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = []
        y = []

        for txt, lab in zip(list_text_temp, self.labels):
            # Store sample
            X.append(embed(txt))

            # Store class
            y.append(lab)

        return np.vstack(X), np.array(y)

Next we fit the data:

In [13]:
training_generator = DataGenerator(df_train['Text'], Y_train)
model.fit(training_generator, epochs=1, validation_data = (X_dev, Y_dev))



<tensorflow.python.keras.callbacks.History at 0x7f942c144b80>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [14]:
X_test = embed(df_test["Text"])
Y_test = le.transform(df_test["label"])

Then we predict the labels for each and evaluate the f1 score:

In [15]:
predictions = np.argmax(model.predict(X_test), 1)
f1_score(Y_test, predictions, average="macro")

0.0028490028490028487

We perpare the predictions as a pandas dataframe.

In [16]:
df_preds = pd.DataFrame(data=le.inverse_transform(predictions))

In the final step, we save the predictions as required by the competition guidelines.

In [17]:
if not os.path.exists("predictions"):
    os.mkdir("predictions")
df_preds.to_csv("./predictions/madar.tsv", index=False, header=False, sep="\t")