# ELMO/ARAVEC/FASTTEXT Baseline for XNLI task

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for XNLI task

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import matthews_corrcoef
from embed_classer import embed



ModuleNotFoundError: No module named 'fasttext'

In [2]:
pip install fasttext

Collecting fasttext
  Using cached fasttext-0.9.2.tar.gz (68 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /home/cadmus/PycharmProjects/alue_baselines/env/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-x3gug0_3/fasttext_b937466207874c4aa691daac03d97b55/setup.py'"'"'; __file__='"'"'/tmp/pip-install-x3gug0_3/fasttext_b937466207874c4aa691daac03d97b55/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-wgn_tx7p
       cwd: /tmp/pip-install-x3gug0_3/fasttext_b937466207874c4aa691daac03d97b55/
  Complete output (40 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.8
  creating build/lib.linux-x86_64-3.8/fas

[?25hNote: you may need to restart the kernel to use updated packages.


## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [None]:
df_train = pd.read_csv("../../data/xnli/arabic_train.tsv", sep="\t")
df_test = pd.read_csv("../../data/xnli/arabic_dev.tsv", sep="\t")

Below we list the 5 first entries in the training data.

In [None]:
df_train.head()

And last but not least, the first 5 entries in the test data.

In [None]:
df_test.head()

## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [None]:
tf.random.set_seed(123)
max_sentence_len = 20

In [None]:
model_type = "fasttext"

if model_type == "aravec":
    model_path = '/code/haitham/use_multi_bench/elmo_aravec_fasttext/aravec/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '/code/haitham/use_multi_bench/elmo_aravec_fasttext/fasttext/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '/code/haitham/use_multi_bench/elmo_aravec_fasttext/arabic_elmo'
    size = 1024

Next we load our model of choice:

In [None]:
embedder = embed(model_type, model_path)

Then we define the input and output to the model:

In [None]:
sentence1 = keras.Input(shape=(max_sentence_len, size), name='q1')
sentence2 = keras.Input(shape=(max_sentence_len, size), name='q2')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [None]:
feat_1 = tf.abs(sentence1 - sentence2)
feat_2 = sentence1*sentence2
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
sentence1_logits = rnn(sentence1)
sentence2_logits = rnn(sentence2)
feat_1 = tf.abs(sentence1_logits - sentence2_logits)
feat_2 = sentence1_logits*sentence2_logits
logits = keras.layers.Dense(size*2, activation=tf.nn.sigmoid)(tf.keras.layers.concatenate([sentence1_logits, sentence2_logits, feat_1, feat_2]))
logits = keras.layers.Dense(3, activation=tf.nn.softmax)(logits)

Then we construct and compile the model:

In [None]:
model = keras.Model(inputs=[sentence1, sentence2], outputs=logits)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [None]:
X1_train = embedder.embed_batch(df_train["sentence1"].tolist(), max_sentence_len)
X2_train = embedder.embed_batch(df_train["sentence2"].tolist(), max_sentence_len)

le = LabelEncoder()
le.fit(df_train["gold_label"])
Y_train = le.transform(df_train["gold_label"])

Next we fit the data:

In [None]:
model.fit([X1_train, X2_train],
          Y_train,
          epochs=5,
          batch_size=32)

## Submission Preperation

Below we prepare the features for the test set as follows:

In [None]:
X1_test = embedder.embed_batch(df_test["sentence1"].tolist(), max_sentence_len)
X2_test = embedder.embed_batch(df_test["sentence2"].tolist(), max_sentence_len)
Y_test = le.transform(df_test["gold_label"])

Next we prepare the predictions

In [None]:
predictions_test = np.argmax(model.predict([X1_test, X2_test]),1)
df_test_sub = pd.DataFrame(data=le.inverse_transform(predictions_test), columns=["prediction"], index=df_test["pairID"])
df_test_sub.reset_index(inplace=True)

Then we save the prediction file

In [None]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_dia.to_csv("./predictions/{}/xnli.tsv".format(model_type), index=False)

Next we prepare the diagnostic task predictions

In [None]:
diagnostic_data = pd.read_csv("../../private_datasets/diagnostic.tsv", sep="\t")

We perpare the features for each testset instance as follows:

In [None]:
X1_dia = embedder.embed_batch(diagnostic_data["sentence1"].tolist(), max_sentence_len)
X2_dia = embedder.embed_batch(diagnostic_data["sentence2"].tolist(), max_sentence_len)
Y_dia = le.transform(diagnostic_data["gold_label"])

Then we predict the labels for each and evaluate the f1 score:

In [None]:
predictions = np.argmax(model.predict([X1_dia, X2_dia]),1)

We perpare the predictions as a pandas dataframe.

In [None]:
df_dia = pd.DataFrame(data=le.inverse_transform(predictions), columns=["prediction"], index=diagnostic_data["pairID"])
df_dia.reset_index(inplace=True)

In the final step, we save the predictions as required by the competition guidelines.

In [None]:
df_dia.to_csv("./predictions/{}/diagnostic.tsv".format(model_type), index=False)