# ELMO/ARAVEC/FASTTEXT Baseline for NSURL-2019 Shared Task 8

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for the NSURL-2019 Shared Task 8.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import tensorflow as tf
import pandas as pd
from tensorflow import keras
from sklearn.metrics import f1_score
import gensim
import numpy as np
import fasttext
from embed_classer import embed

## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/nsurl/q2q_similarity_workshop_v2.1.tsv", sep="\t")
df_test = pd.read_csv("../../private_datasets/q2q/q2q_no_labels_v1.0.tsv", sep="\t")

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,question1,question2,label
0,ما هي الطرق الصحيحة للاعتناء بالحامل؟,كيف اهتم بطفلي؟,0
1,ما هي وسائل الاتصالات الحديثة؟,ماذا نعني بوسائل الاتصال الحديثة؟,1
2,ما طريقة تحضير محشي الكوسا ؟,من طرق تحضير محشي الكوسا؟,1
3,ما طريقة تحضير حلى الطبقات؟,من طرق تحضير طبقات الكيك؟,0
4,من الآيات القرآنية عن الراعي والرعية ؟,ما هو تعريف الراعي والرعية ؟,0


And last but not least, the first 5 entries in the test data.

In [4]:
df_test.head()

Unnamed: 0,QuestionPairID,question1,question2
0,1,كم عدد حروف الفاتحة؟,كيف تكون فقيهاً؟
1,2,هل حلال أكل الضبع؟,هل أكل الضبع حلال أم حرام؟
2,3,كم عدد الركعات في كل صلاة؟,كم عدد ركعات الصلوات المفروضة؟
3,4,كيف أؤمن بالله؟,كيف أكون مؤمناً؟
4,5,لماذا سميت حواء بهذا الاسم؟,كيف عذب الله قوم ثمود؟


## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [5]:
tf.random.set_seed(123)
max_sentence_len = 20

In [6]:
model_type = "fasttext"

if model_type == "aravec":
    model_path = '../pretrained/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '../pretrained/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '../pretrained'
    size = 1024

Next we load our model of choice:

In [7]:
embedder = embed(model_type, model_path)



Then we define the input and output to the model:

In [8]:
q1_input = keras.Input(shape=(max_sentence_len, size), name='q1')
q2_input = keras.Input(shape=(max_sentence_len, size), name='q2')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [9]:
feat_1 = tf.abs(q1_input - q2_input)
feat_2 = q1_input*q2_input
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
q1_logits = rnn(q1_input)
q2_logits = rnn(q2_input)
feat_1 = tf.abs(q1_logits - q2_logits)
feat_2 = q1_logits*q2_logits
logits = keras.layers.Dense(size*2, activation=tf.nn.sigmoid)(tf.keras.layers.concatenate([q1_logits, q2_logits, feat_1, feat_2]))
logits = keras.layers.Dense(1, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [10]:
model = keras.Model(inputs=[q1_input, q2_input], outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [11]:
q1 = df_train["question1"].tolist()
q2 = df_train["question2"].tolist()
X1_train = embedder.embed_batch(q1, max_sentence_len)
X2_train = embedder.embed_batch(q2, max_sentence_len)
Y_train = df_train["label"]

Next we fit the data:

In [12]:
model.fit([X1_train, X2_train],
          Y_train,
          epochs=10,
          batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f9178a41198>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [13]:
x1_test = embedder.embed_batch(df_test["question1"].tolist(), max_sentence_len)
x2_test = embedder.embed_batch(df_test["question2"].tolist(), max_sentence_len)

Then we predict the labels for each:

In [15]:
predictions = (model.predict([x1_test, x2_test])>0.5).astype(int)

We perpare the predictions as a pandas dataframe.

In [16]:
df_preds = pd.DataFrame(data=predictions, columns=["prediction"], index=df_test["QuestionPairID"])
df_preds.reset_index(inplace=True)

In the final step, we save the predictions as required by the competition guidelines.

In [17]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_preds.to_csv("./predictions/{}/q2q.tsv".format(model_type), index=False, sep="\t")