# ELMO/ARAVEC/FASTTEXT Baseline for IDAT

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for the IDAT Irony detection task.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.metrics import f1_score
from embed_classer import embed

## Loading Data

Using pandas, we can load and inspect the training and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/idat/IDAT_training_text.csv")
df_test = pd.read_csv("../../data/idat/IDAT_test_text.csv")

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,id,text,label,type_,tweet_id
0,0,ايمان عز الدين:الجراد طلع علي المقطم وبعدين بي...,1,training,'308488170838831104'
1,1,@AymanNour الى المدعو أيمن نور الحرامى من معك ...,0,training,'955724773216129024'
2,2,#بوتين ٦٥ سنه و بيغطس في بحيره متجمده و انا خا...,0,training,'954792171521048576'
3,3,#قال أيه أنهاردة 20 مليون واحد في الشوارع عشان...,1,training,'363321598431862784'
4,4,@EmmanuelMacron وفي كل مره يرفض إيمانويل دعوة ...,0,training,'939204686632103936'


Below we list the 5 first entries in the testing data.

In [4]:
df_test.head()

Unnamed: 0,id,text,label,type_,tweet_id
0,0,#يناير_حلم_ومكملينه فاستبشروا خيرا واستكملوا ث...,0,test,'955879051872350209'
1,1,#الشيخه_موزا_مصدر_فخرنا موزه ويسبّــق اسمــها ...,0,test,'953563403368452096'
2,2,معلش سؤال بس. هو حد علق من جبهة الانقاذ عن احد...,1,test,'322085724235132928'
3,3,ههههههههههههههههههههه. اه يادماغي هو الاخوان ا...,1,test,'367053834235183104'
4,4,ايمن نور فى حوار #مرسي اللى كان مذاع والعالم ك...,1,test,'341890733990633473'


## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [5]:
tf.random.set_seed(123)
max_sentence_len = 200

In [6]:
model_type = "fasttext"

if model_type == "aravec":
    model_path = '../pretrained/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '../pretrained/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '../pretrained'
    size = 1024

Next we load our model of choice:

In [7]:
embedder = embed(model_type, model_path)



Then we define the input and output to the model:

In [8]:
sentence = keras.Input(shape=(max_sentence_len, size), name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [9]:
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
logits = rnn(sentence)
logits = keras.layers.Dense(1, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [10]:
model = keras.Model(inputs=sentence, outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [11]:
X_train = embedder.embed_batch(df_train["text"].tolist(), max_sentence_len)
Y_train = df_train["label"]

Next we fit the data:

In [12]:
model.fit(X_train, Y_train, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc3e39ff400>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [13]:
X_test = embedder.embed_batch(df_test["text"].tolist(), max_sentence_len)
Y_test = df_test["label"]

We predict and evaluate the prediction as follows:

In [14]:
predictions = (model.predict(X_test)>0.5).astype(int)
f1_score(Y_test, predictions, average="macro")

0.7959902659371421

We perpare the predictions as a pandas dataframe.

In [15]:
df_preds = pd.DataFrame(data=predictions, columns=["prediction"], index=df_test["id"])
df_preds.reset_index(inplace=True)

In [16]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_preds.to_csv("./predictions/{}/irony.tsv".format(model_type), index=False, sep="\t")