# Universal Sentence Encoder Baseline for IDAT

In this notebook, we will walk you through the process of reproducing the Universal Sentence Encoder baseline for the IDAT Irony detection task.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import pandas as pd
import tensorflow_text
from tensorflow import keras
from sklearn.metrics import f1_score

## Loading Data

Using pandas, we can load and inspect the training and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/idat/IDAT_training_text.csv")
df_test = pd.read_csv("../../data/idat/IDAT_test_text.csv")

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,id,text,label,type_,tweet_id
0,0,ايمان عز الدين:الجراد طلع علي المقطم وبعدين بي...,1,training,'308488170838831104'
1,1,@AymanNour الى المدعو أيمن نور الحرامى من معك ...,0,training,'955724773216129024'
2,2,#بوتين ٦٥ سنه و بيغطس في بحيره متجمده و انا خا...,0,training,'954792171521048576'
3,3,#قال أيه أنهاردة 20 مليون واحد في الشوارع عشان...,1,training,'363321598431862784'
4,4,@EmmanuelMacron وفي كل مره يرفض إيمانويل دعوة ...,0,training,'939204686632103936'


Below we list the 5 first entries in the testing data.

In [4]:
df_test.head()

Unnamed: 0,id,text,label,type_,tweet_id
0,0,#يناير_حلم_ومكملينه فاستبشروا خيرا واستكملوا ث...,0,test,'955879051872350209'
1,1,#الشيخه_موزا_مصدر_فخرنا موزه ويسبّــق اسمــها ...,0,test,'953563403368452096'
2,2,معلش سؤال بس. هو حد علق من جبهة الانقاذ عن احد...,1,test,'322085724235132928'
3,3,ههههههههههههههههههههه. اه يادماغي هو الاخوان ا...,1,test,'367053834235183104'
4,4,ايمن نور فى حوار #مرسي اللى كان مذاع والعالم ك...,1,test,'341890733990633473'


## Model Preparation

We start by setting the randomisation seed:

In [5]:
tf.random.set_seed(123)

Next we load the Universal Sentence Encoder (WARNING: This will download and cache a huge model of around 1 GB in size)

In [6]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Then we define the input and output to the model:

In [7]:
sentence_input = keras.Input(shape=512, name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [8]:
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(sentence_input)
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(logits)
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(logits)
logits = keras.layers.Dense(1, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [9]:
model = keras.Model(inputs=sentence_input, outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [10]:
X_train = embed(df_train["text"])
Y_train = df_train["label"]

Next we fit the data:

In [11]:
model.fit(X_train, Y_train, epochs=5, batch_size=32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fc9641f1430>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [12]:
X_test = embed(df_test["text"])
Y_test = df_test["label"]

We predict and evaluate the prediction as follows:

In [13]:
predictions = (model.predict(X_test)>0.5).astype(int)
f1_score(Y_test, predictions, average="macro")

0.7660851694969757

We perpare the predictions as a pandas dataframe.

In [14]:
df_preds = pd.DataFrame(data=predictions, columns=["prediction"], index=df_test["id"])
df_preds.reset_index(inplace=True)

In [15]:
if not os.path.exists("predictions"):
    os.mkdir("predictions")
df_preds.to_csv("./predictions/irony.tsv", index=False, sep="\t")