# Universal Sentence Encoder Baseline for E-c Task

In this notebook, we will walk you through the process of reproducing the Universal Sentence Encoder baseline for the E-c task.

## Loading Required Modules

First off, we start by loading the needed python libraries.

In [1]:
import os
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import pandas as pd
import tensorflow_text
from tensorflow import keras
from sklearn.metrics import jaccard_score

## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/affect-in-tweets/emotion-c/2018-E-c-Ar-train.txt", sep="\t")
df_dev = pd.read_csv("../../data/affect-in-tweets/emotion-c/2018-E-c-Ar-dev.txt", sep="\t")
df_test = pd.read_csv("../../private_datasets/emotion/emotion_no_labels_v1.0.tsv", sep="\t")

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-Ar-00259,ظلينا نتكلم ساعات ساعات رتبت فيها نفسي وبكيت ف...,1,0,0,1,0,0,0,1,1,0,0
1,2018-Ar-02696,كل سنه وانتي بخير ياقلبي وكل سنه وانتي سعيده ي...,0,0,0,0,1,1,1,0,0,0,0
2,2018-Ar-03596,البسطاء يمتلكون أرواح نادره جداتجدهم بمظهر متو...,0,0,0,0,0,1,1,0,0,0,0
3,2018-Ar-02999,مومعقول اللي قاعد يصير فيني هالايام يارب ماينت...,0,0,0,1,0,0,0,0,0,0,0
4,2018-Ar-02716,انا اكثر شخص متناقض بداخلي حب وكره وامل وقنوط ...,1,0,0,0,0,0,0,0,1,0,0


And the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-Ar-00289,باقي ١٠ دقايق واخلص حلقة سكول الثانيه بس النت ...,1,0,1,0,1,0,0,1,0,0,0
1,2018-Ar-02519,معاناة لما يكون دايماً إحساسك بمحله .....,0,0,0,0,0,0,0,0,1,0,0
2,2018-Ar-01952,لو فيه جائزه اكثر موسوسه تخاف من الامراض فزت ب...,0,0,0,1,0,0,0,1,0,0,0
3,2018-Ar-02912,ما يستفز راحة البال إلا الذكريات ☹️ 💔,1,0,0,0,0,0,0,0,1,0,0
4,2018-Ar-02756,القلب ياوقت والنفس ماهي مرتاحه ونبض القلب ماهو...,0,0,0,1,0,0,0,0,1,0,0


And last but not least, the first 5 entries in the test data.

In [5]:
df_test.head()

Unnamed: 0,ID,Tweet,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,17439,عيونك تهبل عيونك هي الآلئ اللي تظوي حياتي ما ا...,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
1,10196,كم هو موجع في بعض الاحيان ان يعمل الانسان با ط...,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
2,17470,انا هولندي عندي مناعة عخيبات الامل... ما في شي...,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
3,16262,احا المانيا خرجت,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE
4,13597,وش الي صار متى سجلت الارجنتين,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE,NONE


## Model Preparation

We start by setting the randomisation seed:

In [6]:
tf.random.set_seed(123)

Next we load the Universal Sentence Encoder (WARNING: This will download and cache a huge model of around 1 GB in size)

In [7]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Then we define the input and output to the model:

In [8]:
sentence = keras.Input(shape=512, name='sentence')
label = keras.Input(shape=(11,), name='label')

This is followed by defining the structure of the network:

In [9]:
logits = keras.layers.Dense(512, activation=tf.nn.sigmoid)(sentence)
logits = keras.layers.Dense(512, activation=tf.nn.sigmoid)(logits)
logits = keras.layers.Dense(11, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [10]:
model = keras.Model(sentence, outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [11]:
X_train = embed(df_train["Tweet"])
Y_train = df_train[df_train.columns[2:]]
X_dev = embed(df_dev["Tweet"])
Y_dev = df_dev[df_dev.columns[2:]]

Next we fit the data:

In [12]:
model.fit(X_train,
          Y_train,
          epochs=5,
          batch_size=32,
          validation_data = (X_dev, Y_dev))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fce34ca9130>

We calculate the Jaccard similarity score for the development set as follows:

In [13]:
jaccard_score(model.predict(X_dev)>0.5, Y_dev, average="macro")

0.21584681510804068

## Submission Preperation

We perpare the features for each testset instance as follows:

In [14]:
X_test = embed(df_test["Tweet"])

Then we predict the labels for each:

In [15]:
predictions = (model.predict(X_test)>0.5).astype(int)

We perpare the predictions as a pandas dataframe.

In [16]:
df_preds = pd.DataFrame(data=predictions, columns=df_train.columns[2:].tolist(), index=df_test["ID"])
df_preds.reset_index(inplace=True)

We explore the prediction dataframe:

In [17]:
df_preds.head()

Unnamed: 0,ID,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,17439,0,0,0,0,0,0,0,0,1,0,0
1,10196,1,0,0,0,0,0,0,0,0,0,0
2,17470,1,0,0,0,0,0,0,0,1,0,0
3,16262,1,0,0,0,0,0,0,0,1,0,0
4,13597,1,0,1,0,0,0,0,0,1,0,0


In the final step, we save the predictions as required by the competition guidelines.

In [18]:
if not os.path.exists("predictions"):
    os.mkdir("predictions")
df_preds.to_csv("./predictions/E_c.tsv", index=False, sep="\t")