# ELMO/ARAVEC/FASTTEXT Baseline for V-reg Task

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for the V-reg task.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from scipy.stats import pearsonr
from embed_classer import embed

## Loading Data

Using pandas, we can load and inspect the training, validation, and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/affect-in-tweets/V-reg/2018-Valence-reg-Ar-train.txt", sep="\t")
df_dev = pd.read_csv("../../data/affect-in-tweets/V-reg/2018-Valence-reg-Ar-dev.txt", sep="\t")
df_test = pd.read_csv("../../private_datasets/vreg/vreg_no_labels_v1.0.tsv", sep="\t")

Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Score
0,2018-Ar-01961,إلىٰ متىٰ الألم يغلب على الفرح,valence,0.097
1,2018-Ar-03289,@Al3mriRami @Holyliviuss كل مافي الأمر أني غاض...,valence,0.219
2,2018-Ar-04349,يحذركم ويخوفكم من نفسه اذا ارتكبتم ذنب او معصي...,valence,0.313
3,2018-Ar-03640,💞 💞 صباحكم سعادة في اليوم المبارك تقبل الله صي...,valence,0.828
4,2018-Ar-01176,@sjalmulla شفته قبل اسبوع ومتشوقه عليه وايد ال...,valence,0.719


And the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Score
0,2018-Ar-00297,لؤي عرفك من زماان طيب ومحترم وجدع ومحبوب ربنا ...,valence,0.613
1,2018-Ar-03228,مدمن العزلة يخاف الاهتمام الزائد يتوتر لا يحسن...,valence,0.328
2,2018-Ar-00857,تذكر أن بعد الشقاء سعادة وبعد دموعك #إبتسامة,valence,0.625
3,2018-Ar-02764,ماف واحد متزوج اسأله عن الزواج الا يسب ويلعن و...,valence,0.422
4,2018-Ar-00582,٢٥ للاسف ما بعرفك بس باين انك حد منيح ودمك خفي...,valence,0.547


And last but not least, the first 5 entries in the test data.

In [5]:
df_test.head()

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Score
0,ID-923,للاسف اتى علينا زمن اصبح بعض الآباء ليس حضناً ...,valence,NONE
1,ID-280,ايه الفرص اللي بتضيع من البرازيل دي حراام بجد,valence,NONE
2,ID-406,جات لى ريادة أطفال .. ف الاسبوع السادس,valence,NONE
3,ID-423,الحمد لله انه ما في خاصيه بتبين كم مره حضرت ال...,valence,NONE
4,ID-965,اب همي وهم بي أحبابي همهم ما بهم وهمي ما بي ...,valence,NONE


## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [6]:
tf.random.set_seed(123)
max_sentence_len = 100

In [7]:
model_type = "aravec"

if model_type == "aravec":
    model_path = '../pretrained/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '../pretrained/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '../pretrained'
    size = 1024

Next we load our model of choice:

In [8]:
embedder = embed(model_type, model_path)

Then we define the input and output to the model:

In [9]:
sentence = keras.Input(shape=(max_sentence_len, size), name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [10]:
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
logits = rnn(sentence)
logits = keras.layers.Dense(1)(logits)

Then we construct and compile the model:

In [11]:
model = keras.Model(sentence, outputs=logits)
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [12]:
tweet_train = df_train["Tweet"].tolist()
tweet_dev = df_dev["Tweet"].tolist()
X_train = embedder.embed_batch(tweet_train, max_sentence_len)
Y_train = df_train["Intensity Score"]
X_dev = embedder.embed_batch(tweet_dev, max_sentence_len)
Y_dev = df_dev["Intensity Score"]

Next we fit the data:

In [13]:
model.fit(X_train,
          Y_train,
          epochs=10,
          batch_size=32,
          validation_data = (X_dev, Y_dev))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f2b94a8a9b0>

We calculate the Pearson correlation coefficient for the development set as follows:

In [14]:
pearsonr(Y_dev, model.predict(X_dev).reshape(-1))

(0.6141037368772433, 1.1492500774076655e-15)

## Submission Preperation

We perpare the features for each testset instance as follows:

In [15]:
tweet_test = df_test["Tweet"].tolist()
X_test = embedder.embed_batch(tweet_test, max_sentence_len)

Then we predict the labels for each:

In [19]:
predictions = model.predict(X_test)

We perpare the predictions as a pandas dataframe.

In [20]:
df_preds = pd.DataFrame(data=predictions, columns=["prediction"], index=df_test["ID"])
df_preds.reset_index(inplace=True)

In the final step, we save the predictions as required by the competition guidelines.

In [21]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_preds.to_csv("./predictions/{}/v_reg.tsv".format(model_type), index=False, sep="\t")