# ELMO/ARAVEC/FASTTEXT Baseline for OSACT4 - Task B

In this notebook, we will walk you through the process of reproducing the ELMO/ARAVEC/FASTTEXT baseline for the OSACT4 task B.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder
from embed_classer import embed

## Loading Data

Using pandas, we can load and inspect the training and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-train.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
df_dev = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-dev.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
df_test = pd.read_csv("../../private_datasets/offensive/tweets_v1.0.txt", sep="\t", quotechar='▁', header=None, names=["text"])

  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until


Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,text,offensive,hate
0,الحمدلله يارب فوز مهم يا زمالك.. كل الدعم ليكم...,NOT_OFF,NOT_HS
1,فدوه يا بخت فدوه يا زمن واحد منكم يجيبه,NOT_OFF,NOT_HS
2,RT @USER: يا رب يا واحد يا أحد بحق يوم الاحد ا...,OFF,HS
3,RT @USER: #هوا_الحرية يا وجع قلبي عليكي يا امي...,NOT_OFF,NOT_HS
4,يا بكون بحياتك الأهم يا إما ما بدي أكون 🎼,NOT_OFF,NOT_HS


Below we list the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,text,offensive,hate
0,فى حاجات مينفعش نلفت نظركوا ليها زى الاصول كده...,NOT_OFF,NOT_HS
1,RT @USER: وعيون تنادينا تحايل فينا و نقول يا ع...,NOT_OFF,NOT_HS
2,يا بلادي يا أم البلاد يا بلادي بحبك يا مصر بحب...,NOT_OFF,NOT_HS
3,RT @USER: يا رب يا قوي يا معين مدّني بالقوة و ...,NOT_OFF,NOT_HS
4,RT @USER: رحمك الله يا صدام يا بطل ومقدام. URL,NOT_OFF,NOT_HS


Below we list the 5 first entries in the testing data.

In [5]:
df_test.head()

Unnamed: 0,text
0,أود أن أعلمكم أن التعليق المنشور هنا باسم نور ...
1,مافيه فرق بين احمد جبريل والعاهره المستأجره
2,اذا نطق السفية فلا تجبة لانة سفية وقليل الادب ...
3,اعتقد حضرتك تدعو لمؤتمر دولى للحوار للسلمي مع ...
4,يسرني في المركز الموريتاني لقياس الراي العام ا...


## Model Preparation

We start by setting the randomisation seed and the maximum sentence length:

In [6]:
tf.random.set_seed(123)
max_sentence_len = 20

In [7]:
model_type = "fasttext"

if model_type == "aravec":
    model_path = '../pretrained/full_uni_sg_300_twitter.mdl'
    size = 300
elif model_type == "fasttext":
    model_path = '../pretrained/cc.ar.300.bin'
    size = 300
elif model_type == "elmo":
    model_path= '../pretrained'
    size = 1024

Next we load our model of choice:

In [8]:
embedder = embed(model_type, model_path)



Then we define the input and output to the model:

In [9]:
sentence = keras.Input(shape=(max_sentence_len, size), name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [10]:
forward_layer = tf.keras.layers.LSTM(size)
backward_layer = tf.keras.layers.LSTM(size, go_backwards=True)
masking_layer = tf.keras.layers.Masking()
rnn = tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer)
logits = rnn(sentence)
logits = keras.layers.Dense(1, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [11]:
model = keras.Model(inputs=sentence, outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [12]:
X_train = embedder.embed_batch(df_train["text"].tolist(), max_sentence_len)
X_dev = embedder.embed_batch(df_dev["text"].tolist(), max_sentence_len)
le = LabelEncoder()
le.fit(df_train["hate"])
Y_train = le.transform(df_train["hate"])
Y_dev = le.transform(df_dev["hate"])

Next we fit the data:

In [13]:
model.fit(X_train, Y_train, epochs=5, batch_size=32, validation_data=(X_dev, Y_dev))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa9dca13400>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [14]:
X_test = embedder.embed_batch(df_test["text"].tolist(), max_sentence_len)

We predict and evaluate the prediction as follows:

In [15]:
predictions = (model.predict(X_test)>0.5).astype(int)

We perpare the predictions as a pandas dataframe.

In [16]:
df_preds = pd.DataFrame(data=le.inverse_transform(predictions), columns=["prediction"])

  return f(*args, **kwargs)


In [17]:
if not os.path.exists("./predictions/{}".format(model_type)):
    os.makedirs("./predictions/{}".format(model_type), exist_ok=True)
df_preds.to_csv("./predictions/{}/hate.tsv".format(model_type), index=False, header=False, sep="\t")