# Universal Sentence Encoder Baseline for OSACT4 - Task B

In this notebook, we will walk you through the process of reproducing the Universal Sentence Encoder baseline for the OSACT4 tasks B.

## Loading Required Modules

We start by loading the needed python libraries.

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
import pandas as pd
import tensorflow_text
from tensorflow import keras
from sklearn.metrics import f1_score
from sklearn.preprocessing import LabelEncoder

## Loading Data

Using pandas, we can load and inspect the training and testing datasets as follows:

In [2]:
df_train = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-train.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
df_dev = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-dev.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
df_test = pd.read_csv("../../private_datasets/offensive/tweets_v1.0.txt", sep="\t", quotechar='▁', header=None, names=["text"])

  df_train = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-train.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
  df_dev = pd.read_csv("../../data/osact4/OSACT2020-sharedTask-dev.txt", sep="\t", quotechar='▁', header=None, names=["text", "offensive", "hate"])
  df_test = pd.read_csv("../../private_datasets/offensive/tweets_v1.0.txt", sep="\t", quotechar='▁', header=None, names=["text"])


Below we list the 5 first entries in the training data.

In [3]:
df_train.head()

Unnamed: 0,text,offensive,hate
0,الحمدلله يارب فوز مهم يا زمالك.. كل الدعم ليكم...,NOT_OFF,NOT_HS
1,فدوه يا بخت فدوه يا زمن واحد منكم يجيبه,NOT_OFF,NOT_HS
2,RT @USER: يا رب يا واحد يا أحد بحق يوم الاحد ا...,OFF,HS
3,RT @USER: #هوا_الحرية يا وجع قلبي عليكي يا امي...,NOT_OFF,NOT_HS
4,يا بكون بحياتك الأهم يا إما ما بدي أكون 🎼,NOT_OFF,NOT_HS


Below we list the 5 first entries in the development data.

In [4]:
df_dev.head()

Unnamed: 0,text,offensive,hate
0,فى حاجات مينفعش نلفت نظركوا ليها زى الاصول كده...,NOT_OFF,NOT_HS
1,RT @USER: وعيون تنادينا تحايل فينا و نقول يا ع...,NOT_OFF,NOT_HS
2,يا بلادي يا أم البلاد يا بلادي بحبك يا مصر بحب...,NOT_OFF,NOT_HS
3,RT @USER: يا رب يا قوي يا معين مدّني بالقوة و ...,NOT_OFF,NOT_HS
4,RT @USER: رحمك الله يا صدام يا بطل ومقدام. URL,NOT_OFF,NOT_HS


Below we list the 5 first entries in the testing data.

In [5]:
df_test.head()

Unnamed: 0,text
0,أود أن أعلمكم أن التعليق المنشور هنا باسم نور ...
1,مافيه فرق بين احمد جبريل والعاهره المستأجره
2,اذا نطق السفية فلا تجبة لانة سفية وقليل الادب ...
3,اعتقد حضرتك تدعو لمؤتمر دولى للحوار للسلمي مع ...
4,يسرني في المركز الموريتاني لقياس الراي العام ا...


## Model Preparation

We start by setting the randomisation seed:

In [6]:
tf.random.set_seed(123)

Next we load the Universal Sentence Encoder (WARNING: This will download and cache a huge model of around 1 GB in size)

In [7]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

Then we define the input and output to the model:

In [8]:
sentence_input = keras.Input(shape=512, name='sentence')
label = keras.Input(shape=(1,), name='label')

This is followed by defining the structure of the network:

In [9]:
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(sentence_input)
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(logits)
logits = keras.layers.Dense(512, activation=tf.nn.tanh)(logits)
logits = keras.layers.Dense(1, activation=tf.nn.sigmoid)(logits)

Then we construct and compile the model:

In [10]:
model = keras.Model(inputs=sentence_input, outputs=logits)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Model Training

First we perpare the inputs and outputs to be fed to the model during training:

In [11]:
X_train =  np.concatenate(df_train["text"].apply(lambda x: embed(x)).to_numpy())
X_dev =  np.concatenate(df_dev["text"].apply(lambda x: embed(x)).to_numpy())
le = LabelEncoder()
le.fit(df_train["hate"])
Y_train = le.transform(df_train["hate"])
Y_dev = le.transform(df_dev["hate"])

Next we fit the data:

In [12]:
model.fit(X_train, Y_train, epochs=5, batch_size=32, validation_data=(X_dev, Y_dev))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f147596d100>

## Submission Preperation

We perpare the features for each testset instance as follows:

In [13]:
X_test = embed(df_test["text"])

We predict and evaluate the prediction as follows:

In [14]:
predictions = (model.predict(X_test)>0.5).astype(int)

We perpare the predictions as a pandas dataframe.

In [15]:
df_preds = pd.DataFrame(data=le.inverse_transform(predictions), columns=["prediction"])

  return f(*args, **kwargs)


In [16]:
if not os.path.exists("predictions"):
    os.mkdir("predictions")
df_preds.to_csv("./predictions/hate.tsv", index=False, header=False, sep="\t")