In this notebook we will implement a Long Short-term Memory (LSTM) network to classify texts.

The dataset we will use is Offensive Language Identification (OLID), where short texts in English are labeled for offensiveness. We focus on subtask A: binary classification of offensiveness.

In [1]:
!wget https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip
!unzip OLIDv1.0.zip

--2022-10-21 15:22:18--  https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip
Resolving sites.google.com (sites.google.com)... 209.85.145.139, 209.85.145.101, 209.85.145.102, ...
Connecting to sites.google.com (sites.google.com)|209.85.145.139|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attredirects=0 [following]
--2022-10-21 15:22:18--  https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attredirects=0
Reusing existing connection to sites.google.com:443.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://ef80a887-a-62cb3a1a-s-sites.googlegroups.com/site/offensevalsharedtask/olid/OLIDv1.0.zip?attachauth=ANoY7cqEQR0eJ5TLdEsK4S1HzQw5bFGFDWy-0jht98Ph-6Q8mGDuWzno3ilDDXtWlo5Evm7FTuUwxwnaoPtcYIEaqxaqZj1BEJtqIPnfqS07Be3tEBGitu-JEuLBcEelaRVgKD6AQbJHTqClaV1TYKCyEd-vQmm7aDz66FGYfal_tS623Ld5T92DKYesJ21oZcOKVMBPe8rNZysr3On

In [2]:
import csv
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

data_train = []
labels_train = []

with open("olid-training-v1.0.tsv") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        words = [word.lower() for word in word_tokenize(row["tweet"])]
        data_train.append(words)
        labels_train.append(row["subtask_a"])

data_test = []
labels_test = []
with open("testset-levela.tsv") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        words = [word.lower() for word in word_tokenize(row["tweet"])]
        data_test.append(words)

with open("labels-levela.csv") as f:
    reader = csv.DictReader(f, fieldnames=["id", "label"])
    for row in reader:
        labels_test.append(row["label"])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


We use Keras' tokenizer only to compute the vocabulary on the training set. Sentences are truncated at 100 tokens and padding is added for shortes sentences.

In [5]:
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder

# transform the sentences into vectors
tokenizer = Tokenizer(filters='', lower=True, split=' ')
tokenizer.fit_on_texts(data_train)
word_index = tokenizer.word_index
X_train = tokenizer.texts_to_matrix(data_train)
X_train = pad_sequences(X_train, 100, padding='post', truncating='post')

# encode the labels
encoder = LabelEncoder()
encoder.fit(labels_train)
y_train = encoder.transform(labels_train)

# vectorize the test set
X_test = tokenizer.texts_to_matrix(data_test)
X_test = pad_sequences(X_test, 100, padding='post', truncating='post')
y_test = encoder.transform(labels_test)


Let's try to initialize the weights of the first layer with pre-trained embeddings from GloVe.

In [6]:
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import numpy as np

glove2word2vec("glove.6B.300d.txt", "glove_gensim.6B.300d.txt")
embedding_model=KeyedVectors.load_word2vec_format("glove_gensim.6B.300d.txt",binary=False)

--2022-10-21 15:28:56--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-10-21 15:28:56--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2022-10-21 15:31:35 (5.18 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [7]:
embedding_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in word_index.items():
    try:
        embedding_vector = embedding_model[word]
        embedding_matrix[i] = embedding_vector
    except:
        # words not found in embedding index will be all-zeros.
        continue


The neural network has a first layer where the embeddings are input. They are then concatenated by the Flatten layer and passed on a smaller fully connected hidden layer. The output layer is one neuron with sigmoid activation for binary classification (offensive/not offensive).

The embedding layer can be set to be trainable, or the weights can be kept frozen.

In [8]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM, Dropout, Activation, Bidirectional
import tensorflow as tf
from sklearn.utils import class_weight

model = Sequential()
model.add(Embedding(len(word_index)+1, 300, input_shape=(100,), weights=[embedding_matrix], trainable=True))
model.add(Bidirectional(LSTM(16)))
model.add(Dropout(0.25))
model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
                  metrics=["accuracy"])
model.summary()

class_weights = class_weight.compute_class_weight(
        'balanced',classes=np.unique(y_train),y= y_train)
class_weights = dict(enumerate(class_weights))

history = model.fit(X_train, y_train,
                        batch_size=16,
                        epochs=5,
                        shuffle=True,
                        validation_split=0.1,
                        class_weight=class_weights,
                        verbose=1
                        )

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 300)          6444300   
                                                                 
 bidirectional (Bidirectiona  (None, 32)               40576     
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 6,484,909
Trainable params: 6,484,909
Non-trainable params: 0
_________________________________________________________________
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [9]:
from sklearn.metrics import classification_report

pred = [int(x>=0.5) for x in model.predict(X_test)]
print (classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.70      0.73      0.72       620
           1       0.23      0.21      0.22       240

    accuracy                           0.58       860
   macro avg       0.47      0.47      0.47       860
weighted avg       0.57      0.58      0.58       860

