# Hate Speech Analysis with LSTMs

In this notebook, we'll apply deep learning technique LSTM (long short-term memory units) to the task of hate speech analysis. 

# Framing Hate Speech Analysis as a Deep Learning Problem

The task of sentiment analysis involves taking in an input sequence of words and determining whether the tweet is hate speech, offensive language or neither . We can separate this into 5 different components.

    1) Training a word vector generation model (such as Word2Vec) or loading pretrained word vectors
    2) Creating an ID's matrix for our training set
    3) RNN (With LSTM units)
    4) Training 
    5) Testing

# Loading Data 

First, we want to create our word vectors. For simplicity, we're going to be using a pretrained model. 

As one of the biggest players in the ML game, Google was able to train a Word2Vec model on a massive Google News dataset that contained over 100 billion different words! From that model, Google [was able to create 3 million word vectors](https://code.google.com/archive/p/word2vec/#Pre-trained_word_and_phrase_vectors), each with a dimensionality of 300. 

In an ideal scenario, we'd use those vectors, but since the word vectors matrix is quite large (3.6 GB!), we'll be using a much more manageable matrix that is trained using [GloVe](http://nlp.stanford.edu/projects/glove/), a similar word vector generation model. The matrix will contain 400,000 word vectors, each with a dimensionality of 50. 

We're going to be importing two different data structures, one will be a Python list with the 400,000 words, and one will be a 400,000 x 50 dimensional embedding matrix that holds all of the word vector values. 

In [83]:
import numpy as np
wordsList = np.load('./data/wordsList.npy')
print('Loaded the word list!')
wordsList = wordsList.tolist() #Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8
wordVectors = np.load('./data/wordVectors.npy')
print ('Loaded the word vectors!')

Loaded the word list!
Loaded the word vectors!


Just to make sure everything has been loaded in correctly, we can look at the dimensions of the vocabulary list and the embedding matrix. 

In [84]:
print(len(wordsList))
print(wordVectors.shape)

400000
(400000, 50)


We can also search our word list for a word like "baseball", and then access its corresponding vector through the embedding matrix.

In [85]:
baseballIndex = wordsList.index('baseball')
wordVectors[baseballIndex]

array([-1.9327  ,  1.0421  , -0.78515 ,  0.91033 ,  0.22711 , -0.62158 ,
       -1.6493  ,  0.07686 , -0.5868  ,  0.058831,  0.35628 ,  0.68916 ,
       -0.50598 ,  0.70473 ,  1.2664  , -0.40031 , -0.020687,  0.80863 ,
       -0.90566 , -0.074054, -0.87675 , -0.6291  , -0.12685 ,  0.11524 ,
       -0.55685 , -1.6826  , -0.26291 ,  0.22632 ,  0.713   , -1.0828  ,
        2.1231  ,  0.49869 ,  0.066711, -0.48226 , -0.17897 ,  0.47699 ,
        0.16384 ,  0.16537 , -0.11506 , -0.15962 , -0.94926 , -0.42833 ,
       -0.59457 ,  1.3566  , -0.27506 ,  0.19918 , -0.36008 ,  0.55667 ,
       -0.70315 ,  0.17157 ], dtype=float32)

Now that we have our vectors, our first step is taking an input sentence and then constructing the its vector representation. Let's say that we have the input sentence "I thought the movie was incredible and inspiring". In order to get the word vectors, we can use Tensorflow's embedding lookup function. This function takes in two arguments, one for the embedding matrix (the wordVectors matrix in our case), and one for the ids of each of the words. The ids vector can be thought of as the integerized representation of the training set. This is basically just the row index of each of the words. Let's look at a quick example to make this concrete. 

In [86]:
import tensorflow as tf

maxSeqLength = 10 #Maximum length of sentence
numDimensions = 300 #Dimensions for each word vector
firstSentence = np.zeros((maxSeqLength), dtype='int32')
firstSentence[0] = wordsList.index("i")
firstSentence[1] = wordsList.index("thought")
firstSentence[2] = wordsList.index("the")
firstSentence[3] = wordsList.index("movie")
firstSentence[4] = wordsList.index("was")
firstSentence[5] = wordsList.index("incredible")
firstSentence[6] = wordsList.index("and")
firstSentence[7] = wordsList.index("inspiring")
#firstSentence[8] and firstSentence[9] are going to be 0
print(firstSentence.shape)
print(firstSentence) #Shows the row index for each word

(10,)
[    41    804 201534   1005     15   7446      5  13767      0      0]


The 10 x 50 output should contain the 50 dimensional word vectors for each of the 10 words in the sequence. 

In [87]:
with tf.compat.v1.Session() as sess:
    print(tf.nn.embedding_lookup(wordVectors,firstSentence).eval().shape)

(10, 50)


Lets create the ids matrix now, code is commented and instead using pregenrated idx.  

In [90]:
import re
import random
import datetime

import pandas as pd
import numpy as np
import tensorflow as tf

# Load data
df = pd.read_csv('./data/cleanedData.csv')
tweets = df.values

max_tweet_length = 30 # Max word count of tweet

#Words to tokens
ids = np.zeros((tweets.shape[0], max_tweet_length), dtype='int32')
#
#for i, tweet in enumerate(tweets):
#    text = tweet[6]
#
#    # Text cleaning
#    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text)
#
#    split_text = text.split()
#
#    # Tokenize text
#    for j, word in enumerate(split_text):
#        try:
#           ids[i][j] = wordsList.index(word)
#        except ValueError:
#            ids[i][j] = 399999 # Vector for unkown words
#        if j == max_tweet_length - 1:
#            break
            
#np.save('./data/ids_matrix', ids)

ids = np.load('./data/ids_matrix.npy')


# Convert tokens to word vectors using Glove's 50 dimension pre-trained word vectors
# this to be used in next step
word_vec_dimension = 50
word_vectors = np.load('./data/wordVectors.npy')

output_labels = []
for tweet in tweets:
    label = tweet[5]
    if label == 0:
        label = [1, 0, 0]
    if label == 1:
        label = [0, 1, 0]
    if label == 2:
        label = [0, 0, 1]
    output_labels.append(label)

# RNN Model

Now, we’re ready to start creating our Tensorflow graph. We’ll first need to define some hyperparameters, such as batch size, number of LSTM units, number of output classes, and number of training iterations. 

In [92]:
#batchSize = 24
#lstmUnits = 64
#numClasses = 2
#iterations = 100000

# Model hyperparameters
output_classes = 3
lstm_units = 64

data_size = 19999
batch_size = 33
iterations = 100000

training_ids, training_labels = ids[:16000], output_labels[:16000]
testing_ids, testing_labels = ids[16000:], output_labels[16000:]

def get_train_batch():
    global batch_index, tweets

    arr = np.zeros([batch_size, max_tweet_length])
    labels = []

    for tweet in range(batch_size):
        num = random.randint(0, training_ids.shape[0]-1)

        arr[tweet] = training_ids[num]

        label = tweets[num][5]
        if label == 0: # Hate speech
            label = [1, 0, 0]
        if label == 1: # Offensive language
            label = [0, 1, 0]
        if label == 2: # Neither
            label = [0, 0, 1]
        labels.append(label)

    return arr, labels

#Check training data
print(len(training_ids))
print(training_ids.shape)
df.head(10)
print(tweets[0][5])
print(tweets[0][6])

16000
(16000, 30)
2
rt mention a a woman you shouldnt complain about cleaning up your house a a man you should always take the trash out


Setup Model

In [93]:
# Model
tf.compat.v1.reset_default_graph()
tf.compat.v1.disable_eager_execution()

input_data = tf.compat.v1.placeholder(tf.int32, [batch_size, max_tweet_length], name="Tweets")
labels = tf.compat.v1.placeholder(tf.float32, [batch_size, output_classes], name="Labels")

data = tf.Variable(tf.zeros([batch_size, max_tweet_length, word_vec_dimension]), dtype=tf.float32, name="InputData")
data = tf.nn.embedding_lookup(word_vectors, input_data)

lstm_cell = tf.compat.v1.nn.rnn_cell.BasicLSTMCell(lstm_units)
lstm_cell = tf.compat.v1.nn.rnn_cell.DropoutWrapper(cell=lstm_cell, output_keep_prob=0.75)
value, _ = tf.compat.v1.nn.dynamic_rnn(lstm_cell, data, dtype=tf.float32)

weight = tf.Variable(tf.compat.v1.truncated_normal([lstm_units, output_classes]), name="Weights")
bias = tf.Variable(tf.constant(0.1, shape=[output_classes]), name="Bias")
value = tf.transpose(value, [1, 0, 2])
last = tf.gather(value, int(value.get_shape()[0]) - 1)
prediction = (tf.matmul(last, weight) + bias)

correctPred = tf.equal(tf.argmax(prediction, 1), tf.argmax(labels, 1))
accuracy = tf.reduce_mean(tf.cast(correctPred, tf.float32))

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=labels))
optimizer = tf.compat.v1.train.AdamOptimizer().minimize(loss)

sess = tf.compat.v1.InteractiveSession()
saver = tf.compat.v1.train.Saver()
sess.run(tf.compat.v1.global_variables_initializer())

Setup Tensorboard


In [94]:
# Tensorboard setup
tf.compat.v1.summary.scalar('Loss', loss)
tf.compat.v1.summary.scalar('Accuracy', accuracy)
merged = tf.compat.v1.summary.merge_all()
logdir = "./tensorboard/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
writer = tf.compat.v1.summary.FileWriter(logdir, sess.graph)

# Training

Train the model

In [95]:
# Model training
for i in range(iterations + 1):
    batch, batch_labels = get_train_batch()
    #print(len(batch))
    #print(batch.shape)
    sess.run(optimizer, {input_data: batch, labels: batch_labels})

    # Write summary to Tensorboard every 50 epochs
    if (i % 50 == 0):
        summary = sess.run(merged, {input_data: batch, labels: batch_labels})
        writer.add_summary(summary, i)

    # Save the network every 10,000 epochs
    if (i % 10000 == 0 and i != 0):
        save_path = saver.save(sess, "models/pretrained_lstm.ckpt", global_step=i)
        print("Saved to", save_path, "-", int(i/iterations*100), "%")
writer.close()

Saved to models/pretrained_lstm.ckpt-10000 - 10 %
Saved to models/pretrained_lstm.ckpt-20000 - 20 %
Saved to models/pretrained_lstm.ckpt-30000 - 30 %
Saved to models/pretrained_lstm.ckpt-40000 - 40 %
Saved to models/pretrained_lstm.ckpt-50000 - 50 %
Saved to models/pretrained_lstm.ckpt-60000 - 60 %
Saved to models/pretrained_lstm.ckpt-70000 - 70 %
Saved to models/pretrained_lstm.ckpt-80000 - 80 %
Saved to models/pretrained_lstm.ckpt-90000 - 90 %
Saved to models/pretrained_lstm.ckpt-100000 - 100 %


# Prediction

Check predictions 

In [96]:
# Model prediction
saver.restore(sess, tf.train.latest_checkpoint('models'))

def get_sentence_matrix(sentence):
    arr = np.zeros([batch_size, max_tweet_length])
    sentence_matrix = np.zeros([batch_size, max_tweet_length], dtype='int32')
    split = sentence.split()
    for i, word in enumerate(split):
        if (i < max_tweet_length):
            try:
                sentence_matrix[0, i] = wordsList.index(word)
            except ValueError:
                sentence_matrix[0, i] = 399999 #Vector for unkown words
    return sentence_matrix

correct = 0
for i in range(3998):
    input_matrix = get_sentence_matrix(tweets[16000 + i][6])
    predicted = sess.run(prediction, {input_data: input_matrix})[0]
    if (np.argmax(predicted) == tweets[16000 + i][5]):
        correct += 1

print("Accuracy on validation data:", correct/3998)

INFO:tensorflow:Restoring parameters from models/pretrained_lstm.ckpt-100000
Accuracy on validation data: 0.8881940970485243


# Conclusion

In this notebook, we went over a deep learning approach to hate speech analysis.