# Soft Replication of Hemker (2018)

The goal of this notebook is to follow the methodology explained in Hemker (2018) to perform a replication of his results. Note that the source code is not available, rendering this task a bit harder.

### Data Retrieval

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("./data/labeled_data.csv", index_col=0)
raw_tweets = df.tweet
raw_labels = df["class"].values

In [2]:
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


## Data Preprocessing
---

### Data Cleaning

In [3]:
import re
import html

def preprocess(text_string):
    
    # Casing should not make a difference in our case
    text_string = text_string.lower()
    
    # Regex
    html_pattern = r'(&(?:\#(?:(?:[0-9]+)|[Xx](?:[0-9A-Fa-f]+))|(?:[A-Za-z0-9]+));)'    
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    
    # First, add space surrounding HTML entities
    text_string = re.sub(html_pattern, r' \1 ', text_string)
    
    # Now, if we wish to find hashtags, we have to unescape HTML entities
    text_string = html.unescape(text_string)
    
    # From Udacity TV script generation project
    # Replace some punctuation by dedicated tokens
    symbol_to_token = {
        '.' : '||Period||',
        ',' : '||Comma||',
        '"' : '||Quotation_Mark||',
        ';' : '||Semicolon||',
        '!' : '||Exclamation_Mark||',
        '?' : '||Question_Mark||',
        '(' : '||Left_Parenthesis||',
        ')' : '||Right_Parenthesis||',
        '-' : '||Dash||',
        '\n' : '||Return||'
    }
    
    # Next, find URLs
    text_string = re.sub(giant_url_regex, ' URLHERE ', text_string)
    
    # Then, tokenize punctuation
    for key, token in symbol_to_token.items():
        text_string = text_string.replace(key, ' {} '.format(token))

    # Finally, remove spaces and find mentions and hashtags
    text_string = re.sub(hashtag_regex, ' HASHTAGHERE ', text_string)
    text_string = re.sub(mention_regex, ' MENTIONHERE ', text_string)
    text_string = re.sub(space_pattern, ' ', text_string)
    
    return text_string

def _test_preprocess():
    
    assert " HASHTAGHERE " == preprocess("#iam1hashtag")
    assert " URLHERE " == preprocess("https://seminar.minerva.kgi.edu")
    assert " MENTIONHERE " == preprocess("@vinimiranda")
    assert ' ' == preprocess("        ")
    assert " & MENTIONHERE URLHERE HASHTAGHERE " == \
        preprocess("&amp;@vinimiranda    https://seminar.minerva.kgi.edu     #minerva    ")
    
_test_preprocess()

print("Example of a raw tweet:\n{}".format(raw_tweets[68]))
print("\nIts cleaned version is:\n{}".format(preprocess(raw_tweets[68])))

Example of a raw tweet:
"@Almightywayne__: @JetsAndASwisher @Gook____ bitch fuck u http://t.co/pXmGA68NC1" maybe you'll get better. Just http://t.co/TPreVwfq0S

Its cleaned version is:
 ||Quotation_Mark|| MENTIONHERE : MENTIONHERE MENTIONHERE bitch fuck u URLHERE ||Quotation_Mark|| maybe you'll get better ||Period|| just URLHERE 


In [4]:
tweets = raw_tweets.map(preprocess)

### Sentiment Analysis

In [5]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as VS

sentiment_analyzer = VS()

# Example
sentiment_analyzer.polarity_scores(tweets[68])

{'neg': 0.329, 'neu': 0.541, 'pos': 0.131, 'compound': -0.6597}

### Checking for outliers

In [6]:
# Get cleaned tweets
df["clean_tweet"] = tweets

# Get their word count
df["word_count"] = df.clean_tweet.apply(lambda x : len(x.split()))

df.word_count.describe()

count    24783.000000
mean        16.729936
std          8.445555
min          1.000000
25%         10.000000
50%         16.000000
75%         23.000000
max         95.000000
Name: word_count, dtype: float64

In [7]:
# Check tweets with the minimum word count
df.loc[df.word_count == df.word_count.min(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
821,3,0,0,3,2,#Yankees,HASHTAGHERE,1
24147,3,0,3,0,1,bitches,bitches,1
24218,3,3,0,0,0,coons,coons,1
24869,3,0,3,0,1,pussy,pussy,1


Looks good. Let's check the tweet(s) with the maximum word count.

In [8]:
# Check tweets with the maximum length
df.loc[df.word_count == df.word_count.max(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
22953,3,0,0,3,2,Was finna slit my eyebrows up in the shop but ...,was finna slit my eyebrows up in the shop but ...,95


There's something strange going on. Let's check the tweet again.

In [9]:
df.loc[df.word_count == df.word_count.max(),].tweet.values

array(['Was finna slit my eyebrows up in the shop but nahhhhhh.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.\r\n.'],
      dtype=object)

The tweet contains a lot of new lines. It's hard to know why, but I'll choose to remove them.

In [10]:
old_tweet = df.loc[df.word_count == df.word_count.max(),].tweet.values[0]
new_tweet = old_tweet[:old_tweet.find("\r")]
df.loc[df.word_count == df.word_count.max(), "tweet"] = new_tweet
df.loc[df.word_count == df.word_count.max(), "clean_tweet"] = preprocess(new_tweet)
df.loc[df.word_count == df.word_count.max(), "word_count"] = len(preprocess(new_tweet).split())

Let's check it again.

In [11]:
# Check tweets with the maximum length
df.loc[df.word_count == df.word_count.max(),]

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,clean_tweet,word_count
18267,3,0,3,0,1,RT @TrxllLegend: One good girl is worth a thou...,rt MENTIONHERE : one good girl is worth a thou...,91


In [12]:
df.loc[df.word_count == df.word_count.max(),].clean_tweet.values[0]

'rt MENTIONHERE : one good girl is worth a thousand bitches ||Return|| ||Return|| 👰 = 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 👭 … '

Sighes. Well, format-wise it is okay.

### Lookup Tables



In [31]:
def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: Tweets
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    
    # Generate vocabulary
    vocab = set()
    text.str.split().apply(vocab.update)
    
    # Generate lookup tables
    vocab_to_int = {word : ii for ii, word in enumerate(vocab, 1)}    
    int_to_vocab = {ii : word for word, ii in vocab_to_int.items()}
    
    # Add padding special word
    vocab_to_int['<PAD>'] = 0
    int_to_vocab[0] = '<PAD>'
    
    # return tuple
    return (vocab_to_int, int_to_vocab)

def _test_lookup_tables():
    
    text = pd.Series(["this is a toy", "I mean not really a toy", "I mean a toy vocabulary"])
    vocab_to_int, int_to_vocab = create_lookup_tables(text)
    
    # Make sure the dicts make the same lookup
    missmatches = [(word, id, id, int_to_vocab[id]) for word, id in vocab_to_int.items() if int_to_vocab[id] != word]
    
    assert not missmatches,\
        'Found {} missmatche(s). First missmatch: vocab_to_int[{}] = {} and int_to_vocab[{}] = {}'.format(len(missmatches),
                                                                                                          *missmatches[0])
    
_test_lookup_tables()

In [14]:
vocab_to_int, int_to_vocab = create_lookup_tables(tweets)

In [15]:
print("The size of the vocabulary is: {} tokens.".format(len(vocab_to_int)))
vocab = list(vocab_to_int.keys())
np.random.shuffle(vocab)
print("These are 10 randomly sample words in the vocabulary:\n{}".format(vocab[:10]))
del vocab

The size of the vocabulary is: 21134 tokens.
These are 10 randomly sample words in the vocabulary:
['fine', 'tht', 'bois', 'caleb', 'tigers', 'plasma', 'moanin', 'ducanville', 'lucricus', 'schoo']


###  Padding the Data

In [36]:
def create_pad_fn(max_length):
    
    def pad_tweets(tweet, max_length=max_length):
        # Do not cut tweet short if it's too long

        # Retrieve tweet word count
        word_count = len(tweet.split())

        # Check how much padding will be needed
        n = max_length - word_count if word_count < max_length else 0

        # Pad tweet
        padded_tweet = ''.join(['<PAD> '] * n + [tweet])

        return padded_tweet

    return pad_tweets

def pad_tweets(tweet, max_length=10):
    # Do not cut tweet short if it's too long

    # Retrieve tweet word count
    word_count = len(tweet.split())
    
    # Check how much padding will be needed
    n = max_length - word_count if word_count < max_length else 0

    # Pad tweet
    padded_tweet = ''.join(['<PAD> '] * n + [tweet])
   
    return padded_tweet

def _test_pad_tweets():
    
    assert pad_tweets('hi', 0) == 'hi'
    assert pad_tweets('hi', 1) == 'hi'
    assert pad_tweets('hi', 2) == '<PAD> hi'
    assert len(pad_tweets('hi', 10).split()) == 10
    assert len(pad_tweets('hi', 100).split()) == 100
    assert pad_tweets('this sentence is a bit longer', 1) == 'this sentence is a bit longer'
    
_test_pad_tweets()

In [38]:
MAX_LENGTH = df.word_count.max()
pad_tweets = create_pad_fn(MAX_LENGTH)
df["padded_tweets"] = df.clean_tweet.map(pad_tweets)
print(df.padded_tweets[10])

<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD>  ||Quotation_Mark|| keeks is a bitch she curves everyone ||Quotation_Mark|| lol i walked into a conversation like this ||Period|| smh


### Tokenizing the Data

In [18]:
tweets_ints = np.array([[vocab_to_int[word] for word in tweet.split()] for tweet in df.padded_tweets.values])
print(tweets_ints[10])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
  4186  6191 19261 19729 15454  2584 16026 20818  4186 14489 14937 13904
  4989 19729  8343 11852 17110 10422  3627]


### Hate Subclass Extraction

In [19]:
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

def hate_classification(hate_tweet):
    '''Receives a hateful tweet. 
       Return 3 for directed hate speech and 4 otherwise.'''
    
    if bool(hate_tweet.count("MENTIONHERE")): return(3)
    
    # Remove tokens since they will oncused the POS tagger
    token_regex = '\|\|\w+\|\|'
    hate_tweet = re.sub(token_regex, "", hate_tweet)
    
    # URLHERE is considered a proper noun by the pos tagger.
    # Remove them before checking for proper nouns
    no_punct_hate = ''.join([char for char in hate_tweet if char not in punctuation])
    no_URL_hate = ' '.join([token for token in no_punct_hate.split() if token != "URLHERE"])
    has_NE = False
    for sent in sent_tokenize(no_URL_hate):
        for chunk in ne_chunk(pos_tag(word_tokenize(sent))):
            if hasattr(chunk, 'label'):
                return(3)  # Named Entity found    

    return(4)
        
def _test_hate_classification():
    assert hate_classification("MENTIONHERE") == 3
    assert hate_classification("Karen is absolutely crazy") == 3
    assert hate_classification("Karen is his sister. She's absolutely crazy") == 3
    assert hate_classification("They should all be sent to Mexico") == 3
    assert hate_classification("They should all leave the country") == 4
    assert hate_classification("some hate speech stuff") == 4
    assert hate_classification("") == 4

_test_hate_classification()

In [20]:
hate_tweets = tweets[df["class"] == 0].values
_hate_prnt = lambda x : "Generalized" if hate_classification(x) == 4 else "Directed"

print("Example of a hateful tweet: \n{}".format(hate_tweets[20]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[20])))

print("Example of a hateful tweet:\n{}".format(hate_tweets[10]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[10])))

Example of a hateful tweet: 
 ||Quotation_Mark|| we're out here ||Comma|| and we're queer ||Exclamation_Mark|| ||Quotation_Mark|| ||Return|| ||Quotation_Mark|| 2 ||Comma|| 4 ||Comma|| 6 ||Comma|| hut ||Exclamation_Mark|| we like it in our butt ||Exclamation_Mark|| ||Quotation_Mark|| 
Its type of hate speech is: Generalized

Example of a hateful tweet:
 ||Quotation_Mark|| MENTIONHERE : jackies a retard HASHTAGHERE ||Quotation_Mark|| at least i can make a grilled cheese ||Exclamation_Mark|| 
Its type of hate speech is: Directed



### Change hate labels

In [46]:
def change_hate_labels(tweets, raw_labels):
    ''' Change hate speech labels (0) to directed (3) / generalized labels (4) 
        Shifts class numbers to the left so that class labels start from zero.
        Returned labels:
        
            (0) : Offensive
            (1) : Neither
            (2) : Directed hate speech
            (3) : Generalized hate speech
    
    '''
    labels = raw_labels.copy()

    for i, (tweet, label) in enumerate(zip(tweets, raw_labels)):

        if label == 0:  # If hate speech
            labels[i] = hate_classification(tweet)
            
    return labels - 1 

def _test_hate_labels(tweets, raw_labels):
    labels = change_hate_labels(tweets, raw_labels)
    
    assert 4 not in pd.Series(labels).value_counts().index
    assert 2 in pd.Series(labels).value_counts().index
    assert 3 in pd.Series(labels).value_counts().index
    
_test_hate_labels(tweets, raw_labels)

In [47]:
# Getting the counts for each class
labels = change_hate_labels(tweets, raw_labels)
pd.Series(labels).value_counts()

0    19190
1     4163
2      954
3      476
dtype: int64

## Build the Neural Network
---
### Check Access to GPU

In [23]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

### Creating the Training, Validation, and Test Sets

all from Udacity script generator project

In [24]:
from sklearn.utils import shuffle

tweets_ints, labels = shuffle(tweets_ints, labels)
split_frac = 0.8

## split data into training, validation, and test data (features and labels, x and y)
split_idx = int(tweets_ints.shape[0]*split_frac)
train_x, remaining_x = tweets_ints[:split_idx], tweets_ints[split_idx:]
train_y, remaining_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Feature Shapes:
Train set: 		(19826, 91) 
Validation set: 	(2478, 91) 
Test set: 		(2479, 91)


In [25]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 64

# make sure the SHUFFLE your training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

In [26]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([64, 91])
Sample input: 
 tensor([[    0,     0,     0,  ..., 13463, 13936,  1626],
        [    0,     0,     0,  ...,  3130, 17484, 15454],
        [    0,     0,     0,  ..., 11999,  5941,  2691],
        ...,
        [    0,     0,     0,  ..., 15671, 18156,  6511],
        [    0,     0,     0,  ..., 18397, 18397, 18397],
        [    0,     0,     0,  ..., 10690,   742, 15454]], dtype=torch.int32)

Sample label size:  torch.Size([64])
Sample label: 
 tensor([0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 1, 0, 0, 0,
        1, 0, 2, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])


### Define the Architecture

In [48]:
import gensim
import torch.nn as nn
import torch.nn.functional as F

class HateSpeechClassifier(nn.Module):

    def __init__(self, vocab_size, output_size, embedding_dim, cnn_params, pool_params,
                 hidden_dim, n_layers, dropout=0.5, pretrained_embed=False, vocab_to_int=None):
        """
        TO BE RESTATED
        Initialize the PyTorch RNN Module
        :param vocab_size: The number of input dimensions of the neural network (the size of the vocabulary)
        :param output_size: The number of output dimensions of the neural network
        :param embedding_dim: The size of embeddings, should you choose to use them
        :param cnn_params: A 4-element tuple containing the number 
            of feature maps, kernel size, stride and padding of a Conv1D layer. 
        :param pool_params: A 3-element tuple containing the kernel size, stride and padding of a MaxPool1D layer. 
        :param hidden_dim: The size of the hidden layer outputs
        :param dropout: dropout to add in between LSTM/GRU layers
        """
        super(HateSpeechClassifier, self).__init__()
       
        # set class variables
        self.output_size = output_size
        self.n_layers = n_layers
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_to_int = vocab_to_int
        
        # define model layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        if pretrained_embed:
            self.set_pretrained_weights()
            
        self.conv = nn.Conv1d(embedding_dim, *cnn_params)
        
        self.pool = nn.MaxPool1d(*pool_params)
        
        n_maps, _, _, _ = cnn_params
        self.lstm = nn.LSTM(n_maps, hidden_dim, n_layers, 
                            dropout=dropout, batch_first=True)
        
        self.dropout = nn.Dropout(0.2)
        
        self.fc = nn.Linear(hidden_dim, output_size)
    
    
    def forward(self, nn_input, hidden, test_print=False):
        """
        Forward propagation of the neural network
        :param nn_input: The input to the neural network
        :param hidden: The hidden state        
        :return: Two Tensors, the output of the neural network and the latest hidden state
        """
        # TODO: Implement function   
        batch_size = nn_input.size(0)

        # embeddings
        nn_input = nn_input.long()
        embeds = self.embedding(nn_input)
        
        # Change axes. embedding_dim (in_channels) should be in the middle
        # [batch_size, seq_length, embedding_dim] -> [batch_size, embedding_dim, seq_length]
        embeds_t = embeds.permute(0, 2, 1)
        
        # conv
        conv_out = self.conv(embeds_t)
        
        # pool
        pool_out = self.pool(F.relu(conv_out))
        
        # Change axes. lstm expects features to be the last channel
        # [batch_size, n_maps, down_sampled_seq] -> [batch_size, down_sampled_seq, n_maps]
        pool_out_t = pool_out.permute(0, 2, 1)
        
        # lstm
        lstm_out, hidden = self.lstm(pool_out_t, hidden)
    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        
        # dropout and fully-connected layer
        # out = self.dropout(lstm_out)
        fc_out = self.fc(lstm_out)
        
        # reshape to be batch_size first
        fc_out_t = fc_out.view(batch_size, -1, self.output_size)  
        
        out = fc_out_t[:, -1] # get last batch of labels
        
        if test_print:
            print("nn_input.\nexpected : [batch_size, seq_length].\nshape: {}\n".format(nn_input.shape))
            print("embeds.\nexpected : [batch_size, seq_length, embedding_dim].\nshape: {}\n".format(embeds.shape))
            print("embeds_t.\nexpected : [batch_size, embedding_dim, seq_length].\nshape: {}\n".format(embeds_t.shape))
            print("conv_out.\nexpected : [batch_size, n_maps, seq_length].\nshape: {}\n".format(conv_out.shape))
            print("pool_out.\nexpected : [batch_size, n_maps, down_sampled_seq].\nshape: {}\n".format(pool_out.shape))
            print("pool_out_t.\nexpected : [batch_size, down_sampled_seq, n_maps].\nshape: {}\n".format(pool_out_t.shape))
            print("lstm_out.\nexpected : [batch_size, down_sampled_seq, hidden_dim].\nshape: {}\n".format(lstm_out.shape))
            print("lstm_out.\nexpected : [batch_size * down_sampled_seq, hidden_dim].\nshape: {}\n".format(lstm_out.shape))
            print("fc_out.\nexpected : [batch_size * down_sampled_seq, output_dim].\nshape: {}\n".format(fc_out.shape))
            print("fc_out_t.\nexpected : [batch_size, down_sampled_seq, output_dim].\nshape: {}\n".format(fc_out_t.shape))
            print("out.\nexpected : [batch_size, output_dim].\nshape: {}\n".format(out.shape))
                  
        # return one batch of output word scores and the hidden state
        return out, hidden
    
    def set_pretrained_weights(self, model_path="glove/glove.twitter.27B.200d.txt", pnt=True):
        
        if not hasattr(self, 'word2vec_model'):
            self.word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(model_path)

        # Check whether the pretrained model has the correct dimensionality
        assert len(self.word2vec_model["human"]) == self.embedding_dim
        
        # Create the lookup table
        embedding_weights = np.zeros((len(self.vocab_to_int), self.embedding_dim))

        n = 0  # For each word in the dictionary
        for word, value in vocab_to_int.items():

            try:
                # Find its embeddings
                embedding_weights[value] = self.word2vec_model[word]

            except:
                # Or report that it's missing
                n += 1
            
        if pnt: print("{} words in the vocabulary have no pre-trained embedding.".format(n))

        device = "cuda:0" if train_on_gpu else "cpu"
        embedding_weights = torch.Tensor(embedding_weights).type(torch.FloatTensor).to(device)
        self.embedding.weight = nn.Parameter(embedding_weights)
    
    def init_hidden(self, batch_size):
        '''
        Initialize the hidden state of an LSTM/GRU
        :param batch_size: The batch_size of the hidden state
        :return: hidden state of dims (n_layers, batch_size, hidden_dim)
        '''
        # Implement function
        
        # initialize hidden state with zero weights, and move to GPU if available
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [49]:
def _test_HateSpeechClassifier():
    batch_size = 20
    sequence_length = 14
    vocab_size = 3
    output_size= 4
    embedding_dim= 200
    hidden_dim = 12
    n_layers = 2
    cnn_params = (5, 3, 1, 1)
    pool_params = (2, 2, 0)
    vocab_to_int = {'banana' : 0, 'apple' : 1, 'orange' : 2}
    
    # Initialize model
    test_classifier = HateSpeechClassifier(vocab_size, output_size, embedding_dim, 
                                           cnn_params, pool_params, hidden_dim, n_layers,
                                           pretrained_embed=True, vocab_to_int=vocab_to_int)
    
    # create test input
    X_npy = np.random.randint(vocab_size, size=(batch_size, sequence_length))
    X = torch.from_numpy(X_npy)
    
    # Move to GPU if available
    if(train_on_gpu):
        test_classifier.cuda()
        X = X.cuda()
    
    # Compute
    hidden = test_classifier.init_hidden(batch_size)
    out, hidden_out = test_classifier(X, hidden)
    
    # Test output and hidden state shapes
    assert out.shape == (batch_size, output_size)
    assert hidden_out[0].size() == (n_layers, batch_size, hidden_dim)
    assert len(test_classifier.embedding.weight.data.shape) == 2
    assert test_classifier.embedding.weight.data.shape[0] == len(vocab_to_int)

    
_test_HateSpeechClassifier()

21132 words in the vocabulary have no pre-trained embedding.


### Implement Forward Pass and Back Propagation

In [None]:
def forward_back_prop(model, optimizer, criterion, inp, target, hidden, clip=5):
    """
    Forward and backward propagation on the neural network
    :param model: The PyTorch Module that holds the neural network
    :param optimizer: The PyTorch optimizer for the neural network
    :param criterion: The PyTorch loss function
    :param inp: A batch of input to the neural network
    :param target: The target output for the batch of input
    :return: The loss and the latest hidden state Tensor
    """
    
    batch_size = inp.size(0)
    target = target.type(torch.LongTensor)
    
    # move data to GPU, if available
    if train_on_gpu:
        inp, target = inp.cuda(), target.cuda()
    
    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    hidden = tuple([each.data for each in hidden])
    
    # zero accumulated gradients
    model.zero_grad()
    
    # get the output from the model
    output, hidden = model(inp, hidden)
    
    # perform backpropagation and optimization
    # calculate the loss and perform backprop
    loss = criterion(output, target)
    
    try:
        loss.backward()
    
    except RuntimeError:
        fn = lambda x, y : print('{} : {}'.format(x, y.shape))
        fn('output', output)
        fn('target', target)
        fn('loss', loss.item())
    
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()
    
    # return the loss over a batch and the hidden state produced by our model
    return loss.item(), hidden



I need to understand this code better but I mean it works


In [None]:
from unittest.mock import MagicMock, patch

class _TestNN(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(_TestNN, self).__init__()
        self.decoder = torch.nn.Linear(input_size, output_size)
        self.forward_called = False
    
    def forward(self, nn_input, hidden):
        self.forward_called = True
        output = self.decoder(nn_input)
        
        return output, hidden

def _test_forward_back_prop(classifierNN, forward_back_prop, train_on_gpu):
    batch_size = 20
    sequence_length = 14
    input_size = 20
    output_size= 4
    embedding_dim= 16
    hidden_dim = 12
    n_layers = 2
    cnn_params = (5, 3, 1, 1)
    pool_params = (2, 2, 0)
    learning_rate = 0.01
    
    model = classifierNN(input_size, output_size, embedding_dim, 
                         cnn_params, pool_params, hidden_dim, n_layers)
    
    mock_decoder = MagicMock(wraps=_TestNN(input_size, output_size))
    if train_on_gpu:
        mock_decoder.cuda()
    
    mock_decoder_optimizer = MagicMock(wraps=torch.optim.Adam(mock_decoder.parameters(), lr=learning_rate))
    mock_criterion = MagicMock(wraps=torch.nn.CrossEntropyLoss())
    
    with patch.object(torch.autograd, 'backward', wraps=torch.autograd.backward) as mock_autograd_backward:
        inp = torch.FloatTensor(np.random.rand(batch_size, input_size))
        target = torch.LongTensor(np.random.randint(output_size, size=batch_size))
        
        hidden = model.init_hidden(batch_size)
        
        loss, hidden_out = forward_back_prop(mock_decoder, mock_decoder_optimizer, mock_criterion, inp, target, hidden)
        
    assert (hidden_out[0][0]==hidden[0][0]).sum()==batch_size*hidden_dim
    assert mock_decoder.zero_grad.called or mock_decoder_optimizer.zero_grad.called, 'Didn\'t set the gradients to 0.'
    assert mock_decoder.forward_called, 'Forward propagation not called.'
    assert mock_autograd_backward.called, 'Backward propagation not called'
    assert mock_decoder_optimizer.step.called, 'Optimization step not performed'
    assert type(loss) == float, 'Wrong return type. Expected {}, got {}'.format(float, type(loss))
    
_test_forward_back_prop(HateSpeechClassifier, forward_back_prop, train_on_gpu)

### Training Process 

In [None]:
def train_classifier(model, batch_size, optimizer, criterion, n_epochs, train_loader, valid_loader,
                     show_every_n_batches=10, try_load = False, save_path="model.pt"):
    
    # Load model previously trained if availabale
    if try_load:
        try:
            model.load_state_dict(torch.load(save_path))
            return model
        except:
            pass
    
    # n steps
    steps = 0
    
    # initialize tracker for minimum validation loss
    valid_loss_min = np.Inf  

    print("Training for %d epoch(s)..." % n_epochs)
    for epoch_i in range(1, n_epochs + 1):
        
        # initialize variables to monitor training loss
        train_loss = 0.0
        
        ###################
        # train the model #
        ###################
        
        # initialize hidden state
        hidden = model.init_hidden(batch_size)
        
        # Set model for training
        model.train()
        
        for batch_i, (inputs, labels) in enumerate(train_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(train_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            # forward, back prop
            loss, hidden = forward_back_prop(model, optimizer, criterion, inputs, labels, hidden)          
            
            # record loss
            train_loss += loss

            if batch_i % show_every_n_batches == 0:
                print("Epoch: {}/{}. \tBatch: {}/{}.\t Avg. Training Loss: {}".format(epoch_i,
                                                                                      n_epochs,
                                                                                      batch_i, 
                                                                                      len(train_loader), 
                                                                                      train_loss/batch_i))

        ######################    
        # validate the model #
        ######################
        
        valid_loss = 0.0
        correct = 0.0
        total = 0.0
        
        # Initialize hidden state
        valid_hidden = model.init_hidden(batch_size)
        
        # Set model for evaluation
        model.eval()
        
        for batch_i, (inputs, labels) in enumerate(valid_loader, 1):
            
            # make sure you iterate over completely full batches, only
            n_batches = len(valid_loader.dataset)//batch_size
            if(batch_i > n_batches):
                break
            
            labels = labels.type(torch.LongTensor)

            # move data to GPU, if available
            if train_on_gpu:
                inputs, labels = inputs.cuda(), labels.cuda()

            # Creating new variables for the hidden state
            valid_hidden = tuple([each.data for each in valid_hidden])

            # get the output from the model
            output, valid_hidden = model(inputs, valid_hidden)

            # calculate the loss 
            loss = criterion(output, labels)

            # update running validation loss 
            valid_loss += loss
            
            # convert output probabilities to predicted class
            pred = output.data.max(1, keepdim=True)[1]
            
            # compare predictions to true label
            correct += np.sum(np.squeeze(pred.eq(labels.data.view_as(pred))).cpu().numpy())
            total += inputs.size(0)

        # print training/validation statistics 
        # calculate average loss over an epoch
        train_loss = train_loss/len(train_loader)
        valid_loss = valid_loss/len(valid_loader)
        acc = 100. * correct / total
        
        # print validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f} \t Accuracy: {:.6f}\n'.format(
            epoch_i, 
            train_loss,
            valid_loss,
            acc
            ))
        
        # save model if validation loss has decreased
        if valid_loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...\n'.format(
            valid_loss_min,
            valid_loss))
            torch.save(model.state_dict(), save_path)
            valid_loss_min = valid_loss
                
    # returns a trained classifier
    return model

### Hyperparameters

In [None]:
sequence_length = tweets_ints.shape[1]  # number of words in a sequence
num_epochs = 5
learning_rate = 0.0005
vocab_size = len(vocab_to_int)
output_size = pd.Series(labels).nunique()
embedding_dim = 200
hidden_dim = 256
batch_size = 64
n_layers = 2
show_every_n_batches = 50
cnn_params = (32, 25, 1, 4)
pool_params = (4, 4, 0)

### Instantiate the Model and Train the Network 

In [None]:
model = HateSpeechClassifier(vocab_size, output_size, embedding_dim, cnn_params, pool_params,
                             hidden_dim, n_layers, dropout=0.5, pretrained_embed=1, vocab_to_int=vocab_to_int)

In [None]:
# move to gpu if available    
if train_on_gpu:
    model.cuda()

In [None]:
# defining loss and optimization functions for training
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

In [None]:
model.set_pretrained_weights()
model = train_classifier(model, batch_size, optimizer, criterion, num_epochs, train_loader, valid_loader,
                         show_every_n_batches=show_every_n_batches, save_path="model.pt")

### Testing

In [None]:
# Get test data loss and accuracy

test_loss = 0 # track loss
num_correct = 0
total = 0
y_pred, y_true = [], []

# init hidden state
test_hidden = model.init_hidden(batch_size)

model.eval()
# iterate over test data
for batch_i, (inputs, labels) in enumerate(test_loader, 1):

    # make sure you iterate over completely full batches, only
    n_batches = len(test_loader.dataset)//batch_size
    if(batch_i > n_batches):
        break
                
    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    test_hidden = tuple([each.data for each in test_hidden])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output, test_hidden = model(inputs, test_hidden)
    
    # Accumulate loss
    test_loss += criterion(output, labels)
    
    # convert output probabilities to predicted class
    pred = output.data.max(1, keepdim=True)[1]
    
    # compare predictions to true label
    num_correct += np.sum(np.squeeze(pred.eq(labels.data.view_as(pred))).cpu().numpy())
    total += inputs.size(0)
    
    # Save prediction and labels
    y_pred += list(pred.squeeze().cpu().numpy())
    y_true += list(labels.data.cpu().numpy())


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(test_loss/len(test_loader)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.1f}%".format(100*test_acc))

In [None]:
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.utils.multiclass import unique_labels

class_names = np.array(["Offensive","Neither","Dir. Hate","Gen. Hate"])

def plot_confusion_matrix(y_true, y_pred, classes,
                          title=None, normalize=False,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')
    
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_true, y_pred, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_true, y_pred, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

plt.show()