# Soft Replication of Hemker (2018)

The goal of this notebook is to follow the methodology explained in Hemker (2018) to perform a replication of his results. Note that the source code is not available, rendering this task a bit harder.

### Data Retrieval

In [215]:
# Source: Davidson et al. (2017)

import pandas as pd

df = pd.read_csv("./data/labeled_data.csv", index_col=0)
raw_tweets = df.tweet
raw_labels = df["class"].values

In [216]:
df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


## Data Preprocessing
---

### Noise Removal

In [277]:
raw_tweets[25291]

"you's a muthaf***in lie &#8220;@LifeAsKing: @20_Pearls @corey_emanuel right! His TL is trash &#8230;. Now, mine? Bible scriptures and hymns&#8221;"

In [284]:
# Source: Davidson et al. (2017)

import re
import html
from string import punctuation

def preprocess(text_string):
    
    # Casing should not make a difference in our case
    text_string = text_string.lower()
    
    # Regex
    html_pattern = r'(&(?:\#(?:(?:[0-9]+)|[Xx](?:[0-9A-Fa-f]+))|(?:[A-Za-z0-9]+));)'    
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    
    # First, add space surrounding HTML entities
    text_string = re.sub(html_pattern, r' \1 ', text_string)
    
    # Now, if we wish to find hashtags, we have to unescape HTML entities
    text_string = html.unescape(text_string)
    
    # From Udacity TV script generation project
    # Replace some punctuation by dedicated tokens
    symbol_to_token = {
        '.' : '||Period||',
        ',' : '||Comma||',
        '"' : '||Quotation_Mark||',
        ';' : '||Semicolon||',
        '!' : '||Exclamation_Mark||',
        '?' : '||Question_Mark||',
        '(' : '||Left_Parenthesis||',
        ')' : '||Right_Parenthesis||',
        '-' : '||Dash||',
        '\n' : '||Return||'
    }
    
    # Next, find URLs
    text_string = re.sub(giant_url_regex, ' URLHERE ', text_string)
    
    # Then, tokenize punctuation
    for key, token in symbol_to_token.items():
        text_string = text_string.replace(key, ' {} '.format(token))

    # Finally, remove spaces and find mentions and hashtags
    text_string = re.sub(hashtag_regex, ' HASHTAGHERE ', text_string)
    text_string = re.sub(mention_regex, ' MENTIONHERE ', text_string)
    text_string = re.sub(space_pattern, ' ', text_string)
    
    return text_string

def _test_preprocess():
    
    assert " HASHTAGHERE " == preprocess("#iam1hashtag")
    assert " URLHERE " == preprocess("https://seminar.minerva.kgi.edu")
    assert " MENTIONHERE " == preprocess("@vinimiranda")
    assert ' ' == preprocess("        ")
    assert " & MENTIONHERE URLHERE HASHTAGHERE " == \
        preprocess("&amp;@vinimiranda    https://seminar.minerva.kgi.edu     #minerva    ")
    
_test_preprocess()

print("Example of a raw tweet:\n{}".format(raw_tweets[68]))
print("\nIts cleaned version is:\n{}".format(preprocess(raw_tweets[68])))

Example of a raw tweet:
"@Almightywayne__: @JetsAndASwisher @Gook____ bitch fuck u http://t.co/pXmGA68NC1" maybe you'll get better. Just http://t.co/TPreVwfq0S

Its cleaned version is:
 ||Quotation_Mark|| MENTIONHERE : MENTIONHERE MENTIONHERE bitch fuck u URLHERE ||Quotation_Mark|| maybe you'll get better ||Period|| just URLHERE 


In [218]:
tweets = raw_tweets.map(preprocess)

### Sentiment Analysis

In [219]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer as VS

sentiment_analyzer = VS()

# Example
sentiment_analyzer.polarity_scores(tweets[68])

{'neg': 0.329, 'neu': 0.541, 'pos': 0.131, 'compound': -0.6597}

### Lookup table



In [None]:
# From Udacity script project

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: Tweets
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    
    
    
    # TODO: Implement Function
    vocab_to_int = {word : ii for ii, word in enumerate(set(text))}    
    int_to_vocab = {ii : word for word, ii in vocab_to_int.items()}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)

def _test_lookup_tables():
    
    # Make sure the dicts make the same lookup
    missmatches = [(word, id, id, int_to_vocab[id]) for word, id in vocab_to_int.items() if int_to_vocab[id] != word]
    
    assert not missmatches,\
        'Found {} missmatche(s). First missmatch: vocab_to_int[{}] = {} and int_to_vocab[{}] = {}'.format(len(missmatches),
                                                                                                          *missmatches[0])

In [275]:
vocab = set()
tweets.str.split().apply(vocab.update)
vocab

{'attacks',
 'mango',
 'platter',
 'bittersweet',
 "'lawlessness'",
 '😒💯',
 "95's",
 'muchyour',
 'ari',
 'drugssss',
 '48',
 'lahhh',
 'thotting',
 '😡😤👿🔪',
 'lt',
 'ninooo',
 'jd',
 'gloves',
 'angelou',
 'likin',
 'loudmouth',
 'dislikes',
 'gasoline',
 'fink',
 'describes',
 'nicely',
 'hugs',
 'provide',
 'convoluted',
 'cry😂😂😂',
 'i’d',
 'liberals',
 'bourne',
 '😭”',
 '180k',
 'bezel',
 'natalie',
 'youtube/vine/ig',
 '😭😂😂”',
 '8pm',
 'falcons',
 'dickheads',
 'tornado',
 'magician',
 '→',
 'niggass',
 'liver',
 'trippen',
 'mallett',
 'yahweh',
 'whoooooaaa',
 'tears',
 'HASHTAGHERE”😂😂',
 'uchida',
 'controllin',
 'taxpayer',
 'birkin',
 'fuckinHASHTAGHERE',
 'paranoia:',
 'engels',
 'marry',
 'secondly',
 'yuck',
 'feature',
 'benton',
 'openwrt',
 '😂😂😂😭😭😭',
 'trash',
 'no”i',
 'sub',
 'ratchetness',
 'daughters',
 'frozed',
 "barry's",
 'lifee',
 'she',
 'bait',
 'agrees',
 'first',
 'alll',
 'ny',
 'URLHERE”flawless',
 'hampshire',
 'watermelon',
 'standards',
 '83%',
 'oop',


### Hate Subclass Extraction

In [236]:
# Partly from https://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list
# I do not implement co-reference resolution since a single NE is sufficient for directed hate speech.
from nltk import sent_tokenize, word_tokenize, pos_tag, ne_chunk
from nltk.tokenize import SpaceTokenizer

def hate_classification(hate_tweet):
    '''Receives a hateful tweet. 
       Return 3 for directed hate speech and 4 otherwise.'''
    
    if bool(hate_tweet.count("MENTIONHERE")): return(3)
    
    # Remove tokens since they will oncused the POS tagger
    token_regex = '\|\|\w+\|\|'
    hate_tweet = re.sub(token_regex, "", hate_tweet)
    
    # URLHERE is considered a proper noun by the pos tagger.
    # Remove them before checking for proper nouns
    no_punct_hate = ''.join([char for char in hate_tweet if char not in punctuation])
    no_URL_hate = ' '.join([token for token in no_punct_hate.split() if token != "URLHERE"])
    has_NE = False
    for sent in nltk.sent_tokenize(no_URL_hate):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
            if hasattr(chunk, 'label'):
                return(3)  # Named Entity found    

    return(4)
        
def _test_hate_classification():
    assert hate_classification("MENTIONHERE") == 3
    assert hate_classification("Karen is absolutely crazy") == 3
    assert hate_classification("Karen is his sister. She's absolutely crazy") == 3
    assert hate_classification("They should all be sent to Mexico") == 3
    assert hate_classification("They should all leave the country") == 4
    assert hate_classification("some hate speech stuff") == 4
    assert hate_classification("") == 4

_test_hate_classification()

In [237]:
hate_tweets = tweets[df["class"] == 0].values
_hate_prnt = lambda x : "Generalized" if hate_classification(x) == 4 else "Directed"

print("Example of a hateful tweet: \n{}".format(hate_tweets[20]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[20])))

print("Example of a hateful tweet:\n{}".format(hate_tweets[10]))
print("Its type of hate speech is: {}\n".format(_hate_prnt(hate_tweets[10])))

Example of a hateful tweet: 
 ||Quotation_Mark|| we're out here ||Comma|| and we're queer ||Exclamation_Mark|| ||Quotation_Mark|| ||Return|| ||Quotation_Mark|| 2 ||Comma|| 4 ||Comma|| 6 ||Comma|| hut ||Exclamation_Mark|| we like it in our butt ||Exclamation_Mark|| ||Quotation_Mark|| 
Its type of hate speech is: Generalized

Example of a hateful tweet:
 ||Quotation_Mark|| MENTIONHERE: jackies a retard HASHTAGHERE ||Quotation_Mark|| at least i can make a grilled cheese ||Exclamation_Mark|| 
Its type of hate speech is: Directed



In [195]:
# Change hate speech labels (0) to directed (3) / generalized labels (4) 
labels = raw_labels.copy()
for i, (tweet, label) in enumerate(zip(tweets, raw_labels)):
    
    if label == 0:  # If hate speech
        labels[i] = hate_classification(tweet)

def _test_labels():
    assert 1 not in pd.Series(labels).value_counts().index
    assert 3 in pd.Series(labels).value_counts().index
    assert 4 in pd.Series(labels).value_counts().index

In [196]:
pd.Series(labels).value_counts()

1    19190
2     4163
3     1183
4      247
dtype: int64

## Build the Neural Network
---
### Check Access to GPU

In [198]:
import torch

# Check for a GPU
train_on_gpu = torch.cuda.is_available()
if not train_on_gpu:
    print('No GPU found. Please use a GPU to train your neural network.')

No GPU found. Please use a GPU to train your neural network.
