# Cleaning and preparing the data

All data for this project is sourced from: https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/. Inspiration and some code snippets for this project has been taken from:
- 'Deep Learning tutorials in jupyter notebooks.' - https://github.com/sachinruk/deepschool.io
- 'Stemming - Natural Language Processing With Python and NLTK p.3'- https://www.youtube.com/watch?v=yGKTphqxR9Q

To begin we read our training data into a dataframe and briefly explore the data set

In [1]:
import numpy as np
import pandas as pd
import re

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./train_E6oV3lV.csv')
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [2]:
df.shape

(31962, 3)

This is labelled training data. Each row is 3-tuple (id, label, tweet). A label of 1 means the tweet is hate speech, a label of 0 is a normal tweet.

In [3]:
df.tail()

Unnamed: 0,id,label,tweet
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."
31961,31962,0,thank you @user for you follow


#### The goal for this project is to build a model that will allow us to accurately classify an unlabelled test data set into the binary categores: 1 -> hate speech or 0 -> normal tweet.

In [4]:
# test data
test_df = pd.read_csv('./test_tweets_anuFYb8.csv')
test_df.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [5]:
test_df.shape

(17197, 2)

The first feature we wish to extract for building our model is to count the frequencies of every word as they occur in each tweet. This is known as the 'bag-of-words' method in in Natural Language rocessing (NLP). We can think of the bag-of-words as representing each tweet as a multiset, or alternatively a vector. The sklearn.feature_extraction.text submodule provides an easy means of vectorizing our tweets. http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

In [6]:
# convert collection of tweets to a matrix of frequency counts for each word
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(df.tweet.values)

print(type(tf))
tf.shape

<class 'scipy.sparse.csr.csr_matrix'>


(31962, 41392)

In [7]:
print('Number of unique words: ',len(tf_vectorizer.get_feature_names()))
print('First 10 word labels: ', tf_vectorizer.get_feature_names()[:10])

Number of unique words:  41392
First 10 word labels:  ['00', '000', '000001', '001', '0099', '00am', '00h30', '00pm', '01', '0115']


Looking at the first 10 words in our 'bag' we can see that there is a lot of noise in the data. These are words that we would like to filter from our data (known as stop words in NLP). We use the sklearn CountVectorizer() to filter out common English language words ('the', 'a', 'to', etc.) and any word appearing less than 5 times in the data.

In [8]:
tf_vectorizer = CountVectorizer(min_df=5,stop_words='english')
tf = tf_vectorizer.fit_transform(df.tweet.values)
print('New number of unique words: ',len(tf_vectorizer.get_feature_names()))
print(type(tf))
tf.shape

New number of unique words:  6019
<class 'scipy.sparse.csr.csr_matrix'>


(31962, 6019)

We have reduced the number of words in our vocabulary dictionary (the words we are counting the frequencies of for each tweet)from 41392 to 6019. Cleaning many of these meaningless words from the data will hopefully improve the accuracy of our model.

In [17]:
df

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [9]:
# shuffle the data then split into two halves to cross-validate
idx = np.random.permutation(len(df))
X_train = tf[idx][:15981].todense()
X_test = tf[idx][15981:].todense()
y_train = df.label.values[idx][:15981]
y_test = df.label.values[idx][15981:]

In [15]:
X_train.shape

(15981, 6019)

In [16]:
X_train

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [1, 0, 0, ..., 0, 0, 0]])

We will use the Keras neural network API for training our model with TensorFlow backend

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.regularizers import l2, l1

In [None]:
model = Sequential()
model.add(Dense(units=100, activation='relu', input_dim=tf.shape[1]))
model.add(Dense(units=1, activation='sigmoid'))
# model.add(Activation("sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adagrad', metrics=["binary_accuracy"])
model.summary()

In [None]:
model.fit(X_train, y_train, epochs=2, batch_size=128)

In [None]:
y_test_pred = model.predict(X_test)

In [None]:
print(y_test_pred.shape)
y_test_pred

In [None]:
y_test_pred[y_test_pred<0.5] = 0
y_test_pred[y_test_pred>=0.5] = 1
np.count_nonzero(y_test_pred==y_test[:,None])*1.0/len(y_test)

In [None]:
test_case = tf_vectorizer.transform(["trump"])
model.predict(test_case.todense())

In [None]:
test_case = tf_vectorizer.transform(["fuck trump"])
model.predict(test_case.todense())

In [None]:
test_case = tf_vectorizer.transform(["I like pies"])
model.predict(test_case.todense())

In [None]:
test_case = tf_vectorizer.transform(["kill all men"])
model.predict(test_case.todense())

In [None]:
test_case = tf_vectorizer.transform(["kill all women"])
model.predict(test_case.todense())

In [None]:
plt.hist(model.get_weights()[0].ravel(),100)
plt.show()

In [None]:
In future stages we will try to improve the quality of our model 

Lets try cleaning the data set further with stemming (data-preprocessing) to further reduce our vocabularly 

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

example_words = ["python", "pythoner", "pythoning", "pythoned", "pythonly"]

for w in example_words:
    print(ps.stem(w))