# Cleaning and preparing the data

To begin we read our training data into a dataframe and briefly explore the data set.

In [5]:
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

df = pd.read_csv('./train_E6oV3lV.csv')
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [6]:
df.shape

(31962, 3)

In [7]:
df.tail()

Unnamed: 0,id,label,tweet
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."
31961,31962,0,thank you @user for you follow


The first feature we wish to extract for building our model is to simply count the frequency of each word in each tweet. This method is known in Natural Language rocessing (NLP) as the bag-of-words model. We can think of the bag-of-words as representing each tweet as a multiset, or alternatively a vector. The sklearn.feature_extraction.text submodule provides an easy means of vectorizing our tweets. http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

In [8]:
# convert collection of tweets to a matrix of frequency counts for each word
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(df.tweet.values)

print(type(tf))
tf.shape

<class 'scipy.sparse.csr.csr_matrix'>


(31962, 41392)

In [14]:
print('Number of unique words: ',len(tf_vectorizer.get_feature_names()))
print('First 10 word labels: ', tf_vectorizer.get_feature_names()[:10])

Number of unique words:  41392
First 10 words:  ['00', '000', '000001', '001', '0099', '00am', '00h30', '00pm', '01', '0115']


Looking at the first 10 words in our 'bag' we can see that there is a lot of noise in the data. These are words that we will filter from our data (known as stop words in NLP). We use the sklearn CountVectorizer() to filter out common English language words ('the', 'a', 'to' etc.) and any word appearing less than 5 times in the data.

In [35]:
tf_vectorizer = CountVectorizer(min_df=5,stop_words='english')
tf = tf_vectorizer.fit_transform(df.tweet.values)
print('New number of unique words: ',len(tf_vectorizer.get_feature_names()))
print(type(tf))
tf.shape

New number of unique words:  6019
<class 'scipy.sparse.csr.csr_matrix'>


(31962, 6019)

In [46]:
# shuffle data then cross-validate
np.random.permutation(len(df))
X_train = tf[idx][:15981].todense()
X_test = tf[idx][15981:].todense()
y_train = df.label.values[idx][:15981]
y_test = df.label.values[idx][15981:]

In [37]:
X_train.shape

(15981, 6019)

In [13]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.regularizers import l2, l1

In [38]:
model = Sequential()
model.add(Dense(units=100, activation='relu', input_dim=tf.shape[1]))
model.add(Dense(units=1, activation='sigmoid'))
# model.add(Activation("sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adagrad', metrics=["binary_accuracy"])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 100)               602000    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 602,101
Trainable params: 602,101
Non-trainable params: 0
_________________________________________________________________


In [39]:
model.fit(X_train, y_train, epochs=2, batch_size=128)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9955977550>

In [40]:
y_test_pred = model.predict(X_test)

In [41]:
print(y_test_pred.shape)
y_test_pred

(15981, 1)


array([[0.05348511],
       [0.10702812],
       [0.00030212],
       ...,
       [0.18077154],
       [0.00171096],
       [0.0020906 ]], dtype=float32)

In [42]:
y_test_pred[y_test_pred<0.5] = 0
y_test_pred[y_test_pred>=0.5] = 1
np.count_nonzero(y_test_pred==y_test[:,None])*1.0/len(y_test)

0.9565734309492523

In [43]:
test_case = tf_vectorizer.transform(["trump"])
model.predict(test_case.todense())

array([[0.6355145]], dtype=float32)

In [44]:
test_case = tf_vectorizer.transform(["fuck trump"])
model.predict(test_case.todense())

array([[0.7424171]], dtype=float32)

In [45]:
test_case = tf_vectorizer.transform(["I like pies"])
model.predict(test_case.todense())

array([[0.25506637]], dtype=float32)

In [47]:
test_case = tf_vectorizer.transform(["kill all men"])
model.predict(test_case.todense())

array([[0.3853892]], dtype=float32)

In [48]:
test_case = tf_vectorizer.transform(["kill all women"])
model.predict(test_case.todense())

array([[0.59832096]], dtype=float32)