## Analyzing insults with Naive Bayes: pandas and sklearn

In [1]:
import numpy as np
import pandas as pd
import sklearn
import sklearn.cross_validation as cv
import sklearn.grid_search as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
%matplotlib inline




## Loading and preparing the data

Let's open the CSV file with `pandas`.

In [3]:
df = pd.read_csv("data/troll.csv")

In [4]:
df.head()

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
1,0,20120528192215Z,"""i really don't understand your point.\xa0 It ..."
2,0,,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,,"""listen if you dont wanna get married to a man..."
4,0,20120619094753Z,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."


Each row is a comment. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [5]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


Now we define the feature matrix $\mathbf{X}$ and the labels $\mathbf{y}$.

In [8]:
y = df['Insult']

We want to one of the linear classifiers in `sklearn`,
bit the learners in `sklearn` only work with numerical arrays. How to convert text into a matrix of numbers?
As discussed in lecture and in our text,
obtaining the feature matrix from the text is not trivial. 

The classical solution is to first extract a **vocabulary**: a list of words used throughout the corpus. Then, we can count, for each document in the sample, the frequency of each word. We end up with a **sparse matrix**: a huge matrix containing mostly zeros. Here, `sklearn` and `pandas` make it possible to do this in two lines. 

In [9]:
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])
print(X.shape)

(3947, 16469)


The TFIDF vectorizer uses a simple formula to assign a significance score to the
count of each vocabulary item in each document.

This is a very popular significance measure first proved useful in
document retrieval.  It has some competitors in classification, but
let's try it out here because it's the easiest **feature weighting scheme**
to use in `sklearn`.

There are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix.

In [10]:
print("Each sample has ~{0:.2%} non-zero features.".format(
          X.nnz / float(X.shape[0] * X.shape[1])))

Each sample has ~0.15% non-zero features.


A `TdidfVectorizer` instance stores its `decode` dictionary in the attribute `vocabulary_` (note
the trailing underscore!):

In [11]:
tf.vocabulary_['moron']

8704

Our TDIDF matrix was stored in `X`:

In [12]:
X.shape

(3947, 16469)

The `sklearn` module stores many of its internally computed arrays as **sparse matrices**.  This is basically a 
very clever computer science device for not wasting all the space that very sparse matrices 
waste.  Natural language representations are often **quite** sparse.  The .15% non zero features
firgure we just looked at was typical.  Sparse matrices come at a cost, however; although some
computations can be done while the matrix is in sparse form, some cannot, and to do those
you have to convert the matrix to a nonsparse matrix, do what you need to do, and then, probably,
convert it back.  This is costly.  We're going to do it now, but only because we're goofing
around. Conversion to non-sparse format should in general be avoided whenever possible.

In [13]:
XA = X.toarray()

Ok, now we can look at an arbitrary individual value, which is really all we wanted to do:

In [14]:
XA[3942][8704]

0.0

Didn't we just learn that the word *moron* occurs in this Tweet?  What's wrong?

In [15]:
df.iloc[3942]['Comment']

'"you are both morons and that is never happening"'

Oh, maybe we didn't learn that:

In [17]:
tf.vocabulary_['morons']

8707

In [18]:
XA[3942][8707]

0.5139224706716653

## Training

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

In [19]:
(X_train, X_test,
 y_train, y_test) = cv.train_test_split(X, y,
                                        test_size=.2)

We use a **Bernoulli Naive Bayes classifier**.

In [20]:
bnb =nb.BernoulliNB()

bnb.fit(X_train, y_train);

In [21]:
bnb.score(X_test, y_test)

0.7569620253164557

Now try re-executing the previous cells.  The results shoudl be the same, right?

Well, are they?  

Ok, re-execute the same three cells again.  Now one more time.  Now try the following
piece of code:

In [23]:
X = 7/29.
print ('Hi, {0:.2}'.format(X))

Hi, 0.24


In [26]:
num_runs = 10
for test_run in range(num_runs):
    (X_train, X_test,y_train, y_test) = cv.train_test_split(X, y,test_size=.2)
    bnb =nb.BernoulliNB()
    bnb.fit(X_train, y_train)
    print ('{0}'.format(bnb.score(X_test, y_test)))

TypeError: Singleton array array(0.24137931) cannot be considered a valid collection.

What's happening?  How should we deal this with this when we report our evaluations?

Explain the purpose of the code in the next cell.

In [28]:
num_runs = 100
total = 0
for test_run in range(num_runs):
    (X_train, X_test,
     y_train, y_test) = cv.train_test_split(X, y,
                                            test_size=0.2)
    bnb = nb.BernoulliNB()
    bnb.fit(X_train, y_train)
    score = bnb.score(X_test, y_test)
    total += score
print ('{:.2%}'.format(total/num_runs))

TypeError: Singleton array array(0.24137931) cannot be considered a valid collection.

Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments).

In [27]:
# We first get the words corresponding to each feature.
names = np.asarray(tf.get_feature_names())
# Next, we display the 50 words with the largest
# coefficients.
coefficient_matrix = bnb.coef_[0,:]
print coefficient_matrix.shape
# Sorting gives us smallest first, we reverse the order and take top 50
top_fifty_feat_indices = np.argsort(coefficient_matrix)[::-1][:50]
print(','.join(names[top_fifty_feat_indices]))

(16469,)
you,are,your,the,to,and,of,that,is,it,in,like,on,have,not,for,re,just,xa0,so,all,an,idiot,what,this,with,be,fuck,get,do,don,go,up,or,as,can,stupid,but,about,know,who,no,if,ass,me,little,bitch,my,because,we


Finally, let's test our estimator on a few test sentences.


In [38]:
print(bnb.predict(tf.transform([
    "I totally agree with you.",
    "You are so stupid.",
    "I love you."
    ])))

[0 0 0]


Not real impressive.  The word *stupid* was not recognized as an insult.

> You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

> [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).

In [16]:
print(bnb.predict(tf.transform([ "I totally agree with you.", "You are so stupid.", "I love you." ])))

[0 0 0]


## Homework

Read the on line book draft chapter about doing the movie review data,
and try the clasifier used there, an SVM, on this data. 

Sgow your code, and print out results.  Which classifier does better?