This notebook follows the Training example given by Kaggle https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words to display each document as a Bag-of-Words and then uses Scikit-learn's logistic regression to train a model.

The data in this initial test is the labeled dataset of 50,000 IMDB movie reviews for sentiment analysis.  This was obtained from Kaggle.  https://www.kaggle.com/c/word2vec-nlp-tutorial/data  The labeled training data was split into 90% train, 10% test.

While Pythia will not be doing sentiment analysis, using this data was a quick and easy way to see how a bag-of-words + logistic regression could work for novelty detection.  All that is really needed for this type of analysis is a labels column (in this dataset that is sentiment) and text (which is review in this dataset).

In the future we will be looking at better to see how spacy could substitute in for BeautifulSoup and nltk, 

In [5]:
import pandas as pd
from bs4 import BeautifulSoup
import nltk
import numpy as np
from sklearn import linear_model

In [6]:
data = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)

In [7]:
#do a simple split by using "ID" with trailing 3 before the _

train = data[data["id"].str.contains("3_")==False]
test = data[data["id"].str.contains("3_")==True]

In [8]:
print train.shape, test.shape

(22500, 3) (2500, 3)


In [9]:
2500/25000.0 *100 #so split is perfect 10%

10.0

In [10]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [11]:
test.head()

Unnamed: 0,id,sentiment,review
7,"""10633_1""",0,"""I watched this video at a friend's house. I'm..."
9,"""8713_10""",1,"""<br /><br />This movie is full of references...."
26,"""2203_3""",0,"""Note to all mad scientists everywhere: if you..."
41,"""4613_4""",0,"""Well then, what is it?! I found Nicholson's c..."
46,"""9983_3""",0,"""This film is a massive Yawn proving that Amer..."


In [12]:
sum(train["sentiment"])/22500.0 #training set is nicely half positive and half negative as well

0.5

In [None]:
#The tutorial goes through the steps in the function to show what each is doing they are...

In [14]:
# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train["review"][0])  

# Print the raw review and then the output of get_text(), for 
# comparison
print train["review"][0]
print example1.get_text()

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [16]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print letters_only

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    mi

In [17]:
lower_case = letters_only.lower()        # Convert to lower case
words = lower_case.split()               # Split into words

In [18]:
#Some of the most common stopwords, you would normally get this through a package
stopwords  = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

In [19]:
#The Main function is review_to_words which does all the text process of cleaning and splitting

In [20]:
def review_to_words( raw_review ):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(raw_review).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = stopwords                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   


In [21]:
clean_review = review_to_words( train["review"][0] )
print clean_review

stuff going moment mj ve started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad m kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate wo

In [22]:
train_labels = train["sentiment"]
test_labels = test["sentiment"]

In [23]:
print "Cleaning and parsing the training set movie reviews...\n"
# Get the number of reviews based on the dataframe column size
num_reviews = data["review"].size

# Initialize an empty list to hold the clean reviews
#When the data was split the train, test sets kept the index which we will use to our advantage here
clean_train_reviews = []
clean_test_reviews = []
for i in xrange( 0, num_reviews ):
    # If the index is evenly divisible by 1000, print a message
    if( (i+1)%1000 == 0 ):
        print "Review %d of %d\n" % ( i+1, num_reviews )
    try:
        clean_train_reviews.append(review_to_words(train["review"][i] ))
    except:
        clean_test_reviews.append(review_to_words(test["review"][i] ))

Cleaning and parsing the training set movie reviews...

Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



In [24]:
print "Creating the bag of words...\n"
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

Creating the bag of words...



In [25]:
#Also make the test data into the correct format
test_data_features = vectorizer.fit_transform(clean_test_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
test_data_features = test_data_features.toarray()

Let us look at what the data now looks like

In [27]:
print train_data_features.shape

(22500, 5000)


In [28]:
print sum(train_data_features[0]), max(train_data_features[0]), train_data_features[0]

189 5 [0 0 0 ..., 0 0 0]


In [29]:
print sum(test_data_features[0]), max(test_data_features[0]), test_data_features[0]

54 6 [0 0 0 ..., 0 0 0]


And Finally use Logistic Regression to train a model and perform a test

In [30]:
#try using the BagOfWords with the logistic regression
logreg = linear_model.LogisticRegression(C=1e5)

In [31]:
# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(train_data_features, train_labels)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [32]:
#Now that we have something trained we can check if it is accurate with the test set

In [34]:
preds = logreg.predict(test_data_features)

In [35]:
#because the label is zero or one the root difference is simply the absolute difference between predicted and actual
rmse = sum(abs(preds-test_labels))/float(len(test_labels)) 
print rmse

0.494


In [36]:
#Not a very good model as it is just every so slightly better than random

In [None]:
#Some additional data analysis of the vocabulary and model...

In [38]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print vocab[:10]

[u'abandoned', u'abbott', u'abilities', u'ability', u'able', u'abraham', u'abruptly', u'absence', u'absolute', u'absolutely']


In [36]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print count, tag

187 abandoned
125 abc
108 abilities
454 ability
1259 able
85 abraham
116 absence
83 absent
352 absolute
1485 absolutely
306 absurd
192 abuse
91 abusive
98 abysmal
297 academy
485 accent
203 accents
300 accept
130 acceptable
144 accepted
92 access
318 accident
200 accidentally
88 accompanied
124 accomplished
296 according
186 account
81 accuracy
284 accurate
123 accused
179 achieve
139 achieved
124 achievement
90 acid
971 across
1251 act
658 acted
6490 acting
3354 action
311 actions
83 activities
2389 actor
4486 actors
1219 actress
369 actresses
394 acts
793 actual
4237 actually
148 ad
302 adam
98 adams
453 adaptation
80 adaptations
154 adapted
810 add
439 added
166 adding
347 addition
337 adds
113 adequate
124 admire
621 admit
134 admittedly
101 adorable
510 adult
376 adults
100 advance
90 advanced
153 advantage
510 adventure
204 adventures
91 advertising
259 advice
90 advise
346 affair
93 affect
113 affected
104 afford
126 aforementioned
343 afraid
212 africa
255 african
187 afternoon