### Tutorial #1 : Bag of Words:

In [73]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import nltk
import numpy as np
from collections import Counter

path="/home/sophie/projects/kaggleBOW/"

# quoting = 3 tells python to ignore doubled quotes
train = pd.read_csv("%slabeledTrainData.tsv"%path, header = 0, delimiter = "\t", quoting = 3)

In [74]:
train.shape

(25000, 3)

In [75]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

### Data Cleaning and Text Preprocessing

Remove HTML Markup with BeautifulSoup

In [76]:
# Initialize BeautifulSoup object on a single movei review
example1 = BeautifulSoup(train["review"][0])

# get_text gives you the etxt of the review, with no tags or markup.
# not considered reliable practise remove markup with regular expressions to do this however
print(example1.get_text()[0:100]) # first 100 characters 

"With all this stuff going down at the moment with MJ i've started listening to his music, watching 




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


We will use `re` package to remove punctuation and numbers

In [77]:
# ^ means "not"
letters_only = re.sub("[^a-zA-Z]", " ", example1.get_text())
print(letters_only[0:100])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching 


Now we will convert reviews to lower case and split them into individual words ("tokenization")

In [78]:
lower_case = letters_only.lower() 
words = lower_case.split() 

Use NLTK library to remove all of the stop words ("a","is","and") etc..

In [79]:
#nltk.download() # download text data sets, including stop words

In [80]:
from nltk.corpus import stopwords # import the stop word list
print(stopwords.words("english")[0:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']


In [81]:
# remove stop words from "words"
words = [w for w in words if not w in stopwords.words("english")]
print(words[0:10])

['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary']


There is more we could go here. NLTK could allow us to do "stemming" (Porter Stemming) which allows us to treat "message", "messaging" and "messages" all the same.

### Pulling it all together

Now to make some reusable code to clean all 25 000 training reviews

In [82]:
def review_to_words(raw_review):
    """Convert a raw review into a string of words.
    The input it a single string(raw movie review)
    and the output is a single string (a preprocessed
    movie review)"""
    # remove HTML
    review_text = BeautifulSoup(raw_review).get_text()
    
    # remove non-letters
    letters_only = re.sub("[^a-zA-Z]"," ", review_text)
    
    # convert to lower case
    words = letters_only.lower().split()
                          
    # convert stopwords into a set
    stops = set(stopwords.words("english"))
                          
    # remove stop words
    meaningful_words = [w for w in words if not w in stops]
    
    # Join the words back into one string, seperated by space
    return(" ".join(meaningful_words))

In [83]:
# try a call on a single review
print(review_to_words(train["review"][0])[0:100])

stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalke




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Now we will loop through and clean all of the training set at once:

In [84]:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length of the movie review list
print ("Cleaning and parsing the training set movie reviews...")
clean_train_reviews = []
for i in range(0, num_reviews):
    
    # if the index is evenly divisible by 1000, print a message
    if ((i+1)% 5000 == 0):
        print ("Review %d of %d\n" %(i+1, num_reviews))
    
    # Call our function for each one, and add the result to the list of clean reviews
    clean_train_reviews.append(review_to_words(train["review"][i]))   

Cleaning and parsing the training set movie reviews...




 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Review 1000 of 25000

Review 2000 of 25000

Review 3000 of 25000

Review 4000 of 25000

Review 5000 of 25000

Review 6000 of 25000

Review 7000 of 25000

Review 8000 of 25000

Review 9000 of 25000

Review 10000 of 25000

Review 11000 of 25000

Review 12000 of 25000

Review 13000 of 25000

Review 14000 of 25000

Review 15000 of 25000

Review 16000 of 25000

Review 17000 of 25000

Review 18000 of 25000

Review 19000 of 25000

Review 20000 of 25000

Review 21000 of 25000

Review 22000 of 25000

Review 23000 of 25000

Review 24000 of 25000

Review 25000 of 25000



### Creating Features from a Bag of Words (Using scikit-learn)

Bag of words takes all words from all reviews (documents) to create a *vocabulary*.   
Models each review by counting the number of times each word appears.

We will choose a maximum vocabulary size of 5000 of the most frequent words.    
we will use `feature_extraction` from scikit-learn to create bag-of-words features.

In [85]:
print ("Creating the bag of words....")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",
                             tokenizer = None, 
                             preprocessor = None,
                             stop_words = None,
                             max_features = 5000)


Creating the bag of words....


`fit_transform()` does two functions: First, it fits the model and learns the vocabulary; second, it transforms our training data into feature vectors. The input to `fit_transform` should be a list of strings.

In [86]:
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an array
train_data_features = train_data_features.toarray()

In [87]:
print(train_data_features.shape)

(25000, 5000)


In [88]:
# take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print (vocab[0:10])

['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham', 'absence', 'absent', 'absolute', 'absolutely']


In [123]:
# print the counts of each word in the vocabulary
#counts = Counter(word += 1 for word in vocab)

counts = Counter()
#for w in train_data_features2:
#    counts[w] += 1

#[counts[w] for w in list(counts) if counts[w] > 1]
#print(train_data_features2)

In [130]:
# sum up the counts of each vocabulary word. axis = 0 referes to rows.
dist = np.sum(train_data_features, axis = 0)

print(dist.shape)
print(dist[0:100])

(5000,)
[ 187  125  108  454 1259   85  116   83  352 1485  306  192   91   98  297
  485  203  300  130  144   92  318  200   88  124  296  186   81  284  123
  179  139  124   90  971 1251  658 6490 3354  311   83 2389 4486 1219  369
  394  793 4237  148  302   98  453   80  154  810  439  166  347  337  113
  124  621  134  101  510  376  100   90  153  510  204   91  259   90  346
   93  113  104  126  343  212  255  187  128 1121  233  361   94  249  111
 1033  572   88   95  119  396  106   96   81  120]


In [129]:
# For each, print the vocabulary word and the number of times it appear in the training set
# vocab and dist are now two nice lists we can zip together
for tag, count in zip(vocab, dist):
    if count > 6000:  # just look at the most popular words.
        print (count, tag)

6490 acting
9155 also
9301 bad
6414 best
7022 character
7154 characters
7921 could
12646 even
40146 film
6887 films
9061 first
9310 get
15140 good
9058 great
6166 know
6628 life
20274 like
6435 little
6453 love
8362 made
8021 make
6675 many
44030 movie
7663 movies
9765 much
6484 never
26788 one
9285 people
6585 plot
11736 really
11474 see
6679 seen
6294 show
11983 story
7296 think
12723 time
6906 two
6972 watch
8026 way
10661 well
12436 would
