## Sentiment Analysis of Movie Reviews

#### Load the training data files from the local folder. Show the data type and length of the training data. Then, show an instance of the training dataset. 

In [1]:
from sklearn.datasets import load_files
reviews_train = load_files("data/aclImdb/train/")

In [2]:
text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[0]:\n{}".format(text_train[0]))

type of text_train: <class 'list'>
length of text_train: 75000
text_train[0]:
b'Full of (then) unknown actors TSF is a great big cuddly romp of a film.<br /><br />The idea of a bunch of bored teenagers ripping off the local sink factory is odd enough, but add in the black humour that Forsyth & Co are so good at and your in for a real treat.<br /><br />The comatose van driver by itself worth seeing, and the canal side chase is just too real to be anything but funny.<br /><br />And for anyone who lived in Glasgow it\'s a great "Oh I know where that is" film.'


In [3]:
text_train = [doc.replace(b"<br/>",b" ") for doc in text_train]

#### The dataset was collected such that the positive class and the negative class are balanced.

In [4]:
import numpy as np
print("Samples per class (training): {}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500 50000]


In [5]:
reviews_test = load_files("data/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

Number of documents in test data: 25000
Samples per class (test): [12500 12500]


#### Applying Bag-of-words to a toy dataset
##### Declare an array of 2 phrases from a quote in Shakespeare's "As you like it"

In [5]:
bards_words = ["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]

##### Import and instantiate the bag-of-words implementaiton in CountVectorizer and fit it to our toy dataset

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(bards_words)

##### Fitting the CountVectorizer consists of tokenization and vocabulary building, which can be accessed in the vocabulary_ attribute.

In [7]:
vect.get_feature_names()

['be',
 'but',
 'doth',
 'fool',
 'he',
 'himself',
 'is',
 'knows',
 'man',
 'the',
 'think',
 'to',
 'wise']

In [8]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content: \n{}".format(vect.vocabulary_))

Vocabulary size: 13
Vocabulary content: 
{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


##### Create the bag-of-words representation by calling the transform method.

In [9]:
bag_of_words = vect.transform(bards_words)
print("Bag of words: {}".format(repr(bag_of_words)))

Bag of words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


##### Convert the sparse matrix into a "dense" NumPy array using the toarray method.

In [10]:
print("Dense representation of bag_of_words:\n{}".format(bag_of_words.toarray()))

Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


##### Examine the vocubulary and document-term matrix together using pandas

In [11]:
import pandas as pd
pd.DataFrame(bag_of_words.toarray(),columns=vect.get_feature_names())

Unnamed: 0,be,but,doth,fool,he,himself,is,knows,man,the,think,to,wise
0,0,0,1,1,1,0,1,0,0,1,1,0,1
1,1,1,0,1,0,1,0,1,1,1,0,1,1


### Bag-of-words for Movie Reviews
#### Import and instantiate the CountVectorizer into our Movie Reviews dataset. 

In [12]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10359806 stored elements in Compressed Sparse Row format>


#### Access the vocabulary using the get_feature_name method of the vectorizer.

In [13]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 124255
First 20 features:
['00', '000', '0000', '0000000000000000000000000000000001', '0000000000001', '000000001', '000000003', '00000001', '000001745', '00001', '0001', '00015', '0002', '0007', '00083', '000ft', '000s', '000th', '001', '002']
Features 20010 to 20030:
['cheapen', 'cheapened', 'cheapening', 'cheapens', 'cheaper', 'cheapest', 'cheapie', 'cheapies', 'cheapjack', 'cheaply', 'cheapness', 'cheapo', 'cheapozoid', 'cheapquels', 'cheapskate', 'cheapskates', 'cheapy', 'chearator', 'cheat', 'cheata']
Every 2000th feature:
['00', '_require_', 'aideed', 'announcement', 'asteroid', 'banquière', 'besieged', 'bollwood', 'btvs', 'carboni', 'chcialbym', 'clotheth', 'consecration', 'cringeful', 'deadness', 'devagan', 'doberman', 'duvall', 'endocrine', 'existent', 'fetiches', 'formatted', 'garard', 'godlie', 'gumshoe', 'heathen', 'honoré', 'immatured', 'interested', 'jewelry', 'kerchner', 'köln', 'leydon', 'lulu', 'mardjono', 'meistersinger', 'misspells', 'mumblecore'

##### Build a LogisticRegression classifier and evaluate using cross-validation.

In [14]:
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.71


### Naive Bayes
#### Build a Multinomial Naive Bayes classifier and evaluate using cross-validation

In [15]:
from sklearn.naive_bayes import MultinomialNB
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.66


#### Build a Bernoulli Naive Bayes classifier and evaluate using cross-validation

In [16]:
from sklearn.naive_bayes import BernoulliNB
scores = cross_val_score(BernoulliNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.65


#### Set to 5 the minimum number of documents that a token needs to appear.

In [17]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <75000x44532 sparse matrix of type '<class 'numpy.int64'>'
	with 10235504 stored elements in Compressed Sparse Row format>


#### Show the first 50 features, the 20010-20030th features and every 700th feature.

In [18]:
feature_names = vect.get_feature_names()
print("First 50 features: \n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature: \n{}".format(feature_names[::700]))

First 50 features: 
['00', '000', '001', '007', '00am', '00pm', '00s', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '100', '1000', '1001', '100k', '100th', '100x', '101', '101st', '102', '103', '104', '105', '106', '107', '108', '109', '10am', '10pm', '10s', '10th', '10x', '11', '110', '1100', '110th', '111', '112', '1138', '115', '116', '117', '11pm', '11th']
Features 20010 to 20030:
['inert', 'inertia', 'inescapable', 'inescapably', 'inevitability', 'inevitable', 'inevitably', 'inexcusable', 'inexcusably', 'inexhaustible', 'inexistent', 'inexorable', 'inexorably', 'inexpensive', 'inexperience', 'inexperienced', 'inexplicable', 'inexplicably', 'inexpressive', 'inextricably']
Every 700th feature: 
['00', 'accountability', 'alienate', 'appetite', 'austen', 'battleground', 'bitten', 'bowel', 'burton', 'cat', 'choreographing', 'collide', 'constipation', 'creatively', 'dashes', 'descended', 'dishing', 'dramatist', 'ejaculation', 'epitomize', 'extinguished', 'figment', 'forgo

#### Build a Multinomial Naive Bayes classifier again and evaluate using cross-validation.

In [19]:
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.60


#### Show the stop words from the built_in English list in the feature_extraction.text module.

In [20]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Stopwords: \n{}".format(list(ENGLISH_STOP_WORDS)))

Number of stop words: 318
Stopwords: 
['below', 'moreover', 'keep', 'none', 'fifty', 'has', 'without', 'towards', 'yet', 'whoever', 'rather', 'me', 'last', 'each', 'name', 'once', 'made', 'seemed', 'my', 'else', 'same', 'serious', 'several', 'something', 'who', 'also', 'was', 'were', 'themselves', 'though', 'twelve', 'but', 'ie', 'what', 'which', 'from', 'along', 'two', 'formerly', 'whereafter', 'give', 'itself', 'through', 'whereupon', 'afterwards', 'must', 'or', 'sixty', 'off', 'somehow', 'hasnt', 'that', 'less', 'hereby', 'whence', 'get', 'yours', 'under', 'mine', 'this', 'hundred', 'nobody', 'either', 'an', 'had', 'nine', 'ours', 'any', 'whither', 'otherwise', 'across', 'in', 'those', 'about', 'whole', 'enough', 'next', 'seem', 'is', 'namely', 'whether', 'thru', 'it', 'put', 'you', 'ltd', 'never', 'their', 'yourselves', 'de', 'if', 'anyway', 'for', 'someone', 'whenever', 'always', 'least', 'over', 'herein', 'besides', 'become', 'ever', 'mostly', 'eleven', 'please', 'can', 'both', '

#### Remove English stop words from the dataset.

In [21]:
vect = CountVectorizer(min_df=5,stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))

X_train with stop words:
<75000x44223 sparse matrix of type '<class 'numpy.int64'>'
	with 6621682 stored elements in Compressed Sparse Row format>


#### Then, build a Multinomial Naive Bayes classifier again and evaluate using cross-validation.

In [22]:
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.62


### Term Frequency-Inverse Document Frequency
#### Import and instantiate the TfidfVectorizer and setting the min_df=5.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=5).fit(text_train)
X_train = vectorizer.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<75000x44532 sparse matrix of type '<class 'numpy.float64'>'
	with 10235504 stored elements in Compressed Sparse Row format>


#### Then, evaluate the Multinomial Naive Bayes classifier again  using cross-validation.

In [24]:
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.67


#### Inspect some of the words with the lowest and highest tfidf.

In [25]:
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
feature_names = np.array(vectorizer.get_feature_names())
print("Features with lowest tfidf:\n{}".format(feature_names[sorted_by_tfidf[:20]]))
print("Features with highest tfidf:\n{}".format(feature_names[sorted_by_tfidf[-20:]]))

Features with lowest tfidf:
['suplexes' 'acquiesce' 'poffysmoviemania' 'avenged' 'invocus' 'bookcase'
 'zantara' 'boozed' 'brutes' 'simpatico' 'pragmatism' 'unrivalled'
 'enrages' 'trumpeted' 'flowering' 'fearfully' 'reveries' 'authorial'
 'décor' 'clobber']
Features with highest tfidf:
['grease' 'halloweentown' 'smallville' 'frasier' 'stinks' 'lupin' 'gta'
 'tomie' 'lucy' 'skulls' 'scanners' 'doodlebops' 'ling' 'nr' 'yadda' 'ha'
 'pokemon' 'freakazoid' 'click' 'wicked']


#### Inspect some of the words with the lowest idf.

In [26]:
sorted_by_idf = np.argsort(vectorizer.idf_)
print("Features with lowest idf:\n{}".format(feature_names[sorted_by_idf[:100]]))

Features with lowest idf:
['the' 'and' 'of' 'to' 'this' 'is' 'it' 'in' 'that' 'but' 'for' 'with'
 'was' 'as' 'on' 'movie' 'not' 'br' 'one' 'be' 'have' 'are' 'film' 'you'
 'all' 'at' 'an' 'by' 'from' 'so' 'like' 'who' 'there' 'they' 'his' 'if'
 'out' 'just' 'about' 'he' 'or' 'has' 'what' 'some' 'can' 'good' 'when'
 'more' 'up' 'time' 'very' 'even' 'only' 'no' 'see' 'would' 'my' 'story'
 'really' 'which' 'well' 'had' 'me' 'than' 'their' 'much' 'were' 'get'
 'other' 'do' 'been' 'most' 'also' 'into' 'don' 'her' 'first' 'great' 'how'
 'made' 'people' 'will' 'make' 'because' 'way' 'could' 'bad' 'we' 'after'
 'them' 'too' 'any' 'then' 'movies' 'watch' 'she' 'think' 'seen' 'acting'
 'its']


### Bag-of-words with More Than One Word (n-Grams)
##### Applying it again to the toy dataset

In [27]:
print("bards_words;\n{}".format(bards_words))

bards_words;
['The fool doth think he is wise,', 'but the wise man knows himself to be a fool']


##### Show the unigrams in the vocabulary

In [28]:
cv = CountVectorizer(ngram_range=(1,1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

Vocabulary size: 13
Vocabulary:
['be', 'but', 'doth', 'fool', 'he', 'himself', 'is', 'knows', 'man', 'the', 'think', 'to', 'wise']


##### Show the unigrams and bigrams in the vocabulary

In [29]:
cv = CountVectorizer(ngram_range=(1,2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

Vocabulary size: 27
Vocabulary:
['be', 'be fool', 'but', 'but the', 'doth', 'doth think', 'fool', 'fool doth', 'he', 'he is', 'himself', 'himself to', 'is', 'is wise', 'knows', 'knows himself', 'man', 'man knows', 'the', 'the fool', 'the wise', 'think', 'think he', 'to', 'to be', 'wise', 'wise man']


##### Check the "dense" NumPy array using the toarray method of the transformed training data.

In [30]:
print("Transformed data(dense):\n{}".format(cv.transform(bards_words).toarray()))

Transformed data(dense):
[[0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0]
 [1 1 1 1 0 0 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 1 1 1]]


##### Show the unigrams, bigrams and trigrams in the vocabulary. 

In [31]:
cv = CountVectorizer(ngram_range=(1,3)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

Vocabulary size: 39
Vocabulary:
['be', 'be fool', 'but', 'but the', 'but the wise', 'doth', 'doth think', 'doth think he', 'fool', 'fool doth', 'fool doth think', 'he', 'he is', 'he is wise', 'himself', 'himself to', 'himself to be', 'is', 'is wise', 'knows', 'knows himself', 'knows himself to', 'man', 'man knows', 'man knows himself', 'the', 'the fool', 'the fool doth', 'the wise', 'the wise man', 'think', 'think he', 'think he is', 'to', 'to be', 'to be fool', 'wise', 'wise man', 'wise man knows']


#### Now, try out the TfidfVectorizer on the Movie Reviews dataset and find the best setting of the n-gram range
(1) Unigrams and Bigrams

In [32]:
vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1,2)).fit(text_train)
X_train = vectorizer.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<75000x377417 sparse matrix of type '<class 'numpy.float64'>'
	with 22396054 stored elements in Compressed Sparse Row format>


In [33]:
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.67


(2) Unigrams, Bigrams and Trigrams

In [34]:
vectorizer = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(text_train)
X_train = vectorizer.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<75000x710857 sparse matrix of type '<class 'numpy.float64'>'
	with 28414792 stored elements in Compressed Sparse Row format>


In [35]:
scores = cross_val_score(MultinomialNB(),X_train,y_train,cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.67
