# Bag of Words Model on Amazon Dataset

This notebook carries out three tests:

1: BoW using an SVM on an unbalanced dataset: Result: 0.92 (Bigram 0.81)

2: BoW using Multinomial NB and tf idf: Result: 0.85

3: Bow using an SVM on a balanced dataset: Result: 0.88

Load packages and datasets

In [1]:
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from Word2VecUtility import Word2VecUtility
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('Reviews.csv', sep=',', index_col=False, encoding='utf-8')

In [3]:
print 'A quick look at the reviews:'
print data.head()

A quick look at the reviews:
   Score                                               Text
0      5  I have bought several of the Vitality canned d...
1      1  Product arrived labeled as Jumbo Salted Peanut...
2      4  This is a confection that has been around a fe...
3      2  If you are looking for the secret ingredient i...
4      5  Great taffy at a great price.  There was a wid...


Reformat data

In [4]:
#Create a sample dataset to speed up training times for the moment. 
size = 100000 
subdata = data.sample(n = size, random_state=520)

#subdata = subdata[pd.notnull(subdata['text'])] - to get rid of null values
print subdata.index
subdata.to_csv('review_sub100k.csv', index=False, sep=',', encoding='utf-8')

Int64Index([ 86986, 223984, 365727, 109874, 424481, 426869,  21208, 369621,
            325219, 209866,
            ...
            545834, 226170, 498791, 219611,  92919,  21690, 350418, 561044,
            278642, 220337],
           dtype='int64', length=100000)


In [5]:
del(data)
data = subdata
del(subdata)

In [6]:
#Load in the sample data
data = pd.read_csv('review_sub100k.csv', index_col=False)
print data.iloc[:5]

   Score                                               Text
0      2  Greenies tries to position itself as a healthy...
1      5  Love the flavor!  Not too strong, but not weak...
2      1  I can't comment on the other flavors of Silk s...
3      4  I've enjoyed the Dark Magic coffee and re-orde...
4      5  I first found Primo Pasta at a local restauran...


In [7]:
#remove rows which contain ratings of 3 (Neutral and not included in our analysis.)
data = data[data.Score != 3]
print(data.head())
data['Score'].value_counts() 

   Score                                               Text
0      2  Greenies tries to position itself as a healthy...
1      5  Love the flavor!  Not too strong, but not weak...
2      1  I can't comment on the other flavors of Silk s...
3      4  I've enjoyed the Dark Magic coffee and re-orde...
4      5  I first found Primo Pasta at a local restauran...


5    63770
4    14133
1     9219
2     5280
Name: Score, dtype: int64

In [8]:
data.loc[data.Score <=2, 'Score'] = 0
data.loc[data.Score >=4, 'Score'] = 1
        
data['Score'].value_counts()

1    77903
0    14499
Name: Score, dtype: int64

In [9]:
#make sure the reviews were labelled correctly (can compare to the previous header)
print data.iloc[:5]

   Score                                               Text
0      0  Greenies tries to position itself as a healthy...
1      1  Love the flavor!  Not too strong, but not weak...
2      0  I can't comment on the other flavors of Silk s...
3      1  I've enjoyed the Dark Magic coffee and re-orde...
4      1  I first found Primo Pasta at a local restauran...


In [10]:
from sklearn.model_selection import train_test_split
#train, test = train_test_split(data, test_size = 0.3)

#split dataset into train/test sets
train_data = data.sample(frac=0.7,random_state=200)
test_data = data.drop(train_data.index)

train_data.to_csv('train.csv', index=False, sep=',', encoding='utf-8')
test_data.to_csv('test.csv', index=False, sep=',', encoding='utf-8')

In [11]:
#load train/test sets
train = pd.read_csv('train.csv', index_col=False)
test = pd.read_csv('test.csv', index_col=False)

print ("The number of training samples are: %r") % (len(train))
print ("The number of testing samples are: %r \n") % (len(test))

#make sure the train/test tests are formatted correctly.
print train.iloc[:2]
print test.iloc[:2]

The number of training samples are: 64681
The number of testing samples are: 27721 

   Score                                               Text
0      1  Honey Maid Graham Cracker Crust are the best c...
1      1  I have only been eating more healthy foods for...
   Score                                               Text
0      0  Greenies tries to position itself as a healthy...
1      1  I've enjoyed the Dark Magic coffee and re-orde...


# Text Processing & Bag of Words Model

In [12]:
# Initialize an empty list to hold the clean reviews
clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list. Word2VecUtility is a text processing function imported from another file. 

print("Cleaning and parsing the Amazon reviews...\n")
for i in range( 0, len(train["Text"])):
    clean_train_reviews.append(" ".join(Word2VecUtility.review_to_wordlist(train["Text"][i], True)))

Cleaning and parsing the Amazon reviews...





 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


Create Bag of Words

In [13]:
# ****** Create a bag of words from the training set
#
print("Creating the bag of words...\n")


# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
vectorizer = CountVectorizer(analyzer = "word",   \
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         max_features = 5000)

Creating the bag of words...



# Train SVM

In [17]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an
# array
np.asarray(train_data_features)

# ******* Train an SVM using the bag of words
#
print("Training the SVM (this may take a while)...")

# Fit the SVM to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# Initialize an SVM classifier with chosen parameters.

from sklearn.linear_model import SGDClassifier

SVM = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)
SVM = SVM.fit( train_data_features, train["Score"] )

Training the SVM (this may take a while)...


# Testing Stage

In [18]:
# Create an empty list and append the clean reviews one by one
clean_test_reviews = []

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,len(test["Text"])):
    clean_test_reviews.append(" ".join(Word2VecUtility.review_to_wordlist(test["Text"][i], True)))

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
np.asarray(test_data_features)

# Use the random forest to make sentiment label predictions
print("Predicting test labels...\n")
result = SVM.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["Score"], "Sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv(os.path.join('Bag_of_Words_modelSVM100k.csv'), index=False, quoting=3)
print("Wrote results to Bag_of_Words_modelSVM100k.csv")

Cleaning and parsing the test set movie reviews...

Predicting test labels...

Wrote results to Bag_of_Words_modelSVM100k.csv


In [19]:
accuraccy = np.mean(result == test['Score'])  

print(accuraccy)

0.915731755709


# Bag of Words Bigram

In [20]:
# ****** Create a bag of words from the training set n GRAM
#
print("Creating the bag of words...\n")


# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
vectorizerb = CountVectorizer(analyzer = "word",   \
                         ngram_range=(2,2),     
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         max_features = 5000)

Creating the bag of words...



In [21]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizerb.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an
# array
np.asarray(train_data_features)

# ******* Train an SVM using the bag of words
#
print("Training the SVM (this may take a while)...")

# Fit the SVM to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# Initialize an SVM classifier with chosen parameters.

from sklearn.linear_model import SGDClassifier

SVM = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)
SVM = SVM.fit( train_data_features, train["Score"] )

Training the SVM (this may take a while)...


In [23]:
# Create an empty list and append the clean reviews one by one
clean_test_reviews = []

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,len(test["Text"])):
    clean_test_reviews.append(" ".join(Word2VecUtility.review_to_wordlist(test["Text"][i], True)))

# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
np.asarray(test_data_features)

# Use the random forest to make sentiment label predictions
print("Predicting test labels...\n")
result = SVM.predict(test_data_features)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["Score"], "Sentiment":result} )

# Use pandas to write the comma-separated output file
output.to_csv(os.path.join('Bag_of_Words_modelSVMbigram100k.csv'), index=False, quoting=3)
print("Wrote results to Bag_of_Words_modelSVMbigram100k.csv")

Cleaning and parsing the test set movie reviews...

Predicting test labels...

Wrote results to Bag_of_Words_modelSVMbigram100k.csv


In [24]:
accuraccy = np.mean(result == test['Score'])  

print(accuraccy)

0.814184192489


# Trying some other BOW models...

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(clean_train_reviews)
train_counts.shape
(2257, 35788)

(2257, 35788)

In [26]:
count_vect.vocabulary_.get(u'algorithm')

In [27]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(train_counts)
train_tf = tf_transformer.transform(train_counts)
train_tf.shape
(2257, 35788)

(2257, 35788)

In [28]:
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
train_tfidf.shape
(2257, 35788)

(2257, 35788)

In [29]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(train_tfidf, train["Score"])

In [30]:
#Use Testing Data 

# Create an empty list and append the clean reviews one by one
clean_test_reviews = []

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,len(test["Text"])):
    clean_test_reviews.append(" ".join(Word2VecUtility.review_to_wordlist(test["Text"][i], True)))

Cleaning and parsing the test set movie reviews...



In [32]:
new_counts = count_vect.transform(clean_test_reviews)
new_tfidf = tfidf_transformer.transform(new_counts)

result = clf.predict(new_tfidf)
a = np.mean(result == test['Score'])  
print(a)


output = pd.DataFrame( data={"Score":test["Score"], "Sentiment":result} )


# Use pandas to write the comma-separated output file
output.to_csv( "Bag_of_Words_model100kSVMTFIDF.csv", index=False, quoting=3 )

0.847840986977


# Try SVM BoW with Balanced Dataset

In [33]:
df = pd.read_csv('AFF_Evenly_Sampled.csv', sep=',', index_col=False)

In [34]:
#make sure the data is evenly distributed
print df.iloc[:5]
df['Score'].value_counts()

   Score                                               Text
0      1  I like this brand. I didn't realize I was orde...
1      1  Being my wife is a licensed cosmetologist and ...
2      1  If you are looking for an upgrade from the sta...
3      1  I am so allergic to too many artificial sweete...
4      1  I have not been able to find this locally and ...


1    82000
0    82000
Name: Score, dtype: int64

Split the sampled data into training and testing sets

In [35]:
#split dataset into train/test sets
#changed names so that we don't contaminate data 
train_balanced_data = df.sample(frac=0.7,random_state=200)
test_balanced_data = df.drop(train_balanced_data.index)

train_balanced_data.to_csv('train_balanced.csv', index=False, sep=',', encoding='utf-8')
test_balanced_data.to_csv('test_balanced.csv', index=False, sep=',', encoding='utf-8')

In [36]:
#load train/test sets
train1 = pd.read_csv('train_balanced.csv', index_col=False)
test1 = pd.read_csv('test_balanced.csv', index_col=False)

print ("The number of training samples are: %r") % (len(train1))
print ("The number of testing samples are: %r \n") % (len(test1))

#make sure the train/test tests are formatted correctly.
print train1.iloc[:2]
print test1.iloc[:2]
#print(train1['Text'][0])

The number of training samples are: 114800
The number of testing samples are: 49200 

   Score                                               Text
0      1  I had been frustrated trying to figure out how...
1      0  I tried this product in hopes that it would in...
   Score                                               Text
0      1  I like this brand. I didn't realize I was orde...
1      1  If you are looking for an upgrade from the sta...


In [37]:
# Initialize an empty list to hold the clean reviews
clean_train_reviews1 = []

# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list. Word2VecUtility is a text processing function imported from another file. 

print("Cleaning and parsing the Amazon reviews...\n")
for i in range( 0, len(train1["Text"])):
    clean_train_reviews1.append(" ".join(Word2VecUtility.review_to_wordlist(train1["Text"][i], True)))

Cleaning and parsing the Amazon reviews...



In [38]:
# ****** Create a bag of words from the training set
#
print("Creating the bag of words...\n")


# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
vectorizer1 = CountVectorizer(analyzer = "word",   \
                         tokenizer = None,    \
                         preprocessor = None, \
                         stop_words = None,   \
                         max_features = 5000)

Creating the bag of words...



In [39]:
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features1 = vectorizer1.fit_transform(clean_train_reviews1)

# Numpy arrays are easy to work with, so convert the result to an
# array
np.asarray(train_data_features1)

# ******* Train an SVM using the bag of words
#
print("Training the SVM (this may take a while)...")

# Fit the SVM to the training set, using the bag of words as
# features and the sentiment labels as the response variable
#
# Initialize an SVM classifier with chosen parameters.

from sklearn.linear_model import SGDClassifier

SVM1 = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)
SVM1 = SVM1.fit( train_data_features1, train1["Score"] )

Training the SVM (this may take a while)...


In [40]:
# Create an empty list and append the clean reviews one by one
clean_test_reviews1 = []

print("Cleaning and parsing the test set movie reviews...\n")
for i in range(0,len(test1["Text"])):
    clean_test_reviews1.append(" ".join(Word2VecUtility.review_to_wordlist(test1["Text"][i], True)))

# Get a bag of words for the test set, and convert to a numpy array
test_data_features1 = vectorizer1.transform(clean_test_reviews1)
np.asarray(test_data_features1)

# Use the random forest to make sentiment label predictions
print("Predicting test labels...\n")
result1 = SVM1.predict(test_data_features1)

# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output1 = pd.DataFrame( data={"id":test1["Score"], "Sentiment":result1} )

# Use pandas to write the comma-separated output file
output1.to_csv(os.path.join('Bag_of_Words_Model_BalancedSVM82k.csv'), index=False, quoting=3)
print("Wrote results to Bag_of_Words_model.csv")

Cleaning and parsing the test set movie reviews...

Predicting test labels...

Wrote results to Bag_of_Words_model.csv


In [42]:
accuraccy1 = np.mean(result1 == test1['Score'])  

print(accuraccy1)

0.884552845528
