# Project 02 - Brian Bell

## Experiment Objective

The dataset used in this project was a colleciton of approximately 231,000 video game reviews taken from Amazon. The objective of this experiment is to use the contents of a users review to be able to accurately predict the rating. The ratings range from 1-5. To improve model performance, the ratings were changed from 1-3 (1-2=Unfavorable, 3=Neutral, 4-5=Favorable).

The dataset was created by Julian McAuley at UCSD. McAuley made the dataset availble to anyone who wishes to use it, as long as the following citations are present:

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016
http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf

Image-based recommendations on styles and substitutes
J. McAuley, C. Targett, J. Shi, A. van den Hengel
SIGIR, 2015
http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf

## Data Collection

In [1]:
import os
import gzip
import requests
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
url = r'http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Video_Games_5.json.gz'
file = 'reviews_Video_Games_5.json.gz'

if not os.path.isfile(file):
    r = requests.get(url)
    open('reviews_Video_Games_5.json.gz', 'wb').write(r.content)

In [3]:
# Code to convert .gz to dataframe from the creator of the dataset at http://jmcauley.ucsd.edu/data/amazon/
def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Video_Games_5.json.gz')

In [4]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,1.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,4.0,Good rally game,1372550400,"06 30, 2013"
2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,1.0,Wrong key,1403913600,"06 28, 2014"
3,A1DLMTOTHQ4AST,700099867,ampgreen,"[7, 10]","I got this version instead of the PS3 version,...",3.0,"awesome game, if it did not crash frequently !!",1315958400,"09 14, 2011"
4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,4.0,DIRT 3,1308009600,"06 14, 2011"


In [5]:
# For testing on smaller dataframe
# df = df[:500]

## Data Preprocessing

In [6]:
df.overall = df.overall.map({1.0:1, 2.0:1, 3.0:2, 4.0:3, 5.0:3})

In [7]:
X, y = (df.reviewText.values, df.overall.values)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [9]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.base import BaseEstimator, TransformerMixin

In [11]:
class TextTransformer(BaseEstimator, TransformerMixin):
    '''
    This transformer removes numbers, stop words, punctuation, and stemming (with the provided stemmer).
    
    Expects a column in the dataframe named "reviewText" and puts the cleaned text in a column "reviewTextClean". 
    '''
    def __init__(self, stemmer, remove_stop=True):
        '''
        Args: 
            stemmer: A stemmer with a "stem" method
            remove_stop: True if stop words should be removed
        '''
        self.stemmer = stemmer
        self.remove_stop=remove_stop      

        
    def transform(self, X, **transform_params):
        '''
        Cleans the text
        '''
        return [self.stem_text(x) for x in X]
    
    def fit(self, X, y=None, **fit_params):
        '''
        Fits the transformer
        '''
        return self
    
    def stem_text(self, text):
        '''
        Utility function for transform method to clean the text
        '''
        text = re.sub(r'http://.*', '', text) # Remove any links
        
        if self.remove_stop:
            stop = self.stop = stopwords.words('english')             
        else:
            stop = []
        
        
        return ' '.join([self.stemmer.stem(word) for word in word_tokenize(text) if word.isalpha() and word not in stop])

## Model Optimization and Serialization

In [12]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import SGDClassifier

### Find Optimal Hyperparameters

Originally, transforming the text was part of the pipeline. However, on such a large dataset this is a very expensive task. It takes close to two minutes to transform the text on this computer. Moving it outside of the pipeline, and fitting a separate pipeline for each stemmer, dramatically reduced the run time since it only has to be ran three times instead of hundreds of times. It also also the grid search to run on all available CPU cores, since the tokenize function cannot be ran on all cores in Windows.

In [13]:
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("english")

X_porter = TextTransformer(porter).transform(X_train)
X_lancaster = TextTransformer(lancaster).transform(X_train)
X_snowball = TextTransformer(snowball).transform(X_train)

In [14]:
tfidf = TfidfVectorizer()

param_grid = [{'vect__max_df' : [.5, .8],
               'vect__min_df' : [100, 1000, 10000],
               'clf__alpha' : [.0001, .001, .01, .1, 1],
               'clf__penalty' : ['l2', 'l1', 'elasticnet'],
               'clf__loss' : ['hinge', 'modified_huber', 'log']}]

pipe = Pipeline([('vect', tfidf),
                ('clf', SGDClassifier(random_state=0))])

In [15]:
gs_porter = GridSearchCV(pipe, param_grid,
                  scoring='accuracy',
                  cv=5, verbose=1,
                  n_jobs=-1)
gs_porter.fit(X_porter, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   50.2s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed: 11.0min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed: 19.8min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed: 32.5min
[Parallel(n_jobs=-1)]: Done 1350 out of 1350 | elapsed: 36.0min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf', SGDClassifier(random_state=0))]),
             n_jobs=-1,
             param_grid=[{'clf__alpha': [0.0001, 0.001, 0.01, 0.1, 1],
                          'clf__loss': ['hinge', 'modified_huber', 'log'],
                          'clf__penalty': ['l2', 'l1', 'elasticnet'],
                          'vect__max_df': [0.5, 0.8],
                          'vect__min_df': [100, 1000, 10000]}],
             scoring='accuracy', verbose=1)

In [16]:
print(gs_porter.best_params_)
clf_porter = gs_porter.best_estimator_

{'clf__alpha': 0.0001, 'clf__loss': 'modified_huber', 'clf__penalty': 'l2', 'vect__max_df': 0.8, 'vect__min_df': 100}


In [17]:
gs_lancaster = GridSearchCV(pipe, param_grid,
                  scoring='accuracy',
                  cv=5, verbose=1,
                  n_jobs=-1)
gs_lancaster.fit(X_lancaster, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed: 12.7min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed: 21.3min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed: 33.7min
[Parallel(n_jobs=-1)]: Done 1350 out of 1350 | elapsed: 37.0min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf', SGDClassifier(random_state=0))]),
             n_jobs=-1,
             param_grid=[{'clf__alpha': [0.0001, 0.001, 0.01, 0.1, 1],
                          'clf__loss': ['hinge', 'modified_huber', 'log'],
                          'clf__penalty': ['l2', 'l1', 'elasticnet'],
                          'vect__max_df': [0.5, 0.8],
                          'vect__min_df': [100, 1000, 10000]}],
             scoring='accuracy', verbose=1)

In [18]:
print(gs_lancaster.best_params_)
clf_lancaster = gs_porter.best_estimator_

{'clf__alpha': 0.0001, 'clf__loss': 'modified_huber', 'clf__penalty': 'l2', 'vect__max_df': 0.5, 'vect__min_df': 100}


In [19]:
gs_snowball = GridSearchCV(pipe, param_grid,
                  scoring='accuracy',
                  cv=5, verbose=1,
                  n_jobs=-1)
gs_snowball.fit(X_lancaster, y_train)

Fitting 5 folds for each of 270 candidates, totalling 1350 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:   47.1s
[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 418 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 768 tasks      | elapsed: 19.5min
[Parallel(n_jobs=-1)]: Done 1218 tasks      | elapsed: 31.8min
[Parallel(n_jobs=-1)]: Done 1350 out of 1350 | elapsed: 35.3min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf', SGDClassifier(random_state=0))]),
             n_jobs=-1,
             param_grid=[{'clf__alpha': [0.0001, 0.001, 0.01, 0.1, 1],
                          'clf__loss': ['hinge', 'modified_huber', 'log'],
                          'clf__penalty': ['l2', 'l1', 'elasticnet'],
                          'vect__max_df': [0.5, 0.8],
                          'vect__min_df': [100, 1000, 10000]}],
             scoring='accuracy', verbose=1)

In [20]:
print(gs_snowball.best_params_)
clf_snowball = gs_snowball.best_estimator_

{'clf__alpha': 0.0001, 'clf__loss': 'modified_huber', 'clf__penalty': 'l2', 'vect__max_df': 0.5, 'vect__min_df': 100}


Save all the classifiers, in case I need to restart the Jupyter session.

In [21]:
import pickle

pickle.dump(clf_porter, open('classifier_porter.pkl', 'wb'), protocol=4)
pickle.dump(clf_lancaster, open('classifier_lancaster.pkl', 'wb'), protocol=4)
pickle.dump(clf_snowball, open('classifier_snowball.pkl', 'wb'), protocol=4)

In [22]:
clf_porter.score(X_porter, y_train)

0.8245232548105963

In [23]:
clf_lancaster.score(X_lancaster, y_train)

0.8051945810682544

In [24]:
clf_snowball.score(X_snowball, y_train)

0.8090545632352518

Porter stemmer had the best accuracy on the training data. Get accuracy on test dataset.

In [None]:
X_test_transform = TextTransformer(porter).transform(X_test)

In [26]:
clf_porter.score(X_test_transform, y_test)

0.8172059711795668

The training accuracy was only slightly better that the test accuracy, so the model there was not significant overfitting.

### Train Best Model on full dataset

In [31]:
X_transform_full = TextTransformer(porter).transform(X)

In [32]:
final_pipe = Pipeline([('vect', tfidf),
                       ('clf', SGDClassifier(random_state=0))])

final_pipe.set_params(**gs_porter.best_params_)
final_pipe.fit(X_transform_full, y)

In [34]:
pickle.dump(final_pipe, open('website/pkl/review_classifier.pkl', 'wb'), protocol=4)

In [40]:
pickle.dump(stopwords.words('english') , open('website/pkl/stop.pkl', 'wb'), protocol=4)

In [38]:
pickle.dump(porter, open('website/pkl/porter.pkl', 'wb'), protocol=4)

In [39]:
pickle.dump(TextTransformer(porter), open('website/pkl/transformer.pkl', 'wb'), protocol=4)

## Website Creation and Publishing

Make more portable tokenizer.

In [62]:
import string

stop = pickle.load(open('website/pkl/stop.pkl', 'rb'))
stemmer = pickle.load(open('website/pkl/porter.pkl', 'rb'))

def word_tokenize(text):
    '''
    Removes punctuation and splits into a list.
    '''
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\n', '', text)
    
    return text.split(" ")
    


def stem_text(text):
    '''
    Stem words in a list.
    '''
    text = re.sub(r'http://.*', '', text) # Remove any links

    return ' '.join([stemmer.stem(word) for word in word_tokenize(text) if word.isalpha() and word not in stop])

Train HashVectorizer and pickle.

In [50]:
from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore',
                        n_features=2**21,
                        preprocessor=None,
                        tokenizer=stem_text)

clf = SGDClassifier(loss='modified_huber', alpha=.0001, penalty='l2', random_state=1)

In [57]:
def data_stream(X, y, batch_size=1000):
    length = len(X)
    current = 0 
    while current < length:
        if current + batch_size < length:
            yield (X[current:current+batch_size], y[current:current+batch_size])
        else:
            yield (X[current:], y[current:])
        current += batch_size

In [63]:
classes = np.unique(y)
for X_train, y_train in data_stream(X, y):
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)

In [69]:
clf.score(vect.transform(X), y)

0.7550090603158167

In [67]:
pickle.dump(clf, open('website/pkl/hashing_classifier.pkl', 'wb'))

## Flask Website

http://bell0x2a.pythonanywhere.com/