### Introduction
We are looking to predict the star rating (0-5 stars) assiciated with a particular Amazon review. 
To do this we are going to use Machine Learning (ML). ML learns how to make predictions by learning from training data. Once the ML model is trained, it can predict the output for a given input based on the previous data. 

We will try out a number of ML models to see which one is best at predicting Amazon review star-ratings. 


### Installing, Importing, and Loading Data

We can start by installing and importing some libraries that will help us with our analysis. 

In [1]:
#Install stuff
%%capture
!pip install -U gensim
!pip install urllib2

UsageError: Line magic function `%%capture` not found.


In [2]:
#Import stuff
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

from gensim import corpora
from gensim.models import LsiModel, KeyedVectors
from gensim.models.tfidfmodel import TfidfModel
from gensim.models.nmf import Nmf

import sklearn.model_selection as ms
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from datetime import *
from operator import itemgetter


Now we can load the data. The data we are using are amazon reviews, and their corresponding star rating from 0-5. The reviews are stored as "bags of words" (BoW), meaning word order is not retained. We are assuming the word order will not be critical to making good estimations. Let's see if that assumption is valid!

In [0]:
#Access data
%%capture
!wget https://cis.upenn.edu/~cis545/data/reviews.dict
!wget https://cis.upenn.edu/~cis545/data/train_reviews.mm
!wget https://cis.upenn.edu/~cis545/data/train_times.npy

In [0]:
#Store data in variables
reviews_dict = corpora.Dictionary.load("reviews.dict")
reviews_bow = corpora.MmCorpus('train_reviews.mm')
reviews_times  = np.load('train_times.npy')
reviews_times.shape = (len(reviews_bow),1)
y = np.vstack((np.repeat(1, 4000), np.repeat(2, 4000), np.repeat(3, 4000), np.repeat(4, 4000), np.repeat(5, 4000)))
y = np.repeat(y, 5)

### PCA
We want to start training machine learning models with our data, however right now the number of features is simply too large. We can use a process called Principal Component Analysis (PCA) to separate out the orthogonal vectors, and then we'll trim those that have basiclly no impact. This is analogous to Taylor Series approximation.


We don't yet know how many of these orthogonal components we want. Let's narrow down exactly where we stop gaining predictive power by training some ML models with different numbers of components. 

In order to do this, we will need a function that converts sparse data into dense data. 

**Densify a Sparse Matrix**

Dense data contains the frequency for every word present in the sentence and a 0 for every other word, whereas sparse data contains only a frequency for every word in the sentence and does not mention the other words. This helps prevent storing and modifying overly large variables, but is not what we want in this case. 

In [0]:
#
def densify(sparse, columns):
    dense = np.zeros((len(sparse), columns)) #Fill up the array with zeros
    
    #Fill in the frequency for those that have it
    for i, sentence in enumerate(sparse):
      for word in sentence:
        dense[i, word[0]] += word[1]
    return(dense)

###How we will evaluate models

For whatever model we try, we want to know how many components are ideal for our analysis, so we will evaluate the model several times using different reconstructed matrices of up 200 components using accuracy. 

This function trains a given model (labeled eval_model) and evaluates its performance. We won't use this function by itself though, we will have another function call it several times using different matrices so we can compare them. 

In [0]:
#Returns the accuracy score fo a model given data
def evaluate_model(X, y, eval_model):
    X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state = 1911)    
    eval_model.fit(X_train, y_train)
    return eval_model.score(X_test, y_test)

This function compares the performance of different matrices


In [0]:
#Compares the peformance of different input matrices 
def evaluate_cutoffs(X_orig, X_dict, y, cutoffs, eval_model):
    results = []
    
    #Create a new model for each cutoff
    for i, cutoff in enumerate(cutoffs):
        print(i+1, "of", len(cutoffs),"...")
        np.random.seed(1911)
    
        model = LsiModel(X_orig, num_topics=cutoff, id2word=X_dict)
        V = densify(model[X_orig], len(model.projection.s))
        
        #Store the results for the model
        result = evaluate_model(V, y, eval_model)
        results.append(result)
    
    #Plot the results
    plt.ylabel('Accuracy')
    plt.xlabel('# of Components')
    plt.title('PCA Analysis')
    plt.style.context('seaborn-whitegrid')
    plt.plot(cutoffs, results)

    #Return the results
    return results

### Let's evaluate or first model: Random Forest! This will be the first of several that we test. 

Now we can get an idea for how the number of principal components affects our model's accuracy. To do this, we need to pick a model. Let's start with a random forest!

A random forest consists of several decision trees trained on different sections of the data, and their predictions are averaged. This can solve some of the overfitting issues that come with decision trees. 

In [0]:
eval_model = RandomForestClassifier(n_estimators=70, random_state=1911)
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,220,20), eval_model)

It looks like the model accuracy already started plateauing even before 20 components. Let's see what was happening before... 

In [0]:
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(2,22,2), eval_model)

The plateauing begins as soon as just 10-12 components! This model accurately predicts the star value of a review ~65% of the time. Not bad! Let's see how other models compare. 

### Decision Tree!

Are be being overly complicated here? Would a simple decision tree model do the trick, or would it overfit the data, as it is prone ot do?

In [0]:
from sklearn import tree

eval_model = tree.DecisionTreeClassifier()
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,420,20), eval_model)


Certainly not the way to go. It can predict the correct outcome about half of the time, but this is a serious degredation for the success rates of our previous models. 

###Decision Stumps

There are otherways to avoid the overfitting, however. What if we make a decision tree that only went so far, say to a depth of only 3? Sometimes a more general decision is better. 

In [0]:
from sklearn import tree

eval_model = tree.DecisionTreeClassifier(max_depth=3)
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,420,20), eval_model)

But not always a better result! We still are getting very low accuracy, right around 50%. Forget that! The random forest is so far the best model still. 

### Naive Bayes

Enough with the trees! Our next model, Naive Bayes, will predict the most likely output for an input given the data we've seen so far using statistical modelling. This is a good general model but may be limited in success on a bag of words model. 

In [0]:
from sklearn.naive_bayes import GaussianNB

eval_model = GaussianNB()
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,420,20), eval_model)

The results just got worse and worse with more components, never doing better than around 38%. I'd say this is definitely not the direction to go. 

### Perceptron

A perceptron is like a two layered Neuran Network (NN) with just input nodes and an output node. This is a simple yet powerful ML model, let's see how it does!

In [0]:
from sklearn.linear_model import Perceptron

eval_model = Perceptron(tol=1e-3, random_state=0)
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,420,20), eval_model)

The percepton takes longer to converge, at around 100 components. However the accuracy is improved to 70+%. Best so far! Can we do even better? 

### Neural Network

Let's try a neural network model using stochastic gradient descent, with two hidden layers of 20 neurons. A neural network is comperable in architecture to a perceptron, however with additional hidden layers to allow for more nuanced hypothesis. Maybe the additional nuace will improve our model further. 

In [0]:
from sklearn.neural_network import MLPClassifier

eval_model = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(20, 20), random_state=1)
results = evaluate_cutoffs(reviews_bow, reviews_dict, y, range(20,420,20), eval_model)

Here our accuracy is even better than in the previous model! There is a plateua around 100 compenents, with an accuracy close to 80% ar around 200 components.

### Conclusion

Of the models presented, the most accurate was the Neural Network. The accuracy was close to 80%, and converged at around 100 components. 

**Is Accuracy all that matters?**

Probably not. We don't know exactly how the data is distributed, for all we know only 80% of the ratings were 5 stars and out model always predicts 5 stars. We can get a better idea of the model's performance by looking at its confusion matrix. 

In [0]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix

#Train model
eval_model = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(20, 20), random_state=1)
model = LsiModel(reviews_bow, num_topics=100, id2word=reviews_dict)
V = densify(model[reviews_bow], len(model.projection.s))
X_train, X_test, y_train, y_test = ms.train_test_split(V, y, test_size=0.2, random_state = 1911)    
eval_model.fit(X_train, y_train)

#Predict the outcomes
y_pred = eval_model.predict(X_test)

#Print confusion matrix
print(confusion_matrix(y_test, y_pred))

How can this be interpreted? Position [i,j] represents the number of times actual rating i was predicted to be rating j. This means the diagonal represents the correct predictions. We have much higher values in our diagonal than elsewhere, meaning we can be more confident that our results are indeed good!