The following is the final report written by H. Passmore for the Springboard Career Track Capstone 1: Amazon Book Reviews & Ratings Predictor. 

# Final Report for Capstone 1:  Rating Predictor
_Amazon Book Reviews & Ratings Predictor_

_March 2018_
***

### Table of Contents
1. Introduction and Objectives
2. Client Profile and Motivation
3. Data Aquisition  
    3.1. Consumer reviews and ratings  
    3.2. Genre-specific ISBN codes
4. Data Wrangling
5. Exploratory Data Analysis and Inferential Statistics  
    5.1. Exploring patterns and content of reviews and ratings
    5.2. Grouping reviews into low and high ratings
6. Machine Learning  
    6.1. 
    6.2. 
7. Results Summary
8. Recommendations for review platforms and future directions
9. Sources

### 1. Introduction and Objectives

When consumers consider purchasing a product, they often turn to reviews and ratings submitted by other customers to determine if the purchase is worthwhile. Conversely, retailers depend on honest and accurate reviews and ratings to ensure subsequent buyers can make informed purchases. Like business ratings, product ratings and reviews also affect sales. Therefore, accurate and error-free reviews and ratings are extremely valuable to retailers. The sentiment captured in the text of a review should be reflected in the star rating. One-star ratings potentially have a big negative effect on sales, so retailers need tools to flag incongruous reviews and ratings that may indicate user error. Similarly, high ratings paired with scathing review text may indicate errors or other issues with the product or review system. Can we predict ratings whether ratings are high or low based on review features? I used Natural Language Processing methods and fit machine learning algorithms to training data to predict high and low reviews in testing data. With scikit-learn's feature selection tools I identify text features most associated with high and low ratings.

Both consumers and vendors depend on reviews and ratings to make informed decisions about purchases and to help with sales. Positive and negative ratings and reviews help buyers and sellers know what to spend money on and what products to avoid. Errors and inconsistencies in these assessments can directly affect sales and customer satisfaction. Here I use features of consumer book review text to determine if reviews can predict ratings. Being able to predict ratings based on review features has multiple benefits for potential clients: 1) catch errors by reviewers where they accidentally selected the wrong number of stars, 2) suggest ratings when reviewers do not provide a star rating along with their review, 3) flag confusing/incongruous review-rating pairs for revision (by reviewer) or so that they are not featured first in review lists, and potentially 4) identify and flag reviews and ratings that are ‘fake’ or jokes based on the text of the review.

![AmazonChemistryReviews](AmazonChemistryReviews.png)

__Figure 1.__ Amazon reviews for books may address the content of the book, or just the purchasing experience of the buyer. My goal through natural language processing and machine learning is to tune a classification algorithm to predict, based on the content of a review whether the customer rates the book with a high rating (5-star) or a low rating (4 or fewer stars).

### 2. Client Profile and Motivation

My preliminary clients are retailers. From bookstores to toy stores and to large online retailers of many categories, retailers depend on consumer reviews to 1) make decisions on what products to purchase for resale and 2) to promote sales from their platform. The machine learning algorithm can be used by retailers internally or as part of their review platform used by consumers. I envision a review platform that facilitates consumer review writing. This platform could incorporate a text editor (like Grammarly) to help reviewers craft clear and effective reviews in addition to suggesting a rating level based on the specific rating system of a given platform. Together these features will help reviewers communicate more clearly and select corresponding ratings more consistently. The algorithm parameters would be tuned and adjusted based on the product categories and also based on prior customer input.

Ultimately, my machine-learning algorithm that predicts high and low ratings from review text features can be utilized for any product-category or business. Further, interpreting the sentiment of consumer input has value beyond rating systems. Businesses benefit from understanding customer responses to products, interactions with customer service, assessments of online resources, and many other customer-business interactions. A system that identifies positive and negative feedback from potential or actual customers can give businesses the power to intercede and to improve customer engagement and satisfaction.

During my Capstone project I acquired a large dataset of reviews and ratings, subsetted a specific genre of book reviews, explored quantitative and qualitative patterns within reviews and ratings, preprocessed and cleaned review text, grouped reviews into binary ratings categories, and fit machine learning algorithms to subsetted training data in order to test algorithms and parameters on testing data. These initial steps are the foundation for further development of tools to facilitate effective review writing and consistent rating assignments by consumers.

### 3. Data Aquisition

__3.1. Consumer reviews and ratings.__ My source dataset has over 22 million book reviews from Amazon.com from May 1996 - July 2014. These reviews are made available by Julian McAuley UCSD professor of Computer Science (McAuley et al. 2015; He & McAuley 2016). For this project, I have accessed a subset of all book reviews within a specific genre to train and test the algorithm.

* J. McAuley’s main page: http://cseweb.ucsd.edu/~jmcauley/
* Amazon Review Data links: http://jmcauley.ucsd.edu/data/amazon/ Data files with all reviews are only available from Julian McAuley by request.
* Count of reviews_Books records: 22,507,155
* Count of 5.0 rated reviews: 13,886,788
* For books with 10-digit International Standard Book Number (ISBN), the ASIN and the ISBN are the same.

__3.2. Genre-specific ISBN codes.__ I queried the Google Books API using a variety of science topic query terms to build a list of ISBN codes to match the ASIN codes in the Amazon review database. Google Books Developers resources: https://developers.google.com/books/docs/v1/getting_started

### 4. Data Wrangling
Accessing and subsetting the large file of Amazon Review data required several data wrangling steps before the data were ready for exploratory data analysis in Python.

1. Use Google Books API to query for science textbooks and non-fiction science books and their 10-digit ISBN codes. Standard ISBN-10 codes are the same as 'ASIN' ('Amazon Standard Identification Number') codes in the Book Review dataset. Query terms: 'q=science+[x]+nonfiction' where x on separate API requests was: science, biology, chemistry, physics, astronomy, invertebrate, biochemistry, zoology, math, geology, climate, and cellular. Ultimately my deduplicated, indexed DataFrame contained 'Title', 'Subtitle', 'description', and 'ISBN_10' columns for 3950 Science Textbook and non-fiction books. I pickled this DataFrame for subsequent processing in Pymongo.

2. Install MongoDB and Studio 3T to access the JSON as a database file. Then, in Jupyter Notebook with PyMongo use the .find() function to match the review documents I want to use with the list of ISBN codes for science textbooks and non-fiction. 

3. Matching the list of science book ISBN codes with the large Amazon review database resulted in 729 individual titles in the genre "Non-fiction Science and Textbooks". Many books have multiple reviews by different Amazon customers resulting in a final dataset of 11546 reviews.  Some titles were reviewed by over 100 different users. The highest review count per book is 382 with an average of 16 reviews per book.


### 5. Exploratory Data Analysis and Inferential Statistics

__5.1. Exploring patterns and content of reviews and ratings.__ For the purposes of estimating ratings from reviews the most important data fields from the review data set are the ratings assigned by reviewers ('overall') and the text of the associated reviews ('reviewText'). Amazon ratings range from 1 to 5, where 5 is the highest rating. Like the full Amazon book review dataset, the non-fiction science book reviews are dominated by 5-star reviews (66%). 

To explore quantitative features of the review text I estimated the number of words per review. Below are the mean number of words per rating groups where 'count' is the number of reviews per group (Table 1). The highest review group (overall = 5) has the shortest reviews by word count on average, but the highest word count is also in rating level 5 (max number of words = 5364).

|overall |count     |mean	    |std	    |min	|25%	|50%	|75%	|max   |
| ------ |:---------|:----------|:--------- |:------|:------|:------|:------|:---- |
|1.0	 |605.0	    |193.79  	|311.48 	|2.0	|50.0	|107.0	|226.0	|4027.0|
|2.0	 |405.0	    |183.94 	|253.04 	|5.0	|47.0	|105.0	|221.0	|2703.0|
|3.0	 |843.0	    |161.24 	|209.89 	|4.0	|38.5	|90.0	|188.5	|2326.0|
|4.0	 |2031.0	|146.83 	|185.37 	|1.0	|33.5	|78.0	|190.0	|2041.0|
|5.0	 |7662.0    |112.21 	|169.75 	|1.0	|30.0	|58.0	|126.0	|5364.0|
__Table 1.__ Reviews grouped and tallied by rating category 'overall' and summary statistics for the number of words per review (mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and the maximum number of words per review).

The majority of books rated and reviewed in this genre are given the highest rating of five. Additionally, for books in the data set with more than ten reviews the distribution of ratings is also biased towards higher ratings. Books with low average ratings (3-stars) are reviewed by fewer individuals that books with average ratings of 4.5 stars and higher (Figure 1). This pattern is predictable - popular books are generally reviewed by more users and garner higher ratings.

![averagerate](AvgRateBookHist.png)
__Figure 2.__ For books reviewed by more than 10 reviewers the mean ratings are most frequently greater than 4.5. Books with lower ratings are less likely to prompt additional purchases and reviews.

The corpus of book review text contains few surprises in the top 25 most common words (Figure 2). Words about books are reading are among the most common terms (e.g., book, books, read, reading, author). Positive words are also frequent coming from a dataset of 66% five-star reviews (e.g., like, great, well, good). Some of the top 25 words are also genre-specific terms related to science, information and the real world. This word frequency graph also reveals that normalization of tokens is not complete at this stage. In later steps of pre-processing the data, I will utilize a text stemmer to reduce redundant words like 'book' and 'books' from appearing separately. Although I removed stopwords before creating this word frequency chart, later I will add additional stop words to the list to remove words that do not add meaning or are very frequent across all reviews (e.g., book, books, read, reading). Word frequencies were calculated with NLTK's tokenize_words with English stopwords removed (Loper & Bird 2002).

![frequencygraph](frequencies.png)
__Figure 3.__ The top 25 most frequent words in the book review corpus (stop words removed) include words about books and reading, positive words, and scientific terms. The most frequent word used in 11546 Science Textbook reviews was 'book' which was more than three times more frequent than the next word, 'read' in these reviews. 

__5.2. Grouping reviews into low and high ratings.__ Based on my exploration of the distribution of ratings among the Science Textbook reviews I grouped reviews into two rating categories for all subsequent analysis and modeling. All reviews with ratings of 1 to 4 stars are grouped together as 'low' ratings (n = 3884). Reviews associated with the highest Amazon rating are in the 'high' rating group (n = 7662). To compare and contrast the characteristics of the two rating categories I performed several hypothesis tests.

Review length as measured by word count were lowest for the 5-star rated books and highest for the 1-star book rating. Does this pattern continue for the new binary rating groups? I tested the null hypothesis that word counts are equal between the two 'high' and 'low' rating categories. To make comparisons between the rating groups I first log-transformed the word count data to address non-normal distributions and performed an independent two-sample t-test for unequal variances. Results of this test on the log-transformed word counts indicate that the word count of all 1-star through 4-star rated reviews is significantly higher than 'five-star' rated reviews (t = 15.5, p = 0.00).

![wordcountboxplot](boxplot.png)
__Figure 4.__ Word counts for reviews with 1 to 4-stars ('not5') are higher than word counts for 5-star ('five') reviews (t=15.5, p=0.0). Count values are log transformed.

Other quantifiable differences between low and high ratings include the percentage of review text entered in uppercase. Text written in all capital characters in reviews may reflect strong negative or positive sentiment by the reviewer. I explored the use of uppercase words in reviews using regular expressions to find and then count all words written in all uppercase.  For comparison between rating categories, I calculated the percentage of words in uppercase from the total number of words per review. There are many small values of percent uppercase words for both rating groups - my methods captured words like 'I' and 'A' and individual words written in uppercase for emphasis. When I considered reviews with 25% or more uppercase words there were 32 5-star reviews and only seven reviews of 4-stars or fewer. The statistical comparison between these two groups indicates there is no difference between the five-star and 1 to 4-star percentages of uppercase words. This analysis was indicative of other sub-groups I considered.

![uppercase](uppercase.png)

__Figure 5.__ The percent of words written by reviewers in uppercase is low for most reviews in our genre data. Low percentages are expected for normal text entry because I did not eliminate words like 'I' and 'A' for this calculation. For a small subset of reviews percent of uppercase words was higher, possibly for emphasis of either strongly positive or negative sentiment. The average percentage of uppercase words did not differ significantly for five-star vs. lower rated review categories in this analysis.

### 6. Machine Learning
The motivation for machine learning (ML) on the dataset of consumer book ratings and book review text is to classify new review text as either a high rated book (5-star) or low rated book (4 or fewer stars). Following exploratory data analysis and inferential statistical comparisons of ratings and reviews, the first step towards machine learning is to pre-process the text data through normalization methods. Following pre-processing I divided the review data into training and testing datasets. In order for the computer to process and compare elements of high and low rated reviews the text data must be vectorized. Vectorization tools transform the corpus of review text into a vector of words. For each review, the count or frequency of each vocabulary element is encoded. Many cells in this matrix will be zero - these are sparse vectors. Prior to fitting and comparing different ML algorithms I applied and compared two vectorization methods: CountVectorizer and TfidfVectorizer from the scikit-learn library (Pedregosa et al. 2011). I selected the best performing vectorization method and proceeded to fit, tune and compare algorithms for predicting book ratings on unseen reviews (the test dataset).

#### 6.1. Text Preprocessing
My exploratory data analysis of word frequencies revealed several elements of review text that needed attention before the data were ready for machine learning. Even after removing stop words the most common word in all Science textbook reviews was 'book' which was used about 16,000 times more than the next most common word 'read'. Neither word has a negative or positive meaning in the context of book reviews, so they could be added to the stop words list. My first preprocessing step was to join 'english' stopwords from the NLTK library with my own short list of stop words  ('book', 'books') into a set called 'stopwords'.

To pre-process the raw review input I defined preprocessing functions. The first function cleaned up noise by selecting alphabetical text by matching regular expressions, making all remaining words only lowercase characters, and keeping only stop words from my user-defined set, stopwords. I applied this function to the dataframe column of raw review text ('reviewText') and the new preprocessed column ('clean_revs') is still readable, but much cleaner.

|index	|reviewText	                                        |rating_cat	|clean_revs|
|:------|:------------------------------------------------- |:--------- |:---------|
|11541	|Definitely a MUST-READ if you are a home cooki...	|1	|definitely must home cooking enthusiast want g...|
|11542	|Pros: Scientifically informative and solid. Kn...	|0	|pros scientifically informative solid knowing ...|
|11543	|Real fun to read. For everybody that is inters...	|0	|real fun everybody intersted cooking certain p...|
|11544	|This book will teach you the chemical secrets ...	|0	|teach chemical secrets techniques usually used...|
|11545	|I paid more than $30 to buy such a superficial...	|0	|paid buy superficial trivial made big mistake ...|

__Table 2.__ Pandas DataFrame of raw review text, rating category, and review text processed by the first preprocessing function to remove stop words, lowercase all letters, and remove numerals.

Final steps towards normalization involved text tokenization and stemming. My second preprocessing function first tokenized the 'clean_revs' columns using word_tokenize from NLTK so that the review text becomes a list of strings where each word or token is a string. From NLTK I also imported the SnowballStemmer with 'english' vocabulary. The second preprocessor applied the stemmer to the tokenized words to reduce the dimensions of each review document. The stemmer keeps the 'stem' of words with multiple forms: science, scientific, and scientifically all become 'scienc'. The product is a less readable, but ready for machine learning DataFrame column, 'clean_revs'.
```
0    ['good', 'scienc', 'nerd', 'non', 'scienc', 'a...
1    ['biolog', 'genet', 'enthusiast', 'great', 'of...
2    ['bought', 'daughter', 'borrow', 'frank', 'mcc...
3    ['recommend', 'tour', 'guid', 'ireland', 'extr...
4    ['school', 'recent', 'upgrad', 'chemistri', 'h...
5    ['confus', 'explain', 'poor', 'overal', 'graph...
6    ['lot', 'cool', 'experi', 'complet', 'fun', 'p...
7    ['great', 'condit', 'better', 'expect', 'ship'...
Name: clean_revs, dtype: object
```
__Figure 6.__ Once preprocessing is complete the raw review text is reduced to tokenized, stemmed, lowercase, alphabetical lists of words with no stop words.

In the final steps before vectorization, I stored the tokenized text in a feature matrix (m x n) called 'X' and the response vector (where 0 is ratings 1 through 4 and 1 is rating level 5) in the m x 1 vector 'y'. I then divided X and y into training and test datasets (X_train, X_test, y_train, y_test) using scikit-learn's train_test_split function. Because the number of reviews in the '0' rating category is much lower than the '1' rating category I stratified the division into training and testing data to ensure that both categories are fairly represented in all subsets.

#### 6.2. Vectorizer Selection
Different vectorizers encode text data using different methods. Scikit-learn's CountVectorizer builds a vocabulary based on the tokens in the training data and counts the occurrence of each token encoded in a sparse vector for each observation (review). Count Vectorizer allows us to take a "bag-of-words" approach to text analysis. The vectorized text is analyzed based on word counts, but cannot take text structure into account. A vectorizer that takes into account more information than simple token counts is the TfidfVectorizer which applies the Term Frequency - Inverse Document Frequency to calculate token frequencies using the test data tokens. Specifically, Term Frequency is the frequency of a given token within a document while Inverse Term Frequency is used to reduce the effect of tokens that are very common across all documents. The TfidfVectorizer learns both the vocabulary as well as the inverse document frequency weights. 

To compare the effect of these two vectorizers on the book review document vocabularies I instantiated both CountVectorizer and TfidfVectorizer separately and saved the resulting document-term matrices from the fit and transform steps of vectorization on the X_train and X_test matrices.  For the initial comparison, I used the default parameters for both vectorizers. Next, I instantiated a Naive Bayes classifier, MultinomialNB(), and trained it on the X_train document-term matrix and y_train (the corresponding binary rating vector) from both vectorizers. I made class predictions with the Multinomial Naive Bayes classifier using the X_test document-term matrices, calculated scores to compare models, and used GridSearchCV to select the best parameters for the simple Naive Bayes model. For CountVectorizer the apparent best parameters were the default parameters while for TfidfVectorizer the best alpha value was 0.1. Updating alpha for TfidfVectorizer and re-running the classifier resulted in a higher ROC-AUC score as well and improved prediction counts in the confusion matrix.

|Vectorizer/Classifier: |ngram_range |ROC-AUC |GridSearchCV best params|
|:--------------------- |:------ |:------ |:-----------------------|
|CountVectorizer/MultinomialNB | (1,1) | 0.74 | alpha=1, fit_prior=True|
|TfidfVectorizer/MultinomialNB | (1,1) | 0.75 | alpha=0.1, fit_prior=True|
|CountVectorizer/MultinomialNB | (1,2) | 0.73 | alpha=1, fit_prior=True|
|TfidfVectorizer/MultinomialNB | (1,2) | 0.76 | alpha=0.1, fit_prior=True|

__Table 3.__ Comparison of the predictive effect of Multinomial Naive Bayes model fit with the document-term matrices from two separate vectorizers applied to the vocabulary of book review data. The two vectorizers are also compared with ngram range set to (1,1) and (1,2). ROC_AUC prediction scoring values, which calculates the area under the Receiver Operating Characteristic curve, are very similar after tuning using the best parameters from GridSearchCV. Overall, the Term Frequency -Inverse Document Frequency vectorization with features from single words and bigrams and Multinomial Naive Bayes alpha of 0.1 resulted in the highest area under the ROC curve.

The area under the ROC curve is slightly greater using tuned Multinomial Naive Bayes parameters and including bigrams as features. Predictions made with the TfidfVectorizer and MultinomialNB improved a lot with GridSearchCV tunings of alpha. In the Machine Learning section below I compare models fit with the document-terms matrix from vectorization with TfidfVectorizer tuned with the following parameters:
```
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=0.7, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
```

By viewing features names with the largest and smallest coefficients resulting from the best Multinomial Naive Bayes model with TfidfVectorization we can interpret term-frequencies and inverse-document-frequencies. The largest tf-idf values are words that were frequent in a review but were no common across all reviews:
```
Largest Coefs: 
['read' 'love' 'great' 'year' 'inform' 'use' 'scienc' 'learn' 'like' 'good']
```
In contrast, features with small coefficients, or small tf-idf values occurred in book reviews often or occurred rarely in long reviews:
```
Smallest Coefs: 
['theolog student' 'escap wild' 'escape' 'escape main' 'live kennedi'
 'eschatolog premillennialist' 'provis articl' 'escap unrel' 'scope pretti'
 'escher draw']
```

The current tuned Multinomial Naive Bayes model over-predicts reviews to be 'high' rated reviews. A test of classification accuracy by the current model on review test including negation terms (e.g., 'not good' or 'not bad') reveals that the model tends to classify almost any text to the 'high' rating category. For the following simple test reviews, both are assigned to the 'high' review rating category. With the current model, I could not write a review that rated as 'low' rating category (I tried strings of negative words like 'yuck', 'horrible', 'terrible', etc.).

|Sample Review Text: |intended rating |predicted rating |
|:--------------------- |:------ |:------ |
|'This book is not good. I do not recommend that you ever read it.' | 0 | 1 | 
|'This book is not bad at all. I highly recommend this for anyone.' | 1 | 1 |
|'This is a big disappointing bore. I expected more. Unfortunately I was wrong.' | 0 | 1 | 
|'This book is not bad at all. You should read it.' | 1 | 1 |
__Table 4.__ Test sentences for review classification with intended rating (0=low, 1=high) and rating predicted by the current model.

#### 6.3. Model Comparison
Using primarily the document-terms matrix from the TF-IDF vectorization (with the ngram_range set to 1-2) I fit and tuned four classification algorithms. To compare models I report the area under the ROC curve (scoring = ROC_AUC) for all models (Table 5). For one model, SGDClassifier, I used the document-terms matrix from the CountVectorizer (with ngram_range = (1,2)) in place of TfidfVectorization because the tfidf training vector was not compatible with the algorithm. The SGDClassifier is recommended in lieu of Support Vector Machines for fitting large-scale linear classifiers without having to copy dense numpy C-contiguous double precision arrays as input (Pedregosa et al. 2011). With the exception of LogisticRegressionCV, I implemented GridSearchCV for cross-validation and hyperparameter tuning for all models. The ROC_AUC values in Table 5 are the scores following parameter tuning based on the best parameter output from GridSearchCV. The highest scoring model is LogisticRegressionCV which adjusts regularization parameter C automatically by selecting the best hyperparameter by the cross-validator StratifiedKFold from the grid of Cs values set by default to be ten values in a logarithmic scale between 1e-4 and 1e4 (Pedregosa et al. 2011). The area of the true positive vs true negative rates (0.80) is highest for the LogisticRegressionCV model (Figure 7).


|Vectorizer: |Classifier: |ROC-AUC |GridSearchCV best params|
|:-------------|:--------------------- |:------ |:-----------------------|
|TfidfVectorizer|MultinomialNB | 0.74 | alpha=1, fit_prior=True|
|TfidfVectorizer|RandomForestClassifier | 0.70 | max_features=400, min_samples_leaf=4|
|TfidfVectorizer|LogisticRegressionCV | 0.80 | Cs=10, max_iter=100, tol=0.0001|
|CountVectorizer|SGDClassifier | 0.68 | alpha=0.001|

__Table 5.__ Comparison of three classifier algorithms fit with the document-terms matrix from TfidfVectorizer with ngram_range=(1,2). Each model was fit using GridSearchCV and best parameters were re-entered.

![logregroc](LogRegROC8013.png)

__Figure 7.__ Receiver Operating Characteristic curve (ROC), when area under the curve is 0.8013 from the tuned LogistRegressionCV classifier (Cs=10, max_iter=100, tol=0.0001). This model is the highest scoring from my set of comparison models.

To explore features of the LogisticRegressionCV model I implemented Python package [ELI5](https://eli5.readthedocs.io/en/latest/index.html) to explain weights and predictions of classifiers within scikit-learn (Figure 8). Stemmed tokens with the biggest weights from this model include 'love', 'excel', 'wonder', and 'high recommend'. Some tokens with the low weights were obviously negative and may be associated with reviews in the 'low' category: 'unfortun', 'bore', and 'disappoint'. Other tokens with low weights such as 'like', 'good', and 'ok' may be terms that are common in both categories of reviews and therefore are not useful for classification in the high rating class.

![logregFeatures](tfidfLogRegFeatures.png)
__Figure 8.__ For the target class '1', or 'high' rated reviews, the features with the highest weights (in green) and lowest weights (in red) and the associated weight values. This method, from package ELI5, works well with 'bag-of-words' vectorizers and linear classifiers because there is direct mapping between individual words and classifier coefficients.

To test the LogisticRegressionCV classification model on sample review text I used the same set of four reviews (two negative and two positive) that I tested on with MultinomialNaiveBayes classifier above. In the initial test, all four sample reviews were classified as 'high' rated reviews. With the LogisticRegressionCV classifier, the two ambiguous reviews with negation phrases ('not good' and 'not bad') are both classified as 'high' rated reviews. However, this improved classification model correctly places the very negative review (i.e., 'a big disappointing bore') in the intended 'low' rated review class (Table 6).

|Sample Review Text: |intended rating |predicted rating |
|:--------------------- |:------ |:------ |
|'This book is not good. I do not recommend that you ever read it.' | 0 | 1 | 
|'This book is not bad at all. I highly recommend this for anyone.' | 1 | 1 |
|'This is a big disappointing bore. I expected more. Unfortunately I was wrong.' | 0 | 0 | 
|'This book is not bad at all. You should read it.' | 1 | 1 |
__Table 6.__ Test sentences for review classification with intended rating (0=low, 1=high) and rating predicted by the LogisticRegressionCV model. Unlike Multivariate Naive Bayes and RandomForestClassifier the LogisticRegressionCV model classified the (to a human) obviously negative review ('This book is a big disappointing bore...') as 'low'. The LogisticRegressionCV model still did not distinguish between the more difficult meanings between 'This book is not good' and 'This book is not bad at all' - both were classified as 'high' rated reviews.

__Pipeline Models__
I also wanted to attempt rating classification using a pipeline of vectorizers and classifiers in order to be able to tune parameters in all steps simultaneously. Here I compared Multinomial Naive Bayes, SVC (Support Vector Machines), and Random Forest classifiers with CountVectorizer and TfidfTransformer (Table 7). For some parameter types, I could not get GridSearchCV to run so I tuned over a limited number of parameter ranges. The highest area under the ROC curve (0.80) was for the Support Vector Machines model (SVC).

|Vectorizer/Transformer: |Classifier: |ROC-AUC |GridSearchCV best params|
|:-------------|:--------------------- |:------ |:-----------------------|
|CountVectorizer/TfidfTransformer|MultinomialNB | 0.78 | min_df=4|
|CountVectorizer/TfidfTransformer|SVC (Support Vector Machines) | 0.80 | min_df=4|
|CountVectorizer/TfidfTransformer|RandomForestClassifier | 0.77 | clf_RFC_n_estimators=500|


### 7. Results Summary
The most successful approach to review text classification for this genre-specific subset of Amazon reviews involved Logistic Regression classification with a built-in cross-validation tool. The normalization steps leading to the best model included preprocessing by lowercaseing all words, removing numbers and punctuation, removing stop words, word tokenization and stemming all tokens with SnowballStemmer. Document-terms were produced with vectorization and Tf-idf of words and bigrams. 

Applied to the test data the LogisticRegressionCV classifier (with the regularization parameter C automatically tuned to Cs=10) correctly classified 1743 reviews in the 'high' rating category and 440 reviews in the 'low' rating category. Another 704 reviews were misclassified false negatives (n=173) or false positives (n=531). These counts were better than other model and resulted in the higher area under the curve for the ROC curve of true positive rate vs. false positive rate for this model.

Testing the various models on sample positive and negative review text illustrates that many of the tested algorithms classify the majority of novel review text as 'high' rated reviews. However, the tuned LogisticRegressionCV model appropriately classified very negative review text in the 'low' rating category. This indicates success towards building an accurate classifier for this genre of book reviews. Further adjustments would likely improve the predictive model. In addition, work with separate sets of genre-specific reviews could help me tune and improve this classifier.

The pipeline with Support Vector Machines also performed well classifying review data. This is a promizing model and with additional parameter tuning could out-perform the LogisticRegressionCV model. However, GridSearchCV ran very slowly and I was only able to search with limited numbers of parameters and parameter values at a time.

### 8. Recommendations for review platforms and future directions
My final classification model using LogisticRegression with cross-validation on bag-of-words and bigrams classifies many positive reviews correctly as 'high' rated and some very negatively worded reviews as 'low' rated. This accuracy of classification is a great step towards building tools for product review writing and rating. There are many more steps that could improve this project and other steps for potential work down the road towards developing tools for product reviews. With more time and resources I would address some of these limitations in my next steps.

Next steps & future directions:
* My Amazon review database also includes short 'summary' text along with reviews. These short phrases may be more explicitly focused on the gist of the review. 
    * Test product summary text with the best model from review analysis. Summaries are short and to-the-point. Can we predict rating category with the summary text based on the review vocabulary? Do we improve our prediction success?
* Access another subset of reviews from a different genre. Perform similar steps to build a machine learning algorithm to predict rating category. How different is the new model?
* Cross-test the best models for each genre of reviews on the review text for the opposite set of reviews. Do the models perform better or worse on unique review vocabularies?
* Explore word embeddings with Word2vec to model reviews with two-layer neural networks.
* Re-assign 5-star Amazon review ratings to different binary rating categories. 
    * In an analysis of fine food reviews from Amazon Data Scientist Susan Li divided reviews into high (4 & 5-star) and low (1 & 2-star) and removed the 3-star reviews based on the assumption that they are neutral (Li 2017).
    * Building a model based on only very negative (1-star) and very positive (5-star) reviews may lead to more accurate predictions. However, in the large Amazon book review database and in my Science Textbook subset 1-star reviews are very uncommon. Getting enough of a vocabulary for low rated reviews would be difficult.
* Explore other modeling methods to increase the predictive power of my rating classifier.

### Word Clouds
(coming soon)

### 9. Sources:

He, Ruining and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507-517. DOI: https://doi.org/10.1145/2872427.2883037

Li, Susan. 2017. Scikit-Learn for Text Analysis of Amazon Fine Food Reviews. In 'datascience+', An online community for showcasing R & Python tutorials. https://datascienceplus.com/scikit-learn-for-text-analysis-of-amazon-fine-food-reviews/

Loper, Edward and Steven Bird. 2002. NLTK: the Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1 (ETMTNLP '02), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 63-70. DOI: https://doi.org/10.3115/1118108.1118117

McAuley, Julian, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 43-52. DOI: http://dx.doi.org/10.1145/2766462.2767755

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, pp. 2825-2830.