The following is the final report written by H. Passmore for the Springboard Career Track Capstone 1: Amazon Book Reviews & Ratings Predictor. 

# Final Report for Capstone 1:  Rating Predictor
_Amazon Book Reviews & Ratings Predictor_

_May 2018_
***

![high_rate_cloud](Graphics/high_rate_cloud.png)
_Word Cloud of 1000 stemmed words from vocabulary of Amazon Science Textbook reviews with __'high'__ ratings._

### Table of Contents
1. Introduction and Objectives
2. Client Profile and Motivation
3. Data Aquisition  
    3.1. Consumer reviews and ratings  
    3.2. Genre-specific ISBN codes
4. Data Wrangling
5. Exploratory Data Analysis and Inferential Statistics  
    5.1. Exploring patterns and content of reviews and ratings
    5.2. Grouping reviews into low and high ratings
6. Machine Learning  
    6.1. 
    6.2. 
7. Results Summary
8. Recommendations for review platforms and future directions
9. Sources

### 1. Introduction and Objectives

When consumers consider purchasing a product, they often turn to reviews and ratings submitted by other customers to determine if the purchase is worthwhile. Conversely, retailers depend on honest and accurate reviews and ratings to ensure subsequent buyers can make informed purchases. Like business ratings, product ratings and reviews also affect sales. Therefore, accurate and error-free reviews and ratings are extremely valuable to retailers. The sentiment captured in the text of a review should be reflected in the star rating. One-star ratings potentially have a big negative effect on sales, so retailers need tools to flag incongruous reviews and ratings that may indicate user error. Similarly, high ratings paired with scathing review text may indicate errors or other issues with the product or review system. Can we predict ratings whether ratings are high or low based on review features? I used Natural Language Processing methods and fit machine learning algorithms to training data to predict high and low reviews in testing data. With scikit-learn's feature selection tools I identify text features most associated with high and low ratings.

Both consumers and vendors depend on reviews and ratings to make informed decisions about purchases and to help with sales. Positive and negative ratings and reviews help buyers and sellers know what to spend money on and what products to avoid. Errors and inconsistencies in these assessments can directly affect sales and customer satisfaction. Here I use features of consumer book review text to determine if reviews can predict ratings. Being able to predict ratings based on review features has multiple benefits for potential clients: 1) catch errors by reviewers where they accidentally selected the wrong number of stars, 2) suggest ratings when reviewers do not provide a star rating along with their review, 3) flag confusing/incongruous review-rating pairs for revision (by reviewer) or so that they are not featured first in review lists, and potentially 4) identify and flag reviews and ratings that are ‘fake’ or jokes based on the text of the review.

![AmazonChemistryReviews](Graphics/AmazonChemistryReviews.png)

__Figure 1.__ Amazon reviews for books may address the content of the book, or just the purchasing experience of the buyer. My goal through natural language processing and machine learning is to tune a classification algorithm to predict, based on the content of a review whether the customer rates the book with a high rating (5-star) or a low rating (4 or fewer stars).

### 2. Client Profile and Motivation

My preliminary clients are retailers. From bookstores to toy stores and to large online retailers of many categories, retailers depend on consumer reviews to 1) make decisions on what products to purchase for resale and 2) to promote sales from their platform. The machine learning algorithm can be used by retailers internally or as part of their review platform used by consumers. I envision a review platform that facilitates consumer review writing. This platform could incorporate a text editor (like Grammarly) to help reviewers craft clear and effective reviews in addition to suggesting a rating level based on the specific rating system of a given platform. Together these features will help reviewers communicate more clearly and select corresponding ratings more consistently. The algorithm parameters would be tuned and adjusted based on the product categories and also based on prior customer input.

Ultimately, my machine-learning algorithm that predicts high and low ratings from review text features can be utilized for any product-category or business. Further, interpreting the sentiment of consumer input has value beyond rating systems. Businesses benefit from understanding customer responses to products, interactions with customer service, assessments of online resources, and many other customer-business interactions. A system that identifies positive and negative feedback from potential or actual customers can give businesses the power to intercede and to improve customer engagement and satisfaction.

For my Capstone project I acquired a large dataset of reviews and ratings, subsetted a specific genre of book reviews, explored quantitative and qualitative patterns within reviews and ratings, preprocessed and cleaned review text, grouped reviews into binary ratings categories, and fit machine learning algorithms to subsetted training data in order to test algorithms and parameters on testing data. These initial steps are the foundation for further development of tools to facilitate effective review writing and consistent rating assignments by consumers.

### 3. Data Aquisition

__3.1. Consumer reviews and ratings.__ My source dataset has over 22 million book reviews from Amazon.com from May 1996 - July 2014. These reviews are made available by Julian McAuley, UCSD professor of Computer Science (McAuley et al. 2015; He & McAuley 2016). For this project, I have accessed a subset of all book reviews within a specific genre to train and test the algorithm.

* J. McAuley’s main page: http://cseweb.ucsd.edu/~jmcauley/
* Amazon Review Data links: http://jmcauley.ucsd.edu/data/amazon/ Data files with all reviews are only available from Julian McAuley by request.
* Count of reviews_Books records: 22,507,155
* Count of 5.0 rated reviews: 13,886,788
* For books with 10-digit International Standard Book Number (ISBN), the ASIN and the ISBN are the same.

__3.2. Genre-specific ISBN codes.__ I queried the Google Books API using a variety of science topic query terms to build a list of ISBN codes to match the ASIN codes in the Amazon review database. Google Books Developers resources: https://developers.google.com/books/docs/v1/getting_started

### 4. Data Wrangling
Accessing and subsetting the large file of Amazon Review data required several data wrangling steps before the data were ready for exploratory data analysis in Python.

1. Use Google Books API to query for science textbooks and non-fiction science books and their 10-digit ISBN codes. Standard ISBN-10 codes are the same as 'ASIN' ('Amazon Standard Identification Number') codes in the Book Review dataset. Query terms: 'q=science+[x]+nonfiction' where x on separate API requests was: science, biology, chemistry, physics, astronomy, invertebrate, biochemistry, zoology, math, geology, climate, and cellular. Ultimately my deduplicated, indexed DataFrame contained 'Title', 'Subtitle', 'description', and 'ISBN_10' columns for 3950 Science Textbook and non-fiction books. I pickled this DataFrame for subsequent processing in Pymongo.

2. Install MongoDB and Studio 3T to access the JSON as a database file. Then, in Jupyter Notebook with PyMongo use the .find() function to match the review documents I want to use with the list of ISBN codes for science textbooks and non-fiction. 

3. Matching the list of science book ISBN codes with the large Amazon review database resulted in 729 individual titles in the genre "Non-fiction Science and Textbooks". Many books have multiple reviews by different Amazon customers resulting in a final dataset of 11546 reviews.  Some titles were reviewed by over 100 different users. The highest review count per book is 382 with an average of 16 reviews per book.


### 5. Exploratory Data Analysis and Inferential Statistics

__5.1. Exploring patterns and content of reviews and ratings.__ For the purposes of estimating ratings from reviews the most important data fields from the review data set are the ratings assigned by reviewers ('overall') and the text of the associated reviews ('reviewText'). Amazon ratings range from 1 to 5, where 5 is the highest rating. Like the full Amazon book review dataset, the non-fiction science book reviews are dominated by 5-star reviews (66%). 

To explore quantitative features of the review text I estimated the number of words per review. Below are the mean number of words per rating groups where 'count' is the number of reviews per group (Table 1). The highest review group (overall = 5) has the shortest reviews by word count on average, but the highest word count is also in rating level 5 (max number of words = 5364).

|overall |count     |mean	    |std	    |min	|25%	|50%	|75%	|max   |
| ------ |:---------|:----------|:--------- |:------|:------|:------|:------|:---- |
|1.0	 |605.0	    |193.79  	|311.48 	|2.0	|50.0	|107.0	|226.0	|4027.0|
|2.0	 |405.0	    |183.94 	|253.04 	|5.0	|47.0	|105.0	|221.0	|2703.0|
|3.0	 |843.0	    |161.24 	|209.89 	|4.0	|38.5	|90.0	|188.5	|2326.0|
|4.0	 |2031.0	|146.83 	|185.37 	|1.0	|33.5	|78.0	|190.0	|2041.0|
|5.0	 |7662.0    |112.21 	|169.75 	|1.0	|30.0	|58.0	|126.0	|5364.0|
__Table 1.__ Reviews grouped and tallied by rating category 'overall' and summary statistics for the number of words per review (mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and the maximum number of words per review).

The majority of books rated and reviewed in this genre are given the highest rating of five. Additionally, for books in the data set with more than ten reviews the distribution of ratings is also biased towards higher ratings. Books with low average ratings (3-stars) are reviewed by fewer individuals that books with average ratings of 4.5 stars and higher (Figure 1). This pattern is predictable - popular books are generally reviewed by more users and garner higher ratings.

![averagerate](Graphics/AvgRateBookHist.png)
__Figure 2.__ For books reviewed by more than 10 reviewers the mean ratings are most frequently greater than 4.5. Books with lower ratings are less likely to prompt additional purchases and reviews.

The corpus of book review text contains few surprises in the top 25 most common words (Figure 2). Words about books are reading are among the most common terms (e.g., book, books, read, reading, author). Positive words are also frequent coming from a dataset of 66% five-star reviews (e.g., like, great, well, good). Some of the top 25 words are also genre-specific terms related to science, information and the real world. This word frequency graph also reveals that normalization of tokens is not complete at this stage. In later steps of pre-processing the data, I will utilize a text stemmer to reduce redundant words like 'book' and 'books' from appearing separately. Although I removed stopwords before creating this word frequency chart, later I will add additional stop words to the list to remove words that do not add meaning or are very frequent across all reviews (e.g., book, books, read, reading). Word frequencies were calculated with NLTK's tokenize_words with English stopwords removed (Loper & Bird 2002).

![frequencygraph](Graphics/frequencies.png)
__Figure 3.__ The top 25 most frequent words in the book review corpus ('english' stop words removed) include words about books and reading, positive words, and scientific terms. The most frequent word used in 11546 Science Textbook reviews was 'book' which was more than three times more frequent than the next word, 'read' in these reviews. 

__5.2. Grouping reviews into low and high ratings.__ Based on my exploration of the distribution of ratings among the Science Textbook reviews I grouped reviews into two rating categories for all subsequent analysis and modeling. All reviews with ratings of 1 to 4 stars are grouped together as 'low' ratings (n = 3884). Reviews associated with the highest Amazon rating are in the 'high' rating group (n = 7662). To compare and contrast the characteristics of the two rating categories I performed several hypothesis tests.

Review length as measured by word count were lowest for the 5-star rated books and highest for the 1-star book rating. Does this pattern continue for the new binary rating groups? I tested the null hypothesis that word counts are equal between the two 'high' and 'low' rating categories. To make comparisons between the rating groups I first log-transformed the word count data to address non-normal distributions and performed an independent two-sample t-test for unequal variances. Results of this test on the log-transformed word counts indicate that the word count of all 1-star through 4-star rated reviews is significantly higher than 'five-star' rated reviews (t = 15.5, p = 0.00).

![wordcountboxplot](Graphics/boxplot.png)
__Figure 4.__ Word counts for reviews with 1 to 4-stars ('not5') are higher than word counts for 5-star ('five') reviews (t=15.5, p=0.0). Count values are log transformed.

Other quantifiable differences between low and high ratings include the percentage of review text entered in uppercase. Text written in all capital characters in reviews may reflect strong negative or positive sentiment by the reviewer. I explored the use of uppercase words in reviews using regular expressions to find and then count all words written in all uppercase.  For comparison between rating categories, I calculated the percentage of words in uppercase from the total number of words per review. There are many small values of percent uppercase words for both rating groups - my methods captured words like 'I' and 'A' and individual words written in uppercase for emphasis. When I considered reviews with 25% or more uppercase words there were 32 5-star reviews and only seven reviews of 4-stars or fewer. The statistical comparison between these two groups indicates there is no difference between the five-star and 1 to 4-star percentages of uppercase words. This analysis was indicative of other sub-groups I considered.

![uppercase](Graphics/uppercase.png)

__Figure 5.__ The percent of words written by reviewers in uppercase is low for most reviews in our genre data. Low percentages are expected for normal text entry because I did not eliminate words like 'I' and 'A' for this calculation. For a small subset of reviews percent of uppercase words was higher, possibly for emphasis of either strongly positive or negative sentiment. The average percentage of uppercase words did not differ significantly for five-star vs. lower rated review categories in this analysis.

### 6. Machine Learning
The motivation for machine learning (ML) on the dataset of consumer book ratings and book review text is to classify new review text as either a high rated book (5-star) or low rated book (4 or fewer stars). Following exploratory data analysis and inferential statistical comparisons of ratings and reviews, the first step towards machine learning is to pre-process the text data through normalization methods. Following pre-processing I divided the review data into training and testing datasets. In order for the computer to process and compare elements of high and low rated reviews the text data must be vectorized. Vectorization tools transform the corpus of review text into a vector of words. For each review, the count or frequency of each vocabulary element is encoded. Many cells in this matrix will be zero - these are sparse vectors. Prior to fitting and comparing different ML algorithms I applied and compared two vectorization methods: CountVectorizer and TfidfVectorizer from the scikit-learn library (Pedregosa et al. 2011). I selected the best performing vectorization method and proceeded to fit, tune and compare algorithms for predicting book ratings on unseen reviews (the test dataset).

#### 6.1. Text Preprocessing
My exploratory data analysis of word frequencies revealed several elements of review text that needed attention before the data were ready for machine learning. Even after removing stop words the most common word in all Science textbook reviews was 'book' which was used about 16,000 times more than the next most common word 'read'. Neither word has a negative or positive meaning in the context of book reviews, so they could be added to the stop words list. Ultimately, I removed 'book' and 'books' along with other stop words (see details below) but left 'read' since bigrams using 'read' are potentially meaningful within reviews (Table 2; i.e., 'this is a good read' or 'she is a well-read author').

###### _Create project-specific set of stopwords_
Before preprocessing the review text data I created a specific set of stopwords. Initial machine learning tests revealed that by removing the word 'not' along with 178 'english' stopwords with NLTK (Loper et al. 2002) my models tended to classify most sample reviews as 'high' rated reviews. This is the result of losing bigram features with the negation word 'not'. While 'not' is common in both high and low rated reviews the meaning of the bigrams 'not good' and 'not bad' are distinct. First, I put NLTK's 'english' stopwords into a set and removed 'not'. Next, I joined my modified 'english' stopwords with my own short list of project-specific stop words  ('book', 'books') into a set called 'stopwords'.

###### _Preprocessing Step 1: Reduce noise, lowercase , remove stopwords_
To pre-process the raw review input I defined preprocessing functions. The first function cleaned up noise by selecting alphabetical text by matching regular expressions, making all remaining words only lowercase characters, and keeping only stop words from my user-defined set, 'stopwords'. I applied this function to the dataframe column of raw review text ('reviewText') and the new preprocessed column ('clean_revs') is still readable, but much cleaner.

|index	|reviewText	                                        |rating_cat	|clean_revs|
|:------|:------------------------------------------------- |:--------- |:---------|
|11541	|Definitely a MUST-READ if you are a home cooki...	|1	|definitely must read home cooking enthusiast w...|
|11542	|Pros: Scientifically informative and solid. Kn...	|0	|pros scientifically informative solid knowing ...|
|11543	|Real fun to read. For everybody that is inters...	|0	|real fun read everybody intersted cooking cert...|
|11544	|This book will teach you the chemical secrets ...	|0	|teach chemical secrets techniques usually used...|
|11545	|I paid more than $30 to buy such a superficial...	|0	|paid buy superficial trivial made big mistake ...|

__Table 2.__ Pandas DataFrame of raw review text, rating category, and review text processed by the first preprocessing function to remove stop words, lowercase all letters, and remove numerals.

###### _Preprocessing Step 2: Word tokenize, Snowball stemmer_
Final steps towards normalization involved text tokenization and stemming. My second preprocessing function first tokenized the 'clean revs' columns using word_tokenize from NLTK so that the review text becomes a list of strings where each word or token is a string. From NLTK I also imported the SnowballStemmer with 'english' vocabulary. The second preprocessor applied the stemmer to the tokenized words to reduce the dimensions of each review document. The stemmer keeps the 'stem' of words with multiple forms: science, scientific, and scientifically all become 'scienc' (Figure 6). The product is a less readable, but ready for machine learning DataFrame column, 'clean revs'.
```
0    ['good', 'scienc', 'nerd', 'non', 'scienc', 'a...
1    ['biolog', 'genet', 'enthusiast', 'great', 'of...
2    ['bought', 'daughter', 'borrow', 'frank', 'mcc...
3    ['recommend', 'tour', 'guid', 'ireland', 'extr...
4    ['school', 'recent', 'upgrad', 'chemistri', 'h...
5    ['confus', 'explain', 'poor', 'overal', 'graph...
6    ['lot', 'cool', 'experi', 'complet', 'fun', 'p...
7    ['great', 'condit', 'better', 'expect', 'ship'...
Name: clean_revs, dtype: object
```
__Figure 6.__ Once preprocessing is complete the raw review text is reduced to tokenized, stemmed, lowercase, alphabetical lists of words with no stop words.

###### _Final step: Create feature matrix and response vector_
In the final steps before vectorization, I stored the tokenized text in a feature matrix (m x n) called 'X' and the response vector (where 0 is ratings 1 through 4 and 1 is rating level 5) in the m x 1 vector 'y'. I then divided X and y into training and test datasets (X_train, X_test, y_train, y_test) using scikit-learn's train_test_split function. Because the number of reviews in the '0' rating category is much lower than the '1' rating category I stratified the division into training and testing data to ensure that both categories are fairly represented in all subsets.

#### 6.2. Vectorizer Selection
###### _Background_
Different vectorizers encode text data using different methods. Scikit-learn's CountVectorizer builds a vocabulary based on the tokens in the training data and counts the occurrence of each token encoded in a sparse vector for each observation (review). Count Vectorizer allows us to take a "bag-of-words" approach to text analysis. The vectorized text is analyzed based on word counts, but cannot take text structure into account. A vectorizer that takes into account more information than simple token counts is the TfidfVectorizer which applies the Term Frequency - Inverse Document Frequency to calculate token frequencies using the test data tokens. Specifically, Term Frequency is the frequency of a given token within a document while Inverse Term Frequency is used to reduce the effect of tokens that are very common across all documents. The TfidfVectorizer learns both the vocabulary as well as the inverse document frequency weights. 

###### _Vectorizer comparison_
To compare the effect of these two vectorizers on the book review document vocabularies I instantiated both CountVectorizer and TfidfVectorizer separately and saved the resulting document-term matrices from the fit and transform steps of vectorization on the X_train and X_test matrices.  For the initial comparison, I used the default parameters for both vectorizers. Next, I instantiated a Naive Bayes classifier, MultinomialNB(), and trained it on the X_train document-term matrix and y_train (the corresponding binary rating vector) from both vectorizers. I made class predictions with the Multinomial Naive Bayes classifier using the X_test document-term matrices, calculated scores to compare models, and used GridSearchCV to select the best parameters for the simple Naive Bayes model. For CountVectorizer the apparent best parameters were the default parameters while for TfidfVectorizer the best alpha value was 0.1 (Table 3). Updating alpha for TfidfVectorizer and re-running the classifier resulted in a higher ROC-AUC score as well and improved prediction counts in the confusion matrix. Next, I re-parameterized the two vectorizers to include bigrams (ngram_range = (1,2)) and increased the default minimum document frequency (min_df) for the vectorizers to three. These parameters increased the area under the ROC curve and grid search indicated that the default Multinomail Naive Bayes parameters were best (Table 3).

|Vectorizer/Classifier: |ngram_range |min_df |ROC-AUC |GridSearchCV best params|
|:--------------------- |:------ |:------ |:------ |:-----------------------|
|CountVectorizer/MultinomialNB | (1,1) | 1  |0.74 | alpha=1, fit_prior=True|
|TfidfVectorizer/MultinomialNB | (1,1) | 1  |0.75 | alpha=0.1, fit_prior=True|
|CountVectorizer/MultinomialNB | (1,2) | 3  |0.77 | alpha=1, fit_prior=True|
|TfidfVectorizer/MultinomialNB | (1,2) | 3  |0.78 | alpha=1, fit_prior=True|

__Table 3.__ Comparison of the predictive effect of Multinomial Naive Bayes model fit with the document-term matrices from two separate vectorizers applied to the vocabulary of book review data. The two vectorizers are also compared with ngram range set to (1,1) and (1,2) and min_df set to 1 and 3. ROC_AUC prediction scoring values, which calculates the area under the Receiver Operating Characteristic curve, are very similar after tuning using the best parameters from GridSearchCV. Overall, the Term Frequency -Inverse Document Frequency vectorization with features from single words and bigrams and default Multinomial Naive Bayes parameters resulted in the highest area under the ROC curve.

The area under the ROC curve is greaterest with TfidfVectorizer when including bigrams as features and increasing min_df from the default value. In the Machine Learning section below I compare models fit with the document-terms matrix from vectorization with TfidfVectorizer tuned with the following parameters:
```
TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=False, max_df=0.7, max_features=None, min_df=3,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)
```

###### _Explore features and large and small tf-idf values_
By viewing features names with the largest and smallest coefficients resulting from the best Multinomial Naive Bayes model with TfidfVectorization we can interpret term-frequencies and inverse-document-frequencies. The largest tf-idf values are words that were frequent in a review but were not common across all reviews. Notice that there are no bigrams in the top 25 tf-idf values:
```
Largest Coefs: 
['read' 'love' 'great' 'not' 'one' 'year' 'inform' 'well' 'use' 'interest'
 'scienc' 'learn' 'like' 'good' 'recommend' 'old' 'understand' 'time'
 'make' 'would' 'way' 'kid' 'help' 'realli']
```
In contrast, features with small coefficients, or small tf-idf values occurred in book reviews often or occurred rarely in long reviews. The smallest tf-idf values are almost exclusively bigrams:
```
Smallest Coefs: 
['err' 'whatev problem' 'put writer' 'believ zero' 'follow fact'
 'follow first' 'loos leaf' 'loos end' 'loos connect' 'look went'
 'threat bioterror' 'look scientif' 'follow scientif' 'disappoint seem'
 'look promis' 'disappoint see' 'follow seem' 'disappoint overal'
 'believ mani' 'look experi' 'lot serious' 'lot stori' 'behavior like'
 'whether lomborg' 'publish trade']
```

###### _Test classification with made-up review text_
The current tuned Multinomial Naive Bayes model over-predicts reviews to be 'high' rated reviews. A test of classification accuracy by the current model on made-up review text including negation terms (e.g., 'not good' or 'not bad') reveals that this simple model tends to classify text into the 'high' rating category (Table 4). For the following simple test reviews, all made-up reviews are assigned to the 'high' review rating category. With the current model, I had to work hard to write a review that rated as 'low' rating category. The classification only turned to 'low' when I entered this string of negative words:  'bore unfortunately lack reject bad sad annoying avoid boring.

|Sample Review Text: |intended rating |predicted probabilities of [0, 1] |
|:--------------------- |:------ |:------ |
|'This book is not good. I do not recommend that you ever read it.' | 0 | [0.27, __0.73__] | 
|'This book is not bad at all. I highly recommend this for anyone.' | 1 | [0.48, __0.51__] |
|'This is a big disappointing bore. I expected more. Unfortunately I was wrong.' | 0 | [0.43, __0.57__] | 
|'This book awesome. You should totally read it.' | 1 | [0.21, __0.79__] |
|'This book is bad. Avoid reading it.' | 0 | [0.48, __0.51__] | 
|'This book is good. I highly recommend this for everyone.' | 1 | [0.21, __0.79__] |
__Table 4.__ Test sentences for review classification with intended rating (0=low, 1=high) and predicted probabilities of 0 and 1 by the current model. The default threshold setting for predicted probabilites is 0.50, thus the class prediction is for the probability greater than 0.50 (the higher predicted probability is in bold).

###### _Explore classification with False Negative and False Positive review text_
In addition to testing predicted probabilities from the Multinomial Naive Bayes classifier using made-up review text I identified actual reviews from the test data that were either false negatives or false positives using Naive Bayes. The text of three false negatives and two false positives are printed in the machine learning code (["Machine Learning Amazon Reviews Working Code"](https://github.com/PassMoreHeat/springboard/blob/master/Capstone_1/Machine_Learning_Amazon_Reviews_Working_Code.ipynb)) and are too long to reproduce here. I calculated the predicted probabilities for both classes (Table 4.1) and re-calculate probabilities later with the tuned best classifier.

|False Positive or Negative: |Index identifier |actual rating |predicted probabilities of [0, 1] |
|:--------------------- |:------ |:------ |:------ |
|False Positive | 691  | 0 | [0.27, __0.73__] | 
|False Positive | 7518 | 0 | [0.24, __0.75__] |
|False Positive | 1779 | 0 | [0.38, __0.62__] | 
|False Negative | 4040 | 1 | [__0.65__, 0.35] |
|False Negative | 4181 | 1 | [__0.52__, 0.48] | 
__Table 4.1__ False positive and false negative reviews (identified by index of X_test), actual classification, and predicted probabilities of 0 and 1 with Multinomial Naive Bayes model. The default threshold setting for predicted probabilites is 0.50, thus the class prediction is for the probability greater than 0.50 (the higher predicted probability is in bold).

###### _Address target class imbalance_
The above test sentences reaveal that the current model (Multinomial Naive Bayes with TfidfVectorization, bigrams, and minimum document frequency of three) has an extreme classification bias towards the 'high' rating class. In part, this is likely a result of the target class imbalance of the input data. The dataset of Science Textbook reviews consists of 66% 'high' rated reviews. To address this imbalance in subsequent model fitting I will include the parameter class_weight = 'balanced'. This mode uses the y values to automatically adjust weights inversely proportional to class frequencies in the input data as:
```
n_samples / (n_classes * np.bincount(y))
```
The class_weight parameter is available for my comparison models below, but is not available for Multinomial Naive Bayes. By setting the class_weight to 'balanced' I hope to reduce classification biases.

#### 6.3. Model Comparison
Using primarily the document-terms matrix from the TF-IDF vectorization (with the ngram_range set to 1-2 and min_df = 3) I fit and tuned four classification algorithms. In addition, with the exception of Multinomial Naive Bayes, for each classifier I included the parameter class_weight='balanced' to address target class imbalance in the input data. To compare models I report the area under the ROC curve (scoring = ROC_AUC) for all models (Table 5). For one model, SGDClassifier, I used the document-terms matrix from the CountVectorizer (with ngram_range = (1,2), min_df = 3) in place of TfidfVectorization because the tfidf training vector was not compatible with the algorithm. The SGDClassifier is recommended in lieu of Support Vector Machines for fitting large-scale linear classifiers without having to copy complex arrays as input (Pedregosa et al. 2011). With the exception of LogisticRegressionCV, I implemented GridSearchCV for cross-validation and hyperparameter tuning for all models. The ROC_AUC values in Table 5 are the scores following parameter tuning based on the best parameter output from GridSearchCV. The highest scoring model is LogisticRegressionCV which adjusts regularization parameter C automatically by selecting the best hyperparameter by the cross-validator StratifiedKFold from the grid of Cs values set by default to be ten values in a logarithmic scale between 1e-4 and 1e4 (Pedregosa et al. 2011). The area of the true positive vs true negative rates (0.80) is highest for the LogisticRegressionCV model (Figure 7).


|Vectorizer: |Classifier: |ROC-AUC |GridSearchCV best params|
|:-------------|:--------------------- |:------ |:-----------------------|
|TfidfVectorizer|MultinomialNB | 0.78 | alpha=1, fit_prior=True|
|TfidfVectorizer|RandomForestClassifier | 0.75 | max_features=750, min_samples_leaf=6|
|TfidfVectorizer|LogisticRegressionCV | 0.81 | Cs=10, max_iter=100, tol=0.0001|
|CountVectorizer|SGDClassifier | 0.75 | alpha=0.1|

__Table 5.__ Comparison of three classifier algorithms fit with the document-terms matrix from TfidfVectorizer with ngram_range=(1,2), min_df=3. Each model except Multinomial Naive Bayes has parameter class_weight='balanced' to address imbalance in target classes. ROC-AUC scores are from fitting with GridSearchCV with tuned parameters (listed as 'best parameters).

![logregroc](Graphics/LogRegROC8134.png)

__Figure 7.__ Receiver Operating Characteristic curve (ROC), when area under the curve is 0.8134 from the tuned LogistRegressionCV classifier (Cs=10, max_iter=100, tol=0.0001). This model is the highest scoring from my set of comparison models.

To explore features of the LogisticRegressionCV model I implemented Python package [ELI5](https://eli5.readthedocs.io/en/latest/index.html) to explain weights and predictions of classifiers within scikit-learn (Figure 8). Stemmed tokens with the biggest weights from this model include 'love', 'excel', 'wonder', and 'high recommend'. Some tokens with the low weights were obviously negative and may be associated with reviews in the 'low' category: 'unfortun', 'bore', and 'disappoint'. Other tokens with low weights such as 'like', 'good', and 'ok' may be terms that are common in both categories of reviews and therefore are not useful for classification in the high rating class.

![logregFeatures2](Graphics/tfidfLogRegFeat2.png)
__Figure 8.__ For the target class '1', or 'high' rated reviews, the features with the highest weights (in green) and lowest weights (in red) and the associated weight values. This method, from package ELI5, works well with 'bag-of-words' vectorizers and linear classifiers because there is direct mapping between individual words and classifier coefficients. The weights are calculated given the specific classifier (my 'logreg_clf' model and vectorizer, 'tfidf').

###### _Comparing Predictive Probabilities in Test Reviews_
To test the LogisticRegressionCV classification model on sample review text I used the same set of six reviews (three pairs of negative and  positive reviews) that I tested with MultinomialNaiveBayes classifier above. In the initial test, all four sample reviews were classified as 'high' rated reviews. With the LogisticRegressionCV classifier and class imbalance adressed with class_weight = 'balanced', the reviews intended as 'low' ratings are all correctly classified. However, this improved classification model incorrectly places two made-up 'high' reviews in the 'low' class based on predicted probabilities and the default classification threshold of 0.50 (Table 6).

|Sample Review Text: |intended rating |predicted probabilities of [0, 1] |
|:--------------------- |:------ |:------ |
|'This book is not good. I do not recommend that you ever read it.' | 0 | [__0.52__, 0.48] | 
|'This book is not bad at all. I highly recommend this for anyone.' | 1 | [__0.88__, 0.12] |
|'This is a big disappointing bore. I expected more. Unfortunately I was wrong.' | 0 | [__0.82__, 0.17] | 
|'This book awesome. You should totally read it' | 1 | [0.04, __0.95__] |
|'This book is bad. Avoid reading it.' | 0 | [__0.91__, 0.09] | 
|'This book is good. I highly recommend this for everyone.' | 1 | [__0.57__, 0.43] |
__Table 6.__ Test sentences for review classification with intended rating (0=low, 1=high) and the predicted probabilities of the two classes by the LogisticRegressionCV model. The LogisticRegressionCV model correctly classified the all of the made-up reviews intended as 'low' rated reviews. However, the LogisticRegressionCV model classified two of the 'high' reviews into class 0 based on predicted probabilties in contrast to their classification under Multinomial Naive Bayes. These mixed changes in the classification patterns of the models occurred after 1) I removed 'not' from the list of stopwords and 2) I set the class_weight parameter to 'balanced'.

|False Positive or Negative: |Index identifier |actual rating |predicted probabilities of [0, 1] |
|:--------------------- |:------ |:------ |:------ |
|False Positive | 691  | 0 | [__0.71__, 0.29] | 
|False Positive | 7518 | 0 | [__0.86__, 0.14] |
|False Positive | 1779 | 0 | [__0.92__, 0.08] | 
|False Negative | 4040 | 1 | [__0.73__, 0.27] |
|False Negative | 4181 | 1 | [__0.89__, 0.11] | 
__Table 6.1__ Re-calculation of predicted probabilities for set of false positive and false negative reviews (identified by index of X_test using Multinomial Naive Bayes, above). The Logistic Regression classifier with parameters to address imbalanced classes now predicts the correct class for the 'False Positive' set of reviews, but incorrectly classifies the two 'high' reviews as 'low'. The two 'False Negative' reviews are very long and contain a lot of uncommon words from the corpus. These features likely affect the predicted probabilities and the resulting classification.

###### _Adjusting Threshold of Logistic Regression Classifier_
__Identify important metrics to maximize.__ We can measure classifier performance in several ways. Checking the confusion matrix gives us a complete picture of performance by quantifying the number of true positives, true negatives, false positives, and false negatives for the classifier using the current hyperparameters (Ng 2017). Many classifier scoring metrics are calculated from the values in the confusion matrix. Beyond the Accuracy score (correct classifications over all classifications) there are several useful metrics from the Confusion Matrix. Classification error (false classifications over all classifications) is a measure of the rate of misclassification. Sensitivity, or recall (true positives over true positives plus false negatives), is a measure of how often the classification is correct when the actual value is positive. Specificity (true negatives over true negatives plus false positives) indicates how often the prediction is correct when the actual value is negative. Precision (true positives over true positives plus false positives) tell you how often the classifier is correct when a positive value is predicted. For the specific problem of classifying Amazon Review text into the binary categories 'high' rating and 'low' rating, my goal is to minimize the number of false positive rating classifications. My reasoning is that there are many positive reviews and a limited number of negative reviews. There is not a lot of harm is accidentally classifying a 'high' review as 'low' because there are many examples of 'high' reviews that are correctly classified. However, there is a reason to protect the True Negative reviews and aim to classify them as 'low' as frequently as possible. Both customers and retailers benefit from accurate information in review classification, especially if it may inform their purchases (e.g., purchasing a positively reviewed product instead of an inaccurately classified negatively reviewed product). Thus, my goal in classification is to maximize Precision, or to maximize the rate of not classifying actual 'low' reviews as 'high'. 

__Adjust threshold to maximize Precision.__ The default decision threshold for the Logistic Regression classifier is the predicted probability of a given observation being above or below 0.5. For the Amazon Review classifier if the predicted probability of a review based on the text of the review is over 0.5 then that review is classified as a 'high' rated review. However, the default decision threshold may not be the best threshold for our specific classification problem. Our review dataset is heavily skewed towards positive ratings. We took steps to account for the imbalance of classes (see discussion of stratification during train_test_split and class_weight settings above) but the frequency of 'high' rated reviews outnumbers 'low' reviews by far (Figure 9). To account for the high frequency of positive reviews we can increase the threshold for predicting 'high' ratings which would, in turn, decrease the sensitivity of the classifier. This would also increase the specificity by increasing the number of True Negatives.

![HistPredProb](Graphics/HistPredictProb.png)
__Figure 9.__ The majority of observations have high probability of a 'high' rating.

Adjusting the decision threshold of the classifier is a tradeoff between the precision and recall metrics. Since the total number of observations in the confusion matrix is always the same but the numbers of false negatives and false positives change depending on if the threshold is above or below the default. The logistic regression classifier of Amazon book reviews has inversely proportional precision and recall metrics depending on the threshold value (Figure 10). My goal is to decrease the number of false negatives ('low' rated reviews that are categorized as 'high') by the classifier. Since we have more limited access to 'low' reviews I prefer that the classifier identifies as many as possible correctly so that users of the classifier (customers and vendors) do not lose information from negative reviews.

![PrecRecall](Graphics/PrecRecallThreshold.png)
__Figure 10.__ Increases in threshold, the predicted probability level above which observations are classified as 'high' rated reviews, results in higher precision (not lableing 'low' reviews falsly as 'high') but decreases recall or sensitivity.

To explore the effect of adjusting the decision threshold I modeled precision and recall for different threshold values. At the default threshold of 0.5 predicted probability Precision and Recall are 0.83 and 0.80 respectively and the confusion matrix has nearly equal numbers of False Negatives and False positives (Figure 11).

|n = 2887: |Predicted 'low': |Predicted 'high' |    |
|:-------------|:--------------------- |:------ |:-----------------------|
|__Actual 'low'__|TN = 651 | FP = 320 | (971)|
|__Actual 'high'__|FN = 376| TP = 1540| (1916)|
|             |(1027) | (1860) |    |

![Threshold0.50](Graphics/threshold_0.50.png)
__Figure 11.__ Confusion matrix for default threshold 0.50 (above) and relationship between precision and recall (below) for the logistic regression classifier where ^ indicates the precision and recall values for the current threshold setting (0.50).

Lowering the threshold to the extreme where all predicted probabilities above 0.20 result in 'high' rating classification results in many fewer False Negatives but more False Positives (Figure 12). The result is also that precision decreases while recall increases.

|n = 2887: |Predicted 'low': |Predicted 'high' |    |
|:-------------|:--------------------- |:------ |:-----------------------|
|__Actual 'low'__|TN = 219 | FP = 752 | (971)|
|__Actual 'high'__|FN = 34| TP = 1882| (1916)|
|             |(1027) | (1860) |    |

![Threshold0.20](Graphics/threshold_0.20.png)
__Figure 12.__ Confusion matrix for low threshold set to 0.20 (above) and relationship between precision and recall (below) for the logistic regression classifier where ^ indicates the precision and recall values for the current threshold setting (0.20).

To address the goal of minimizing the 'low' reviews falsely classified as 'high' reviews (the False Positives) I incrementally increased the threshold (see (["Machine Learning Amazon Reviews Working Code"](https://github.com/PassMoreHeat/springboard/blob/master/Capstone_1/Machine_Learning_Amazon_Reviews_Working_Code.ipynb)) until the count of False Positives decreased substantially but before the number of False Negatives was higher than the number of True Negatives (Figure 13). The decision threshold of 0.65 meets these criteria. Of the 971 actual 'low' rated reviews only 167 of them are falsely classified as 'high' rated reviews.

|n = 2887: |Predicted 'low': |Predicted 'high' |    |
|:-------------|:--------------------- |:------ |:-----------------------|
|__Actual 'low'__|TN = 804 | FP = 167 | (971)|
|__Actual 'high'__|FN = 723| TP = 1193| (1916)|
|             |(1027) | (1860) |    |

![Threshold0.65](Graphics/threshold_0.65.png)
__Figure 13.__ Confusion matrix for purposefully adjusted threshold 0.65 (above) and relationship between precision and recall (below) for the logistic regression classifier where ^ indicates the precision and recall values for the current threshold setting (0.65).

###### _Pipeline Models_
I also wanted to attempt rating classification using a pipeline of vectorizers and classifiers in order to be able to tune parameters in all steps simultaneously. Here I compared Multinomial Naive Bayes, SVC (Support Vector Machines), and Random Forest classifiers with CountVectorizer and TfidfTransformer (Table 7). For some parameter types, I could not get GridSearchCV to run so I tuned over a limited number of parameter ranges. The highest area under the ROC curve (0.79) was for the Random Forest Classifier model.

|Vectorizer/Transformer: |Classifier: |ROC-AUC |GridSearchCV best params|
|:-------------|:--------------------- |:------ |:-----------------------|
|CountVectorizer/TfidfTransformer|MultinomialNB | 0.7898 | min_df=5|
|CountVectorizer/TfidfTransformer|SVC (Support Vector Machines) | 0.7701 | min_df=4|
|CountVectorizer/TfidfTransformer|RandomForestClassifier | 0.7922 | n_estimators=700, min_df=3|
__Table 7.__ Comparison of pipelines for CountVectorizer, TfidfTransformer and three classifier algorithms (MultinomialNB, SVC, and RandomForestClassifier respectively). Each model was fit using GridSearchCV to determine the best parameters.

### 7. Results Summary
The most successful approach to review text classification for this genre-specific subset of Amazon reviews involved Logistic Regression classification with a built-in cross-validation tool. The normalization steps leading to the best model included preprocessing by lowercaseing all words, removing numbers and punctuation, removing a user-determined set of stopwords which did not include 'not', word tokenization and stemming all tokens with SnowballStemmer. Document-terms were produced with vectorization and Tf-idf of words and bigrams. 

Applied to the test data, the LogisticRegressionCV classifier (with the regularization parameter C automatically tuned to Cs=10) correctly classified 1540 reviews in the 'high' rating category and 651 reviews in the 'low' rating category. Another 696 reviews were misclassified false negatives (n=376) or false positives (n=320). These counts were better than other models and resulted in the higher area under the curve for the ROC curve of true positive rate vs. false positive rate for this model.

Testing the various models on made-up positive and negative review text illustrates that many of the tested algorithms classify the majority of novel review text as 'low' rated reviews. However, the tuned LogisticRegressionCV model (as well as RandomForestClassifier and SGDC) appropriately classified very negative review text in the 'low' rating category and the paired positive text as 'high'. This indicates some success towards building an accurate classifier for this genre of book reviews. One curious change that resulted from adjustments to pre-processing (keeping the word 'not' in the document terms) and model parameterization (setting class_weights to 'balanced' to address imbalanced target classes) was that the models shifted from a tendency to classify text into the 'high' class to more frequently classifying text as the 'low' class. Further adjustments would likely improve the predictive models. In addition, work with separate sets of genre-specific reviews could help me tune and improve this classifier.

One way to explore patterns of misclassification is to adjust the predicted probability threshold between the two classes. The default threshold is 0.5 so that whichever class has a predicted probability greater than 0.5 is the category that a given observation is classified as. My case-specific goal for rating classification with Amazon book review text was to protect the classification of 'low' rated reviews because they were less common in the dataset than 'high' rated reviews. My goal was thus to reduce the number of False Positive classifications. Because 'high' rated classification was much more frequent (Figure 9) I increased the decision threshold from 0.5 to 0.65 in order to reduce the occurrence of 'low' rated reviews being falsely classified. The secondary effect of adjusting the decision threshold was that classification precision increased but recall (or sensitivity) decreased for the logistic regression classifier.

The pipeline with Random Forest Classifier also performed well classifying review data. This is a promising model and with additional parameter tuning could out-perform the LogisticRegressionCV model. However, GridSearchCV ran very slowly and I was only able to search with limited numbers of parameters and parameter values at a time.

### 8. Recommendations for review platforms and future directions
My final classification model using LogisticRegression with cross-validation on bag-of-words and bigrams classifies many Amazon reviews with 5-star ratings correctly as 'high' rated and many Amazon book reviews associated with lower ratings as 'low' rated. This accuracy of classification is a great step towards building tools for product review writing and rating. There are many more steps that could improve this project and other steps for potential work down the road towards developing tools for product reviews. With more time and resources I would address some of these limitations in my next steps.

Next steps & future directions:
* Rating classes are imbalanced, with more 'high' ratings than 'low' ratings. In addition to, or in place of setting the class_weight parameter to 'balanced', explore other methods to deal with imbalanced classes. To balance classes as resampled X and y, use SMOTE: imblearn.over_sampling.SMOTE, Synthetic Minority Over-sampling Technique.
* My Amazon review database also includes short 'summary' text along with reviews. These short phrases may be more explicitly focused on the gist of the review. 
    * Test product summary text with the best model from review analysis. Summaries are short and to-the-point. Can we predict rating category with the summary text based on the review vocabulary? Do we improve our prediction success?
* Access another subset of reviews from a different genre. Perform similar steps to build a machine learning algorithm to predict rating category. How different is the new model?
* Cross-test the best models for each genre of reviews on the review text for the opposite set of reviews. Do the models perform better or worse on unique review vocabularies?
* Explore word embeddings with Word2vec to model reviews with two-layer neural networks.
* Re-assign 5-star Amazon review ratings to different binary rating categories. 
    * In an analysis of fine food reviews from Amazon, Data Scientist Susan Li divided reviews into high (4 & 5-star) and low (1 & 2-star) and removed the 3-star reviews based on the assumption that they are neutral (Li 2017).
    * Building a model based on only very negative (1-star) and very positive (5-star) reviews may lead to more accurate predictions. However, in the large Amazon book review database and in my Science Textbook subset 1-star reviews are very uncommon. Getting enough of a vocabulary for low rated reviews would be difficult.
* Explore other modeling methods to increase the predictive power of my rating classifier.

![low_rate_cloud](Graphics/low_rate_cloud.png)
_Word Cloud of 1000 stemmed words from vocabulary of Amazon Science Textbook reviews with __'low'__ ratings._

### 9. Sources:

Arvai, Kevin. 2018. Fine tuning a classifier in scikit-learn. Towards Data Science. Accessed May 2018. https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65

He, Ruining and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507-517. DOI: https://doi.org/10.1145/2872427.2883037

Li, Susan. 2017. Scikit-Learn for Text Analysis of Amazon Fine Food Reviews. In 'datascience+', An online community for showcasing R & Python tutorials. https://datascienceplus.com/scikit-learn-for-text-analysis-of-amazon-fine-food-reviews/

Loper, Edward and Steven Bird. 2002. NLTK: the Natural Language Toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1 (ETMTNLP '02), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 63-70. DOI: https://doi.org/10.3115/1118108.1118117

McAuley, Julian, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 43-52. DOI: http://dx.doi.org/10.1145/2766462.2767755

Ng, Ritchie. 2017. Evaluating a Classification Model. http://www.ritchieng.com/machine-learning-evaluate-classification-model/ Accessed April 2018.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, pp. 2825-2830.