# Predicting reader’s book rating from their written reviews: A test by genre

__Capstone 1 Milestone Report__
***
Outline of Report
* [Define the Problem](#Define-the-problem)
* [Identify client](#Identify-client)
* [Description of data sets](#Description-of-data-sets)
    * [Methods for cleaning & wrangling](#Methods-for-cleaning-&-wrangling)
    * [Data section conclusions](#Data-section-conclusions)
* [List of other potential data sets](#List-of-other-potential-data-sets)
* [Explaination of initial findings](#Explaination-of-initial-findings)
* [Sources](#Sources)

## Define the problem <a name="Define-the-problem"></a>
When consumers consider purchasing a product, they often turn to reviews and ratings submitted by other customers to determine if the purchase is worthwhile. Conversely, retailers depend on honest and accurate reviews and ratings to ensure subsequent buyers can make informed purchases. On average customers read more than two Yelp reviews before deciding to use a business (Rudolph 2015). Further, a one-star increase in ratings on Yelp leads to a 5-9% increase in revenue for a business (Rudolph 2015). Like business ratings, product ratings and reviews also affect sales. Therefore, accurate and error-free reviews and ratings are extremely valuable to retailers. The sentiment captured in the text of a review should be reflected in the star rating. One-star ratings potentially have a big negative effect on sales, so retailers need tools to flag incongruous reviews and ratings that may indicate user error. Similarly, high ratings paired with scathing review-text may indicate errors or other issues with the product or review system. Can we predict ratings whether ratings are __high__ or __low__ based on review features? For my Capstone project I am building a machine-learning model using Natural Language Processing to predict high and low reviews for distinct sub-genres of books reviewed on Amazon.com. In my models I use review text features to predict ratings and compare predictions to actual ratings.

Both consumers and booksellers depend on book reviews and ratings to make informed decisions about purchases and to help with sales. Positive and negative ratings and reviews help buyers and sellers know what to spend money on and what products to avoid. Errors and inconsistencies in these assessments can directly affect sales and customer satisfaction. Here I use features of consumer book review text to determine if reviews can predict ratings. Being able to predict ratings based on review features has multiple benefits: 1) catch errors by reviewers where they accidentally selected the wrong number of stars, 2) suggest ratings when reviewers do not provide a star rating along with their review, 3) flag confusing/incongruous review-rating pairs for revision (by reviewer) or so that they are not featured first in review lists, and potentially 4) identify and flag reviews and ratings that are ‘fake’ or jokes based on the text of the review.

## Identify client
My preliminary clients are booksellers. From small brick-and-mortar bookstores to large online book retailers, booksellers depend on consumer reviews to 1) make decisions on what books to purchase for resale and 2) to promote book sales from their platform. The machine learning algorithm can be used by retailers internally or as part of their review platform used by consumers.
 
Ultimately, my machine-learning algorithm that predicts high and low ratings from review text features can be utilized for any product-category or business. I envision a review platform that facilitates consumer review writing. This platform could incorporate a text editor (like Grammarly) to help reviewers craft clear and effective reviews in addition to suggesting a rating level based on the specific rating system of a given platform. Together these features will help reviewers communicate more clearly and select corresponding ratings more consistently.

## Description of data sets <a name="Description-of-data-sets"></a>
__Consumer reviews and ratings.__ My source dataset has over 22 million book reviews from Amazon.com from May 1996 - July 2014. These reviews are made available by Julian McAuley UCSD professor of Computer Science (McAuley et al. 2015; He & McAuley 2016). For this project, I have accessed a subset of all book reviews within a specific genre to train and test the algorithm. 
  * J. McAuley’s main page: http://cseweb.ucsd.edu/~jmcauley/
  * Amazon Review Data links: http://jmcauley.ucsd.edu/data/amazon/  Data files with all reviews are only available from Julian McAuley by request.
  * Metadata file from J. McAuley, with permission.
  * Count of reviews_Books records: 22507155
  * Count of 5.0 rated reviews: 13886788
  * For books with 10-digit International Standard Book Number (ISBN), the ASIN and the ISBN are the same.
  * See McAuley et al. 2015, He & McAuley 2016. Full citations below.
  
__Google Books API for ISBN codes within genre.__ In order to subset the large Amazon book review dataset I utilized the Google Books API to query for book titles and ISBN codes within a specific genre of books. My query focused on non-fiction books and textbooks with key words including: science, biology, chemistry, physics, astronomy, invertebrate, biochemistry, zoology, math, geology, climate, and cellular. After running the search query on these terms I had a list of book titles and ISBN-10 book codes for 3950 books after duplicate titles were removed. Not all of these ISBN-10 codes matched the corresponding ASIN codes from the large Amazon database

### Methods for cleaning & wrangling <a name="Methods-for-cleaning-&-wrangling"></a>
__Problem:__ The book reviews json file is almost 20 GB. Therefore, I cannot open it with Jupyter Notebook. 

__Solution:__ Install MongoDB and Studio 3T to access the json as a database file. Then, in Jupyter Notebook with PyMongo use the .find() function to match the review documents I want to use with a list of book codes.
***
__Problem:__ Amazon Book Review database does not include book titles and the data identifying individual book titles is the 'ASIN' code ('Amazon Standard Identification Number'). Outside of Amazon books are identified by ISBN codes (either 10-digit or 13-digit). In order to access the book review documents from the MongoDB database I need a list of 'ASIN' codes in the non-fiction science textbook genre.

__Solution:__ Standard ISBN-10 codes are the same as 'ASIN' codes.
***
__Problem:__ In order to have a large enough dataset of book reviews to apply Machine Learning models I need many ISBN-10 codes within my genre because many of the codes I find will not match 'ASIN' codes in the large Amazon Review database.

__Solution:__ 
1. Use Google Books API to query for Invertebrate Biology textbooks. Query, with API key masked: https://www.googleapis.com/books/v1/volumes?[query-terms-go-here]&maxResults=40&startIndex=0&printType=books&subject:textbook&key=xxxxxx

    * Max results per query run was set to 40 (meaning 40 book title results)
    * Start index was set to '0' initially and incremented by 40 each run so that each request resulted in 40 new results.
    * Query terms: 'q=science+[x]+nonfiction' where x on separate API requests was:
        * science, biology, chemistry, physics, astronomy, invertebrate, biochemistry, zoology, math, geology, climate, and cellular.
    * For each topic, x, I ran the query approximately ten times resulting in ~400 book records per query topic. 

2. The results of each API query are nested JSON files with nested objects and arrays. To access the key:value pairs of interest I coded my API query in Jupyter Notebook with 'requests.get' and I have used 'json_normalize' from pandas.io.json to flatten the nested json from the API. 

3. I concatenated all request results and renamed the column headings. Then I removed rows where column 'ISBN_10' contained the string 'None',dropped duplicate rows, and reset the index

```python
# change isbn_bio column names
isbn_more.rename(columns={'volumeInfo.title': 'title', 'volumeInfo.subtitle': 'subtitle', 
                         'volumeInfo.description': 'descrip', 'id_isbn10': 'isbn10'}, inplace=True)

# remove the isbn10='None' rows
isbn_many = isbn_more[isbn_more.isbn10 !='None']

# remove duplicates from big isbn_many dataframe
# drop duplicate rows of concatenated file
shorter = isbn_many.drop_duplicates(subset=['title','isbn10'], keep='first')

# reset index of deduplicated output
shorter2 = shorter.reset_index(drop=True)
```

Ultimately my deduplicated, indexed DataFrame contained 'Title', 'Subtitle', 'description', and 'ISBN_10' columns for 3950 Science Texbook and non-fiction books. I pickled this DataFrame for subsequent processing in Pymongo.
***
__Problem:__ The large Amazon Book Review database in Pymongo is a collection of documents where each review record is a document. Running a search that matched ISBN codes from Google Books with the 'asin' field in the database was tricky.

__Solution:__ In order to successfully use the .find() function on the book review collection in Pymongo I had to format the ISBN_10 field from Google Books as a list and use '$in' in the search settings to find items in the ISBN list 'out':

```python
# reset index of isbn dataframe with 'isbn10' and turn just 'isbn10' into a list
ans = isbn_all.reset_index()['isbn10']
out = ans.values.tolist()

# Import DataFrame from pandas 
# Create new DataFrame 'many_revs' Pymongo's using .find() function on the list of ISBN codes 'out'
from pandas import DataFrame
many_revs = DataFrame(list(db.reviews_Books.find({'asin':{"$in": out
       }})))
```

### Data section conclusions <a name="Data-section-conclusions"></a>
Working Genre Review Collection:

After completing the data wrangling to find matching book codes and subsetting the collection of Amazon book reviews within the specified genre, my working DataFrame of reviews includes:
* This Science Textbook genre subset of reviews includes reviews for 729 different books.
* The number of reviews is 11546
* The number of reviews per book ranges from 1 to 382 with an average of 16 per book.
* Average reviews are 754 characters or 128 words long.
* Reviewers awarded 4 stars on average.
* The longest review in this genre subset is 5,364 words long.

## List of other potential data sets <a name="List-of-other-potential-data-sets"></a>
* Metadata from McAuley: I also have access to an Amazon metadata file which includes descriptions, price, sales-rank, brand info, and co-purchasing links for 9.4 million products (not just books).
* For the books in my genre subset I have title, subtitle and description text that I am not currently utilizing.

## Explaination of initial findings <a name="Explaination-of-initial-findings"></a>
Through exploratory data analysis and inferential statistics I find several clear patterns. First, the book review data are dominated by 5-star, or highly rated books. Book reviews for all ratings levels range form short to very long, but 'low' rated books have significantly longer reviews (by word count) compared to 'high' rated books. There may be patterns of ratings assignments over time, but a superficial check of several frequenly reviewed books indicates 1) books get high and low ratings distributed over time - there is likely not a peer effect resulting in reviews turning either all high or all low over time, and 2) many books have a peak in frequency of reviews written near the beginning of the review records on a amazon and then a less frequent but continuous rate of reviews afterward. A few books have infreuqent reviews at first, and the rate of reviews picks up over time. These patterns may warrant time series analysis of a larger subset of frequenly reviewed books (>100 reviews). Finally, the quantitative variables available for these reviews are limited and do not reveal strong patterns of characteristics associated with high or low rated reviwes. More informative analyses will result from applying classification models.

## Sources: <a name="Sources"></a>
Rudolph, Stacey. (2015, July 25). The Impact of Online Reviews on Customers’ Buying Decisions [Infographic]. Retrieved from: https://www.business2community.com/infographics/ impact-online-reviews-customers-buying-decisions-infographic-01280945#etm1uliB3CDhGtdP.99
 
Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (WWW '16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 507-517. DOI: https://doi.org/10.1145/2872427.2883037
 
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15). ACM, New York, NY, USA, 43-52. DOI: http://dx.doi.org/10.1145/2766462.2767755