# Finding the Next Bestseller

## Project Goal

Using publicly available data fom Goodreads, Wikipedia, and Amazon, this project aims to acquire, explore, and analyze information about books - their popularity via online reviews and ratings, as well as keywords, author name, publisher, and more - to programmatically determine which factors lead to a book landing on the New York Times Bestseller list. 

## Project Creators:

- [Brandon Navarrete](https://github.com/brandontnavarrete)
- [Magdalena Rahn](https://github.com/MagdalenaRahn)
- [Manuel Parra](https://github.com/manuelparra1)
- [Shawn Brown](https://github.com/shawn-brown12)


## Setting up the Environment

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import requests
import unicodedata
import re
import os
import json

import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from xgboost import XGBClassifier as xgb

from scipy import stats

import prepare as prep
import explore as ex
import model as m

seed = 42

import warnings
warnings.filterwarnings("ignore")

## Acquisition

In [None]:
# This function sequentially runs each function from within the prepare.py file 
# in order to gather and clean the data, as well as creating our target variable and getting
# the sentiment analysis of the book summaries
df = prep.prep_data('all_books.csv')

In [None]:
# a quick peak at our dataframe
df.head()

In [None]:
# saving the above df into a new csv file, so that we don't have to run it through again unless we add to our dataset.
df.to_csv('final_df.csv')

In [None]:
# pulling the data from the csv saved above
df = pd.read_csv('final_df.csv', index_col=0)

In [None]:
# a peak to compare the dataframe above and confirm they are the same
df.head()

-----------------------------------------

### Data Summary

In [None]:
# our rows and columns
df.shape

In [None]:
# some basic information about our data
df.info()

In [None]:
# a look at what genres we have
df['genre'].unique()

## Preparation

In [None]:
# splitting our data into train and test subsets
train, test = ex.split(df, 'successful')

In [None]:
# checking the size of our subsets
train.shape, test.shape

In [None]:
train.head()

<div class="alert alert-block alert-success">
<b>Acquisition and Preparation Takeaways</b>
    
- Initially, we had over 4000 books in our book list, as well at the dataset of NYT bestsellers comprising of over 1000 books. This included 11 features of each of those books. From the actual gathered data, we had around 3800 books, around 160 of which were bestsellers.
    
- For any null values in our data, we either imputed or dropped them, depending on what feature was null. We ended up dropping a number of rows where the summary was empty, while we manually imputed missing book titles, lengths, and publishing years, as those encompassed multiple of our bestsellers.
    
- We dropped any books not in English, as well as any duplicated books. We also used the Goodreads data on the first available hardcover edition, where possible.
    
- During our cleaning phase, we engineered a number of columns to our dataframe, including our target column, cleaned and lemmatized version, of the summary, and several values created during our dsentiment analysis of the summary.
    
- Our final dataframe had the following columns:
    - `title`, `summary`, `year_published`, `author`, `review_count`, `number_of_ratings`, `length`, `genre`, `rating`, `reviews`, `cleaned_title`, `cleaned_summary`, `target`, `lemmatized_summary`, `neg`, `neutral`, `pos`, `compound`, `sentiment`.


## Exploration

### Which words/ngrams appear more often in summaries with a positive sentiment?

In [None]:
# function to most common single words
best_words = ex.uni_id_best_seller(train)

In [None]:
# function to show most common bigrams in bestsellers
ex.best_bigrams(best_words)

<div class="alert alert-block alert-success">
<b>Takeaways:</b> 
        
Looking at bi grams, we see:

----------------------------------------    

- 'bestselling author': Either the summary referencing a past bestseller or the fact that the book *is* a bestseller.
    
- 'bouny hunter': Perhaps books with bounty hunter characters are popular?

- There were a lot of character names, like: (eve, dallas(in death series)) or (armand, gamache(still life)(location three pine))  

- Sûreté du Québec is the provincial police service for the province of Quebec, in Canada.
    
</div>

### Which words/ngrams appear more often in summaries with a negative sentiment?

In [None]:
# function to check the frequency of top words, bigrams, and trigrams in summaries with negative sentiment
ex.explore_question_2(train)

<div class="alert alert-block alert-success">
<b>Takeaways</b>
    
Unigrams: 
    


### Is there a relationship between the length of a book and its appearing on the NYT Best Seller list?

Exploring length and successs:

$H_O$ : There is no relationship between the length of a book and its landing on the NYT Best Seller list.  
$H_a$ : There is a relationship between the length of a book and its landing on the NYT Best Seller list.

In [None]:
# function to visualize success vs book length
ex.book_len_success(train)

In [None]:
# defining two groups for chi squared function
a = train['length']
b = train['year_published']
#calling the chi squared function
ex.chi_sq(a, b)

In [None]:
# same as above, defining groups for the chi squared test
r = train['length']
s = train['successful']
#calling the function for the chi squared test
ex.chi_sq(r, s)

<div class="alert alert-block alert-success">
<b>Takeaways</b>
    
There is a relationship between the length of the book (positive correlation) and the year that it was published, particularly for books not on the NYT Best Seller list, and for the train dataset. The length of the book and the year that it was published did not have a relationship for NYT Best Sellers

### What is the relationship between summary sentiment score and book length?

$H_0$ : There is no relationship between the books length and the summary's sentiment score.  
$H_a$ : There is some kind of relationship between the book length and the summary's sentiment score.

In [None]:
# function to call a visual created
ex.sent_vs_len(train)

In [None]:
# function to run a statistical test
ex.pearsonr_report(train['length'], train['compound'])

<div class="alert alert-block alert-success">
<b>Takeaways</b>
    
Going by the visual here we can see that, if there is a relationship here, it's pretty insignificant. After running a Pearson R statistical test on the two features, that is confirmed. We are able to reject the null hypothesis here that there isn't a relationship, but it **is** a weak relationship.

## Exploration Key Takeaways

<div class="alert alert-block alert-success">
<b>Key Takeaways</b>
    
- A lot of the most used words, bigrams, and trigrams had the words 'new', 'york', 'times', and 'bestseller', so on our next iteration we plan on creating a more robust set of stopwords.
    
- There is a weak positive correlation between book length and year published
    
- The book length and year published did not have a significant relationship with the success rate of a book when compared directly.
    
- There is weak negative relationship between the length of a book and the sentiment analysis of the book summary. 

## Modeling

### Preparing the data for modeling

In [None]:
# function to prep df for scaling and splitting by making dummies and removing uneeded categorical columns
df = m.ready_df(df)

In [None]:
df.head(1)

In [None]:
# splitting df into train and test
train, test = ex.split(df, 'successful')

In [None]:
# a quick shape to check the sizes
train.shape, test.shape

In [None]:
# function to create our x/y subsets
X_train, y_train, X_test, y_test = m.Xy_set(train, test)

In [None]:
# a quick peak
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
# a function to scale our numerical data
X_train_scaled, X_test_scaled = m.scaling(X_train, X_test)

In [None]:
X_train_scaled.head()

### Model Evaluation

In [None]:
# function to 
y_pred = m.XGBclf(X_train_scaled, X_test_scaled, y_train, y_test)

In [None]:
m.roc(y_test, y_pred)

<div class="alert alert-block alert-success">
<b>Takeaways</b>
    
- Our baseline recall was 0% and our accuracy 95%
    
- Through many iterations of XGBoost models, our best model gave us a recall of 34% and an accuracy of 96%.
    
- Overall, while there's room for improvement, we have beaten both of our baseline metrics.

## Conclusions

### Summary

<div class="alert alert-block alert-success">
<b></b>
    
- Our text data for the book summaries was not helpful in this iteration of the project. 

- We accurately predicted 11 of the 32 bestsellers in our test dataframe, giving us a recall score of 34.3%. 

- Our accuracy score was 96%, only missing 8 out of over 700 books in our test set.


### Recommendations

<div class="alert alert-block alert-success">
<b></b>
    
- Pay attention to the style of books written by authors whose books frequently appear on the New York Times Best Seller list.

- As a publisher, make efforts to get as many Goodreads ratings as possible, as the higher the number of reader ratings on Goodreads, the higher the overall star rating score and the more likely the book was to be on the New York Times Best Seller list.

### Next Steps

<div class="alert alert-block alert-success">
<b></b>
    
For future iterations of this project:
- Obtain the publishers of each book and multiple Goodreads user reviews for each book. 

    - This would be used for natural language processing (NLP) modeling on the text of the reviews. Feature engineering review sentiment scores would be another option.

    - Information on publishers would, likewise, be used as a feature in determining what contributes to a book being a NYT Best Seller title.
    
- Add a selection of new stopwords to try while cleaning the text data

- Model *seemed* to work better with a small selection of children's books, we would like to add those back in and find out why.