# Goal

To create simple model which predicts a users rating for a book given the amount of reviews on it, the price in dollars, year released and genre.



In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# 1.Data exploration

In [None]:
dataset_path = '/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv'
amazon_dataset = pd.read_csv(dataset_path)
amazon_dataset.head()

In [None]:
amazon_dataset.info()

In [None]:
amazon_dataset.hist(bins=50, figsize=(20,15))

User rating, price, and reviews all seem to follow an exponential distribution, with far outliers on the tails. 
Those will all have to be standardized, but for fun lets look at the extreem ends of ditributions first. 



In [None]:
amazon_dataset.describe()

The highest amount of reviews a book got in this dataset was 87841, I wonder what book could have lead to everybody discussing it so much.

In [None]:
amazon_dataset.loc[amazon_dataset["Reviews"] == 87841]

Somebody more in the fiction space might have guessed this book, doing a quick google searched revealed it used to be the Amazon top best seller and is now getting a movie, it's very popular.

Whats the most expensive book in the list?

In [None]:
amazon_dataset.loc[amazon_dataset["Price"] == 105]

The DSM, a manual for psychiatric testing, no wonder it's so expsenive.

In [None]:
amazon_dataset.loc[amazon_dataset["Price"] == 0]

Looking at the entries where the price is zero there is a clear problem. It looks like there are not only some books with price erroneously entered as zero cost but also duplicates of books in the dataset, that will need to be taken care of, but for now, I'm just going to keep exploring the data.

In [None]:
amazon_dataset.loc[amazon_dataset["User Rating"] == 3.3]

In [None]:
amazon_dataset.loc[amazon_dataset["User Rating"] == 4.9]

Other than the interesting fact of J.K. Rowling being in the best rated in worst rated spots, everything is normal here.

It also seems the curator of the dataset made sure to pick an equal amount of books from each year. Let’s make sure with one example.

In [None]:
year2019 = amazon_dataset.loc[amazon_dataset["Year"] == 2019]

np.round(year2019.shape[0] / 550, decimals=2) #books out in 2019 as a total amount of dataset


That doesn't look right, ideally, that number would be .1, since we have 10 years of books and each year should get 10% representation in the dataset to be equal. But it is probably just a rounding error.

To be totally sure let's check that every year has equal weighting.

In [None]:
def yearRatios(dataset):
    """
    gets the years out of a dataset and returns the ratio of that years entries 
    
    #Parameters:
    #    dataset (Pandas Dataframe): a dataset to take the years from

    #Returns:
    #    dictionary (dict): of both the year and the ratio of that year's contibution to all dataset values   

    """
    years_dict = {}
    year_series = dataset['Year']
    total_instances = year_series.shape[0]
    years = np.sort(year_series.unique())
    for i in range(len(years)):
        year = years[i]
        total_year_amount = dataset.loc[dataset["Year"] == year]
        year_count = total_year_amount.shape[0]
        year_ratio = np.round(year_count / total_instances, decimals=2)
        years_dict[year] = year_ratio
    return years_dict
        
    
yearRatios(amazon_dataset)

Thankfully it looks like all years are given equal weight, so nothing to worry about for stratification there. But let’s see if stratification is needed for the amount of fiction books vs. non-fiction books. 

In [None]:
Fiction_fraction = amazon_dataset.loc[amazon_dataset["Genre"] == "Fiction"]

np.round(Fiction_fraction.shape[0] / 550, decimals=2) * 100

Seems like there is a bias in favor of non-fiction books in the dataset, so the training and test sets will need to be stratified based on that.

# 2. Data cleaning function set up

First lets take out all books with a zero price, sadly my modeling assumption is I don't think Amazon is offering free products.

In [None]:
def noZeroPrice(dataset):
    zero_indexes = dataset.loc[dataset["Price"] == 0].index
    return dataset.drop(zero_indexes)

price_transformer = FunctionTransformer(noZeroPrice)

amazon_no_zeros = price_transformer.transform(amazon_dataset)

amazon_no_zeros.shape

Now dropping duplicate rows based on the book name. It looks like books were entered multiple times in different years. That might mess up model predictions.

In [None]:
def dropNameDupes(dataframe):
    """
    finds the rows where the name for the book has been duplicated and drops them from the dataset
    
    #Parameters:
    #    dataset (Pandas Dataframe): a dataset to take the names from

    #Returns:
    #    Pandas Dataframe: A new dataframe with the duplicate rows removed.
    
    """

    df = dataframe
    return df[pd.DataFrame.duplicated(df,subset=["Name"]) == False]

dupes_transformer = FunctionTransformer(dropNameDupes)

amazon_nodupes_nozero = dupes_transformer.transform(amazon_no_zeros)

amazon_nodupes_nozero.shape[0]

Ouch, only 343 viable books to work with now, but it's what we got. 

Let’s put it all into a pipeline that cleans and standardizes the data. Also, that turns the genre catagorical feature into a one-hot encoded representation.

In [None]:
clean_pipline =  Pipeline([
        ('no_zero_price', price_transformer),
        ('no_dupes', dupes_transformer),
    ])

cleaned_set = clean_pipline.transform(amazon_dataset)

amazon_nums = amazon_dataset.select_dtypes(include=np.number).columns.tolist()
amazon_nums.remove("User Rating")
genre_attrib = ["Genre"]

full_pipeline = ColumnTransformer([
        ('std_scaler', StandardScaler(),amazon_nums),
        ("cat", OneHotEncoder(), genre_attrib),
    ])


amazon_ready = full_pipeline.fit_transform(cleaned_set)

Since the pipelines now work lets actually make the model.

# 3. Data preparation and model creation

In [None]:

clean_set = clean_pipline.transform(amazon_dataset)

amazon_train,amazon_test = train_test_split( clean_set , test_size = 0.2 , stratify= clean_set["Genre"] )

amazon = amazon_train.drop("User Rating", axis=1)
amazon_labels = amazon_train["User Rating"].copy()

amazon_testset = amazon_test.drop("User Rating", axis=1)
amazon_test_labels = amazon_test["User Rating"].copy()

amazon_ready = full_pipeline.fit_transform(amazon)

amazon_test_ready = full_pipeline.transform(amazon_testset)


amazon_ready

In [None]:
lin_reg = LinearRegression()

lin_scores = cross_val_score(lin_reg, amazon_ready,amazon_labels,scoring="neg_mean_squared_error",cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)

pd.Series(lin_rmse_scores).describe()

The mean error there isn't great, especially since in the exploratory part of the project most scores fell within the 4.5 - 4.8 range. So being off by at a minimum of .17~ is very big. Let’s try another model.

In [None]:
tree_reg = DecisionTreeRegressor()

tree_scores = cross_val_score(tree_reg, amazon_ready,amazon_labels,scoring="neg_mean_squared_error",cv=10)

tree_rmse_scores = np.sqrt(-tree_scores)

pd.Series(tree_rmse_scores).describe()

I'm surprised to see the decision tree model work worse, I would have imagined an unrestrained tree would overfit but it’s performing worse than the linear regression model from before. Let’s see if it can be made better by changing some of the tree's hyperparameters.

In [None]:
parameter_search_grid = [
    {'max_depth': [5, 15, 25], 'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson']},
    {'max_depth': [5, 15, 25], 'max_features': [2, 3, 4]},
  ]
search_grid = GridSearchCV(tree_reg, parameter_search_grid , cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
search_grid.fit(amazon_ready, amazon_labels)

In [None]:
print(search_grid.best_estimator_)
print(search_grid.best_estimator_.feature_importances_)

A nice thing about grid search is that now we have the best hyperparameters for the decision tree model (of those that we tested), and we know what the most important features are, looking above user ratings are best predicted by reviews and year released. Which is surprising to me, I would have thought Genre would have a greater impact than year.

In [None]:
best_tree = DecisionTreeRegressor(max_features=2, max_depth=5)
best_tree_scores = cross_val_score(best_tree,amazon_ready,amazon_labels,scoring="neg_mean_squared_error",cv=5)

best_tree_rmse_scores = np.sqrt(-best_tree_scores)

pd.Series(best_tree_rmse_scores).describe()

The mean error went down, but a mean error around .27 is still far too high to be useful. One last thing I will try to achieve a good score is to apply gradient boosting and see if that does better.

In [None]:
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120)
gbrt_score = cross_val_score(gbrt,amazon_ready,amazon_labels,scoring="neg_mean_squared_error",cv=5)
gbrt_rmse_scores = np.sqrt(-gbrt_score)

pd.Series(gbrt_rmse_scores).describe()

The mean error is better than both decision trees above, as expected from an ensemble model but linear regression still gives the lowest mean error, so let’s use that on the test set.

In [None]:
lin_reg.fit(amazon_ready,amazon_labels)
labels_pred = lin_reg.predict(amazon_test_ready)
mse = mean_squared_error(amazon_test_labels,labels_pred)
test_rmse = np.sqrt(mse)

test_rmse

The test Root Mean Squared Error wasn't as high as I thought it was going to be. looking at the RSME estimator statistics of the cross-validation sets the test RMSE is in the 50~ quantile. So if you were to give the model a book’s reviews, price, year released, and genre (standardized on the backend) you would get a predicted user rating that would be roughly .19 off the real user rating. But again considering the range of the user rating variable .19 is relatively large.

# 4. Project Review

## Why I chose this dataset

It was always going to be an uphill battle. The dataset is limited to say the least, but that is also why I chose it as my starter project. Amazon book reviews are an easy-to-understand and approachable subject. The dataset was small enough to where I didn't have any fear of running multiple different test functions on the whole dataset since there were only 550 instances. It also had a clear and easy target variable for a dataset, I thought that for authors it would be nice to have a model that could predict their book’s user score with a few input features. Small dataset + defined goal + easily understandable data made this a nice dataset to start off on.

## Limitations

Given 550 instances and a few features to work with will hard limit you in pretty much any instance I think. Exploring the data and finding out that there were so many duplicates that it took out roughly 35% of the dataset when they were removed was a big blow.

Even though I really didn't have high hopes for any model to perform well on the dataset. A lot of the reason books are rated as they are is due to the broader culture surrounding them, which for any dataset would be hard to model, especially with such limited features. The highest-rated books have genres that covered young adult novels, religious works, biographies, and more. All those genres are crudely pigeonholed into "fiction" or "non-fiction". But you work with the data you have, and I chose this to start with.

## Errors

Multiple times I made functions that ended up not helping and I cut them out. One was a standardizer before I figured out how Sklearns pipelining worked. Another was a function that standardized the numerical variables and then put them back in the dataframe, again not really needed since I was going to have to drop the book name and author feature since those would blow up the dimensionality if I were to one-hot encode them. 

Don't get me wrong, authors and book titles will impact the user rating of a book. But it goes back to the structural problem of the limited data. If the model had enough data to encode pop culture author name and book title would be a helpful feature. But I thought it was facile here with already so few features and instances.

But these errors did help me learn more about the tools of Sklearn and Pandas, so I'm not upset about having made them.

## For future projects

* Try out imputing missing data instead of dropping it. Or make a Sklearn class function to test if imputing or not performs better for the model.

* Instead of creating Python functions and then putting them through a function transformer to make them work with pipelining, just make a Sklearn class for it.

* Create a more detailed project outline next time so you don't make pointless functions.

* Try to make new features by combining features and seeing if that helps.

* Test out more models/model combinations.

Overall I'm happy with how the project went, I just wanted to get my hands dirty and go through a whole project and I did. I'm putting this out as my starting place, I hope to get only better from here. 


If you have any comments on what I could do better, feel free to add them!