This notebook is about the Amazon Top 50 Bestselling Books 2009 – 2019. The Dataset contains 550 books. Data has been categorized into fiction and non-fiction using Goodreads. The analysis of this Dataset will allow us have a deep understanding of the book market trends over the past decade.


The dataset includes seven categories, such as Name of the Book, Author of the Book, User Rating, Number of Reviews, the Price of the Book. The Year(s) it ranked on the bestseller and whether Fiction or Non-Fiction.

__Content:__
1. Data Exploration
1. Data Visulaistion
1. Data Preprocessing
1. Building Random Forest Model

__Features:__ <br>
1. __Name:__ Name of the Book <br>
1. __Author:__ The Author of the Book <br>
1. __User Rating:__ Amazon User Rating <br>
1. __Reviews:__ Number of Reviews on Amazon <br>
1. __Price:__ The Price of the Book <br>
1. __Year:__ The Year(s) It Ranked on the Bestseller <br>
1. __Genre:__ Whether Fiction or Non-fiction <br>

# Questions to Be Answered with the Dataset

1. What Is the Rating Distribution for the Books?
1. Which Category Has a Wider Range and Distribution?
1. What Is Price Distribution? 
1. What Is Reviews Distribution?
1. Who Has Written the Most Books?
1. What Is the Number of Books Per Rating?
1. Which Year Has the Highest User Rating?
1. Which Year Has the Highest Reviews?
1. What is the Price Variation Through Time?
1. What Are the Highest Reviewed Books?
1. What Are the Lowest Reviewed Books?
1. What Are the Worst Rated Books?
1. What Are the Most Expensive Books?
1. What Are the Cheapest Books?
1. What Are the Best 10 Free Books?
1. Who Are the Most Popular Authors?

In [None]:
# import necessary libraries
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [None]:
# set the color palette
sns.set_palette(sns.color_palette('deep'))
sns.set(rc = {'figure.figsize': (9, 5)})
sns.set_style('whitegrid')

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# read the csv file
df = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

# Data Exploration

In [None]:
# show first few records
df.head()

In [None]:
# get the number of records and columns
df.shape

In [None]:
# get concise summary of the dataframe
df.info()

In [None]:
# check missing values
df.isnull().sum()

In [None]:
# check duplicate values
df.duplicated().sum()

In [None]:
df.describe()

In [None]:
# range of prices
np.sort(df['Price'].unique())

Well, zero price would mean that amazon distributed those books for free, lets show these books labeled with zero price.

In [None]:
df[df['Price'] == 0]

# Data Visulaistion

## What Is the Rating Distribution for the Books?

In [None]:
# ratings distribution
sns.kdeplot(df['User Rating'], shade = True)
plt.title('Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Frequency');

## Which Category Has a Wider Range and Distribution?

In [None]:
# genre distribution
plt.pie(df['Genre'].value_counts(), autopct = '%1.2f%%', labels = df['Genre'].value_counts().index)
plt.title('Genre Distribution');

## What Is Price Distribution? 

In [None]:
# price distribution
plt.title('Price Distribution')
sns.histplot(x = 'Price', hue = 'Genre', data = df);

## What Is Reviews Distribution?

In [None]:
# reviews distribution
plt.title('Reviews Distribution')
sns.histplot(x = 'Reviews', hue = 'Genre', data = df);

## Who Has Written the Most Books?

In [None]:
# authors with most books
most_books = df['Author'].value_counts().head(10)
most_books

In [None]:
# visualise authors with most books
most_books.plot(kind = 'pie', autopct = '%1.1f%%', figsize = (7, 7));

## What Is the Number of Books Per Rating?

In [None]:
# number of books per rating?
sns.barplot(df['User Rating'].value_counts().index, df['User Rating'].value_counts())
plt.title('Number of Books Each Rating Received')
plt.xlabel('Ratings')
plt.ylabel('Counts')
plt.xticks(rotation = 45);

## Which Year Has the Highest User Rating?

In [None]:
# year with highest rating
df.groupby('Year')['User Rating'].sum().plot(marker = 'o', c = 'g')
plt.title('Year vs Average Rating')
plt.xlabel('Year')
plt.ylabel('No. of Ratings');

## Which Year Has the Highest Reviews?

In [None]:
# year with highest reviews
df.groupby('Year')['Reviews'].sum().plot(marker = 'o', c = 'g')
plt.title('Year Vs Average Reviews')
plt.xlabel('Year')
plt.ylabel('No. of Reviews');

## What is the Price Variation Through Time?

In [None]:
# price variation over the time
df.groupby('Year')['Price'].sum().plot(marker = 'o', c = 'g')
plt.title('Variation of Price Over the Years')
plt.xlabel('Year')
plt.ylabel('Price');

## What Are the Highest Reviewed Books?

In [None]:
# top reviewed books
top_reviews = df.nlargest(20, ['Reviews'])
sns.barplot(top_reviews['Reviews'], top_reviews['Name']);

## What Are the Lowest Reviewed Books?

In [None]:
# lowest reviewed books
lowest_reviews = df.nsmallest(10, ['Reviews'])
sns.barplot(lowest_reviews['Reviews'], lowest_reviews['Name']);

## What Are the Worst Rated Books?

In [None]:
# worst rated books
worst = df.sort_values('User Rating').head(10)
worst

In [None]:
# visualise worst rated books
plt.title('Worst Rated Books')
sns.barplot(y = worst['Name'], x = worst['User Rating']);

## What Are the Most Expensive Books?

In [None]:
# top expensive books
plt.title('Expensive books in Amazon bestseller list')
top_expensive = df.drop(df[df['Price'] < 1].index).sort_values('Price', ascending = False).head(10)
sns.barplot(y = top_expensive['Name'], x = top_expensive['Price']);

## What Are the Cheapest Books?

In [None]:
# cheapest books
plt.title('Cheapest books in Amazon bestseller list')
cheapest = df[-df['Price'].isin([0])].sort_values('Price').head(10)
sns.barplot(y = cheapest['Name'], x = cheapest['Price']);

## What Are the Best 10 Free Books?

In [None]:
# top free books
df.drop(df[df['Price'] > 0].index).sort_values('User Rating', ascending = False).head(10)

## Who Are the Most Popular Authors?

In [None]:
# most popular authors
authors = df.groupby('Author').agg({'User Rating':'mean', 'Reviews':'sum', 'Name': 'count'}).rename({'Name': 'Total Books'}, axis = 1)
authors.sort_values(['User Rating', 'Reviews'], ascending = (False, False)).head(10)

# Data Preprocessing

In [None]:
# encode genre and author columns
le = preprocessing.LabelEncoder()
df['Genre'] = le.fit_transform(df['Genre'])
df['Author'] = le.fit_transform(df['Author'])

In [None]:
# sentiment analyse for name column
df['Name'] = df['Name'].apply(lambda x: x.lower())
df['Name'] = df['Name'].str.replace('[^\w\s]','')

In [None]:
# calculate negative, positive, neutral and compound values
score = SentimentIntensityAnalyzer()
df['Sentiment'] = df['Name'].apply(lambda x : score.polarity_scores(x))
df['Neutral'] = df['Sentiment'].apply(lambda x : x['neu'])
df['Positive'] = df['Sentiment'].apply(lambda x : x['pos'])
df['Negative'] = df['Sentiment'].apply(lambda x : x['neg'])
df['Compound'] = df['Sentiment'].apply(lambda x : x['compound'])
# df = df.drop(columns = ['Sentiment'])
df = df.drop(columns = ['Sentiment', 'Name'])

We know the dataset is not yet a scaled value, we gonna do it after splitting to prevent data leakage as the mean and standard deviation used to normalise the data will be based on the full dataset and not the training subset — therefore leaking information about the testset into the training set.

# Modelling

In [None]:
# split the dataset into features and target
X = df.drop('User Rating', 1)
y = df['User Rating']

In [None]:
# split features and target into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
# standardise data values into a standard format
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)

# Into the Woods 😄

In [None]:
# instantiate model with 1000 decision trees
model = RandomForestRegressor(n_estimators = 1000, random_state = 42)

In [None]:
# train the model on training data
model.fit(X_train, y_train)

In [None]:
# use the forest's predict method on the testset
y_pred = model.predict(X_test)

In [None]:
# show actual values vs predicted values
predictions = pd.DataFrame({'Actual' : y_test, 'Predicted' : y_pred})
predictions.head()

In [None]:
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
# calculate mean absolute percentage error (MAPE)
mape = np.mean(np.abs((y_test - y_pred) / np.abs(y_test)))
print('Mean Absolute Percentage Error (MAPE):', round(mape * 100, 2))

In [None]:
# calculate accuracy
print('Accuracy:', round(100*(1 - mape), 2), '%')