# Top Selling Amazon Books (2009-2019)

About the dataset:
This dataset contains information about top selling books of Amazon(550 books) either belonging to fiction or non-fiction.



# 1.Importing Libraries 😁



In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2.Loading the Dataset 🌠

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
dataset_url = '../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv'

In [None]:
df = pd.read_csv(dataset_url)
df.head()

# 3.Data Preparation and Cleaning 🤠

We will explore the dataset here and if there is any null values.


In [None]:
df.info()

In [None]:
### to get an overview of the dataframe
df.describe()

- It is clear from the above table that maximum price of a book is 105$ and maximum rating is 4.9.

In [None]:
r,c = df.shape
print(f"The dataset has {r} rows and {c} columns.")

### 3.A.Renaming Columns

In [None]:
#To rename the columns and make it easy to use:
df.columns=['name','author','user_rating','reviews','price','year','genre']

In [None]:
df.head()

### 3.B. Checking for null values

In [None]:
#To check if there is any null value in the dataframe
df.isnull().sum()

In [None]:
#For total number of different books
len(df.name.unique())

**Total number of unique books are 351 but total rows are 550.**

This is clear if we see the dataframe that some books are repeatedly among the top sellers in different years.

### 3.C. Adding a feature

**Lets add another columns to the dataframe which tells us the estimated profit which is no doubt less than the total earned but provides us with a brief overview of how much is earned for sure.**

In [None]:
df['estimated_profit']=df.reviews*df.price

# 4. Exploratory Analysis and Visualization

Here we will see the trends of different columns and get the insights of the dataset.



Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [None]:
# setting the background to be dark, it looks cool with this ;)
sns.set_style('darkgrid')
plt.rcParams['font.size'] = 14
plt.rcParams['figure.figsize'] = (9, 5)
plt.rcParams['figure.facecolor'] = '#00000000'

# Lets put forward some questions.

One thing to note is that this analysis is only valid for the top selling category.
- Which genre has the most books in this category ?
- What is the average rating of each genre?
- What is the popularity of each genre and its relationship with time?
- Which year had the most books sold in this category?
- What effect does time had on the price of books over the years? 
- Price of books of each genre over the years.
- How has been the customer reviews over the years?
- Which genres have the highest and lowest ratings?
- Which author is most popular and which have earned the most in this category?
- Who is the most popular author to each genre?
- Which books have earned the most money in each genre?
- The money making books overall.
- What is the relationship of selling with ratings?


# 1-Which genre has the most books in this category  and their distribution?

In [None]:
fiction_df_values=df[df.genre=='Fiction']
len(fiction_df_values)

In [None]:
Nfiction_df=df[df.genre=='Non Fiction']
len(Nfiction_df)

In [None]:
genre_dist=df.genre.value_counts()
genre_dist

In [None]:
sns.barplot(x=genre_dist.index,y=genre_dist);

In [None]:
plt.pie([240,310],labels=['Fiction','Non Fiction'],autopct='%.0f%%');

- Non Fiction books are in majority in top selling category.

### User Ratings Overview:

- What is the average rating of each genre?

In [None]:
df.groupby('genre')['user_rating'].mean()

In [None]:
#Distribution of ratings
sns.histplot(data=df.user_rating,bins=10)
plt.xlabel("Ratings");

In [None]:
# Relationship of ratings with time.
sns.lineplot(y=df.user_rating,x=df.year,hue=df.genre);
plt.ylabel("Ratings")
plt.xlabel("Years");

- It is clear from the above graph that most of the books received ratings between 4.5 to 4.9.


### Relationship between ratings and price

In [None]:
sns.lmplot(y='user_rating',x='price',data=df)
plt.ylabel('Ratings')
plt.xlabel('Price');

**As price increases, rating goes down.**

In [None]:
df.groupby('genre')['user_rating'].mean()

- This graph shows that there is not a significant relationship between price and ratings but with increasing price , the ratings are falling for both Fiction and Non fiction.

### To get an overview of how much books have earned yearly.

In [None]:
sns.barplot(x=df.year,y=df.estimated_profit,hue=df.genre)
plt.xlabel('Years')
plt.ylabel("Eearned(millions)")
plt.title('Money earned each year');

- Above graph above shows the  earning of these top books ,in most cases, are below 0.8 million.

### Average Profit earned by each book depending on its genre.

In [None]:
genre_average=df.groupby(['genre'])['estimated_profit'].mean()

In [None]:
sns.barplot(x=genre_average.index,y=genre_average);

# Q1: TOP 10 Books which earned the most ?

In [None]:
rich_df=df.groupby('name')['estimated_profit'].max()
rich_df=rich_df.sort_values(ascending=False).head(10)
rich_df

In [None]:
sns.barplot(x=rich_df,y=rich_df.index)
plt.xlabel("Earned(millions)")
plt.ylabel("Books")
plt.title('Earning by Books');

# Q2: Books which earned the most per year (2009-19)

In [None]:
most_earning_book_per_year=df[df.groupby('year')['estimated_profit'].transform(max) == df['estimated_profit']]
most_earning_book_per_year

In [None]:
most_earning_book_per_year=most_earning_book_per_year.sort_values('year').set_index('year')

In [None]:
most_earning_book_per_year

### What is the average earning of each genre on per year basis.

In [None]:
genres_per_year_mean=df.groupby(['year','genre'])['estimated_profit'].mean().round(2)

In [None]:
pd.DataFrame(genres_per_year_mean)

#### Q4: Trend of books selling (Top rated) over the years.

In [None]:
Earning_Graph=df.groupby('year')['estimated_profit'].sum()

In [None]:
sns.lineplot(data=Earning_Graph)
plt.xlabel('Year')
plt.ylabel("Earned")
plt.title("EARNING PER YEAR")
plt.figure(figsize=(12,12));

#### Q5: Top 5 authors which earned the most.

In [None]:
authors=df.groupby('author')['estimated_profit'].sum()

In [None]:
authors=authors.sort_values(ascending=False).head(10)
authors

In [None]:
sns.barplot(y=authors.index,x=authors)
plt.title('The Money Makers ');