# Amazon Best-Selling Books Analysis

<ul>
<li>Name: Book name</li>
<li>Author: Book author</li>
<li>User Rating: Amazon user rating (0.0 - 5.0)</li>
<li>Reviews: Number of user reviews</li>
<li>Price: Book price (as of 2020)</li>
<li>Year: The year(s) it ranked</li>
<li>Genre: Fiction or non-fiction</li>
</ul>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('bestsellers.csv') # read the bestseller file and strore it in df = data frame

## Explore the Data


In [6]:
print(df.head()) # print first 5 rows


                                                Name  \
0                      10-Day Green Smoothie Cleanse   
1                                  11/22/63: A Novel   
2            12 Rules for Life: An Antidote to Chaos   
3                             1984 (Signet Classics)   
4  5,000 Awesome Facts (About Everything!) (Natio...   

                     Author  User Rating  Reviews  Price  Year        Genre  
0                  JJ Smith          4.7    17350      8  2016  Non Fiction  
1              Stephen King          4.6     2052     22  2011      Fiction  
2        Jordan B. Peterson          4.7    18979     15  2018  Non Fiction  
3             George Orwell          4.7    21424      6  2017      Fiction  
4  National Geographic Kids          4.8     7665     12  2019  Non Fiction  


In [7]:
print(df.shape) # shape of the spreadsheet (rows , columns)

(550, 7)


In [8]:
print(df.columns) # displays the columns


Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')


In [9]:
print(df.describe()) # describe the data 


       User Rating       Reviews       Price         Year
count   550.000000    550.000000  550.000000   550.000000
mean      4.618364  11953.281818   13.100000  2014.000000
std       0.226980  11731.132017   10.842262     3.165156
min       3.300000     37.000000    0.000000  2009.000000
25%       4.500000   4058.000000    7.000000  2011.000000
50%       4.700000   8580.000000   11.000000  2014.000000
75%       4.800000  17253.250000   16.000000  2017.000000
max       4.900000  87841.000000  105.000000  2019.000000


## Cleaning the Data


In [10]:
# remove any duplicate rows

df.drop_duplicates(inplace = True) # inplace means that changes are made to the original data

In [11]:
# rename the columns of the DataFrame to make them more descriptive

df.rename(columns={
    "Name": "Book Title",
    "Year": "Publication Year",
    "User Rating": "Rating",
}, inplace= True)

### The data after reanameing

<ul>
<li>Book Title: Book name</li>
<li>Author: Book author</li>
<li>Rating: Amazon user rating (0.0 - 5.0)</li>
<li>Reviews: Number of user reviews</li>
<li>Price: Book price (as of 2020)</li>
<li>Publication Year: The year(s) it ranked</li>
<li>Genre: Fiction or non-fiction</li>
</ul>



In [12]:
#  convert the "Price" column to a float data type to make it easier to work with

df["Price"] = df["Price"].astype(float)

## Analysis


In [13]:
# Analyzing Author Popularity

authorCounts = df["Author"].value_counts()
print(authorCounts)

Author
Jeff Kinney                           12
Gary Chapman                          11
Rick Riordan                          11
Suzanne Collins                       11
American Psychological Association    10
                                      ..
Keith Richards                         1
Chris Cleave                           1
Alice Schertle                         1
Celeste Ng                             1
Adam Gasiewski                         1
Name: count, Length: 248, dtype: int64


In [14]:
# Average Rating by Genre
avg_rating_by_genre = df.groupby("Genre")["Rating"].mean()
print(avg_rating_by_genre)

Genre
Fiction        4.648333
Non Fiction    4.595161
Name: Rating, dtype: float64


In [None]:
# to save the new data
# authorCounts.head(10).to_csv("topAuthors.csv")