# EDA: Amazon Top 50 Bestselling Books 📗


> 

![](https://upload.wikimedia.org/wikipedia/commons/6/62/Amazon.com-Logo.svg)

* [Image Source](https://en.wikipedia.org/wiki/Amazon_(company))
* Amazon.com is one of the largest online marketplace and many people around the world purchase products.
* In this notebook, we observe the data about top 50 bestselling books from 2009 to 2019 on Amazon.
* It is not important, but I used some **emojis** on a trial basis.

### If you like, please feel free to **upvote**👍!

# Data Source
* This dataset is very interesting, I really appreciate it.
* [Amazon Top 50 Bestselling Books 2009 - 2019](https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

## Read the data

In [None]:
books = pd.read_csv('../input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
books.head()

<h1 style='background:#FFFFFF; border:0; color:black'><center>Data Visualization and Analysis📊<center><h1>

* From here, we visualize the data by using Plotly and analyze them.

## User Rating 💯: How popular is the book?

* The column "User Rating" is the most important because it shows how the book is popular.
* The rating is a maximum of 5 points, and the higher the value is, the higher the evaluation is.

In [None]:
fig = px.histogram(books, x="User Rating",labels={'':'The Number of Books'},title="User Rating Histogram")
fig.show()

* According to the graph above, there is a little variation in User Rating.
* About 180 books have a rating of 4.8 or 4.9.
* In the meantime, the rating of 9 books is less than 4.0.

* Based on this information, we would like to divide the **"Popularity"** of books into four stages.

| Popularity Level | Ratings |
| --- | --- |
| Extremely Popular |  　4.8 or 4.9|
| Very Popular | 4.5 ~ 4.7|
| Fairly Popular | 4.0 ~ 4.4|
| Popular | 3.3 ~ 3.9|

In [None]:
popularity = [0,0,0,0]
for i in books["User Rating"]:
    if i >= 4.8:
        popularity[0] += 1
    elif i >= 4.5:
        popularity[1] += 1
    elif i >= 4.0:
        popularity[2] += 1
    else:
        popularity[3] += 1

In [None]:
x = ['Extremely Popular(4.8, 4.9)','Very Popular(4.4 - 4.7)','Fairly Popular(4.0 - 4.3)','Popular (- 3.9)']
fig = go.Figure([go.Bar(x=x, y=popularity)])
fig.update_layout(title_text='Popularity of Books')
fig.show()

## Bestselling Books Rated 4.9 ⭐⭐⭐⭐⭐

In [None]:
fig = go.Figure(data=[go.Table(
    header=dict(values=['Book Title','Author'],
                fill_color='paleturquoise',
                line_color='blue',
                align='left'),
    cells=dict(values=[books[books["User Rating"] == 4.9]["Name"],books[books["User Rating"] == 4.9]["Author"]],
               fill_color='lavender',
               align='left'))
])
fig.update_layout(title='Books Rated 4.9 (You can scroll the table)')
fig.show()

## "Name"📖: The Name of Books

* The column "Name" shows the name of the bestselling book.
* The name of books is very important, because readers get a first impression from it.

## Does the name of books affect User Rating?

* Of course, the book which has short title is not always popular, and vice versa.
* However, the short title has the advantage of being simple and easy to understand, and the long title has that of giving readers an accurate understanding of what the book is like.
* Then, we try to find out the relationships between the length of name and rating.

In [None]:
name_len = []
for i in books["Name"]:
    name_len.append(len(i))

In [None]:
fig = go.Figure(data=[go.Histogram(x=name_len)])
fig.update_layout(title_text='The Length of Name Histogram')
fig.show()

* It is interesting that there is a big variation in the length of book name.
* There are many books which has the short title, but some books have very long title.

## Relationship between Popularity and Length of Name

In [None]:
books["Name Length"] = name_len
fig = px.scatter_matrix(books, dimensions=["User Rating","Name Length"], color="User Rating",title="Scatter Plot shows the relationship between Rating and the Length of Book Title")
fig.show()

* It seems that there is no correlation between the popularity and the length of name.

## Author: Who Wrote the Bestselling Book?📕

* The column "Author" shows the name of the author who wrote the bestselling book.
* There are very few authors who can write bestselling books.

In [None]:
print("We have " + str(len(books["Author"].unique())) + " best-selling authors.")

In [None]:
authors = books["Author"].value_counts()
author_times = []
author_names = []
for i in range(len(authors)):
    author_names.append(authors.keys()[i])
    author_times.append(authors[i])

## The author who wrote 🖊️ the most a lot of bestselling books is...
### "Jeff Kinney" (<span style="color:red">12</span> Works)



In [None]:
popular_authors = pd.DataFrame({"Author":author_names,"Number of Times":author_times})
popular_authors

* The column "Number of Times" means that the number of times the author's work is in Top 50.

In [None]:
fig = px.bar(popular_authors, y='Author', x='Number of Times',title="Best-Selling Authors")
fig.show()

* The bar graph above shows that some of the bestselling authors have received high praise over several years, and some have written several bestselling books.

## Reviews: The Number of Reviews 📝 by Readers

* The column "Reviews" shows the number of reviews.
* On Amazon.com, reviews are very reliable and essential information.
* Many consumers dicede whether or not to buy something by referring other's reviews.
* A large number of reviews indicates that so many readers want to evaluate the book, so we can guess that there is a relationship between popularity and the number of reviews.

In [None]:
fig = px.histogram(books, x="Reviews",labels={'':'The Number of Books'},title="Reviews Histogram")
fig.show()

* There is a little variation in the number of reviews.
* Only a few books have more than 50k reviews.

## The Bestselling Book which has the most reviews 🖋️ is...

### "Where the Crawdads Sing" (<span style="color:red">87841</span> Reviews)

In [None]:
fig = px.scatter_matrix(books, dimensions=["User Rating","Reviews"], color="User Rating",title="Scatter Plot shows the relationship between Rating and Reviews")
fig.show()

* According to the graph above, we can see that there is no correlation between User Rating and the number of reviews.
* However, the graph also shows that books which have relatively low rateing do not have many reviews.

# Price 💴: How much is the book?

* The column "Price" shows the price of books (US dollars).
* Unlike groceries 🍅🍇, books are not sold well just because they are inexpensive.
* However, it is also true that children or people who do not usually read books can easily buy inexpensive books.

In [None]:
fig = px.histogram(books, x="Price",labels={'':'The Number of Books'},title="Price Histogram")
fig.show()

* Most of bestselling books are priced at less than 20 dollars.
* Except for specialized books, most of books can be bought for less than 20 dollars, so there is no surprise.

## The most Expensive💰 Bestselling Book is...

### "Diagnostic and Statistical Manual of Mental Disorders, 5th Edition: DSM-5" (<span style="color:red">$105</span>)


In [None]:
fig = px.scatter_matrix(books, dimensions=["User Rating","Price"], color="User Rating",title="Scatter Plot shows the relationship between Rating and Price")
fig.show()

* We can see that there is no correlation between Rating and Price.
* However, books with particularly high prices (more than 50 dollars) are generally high evaluated.
* This is probably because expensive books are bought by only those who really need them.
* (e.g. specialized books are bought by only scholars and researchers, not people who are not familiar with the field.)

# Genre 💭: Is the Book Fiction or Non-fiction?

* The column "Genre" shows whether the book is fiction or non-fiction.

In [None]:
fig = px.histogram(books, x="Genre",labels={'':'The Number of Books'},title="Book Genre Histogram")
fig.show()

* We can see that there are more non-fiction books than fiction.

In [None]:
fig = px.scatter_matrix(books, dimensions=["User Rating","Genre"], color="User Rating",title="Scatter Plot shows the relationship between Rating and Genre")
fig.show()

* We can see that all non-fiction books have a 4.0 or higher rating, while some of fiction books have a relatively low rating.

<h1 style='background:#FFFFFF; border:0; color:black'><center>Conclusion<center><h1>

* In this notebook, we observed the data about top 50 bestselling books on Amazon.
* No clear correlation between user rating and other features was observed, but we could find some weak correlations.

# Thank you for reading this notebook to the end!
## Feel free to upvote👍 and leave a comment🔖!