# Exploratory Data Analysis (EDA)
- Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Dataset on Amazon's Top 50 bestselling books from 2009 to 2019. Contains 550 books, data has been categorized into fiction and non-fiction

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("bestsellers with categories.csv")

# Observe data in the dataframe

In [None]:
df.head()

# Number of rows and column

In [None]:
print("Shape of Dataset")
print(df.shape)

# Unique elements in columns

In [None]:
print("Unique elements in Features")
df.nunique()

# Duplicated Rows

In [None]:
print("Duplicated Series values")
print(df.duplicated().sum())

# Genres Feature

In [None]:
df['Genre'].value_counts()

In [None]:
fig = plt.figure(figsize=(2,2), dpi=300, facecolor="w")
genres = df['Genre'].value_counts()
plt.pie(genres, labels=genres.index, autopct="%.2f%%")
plt.title("Pie Chart Showing Distibution of Genres")
plt.savefig("genres_pie.png", dpi=300, bbox_inches="tight")
plt.show()

- Observation: Almost 56% rated as best selling books are Non Fiction

In [None]:
sns.set_theme(style="darkgrid")

In [None]:
# Below Countplot shows the number of books(Count) that were fiction vs non fiction among the best sellers over the years.
plt.figure(figsize=(12,7),dpi = 300)
sns.countplot(x=df['Year'],hue=df['Genre'])
plt.show()

- Observations: For all the years except 2014, the number of fiction best sellers have been greater than non fiction best sellers books.

# User Rating

In [None]:
print("Max User Rating")
print(df['User Rating'].max())
print()
print("Avg User Rating")
print(df['User Rating'].mean())
print()
print("Most Often User Rating")
print(df['User Rating'].mode())
print()

In [None]:
plt.figure(figsize=(12,6), dpi=300)
# plt.style.use("seaborn")
# plt.figure(figsize=(20,20))
plt.subplot(221)
fund= sns.countplot(x=df["User Rating"], palette="magma",edgecolor='black',saturation=0.50)
fund.set_xticklabels(fund.get_xticklabels(),fontsize=12)
plt.title("COUNT OF RATINGS",fontsize=15)
fund.set_xlabel("Counts", fontsize=12)
fund.set_ylabel("User Rating", fontsize=12)
plt.show()

# Authors

In [None]:
df['Author'].value_counts() # How many books each author have written (acc to this dataset)

In [None]:
maxrating.groupby(['Author']).size()

In [None]:
#Author's Books having rating: 4.9 
maxrating = df[df['User Rating']==4.9]
aumax = maxrating.groupby(['Author']).size().reset_index(name="Count")
aumax.sort_values(by='Count',ascending=False).head(20)

In [None]:
#'Where the Crawdads sing' Book of Delia Owens has maximum user reviews (87841).
df[df['Reviews']==df['Reviews'].max()]

In [None]:
maxrating[maxrating['Reviews']==maxrating['Reviews'].max()]

# Price

In [None]:
#Most of books having rating 4.9 have price 8 
plt.figure(figsize=(12,6),dpi=300)
sns.histplot(maxrating['Price'])
plt.title('Price Distribution Plot',fontsize=20)
plt.show()
maxrating['Price'].mode()

# general trend & Outlier

In [None]:
fig, axes = plt.subplots(2, 2,figsize=(10,6),dpi=300)
fig.tight_layout()
fig.suptitle('Box Plots')
sns.boxplot(x=df["User Rating"], ax=axes[0,0],color="lightgreen")

sns.boxplot(x=df["Reviews"],ax=axes[0,1])

sns.boxplot(x=df["Price"],ax=axes[1,0])
sns.boxplot(x=df["Year"],ax=axes[1,1], color="lightgreen")

plt.show()

- When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot. For example, outside 1.5 times the interquartile range above the upper quartile and below the lower quartile (Q1 - 1.5 * IQR or Q3 + 1.5 * IQR).