## Sections In Report:

- Brief description of the data set and a summary of its attributes
- Initial plan for data exploration
- Actions taken for data cleaning and feature engineering
- Key Findings and Insights, which synthesizes the results of Exploratory Data Analysis in an insightful and actionable manner
- Formulating hypothesis about this data
- Conducting a formal significance test for one of the hypotheses and discuss the results 
- Suggestions for next steps in analyzing this data
- A paragraph that summarizes the quality of this data set and a request for additional data if needed

### Important necessary libraries

pandas, numpy, matplotlib, fuzzywuzzy(for values replacement), missingno(deal with missing values), scipy(for hypothesis thesis)


In [None]:
# Import libraries -> Statistics
import pandas as pd
import numpy as np
import missingno
import fuzzywuzzy
from fuzzywuzzy import process
import collections

# Import libraries -> Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.figure_factory as ff

# Import libraries -> Hypothesis testing
from scipy.stats import shapiro
from scipy.stats import mannwhitneyu

%matplotlib inline

### Load data

In [None]:
# Reading data
df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')
df.head(7)

In [None]:
df.tail(7)

> Observe that there are duplicate data in the dataset with different years, we will deal with that when we start Feature Cleaning

### Dataset Description & Summary of attributes
This section performs exploratory data analysis.

In [None]:
df.describe().T

In [None]:
df.info()

In [None]:
# Let's look at the dimension of the data
print(f'This Dataset contain {df.shape[0]} records and {df.shape[1]} columns.')

In [None]:
print(f'This Columns in this Dataset are :{df.columns.tolist()}')

In [None]:
# Let's deduce data types
print(df.dtypes)

### Data Cleaning and Feature Engineering

Data contains 3 categorical columns and 4 columns contains numeric values. Let's convert the genre to the category data type, leaving the year as a number.

In [None]:
# Deal with duplicate data
df.tail(7)

> We are going to remove all duplicated rows save the last one as it is the most recent

In [None]:
# Remove duplicates and check how many books are left in the data
df = df.drop_duplicates(subset='Name', keep='last')
df

In [None]:
# Delete the year column
del df['Year']

In [None]:
df.Genre.value_counts()

In [None]:
# Visualize

df["Genre"].value_counts().plot(kind="bar", color=["pink", "blue"])

In [None]:
# Change the data type
df.Genre = df.Genre.astype('category')

In [None]:
# Cheacking for missing values
df.isnull().sum()

> The data has no missing values, so no further transformations are required.

In [None]:
# Forming categorical columns
cat_col = list(df.select_dtypes(exclude=('int', 'float')).columns)
print(f'Сategorical Columns: {", ".join(cat_col)}.')

In [None]:
# Check for duplicate data
for col in cat_col:
    if df[col].duplicated().any() == True:
        print (f'Column {col} has duplicate data.')
    else:
        print (f'Column {col} does have duplicate data.')

In [None]:
# Check for spelling errors in entire dataset
for col in cat_col:
    print(f'Actual {col}: "{len(set(df[col]))}" - After Spell Check {col}: "{len(set(df[col].str.title().str.strip()))}"')

> Take care of spelling error encountered

In [None]:
# Correct the errors
df.Name = df.Name.str.title().str.strip()

In [None]:
# Check if the changes have passed
for col in cat_col:
    print(f'Actual {col}: "{len(set(df[col]))}" - After Spell Check {col}: "{len(set(df[col].str.title().str.strip()))}"')

<p style="font-family: Arials, sans-serif; font-size: 14px; color: rgba(0,0,0,.7)">Let's check if there are the same author names but with different spellings.</p>

In [None]:
# Check for spelling errors in Author column
authors = df.Author.sort_values().unique()
authors

> Observe that names with initial have a tendency to be in different varaiations. <br>
<b>George R. R. Martin</b> and <b>J. K. Rowling</b> fall into this category.

In [None]:
# Let's build a list of the most similar spellings for the first Author with this error
matches_author_name = fuzzywuzzy.process.extract('George R.R. Martin', authors, limit=4, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches_author_name

In [None]:
# Let's build a list of the most similar spellings for the second Author with this error
matches_author_name = fuzzywuzzy.process.extract('J. K. Rowling', authors, limit=4, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
matches_author_name

In [None]:
# Replace the names of the Authors with the correct ones
df = df.replace('George R. R. Martin', 'George R.R. Martin')
df = df.replace('J. K. Rowling', 'J.K. Rowling')

In [None]:
# Check if the changes have passed
for col in cat_col:
    print(f'Before {col}: {len(set(df[col]))} After {col}: {len(set(df[col].str.title().str.strip()))}')

### Exploratory Data Analysis
The dataset contains 350 different books written by 246 authors. All books are presented in two categories (Non Fiction, Fiction).

In [None]:
# Author with the highest entries
print(f'Author with the most entries for different books: `{df.Author.sort_values().max()}`')

In [None]:
# Analysis -> Best Authors by User Rating
best_sellers = df[df['User Rating'].values == 4.9]
best_sellers = best_sellers.groupby('Author')[['User Rating']].mean().sort_values('User Rating', ascending=False).reset_index()

# Visualize 
sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize': (12, 8)})
ax = sns.barplot(x="User Rating", y="Author", data=best_sellers, color="teal")\
                .set(title="Authors with a Rating equal to 4.9(Highest Rating)", ylabel=None)

>Alice Schertle, Jill Twiss, Sarah Young, Nathan W. Pyle, Patrick Thorpe, Eric Carle, Emily Winfield Martin, Chip Gaines,  Rush Limbaugh, Sherri Duskey Rinker,  Pete Souza, Lin-Manuel Miranda, Bill Martin Jr., Dav Pilkey all have an average rating of 4.9.

In [None]:
# Analysis -> Best Books by User Rating
best_books = df[df['User Rating'].values == 4.9]
best_books = best_books.groupby('Name')[['User Rating']].mean().sort_values('User Rating', ascending=False).reset_index()

# Visualize
sns.set(rc = {'figure.figsize': (12, 8)})
ax = sns.barplot(x="User Rating", y="Name", data=best_books, color="salmon")\
                .set(title="Books with a Rating equal to 4.9(Highest Rating)", ylabel=None)

In [None]:
# Analysis -> Books by most reviews
most_reviews = df[df['Reviews'].values > 40000]
most_reviews = most_reviews.groupby('Name')[['Reviews']].sum().sort_values('Reviews', ascending=False).reset_index()

# Visualize
sns.set(rc = {'figure.figsize': (8, 4)})
ax = sns.barplot(x="Reviews", y="Name", data=most_reviews, color="teal")\
                .set(title="Books with more than 40000 reviews", ylabel=None)

In [None]:
# Analysis -> Books by expense
most_worth = df.groupby('Name')[['Price']].sum().sort_values('Price', ascending=False).head(10).reset_index()

# Visualize
sns.set(rc = {'figure.figsize': (8, 4)})
ax = sns.barplot(x="Price", y="Name", data=most_worth, color="salmon")\
                .set(title="10 most Valuable Books", ylabel=None)

    Diagnostic And Statistical Manual Of Mental Disorders, 5th Edition is the most valuable bestseller

In [None]:
# Analysis -> Books by Genres
books_by_genre = df.groupby('Genre')[['Name']].count()\
                                             .sort_values('Name', ascending=False)\
                                             .head(10)\
                                             .reset_index()

# Visualize

df["Genre"].value_counts().plot(kind="bar", color=["pink", "blue"]).set_title("Genre Distribution")

    Non-fiction is more likely to become a bestseller.

In [None]:
# Displaying measures of the central trend of The Price
df["Price"].describe()

It is observed that:
- There are books that cost much higher than the mean Price
- There are books which cost nothing.

In [None]:
# Building a correlation matrix and building a correlation matrix and visualizing relationships
df.corr()

In [None]:
# Correlation Matrix
corr = df.corr()
fig, ax = plt.subplots(figsize = (8, 6))
ax = sns.heatmap(corr,
                annot=True,
                linewidths=0.5,
                fmt=".2f",
                cmap="YlGnBu");
bottom, top = ax.get_ylim()

### Observations:
    - The highest positive correlation can be seen between The Number of reviews written and the Year.
    - There exists no positive or negative linear relationship between the rating, reviews and the price of books.
    - A negative relationship exits between the Price of the Bestseller and the Year

In [None]:
sns.scatterplot(data=df, x="Price", y="Reviews", hue="Genre")

In [None]:
sns.scatterplot(data=df, x="Price", y="User Rating", hue="Genre")

In [None]:
sns.scatterplot(data=df, x="Reviews", y="User Rating", hue="Genre")

### Hypothesis Testing - Does the Genre of the Book drive it's Price?

    Null Hypothesis :- There exists a relationship between The price and the Genre where the genre determines the Price

    Alternative Hypothesis :- No relationship exits between Genres and Price in which the Genre determines the Price


In [None]:
# Encode Genre column to perform hypothesis testing
df["Genre"] = df["Genre"].replace({"Fiction": 1, "Non Fiction": 0})
df.head(3)

In [None]:
# Generate samples to test
non_fiction = df[df['Genre'] == 0]['Price']
fiction = df[df['Genre'] == 1]['Price']

In [None]:
# Set the alpha level
alpha=0.5

# We pass groups to the criterion for testing using the nonparametric Mann-Whitney test.
stat, pval = mannwhitneyu(non_fiction, fiction)

print('Statistic:', f'{stat:.3f}')
print('P-Value:', f'{pval:.20f}')
 
# Checking the condition for accepting or rejecting H0
if pval > alpha:
    print('Accept Null Hypothesis - Hmm, Such a relationship exists.')
if pval < alpha:
    print('Reject Null Hpothesis - Nahh, No such thing.')

<b>INSIGHT:</b>

    As a result of the testing, It is observed that the Genre of the Book has no definitive impact in the price at which it is sold.

### Summary

    - During the E.D.A, Authors that received the highest ratings from readers were established alongside the Author with the most Bestsellers (look out for Zhi Gang Sha) and the most Valuable Books in terms of Price ( I see you Stats 😊), High Reviews and high Ratings.
    - The Analysis also established that Non-Fiction is the fan's favorite over the years and is more likely to be a bestseller according to Amazon (This is your chance - Creatives).
    - Also the fact that the Genre of the book drives the price is not valid according to Statistical Testing.
    
    More data can also be gathered to improve this Analysis, although would not significantly improve the results established.