In [54]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
sns.set(style="white")


import plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)


df = pd.read_csv(
    
    'bestsellers with categories.csv'
)

This is really interesting dataset which contatins informations about books selling on Amazon in years 2009 - 2019. Contains 550 books, data has been categorized into fiction and non-fiction using Goodreads

Basic informations about this dataset:

550 rows

7 columns

Column names:

* Name: name of the book,
* Author: book author,
* User rating: Amazon User Rating,
* Reviews: Number of written reviews on amazon,
* Price: The price of the book (As at 13/10/2020)
* The Year(s) it ranked on the bestseller
* Whether fiction or non-fiction

Questions which comes to my mind before going any further:

* how distribution of fiction and non-fiction bestsellers changed over the years? 

My hypothesis: I think that books are more likely to be non-fiction.

* Which author has the biggest amount of bestsellers? (let's say TOP 10)

My hypothesis: I don't know many authors from period between 2009 to 2019.


* Is the book price correlated with book rating?

I thnik that no.

* How many percent of all books is for free?

I think that it will be less than 5%.

## 1.1 Data preprocessing

In [55]:
df.head()

In [56]:
df.info()

Numeric columns:

In [57]:
for column in df.select_dtypes(include=[np.number]).columns:
    print(column)

In [58]:
df.describe()

Conclusions:

* bestsellers has minimal rating 3.3
* some of them are free (minimal price is 0)
* minimal year is 2009

In [59]:
df.isnull().sum()

This dataset is clear and well prepared to fall into EDA.

## EDA

Distributions of numeric columns:

In [60]:
from typing import List
def plot_histogram(data_frame: str=df):
    
    """Takes one key-word parameter - name of the data_frame
    returns: ploted numeric columns in this data frame"""
    
    # selecting names of numeric columns
    columns_to_plot: List[str] = data_frame.select_dtypes(include=[np.number]).columns 
    
    # iterating over every numeric column
    for column in columns_to_plot:
        fig = px.histogram(data_frame=data_frame, x=column)
        
        fig.update_layout(title=dict(text=column,
                                     x=0.5)) # x parameter sets position of the title
        fig.show()

In [61]:
plot_histogram()

Following conclusions may be drawn from plots above:

* User rating is quite left skewed (negative skewness) if we don't include rating 4.9
* most of the books has rating from 4 to 4.8


* distribution of reviews is quite right skewed if we don't include 0 to - 1980 bin


* in price column we can see one visible bin with 8-9, except of this bin it's the same situation as distribution of reviews so it's right skewed

* in year distribution is constant

### Let's start answer the questions I asked at the beggining:

#### how distribution of fiction and non-fiction bestsellers changed over the years? 

In [62]:
# making aggregated view on data frame
grouped_by_year_and_genre = (df
                             .groupby(['Year', 'Genre'])
                             .count()
                             .reset_index()
                             # renaming colum
                            .rename(columns={'Name': 'count'}))

# visualization
fig = px.bar(data_frame=grouped_by_year_and_genre, 
             x='Year', y='count',
             color='Genre',
            barmode='group'
            
            )

fig.update_layout(title=dict(text='<b>amazon bestsellers over years with genre<b>',
                            x=0.5,
                             font=
                             dict(size=24)
                            ),
                  font=dict(family='Lato',
                           size=16)
                 )


fig

Almost in every year it's visible that non-fiction has more bestsellers than fiction, 
one exception of this pattern is year 2014. According to my hypothesis from the beggining I was quite right except year 2014 what has be shown on the plot above. Let's inspect this year.

In [63]:
year_2014 = df[df['Year'] == 2014] # selecting only year 2014

In [64]:
g = sns.FacetGrid(year_2014, col='Genre', xlim=(0, 5e4)) # making FacetGrid object
g.map(sns.kdeplot, 'Reviews') # maping kde to FacetGrid

In [65]:
g = sns.FacetGrid(year_2014, col='Genre')
g.map(sns.kdeplot, 'Price')

In [66]:
g = sns.FacetGrid(year_2014, col='Genre')
g.map(sns.kdeplot, 'User Rating')

There's no clear pattern visible on kde plots. But next idea which comes to my mind is to see most common authors in this year for fiction and no fiction.

In [67]:
fiction_2014 = year_2014[year_2014['Genre'] == 'Fiction']

non_fiction_2014 = year_2014[year_2014['Genre'] == 'Non Fiction']

In [68]:
fig = make_subplots(rows=1, cols=2, column_titles=['Fiction', 'Non-fiction'],
                   shared_yaxes=True)

# adding trace with fiction authors
fig.add_trace(go.Bar(x=fiction_2014['Author'].value_counts().index[:5],
                    y=fiction_2014['Author'].value_counts().values[:5],
                    showlegend=False),
              row=1,
              col=1
             )

# adding non-fiction authors
fig.add_trace(go.Bar(x=non_fiction_2014['Author'].value_counts().index[:5],
                    y=non_fiction_2014['Author'].value_counts().values[:5],
                    showlegend=False),
              row=1,
              col=2
             )



# some styling
fig.update_layout(yaxis=dict(range=[0.1, 3], dtick=1),
                 title=dict(text='<b>Amazon bestsellers authors in 2014 with genre<b>',
                            font=dict(size=24),
                           x=0.5),
                 font=dict(family='Lato', size=16))

Author which have the biggest amount of bestsellers in this year is John Green.

Movie  called "The Fault in Our Stars" has premiere in 2014 and get very popular in this time.
Maybe it helped John Green with selling his books in this time.

In [69]:
df[df['Author'] == 'John Green']

#### Which author has the biggest amount of bestsellers? (let's say TOP 10)

In [70]:
fig = px.bar(x=df['Author'].value_counts().index[:10],
            y=df['Author'].value_counts().values[:10])

fig.update_layout(xaxis=dict(title='author'),
                 yaxis=dict(title='amount of books'),
                 font=dict(family='Lato', size=16),
                 title=dict(text='<b>amount of bestsellers per author<b>',
                            font=dict(size=24),
                           x=.5))

Jeff Kinney is winner with 12 bestsellers, next is Gary Chapman(11) and Rick Riordan(11), Suzane Collins also has 11 bestsellers

#### Is the book price correlated with book rating?

In [71]:
df.corr()

In [72]:
fig = px.scatter(data_frame=df, x='Price',
                y='User Rating')

fig

It's weak negtive correlation -0.13 but it's not enought to draw from this any conclusions.

#### How many percent of all books is for free?

In [73]:
df['free_or_not_free'] = df['Price'].apply(lambda price:  'free' if not price else 'not free')

In [74]:
fig = go.Figure()

colors = ['lime', 'grey']

fig = fig.add_trace(
                        go.Pie(labels=df['free_or_not_free'].
                           value_counts().index, 
                           values=df['free_or_not_free'].
                           value_counts().values,
                        hoverinfo='label+percent',
                        textinfo='percent',
                        textfont=dict(size=14, color='black'),
                        marker=dict(colors=colors,
                                    line=dict(width=2)),
                        hole=0.7)
                   )

fig.update_layout(
    font=dict(family='Lato', size=16, color='black'),
    title=dict(text='<b>FREE BESTSELLERS AMAZON<b>', 
               font=dict(size=24),x=0.5),
    plot_bgcolor='white'
)

Amount of not free books is less than 5 % as I assumed at the beggining.

I answered all the questions I was going to. But this one has come to my mind when I was doing this analysis. Namely:

* which books have bigger price at average ? fiction or non-fiction?

Fiction:

In [75]:
np.mean(df.loc[df['Genre'] == 'Fiction', 'Price'])

Non-fiction:

In [76]:
np.mean(df.loc[df['Genre'] == 'Non Fiction', 'Price'])

Fiction books are at average cheaper than non fiction books.

## 3. Conclusions

Questions I wanted to answer were obtained.


If you want to see my other kernels they are avalible [here](https://www.kaggle.com/cloudy17/code)

Peace ✌️