In [1]:
import os, sys, inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)
from database_utils import get_query_results

In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import tools
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

## Get and prepare the data

Retrieve all the books for all book clubs along with their book ids and first publication year. 

In [3]:
q2 = """
SELECT bc.club_name, bc.book_club_id, b.pub_year, b.book_id
FROM book b
JOIN book_club_book bcb
ON bcb.book_id=b.book_id
JOIN book_club bc
ON bc.book_club_id=bcb.book_club_id
"""

In [52]:
source_data = get_query_results(q2)

In [53]:
print("Minimum year: {}, maximum year: {}".format(source_data.pub_year.min(), source_data.pub_year.max()))

Minimum year: -720, maximum year: 2020


In [54]:
source_data.pub_year.describe()

count     606.000000
mean     1968.542904
std       207.325314
min      -720.000000
25%      1975.500000
50%      2004.000000
75%      2012.000000
max      2020.000000
Name: pub_year, dtype: float64

As we can see, there are outliers here in our data that are far more than 3 standard deviations from the mean. With the average year being 1968 the minimum year is -720 which is about 13 SD away from the mean. So the idea is to exclude these outliers and take only the data that is within 3 SD from the mean. 

In [79]:
# limit the data so that we only look at the books published within 3 SD from the mean
within_3_sd = source_data.pub_year.mean() - 3*source_data.pub_year.std()

df = source_data[source_data.pub_year >= within_3_sd]

Store club names in a separate array so that we have it at hand all the time.

In [80]:
club_names = df.club_name.unique()
club_names

array(['Bertelsmann Data Science book readers', 'Gone with a Book',
       "Pop Sugar's Annual Ultimate Reading Challenge",
       'Reading with Style'], dtype=object)

## Visualisation 1: Historgrams

The first very intuitive step is to make histograms of the books' first publications year to visually compare them. And this is what we do here to implement that with Plot.ly:

In [81]:
# create histograms 
data = [go.Histogram(x=df[df.club_name==club_names[i]]['pub_year'], 
                     name = club_names[i],
                     xbins = dict(
                         start = within_3_sd,
                         end = df.pub_year.max(),
                         size = 5
                     ),
                    showlegend = False) for i in range(4)]

# supress warnings due to Plotly dependencies
import warnings
warnings.filterwarnings("ignore")

# arrange subplots
fig = tools.make_subplots(rows=2, cols=2, 
                          subplot_titles = tuple(club_names), 
                          print_grid=False)
fig.append_trace(data[0], 1, 1)
fig.append_trace(data[1], 1, 2)
fig.append_trace(data[2], 2, 1)
fig.append_trace(data[3], 2, 2)

fig['layout'].update(height=600, width=900, title='Year distribution by club')
iplot(fig, config={'showLink': False})
# show image result because Github doesn't render plotly figures 
Image(url= "by_year.png")

We can see that they distributions are rather similar and all of them are negatively skewed. People mostly prefer to read modern books rather than historical. However, we also should take into account that the number of books published also increases with each year, so there might be a bias. By looking at the histogram we can say that we can even cut that data at approximately year 1920. And this is what we will use in the next visualisation.

## Visualisation 2: Box plots

Although histograms can give us some idea of what the data looks like, it's actually considered that box plots <a href="https://www.forbes.com/sites/naomirobbins/2012/01/10/comparing-distributions-with-box-plots/#5f76f8432c2c">are better</a> when it comes to comparing distributions. Herewe are using only the books first published after 1920 (as explained in previous step). If we use the raw data (without cutting of the outliers) the picture will be really hard to read because of the outliers that are too far away from the mean. 

In [82]:
df = source_data[source_data.pub_year > 1920]
data = [go.Box(y=df[df.club_name==club_names[i]]['pub_year'], name = club_names[i]) for i in range(4)]
layout = go.Layout(
    margin = dict(b=150),
    xaxis = dict(tickangle = 25)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig,config={'showLink': False})
# show image result because Github doesn't render plotly figures 
Image(url= "by_year_box.png")

Indeed, we can see that box plots give a much better idea of what's going on in our data. And it is even different from what we saw on the histograms.  
Here are a few observations that we can make:  
-  Our group ("Bertelsmann Data Science book readers") prefers older books with the median of 200 while the closest to us is 'Reading with Style' group with the median of 2005.
- Our group has broader interquartile range: 1972 to 2017 (50% of the books are in that range). Second best in terms of breadth is 'Reading with Style' with the range of 1982 to 2018. 
- Our group doesn't seem to be obsessed with getting most recent books. The maximum value for us is 2017 and the third quartile is at 2008. This is the lowest across all groups who have the third quartiles at 2015, 2014, 2012. 