# Set up the environment

For this tutorial, we need to have installed:

* pandas >= 0.25
* bokeh >= 1.2
* pandas-bokeh >= 0.3.1

Let us first import **pandas** and set the plotting backend to **Bokeh**:

In [2]:
import pandas as pd
pd.set_option('plotting.backend', 'pandas_bokeh')

**Pandas-Bokeh** allows to plot the data either to a file (static *HTML*) or as an *embedded interactive plot* in the notebook. The first can be done by calling:

In [3]:
pd.plotting.output_file("path-to-file.html")

We will in this tutorial focus on the embedded plots, which is more convenient when making an explorative data ananlysis:

In [4]:
pd.plotting.output_notebook()

When you execute this code, you should see a small "Bokeh" icon that indicates that Bokeh has been successfully loaded.

# Load the Titanic Dataset

After we have set up our environment, let us load our dataset:

In [5]:
df = pd.read_csv("data/titanic.csv")
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


# Analysis of the Titanic data

## Age distribution

Let us first look at the age distribution for male and female passengers:

In [20]:
df[df.Sex == "male"].plot.hist(
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    ylabel="# Passengers",
    figsize=(800, 300),
)
df[df.Sex == "female"].plot.hist(
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    ylabel="# Passengers",
    figsize=(800, 300),
    color="pink",
)

Of course, to compare the distributions it would make sense to plot them into just on figure. This can be done by passing the first figure to the second plot (one can suppress the plotting of the figure with the **show_figure** argument):

In [54]:
# Create first plot (suppress plotting):
p_hist_1 = df[df.Sex == "male"].plot.hist(
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    ylabel="# Passengers",
    title="Age distributions seperated by sex",
    figsize=(800, 300),
    alpha=0.5,
    show_figure=False,
)

# Plot second plot on first figure:
p_hist_2 = df[df.Sex == "female"].plot.hist(
    figure=p_hist_1,
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    figsize=(800, 300),
    color="pink",
    alpha=0.5,
    show_figure=False,
)

# Finally show the plot using show method:
pd.plotting.show(p_hist_2)

Now, we can clearly see that there are way more male passengers than female passengers in the dataset and also that the male passengers tend to be a bit older.

Let us go one step further and:

* norm the distributions
* add the average of the distributions

We can also use the plot_grid method to easily create dashboard layouts as shown below:

In [55]:
p_hist_3 = df[df.Sex == "male"].plot.hist(
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    ylabel="Passengers (%)",
    title="Normed age distributions seperated by sex",
    figsize=(800, 300),
    alpha=0.5,
    normed=100,
    show_average=True,
    show_figure=False,
)

p_hist_4 = df[df.Sex == "female"].plot.hist(
    figure=p_hist_3,
    y="Age",
    bins=range(0, 101, 10),
    line_color="black",
    figsize=(800, 300),
    color="pink",
    alpha=0.5,
    normed=100,
    show_average=True,
    show_figure=False,
)

pd.plotting.plot_grid([[p_hist_2, p_hist_4]], plot_width=450)