# Data Visualisation on the Titanic Dataset with Seaborn

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

This time we load the dataset from seaborn directly rather than from a CSV file. As you will notice the data attributes are a bit diffrent here.

In [None]:
titanic_df = sns.load_dataset('titanic')

### Dataset overview

In [None]:
titanic_df.info()

In [None]:
titanic_df.sample(10)

## Add a categorical variable: age groups

In [None]:
titanic_df.age.value_counts(dropna=False, ascending=False)

In [None]:
bins = [0, 12, 20, 60, np.inf]
labels = ['child', 'teenager', 'adult', 'elder']
titanic_df['age_group'] = pd.cut(titanic.age, bins, labels=labels)

In [None]:
titanic_df

In [None]:
groups = titanic_df.groupby(['age_group', 'sex'])
groups.size()

### Reshaping the dataset with respect to age group: pivoting and aggregating with `df.pivot_table()`


We want to calculate the median paid fare per embarking town per age group. We can do this using `pd.DataFrame.pivot_table()`.

More details on reshaping data frames here: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

In [None]:
titanic_df.pivot_table(
    index="embark_town",
    columns="age_group",
    values="fare",
    aggfunc=[np.min, np.median, np.mean, np.max]
)

## Visualization with Seaborn

### Catplot

Docs: https://seaborn.pydata.org/generated/seaborn.catplot.html

Figure-level interface for drawing categorical plots onto a FacetGrid.

This function provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations. The kind parameter selects the underlying axes-level function to use

In [None]:
sns.catplot("sex",kind="count",data=titanic_df)

To plot an additional variable (i.e. the class) we can use `hue`.

In [None]:
sns.catplot("sex", kind="count", hue="pclass", data=titanic_df)

In [None]:
sns.catplot(x="pclass", kind="count", hue="sex", data=titanic_df)

In [None]:
sns.catplot(x="deck", kind="count", hue="sex", data=titanic_df)

In [None]:
sns.catplot(
    x='fare',
    y='embarked',
    hue="sex",
    data=titanic_df,
    kind="violin"
)

In [None]:
sns.catplot(
    x='fare',
    y='embarked',
    hue="sex",
    data=titanic_df,
    kind="violin",
    col="pclass",
    col_wrap=3,
    height=4,
    aspect=1,
    dodge=True,
    palette="Set3",
    order=["C", "Q", "S"]

)

##### Passengers' survival rate

Let's see how many passengers have survived.

In [None]:
titanic_df['survivor'] = titanic_df['survived'].map({0: 'no', 1: 'yes'})

In [None]:
titanic_df.survivor.value_counts()

In [None]:
sns.catplot(x='survivor',data=titanic_df,kind='count',palette='Set1')

A majority of passengers did not survive. Let's see how this breakout by class:

In [None]:
sns.catplot(x='pclass', y='survived', kind="point", data=titanic_df)


Third class definetly have lowest survival rate. Let's see how this spliy by age group

In [None]:
sns.catplot(x='pclass', y='survived', hue="sex", kind="point", data=titanic_df)

Female passengers had been prioritized.

##### Passengers' survival with respect to their deck

In [None]:
sns.catplot(
    x='survivor',
    col='deck',
    col_wrap=4,
    data=titanic_df[titanic_df.deck.notnull()],
    kind="count",
    height=3.5,
    aspect=.9,
    palette='rocket'
)

We don't have many data points but it seems quite clear that people on deck A and deck G had a higher mortality rathe than people on deck B or D.

We'd need some statistical test to confirm whether this is statistically significant. More on this on week 10.

##### Did having family members onboard affect your survival chances?

In [None]:
sns.catplot(x='alone', kind="count", hue='survivor', data=titanic_df, palette='rocket')

It did.

##### Age histogram (with kde fit)

In [None]:
sns.histplot(x='age', kde=True, data=titanic_df)

#### FacetGrid 

If we wanted to break down a plot (e.g. the last one) by some categories, we needn't perform boolean queries, nor groupbys, we can use FacetGrid.

In [None]:
g = sns.FacetGrid(titanic_df, row='survivor', col='class')
g.map(sns.histplot, "age")

##### Jointplot

This method is used to display data points according to two variables, along with both their distributions, kernel density estimators, and an optional regression that fits the data. With reg we indicate that we want a regression fit to the data.

In [None]:
sns.jointplot(data=titanic, x='age', y='fare', kind='reg', color='g')

##### Heatmap 

Heatmaps are ideal to plot "rectangular data" such as matrixes. 

They're great to visualize when some values, or calculated values, such as averages, counts, etc. are more extreme.

In [None]:
titanic_pivoted_df = titanic_df.pivot_table(
    index="embark_town",
    columns="age_group",
    values="fare",
    aggfunc=np.median
)

In [None]:
sns.heatmap(titanic_pivoted_df, annot=True, fmt=".1f")

You can also use a heatmap to visualise a correlation matrix.

In [None]:
sns.heatmap(titanic.corr(), annot=True, fmt=".2f")