# Exercise 2.3 seaborn
prepared by M.Hauser

Seaborn is a library for statistical visualisation; it tries to 'make a well-defined set of hard things easy to do'.

It has a beautiful [gallery](https://seaborn.pydata.org/examples/index.html) illustrating its capabilities.

In [None]:
import matplotlib.pyplot as plt

import numpy as np
import seaborn as sns
import xarray as xr

%matplotlib inline

Let's use a seaborn style:

In [None]:
sns.set(style="white")

### Load Data

We again use time series of Station Data for Switzerland - Temperature & Precip.

The data is available from [MeteoSwiss](http://www.meteoswiss.admin.ch/home/climate/past/homogenous-monthly-data.html).

The data has already been [retrieved and postprocessed](../data/prepare_data_MCH.ipynb).

In [None]:
def load_mch(station, annual=True):
    fN = '../data/MCH_HOM_{}.nc'.format(station)
    return xr.open_dataset(fN)

BAS = load_mch('BAS')
BER = load_mch('BER')
GSB = load_mch('GSB')
DAV = load_mch('DAV')

## Distributions

While it is easy to create histograms with matplotlib, it's quite difficult to plot a Kernel Density Estimate (kde). With seaborns `distplot` function this is easy:

In [None]:
# create random data
d = np.random.randn(100)

# ======================

# Set up the matplotlib figure
f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True, sharey=True)
axes = axes.flatten()


# get rid of the left axis
sns.despine(left=True)

ax = axes[0]

# Plot a simple histogram with binsize determined automatically
sns.distplot(d, kde=False, color="b", ax=ax, hist_kws=dict(normed=True))

ax = axes[1]

# Plot a kernel density estimate and rug plot
sns.distplot(d, hist=False, rug=True, color="r", ax=ax)

ax = axes[2]

# Plot a filled kernel density estimate
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=ax)

ax = axes[3]

# Plot a historgram and kernel density estimate
sns.distplot(d, color="m", ax=ax)

plt.setp(axes, yticks=[])
plt.tight_layout()

### Exercise

 * Plot a kde of `BAS.Temperature` and `DAV.Temperature`
 * can you add a legend?

In [None]:
# code here



### Solution

In [None]:
sns.distplot(BAS.Temperature, hist=False, kde_kws={"shade": True, 'label': 'Basel'})
sns.distplot(DAV.Temperature, hist=False, kde_kws={"shade": True, 'label': 'Davos'})

plt.legend();

## Joint plot

`jointplot` allows you to see the distribution of two individual data sets as well as their joint distribution

In [None]:
sns.jointplot(BAS.Temperature, BAS.Precipitation, kind='kde');

### Exercise
 * is there a correlation between precipitation and temperature in Davos (`DAV`)
 * choose another `kind`

In [None]:
sns.jointplot?

### Solution

In [None]:
sns.jointplot(DAV.Temperature, DAV.Precipitation, kind='hex');

## Pandas

Seaborn works especially well with pandas dataframes. We can illustrate this with an example from the [seaborn gallery](https://seaborn.pydata.org/examples/factorplot_bars.html).

The example data is a passenger list from the titanic:

In [None]:
# Load the example Titanic dataset
titanic = sns.load_dataset("titanic")

titanic.head()

Then we can use a `factorplot` to illustrate the survival probability depending on the class the passenger traveled and if it was a male or female passenger:

In [None]:
with sns.axes_style('whitegrid'):
    # Draw a nested barplot to show survival for class and sex
    g = sns.factorplot(x="class", y="survived",
                       hue="sex",
                       data=titanic,
                       size=6, kind="bar", palette="muted")
    
    g.despine(left=True)
    g.set_ylabels("survival probability")

Pandas DataFrame don't work very well with lat/ lon data (that's what xarray and the like are for), so let's use an example with a time series. First we need to convert `BAS` from an xarray Dataset to a pandas DataFrame. 

In [None]:
import pandas as pd

In [None]:

def to_dataframe(data):
    # STEP 1
    # calculate monthly temperature and precipitation anomalies
    d = data.groupby('time.month') - data.groupby('time.month').mean('time')

    # STEP 2
    # convert to a dataframe
    d = d.to_dataframe()[['Temperature', 'Precipitation']]

    # STEP 3
    # create a new categorical variable 'month
    d['month'] = d.index.month.values
    d['month'] = d['month'].astype('category')

    # STEP 4
    # create wet and dry category depending if it rained more than on average
    bins  = [-np.inf, 0, np.inf]
    d['prec_cat'] = pd.cut(d['Precipitation'], bins, labels=['dry', 'wet'])

    return d

BAS_df = to_dataframe(BAS)
DAV_df = to_dataframe(DAV)
BAS_df.head()

### Exercise

 * create a `factorplot` showing monthly temperature anomalies as a function of the `month` and precipitation category

In [None]:
# code here

# sns.factorplot(...)

### Solution

In [None]:
g = sns.factorplot(x="month", y="Temperature", hue="prec_cat", data=BAS_df,
                   size=6, kind="bar", palette="BrBG")