# <center><b>Hands-on</b></center>

<div style="text-align:center">
    <img src="../images/seaborn.png" width="600px">
    <div>
       Bertrand Néron, François Laurent, Etienne Kornobis, Vincent Guillemot
       <br />
       <a src=" https://research.pasteur.fr/en/team/bioinformatics-and-biostatistics-hub/">Bioinformatics and Biostatistiqucs HUB</a>
       <br />
       © Institut Pasteur, 2024
    </div>    
</div>

Practice your graphing skills through the data of [happiness 2016](https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2016.csv)

(The data are already in data directory as `happiness_2016.csv`)

## Import the data and have a look on them

1. import the pandas and seaborn modules
2. import the data

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
ha_df = pd.read_csv("../data/happiness_2016.csv")

3. have a look on them

In [None]:
ha_df.shape

In [None]:
ha_df.head()

## Do a boxplot showing the differences in `happiness` between `Region`:

In [None]:
sns.boxplot(data=ha_df, x="Happiness Score", y="Region")

## Using a histogram and continuous probability density curve, display the distribution of `Freedom` in the dataset

In [None]:
sns.histplot(data=ha_df, x="Freedom")

In [None]:
sns.histplot(data=ha_df, x="Freedom", kde=True)

- Use a barplot to show the count of country per Region (see the documentation for a countplot)

In [None]:
sns.countplot(data=ha_df, x="Region")

As you can see the labels overlaps each ohers and are not readable

One possibility is to rotate the X-labels. In this case is better to provide the labels.

In [None]:
# extract the Region from the data, I will use them as labels for figures below
regions = ha_df.loc[:, 'Region'].drop_duplicates()

In [None]:
ax = sns.countplot(data=ha_df, x="Region")
ax.set_xticks(regions)
ax.set_xticklabels(regions, rotation=45, ha='right', rotation_mode='anchor')

## On the same data `Happiness` and `Region` do a boxplot and a swarmplot to display the structure of the data

In [None]:
ax = sns.swarmplot(data=ha_df, x="Region", y="Happiness Score")
ax.set_xticks(regions)
ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')

In [None]:
ax = sns.boxplot(data=ha_df, x="Region", y="Happiness Score", hue='Region') # see the result of the option hue
ax.set_xticks(regions)
ax.set_xticklabels(regions,rotation=45, ha='right', rotation_mode='anchor')

## Plot the distribution of `happiness` for the people leaving `Western Europe`

In [None]:
sns.histplot(data=ha_df.query("Region == 'Western Europe'"), x="Happiness Score", kde=True)

## Plot the `Health (Life Expectancy)` vs `Happiness Score` and color the dots according to the region specify a size for the figure (9 inches x 7) 

1. import pyplot
2. then create a new figure and axis at the right size
3. create the plot

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots(figsize=(9, 7))
sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", ax=ax)

- Feel free to explore more of [seaborn](https://seaborn.pydata.org/examples/index.html) !

## Do a barplot of the Happiness Score for each Region

In [None]:
sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h')

## from this point we will focus on the Regions

### clean our dataset. Remove not relevant columns

In [None]:
ha_df.columns

1. keep only columns 'Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity'
2. set the index to the Region
3. have a look on your new data

In [None]:
region_df = ha_df.loc[:, ['Region', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Freedom','Trust (Government Corruption)', 'Generosity']]
region_df.set_index('Region', inplace=True)
region_df.head()

## Aggregate the new data region by region. Compute the mean of each country as value for the corresponding Region

In [None]:
reg_agg = region_df.groupby('Region').agg('mean')
reg_agg

## Do a hierarchically-clustered heatmap 

In [None]:
sns.clustermap(data=reg_agg)

Check the data.

In [None]:
reg_agg.describe()

The data are not in the same range, so it could be better to standardize the data before to do the clustering

In [None]:
normalized_reg=(reg_agg - reg_agg.mean()) / reg_agg.std()
normalized_reg

In [None]:
sns.clustermap(data=normalized_reg, annot=True) # see the results of the annot option 

It's possible to do that directly in seaborn. with the option z_score (https://seaborn.pydata.org/generated/seaborn.clustermap.html)

In [None]:
sns.clustermap(data=reg_agg, z_score=1, annot=True)

- Explore the clustermap documentation to have a more visual heatmap by standardizing the data within genes.

## Create a function which produce a single image with four different plots of your choice and save it to pdf file.

like the image below.

<img src="../images/multiple_figure.png" width="50%" />

In [None]:
import matplotlib.pyplot as plt

In [None]:
def expression_graph():
    fig, axs = plt.subplots(2,2, figsize=(9,7), constrained_layout=True) # constrained_layout=True avoid overlapping between axis title and X-labels from the above figure
    sns.boxplot(data=ha_df, x="Happiness Score", y="Region", hue='Region', ax=axs[0,0])
    axs[0,0].set_title("happiness data structure")
    
    sns.scatterplot(data=ha_df, x="Health (Life Expectancy)", y="Happiness Score", hue="Region", legend=False, ax=axs[0,1])
    axs[0,1].set_title("Happiness vs Health")
    
    sns.barplot(data=ha_df, y="Region", x="Happiness Score", hue='Region', orient='h', ax=axs[1,0])
    axs[1,0].set_title("happiness through the world")
    
    sns.histplot(data=ha_df, x="Happiness Score", kde=True, ax=axs[1,1])
    axs[1,1].set_title("Happiness data distribution")
    
    return fig
    

In [None]:
my_fig = expression_graph()
my_fig.suptitle("Happiness Report")
my_fig.savefig("happiness_visualization.pdf",  bbox_inches = "tight") # bbox_inches = "tight" avoid to truncate the Y-labels on left on pdf

# Extras

- Using ipywidget, make a function to display barplot of `Happiness Score` by country but with region selected by the user (using a Dropdown widget)

Imports the needed modules 
- `widgets` and `interact` from the `ipywidgets` package


In [None]:
from ipywidgets import widgets
from ipywidgets import interact

create a dataframe containing regions (without duplicates values

In [None]:
regions = ha_df.loc[:, 'Region'].drop_duplicates()

1. Use this DataFarame to populate your dropdown list
2. Use the region selected in dropdown list as parameter of your function
3. select form the whole data frame the data corresponding to this region
4. display the barplot

below the code skeleton of your function

```python
@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data= ....
```

In [None]:

@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
    

You can customize your figure as classical seaborn/matplotib figure

for instance to display the value above each bar

In [None]:

@interact(region=widgets.Dropdown(options=regions))
def plot_counts(region):
    data = ha_df.loc[ha_df['Region'] == region]
    ax = sns.barplot(data=data, y='Happiness Score', x='Country')
    for i in ax.containers:
        # add label on each bar https://www.geeksforgeeks.org/how-to-show-values-on-seaborn-barplot/
        ax.bar_label(i, fmt="{:.2f}", rotation='vertical', padding=3)
        
    ax.set_xticks(data.Country)
    ax.set_xticklabels(data.Country, rotation=45, ha='right', rotation_mode='anchor')
    ax.margins(y=0.1) # add margin to avoid to have label outside the barplotboundaries, here add 10% white space vertically 
    # https://stackoverflow.com/questions/72662991/how-can-i-prevent-bar-labels-from-going-outside-the-barplot-boundaries-range
    