# Seaborn Workshop

Seaborn is a Python data visualization library based on matplotlib. 
It provides a high-level interface for drawing attractive and informative statistical graphics.

___

Installing Seaborn (conda installation recommend)

https://seaborn.pydata.org/installing.html

___

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns # seaborn library

For this session, you will need the data set named ```heart.csv```, which can be downloaded from our [GitHub repository](https://github.com/IC-Computational-Biology-Society/Pandas_Matplotlib_session.git) dedicated to today's workshop. Make sure you save it in the same directory as this Jupyter notebook. 

___

## Getting Started

In [None]:
df = pd.read_csv('heart.csv')
display(df.head())

**Dataset description** 

- ```age```: The patient's age
- ```gender```: 0 = female and 1 = male
- ```cp```: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
- ```trestbps```: The patient's resting blood pressure (mm Hg on admission to the hospital)
- ```chol```: The patient's cholesterol measurement in mg/dl
- ```fbs```: The patient's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
- ```restecg```: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
- ```thalach```: The patient's maximum heart rate achieved
- ```exang```: Exercise induced angina (0 = no, 1 = yes)
- ```oldpeak```: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)
- ```slope```: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
- ```ca```: The number of major vessels (0-3)
- ```thal```: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
- ```target```: Heart disease (0 = no, 1 = yes)

In [None]:
# get the number of patients (number of rows) and columns of the dataset
print ("number of patients :", len(df))
print ("number of columns :", len(df.columns))

In [None]:
# check if any values are missing


## Task 1

Plot a histogram using  ```sns.histplot``` function of the patients age distribution.
Set the paramete ```kde```  to ```True``` inlcude the kernel density estimate.

Don't forget to include the plot's title.
___

In [None]:
### Enter code below

Create a new histogram using again the ```sns.histplot``` function but showing in different colour the patients with disease (target = 1) and patients without disease (target = 0)
____

In [None]:
### Enter code below

## Task 2 

Similar to histograms are kernel density estimate (KDE) plots, which can be used for visualising the distribution of observations in a dataset. KDE represents the data using a continuous probability density curve.

___

Use the ```sns.kdeplot``` function to visualise to the distribution of resting blood pressure ```trestbps``` with the patients ```gender``` (0 = female, 1 = male)

Rename the legend to 'female' and 'male'
___

In [None]:
### Enter code below

Create a new figure that contains two subplots, one showing the distribution of resting blood pressure ```trestbps``` with gender and the other showing the distribution of cholesterol ```chol``` with gender.

Include a title to each of the subplots and rename the legends to 'female' and 'male'.
___

In [None]:
### Enter code below

## Task 3

Use the ```sns.countplot``` function to visualise the counts of patients with and without the disease based on their gender.

Rename the x ticks labels to 'female' and 'male' and the legend values to 'disease' and 'no disease'.
___

In [None]:
### Enter code below

## Task 4

Correlation indicates how the features are related to each other or to the target variable.
The correlation may be positive (increase in one value of the feature increases the value of the target variable) or negative (increase in one value of the feature decreases the value of the target variable).

Plot a correlation matrix using the ```sns.heatmap``` showing the correlation of the features to each other and the target value.

**Hint**

The correlation between the variable in the data can be caluated using ``` df.corr()```, which needs to be added as the data parameter of the heatmap function.

In [None]:
### Enter code below