<img src="images/inmas.png" width=130x align='right' />

# Notebook 21 - Visualizing with Seaborn
This notebook is a short introduction to Seaborn. It covers:

<small>

- Reading a data set
- Relational plots such as scatterplots, line plots, and regression plots
- Categorical plots such as bar plots, and box plots
- Distribution plot such as histograms
- Grid plots and pair plots

### Prerequisite
Notebook 20
</small>
----

### Housekeeping

In [None]:
import seaborn as sbn

### What is Seaborn?
<small>

- Seaborn is a Python data visualization library based on Matplotlib providing a high-level interface for drawing attractive and informative statistical graphics
- Seaborn offers a richer set of plotting functions such as:
    - Relational plots: Scatterplots, line plots, and regression plots
    - Categorical plots: Bar plots, count plots, box plots, and violin plots
    - Distribution plots: Histograms, kernel density estimation (KDE) plots, and rug plots
    - Grid plots: FacetGrid and PairGrid for creating multiple plots in a grid layout
    - Matrix plots: Heatmaps and clustermaps
</small>

### Loading sample data from the Seaborn package
We'll use the [palmer penguins](https://allisonhorst.github.io/palmerpenguins/) data set, which comes part of the Seaborn package

Data set contains anatomical measurements for three species of penguins: Adelie, Gentoo, and Chinstrap

In [None]:
penguinsDf = sbn.load_dataset('penguins')
type(penguinsDf)

- Seaborn's data format is based on Pandas! (notice `pandas` was not directly imported)
- As data is a data frame we can use the commands we already know:

In [None]:
penguinsDf.head(3)

### Exploring the 'penguins' data set
We can use additional Pandas command to explore the contents of the data set just loaded:

In [None]:
penguinsDf.info()

Notice the few missing entries

In [None]:
penguinsDf.describe()

### Using Seaborn for generating scatterplots

We'll make some scatterplots using the 'penguins' data set to look for possible correlations:

In [None]:
sbn.scatterplot(data=penguinsDf, x="flipper_length_mm", y="bill_length_mm"); 

We can color the points by species

In [None]:
sbn.scatterplot(data=penguinsDf, x="flipper_length_mm", y="bill_length_mm", hue="species");

As Seaborn is based on Matplotlib, we can also add details to each plot by using `matplotlib`:

In [None]:
sbn.scatterplot(data=penguinsDf, x="flipper_length_mm", y="bill_length_mm", hue="species")

import matplotlib.pyplot as plt
plt.title('Flipper length (mm) vs. Bill Length (mm)')
plt.xlabel('Flipper length (mm)')
plt.ylabel('Bill length (mm)');

### Automatic linear regression (lm = linear model)
As linear regressions are common, the command `lmplot` superimpose a best-fit line over the scatterplot:

In [None]:
sbn.lmplot(data=penguinsDf, x="flipper_length_mm", y="bill_length_mm");

Regressions can also be done separately for each species:

In [None]:
sbn.lmplot(data=penguinsDf, x="flipper_length_mm", y="bill_length_mm", hue="species");

Notice the uncertainty of the model indicated by the light shadow

### Facet plots

<small>

- Facet grids are a means to partition the data set by groups into subplots that can be combined together to form a single, larger graph
    - similar to `subplots` in Matplotlib, but automated in Seaborn
- Facets are useful when there are multiple discrete or categorical variables present
- Making a facet plot requires two separate steps: 
    1. Using `FacetGrid()`, create a grid with the desired row and/or column variables
    2. Use the `map()` method to map the desired graphs of selected variables onto the grid
</small>

    
For example:
```python
grid = seaborn.FacetGrid(dataset_name, col="col_name")
grid.map(seaborb.function_name, "variable_1", "variable_2");
```


Running the code for creating a facet grid with one column by island

In [None]:
grid = sbn.FacetGrid(penguinsDf, col="island")
grid.map(sbn.scatterplot, "flipper_length_mm", "bill_length_mm");

### Boxplot

- A boxplot is a way to showcase relationships with categorical and continuous data
- It displays: Quantile 25%, Median (50%), Quantile 75%, and outliers

In [None]:
sbn.boxplot(data=penguinsDf, x="sex", y="bill_length_mm", hue="species");

### Scatterplots with a categorical variable

If we applied a scatterplot when one variable was categorical, the outcome wouldn't look great...

In [None]:
sbn.scatterplot(data=penguinsDf, x="sex", y="bill_length_mm", hue="species");

Fortunately, Seaborn has the commands  `stripplot` or `swarmplot` for that purpose

### The `stripplot` command
This command includes the possibility to group and slighly offset each category by stating `dodge=True` :

In [None]:
sbn.stripplot(data=penguinsDf, x="sex", y="bill_length_mm", hue="species", dodge=True);

### The `swarmplot` command
The swarmplot is similar, except that data points are drawn to never overlap: 

In [None]:
sbn.swarmplot(data=penguinsDf, x="sex", y="bill_length_mm", hue="species", dodge=True);

### Plotting histograms

- Histograms are used for describing how continuous variables are distributed
- They're great for identifying the *center* and *spread* of a distribution
- They can also be used to see frequent and infrequent (outlier) values

In [None]:
sbn.histplot(data=penguinsDf, x="flipper_length_mm");

There are many settings in `histplot` that can be adjusted. Here is an example:

In [None]:
sbn.histplot(penguinsDf, x="flipper_length_mm", hue="species", element="step", stat="density", alpha = 0.15);

### Histograms of categorical data

The command `catplot` provides the ability to visualize the discrete distribution of a categorical variable

In [None]:
sbn.catplot(data=penguinsDf, kind="bar", x="species", y="body_mass_g", hue="sex", palette="dark", errorbar=None);

### Standard line plots

- Line graphs are best-suited for temporal data that goes across time
- Let's use Seaborn's 'flights' data set to demonstrate this capability:

In [None]:
flightsDf = sbn.load_dataset("flights")
flightsDf.head(3)

Exploring the data set with Pandas' `info()` method

Data set contains the number of airline passengers per month 

In [None]:
flightsDf.info()

We plot the number of passengers in May over the years:

In [None]:
may_flightsDf = flightsDf.query("month == 'May'")               # Notice the string within the string
sbn.lineplot(data=may_flightsDf, x="year", y="passengers");

### Pair-wise plots

- A `pairplot` consists of a grid of scatterplots on off-diagonal elements
- Histograms are on the diagonal showing the distribution of values in the variable

In [None]:
subpenguinsDf = penguinsDf[['body_mass_g', 'flipper_length_mm']]
sbn.pairplot(subpenguinsDf);

### Key Points
- Seaborn is a powerful visualization tool that extends the capabilities of Matplotlib
- It offers many capabilities including representing categorical data and performing linear regressions
    - categorical data displayed with `catplot`, `stripplot`, and `swarmplot`
- Seaborn pair plots can be used to display correlations between data sets


### Further Reading
- Seaborn reference manual is [here](https://seaborn.pydata.org/)

### What's Next?
- Complete the exercises in this associated exercise notebook [X-21-Seaborn.ipynb](X-21-Seaborn.ipynb)
- Next notebook is [N-22-VirtualEnv.ipynb](N-22-VirtualEnv.ipynb)