# Python data visualization with Seaborn and Matplotlib  

Most of the below tutorial is cherry-picked and slightly modified from Seaborn's own tutorial page which can be found here:  

https://seaborn.pydata.org/tutorial.html

Their tutorial is very good. For the sake of time we'll only look at a few important aspects of seaborn with basic examples, but the full tutorial is great resource for more depth and extra examples.  

## Python visualization libraries
<img src='https://matplotlib.org/_static/images/logo2.svg' width=200>  
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible. This is the predominant ploting library in python and the basis for Seaborn.


<img src='https://seaborn.pydata.org/_static/logo-wide-lightbg.svg' width=200>  
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.  


<img src='https://static.bokeh.org/branding/logos/bokeh-logo.svg' width=200>  
Bokeh is a Python library for creating interactive visualizations for modern web browsers. It helps you build beautiful graphics, ranging from simple plots to complex dashboards with streaming datasets. With Bokeh, you can create JavaScript-powered visualizations without writing any JavaScript yourself.  


<img src='https://upload.wikimedia.org/wikipedia/commons/3/37/Plotly-logo-01-square.png' width=200>  
Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.  

<img src='https://altair-viz.github.io/_static/altair-logo-light.png' width=200>  
Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite, and the source is available on GitHub. With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.


<img src='https://plotnine.readthedocs.io/en/stable/_images/logo-180.png' width=200>  
plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots easy to think about and then create, while the simple plots remain simple.

## Why Seaborn?  
**Seaborn makes some hard things easy** - Mainly semantics, faceting, nice defaults, and easy customization.  
**Seaborn is mature** - Been around many years. Still actively developed.   
**Seaborn is popular** - Many people use it (and matplotlib) so you can easily find help on stackexchange and other resources.  
**Seaborn is flexible** - Since it's built on matplotlib you can do almost anything.    

## Seaborn and Matplotlib  
Seaborn is built on top of Matplotlib. A figure created by seaborn is a matplotlib figure. All the elements of the figure are matplotlib elements.  
If you have trouble tweaking some component of the figure with seaborn then try looking for help with "matplotlib" instead of "seaborn".  
<img src='https://matplotlib.org/3.5.0/_images/sphx_glr_anatomy_001.png' width=500>

## Getting Started

**Installation**  
In your existing python environment from the command line -   
pip install seaborn

Or with conda -  
conda install seaborn

If you're running this notebook in Google Colab then you don't need to install anything.  

**Importing**

In [None]:
import seaborn as sns

Since seaborn is based on matplotlib sometimes it's useful to import the matplotlib scripting interface "pyplot"

In [None]:
import matplotlib.pyplot as plt

Occassionally it can be useful to import the full matplotlib package to access a few more tools and controls, but this is rarely needed.

In [None]:
import matplotlib as mpl

It's also useful to import math libraries explicitly.

In [None]:
import pandas as pd
import numpy as np

## Data  
Seaborn comes with a few built in pandas datasets for testing. We'll be using these for the tutorial.  

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/768px-Pandas_logo.svg.png?20200209204934' width=100> is the primary data manipulation tool for python. It's kind of like R dataframes, dplyr and some other tidyverse stuff all rolled into one. And there are plotting methods directly included, but we won't use those.

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
type(penguins)

In [None]:
penguins.head()

In [None]:
# grabbing a column returns a pandas Series
penguins['body_mass_g']

In [None]:
# grabbing a row also returns a Series
penguins.loc[1]

In [None]:
# grabbing an element
penguins.loc[1, 'body_mass_g']

In [None]:
# summarizing data
penguins.describe()

## Plot types - quick overview
Seaborn can make a wide variety of plots with it's standard plotting functions which are oriented around five types of plots.   

**Relational** - statistical relationship between two, usually continuous, variables as line and scatter plots  
**Distributional** -  univariate and bivariate distributions as histograms, kernel density estimates, and empirical cumulative distribution functions  
**Categorical** - relationships between a continuous and a categorical variable as bar plots, boxplots, violinplots, stripplots, and swarmplots  
**Regression** - plot linear regression estimates with uncertainty   
**Matrix** - plot rectangular data as a color-encoded matrix with and without clustering

In [None]:
# Relational - scatter plot
sns.relplot(data=penguins, x='bill_length_mm', y='bill_depth_mm', hue='species')

In [None]:
# Distributional - kernel density estimate
sns.displot(data=penguins, x='bill_length_mm', hue='species', kind='kde')

In [None]:
# Categorical - violinplots
sns.catplot(data=penguins, x='bill_length_mm', y='species', kind='violin')

In [None]:
# Regression
sns.lmplot(data=penguins, x='body_mass_g', y='flipper_length_mm', hue='sex')

In [None]:
# Matrix - clustermap
sub = penguins[['bill_length_mm', 'bill_depth_mm',
                'flipper_length_mm',	'body_mass_g',]].dropna()
sns.clustermap(sub, standard_scale=1)

The seaborn gallery has lots of examples of the types of plots you can make.

If the primary seaborn functions don't have the type of plot you want, you may still be able to implement it along with some of the nice seaborn features like semantics and aggregation by mapping it to a FacetGrid, which I'll show later.

In [None]:
%%html
<iframe src="https://seaborn.pydata.org/examples/index.html" width="1000" height="600"></iframe>

# Seaborn structure and features

## Data structures - Wide-form vs Long-form  
Many plotting functions are of capable of handling both long-form data and wide-form data. If it can't handle one, then it's usually not too difficult to massage the data into the form you need.

In the below examples we'll make the same relational plot with different data structures. "relplot" is generalized plotting function capable of drawing both line plots and scatterplots.



Long-form data - every row is an observation, every column is an attribute. In the R world this is called tidy data.

In [None]:
flights = sns.load_dataset("flights")
flights.head()

In [None]:
# Long-form data with defined x, y, and hue vectors
sns.relplot(data=flights, x="year", y="passengers", hue="month", kind="line")

Wide-form data - The columns and rows define two dimensions of a single attribute.   
For example, a measurement taken at different months (columns) and years (rows).

In [None]:
# Make it wide-form
flights_wide = flights.pivot(index="year", columns="month", values="passengers")
flights_wide.head()

In [None]:
# Plot wide-form data
# Seaborn treats the argument to data as wide form when neither x nor y are assigned.
sns.relplot(data=flights_wide, kind="line")

Generally it's better to use long-form data and be more explicit about the variables you want used for the different axes, hue and style of plot. 

There are various ways to munge data into long-form. That is best left to a pandas tutorial but here is one example.

In [None]:
# Make the wide data long again
flights_long = flights_wide.stack().reset_index().rename(columns={0:'passengers'})
flights_long.head()

Seaborn can also accept a number of other data formats as vectors for the plotting functions. But notice only the Series retains the name and puts it on the axis label.

In [None]:
# Convert columns to other data formats
year = flights['year']                       # pandas Series
month = flights['month'].tolist()            # list
passengers = flights['passengers'].values    # numpy array

sns.relplot(x=year, y=passengers, hue=month, kind="line")

## Figure-level vs. axes-level functions

Seaborn has several modules that contain the of base plotting functions - relational, distributional, categorical, regression, matrix, and multiplot grids.   

<img src='https://seaborn.pydata.org/_images/function_overview_8_0.png'>


There is some overlap between functions so you can achieve the same thing (or something very similar) with different functions. Some functions are specific to a kind of plot like "sns.histplot" and others are flexible to create any kind of distributional plot like "sns.displot".

One difference between the above *histplot* and *displot* is that *hisplot* is an axes-level plot and can be drawn into an existing axes. Whereas *displot* is a figure-level plot that creates its own figure and has some additional faceting options.  It actually returns a FacetGrid object that we'll get into later.

In [None]:
# Histogram
subplot = sns.histplot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack")

In [None]:
type(subplot)

In [None]:
# Same as above histplot with displot returning a FacetGrid
# Try changing to kind="kde" to make a kernel density estimate
# Or kind="edcf" and remove multiple="stack" to make an empirical cumulative distribution function
g = sns.displot(data=penguins, x="flipper_length_mm", hue="species", multiple="stack", kind="hist")

In [None]:
print(type(g))
print(type(g.fig))

In [None]:
# Distribution plot with column facets
sns.displot(data=penguins, x="flipper_length_mm", hue="species", col="species")

Axes-level plots draw into an axis object rather than a figure so different plots can be combined in a single figure.

In [None]:
# Draw two different seaborn plots into separate axes of the same figure
f, axs = plt.subplots(1, 2, figsize=(8, 4), gridspec_kw=dict(width_ratios=[4, 3]))
sns.scatterplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species", ax=axs[0])
sns.histplot(data=penguins, x="species", hue="species", shrink=.8, alpha=.8, legend=False, ax=axs[1])
f.tight_layout()

## Semantics
In addition to X and Y, vectors can be passed to hue, style, and size keywords for illustrating other dimensions of the data. This kind of semantic styling is available in all seaborn plotting functions. Seaborn makes use of semantics much much easier than base matpotlib.

In [None]:
# Load the data
tips = sns.load_dataset("tips")
tips.head()

In [None]:
# Semantic options with seaborn
g = sns.relplot(data=tips, x='total_bill', y='tip', hue="smoker", style="time", size="size", kind='scatter')

In [None]:
# Semantics with matplotlib
fig, ax = plt.subplots(figsize=(5,5))
x, y = tips['total_bill'], tips['tip']

# Color vector for smoker
c = tips['smoker'].map({'No':'orange', 'Yes':'blue'})
# Size vector for size
s = tips['size']*8
# Marker vector for time
time_marker = {'Lunch':'o', 'Dinner':'x'}

# Plot
for time, marker in time_marker.items():
  mask = tips['time']==time
  sc = ax.scatter(x[mask], y[mask], c=c[mask], s=s[mask], marker=marker, label=time)
  
  # get size handles and labels for legend
  if time=='Lunch':
    shandles, slabels = sc.legend_elements(prop="sizes")

## Make legend
# time legend (get markers automatically but change color)
legend1 = ax.legend(loc=(1.05, 0.12), title='time')
legend1.legendHandles[0].set_color('black')
legend1.legendHandles[1].set_color('black')
ax.add_artist(legend1)

# size legend (modify labels because they're scaled by 8)
newslabels = range(1,7)
legend2 = ax.legend(shandles, newslabels, loc=(1.05, 0.3), title="size")
ax.add_artist(legend2)

# color legend (make manually)
chandles = [mpl.lines.Line2D([], [], marker='o', linestyle='None', color='orange'),
            mpl.lines.Line2D([], [], marker='o', linestyle='None', color='blue')]
legend3 = ax.legend(chandles, ['No', 'Yes'], title='smoker', loc=(1.05,0.7))

ax.set(xlabel='total_bill', ylabel='tip')

## Aggregation
For datasets with multiple observations, like repeated measures data, you can easily plot an aggregation (e.g. mean) and the variability (e.g. confidence interval). For some plotting functions this aggregation is performed by default if there are multiple measures for a given x and y.

In [None]:
fmri = sns.load_dataset("fmri")
fmri.head()

In [None]:
# Show the mean signal at each time point and it's bootstrapped 95% confidence
# interval aggregated across subjects, events and regions 
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri);

You can use different kinds of estimators and confidence interval functions, including custom functions. You can also change the confidence interval size or use the standard deviation.

In [None]:
# Use a median estimator and standard deviation
# Try changing ci to an integer and estimator to a different function (np.sum, np.product, etc)
sns.relplot(x="timepoint", y="signal", estimator=np.median, ci='sd', kind="line", data=fmri);

You can also set estimator=None to plot every observation across given units. If you don't set units then it tries to plot the line at multiple y's for the same x, which is incorrect.

In [None]:
# No estimator. One line for each subject.
# try removing the units to see what happens
df = fmri[(fmri['region']=='parietal') & (fmri['event']=='stim')]
sns.relplot(x="timepoint", y="signal", estimator=None, units='subject', kind="line", data=df);

## Multi-plot grids
Figure-level plots like relplot, displot, catplot, and lmplot are convient functions for faceting with some standard plot types. They in fact return a FacetGrid object. Using FacetGrid itself offers the ability to perform faceting with some additional flexibility.

### FacetGrid

In [None]:
tips = sns.load_dataset("tips")
tips.head()

In [None]:
# Create a FacetGrid and map a plotting function to it
# Note how you can pass both positional and keyword arguments to the mapped function
g = sns.FacetGrid(tips, col="sex", row='time', hue="smoker", margin_titles=True)
g.map(sns.scatterplot, "total_bill", "tip", alpha=.7)
g.add_legend()

You can control the size of the subplots, the order of various elements, and many other parts of the plot.

In [None]:
# FacetGrid with specified size, order, and wrapping # of columns
attend = sns.load_dataset("attention").query("subject <= 12")
g = sns.FacetGrid(attend, col="subject", col_wrap=4, height=2, ylim=(0, 10))
g.map(sns.pointplot, "solutions", "score", order=[1, 2, 3], color=".3", ci=None)

Certain components of the plot can be set using specialized functions to facet grid, or they can be changed by accessing the underlying matplotlib figures and axes.

In [None]:
g = sns.FacetGrid(tips, col="smoker", margin_titles=True, height=4)
g.map(plt.scatter, "total_bill", "tip", color="#338844", edgecolor="white", s=50, lw=1)

# Modify the axes label (specialize FacetGrid function)
g.set_axis_labels("Total bill (US Dollars)", "Tip")
g.set(xlim=(0, 60), ylim=(0, 12))

# Add another plot to the each axis
for ax in g.axes_dict.values():
    ax.plot((0, 60), (0,10), c=".2", ls="--", zorder=0)

Probably the best thing about FacetGrid is that you can use it to perform semantic mapping and faceting with custom plotting functions.   
**But** the function must...  
1) plot to the current axes  
2) accept the data as positional arguments  
3) accept color and label arguments. You can just use **kwargs for this.  

In [None]:
def hexbin(x, y, color, **kwargs):
    cmap = sns.light_palette(color, as_cmap=True)
    plt.hexbin(x, y, gridsize=15, cmap=cmap, **kwargs)

with sns.axes_style("dark"):
    g = sns.FacetGrid(tips, hue="time", col="time", height=4)
g.map(hexbin, "total_bill", "tip", extent=[0, 50, 0, 10]);

### Pairgrid and pairplot
*Pairgrid* shows the pairwise relationship between multiple variables. It's a very handy tool for exploritory analysis. The plot can be customized to show different types of plots in the upper and lower triangles and the diaganol plots. *Pairplot* is higher-level interface to some of the prime functionality of pairgrid. 

In [None]:
iris = sns.load_dataset("iris")
iris.head()

In [None]:
# Pairwise relationship between flower attributes by species
# Try swapping upper&lower for map_offdiag
# Note this is a bit slow to draw
g = sns.PairGrid(iris, hue="species")
g.map_diag(sns.histplot)
g.map_lower(sns.kdeplot)
g.map_upper(sns.scatterplot)
g.add_legend()

In [None]:
# Pairplot version (can't specify upper/lower kind yet)
g = sns.pairplot(iris, diag_kind="hist", hue="species", height=2.5)
# But can still map a function to add it
g.map_lower(sns.kdeplot, levels=4, color=".2")

# Aesthetics
Matplotlib offers a lot of customizability for plot appearance, but it can often be hard to tweak and typically uses ugly defaults. Seaborn has some nicer themes and offers a high-level interface for controlling most aesthetics.

In [None]:
# Using seaborn aesthetic features doesn't even require a seaborn plot
# Use a standard matplotlib plotting function
def sinplot(flip=1):
    x = np.linspace(0, 14, 100)
    for i in range(1, 7):
        plt.plot(x, np.sin(x + i * .5) * (7 - i) * flip)

## Style

In [None]:
# Default plot style
sinplot()

In [None]:
# Change to one of five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks
sns.set_style("darkgrid")
sinplot()

In [None]:
# Temporarily set the style for the current plot
f = plt.figure(figsize=(6, 3))
gs = f.add_gridspec(1, 2)

with sns.axes_style("dark"):
    ax = f.add_subplot(gs[0, 0])
    sinplot()

with sns.axes_style("white"):
    ax = f.add_subplot(gs[0, 1])
    sinplot()

If you want to customize the seaborn styles, you can pass a dictionary of parameters to the rc argument of axes_style() and set_style(). Note that you can only override the parameters that are part of the style definition through this method. (However, the higher-level set_theme() function takes a dictionary of any matplotlib parameters).

If you want to see what parameters are included, you can just call the function with no arguments, which will return the current settings:

In [None]:
sns.axes_style()

In [None]:
# Customize the style
sns.set_style("darkgrid", {"axes.facecolor": ".7"})
sinplot()

## Context  
A separate set of parameters control the scale of plot elements, which should let you use the same code to make plots that are suited for use in settings where larger or smaller plots are appropriate.

In [None]:
# reset style first
sns.set_theme()

In [None]:
# Try setting different contexts: paper, poster, talk, notebook
sns.set_context("poster")
sinplot()

In [None]:
# Independently rescale some elements
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 1.5})
sinplot()

## Colors  
Seaborn uses some nicer color defaults than base matplotlib, and it offers some tools for choosing good color palettes.  

Broadly, palettes fall into one of three categories:  
qualitative palettes, good for representing categorical data  
sequential palettes, good for representing numeric data  
diverging palettes, good for representing numeric data with a categorical boundary

In [None]:
# return a qualitative color palette
pal = sns.color_palette("tab10")
pal

In [None]:
# Tool to help choose a color brewer palette
# Try the other sns.choose_ tools for creating a customized palette
sns.choose_colorbrewer_palette('qualitative')

Both matplotlib and seaborn have pages with advice and additional tools for choosing a color palette.  
https://matplotlib.org/stable/tutorials/colors/colormaps.html  
https://seaborn.pydata.org/tutorial/color_palettes.html  

There is also the excellent ColorBrewer tool https://colorbrewer2.org/

# Matplotlib tips  
Below are a couple small things to be aware of when using matplotlib. There is a whole lot more to know, and you can find this by searching through [Matplotlib's docs](https://matplotlib.org/3.5.0/index.html), one of [the cheatsheats](https://matplotlib.org/cheatsheets/), or looking through [their tutorials](https://matplotlib.org/3.5.0/tutorials/index.html). 

## Pyplot vs object-oriented interface
Matplotlib uses both a functional state-based interface (pyplot) and an object-oriented interface. You can choose to use one or a combination of them when plotting. Pyplot operates on the current active figure and axis, whereas the object-oriented interface operates on the object you're calling a method to.

In [None]:
sns.relplot(data=penguins, x='bill_length_mm', y='bill_depth_mm', hue='species')

In [None]:
# Scatterplot with state-based pyplot
x, y = penguins["flipper_length_mm"], penguins["body_mass_g"]
# Create the current plot
plt.scatter(x, y)
# Then make some changes to it
plt.xlabel("Flipper length (mm)")
plt.ylabel("Body Mass (g)")
plt.grid(True)

In [None]:
# Scatterplot with OO-interface
# Set up a figure and axes
fig, ax = plt.subplots()
# Modify the axes object with methods on the object
ax.scatter(x, y)
ax.set(xlabel="Flipper length (mm)", ylabel="Body Mass (g)")
ax.grid()

## Annotations and transformations
You can add text and draw symbols on plots. One handy way of positioning annotations is to use a transformation so that the coordinates are normalized on a 0-1 scale within the extent of the axis or figure.

In [None]:
# Add correlation to a regplot in specific position
xcol, ycol = "flipper_length_mm", "body_mass_g"
fig, ax = plt.subplots()
sns.regplot(data=penguins, x=xcol, y=ycol, ax=ax)

corr =  penguins[[xcol, ycol]].corr().iloc[0,1]
ax.text(0.1, 0.9, "r = "+str(round(corr, 2)), transform=ax.transAxes)

## Tight layout  
Sometimes plot elements overlap. The tight_layout() method on a figure can often resolve these conflicts automatically by shifting and scaling subplot elements.

In [None]:
sns.set_context('paper')

In [None]:
# Make a plot with overlapping elements
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(6,6))

for i, ax in enumerate(axes.flat, start=1):
    ax.set_title('Test Axes {}'.format(i))
    ax.set_xlabel('X axis')
    ax.set_ylabel('Y axis')

In [None]:
# Adjust with tight_layout
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(6,6))

for i, ax in enumerate(axes.flat, start=1):
    ax.set_title('Test Axes {}'.format(i))
    ax.set_xlabel('X axis')
    ax.set_ylabel('Y axis')

fig.tight_layout()

## Other types of plots  
Seaborn covers the majority of plots that most people use regularly, but Matplotlib is capable of some other plot types such as images and 3D plots. It might be capable to map these to a seaborn grid, but it's likely best to implement these with base matplotlib.  

In [None]:
# Draw an image with matplotlib
with mpl.cbook.get_sample_data('logo2.png') as image_file:
    image = plt.imread(image_file)
plt.imshow(image)

In [None]:
# Simple 3D projection scatterplot
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
c = penguins["species"].map({'Adelie':'blue', 'Gentoo':'orange', 'Chinstrap':'green'})
ax.scatter(penguins["flipper_length_mm"], penguins["body_mass_g"],
           penguins["bill_length_mm"], c=c)

Explore the other kinds of matplotlib plots and plotting features:


In [None]:
%%html
<iframe src="https://matplotlib.org/stable/gallery/index.html" width="1000" height="600"></iframe>