ARTI308 - Machine Learning
# Seaborn Overview

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




## Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

## Imports

In [None]:
# Import seaborn for statistical data visualization
import seaborn as sns
# Enable inline display of matplotlib plots in Jupyter notebooks
%matplotlib inline

## Data
Seaborn comes with built-in data sets!

In [None]:
# Load the 'tips' dataset from seaborn's built-in datasets
tips = sns.load_dataset('tips')

In [None]:
# Display the first 5 rows of the tips dataset
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## distplot

The distplot shows the distribution of a univariate set of observations.

In [None]:
# Create a distribution plot showing histogram and KDE for total_bill column
sns.distplot(tips['total_bill'])
# Safe to ignore warnings

To remove the kde layer and just have the histogram use:

In [None]:
# Create a distribution plot with histogram only (no KDE) with 30 bins
sns.distplot(tips['total_bill'],kde=False,bins=30)

## jointplot

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [None]:
# Create a joint plot showing bivariate distribution with scatter plot
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')

In [None]:
# Create a joint plot with hexagonal bins to show density
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')

In [None]:
# Create a joint plot with linear regression line fit to the data
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [None]:
# Create a pair plot showing relationships between all numeric columns
sns.pairplot(tips)

In [None]:
# Create a pair plot with hue based on sex and coolwarm color palette
sns.pairplot(tips,hue='sex',palette='coolwarm')

## rugplot

rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

In [None]:
# Create a rug plot showing individual data points as dashes
sns.rugplot(tips['total_bill'])

## kdeplot

kdeplots are [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

In [None]:
# This cell demonstrates KDE plot concept - skip detailed explanation
# Import numpy for numerical operations
import numpy as np
# Import matplotlib for plotting
import matplotlib.pyplot as plt
# Import scipy.stats for statistical distributions
from scipy import stats
# Create a dataset of 25 random normally distributed values
dataset = np.random.randn(25)
# Create a rug plot to show individual data points
sns.rugplot(dataset);
# Calculate the minimum value of the dataset minus 2 (for x-axis range)
x_min = dataset.min() - 2
# Calculate the maximum value of the dataset plus 2 (for x-axis range)
x_max = dataset.max() + 2
# Create 100 equally spaced points between x_min and x_max
x_axis = np.linspace(x_min,x_max,100)
# Set up the URL reference for kernel density estimation bandwidth info
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'
# Calculate the optimal bandwidth using Silverman's rule of thumb
bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2
# Create an empty list to store kernel functions
kernel_list = []
# Loop through each data point in the dataset
for data_point in dataset:
    # Create a normal distribution kernel centered at each data point
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
    # Append the kernel to the kernel list
    kernel_list.append(kernel)
    # Normalize the kernel for visualization
    kernel = kernel / kernel.max()
    # Scale the kernel for plotting visibility
    kernel = kernel * .4
    # Plot each kernel with grey color and 50% transparency
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)
# Set the y-axis limit from 0 to 1
plt.ylim(0,1)

In [None]:
# Sum all the basis functions (kernels) to get the final KDE plot
sum_of_kde = np.sum(kernel_list,axis=0)
# Plot the sum of the basis functions as the KDE curve
fig = plt.plot(x_axis,sum_of_kde,color='indianred')
# Add the original rug plot with indianred color
sns.rugplot(dataset,c = 'indianred')
# Remove y-axis tick marks
plt.yticks([])
# Set the title for the plot
plt.suptitle("Sum of the Basis Functions")

So with our tips dataset:

In [None]:
# Create a KDE plot for total_bill column
sns.kdeplot(tips['total_bill'])
# Overlay a rug plot to show individual data points
sns.rugplot(tips['total_bill'])

In [None]:
# Create a KDE plot for tip column
sns.kdeplot(tips['tip'])
# Overlay a rug plot to show individual data points
sns.rugplot(tips['tip'])

# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:

* factorplot
* boxplot
* violinplot
* stripplot
* swarmplot
* barplot
* countplot

Let's go through examples of each!

In [None]:
# Import seaborn for statistical data visualization
import seaborn as sns
# Enable inline display of matplotlib plots in Jupyter notebooks
%matplotlib inline

In [None]:
# Load the 'tips' dataset from seaborn's built-in datasets
tips = sns.load_dataset('tips')
# Display the first 5 rows of the tips dataset
tips.head()

## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean:

In [None]:
# Create a bar plot showing average total_bill by sex
sns.barplot(x='sex',y='total_bill',data=tips)

In [None]:
# Import numpy for numerical operations
import numpy as np

You can change the estimator object to your own function, that converts a vector to a scalar:

In [None]:
# Create a bar plot showing standard deviation of total_bill by sex
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)

### countplot

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:

In [None]:
# Create a count plot showing the count of observations by sex
sns.countplot(x='sex',data=tips)

## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
# Create a box plot showing distribution of total_bill by day with rainbow palette
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')

In [None]:
# Create a horizontal box plot for all numeric columns in the tips dataset with rainbow palette
sns.boxplot(data=tips,palette='rainbow',orient='h')

In [None]:
# Create a box plot with hue based on smoker status and coolwarm palette
sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm")

### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
# Create a violin plot showing distribution of total_bill by day with rainbow palette
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')

In [None]:
# Create a violin plot with hue based on sex and Set1 palette
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',palette='Set1')

In [None]:
# Create a split violin plot with hue based on sex and Set1 palette
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1')

## stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

In [None]:
# Create a strip plot showing scatter of total_bill by day with rainbow palette
sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow')

In [None]:
# Create a strip plot with jitter to avoid overlapping points
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True, palette='rainbow')

In [None]:
# Create a strip plot with jitter and hue based on sex with Set1 palette
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1')

In [None]:
# Create a strip plot with jitter, hue, palette, and dodge to separate by sex
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',dodge=True)

In [None]:
# Create a swarm plot to avoid overlapping points naturally
sns.swarmplot(x="day", y="total_bill", data=tips)

### Combining Categorical Plots

In [None]:
# Create a violin plot with rainbow palette
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')
# Overlay a swarm plot with black color and size 3 to show individual points
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)

## catplot

factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:

In [None]:
# Create a categorical plot with bar plot kind showing total_bill by sex
sns.catplot(x='sex',y='total_bill',data=tips,kind='bar')

# Matrix Plots

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

Let's begin by exploring seaborn's heatmap and clutermap:

In [None]:
# Import seaborn for statistical data visualization
import seaborn as sns
# Enable inline display of matplotlib plots in Jupyter notebooks
%matplotlib inline

In [None]:
# Load the 'flights' dataset from seaborn's built-in datasets
flights = sns.load_dataset('flights')

In [None]:
# Load the 'tips' dataset from seaborn's built-in datasets
tips = sns.load_dataset('tips')

In [None]:
# Display the first 5 rows of the tips dataset
tips.head()

In [None]:
# Display the first 5 rows of the flights dataset
flights.head()

## Heatmap

In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:

In [None]:
# Display the first 5 rows of the tips dataset
tips.head()

In [None]:
# Calculate correlation matrix for numeric columns in tips dataset
tips.corr()

In [None]:
# Create a heatmap of the correlation matrix
sns.heatmap(tips.corr())

In [None]:
# Create a heatmap with coolwarm palette and annotated correlation values
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)

Or for the flights data:

In [None]:
# Create a pivot table with passengers as values, month as rows, year as columns
flights.pivot_table(values='passengers',index='month',columns='year')

In [None]:
# Create a pivot table and store in variable
pvflights = flights.pivot_table(values='passengers',index='month',columns='year')
# Create a heatmap of the pivot table
sns.heatmap(pvflights)

In [None]:
# Create a heatmap with magma palette, white lines, and line widths of 1
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1)

## clustermap

The clustermap uses hierarchal clustering to produce a clustered version of the heatmap. For example:

In [None]:
# Create a cluster map using hierarchal clustering on the pivot table
sns.clustermap(pvflights)

Notice now how the years and months are no longer in order, instead they are grouped by similarity in value (passenger count). That means we can begin to infer things from this plot, such as August and July being similar (makes sense, since they are both summer travel months)

In [None]:
# Create a cluster map with standard scaling on columns for normalization
sns.clustermap(pvflights,standard_scale=1)

# Regression Plots

Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the **lmplot()** function for now.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

In [None]:
# Import seaborn for statistical data visualization
import seaborn as sns
# Enable inline display of matplotlib plots in Jupyter notebooks
%matplotlib inline

In [None]:
# Load the 'tips' dataset from seaborn's built-in datasets
tips = sns.load_dataset('tips')

In [None]:
# Display the first 5 rows of the tips dataset
tips.head()

## lmplot()

In [None]:
# Create a linear model plot showing relationship between total_bill and tip
sns.lmplot(x='total_bill',y='tip',data=tips)

In [None]:
# Create a linear model plot with hue based on sex
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex')

In [None]:
# Create a linear model plot with hue based on sex and coolwarm palette
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm')

## Using a Grid

We can add more variable separation through columns and rows with the use of a grid. Just indicate this with the col or row arguments:

In [None]:
# Create separate linear model plots for each sex in columns
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')

In [None]:
# Create linear model plots in a grid with sex as rows and time as columns
sns.lmplot(x="total_bill", y="tip", row="sex", col="time",data=tips)

In [None]:
# Create linear model plots with day as columns, hue by sex, and coolwarm palette
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm')

## Aspect and Size

Seaborn figures can have their size and aspect ratio adjusted with the **height** and **aspect** parameters:

In [None]:
# Create linear model plots with adjusted aspect ratio (0.6) and height (8)
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm',
          aspect=0.6,height=8)

### Reference:

* https://seaborn.pydata.org/ - Seaborn: statistical data visualization


* https://seaborn.pydata.org/tutorial/color_palettes.html - Color palettes