ARTI308 - Machine Learning
# Seaborn Overview

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




## Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

## Imports

In [None]:
import seaborn as sns # we import seaborn and use sns to call it
%matplotlib inline    # this makes plots show right under the cell


## Data
Seaborn comes with built-in data sets!

In [None]:
tips = sns.load_dataset('tips') # load seaborn's tips dataset into a variable


In [None]:
tips.head() # show the first 5 rows so we can preview the data


## distplot

The distplot shows the distribution of a univariate set of observations.

In [None]:
sns.distplot(tips['total_bill']) # draw distribution of total_bill
# Safe to ignore warnings


To remove the kde layer and just have the histogram use:

In [None]:
sns.distplot(tips['total_bill'],kde=False,bins=30) # same distribution but only histogram with 30 bins


## jointplot

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter') # scatter plot of bill vs tip with side distributions


In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex') # hexbin version to show dense areas better


In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg') # scatter + regression line


## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [None]:
sns.pairplot(tips) # pairwise plots for all numeric columns


In [None]:
sns.pairplot(tips,hue='sex',palette='coolwarm') # same pairplot but color points by sex


## rugplot

rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

In [None]:
sns.rugplot(tips['total_bill']) # draw small tick marks for each total_bill value


## kdeplot

kdeplots are [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

In [None]:
# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np # we import numpy for random data and numeric ops
import matplotlib.pyplot as plt # we import pyplot for line plotting options
from scipy import stats # we import stats to build gaussian kernels

#Create dataset
dataset = np.random.randn(25) # create 25 random normal values

# Create another rugplot
sns.rugplot(dataset); # show where the random values sit on the x-axis

# Set up the x-axis for the plot
x_min = dataset.min() - 2 # start a bit left of min value
x_max = dataset.max() + 2 # end a bit right of max value

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100) # create smooth x points for drawing curves

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth' # extra reference link

bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2 # estimate kde bandwidth


# Create an empty kernel list
kernel_list = [] # we will store one kernel per data point

# Plot each basis function
for data_point in dataset: # loop through each value in dataset
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis) # build normal curve centered at this data point
    kernel_list.append(kernel) # save this curve into list
    
    #Scale for plotting
    kernel = kernel / kernel.max() # normalize the peak to 1
    kernel = kernel * .4 # shrink height so all kernels are visible
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5) # draw each kernel in light grey

plt.ylim(0,1) # lock y-axis from 0 to 1


In [None]:
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0) # add all kernels together point by point

# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred') # draw final summed kde curve

# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred') # add rug ticks under the kde

# Get rid of y-tick marks
plt.yticks([]) # hide y tick labels for cleaner look

# Set title
plt.suptitle("Sum of the Basis Functions") # add title above the plot


So with our tips dataset:

In [None]:
sns.kdeplot(tips['total_bill']) # draw kde for total bill
sns.rugplot(tips['total_bill']) # add rug marks for total bill values


In [None]:
sns.kdeplot(tips['tip']) # draw kde for tip values
sns.rugplot(tips['tip']) # add rug marks for tip values


# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:

* factorplot
* boxplot
* violinplot
* stripplot
* swarmplot
* barplot
* countplot

Let's go through examples of each!

In [None]:
import seaborn as sns # import seaborn again for this section
%matplotlib inline # show plots inline in notebook


In [None]:
tips = sns.load_dataset('tips') # load tips dataset again
tips.head() # preview first rows


## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips) # show average total bill by sex


In [None]:
import numpy as np # import numpy so we can pass np.std as estimator


You can change the estimator object to your own function, that converts a vector to a scalar:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std) # use standard deviation instead of mean


### countplot

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:

In [None]:
sns.countplot(x='sex',data=tips) # count how many entries for each sex


## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow') # boxplot of bills grouped by day


In [None]:
# Can do entire dataframe with orient='h'
sns.boxplot(data=tips,palette='rainbow',orient='h') # horizontal boxplots for numeric columns


In [None]:
sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm") # split each day by smoker or not


### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow') # violinplot of bill distribution by day


In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',palette='Set1') # add sex grouping inside violins


In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1') # split male/female into one violin per day


## stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow') # show each data point as a strip plot


In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True, palette='rainbow') # jitter points so overlaps are easier to see


In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1') # color points by sex


In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',dodge=True) # separate male/female points side by side


In [None]:
sns.swarmplot(x="day", y="total_bill", data=tips) # swarmplot packs points without overlap


### Combining Categorical Plots

In [None]:
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow') # violinplot for tips by day
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3) # overlay individual points on top


## catplot

factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:

In [None]:
sns.catplot(x='sex',y='total_bill',data=tips,kind='bar') # higher-level categorical bar plot


# Matrix Plots

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

Let's begin by exploring seaborn's heatmap and clutermap:

In [None]:
import seaborn as sns # import seaborn for matrix plots section
%matplotlib inline # keep plots inline


In [None]:
flights = sns.load_dataset('flights') # load flights dataset


In [None]:
tips = sns.load_dataset('tips') # load tips dataset again


In [None]:
tips.head() # quick look at tips data


In [None]:
flights.head() # quick look at flights data


## Heatmap

In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:

In [None]:
tips.head() # show tips again before correlation


In [None]:
# Matrix form for correlation data
tips.corr() # compute correlation matrix for numeric columns


In [None]:
sns.heatmap(tips.corr()) # basic heatmap of correlations


In [None]:
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True) # heatmap with colors and printed values


Or for the flights data:

In [None]:
flights.pivot_table(values='passengers',index='month',columns='year') # reshape flights into month x year matrix


In [None]:
pvflights = flights.pivot_table(values='passengers',index='month',columns='year') # save pivot table into variable
sns.heatmap(pvflights) # heatmap of passengers by month and year


In [None]:
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1) # heatmap with custom colormap and grid lines


## clustermap

The clustermap uses hierarchal clustering to produce a clustered version of the heatmap. For example:

In [None]:
sns.clustermap(pvflights) # cluster rows/columns by similarity


Notice now how the years and months are no longer in order, instead they are grouped by similarity in value (passenger count). That means we can begin to infer things from this plot, such as August and July being similar (makes sense, since they are both summer travel months)

In [None]:
# More options to get the information a little clearer like normalization
sns.clustermap(pvflights,standard_scale=1) # normalize each column before clustering


# Regression Plots

Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the **lmplot()** function for now.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

In [None]:
import seaborn as sns # import seaborn for regression plot section
%matplotlib inline # show plots in notebook


In [None]:
tips = sns.load_dataset('tips') # load tips dataset


In [None]:
tips.head() # preview data


## lmplot()

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips) # scatter + linear fit for bill vs tip


In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex') # same fit but color by sex


In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm') # use custom colors for sex groups


## Using a Grid

We can add more variable separation through columns and rows with the use of a grid. Just indicate this with the col or row arguments:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex') # split into separate columns by sex


In [None]:
sns.lmplot(x="total_bill", y="tip", row="sex", col="time",data=tips) # facet by sex rows and time columns


In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm') # one plot per day and color by sex


## Aspect and Size

Seaborn figures can have their size and aspect ratio adjusted with the **height** and **aspect** parameters:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm', # set faceted lmplot by day and sex
          aspect=0.6,height=8) # control subplot width/height ratio and size


### Reference:

* https://seaborn.pydata.org/ - Seaborn: statistical data visualization


* https://seaborn.pydata.org/tutorial/color_palettes.html - Color palettes