# Seaborn

** *NOTE: PLEASE DO NOT Restart & Clear Output, OTHERWISE YOU WILL LOSE SOME PLOTS THAT YOU HAVE TO REPLICATE IN THE EXERCISE PART AT THE END OF THE NOTEBOOK.* **

# Distribution Plots

We will discuss the following plots:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

In [None]:
import seaborn as sns
%matplotlib inline

## Data
Built-in data sets in Seaborn.

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

## distplot

`distplot` shows the distribution of a univariate set of observations.

In [None]:
sns.distplot(tips['total_bill'])

Without the kde (gaussian kernel density estimate) layer:

In [None]:
sns.distplot(tips['total_bill'],kde=False,bins=30)

## jointplot

`jointplot` allows to match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')

In [None]:
# Add regression and kernel density
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')

## pairplot

`pairplot` will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [None]:
sns.pairplot(tips)

In [None]:
sns.pairplot(tips,hue='sex',palette='coolwarm')

## rugplot

`rugplots`: plot datapoints in an array as sticks on an axis.

In [None]:
sns.rugplot(tips['total_bill'])

## kdeplot

`kdeplots` are Kernel Density Estimation Plots. KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. 

In [None]:
# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

#Create dataset
dataset = np.random.randn(25)

# Create another rugplot
sns.rugplot(dataset);

# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'

bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2


# Create an empty kernel list
kernel_list = []

# Plot each basis function
for data_point in dataset:
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
    kernel_list.append(kernel)
    
    #Scale for plotting
    kernel = kernel / kernel.max()
    kernel = kernel * .4
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)

plt.ylim(0,1)

In [None]:
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)

# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')

# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')

# Get rid of y-tick marks
plt.yticks([])

# Set title
plt.suptitle("Sum of the Basis Functions")

So with the tips dataset:

In [None]:
sns.kdeplot(tips['total_bill'])
sns.rugplot(tips['total_bill'])

In [None]:
sns.kdeplot(tips['tip'])
sns.rugplot(tips['tip'])

# Categorical Data Plots

* barplot
* countplot
* boxplot
* violinplot
* stripplot
* swarmplot

In [None]:
import seaborn as sns
%matplotlib inline

In [None]:
tips = sns.load_dataset('tips')
tips.head()

## barplot and countplot

Plots to get aggregate data of a categorical feature in the data. `barplot` is a general plot that allows you to aggregate the categorical data based of some function, by default the mean:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips)

In [None]:
import numpy as np

You can change the estimator object to your own function, that converts a vector to a scalar:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)

### countplot

`countplot` is the same as `barplot` except the estimator is explicitly counting the number of occurrences.

In [None]:
sns.countplot(x='sex',data=tips)

## boxplot and violinplot

`boxplots` and `violinplots` are used to shown the distribution of categorical data. A box plot shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. 

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')

In [None]:
# Can do entire dataframe with orient='h'
sns.boxplot(data=tips,palette='rainbow',orient='h')

In [None]:
sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm")

### violinplot
A violin plot plays a similar role as a box. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',palette='Set1')

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1')

## stripplot and swarmplot
`stripplot` will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

`swarmplot` is similar to `stripplot`, but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values.

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips)

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True)

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1')

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',split=True)

In [None]:
sns.swarmplot(x="day", y="total_bill", data=tips)

In [None]:
sns.swarmplot(x="day", y="total_bill",hue='sex',data=tips, palette="Set1", split=True)

### Combining Categorical Plots

In [None]:
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)

# Matrix Plots

Matrix plots are used to plot data as color-encoded matrices and can also be used to indicate clusters within the data.

In [None]:
flights = sns.load_dataset('flights')

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

In [None]:
flights.head()

## Heatmap

In order for a `heatmap` to work properly, your data should already be in a matrix form.

In [None]:
tips.head()

In [None]:
# Matrix form for correlation data: Compute pairwise correlation of columns
tips.corr()

In [None]:
sns.heatmap(tips.corr())

In [None]:
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)

Or for the flights data:

In [None]:
flights.pivot_table(values='passengers',index='month',columns='year')

In [None]:
pvflights = flights.pivot_table(values='passengers',index='month',columns='year')
sns.heatmap(pvflights)

In [None]:
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1)

## clustermap

`clustermap` uses hierarchal clustering to produce a clustered version of the heatmap.

In [None]:
sns.clustermap(pvflights)

# Grids

Grids are general types of plots that allow to map plot types to rows and columns of a grid, this helps you create similar plots separated by features.

In [None]:
iris = sns.load_dataset('iris')

In [None]:
iris.head()

## PairGrid

`PairGrid` is a subplot grid for plotting pairwise relationships in a dataset.

In [None]:
# Just the Grid
sns.PairGrid(iris)

In [None]:
# Then you map to the grid
g = sns.PairGrid(iris)
g.map(plt.scatter) # for all plots

In [None]:
# Map to upper,lower, and diagonal
g = sns.PairGrid(iris)
g.map_diag(plt.hist)
g.map_upper(plt.scatter)
g.map_lower(sns.kdeplot)

## pairplot

`pairplot` is a simpler version of PairGrid.

In [None]:
sns.pairplot(iris)

In [None]:
sns.pairplot(iris,hue='species',palette='rainbow')

## Facet Grid

`FacetGrid` is the general way to create grids of plots based on a feature:

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

In [None]:
# Just the Grid
g = sns.FacetGrid(tips, col="time", row="smoker")

In [None]:
g = sns.FacetGrid(tips, col="time",  row="smoker")
g = g.map(plt.hist, "total_bill")

In [None]:
g = sns.FacetGrid(tips, col="time",  row="smoker",hue='sex')
g = g.map(plt.scatter, "total_bill", "tip").add_legend()

## JointGrid

`JointGrid` is the general version for `jointplot` type grids.

In [None]:
g = sns.JointGrid(x="total_bill", y="tip", data=tips)

In [None]:
g = sns.JointGrid(x="total_bill", y="tip", data=tips)
g = g.plot(sns.regplot, sns.distplot)

# Style and Color

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
tips = sns.load_dataset('tips')

## Styles

In [None]:
sns.countplot(x='sex',data=tips)

In [None]:
sns.set_style('ticks')
sns.countplot(x='sex',data=tips,palette='deep')

## Spine Removal

In [None]:
sns.countplot(x='sex',data=tips)
sns.despine()

In [None]:
sns.countplot(x='sex',data=tips)
sns.despine(left=True)

## Size and Aspect

Use matplotlib's **plt.figure(figsize=(width,height) ** to change the size of most seaborn plots.

In [None]:
# Non Grid Plot
plt.figure(figsize=(12,3))
sns.countplot(x='sex',data=tips)

## Scale and Context

`set_context()` to override default parameters:

In [None]:
sns.set_context('poster',font_scale=1) # change font_scale
sns.countplot(x='sex',data=tips,palette='coolwarm')

# Exercises

** Recreate the plots below using the titanic dataframe.**

** *NOTE: IN ORDER NOT TO LOSE THE PLOT IMAGE, MAKE SURE YO DO NOT CODE IN THE CELL THAT IS DIRECTLY ABOVE THE PLOT, THERE IS AN EXTRA CELL ABOVE THAT ONE WHICH WILL NOT OVERWRITE THE PLOT!* **

## The Data

We will be working with a famous titanic data set for these exercises. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
sns.set_style('whitegrid')

In [None]:
titanic = sns.load_dataset('titanic')

In [None]:
titanic.head()

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.jointplot(x='fare',y='age',data=titanic)

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.distplot(titanic['fare'],bins=30,kde=False,color='red')

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.boxplot(x='class',y='age',data=titanic,palette='rainbow')

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.swarmplot(x='class',y='age',data=titanic,palette='Set2')

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.countplot(x='sex',data=titanic)

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
sns.heatmap(titanic.corr(),cmap='coolwarm')
plt.title('titanic.corr()')

In [None]:
# CODE HERE
# REPLICATE EXERCISE PLOT IMAGE BELOW
# BE CAREFUL NOT TO OVERWRITE CELL BELOW
# THAT WOULD REMOVE THE EXERCISE PLOT IMAGE!

In [None]:
g = sns.FacetGrid(data=titanic,col='sex')
g.map(plt.hist,'age')