# Plotting with pandas and seaborn 

The goal of this notebook is to quickly cover plotting functions nested within pandas and expose you to the types of plotting functions avaible in seaborn

# Dependencies

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
print(pd.__version__, sns.__version__)

0.23.4 0.9.0


In [3]:
# Some asthetics

sns.set_context("talk")

# Important references 

https://seaborn.pydata.org/examples/index.html

Spend time scrolling through seaborn's tutorial: https://seaborn.pydata.org/tutorial.html : It's really useful.

https://pandas.pydata.org/pandas-docs/stable/visualization.html

# Importing some data sets

In [4]:
data_frame = sns.load_dataset('iris')

# Exploratory plots (pandas)

In [5]:
data_frame.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [0]:
# Describe is helpful in understanding your data

data_frame.describe()

# Piviot Tables!

Pivot tables are a very useful tool for generating quick views of what's going on with your data.

They work identically to groupby/aggregration in SQL databases. Here is a poorly formatted list of built in agg functions.

    Function	Description
    =====================================
    count	Number of non-NA observations
    sum	Sum of values
    mean	Mean of values
    mad	Mean absolute deviation
    median	Arithmetic median of values
    min	Minimum
    max	Maximum
    mode	Mode
    abs	Absolute Value
    prod	Product of values
    std	Bessel-corrected sample standard deviation
    var	Unbiased variance
    sem	Standard error of the mean
    skew	Sample skewness (3rd moment)
    kurt	Sample kurtosis (4th moment)
    quantile	Sample quantile (value at %)
    cumsum	Cumulative sum
    cumprod	Cumulative product
    cummax	Cumulative maximum
    cummin	Cumulative minimum

In [0]:
pd.pivot_table(data=data_frame, values=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 
               index='species', columns=None, 
               aggfunc='mean', fill_value=None, margins=False, 
               dropna=True, margins_name='All')

In [0]:
mean_std_df = pd.pivot_table(data=data_frame, values=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 
               index='species', columns=None, 
               aggfunc=['mean', 'std'], fill_value=None, margins=False, 
               dropna=True, margins_name='All')
mean_std_df

In [0]:
# Indexing into hierarchical columns

mean_std_df['mean']

## Plotting with pandas

Pandas is designed to plot **structured dataframes**. Meaning things that look like piviot tables. 

In [0]:
mean_std_df['mean'].plot(kind='bar')

In [0]:
mean_std_df['mean'].plot(kind='bar', yerr=mean_std_df['std'])

In [0]:
mean_std_df['mean'].boxplot()

In [0]:
data_frame.head()

# Plotting with seaborn

Seaborn is a statistical plotting package that is designed to plot **LONG dataframes**. Meaning that you only have one numerical value per row.

Do your columns look very redundant? If so they are probably long.

Seaborn natively can make some very impressive summary images

In [0]:
sns.set_context('talk')

In [0]:
sns.pairplot(data=data_frame, hue="species")

Trying to replicate the bar plots from above

In [0]:
data_frame.columns

In [0]:
sns.barplot(x='species', y='petal_width', data=data_frame)

In [0]:
sns.barplot(x='petal_width', y='species', data=data_frame)

If we want to create the above data with the way the data_frame is currently structured we need to build it with a foor loop. 

In [0]:
# Making functions using for loops

fig, ax = plt.subplots(ncols=4, nrows=1, sharey=True, sharex=True, figsize=(16, 4))

for fig_ind, category in enumerate(['sepal_length', 'sepal_width', 'petal_length', 'petal_width']):
    sns.barplot(x='species', y=category,
                data=data_frame, ax=ax[fig_ind])
    ax[fig_ind].set_title(category)
    ax[fig_ind].set_ylabel('')
    ax[fig_ind].set_xticklabels(ax[fig_ind].get_xticklabels(), rotation=30)

Let's say we want to do this without using a for loop.... 

We need to mamke our data **even longer!**

In [0]:
melted_data_frame = data_frame.melt(id_vars='species')

In [0]:
melted_data_frame.head()

In [0]:
sns.barplot(x='species', y='value', hue='variable', data=melted_data_frame)

In [0]:
sns.boxplot(x='species', y='value', hue='variable', data=melted_data_frame)

In [0]:
sns.violinplot(x='species', y='value', hue='variable', data=melted_data_frame)

In [0]:
sns.swarmplot(x='species', y='value', hue='variable', data=melted_data_frame)

In [0]:
sns.boxenplot(x='species', y='value', hue='variable', data=melted_data_frame)

In [0]:
ax = sns.boxenplot(x='value', y='species', hue='variable', data=melted_data_frame)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

# Looking at dense scatters? Use kde plots.

In [0]:
# Alpha controls the opacity of a point

data_frame.plot(x='sepal_length', y='petal_width', kind='scatter', alpha=0.2)

In [0]:
sns.kdeplot(data=data_frame[['sepal_length', 'petal_width']], shade_lowest=True)

In [0]:
ax = data_frame.plot(x='sepal_length', y='petal_width', kind='scatter', alpha=1, c='k')
sns.kdeplot(data=data_frame[['sepal_length', 'petal_width']], shade_lowest=True, ax=ax)


# Summary plots using both pandas and seaborn

## Coorlation plots

In [0]:
# You can generate a coorelation matrix using pandas

data_frame.corr()

In [0]:
sns.heatmap(data_frame.corr(), annot=True, cmap="YlGnBu", cbar=False)

In [0]:
sns.clustermap(data_frame.corr(), annot=True, cmap="YlGnBu")