# Pandas Visualization
- pandas uses matplotlib under the hood to provide some convenient functions for visualization. 

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

In [28]:
plt.style.available

['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
 'seaborn-colorblind',
 'seaborn-dark',
 'seaborn-dark-palette',
 'seaborn-darkgrid',
 'seaborn-deep',
 'seaborn-muted',
 'seaborn-notebook',
 'seaborn-paper',
 'seaborn-pastel',
 'seaborn-poster',
 'seaborn-talk',
 'seaborn-ticks',
 'seaborn-white',
 'seaborn-whitegrid',
 'tableau-colorblind10']

In [29]:
# give the command to matplotlib to use color blind style in plots.
plt.style.use('seaborn-colorblind')

# DataFrame.plot

In [30]:
# set the seed to random number generator which will allow us
# to reproduce the data.
np.random.seed(123)

# next, lets add the columns of random timeseries data.
# we can generate random data by cumulatively summing up random numbers.
df = pd.DataFrame({'A': np.random.randn(365).cumsum(0),
                   'B': np.random.randn(365).cumsum(0) + 20,
                   'C': np.random.randn(365).cumsum(0) - 20},
                  index = pd.date_range('1/1/2017', periods = 365))
df

Unnamed: 0,A,B,C
2017-01-01,-1.085631,20.059291,-20.230904
2017-01-02,-0.088285,21.803332,-16.659325
2017-01-03,0.194693,20.835588,-17.055481
2017-01-04,-1.311601,21.255156,-17.093802
2017-01-05,-1.890202,21.462083,-19.518638
...,...,...,...
2017-12-27,-17.039852,36.468465,-61.792064
2017-12-28,-16.366361,36.860543,-59.518959
2017-12-29,-16.780118,37.607936,-59.615350
2017-12-30,-16.104155,37.880671,-61.557482


In [31]:
# when we call .plot() we get aline graph with labels.
# Notice how the colors are slightly different from the default matplotlib
# colors because of the style we used.
df.plot();

<IPython.core.display.Javascript object>

In [32]:
# .plot() allows us to select which plot we want by passing it to the `kind`
# parameter.
df.plot('A','B', kind= 'scatter', color = ['r']);

<IPython.core.display.Javascript object>

In [33]:
# we can do the same plot above by passing the kind we want beside .plot()
# we can set the color`c` & size`s` to change based on the value of col B.
# SO, we can represent 3 variables in one figure.

df.plot.scatter('A','C' , c = 'B' , s = df['B'] , colormap = 'viridis');

<IPython.core.display.Javascript object>

In [34]:
# Because df.plot.scatter returns a matplotlib.axes.subplot, we can perform 
# modifications on this object just like objects retured by matplotlib plots.

ax = df.plot.scatter('A','C' , c = 'B' , s = df['B'] , colormap = 'viridis')
ax.set_aspect('equal');

# setting the aspect ratio to `equal` allows the the viewer to easily see
# that the range of var`A` is much samller that var`C`.

<IPython.core.display.Javascript object>

### we can also using pandas to do `histograms`.


In [35]:
df.plot.hist(alpha = 0.6);

<IPython.core.display.Javascript object>

### we can also using pandas to do `boxplots`.

In [36]:
df.plot.box();

<IPython.core.display.Javascript object>

### We can also using pandas to do `KDE`.
- It stands for Kernel density estimate plots. Which are useful for visualizing an estimate of a variable's probability density function.

- Kernel density estimation plots come in handy in data science application     where you want to derive a smooth continuous function from a given sample.

In [38]:
df.plot.kde();

<IPython.core.display.Javascript object>

- pandas also has plotting tools that help with visualizing large amounts of
  data or high dimentional data. lets explore couples of these:
  
# pandas.tools.plotting

In [41]:
iris = pd.read_csv('data/iris.csv')
iris      

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [48]:
# pandas has a plotting tool that allows us to create a scatter matrix 
# from a DataFrame.
pd.plotting.scatter_matrix(iris);
# this allows us to quickly see the more obvious patterns in the dataset.

<IPython.core.display.Javascript object>

Let's look at one more plotting tool in pandas which will help us visualize multivariate data. pandas includes a plotting tool for creating parallel coordinates plots.

    - Parallel coordinate plots are a common way of visualizing high dimensional multivariate data. Each variable in the data set corresponds to an equally spaced parallel vertical line. The values of each variable are then connected by lines between for each individual observation.

In [49]:
plt.figure()
pd.plotting.parallel_coordinates(iris,'Name')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x22d7b435b80>

This allows the viewer to more easily see any patterns or clustering:
    - For instance, looking at our iris data set, we can see that the petal length and petal width are two variables that split the different species fairly clearly. With iris virginica having the longest and widest petals. And iris setosa having the shortest and narrowest petals.

# Seaborn

In [139]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib notebook

sns.set_style('darkgrid')
plt.rcParams['font.size'] = 9
plt.rcParams['figure.figsize'] = (5, 6)
plt.rcParams['figure.facecolor'] = '#00000000'

In [140]:
np.random.seed(123)

# create 1000 random numbers with a mean of 0 and std of 10.
v1 = pd.Series(np.random.normal(0,10,1000), name = 'v1')
v2 = pd.Series(2*v1 + np.random.normal(60,15,1000), name = 'v2')

In [141]:
#  lets plot them.
plt.figure()
plt.hist(v1, alpha = 0.7, bins = np.arange(-50,150,5), label = 'v1')
plt.hist(v2, alpha = 0.7 , bins = np.arange(-50,150,5) , label = 'v2')

# Add x, y gridlines 
plt.grid(b = True, color ='grey', 
        linestyle ='-.', linewidth = 0.5, 
        alpha = 0.3)
    
plt.legend()

<IPython.core.display.Javascript object>

<matplotlib.legend.Legend at 0x22d04b90790>

###  - Lets make the 2-histogram again but in a different               way with KDE plot.

In [142]:
plt.figure()
plt.hist([v1,v2] ,density = 1, histtype = 'barstacked',
         bins = np.arange(-50,150,5))

# Add x, y gridlines 
plt.grid(b = True, color ='grey', 
        linestyle ='-.', linewidth = 0.5, 
        alpha = 0.3)

# create v3
v3 = np.concatenate((v1,v2))
sns.kdeplot(v3)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x22d102d86d0>

In [143]:
plt.figure()
plt.hist([v1, v2], histtype='barstacked', density=1);
v3 = np.concatenate((v1,v2))
sns.kdeplot(v3)

# Add x, y gridlines 
plt.grid(b = True, color ='grey', 
        linestyle ='-.', linewidth = 0.5, 
        alpha = 0.3);

<IPython.core.display.Javascript object>

###  - Lets make the v3-histogram again using seaborn `distplot()`.

In [144]:
plt.figure()
sns.distplot(v3,hist_kws = {'color' : 'Teal'},kde_kws = {'color' : 'navy'})

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x22d101e79d0>

Now let's look at an example of one of the types of complex plots that Seborn provides a convenient interface for, the joint plot.
    - The jointplot creates a scatterplot along the histograms for each individual variable on each axis.
    - You've actually seen jointplots in module two and created them manually yourself.

In [145]:
sns.jointplot(v1, v2, alpha = 0.4);

<IPython.core.display.Javascript object>

**Using jointplot we can see that v1 and v2 appear to be normally distributed variables that are positively correlated.**
____
Because Seaborn uses matplotlib we can tweak the plots using Matplotlib's tools:
    - Some of the plotting functions in Seaborn return a matplotlib axis object. While others operate on an entire figure and produce plots with several panels, returning a Seaborn grid object.
    
    -  For example, sns.jointplot returns a Seaborn grid object.From this we can plot a map plot axis subplot object using grid.x joint. Then, we can set the aspect ratio to be equal, using set_aspect equal.

In [146]:
grid = sns.jointplot(v1,v2,alpha = 0.4)
grid.ax_joint.set_aspect('equal')

<IPython.core.display.Javascript object>

 # Hexbin plots 
 they are the bivariate counterpart to histograms. Hexbin plots show the number of observations that fall within hexagonal bins.

In [147]:
sns.jointplot(v1,v2,kind = 'hex');

<IPython.core.display.Javascript object>

# Kde plot.

In [148]:
# first ,we'll change the style
sns.set_style('white')
sns.jointplot(v1,v2,kind='kde' ,space=0 );

<IPython.core.display.Javascript object>

## Finally, lets see how seaborn make visuals for categorical data`iris`
     - Similar to pandas seaborn has a function that can create Scatterplot matrix, which is pairplot(?).

In [149]:
sns.pairplot(iris, hue = 'Name' , diag_kind='kde');

<IPython.core.display.Javascript object>

**Looking at the pair plot, it's clear there are some clusters in the data set. It looks like peddle length and peddle width are good options for separating the observations, whereas width is not a strong separator.**

# Violen plot = informative Boxplot

In [150]:
plt.figure(figsize=(8,6))
plt.subplot(121)
sns.swarmplot('Name', 'PetalLength', data=iris);
plt.subplot(122)
sns.violinplot('Name', 'PetalLength', data=iris);

<IPython.core.display.Javascript object>

Looking at the swarmplot, each species has its own column and each observation's petal length is shown. With more common values appearing as the wide parts of the cluster, much like a histogram.