<DIV ALIGN=CENTER>

# Data Exploration
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

To demonstrate visual data exploration, we first need a suitable data
set. We can make use of the flight data for this task, for which we
generate a clean, compact data representation in the following code
block (this task is described in more detail in the [Introduction to
Data Exploration](https://github.com/ProfessorBrunner/rp-pds15/blob/master/
Week7/intro2de.ipynb) Notebook from the Spring 2015 Practical Data
Science course held at the University of Illinis Research Park.

Once we have a clean data set, the next step in the data exporation
process is to gain a better understanding of our data set. The easiest
way to do this is to visually explore the data. In this Notebook, we use
several visualization techniques in the Seaborn library to learn more
about the flights data. We quickly check the data over before we
generate a trimmed data set (by slicing every 5000 row) before we begin
to make visualizations.

-----

In [1]:
import numpy as np
import pandas as pd

ucs=(1, 3, 4, 14, 15, 16, 17, 18)

cnms = ['month', 'Day', 'dTime', 'aDelay', 'dDelay', 'depart', 'arrive', 'distance']

newdata = pd.read_csv('/home/data_scientist/data/2001.csv', #dtype=np.float32, 
                      header=0, na_values=['NA'], usecols=ucs, names = cnms)

newdata = newdata.dropna()

dts = [np.uint8, np.uint8, np.uint16, np.int16, np.int16, object, object, np.uint16]

data = pd.DataFrame()

for i in range(len(cnms)):
    data[[cnms[i]]] = newdata[[cnms[i]]].astype(dts[i])

In [2]:
data.describe()

Unnamed: 0,month,Day,dTime,aDelay,dDelay,distance
count,5723673.0,5723673.0,5723673.0,5723673.0,5723673.0,5723673.0
mean,6.291581,3.949829,1348.688044,5.528249,8.115272,735.173682
std,3.381755,1.997942,482.638757,31.429291,28.234083,574.815182
min,1.0,1.0,1.0,-1116.0,-204.0,21.0
25%,3.0,2.0,930.0,-9.0,-3.0,314.0
50%,6.0,4.0,1333.0,-2.0,0.0,575.0
75%,9.0,6.0,1740.0,10.0,6.0,983.0
max,12.0,7.0,2400.0,1688.0,1692.0,4962.0


In [3]:
# Lets generate a trimmed data set to speed up exploration

tdata = data[::5000]
print(len(tdata))
print(tdata.head())

1145
       month  Day  dTime  aDelay  dDelay depart arrive  distance
0          1    3   1806      -3      -4    BWI    CLT       361
5232       1    3    706       2      -4    BUF    PIT       186
10406      1    4   1930      -2      -5    GSP    LGA       610
15577      1    5   2035      55      50    PHL    BWI        90
20790      1    1    624     -17      -6    BOS    CLT       728


-----

We can test the version of Seaborn we have by using the built-in `help`
method to display the basic info about the Seaborn module. While you can
do this in your own Notebook, my container has Seaborn version 0.6.0
installed. This is expected, however, since according to the class
Dockerfile, seaborn was installed by using `pip3` which retrieves the
latest  version of the module from the python package index.

The next step is to make a pair plot that allos an easy visual
comparison of the different dimensions in our data set. We do this in
Seaborn by using the `PairGrid` object to make scatter plots between
each dimension.

-----

In [4]:
%matplotlib notebook

In [5]:
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style="white")

pg = sns.PairGrid(tdata)

pg.map(plt.scatter)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x7f68aa49fa58>

-----

### Pair Plotting

We can select specific columns to sue to make pair plots by using
Seaborn. For example in the previous set of plots, the `month` and `Day`
columns do not visually provide significant information. We can remove
those and plot the data of interest. In this case, we also explicitly
set the axis limits of our plot to help highlight visual trends.


-----

In [6]:
pg = sns.PairGrid(tdata[['dTime', 'aDelay', 'dDelay', 'distance']])

pg.map_diag(plt.hist)
pg.map_offdiag(plt.scatter)

# Lets explicitly set the axes limits
axes = pg.axes

xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]

for i in range(len(xlim)):
    for j in range(len(ylim)):
        axes[i, j].set_xlim(xlim[j])
        axes[i, j].set_ylim(ylim[i])



<IPython.core.display.Javascript object>

-----

The previous plots showed the aggregate data, but we can also
differentiate the data by day of the week. To do this, we simply add a
new column that contains the string listing the name (as opposed to the
numerical value). We can add this column to our small data set by using
a lambda function to use the numerical day as an index into our list of
strings. Note that we obviously should verify this linear mapping,
assumptions like this if incorrect can cause significant problems later.

-----

In [7]:
# First we create a simple list of the week day names

dow = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']

# Now we add a column byy converting from the int value to an index into our list.

tdata['DoW'] = tdata['Day'].apply(lambda x: dow[int(x - 1)])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [8]:
# We can use the shorthand pairplot function

pp = sns.pairplot(tdata[['dTime', 'aDelay', 'dDelay', 'distance', 'DoW']], 
                  hue='DoW', palette="Blues_d", diag_kind="kde")

axes = pp.axes

xlim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]
ylim = [(0, 2400), (-30, 70), (-30, 70), (0, 3000)]

for i in range(len(xlim)):
    for j in range(len(ylim)):
        axes[i, j].set_xlim(xlim[j])
        axes[i, j].set_ylim(ylim[i])

<IPython.core.display.Javascript object>

-----

In this case, the additional colored day information didn't produce any
new insights. Sometimes this is due to the small size of the plots. In
this case, we can try to make a larger plot that might gain clarity by
distinguishing data from different days of the week. In the following
plot, we focus on the relationship between departure and arrival delays,
but colored by day of the week. While no obvious difference are noted
here, the basic idea is an important one to learn.

-----

In [9]:
# Lets specify a color pallete by using a runtime context.
with sns.color_palette("Blues_d", 7):
    
    # Make a scatter plot with no regression fit
    lviz = sns.lmplot('dDelay', 'aDelay', tdata, hue='Day', size = 8, fit_reg=False)
    
    # Change the axes limits

    ltmp = lviz.set(xlim=(-30, 70), ylim=(-45, 70))

# Make plot visually less cluttered
sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

<IPython.core.display.Javascript object>

-----

### Box Plots

In some case, we simply want to compare aggregate distributions, for
example, the typical departure delay as a function of day. In this
case, we can use a box plot to generate distributions for the full data,
as a function of day (in this case). The box plot display a box with
upper and lower edges that encapsulate the inner two quartiles. A box
plot also has a line through the box that indicates the mean value, and
whiskers that extend to the box to indicate the typical extent of the
majority of data. Points that are considered extreme outliers are
displayed outside the whiskers. The plot below demonstrates the
variation in departure delay as a function of the day of the week.

-----

In [10]:
sns.boxplot(tdata.Day, tdata.dDelay)#, tdata.Day)

<matplotlib.axes._subplots.AxesSubplot at 0x7f6899e230f0>

-----

### Violin Plot

Another useful visualization tool is the violin plot, which is similar to
the box plot, but rather than a box which indicates the  quartiles and a
mean value, the violin plot changes shape based on the actual
distribution of the data. A squatter shape indicates a more compact data
distribution, which a longer shape indicates a more varied distribution.
The violin plot also displays the full range of the data, as shown below
(where we see Thursday's have long departure delays on average).

-----

In [11]:
sns.violinplot(tdata.Day, tdata.dDelay)

<matplotlib.axes._subplots.AxesSubplot at 0x7f6899e230f0>

-----

### Histograms

Histograms are a great way to visually summarize information. But rather
than always overplotting similar histograms, we can use the Seaborn
`FacetGrid` object to compare histograms. We demonstrate this below, by
first comparing the distances of different flights for different days,
and second by comparing the departure delays for different days.

-----

In [12]:
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(tdata[tdata.distance < 3000], row="DoW", hue="DoW", palette="Blues_d",
                  size=1.7, aspect=4) #, hue_order=days, row_order=days)
viz.map(plt.hist, "distance", bins=30)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x7f68984de4e0>

In [13]:
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(tdata[(tdata.dDelay < 30) & (tdata.dDelay > 0)], row="DoW", hue="DoW", palette="Blues_d",
                  size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=10)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x7f6898039da0>

-----

Sometimes small changes can improve the visual clarity of a result. Here
we first only select data for the Atlanta airport. We also increase the
number of bins and data range, which highlights the differences and
similarities between departure delays for different days of the week. We
also use these distributions to model the data by using regression to
quantify these differences statistically, for example by using the
Gamma distribution.

-----

In [14]:
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(data[(data.dDelay < 90) & (data.aDelay > 0) & data.depart.str.contains('ATL') & 
                         (data.month < 3)], row="Day", hue="Day", palette="Blues_d",
                  size=1.7, aspect=4)
viz.map(plt.hist, "dDelay", bins=30, normed=False)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x7f6897f1e5f8>

-----

### Density Plots

For large data sets, it is often more illustrative to see the density of
points. We demonstrate this below by using the Seaborn `jointplot`
method. This creates a binned version of the data, in the example below
this is done using a hexagonal binning. The `jointplot` also displays
the marginal distributions along each dimension for further clarity.

-----

In [15]:
jp = sns.jointplot('dDelay', 'aDelay', tdata[tdata.aDelay > 5], kind="hex", 
                   stat_func=None, color="#8855AA", xlim=(5, 50), ylim=(5, 50))

<IPython.core.display.Javascript object>

-----

Earlier we tried to distinguish the different days of the week data on
the same plot. In some cases, it is easier to simply plot the data in
different figures that are side-by-side. We can do this in Seaborn by
using a `FacetGrid` and indicating which DataFrame column should be used
to identify the correct subplot for the current row. We also can wrap
the figures to limit the width. Below we plot the departure versus
arrival delays for each data separately, wrapping the subplots to
three-columns. We also color each subplot differently.

-----


In [16]:
# Doing the basic makes thing hards to see.

viz = sns.FacetGrid(data[::100], col="Day", col_wrap=3, hue="Day", palette="Blues_d", size=4)
viz.map(sns.regplot, 'aDelay', 'dDelay')
viz.set(xlim = (-20, 50), ylim=(-20, 50))


<IPython.core.display.Javascript object>

<seaborn.axisgrid.FacetGrid at 0x7f6897620940>

-----

### HeatMaps

We also can compare aggregate statistics, for example, mean departure
delay, as a function of two attributes. To create this visualization, we
first need to create a pivot table from our DataFrame. To do this, we
group our data by `month` and `Day`, compute the mean on the grouped
data, and display the result as shown in the next code cell.

-----

In [17]:
# Lets make a Pivot Table

# First we group the data by Month and Day
df = tdata.groupby(['month', 'Day'])
dd = df.mean()
dd.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,dTime,aDelay,dDelay,distance
month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,1323.277778,-1.888889,1.666667,629.555556
1,2,1217.666667,6.066667,8.933333,594.533333
1,3,1304.888889,-6.166667,-1.277778,772.277778
1,4,1565.571429,7.071429,9.571429,553.428571
1,5,1538.6,7.133333,10.0,821.4


-----

This new DataFrame has a multi-index, where two columns have been used
to index rows, as opposed to the standard single column index. Before we
can make a pivot table, we first need to reset the index. We next change
the departure delay to an integer value, which will display more nicely
than a floating point value in our heatmap. We finally pivot the data to
make the pivot table before displaying the result as a heatmap.

-----

In [18]:
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDleya column to integer

dd.reset_index(inplace=True)  
dd['dDelay'] = dd['dDelay'].astype(int)

# Now we pivot the DataFrame to make a Matrix with values encoded.
dp = dd.pivot('month', 'Day', 'dDelay')

In [19]:
# Now we can plot the heatmap with values in place

f = plt.figure(figsize=(8, 10))

sns.heatmap(dp, annot=True, fmt='d')


#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

<IPython.core.display.Javascript object>

-----

We can also use Seaborn to make a scatter plot of data that is colored
by a third attribute, in this case day of the week. We also can have
Seaborn calculate a linear regression to the data to indicate potential
differences. We do this below to compare departure and arrival delays
for Thursday and Friday.

-----

In [20]:
# Make a scatter plot that is colored by day.
# Also fit a linear regression to each day.

lviz = sns.lmplot('dDelay', 'aDelay', 
                  tdata[(tdata.aDelay > 1) & ((tdata.Day == 5) | (tdata.Day == 6))], 
                  hue='Day', size = 8, ci=68)

ltmp = lviz.set(xlim=(-30, 70), ylim=(-30, 70))

sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

<IPython.core.display.Javascript object>

-----

One issue to be wary of is extrapolating too much from a limited data
set. We can demonstrate this by making a heatmap that compares the mean
departure delay as a function of day of the week and month for the
entire flights data. We do this below, by first grouping the full data
set by `month` and `Day`, calculating the mean, and pivoting the
DataFrame. We use this new pivot table to make the Seaborn heatmap,
which does look different than the similar heatmap derived from a
limited amount of data.

-----

In [21]:
# Now we rest the Index (to make a regular DataFrame)
# and convert the dDelay column to integer

dff = data.groupby(['month', 'Day'])

# Can change the statistical function to one of min, max, sum, mean, std
# 9/11/01 was a Tuesday.

ddf = dff.mean()

ddf.reset_index(inplace=True)  
ddf['dDelay'] = ddf['dDelay'].astype(int)

# Now we pivot the DataFrame to make a Matrix with values encoded.
dpf = ddf.pivot('month', 'Day', 'dDelay')

In [22]:
# Now we can plot the heatmap with values in place

f = plt.figure(figsize=(8, 10))

sns.heatmap(dpf, annot=True, fmt='d')


#sns.despine(offset=0, trim=True)
sns.set(style="ticks", font_scale=2.0)

<IPython.core.display.Javascript object>

### Additional References

-----

### Return to the [Course Index](index.ipynb).

-----