# CSCI 303
# Introduction to Data Science
<p/>
### 10 - Exploratory Data Analysis

![Exploratory data analysis](eda.png)

## This Lecture
---
- Explore the Boston Housing data set

The obligatory setup code...

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.datasets

from pandas import Series, DataFrame

plt.style.use('seaborn-whitegrid')

%matplotlib inline

## The Boston Housing Dataset
---
A well known and heavily studied dataset for statistical inference.

Available in the scikit-learn package, or many sources online.

In [None]:
raw = sk.datasets.load_boston()
boston = DataFrame(raw.data, columns=raw.feature_names)
boston['MEDV'] = raw.target
boston.head()

## Basic Statistics
---
pandas provides the `describe` function (similar to R's `summary`):

In [None]:
boston.describe()

## What Shall We Explore?
---
Some ideas:

- distributions of individual inputs
- correlations between pairs of inputs and/or the target
- your suggestion here

## Distributions
---
Often best explored via histogram.

A histogram divides data into (usually) even sized *bins*, then counts the frequency of occurrence of samples in each bin.

For example, let's look at average number of rooms per dwelling.

In [None]:
plt.hist(boston['RM'])
plt.show()

Very normal looking, isn't it?  We can vary the number of bins for more or less precision.

In [None]:
plt.hist(boston['RM'], bins=20)
plt.show()

How about crime?

In [None]:
plt.hist(boston['CRIM'], bins=20)
plt.show()

## Correlations
---
Often best explored via a scatter plot.

I theorize that there will be a correlation between percentage of industrial zoning and nitric oxide concentrations.  Let's take a look:

In [None]:
plt.scatter(boston['INDUS'], boston['NOX'])
plt.xlabel('INDUS'); plt.ylabel('NOX');
plt.show()

There seems to be an odd artifact at around 18% on the INDUS axis.

Let's take a closer look at the INDUS data.

In [None]:
plt.hist(boston['INDUS'], bins=range(25))
plt.show()

In [None]:
boston['INDUS'].value_counts().head()

This spike at 18.10 seems suspicious.  Some kind of default?

In [None]:
b2 = boston[boston['INDUS'] == 18.10]
b2.describe()

Four of the other columns have stddev = 0 when filtered on this value.

In [None]:
b2

What are the chances that 132 sequential entries have the same data for ZN, INDUS, RAD, TAX, and PTRATIO?

Let's set this aside for a moment and explore some other correlations.

We can drive plots directly from pandas, too, which provides some extra benefits - like axes labeling.

Let's look at # of rooms versus median value:

In [None]:
boston.plot(kind='scatter', x='RM', y='MEDV')
plt.show()

Not too surprising, there seems to be a strong correlation between number of rooms and median value.

Now, though, we seem to have some other "suspicious" data - look at all those houses at the top!

In [None]:
boston['MEDV'].plot(kind='hist', bins=15)

In [None]:
boston['MEDV'].value_counts().iloc[:10]

In [None]:
b3 = boston[boston['MEDV']==50]
b3

I'm quite suspicious that this value is some kind of data-entry default.

1. It's a round number
2. It's the maximum
3. It explains at least some big outliers: tracts where the average rooms per house < 5 AND the median value is 50,000 (and not with any obvious other great things going on)

For now, let's remove that data. It might not be justified, but without access to the original data collection info, it makes the most sense to me.

In [None]:
bfix1 = boston[boston['MEDV'] != 50.0]
bfix1.plot(kind='scatter', x='RM', y='MEDV')

I'm curious about some of these other outliers.  I'm going to add in some other variables using color cues, just to see if they highlight the outliers.

In [None]:
bfix1.plot(kind='scatter', x='RM', y='MEDV', c='CHAS', colormap='Accent')

That wasn't helpful.  What about our industrial zoning variable?

In [None]:
bfix1.plot(kind='scatter', x='RM', y='MEDV', c='INDUS', colormap='Blues_r')

Hm.  I have a theory... not much of one, though.

In [None]:
bfix2 = bfix1[bfix1['INDUS'] != 18.10]
bfix2.plot(kind='scatter', x='RM', y='MEDV', c='INDUS', colormap='Blues_r')

So this plot now actually makes sense; all the outliers vanished when we removed some suspicious data.

OTOH, we almost certainly lost some good data.

Was removing data the right thing to do?

In [None]:
bfix2.shape, boston.shape

Other questions we could explore:
    
- what is the deal with houses on the Charles River?
- how do each of the remaining variables correlate with median value?
- how do DIS, RAD, and INDUS relate to each other?
- is there a relationship between crime and the age of the neighborhood?
- is PTRATIO relevant to anything?

In [None]:
plt.hist(bfix2[bfix2['CHAS']==0]['MEDV'], bins=7)
plt.hist(bfix2[bfix2['CHAS']==1]['MEDV'], bins=7)
plt.show()

In [None]:
plt.subplot(2,1,1)
plt.hist(bfix2[bfix2['CHAS']==0]['MEDV'], bins=range(5,50,5))
plt.subplot(2,1,2)
plt.hist(bfix2[bfix2['CHAS']==1]['MEDV'], bins=range(5,50,5), color='red')
plt.show()

In [None]:
bfix2['CHAS'].value_counts()

In [None]:
plt.subplot(2,1,1)
plt.hist(boston[boston['CHAS']==0]['MEDV'], bins=range(5,55,5))
plt.subplot(2,1,2)
plt.hist(boston[boston['CHAS']==1]['MEDV'], bins=range(5,55,5), color='red')
plt.show()

In [None]:
for f in raw.feature_names:
    plt.subplot(1,2,1)
    plt.hist(bfix2[f])
    plt.xlabel(f)
    plt.subplot(1,2,2)
    plt.scatter(bfix2[f], bfix2['MEDV'])
    plt.xlabel(f)
    plt.ylabel('MEDV')
    plt.show()
    

In [None]:
bfix2.plot(kind='scatter', x='INDUS', y='DIS', c='RAD', colormap='Blues')

In [None]:
bfix2.plot(kind='scatter', x='AGE', y='CRIM')