In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

In [None]:
import pandas as pd
import numpy as np

## Describing a distribution
* Mean
* Median
* Variance
* Standard Deviation

Often statistical parameters provide important insight into the data - and can reveal information that is not visually obvious. However, it's important to consider their limitations as well and think about what is gained by visual exploration.

Outliers are a good place to start - visually they are easy to spot but they can have deceptive influence on statistical metrics. Consider [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), a set of four distributions with nearly identical aggregate properties:

In [None]:
aq = sns.load_dataset("anscombe")

In [None]:
print aq[aq['dataset'] == 'I'].describe()
print aq[aq['dataset'] == 'II'].describe()

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=aq,
           col_wrap=2, ci=None, palette="muted", size=4,
           scatter_kws={"s": 50, "alpha": 1})

*Question*: You're told that the mean starting salary for a Data Scientist is \$110,000. What are two **non-visual** methods of determining whether the distribution is normal or bimodal (with many positions at ~\$140k and many at ~\$80k)?

## Histograms

In [None]:
tips = sns.load_dataset("tips")

In [None]:
print tips.shape
tips[:5]

In [None]:
tips_hist = tips.hist()

In [None]:
# Side note on saving figures to disk
# Won't work: tips_fig = tips_hist.get_figure()
tip_hist = tips['tip'].hist()
tips_fig = tip_hist.get_figure()

In [None]:
# these will work
# tips_fig.savefig('tiphist.png')
# tips_fig.savefig('tiphist.pdf')

In [None]:
sns.jointplot(x='total_bill', y='tip', data=tips)

## Relationships between variables

### Linear correlation
The most common metric is [Pearson's](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) correlation coefficient (covariance normalized by the product of the standard deviations), which ranges between 1 being total positive correlation and -1 being total negative correlation.

In [None]:
tips.corr(method='pearson')

In [None]:
pd.scatter_matrix(tips, alpha=0.2, figsize=(6, 6), diagonal='kde')
# available in seaborn as pairplot()

### Indirect Influence / constraints
- e.g. speed is highly correlated with accidents only if driving on the highway
- I guess this mostly boils down to intelligently looking at subsets of the data, edge cases, etc.
- Leave one out for predictive models

In [None]:
print tips['tip'].mean()
print tips[tips['size'] > 1]['tip'].mean()
print tips[tips['size'] == 1]['tip'].mean()

*Question*: How meaningful is the above? What else do we need to consider?

In [None]:
sns.lmplot(x='total_bill', y='tip', hue='time', data=tips, palette="Set2")

## Nonobvious patterns in the data
### Autocorrelation

In [None]:
from pandas.tools.plotting import autocorrelation_plot, lag_plot

In [None]:
# Get temperature data
temps_df = pd.read_csv("small_data/temperatures.csv", 
                       index_col=0,
                       names=["Temperature"],
                       parse_dates=True,
                       date_parser=lambda u: pd.datetime.strptime(u, "%Y-%m-%d %H:%M:%S"))

# get GOOG data
import requests
import simplejson as json

with open('small_data/goog.json') as raw_f:
    raw_data = raw_f.read()
    json_data = json.loads(raw_data)

json_data = json.loads(raw_data)
goog_df = pd.DataFrame(json_data['data'], columns=json_data['column_names'])

In [None]:
autocorrelation_plot(goog_df['Open'])

In [None]:
autocorrelation_plot(temps_df)

### FFT
Check out some time series analyses for FFT examples.

## Python visualization tools
* matplotlib (a [thorough rundown](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/) of its potential)
* [pandas](http://pandas.pydata.org/pandas-docs/stable/visualization.html) has its own useful plotting interface around matplotlib
* [seaborn](http://stanford.edu/~mwaskom/software/seaborn/) (focuses on statistics, easier to customize than matplotlib)
* [bokeh](http://bokeh.pydata.org/en/latest/) (focus on interactivity, browser delivery)
* [ggplot](http://ggplot.yhathq.com/) for Python (attempt at porting R's beloved functionality)

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*