# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas and Computation](#Pandas-and-Computation)
	* [Set-Up](#Set-Up)
* [Overview](#Overview)
* [Scikit-Learn](#Scikit-Learn)
* [Statsmodels](#Statsmodels)


# Learning Objectives

After this notebook, the learner will be able to:
* Use pandas with other python libraries for computational tasks
* Use pandas containers with machine learning library SciKit Learn
* Use pandas containers with statistics library statsmodel

# Pandas and Computation

Pandas is able to interact with, by passing data and receiving results to many other 'compute' engines. We find ourselves using pandas in a pipeline where we:

- clean data
- pass to compute engines
- receive back in pandas and iterate

![PyData ecosystem](img/pydata-ecosystem.png)

## Set-Up

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_rows = 8
pd.options.display.max_columns = 8

# Overview

We often want to interact with other libraries and have them handle pandas objects.

We are going to look at some interactions with:

- ``scikit-learn``
- ``statsmodels``

# Scikit-Learn

http://scikit-learn.org/stable/documentation.html

Scikit-Learn's algorithms all deal with numpy arrays. typically:

- data munging in pandas
- pass numpy array to an Estimator
- wrap result in a DataFrame or Series

In [None]:
import sklearn
sklearn.__version__

In [None]:
from sklearn.datasets import california_housing
data = california_housing.fetch_california_housing()

In [None]:
X = pd.DataFrame(data.data, columns=data.feature_names)
X

In [None]:
y = pd.Series(data.target)
y

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.grid_search import GridSearchCV

In [None]:
%%time
param_grid = dict(
    max_features=np.arange(2, 8),
    max_depth=[2, 4],
    min_samples_split=[5, 10, 15, 20],
)
rfc = RandomForestRegressor(n_estimators=10)
gs = GridSearchCV(rfc, param_grid, cv=5, n_jobs=-1)
gs.fit(X.values, y.values)

In [None]:
scores = gs.grid_scores_
scores[:10]

In [None]:
def unpack_grid_scores(scores):
    rows = []
    params = sorted(scores[0].parameters)
    for row in scores:
        mean = row.mean_validation_score
        std = row.cv_validation_scores.std()
        rows.append([mean, std] + [row.parameters[k] for k in params])
    return pd.DataFrame(rows, columns=['mean_', 'std_'] + params)

In [None]:
scores = unpack_grid_scores(gs.grid_scores_)
scores

In [None]:
(scores
       .pipe((sns.factorplot,'data'), x='max_features', y='mean_', hue='max_depth', col='min_samples_split')
 )

In [None]:
s = pd.Series(gs.best_estimator_.feature_importances_,index=X.columns)
(s.sort_values()
  .plot
  .barh(figsize=(5,8))
)

# Statsmodels

http://statsmodels.sourceforge.net/

In [None]:
import statsmodels
import statsmodels.api as sm
statsmodels.__version__

In [None]:
# created in 4. Tidy Data
df = pd.read_hdf('tmp/games.hdf','df')
df

In [None]:
df.info()

``home_win`` is a boolean variable; ``patsy/statsmodels`` wants this as an int.

In [None]:
df['home_win'] = df.home_win.astype(int)

using formulas from ``patsy`` to describe our regression structure

In [None]:
f ='home_win ~ home_strength + away_strength + home_rest + away_rest'
res = (sm
         .Logit
         .from_formula(f, df)
         .fit()
)

In [None]:
res.summary()

In [None]:
df2 = df.assign(rest_difference=df.home_rest - df.away_rest,
                spread=df.home_points - df.away_points)

f = 'spread ~ home_strength + away_strength + rest_difference'
res = (sm
         .OLS
         .from_formula(f, df2)
         .fit()
       )

In [None]:
res.summary()