In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# set defaults
plt.style.use('seaborn-white')   # seaborn custom plot style
plt.rc('figure', dpi=100, figsize=(7, 5))   # set default size/resolution
plt.rc('font', size=12)   # font size

## The modeling pipeline



<img src="imgs/image_0.png" width="100%">

### The steps of the modeling pipeline

1. Create features to best reflect the meaning behind data
2. Create model appropriate to capture relationships between features
    - e.g. linear, non-linear
3. Select a loss function and fit the model (determine $\hat{\theta}$).
4. Evaluate model (e.g. using RMSE)

After these steps, use the model for prediction and/or inference.

### Software development and the modeling pipeline 

* Each step may contain complicated transformations and logic
* The pipeline above represents a single attempt at a model
    - May have thousands of feature/model/paramater combinations to choose from!
    - Remember the Data Science Life Cycle!
* ML pipelines: [the high interest credit card of technical debt](https://ai.google/research/pubs/pub43146)

### Features and Models using `Scikit Learn`

* Scikit-Learn implements many common steps in the feature/model creation pipeline.
* It interfaces with `numpy` arrays, *not* Pandas dataframes :(
    - Some work required keeping track of columns in scikit

### Scikit-Learn feature transformers


<img src="imgs/image_1.png" width="50%">


### Scikit-Learn (linear) models

<img src="imgs/image_2.png" width="50%">


## Scikit-Learn Transformer Classes

* Initialize a feature transformer with parameters:
    - e.g. `binar = Binarizer(thresh)`
* Transform data using `.transform` method
    - e.g. `binar.transform(data)` creates binarized features from `data`.

In [None]:
from sklearn.preprocessing import Binarizer

tips = sns.load_dataset('tips')
tips.head()

In [None]:
bi = Binarizer(threshold=20)
binarized = bi.transform(tips[['total_bill']])
binarized[:5]

In [None]:
(
    pd.concat([tips.total_bill, pd.DataFrame(binarized, columns=['binarized'])], axis=1)
    .sort_values('total_bill')
    .plot(x='total_bill', y='binarized')
);

## Scikit-Learn Model Classes

* Initialize a model with (perhaps zero) parameters:
    - e.g. `lr = LinearRegression()`
* Fit model to given dataset using `.fit`
    - e.g. `lr.fit(data, outcomes)` fits the model weights using `data` and `outcomes`.
* Use the model to predict using `.predict` method
    - e.g. `lr.predict(newdata)` predicts outcomes for `newdata`.
* Inspect model attributes, like model weights.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(tips[['total_bill', 'size']], tips.tip)

In [None]:
lr.predict(tips[['total_bill', 'size']])[:10]

In [None]:
# regression coefficients
lr.coef_

In [None]:
lr.intercept_

## Putting it together: Scikit-Learn Pipelines

* Put together feature transformers and models using `sklearn.Pipeline` objects
* Create a pipeline: `pl = Pipeline([feat, mdl])`
* Fit the model(s) in the pipeline using `pl.fit(data, target)`
* Predict from *raw* input data through the pipeline using `pl.predict`

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer

In [None]:
pl = Pipeline([
    ('one-hot', DictVectorizer()),
    ('lin-reg', LinearRegression())
])

In [None]:
d = tips[['sex', 'smoker', 'day', 'time']].to_dict(orient='records')
d[:10]

In [None]:
pl.fit(d, tips.tip)

In [None]:
pl.named_steps['one-hot'].transform(d).toarray()

In [None]:
pl.named_steps['one-hot'].vocabulary_

In [None]:
pl.predict(d)

In [None]:
pl.score(d, tips.tip)

### (Realistic) Sklearn Pipelines
<div class="image-txt-container">
    
* `ColumnTransformer` is a new (experimental) Pipeline object 
* Transforms using multiple transformers, each on different columns.


<img src="imgs/image_3.png">

</div>

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
import sklearn.preprocessing as pp

In [None]:
tips.drop(['tip', 'total_bill', 'size'], axis=1).head()

In [None]:
# Numeric columns and associated transformers
num_feat = ['total_bill', 'size']
num_transformer = Pipeline(steps=[
    ('scaler', pp.StandardScaler())
])

# Categorical columns and associated transformers
cat_feat = ['sex', 'smoker', 'day', 'time']
cat_transformer = Pipeline(steps=[
    ('intenc', pp.OrdinalEncoder()),
    ('onehot', pp.OneHotEncoder())
])

# preprocessing pipeline (put them together)
preproc = ColumnTransformer(transformers=[('num', num_transformer, num_feat), ('cat', cat_transformer, cat_feat)])

pl = Pipeline(steps=[('preprocessor', preproc), ('regressor', LinearRegression())])

In [None]:
pl.fit(tips.drop('tip', axis=1), tips.tip)

In [None]:
pl.predict(tips.drop('tip', axis=1))

In [None]:
pl.score(tips.drop('tip', axis=1), tips.tip)

## Evaluating the fit model



<img src="imgs/image_4.png" width="100%">

## Evaluating the quality of a model

* Given a fit model on dataset, calculate e.g. the root-mean-square error.
* If the error is low, do you think it's a good model?
    - It fits the given *data* well, but is it a good model? (Is the sample representative?)
    - E.g. will it give good predictions on similar, unknown, data?

## Fundamental Concepts of the quality of a 'fit model'

* **Bias**: the expected deviation between the predicted value and true value
* **Variance**:
    - **Observation Variance**: the variability of the random noise in the process we are trying to model. 
    - **Estimated Model Variance**: the variability in the predicted value across different datasets. 

### Model Quality: Bias and Variance

<div class="image-txt-container">
    
* The red bulls-eye: the true behavior of DGP
* Each dart: a specific function that models/predicts the DGP
* The model parameters $\theta$ select these functions.
* Credit: Scott Fortmann-Roe
    
<img src="imgs/image_5.png" width="100%">

</div>


## Evaluating the quality of a linear model

Given a dataset on which to fit the regression coefficients:
1. Calculate the RMSE to test for bias.
2. To test for variance, bootstrap estimate the regression coefficients:
    - sample the data.
    - For each sample, calculate the linear predictor.
    - For each input feature, calculate the CI for the distribution of predictions.
    - Large "prediction intervals" imply the model is susceptible to noise (e.g. outliers)
    
Still, this relies on a "representative sample" for generalization to new data!

In [None]:
sns.lmplot(data=tips, x='total_bill', y='tip');

## Evaluating the quality of a (general) model

* Given a fit (non-linear) model, there are three possibilities for quality:
    - The model doesn't fit the given data well (high bias; underfit)
    - Does it reflect the process of interest? (good fit; robust)
    - Does it just fit the data (noise and all)? (high variance; overfit)

* How can we ascertain the quality on similar, out-of-sample data?

## Evaluating the quality of a (general) model

* Given a quadratic process, a linear model has high bias.
* "Connecting-the-dots" will fail to generalize (high variance).

![overfit](imgs/under-over-fit.png)

### Example: predicting survival on the Titanic with Decision Trees

<div class="image-txt-container">

* Did a given passenger survive the Titanic distaster?
* The (simple) tree below has mediocre accuracy

<img src="imgs/image_6.png" width="50%">

</div>

### Reducing Bias with more complicated models


<div class="image-txt-container">

* Improve performance by "growing" the decision tree model.
* Increase the depth of the tree.
* Decrease the number of passengers required in leaf nodes.
* Effect: "Learn" individual passengers?

<img src="imgs/image_7.png" width="100%">

</div>

## Train-Test Split

To assess your model for overfitting to the data, randomly split the data into a "training set" and a "test set".

* The training set is used to fit the model (train the predictor).
* The test set is used to test the goodness-of-fit of the fit model.

Leaving out a sample for evaluation is *similar* to bootstrap estimating a regression model.

## The machine learning training pipeline:

<img src="imgs/train-test.png" width="50%">

Scikit-Learn as functions that help us do this.

### Using Scikit-Learn for train-test split

* Splitting a dataset using `sklearn.model_selection.train_test_split` 
* Given features `X` and a target array `y`,
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
```
randomly splits the features and target into training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X = tips.drop('tip', axis=1)
y = tips.tip
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
print(
    len(X_train)/len(X),
    len(X_test)/len(X)
)

### Example Prediction Pipeline

* Train a simple linear regression model on the tips data
* Split the data into a training and test set:
    - fit the model on the training set
    - compute the error on the test set

In [None]:
X = tips.drop(['tip', 'sex', 'smoker', 'day', 'time' ], axis=1)
y = tips.tip

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)


pl = Pipeline([
   ('lin-reg', LinearRegression())
])



pl.fit(X_test, y_test)
print ("Accuracy: %s" % pl.score(X_test, y_test))