Scikit-Learn is a Python-based machine learning framework consisting of most classification, regression, and clustering algorithms with a variety of support capabilities creating a comprehensive ML framework.

*** Need to add much more fluff about what it is.

# Scikit-Learn is a comprehensive Machine Learning toolkit

Scikit-Learn has dozens of estimators (clustering, classification, regression) 

*** Need to figure out a beautiful way to show how comprehensive the algorithm set is in sklearn

### Scikit-Learn provides utilities to manage key ML issues.

** Showing train/test split because it is a requisite part of training a model to data, do more beautifully

import pandas as pd
import sklearn.datasets as datasets
boston = datasets.load_boston()
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston['data'], boston['target'], test_size=0.33, random_state=42)


### Scikit-Learn makes fitting models simple

Simple Linear Regression example

Showing the basics of fitting a model using OLS.

In [486]:
import sklearn.linear_model as lm

#Fit a linear model
linear_regression = lm.LinearRegression().fit(X_train, y_train)
# Score the holdout set
print('Model R_Square on holdout: ' + str(linear_regression.score(X_test,y_test)))
# Review coefficient set
pd.DataFrame(np.array([boston['feature_names'], linear_regression.coef_]).T,columns=['Feature','Coefficient'])

Model R_Square on holdout: 0.725851581823


Unnamed: 0,Feature,Coefficient
0,CRIM,-0.1280603983022774
1,ZN,0.0377955692757803
2,INDUS,0.0586107796779193
3,CHAS,3.240070073436742
4,NOX,-16.222267596613126
5,RM,3.8935224412488014
6,AGE,-0.0127879943522895
7,DIS,-1.423268640106174
8,RAD,0.2345130817919454
9,TAX,-0.0082026112672348


# Scikit-Learn is standardized ML syntax

Each estimator in `sklearn` has  `fit()`, `predict()`, and `score()` methods. This makes swapping techniques effortless.

*** Want to show here that using fit, predict, and score accross a multitude of regressors will produce a multitude of scores with minimal code.

### Swapping out various regression estimators

In [447]:
import sklearn.linear_model as lm
import sklearn.ensemble as ens
import sklearn.tree as tree
max_iter = 100000
random_state = 42

# Let's pick a bunch of regression estimators
linear_models = [lm.LinearRegression(), 
                 lm.Ridge(max_iter=max_iter),  
                 lm.Lasso(max_iter=max_iter), 
                 lm.ElasticNet(max_iter=max_iter), 
                 ens.RandomForestRegressor(random_state=random_state), 
                 ens.GradientBoostingRegressor(random_state=random_state),
                 tree.DecisionTreeRegressor(random_state=random_state)]
# Capture model types
model_names = [type(model).__name__ for model in linear_models]
# Fit the estimators
fitted_models = [model.fit(X_train, y_train) for model in linear_models]
# Evaluate models on hold-out
model_r2 = [model.score(X_test,y_test) for model in fitted_models]
# Summarize results
results = pd.DataFrame([model_names, model_r2], index=['Model','R2 Initial']).T
# Print results
results

Unnamed: 0,Model,R2 Initial
0,LinearRegression,0.725852
1,Ridge,0.720131
2,Lasso,0.664381
3,ElasticNet,0.668769
4,RandomForestRegressor,0.812672
5,GradientBoostingRegressor,0.893198
6,DecisionTreeRegressor,0.740223


The linear models do not fair as well as the ensemble techniques.  This is likely due to non-linear relationships in the underlying featureset.

# Scikit-Learn is a feature engineering toolkit
Scikit-learn has a variety of transformers that take raw data and generates transformations that are better suited for certain algorithms.
These transformers include StandardScaler(), PCA(), Imputer(), LabelBinarizer(), PolynomialFeature(), and so much more...

Transformers are estimators too and have fit() methods to set up the transformation as well as transform() method to apply it to new data.

*** figure out a sexy way to show how comprehensive the toolkit is in feature engineering/data preprocessing capabilities.

In [448]:
from sklearn.preprocessing import PolynomialFeatures

# Let's add second order variables to our featureset.
poly = PolynomialFeatures(degree=2)
poly.fit(X_train)

# Transform our train and test datasets with the new polynomial features.
X_train_transform = poly.transform(X_train)
X_test_transform = poly.transform(X_test)

print('By adding 2nd degree terms to our feature set we expand our featureset from ' + str(len(boston['feature_names'])) + \
      ' features to ' + str(len(poly.get_feature_names())) + '.')

# Fit the estimators
fitted_models = [model.fit(X_train_transform,y_train) for model in linear_models]
# Evaluate models on hold-out
model_r2 = [model.score(X_test_transform,y_test) for model in fitted_models]
# Summarize results
results = results.T.append(pd.Series(model_r2, name='R2 Poly')).T
# Print results
results

By adding 2nd degree terms to our feature set we expand our featureset from 13 features to 105.


Unnamed: 0,Model,R2 Initial,R2 Poly
0,LinearRegression,0.725852,0.486611
1,Ridge,0.720131,0.662576
2,Lasso,0.664381,0.838244
3,ElasticNet,0.668769,0.842125
4,RandomForestRegressor,0.812672,0.825092
5,GradientBoostingRegressor,0.893198,0.887188
6,DecisionTreeRegressor,0.740223,0.74719


### Scikit-learn is customizable

You can even work your own transformers into the workflow.  Let's explore taking the natural log of our featureset.

In [484]:
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)

# Transform our train and test datasets with the new polynomial features.
X_train_transform = np.append(X_train,transformer.transform(X_train),axis=1)
X_test_transform = np.append(X_test,transformer.transform(X_test),axis=1)

#print('By adding log terms to our feature set we expand our featureset from ' + str(len(boston['feature_names'])) + \
#      ' features to ' + str(len(X_train_transform.T) + '.')

# Fit the estimators
fitted_models = [model.fit(X_train_transform,y_train) for model in linear_models]
# Evaluate models on hold-out
model_r2 = [model.score(X_test_transform,y_test) for model in fitted_models]
# Summarize results
results = results.T.append(pd.Series(model_r2, name='R2 Log')).T
# Print results
results

Unnamed: 0,Model,R2 Initial,R2 Poly,R2 Log
0,LinearRegression,0.725852,0.486611,0.822088
1,Ridge,0.720131,0.662576,0.786281
2,Lasso,0.664381,0.838244,0.664378
3,ElasticNet,0.668769,0.842125,0.669214
4,RandomForestRegressor,0.812672,0.825092,0.824317
5,GradientBoostingRegressor,0.893198,0.887188,0.895237
6,DecisionTreeRegressor,0.740223,0.74719,0.727383


# Scikit-learn is a scalable workflow

### Pipelining
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.  Pipelines are estimators too with fit(), predict() and score() methods. 

### Take a look at a simple example
An example of a simple pipeline chaining polynomial transform and lasso regression together.

In [455]:
from sklearn.pipeline import Pipeline

#Let's create our estimators
poly = PolynomialFeatures(2)
lasso = lm.Lasso(max_iter=100000)

# Let's add them to a pipeline in the sequence we want to apply them
pipe = Pipeline(steps=[('poly', poly), ('lasso', lasso)])

# Fitting the pipeline fits each estimator in sequence
pipe.fit(X_train, y_train)
# Evaluate estimator on holdout
print('An pipeline for our Lasso model produces a holdout score of :' + str(pipe.score(X_test, y_test)))

An pipeline for our Lasso model produces a holdout score of :0.838244117471


### Pipelining with GridSearchCV
The real power of pipelines is realized when using them with GridSearchCV - The beauty here is that both transformers and the model maintain data separability between the train and test of each fold in the cross-validation routine.  This ensures no data leakage in evaluating estimator performance, but allows us to refine the hyperparameters of all estimators in the pipeline.

### Optimize our DecisionTreeRegressor

In [485]:
dtree = tree.DecisionTreeRegressor(random_state=random_state)
pipe = Pipeline(steps=[('poly', poly), ('dtree', dtree)])

param_grid = dict(dtree__criterion =['mse','friedman_mse'],
                  dtree__max_depth = [25, 50, 75, 100],
             dtree__min_samples_leaf =[2, 3, 5, 10],
             dtree__min_samples_split = [5,10,20],
             poly__degree=[1,2])

estimator = GridSearchCV(pipe, param_grid, return_train_score=True, cv=5)
estimator.fit(X_train, y_train)
parameters = pd.DataFrame(estimator.cv_results_['params'])
cv_test_scores = pd.Series(estimator.cv_results_['mean_test_score'],name='mean_test_score')
parameters.T.append(cv_test_scores).T.sort_values('mean_test_score', ascending=False).head(5)

Unnamed: 0,dtree__criterion,dtree__max_depth,dtree__min_samples_leaf,dtree__min_samples_split,poly__degree,mean_test_score
131,friedman_mse,50,3,20,2,0.755944
35,mse,50,3,20,2,0.755944
179,friedman_mse,100,3,20,2,0.755944
155,friedman_mse,75,3,20,2,0.755944
83,mse,100,3,20,2,0.755944


Let's retrain the decision tree on our training set with the optimal hyper-parameters from the GridSearch space and verify holdout score.

In [474]:
poly = PolynomialFeatures(2)
dtree = tree.DecisionTreeRegressor(criterion='friedman_mse',max_depth=50, min_samples_leaf=3, min_samples_split=20, random_state = random_state)

pipe = Pipeline(steps=[('poly', poly), ('dtree', dtree)])

pipe.fit(X_train, y_train)
print('An optimized pipeline for our Decision Tree Regressor produces a holdout score of ' + str(pipe.score(X_test, y_test)))

An optimized pipeline for our Decision Tree Regressor produces a holdout score of 0.819812808935
