<H2>Pipelines</H2>
Use to process data before using them


In [1]:
from sklearn.datasets import load_digits
data = load_digits()

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)

from sklearn.svm import SVC
SVC().fit(X_train, y_train).score(X_test, y_test)

0.41555555555555557

As we can see, it doesn't work very well

before, we have to scale the data...


In [2]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [3]:
SVC().fit(X_train_scaled, y_train).score(X_test_scaled, y_test)

0.9555555555555556

In [5]:
param_grid = {'C' : [0.001, 0.01, 0.1, 1, 10], 'gamma' : [0.001, 0.01, 0.1, 1]}
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train_scaled, y_train)
grid.score(X_test_scaled, y_test)
  

0.99111111111111116

This code is wrong, because we split the data, worked on a minmax dataset... so we are kind of cheating :
We split the data, then put it in the grid search that does cross validation, inside the cross validation we have the mean and variance from the whole dataset. 

So preprocessing before grid search is WRONG


In [7]:
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(MinMaxScaler(), SVC())

# the pipe apply the scaler then the svc
print(pipe)

pipe.fit(X_train, y_train).score(X_test, y_test)


Pipeline(steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])


0.9555555555555556

This is the same result as before (without grid search of course). 

Now, we can use the grid search correctly :

In [11]:
param_grid = {'svc__C' : [0.001, 0.01, 0.1, 1, 10], 'svc__gamma' : [0.001, 0.01, 0.1, 1]}
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(pipe, param_grid=param_grid)
grid.fit(X_train_scaled, y_train)
print('score : ', grid.score(X_test_scaled, y_test))
print('best params : ', grid.best_params_)

score :  0.991111111111
best params :  {'svc__C': 10, 'svc__gamma': 0.1}


Now, let's apply this to the <h4>boston housing dataset</h4>

In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge

boston = load_boston()

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size = 0.25, random_state=123)

pipe = make_pipeline(StandardScaler(), PolynomialFeatures(), Ridge())

grid = GridSearchCV(pipe, param_grid={'polynomialfeatures__degree': [1,2,3]}, cv=5)

grid.fit(X_train, y_train)

print('Best parameters:', grid.best_params_)
print('best score:', grid.best_score_)
print('test score:', grid.score(X_test, y_test))



Best parameters: {'polynomialfeatures__degree': 2}
best score: 0.817638941497
test score: 0.83131201386
