There are an incredible amount of possible workflows for any given data set when we account for transforms, feature engineering, model selection and model tuning. This means that we need a systematic way to compare these workflow variants. This is where pipelines become so useful and it is the consistency of the three interfaces that allow make pipelines like this one a necessary part of the iterative workflow.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import median_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston

In [2]:
## load the boston dataset
boston = load_boston()
X, y = boston['data'], boston['target']
features = boston['feature_names']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

pipe = Pipeline([("scaler", StandardScaler()),
                 ("featsel", SelectKBest(k=10)),
                 ("rf",RandomForestRegressor(n_estimators=20))])

In [3]:
## train the data
pipe.fit(X_train,y_train)

## evaluate the model
y_pred = pipe.predict(X_test)
print(r'R^2=%.2f, MAE=%.2f'%(r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))

R^2=0.74, MAE=1.56
