# Preprocessing - scaling

In this notebook i'm going to show how to use a scaler to preprocess data. This can improve results with some algorithms and/or datasets, for example for regression models and neural networks.

In [1]:
from sklearn.datasets import load_boston

X,y = load_boston(return_X_y=True)

First we'll train and evaluate a simple linear regression as baseline - without scaling.

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold

model = LinearRegression()

scores = cross_val_score(model, X, y, cv=KFold(n_splits=3, shuffle=True))
print("Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))

Score: 0.68 (+/- 0.06)


Second we train the same model, but with a RobustScaler to scale the data.

In [14]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

model = make_pipeline(
    RobustScaler(),
    LinearRegression()
)

scores = cross_val_score(model, X, y, cv=KFold(n_splits=3, shuffle=True))
print("Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()))

Score: 0.72 (+/- 0.02)


In the above example the scaler seems to improve the score slightly. Not sure if the improvement is significant in this example though, looking at the standard deviation ;-)