# Centering and Scaling II

Normalizing (centering and scaling) the features in a dataset can significantly impact the performance of a model. Note that this is not always the case: In the Congressional voting records dataset, for example, all of the features are binary. In such a situation, scaling will have minimal impact.

In the following dataset, if the `quality` is less than 5, the target variable is `1`, and otherwise, it is `0`.

Notice how some features seem to have different units of measurement. `density`, for instance, takes values between `0.98` and `1.04`, while `total sulfur dioxide` ranges from `9` to `440`. As a result, it may be worth scaling the features here. 

In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('../data/white-wine.csv')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [3]:
df['quality'] = df.quality.apply(lambda x: 1 if x < 5 else 0)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,0
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,0
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,0
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,0


In [5]:
df.groupby('quality').quality.count()

quality
0    4715
1     183
Name: quality, dtype: int64

In [8]:
# prepare data
X = df.drop('quality', axis=1).values
y = df.quality.values

print(type(X), X.shape)
print(type(y), y.shape)

<class 'numpy.ndarray'> (4898, 11)
<class 'numpy.ndarray'> (4898,)


In [9]:
from sklearn.preprocessing import scale

# Scale the features: X_scaled
X_scaled = scale(X)

# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X))) 
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))

# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled))) 
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))

Mean of Unscaled Features: 18.432687072460002
Standard Deviation of Unscaled Features: 41.54494764094571
Mean of Scaled Features: 2.7314972981668206e-15
Standard Deviation of Scaled Features: 0.9999999999999999


We will use a **k-NN classifier** as part of a pipeline that includes scaling, and for the purposes of comparison, we'll use a **k-NN classifier** trained on unscaled data.

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [13]:
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)

# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))

Accuracy with Scaling: 0.964625850340136
Accuracy without Scaling: 0.9666666666666667


**Actual results**:

Accuracy with Scaling: 0.7700680272108843  
Accuracy without Scaling: 0.6979591836734694

In [12]:
# make predictions
y_pred = pipeline.predict(X_test)
y_pred_unscaled = knn_unscaled.predict(X_test)

# score the model
print('knn with scaling:', accuracy_score(y_test, y_pred))
print('knn without scaling', accuracy_score(y_test, y_pred_unscaled))

knn with scaling: 0.964625850340136
knn without scaling 0.9666666666666667


### Build a pipeline for classification

We'll build a pipeline that includes scaling and hyperparameter tuning to classify wine quality. 

We'll use the **SVM classifier**. The hyperparameters we will tune are `C` and `gamma`. `C` controls the regularization strength. It is analogous to the `C` hyperparameter used in **logistic regression**. `gamma` controls the **kernel coefficient**.

In [21]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

Setup the pipeline with the following steps:

- Scaling, called `scaler` with `StandardScaler()`.
- Classification, called `SVM` with `SVC()`.

In [22]:
steps = [('scaller', StandardScaler()), ('svm', SVC())]
pipeline = Pipeline(steps)

Specify the hyperparameter space using the following notation: `step_name__parameter_name`. 

Here, the `step_name` is `svm`, and the `parameter_names` are `C and `gamma`.

In [23]:
# Specify the hyperparameter space
parameters = {'svm__C':[1, 10, 100],
              'svm__gamma':[0.1, 0.01]}

Create training and test sets, with 20% of the data used for the test set. Use a random state of 21.

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Instantiate `GridSearchCV` with the pipeline and hyperparameter space and fit it to the training set. Use 3-fold cross-validation (This is the default, so you don't have to specify it).

In [25]:
cv = GridSearchCV(pipeline, param_grid=parameters)

Fit the data, predict the labels of the test set and compute the metrics.

In [26]:
# Fit to the training set
cv.fit(X_train, y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))



Accuracy: 0.9693877551020408
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       950
           1       0.50      0.10      0.17        30

   micro avg       0.97      0.97      0.97       980
   macro avg       0.74      0.55      0.58       980
weighted avg       0.96      0.97      0.96       980

Tuned Model Parameters: {'svm__C': 100, 'svm__gamma': 0.01}
