In [206]:
import numpy as np
from numpy import random
import pandas as pd
import sklearn as skl
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

### Scaling Columns

When a data scientist refers to *Scaling a column* in a dataset, they are reffering to any technique that changes each value in the column so that the distribution of the column has desired properties.  The most common scaling technique is *standardization*.  

A column is standardized when we subtract the mean (of the column) and divide by the standard deviation (of the column) so that the column becomes approximately normally distributed with mean 0 and standard deviation 1.  

##### Example 1

In [16]:
column = np.random.choice(np.array(range(1000)), size = (10**2,))
mean = column.mean()
std = column.std()
print(f'Mean: {mean}, Standard Deviation: {std}')

Mean: 506.31, Standard Deviation: 300.0584174789969


In [17]:
column_standardized = (column - mean)/std
print(f'Mean: {column_standardized.mean()}, Standard Deviation: {column_standardized.std()}')

Mean: -3.219646771412954e-17, Standard Deviation: 1.0


$\Box$

According to [this](https://medium.com/data-science-in-your-pocket/different-methods-to-scale-numerical-features-in-datasets-with-examples-and-codes-93f5d7e60877) article and [this](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114#3abe) article, it is a good idea to scale columns because:
+ Unscaled columns can adversely effect the training process.
+ Models that depend on a distance measure (KNN, K-Means) assume that the data is standardized.
+ Scaled columns can be compared better by a ML algorithm.

#### Sci-Kit Learn Standard Scaler

The sci-kit learn standard scaler implements *standardization* as defined above. This scaler is a *transform* in that it transforms the data.  The basic steps of a transform are:
1. Instantiate the transformer object.
2. Fit the transformer.
3. Perform the transformation.

##### Example 2

In [23]:
data = pd.DataFrame({'f1': 14*np.random.normal(size = (10,))+100, 'f2': 0.3*np.random.normal(size = (10,)), 'f3': 3.3*np.random.normal(size = (10,))+20, 'y': np.random.choice(np.array([0,1]), size = (10,))})

The feature columns (f1, f2, f3) do not have mean 0 and standard deviation 1.

In [25]:
data.mean(axis = 0)

f1    101.568146
f2     -0.030843
f3     21.908736
y       0.600000
dtype: float64

In [26]:
data.std(axis = 0)

f1    16.232565
f2     0.470830
f3     3.804731
y      0.516398
dtype: float64

Specify the columns to scale.

In [27]:
cols_to_scale = list(data.columns)
cols_to_scale.remove('y')
cols_to_scale

['f1', 'f2', 'f3']

Instantiate the transformer object.

In [28]:
scaler=StandardScaler()

Fit and transform in one step using the *fit_transform* method.

In [29]:
data[cols_to_scale] = scaler.fit_transform(data[cols_to_scale])

Now the feature columns have approximately mean 0 and standard deviation 1.

In [31]:
data[cols_to_scale].mean(axis = 0)

f1   -1.998401e-16
f2    1.110223e-17
f3   -1.387779e-16
dtype: float64

In [32]:
data[cols_to_scale].std(axis = 0)

f1    1.054093
f2    1.054093
f3    1.054093
dtype: float64

$\Box$

##### Example 3

In this example, we use the wine dataset.

In [192]:
wine = datasets.load_wine() 
wine = pd.DataFrame(
    data=np.c_[wine['data'], wine['target']], 
    columns=wine['feature_names'] + ['target'] 
)
wine = wine[['magnesium', 'ash', 'alcalinity_of_ash', 'target']]
wine.head()

Unnamed: 0,magnesium,ash,alcalinity_of_ash,target
0,127.0,2.43,15.6,0.0
1,100.0,2.14,11.2,0.0
2,101.0,2.67,18.6,0.0
3,113.0,2.5,16.8,0.0
4,118.0,2.87,21.0,0.0


The goal with this dataset is to predict which kind of wine a particular row contains the data for.  There are three classes: 0, 1, and 2.

Notice that the means and standard deviations are wildly different between columns.

In [193]:
wine.mean(axis = 0)

magnesium            99.741573
ash                   2.366517
alcalinity_of_ash    19.494944
target                0.938202
dtype: float64

In [194]:
wine.std(axis = 0)

magnesium            14.282484
ash                   0.274344
alcalinity_of_ash     3.339564
target                0.775035
dtype: float64

We train a K-Nearest Neighbors classifier and then check its performance on the unscaled data.

To know that scaling is actually working, we need to perform the exact same train-test split for the unscaled data and the scaled data.  To accomplish this, we randomly select 25% of the indices in the wine dataset to select the test set. 

In [195]:
test_indices = np.random.choice(np.arange(178), size = (45,), replace = False)

Create train and test sets.

In [196]:
train = wine.loc[[x for x in range(178) if x not in test_indices]]
test = wine.loc[test_indices]

In [197]:
X_train = train.iloc[:, :-1]
y_train = train.iloc[:, -1:].to_numpy().ravel()
X_test = train.iloc[:, :-1]
y_test = train.iloc[:, -1:].to_numpy().ravel()

Train the classifier.

In [198]:
knn = KNeighborsClassifier() 
knn.fit(X_train, y_train) 

KNeighborsClassifier()

We now find the predicted y values for our test set.

In [199]:
y_pred = knn.predict(X_test)

And the accuracy is ...

In [200]:
(y_pred == y_test).sum()/len(y_pred)

0.7443609022556391

Lets see what happens when we scale the columns first.

In [201]:
cols_to_scale = list(wine.columns)
cols_to_scale.remove('target')
cols_to_scale

['magnesium', 'ash', 'alcalinity_of_ash']

In [202]:
scaler=StandardScaler()
wine[cols_to_scale] = scaler.fit_transform(wine[cols_to_scale])

In [205]:
wine[cols_to_scale].mean(axis = 0)

magnesium           -2.494883e-17
ash                 -4.059175e-15
alcalinity_of_ash   -7.110417e-17
dtype: float64

In [204]:
wine[cols_to_scale].std(axis = 0)

magnesium            1.002821
ash                  1.002821
alcalinity_of_ash    1.002821
dtype: float64

We have the same train test split because we use the same test_indices. 

In [187]:
train = wine.loc[[x for x in range(178) if x not in test_indices]]
test = wine.loc[test_indices]

In [188]:
X_train = train.iloc[:, :-1]
y_train = train.iloc[:, -1:].to_numpy().ravel()
X_test = train.iloc[:, :-1]
y_test = train.iloc[:, -1:].to_numpy().ravel()

In [189]:
knn = KNeighborsClassifier() 
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [190]:
y_pred = knn.predict(X_test)

The model performs better.

In [191]:
(y_pred == y_test).sum()/len(y_pred)

0.8045112781954887

$\Box$