## Feature Scaling

What is feature scaling? Feature scaling is a way of transforming your data into a common range of values. There are two common scalings:

* **Standardizing**
* **Normalizing**

**Standardizing**<br>
Standardizing is completed by taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column. In Python, let's say you have a column in `df` called `height`. You could create a standardized height as:
```
df["height_standard"] = (df["height"] - df["height"].mean()) / df["height"].std()
```
This will create a new "standardized" column where each value is a comparison to the mean of the column, and a new, standardized value can be interpreted as the number of standard deviations the original height was from the mean. This type of feature scaling is by far the most common of all techniques 

**Normalizing**<br>
A second type of feature scaling that is very popular is known as normalizing. With normalizing, data are scaled between 0 and 1. Using the same example as above, we could perform normalizing in Python in the following way:
```
df["height_normal"] = (df["height"] - df["height"].min()) /     \
                      (df["height"].max() - df['height'].min())
```

### When Should I Use Feature Scaling?
In many machine learning algorithms, the result will change depending on the units of your data. This is especially true in two specific cases:

* When your algorithm uses a distance-based metric to predict.
* When you incorporate regularization.

**Distance Based Metrics**<br>
Such as supervised learning techniques based on distance points from one another like **Support Vector Machines (or SVMs)**. Another technique that involves distance based methods to determine a prediction is **k-nearest neighbors (or k-nn)**. With either of these techniques, choosing not to scale your data may lead to drastically different (and likely misleading) ending predictions.

For this reason, choosing some sort of feature scaling is necessary with these distance based techniques.

**Regularization**<br>
When you start introducing regularization, you will again want to scale the features of your model. The penalty on particular coefficients in regularized linear regression techniques depends largely on the scale associated with the features. When one feature is on a small range, say from 0 to 10, and another is on a large range, say from 0 to 1 000 000, applying regularization is going to unfairly punish the feature with the small range. Features with small ranges need to have larger coefficients compared to features with large ranges in order to have the same effect on the outcome of the data. (Think about how $ab$ = $ba$ for two numbers $a$ and $b$.) Therefore, if regularization could remove one of those two features with the same net increase in error, it would rather remove the small-ranged feature with the large coefficient, since that would reduce the regularization term the most.

Again, this means you will want to scale features any time you are applying regularization.

Note that feature scaling can speed up convergence of your machine learning algorithms, which is an important consideration when you scale machine learning applications.

In [11]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler

In [12]:
df = pd.read_csv('datasets/regularisation_dataset2.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1.25664,2.04978,-6.2364,4.71926,-4.26931,0.2059,12.31798
1,-3.89012,-0.37511,6.14979,4.94585,-3.57844,0.0064,23.67628
2,5.09784,0.9812,-0.29939,5.85805,0.28297,-0.20626,-1.53459
3,0.39034,-3.06861,-5.63488,6.43941,0.39256,-0.07084,-24.6867
4,5.84727,-0.15922,11.41246,7.52165,1.69886,0.29022,17.54122


In [13]:
# define features and target

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

In [14]:
# define scaler and fit-transform X

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [15]:
# Define and fit Lasso Reg

lasso_reg = Lasso().fit(X_scaled, y)

In [16]:
# Inspect coefficients

lasso_reg.coef_

array([  0.        ,   3.90753617,   9.02575748,  -0.        ,
       -11.78303187,   0.45340137])

When the data's been scaled, the first coefficient is still regularized to 0, but now it's the fourth coefficient (and not the sixth coefficient) that gets set to 0. You might want to explore descriptive statistics for the original data to see how the standardization changed each column.

**Training data BEFORE Standardization**...

In [18]:
df.iloc[:,:-1].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,100.0,-0.076288,5.560089,-12.39824,-3.880408,-0.00994,4.065705,13.37454
1,100.0,-0.181381,1.737693,-5.28025,-1.222918,-0.278235,1.083133,4.3012
2,100.0,0.339573,4.982072,-11.23591,-2.833323,-0.07267,3.85592,11.9465
3,100.0,1.772602,8.163906,-23.82024,-3.3831,0.71186,6.704855,22.88008
4,100.0,-0.168269,3.184054,-6.86533,-2.731047,-0.12052,2.173942,7.35129
5,100.0,0.009754,0.183237,-0.63444,-0.09991,0.002385,0.119822,0.52328


**Training data AFTER Standardization**...

In [17]:
pd.DataFrame(X_scaled).describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,100.0,1.9428900000000003e-17,1.005038,-2.227307,-0.68763,0.011993,0.748704,2.431362
1,100.0,6.661338e-18,1.005038,-2.949057,-0.602398,-0.056018,0.731363,2.592612
2,100.0,6.772360000000001e-17,1.005038,-2.335133,-0.640071,-0.083162,0.709356,2.341476
3,100.0,5.329071000000001e-17,1.005038,-3.15067,-0.634705,-0.130585,0.607197,2.598488
4,100.0,-6.716849e-17,1.005038,-2.113909,-0.808934,0.015072,0.739313,2.373528
5,100.0,-1.7763570000000002e-17,1.005038,-3.533335,-0.601493,-0.040416,0.603717,2.816639
