# Feature Scaling for Machine Learning 🤖

As you go along you'll get to know about the two most important feature scaling techniques💥 `StandardScaler`, also known as **Standardization**, and `MinMaxScaler`, also known as **Normalization**. And also about `MaxAbsScaler` and `RobustScaler`🔥.

The first thing we need to do is to import some relevant libraries.

## Import Libraries 📦

In [220]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Import Dataset 📄
We'll be using the `Ames_Housing_Sales_RO.csv` Dataset. This dataset doesn't contain any outliers.

The Ames housing dataset examines features of houses sold in Ames during the 2006–10 timeframe. The goal is to use the training data to predict the sale prices of the houses in the testing data.

In [221]:
data = pd.read_csv('../datasets/Ames_Housing_Sales_RO.csv')
data.head() #this returns top 5 rows of the dataset

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
2,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0
3,1694.0,0.0,0.0,,3,1Fam,TA,Av,1369.0,0.0,...,0.0,Pave,7,1686.0,AllPub,255.0,2004,2005,2007,307000.0
4,1040.0,0.0,0.0,,3,1Fam,TA,No,906.0,0.0,...,0.0,Pave,5,1040.0,AllPub,0.0,1965,1965,2008,129500.0


In [222]:
# examine each of the columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546 entries, 0 to 545
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   1stFlrSF       546 non-null    float64
 1   2ndFlrSF       546 non-null    float64
 2   3SsnPorch      546 non-null    float64
 3   Alley          546 non-null    object 
 4   BedroomAbvGr   546 non-null    int64  
 5   BldgType       546 non-null    object 
 6   BsmtCond       546 non-null    object 
 7   BsmtExposure   546 non-null    object 
 8   BsmtFinSF1     546 non-null    float64
 9   BsmtFinSF2     546 non-null    float64
 10  BsmtFinType1   546 non-null    object 
 11  BsmtFinType2   546 non-null    object 
 12  BsmtFullBath   546 non-null    int64  
 13  BsmtHalfBath   546 non-null    int64  
 14  BsmtQual       546 non-null    object 
 15  BsmtUnfSF      546 non-null    float64
 16  CentralAir     546 non-null    object 
 17  Condition1     546 non-null    object 
 18  Condition2

## Separate features and target 🖖

In [223]:
target_col = "SalePrice"

X = data.drop(target_col, axis=1)
y = data[target_col]

In [224]:
# dropping categorical variables
X = X.drop(X.columns[X.dtypes == np.object], axis=1)

## Create train and test splits ⚔️

In [225]:
# import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # splits train and test in 7:3 ratio

## Feature Scaling ⚖️

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

### Standard Scaler, also known as Standardization

It is a scaling technique that standardized the data by subtracting it from its mean and then dividing it by the standard deviation.. The values are centered around the mean. It doesn't have a bounding range.

Standardization can be useful where the data follows a Gaussian distribution i.e., Normally Distributed.

##### Import the class containing the `StandardScaler` from `sklearn.preprocessing` and create an instance of the class.

In [226]:
from sklearn.preprocessing import StandardScaler
SS = StandardScaler()

##### Fit StandardScaler on X_train

In [227]:
trainingset = X_train.copy() # copy because we'll use this set again
X_train_ss = SS.fit_transform(trainingset)

Now all the variables are on the same scale.
##### Fit Regression

In [228]:
# import linear regression and create an instance of it
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X_train_ss, y_train)

LinearRegression()

##### Predicting

Before predicting, we need to transform the test-set as well according to how we transformed the train-set.

We have to use the mean and standard deviation that was originally defined for the training set. That's why we use `SS.transform` on test-set instead of `SS.fit_fransform`

In [229]:
testset = X_test.copy() # copy because we'll use this set again
X_test_ss = SS.transform(testset) 
y_pred_train_ss = LR.predict(X_train_ss) # using train data
y_pred_test_ss = LR.predict(X_test_ss) # using test data

##### Mean Squared Error (MSE)
 It is the simplest and most common loss function, it takes the difference between your model's predictions and the ground truth, squares it, and averages it out across the whole dataset.

In [230]:
# Storage for error values
error_df = list()

In [231]:
# import the method mean_squared_error from the sklearn.metrics
from sklearn.metrics import mean_squared_error

In [232]:
mse_train_ss =  mean_squared_error(y_train, y_pred_train_ss)
mse_test_ss =  mean_squared_error(y_test, y_pred_test_ss)
print("Train : ", mse_train_ss)
print("Test : ", mse_test_ss)

error_df.append(pd.Series({'train': str(mse_train_ss),
                           'test' : str(mse_test_ss)},
                          name='standardscaling'))

Train :  315636113.76629907
Test :  300641635.0900279


### Min-Max Scaler, also known as Normalization

It is a scaling technique that converts variables to continuous variables in the 0-1 interval by mapping minimum values to 0 and maximum to 1.

Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. It is sensitive to Outliers.

This can be useful in algorithms that do not assume any distribution of the data like **K-Nearest Neighbors** and **Neural Networks**.

##### Import the class containing the `MinMaxScaler` from `sklearn.preprocessing` and create an instance of the class.

In [233]:
from sklearn.preprocessing import MinMaxScaler
MMS = MinMaxScaler()

##### Fit MinMaxScaler on X_train

In [234]:
trainingset = X_train.copy() # copy because we'll use this set again
X_train_mms = MMS.fit_transform(trainingset)

##### Fit Regression

In [235]:
LR.fit(X_train_mms, y_train)

LinearRegression()

##### Predicting

In [236]:
# transform the test set
testset = X_test.copy() # copy because we'll use this set again
X_test_mms = MMS.transform(testset)

# predicting
y_pred_train_mms = LR.predict(X_train_mms) # using train data
y_pred_test_mms = LR.predict(X_test_mms) # using test data

##### Mean Squared Error (MSE)

In [237]:
mse_train_mms =  mean_squared_error(y_train, y_pred_train_mms)
mse_test_mms =  mean_squared_error(y_test, y_pred_test_mms)
print("Train : ", mse_train_mms)
print("Test : ", mse_test_mms)

error_df.append(pd.Series({'train': str(mse_train_mms),
                           'test' : str(mse_test_mms)},
                          name='minmaxscaling'))

Train :  315636113.7662993
Test :  300641635.09002805


### Max Abs Scaler

It is a scaling technique that scales variables by maximum absolute value. It is similar to Min Max Scaler, therefore, it is also sensitive to outliers.

##### Import the class containing the `MaxAbsScaler` from `sklearn.preprocessing` and create an instance of the class.

In [238]:
from sklearn.preprocessing import MaxAbsScaler

### Robust Scaler

It is a scaling technique that is similar to Min-Max Scaler but instead maps the interquartile range, 75th percentile value minus the 25th percentile value, to 0-1.

It is robust to outliers.

##### Import the class containing the `RobustScaler` from `sklearn.preprocessing` and create an instance of the class.

In [239]:
from sklearn.preprocessing import RobustScaler

In [240]:
scalers = {'maxabsscaling': MaxAbsScaler(),
            'robustscaling': RobustScaler()}

In [241]:
# iterate over scalers and calculate errors
for scaler_label, scaler in scalers.items():
    trainingset = X_train.copy() # copy because we'll use this set again
    # scale trainset
    X_train_s = scaler.fit_transform(trainingset)
    # fit regression
    LR.fit(trainingset, y_train)
    # transform the test set
    testset = X_test.copy() # copy because we'll use this set again
    X_test_s = scaler.transform(testset)
    # predicting
    y_pred_train_s = LR.predict(X_train_s) # using train data
    y_pred_test_s = LR.predict(X_test_s) # using test data
    error_df.append(pd.Series({'train': str(mean_squared_error(y_train, y_pred_train_s)),
                                'test' : str(mean_squared_error(y_test, y_pred_test_s))},
                                name=scaler_label))

In [242]:
# Assemble the results
error_df = pd.concat(error_df, axis=1)
error_df

Unnamed: 0,standardscaling,minmaxscaling,maxabsscaling,robustscaling
train,315636113.76629907,315636113.7662993,3349773986825.721,3353991317143.2114
test,300641635.0900279,300641635.09002805,3347205594248.1323,3349556724291.42


Most of the time in linear regression, the scaling won't affect the outcome. However, this is not the case with **Ridge** and **Lasso** Regression. And with distance-based algorithms like **KNN**, **K-means**, and **SVM**.