# Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

# Types of Feature Scaling

# 1) Absolute Maximum Scaling

This method of scaling requires two-step:

1) We should first select the maximum absolute value out of all the entries of a particular measure.

For the demonstration purpose, we is provided. This dataset is a simpler version of the original house price prediction dataset having only two columns from the original dataset. The first five rows of the original data are shown below:

In [1]:
import pandas as pd
df = pd.read_csv('feature scaling data set/price.csv')
df.head()

Unnamed: 0,LotArea,MSSubClass
0,8450,60
1,9600,20
2,11250,60
3,9550,70
4,14260,60


In [2]:
import numpy as np

In [17]:
max_vals_col1 = np.max(np.abs(df['LotArea']))
max_vals_col1

215245

In [18]:
import numpy as np
max_vals_col2 = np.max(np.abs(df['MSSubClass']))
max_vals_col2

190

2)Then after this, we divide each entry of the column by this maximum value.

In [23]:
X_scaled_col1 = (df['LotArea'] - max_vals_col1)/max_vals_col1

In [24]:
X_scaled_col1

0      -0.960742
1      -0.955400
2      -0.947734
3      -0.955632
4      -0.933750
          ...   
1455   -0.963219
1456   -0.938791
1457   -0.957992
1458   -0.954856
1459   -0.953834
Name: LotArea, Length: 1460, dtype: float64

In [25]:
X_scaled_col2 = (df['MSSubClass'] - max_vals_col2)/max_vals_col2

In [26]:
X_scaled_col2

0      -0.684211
1      -0.894737
2      -0.684211
3      -0.631579
4      -0.684211
          ...   
1455   -0.684211
1456   -0.894737
1457   -0.631579
1458   -0.894737
1459   -0.894737
Name: MSSubClass, Length: 1460, dtype: float64

In [31]:
dataframes = [X_scaled_col1,X_scaled_col2]
Absolute_Maximum_Scaling = pd.concat(dataframes,axis=1)
Absolute_Maximum_Scaling

Unnamed: 0,LotArea,MSSubClass
0,-0.960742,-0.684211
1,-0.955400,-0.894737
2,-0.947734,-0.684211
3,-0.955632,-0.631579
4,-0.933750,-0.684211
...,...,...
1455,-0.963219,-0.684211
1456,-0.938791,-0.894737
1457,-0.957992,-0.631579
1458,-0.954856,-0.894737


# 2) Min-Max Scaling

Rescaling the features to a specific range, such as between 0 and 1, by subtracting the minimum value and dividing by the range.

This method of scaling requires below two-step:

First, we are supposed to find the minimum and the maximum value of the column. Then we will subtract the minimum value from the entry and divide the result by the difference between the maximum and the minimum value.

In [12]:
from sklearn.preprocessing import MinMaxScaler

In [13]:
scaler = MinMaxScaler()

In [14]:
Scaled_X = scaler.fit_transform(df)

In [16]:
Scaled_X_df=pd.DataFrame(Scaled_X, columns= ['LotArea','MSSubClass'])

In [18]:
Scaled_X_df.head()

Unnamed: 0,LotArea,MSSubClass
0,0.03342,0.235294
1,0.038795,0.0
2,0.046507,0.235294
3,0.038561,0.294118
4,0.060576,0.235294


As we are using the maximum and the minimum value this method is also prone to outliers but the range in which the data will range after performing the above two steps is between 0 to 1.

# 3) Normalization

This method is more or less the same as the previous method but here instead of the minimum value, we subtract each entry by the mean value of the whole data and then divide the results by the difference between the minimum and the maximum value. In nromalization, Minimum and maximum value of features are used for scaling.  It Scales values between [0, 1] or [-1, 1].

In [19]:
from sklearn.preprocessing import Normalizer

In [20]:
normalizer = Normalizer()

In [21]:
Normalized_X = normalizer.fit_transform(df)

In [22]:
Normalized_X

array([[0.99997479, 0.00710041],
       [0.99999783, 0.00208333],
       [0.99998578, 0.00533326],
       ...,
       [0.99997003, 0.00774142],
       [0.99999788, 0.00205824],
       [0.99999797, 0.00201268]])

In [23]:
Normalized_X_df = pd.DataFrame(Normalized_X, columns=[['LotArea','MSSubClass']])

In [24]:
Normalized_X_df

Unnamed: 0,LotArea,MSSubClass
0,0.999975,0.007100
1,0.999998,0.002083
2,0.999986,0.005333
3,0.999973,0.007330
4,0.999991,0.004208
...,...,...
1455,0.999971,0.007578
1456,0.999999,0.001518
1457,0.999970,0.007741
1458,0.999998,0.002058


# 4) Standardization

This method of scaling is basically based on the central tendencies and variance of the data. First, we should calculate the mean and standard deviation of the data we would like to normalize. Then we are supposed to subtract the mean value from each entry and then divide the result by the standard deviation. In Standardization, Mean and standard deviation is used for scaling.It is used when we want to ensure zero mean and unit standard deviation. It is not bounded to a certain range. It is much less affected by outliers.

In [25]:
from sklearn.preprocessing import StandardScaler

In [26]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df_X = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df_X.head())

    LotArea  MSSubClass
0 -0.207142    0.073375
1 -0.091886   -0.872563
2  0.073480    0.073375
3 -0.096897    0.309859
4  0.375148    0.073375


# 5) Robust Scaling

In this method of scaling, we use two main statistical measures of the data.

Median, Inter-Quartile Range

After calculating these two values we are supposed to subtract the median from each entry and then divide the result by the interquartile range.

In [27]:
from sklearn.preprocessing import RobustScaler
 
scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df_X = pd.DataFrame(scaled_data,
                         columns=df.columns)
print(scaled_df_X.head())

    LotArea  MSSubClass
0 -0.254076         0.2
1  0.030015        -0.6
2  0.437624         0.2
3  0.017663         0.4
4  1.181201         0.2
