# Feature Scaling

<hr>

- **Feature Scaling:** transforms values in the similiar range for machine learnig algorithm to behave optimal.

## Feature Scaling Techniques
- **Normalization:** is a special case of **MinMaxScaler**
    - Normalization converts values between `0 - 1`.
    
    $$\dfrac{values - values.min()}{values.max()} - values.min()$$
    - **MinMaxScaler:** Between any values.
    
    
- **Stardardization (StandarddScalar** from sklearn)
    - Mean: 0; StdDev: 1
    
    $$\dfrac{values - values.mean()}{values.std()}$$
    - Less sensitive to outliers

<hr>

## Machine Learning Algorithm
- Some algorithms are more sensitive than others
- **Distance-based** algorithms are most affected by the range of features.
    * Examples include: `SVM`, `KNN`, `K-means`

In [1]:
import pandas as pd

In [4]:
data = pd.read_csv('data/weather.csv',index_col=0,parse_dates=True)
data.head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,...,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-02-01,19.5,22.4,15.6,6.2,0.0,,,S,SSW,17.0,...,84.0,1017.6,1017.4,8.0,8.0,20.7,20.9,Yes,6.0,Yes
2008-02-02,19.5,25.6,6.0,3.4,2.7,,,W,E,9.0,...,73.0,1017.9,1016.4,7.0,7.0,22.4,24.8,Yes,6.6,Yes
2008-02-03,21.6,24.5,6.6,2.4,0.1,,,ESE,ESE,17.0,...,86.0,1016.7,1015.6,7.0,8.0,23.5,23.0,Yes,18.8,Yes
2008-02-04,20.2,22.8,18.8,2.2,0.0,,,NNE,E,22.0,...,90.0,1014.2,1011.8,8.0,8.0,21.4,20.9,Yes,77.4,Yes
2008-02-05,19.7,25.7,77.4,,0.0,,,NNE,W,11.0,...,74.0,1008.3,1004.8,8.0,8.0,22.5,25.5,Yes,1.6,Yes


In [5]:
data.dtypes

MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RISK_MM          float64
RainTomorrow      object
dtype: object

In [6]:
# only wanna make use of numeric features
dataset = data.select_dtypes(include='number')
dataset.head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RISK_MM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2008-02-01,19.5,22.4,15.6,6.2,0.0,,17.0,20.0,92.0,84.0,1017.6,1017.4,8.0,8.0,20.7,20.9,6.0
2008-02-02,19.5,25.6,6.0,3.4,2.7,,9.0,13.0,83.0,73.0,1017.9,1016.4,7.0,7.0,22.4,24.8,6.6
2008-02-03,21.6,24.5,6.6,2.4,0.1,,17.0,2.0,88.0,86.0,1016.7,1015.6,7.0,8.0,23.5,23.0,18.8
2008-02-04,20.2,22.8,18.8,2.2,0.0,,22.0,20.0,83.0,90.0,1014.2,1011.8,8.0,8.0,21.4,20.9,77.4
2008-02-05,19.7,25.7,77.4,,0.0,,11.0,6.0,88.0,74.0,1008.3,1004.8,8.0,8.0,22.5,25.5,1.6


In [7]:
data.shape, dataset.shape

((3337, 22), (3337, 17))

In [9]:
# drop the missing value
dataset.isna().sum()

MinTemp             3
MaxTemp             2
Rainfall            6
Evaporation        51
Sunshine           16
WindGustSpeed    1036
WindSpeed9am       26
WindSpeed3pm       25
Humidity9am        14
Humidity3pm        13
Pressure9am        20
Pressure3pm        19
Cloud9am          566
Cloud3pm          561
Temp9am             4
Temp3pm             4
RISK_MM             0
dtype: int64

In [10]:
dataset = dataset.dropna(axis=0)

In [15]:
dataset = dataset.drop(['RISK_MM'],axis=1)

In [16]:
dataset.shape

(1696, 16)

In [17]:
dataset.head()

Unnamed: 0_level_0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2010-10-20,12.9,20.3,0.2,3.0,10.9,37.0,11.0,26.0,70.0,57.0,1028.8,1025.6,3.0,1.0,16.9,19.8
2010-10-21,13.3,21.5,0.0,6.6,11.0,41.0,11.0,28.0,75.0,58.0,1025.9,1022.4,2.0,5.0,17.6,21.3
2010-10-22,15.3,23.0,0.0,5.6,11.0,41.0,6.0,19.0,70.0,63.0,1021.4,1017.8,1.0,4.0,19.0,22.2
2010-10-26,12.9,26.7,0.2,3.8,12.1,33.0,13.0,24.0,73.0,56.0,1018.0,1015.0,1.0,5.0,17.8,22.5
2010-10-27,14.8,23.8,0.0,6.8,9.6,54.0,13.0,26.0,76.0,69.0,1016.0,1014.7,2.0,7.0,20.2,20.6


### Explore Normalization and Standardization

In [18]:
X = dataset
y = data['RainToday'].dropna()

In [19]:
len(X),len(y)

(1696, 3331)