# Feature scaling
- Feature scaling refers to the methods or techniques used to normalize the range of independent variable in our data, or in other words, the methods to set the feature value range within a similar scale.
- Variables with bigger magnitude / larger value range dominate over those with smaller magnitude / value range
- Scale of the features is an important consideration when building machine learning models.
- Feature scaling is generally the last step in the data preprocessing pipeline, performed just before training the machine learning algorithms. preserves the shape of the original distribution
- the minimum and maximum values of the different variables may vary
- preserves outliers

# Feature Scaling importance in some ML Algorithms like
- Gradient descent converges faster when features are on similar scales
- Support Vector Machines
- K-means clustering
- Principal Component Analysis (PCA)

# Various Feature Scaling Techniques
- Standardisation
- Normalisation

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('titanic.csv',usecols=['Age'])
df.head()

Unnamed: 0,Age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0


In [3]:
df.isnull().sum()

Age    177
dtype: int64

In [4]:
df['Age'].fillna(df.Age.median(),inplace=True)

In [5]:
df['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

# Standardisation
- Standardisation involves centering the variable mean at zero, and standardising the variance to 1.
    - z=(x-x_mean)/std
- standardisation:
    - centers the mean at 0
    - scales the variance at 1

In [6]:
# Standardization: We use standardization from sklearn library
from sklearn.preprocessing import StandardScaler
# Call the function
sc=StandardScaler()
# fit_transform
df['Age_sc']=sc.fit_transform(df[['Age']])
df

Unnamed: 0,Age,Age_sc
0,22.0,-0.565736
1,38.0,0.663861
2,26.0,-0.258337
3,35.0,0.433312
4,35.0,0.433312
...,...,...
886,27.0,-0.181487
887,19.0,-0.796286
888,28.0,-0.104637
889,26.0,-0.258337


# Min Max Scaling
# CNN - Deep Learning Techniques
- Min Max Scaling scales the values between 0 and 1
- X_Scaled = (X-X.min/(X.max-X.min))

In [7]:
from sklearn.preprocessing import MinMaxScaler
min_max=MinMaxScaler()
df['Age_mm']=min_max.fit_transform(df[['Age']])
df

Unnamed: 0,Age,Age_sc,Age_mm
0,22.0,-0.565736,0.271174
1,38.0,0.663861,0.472229
2,26.0,-0.258337,0.321438
3,35.0,0.433312,0.434531
4,35.0,0.433312,0.434531
...,...,...,...
886,27.0,-0.181487,0.334004
887,19.0,-0.796286,0.233476
888,28.0,-0.104637,0.346569
889,26.0,-0.258337,0.321438


In [8]:
df=pd.DataFrame({'X':[1,2,3,4,5]})
df

Unnamed: 0,X
0,1
1,2
2,3
3,4
4,5


In [9]:
df['X'].mean()

3.0

In [10]:
df['X'].std()

1.5811388300841898

In [11]:
df['X_sc']=df['X']-df['X'].mean()/df['X'].std()
df

Unnamed: 0,X,X_sc
0,1,-0.897367
1,2,0.102633
2,3,1.102633
3,4,2.102633
4,5,3.102633


In [12]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
df['X_sc_sk']=sc.fit_transform(df[['X']])
df

Unnamed: 0,X,X_sc,X_sc_sk
0,1,-0.897367,-1.414214
1,2,0.102633,-0.707107
2,3,1.102633,0.0
3,4,2.102633,0.707107
4,5,3.102633,1.414214
