<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Model_Concepts/Feature_Scaling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature scaling
The next step in our preprocessing pipeline is to scale our features.

Before applying any scaling transformations it is very important to split your data into a train set and a test set. If you start scaling before, your training (and test) data might end up scaled around a mean value that is not actually the mean of the train or test data, and go past the whole reason why you’re scaling in the first place.

In [None]:
import numpy as np
import pandas as pd

In [None]:
age=[22,25,47,80,66]
income=[23000,35000,67000,50000,45000]
d={'age':age,'income':income}

In [None]:
df=pd.DataFrame(d)
df

Unnamed: 0,age,income
0,22,23000
1,25,35000
2,47,67000
3,80,50000
4,66,45000


**STANDARDISATION** 

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one.
Depending on your needs and data, sklearn provides a bunch of scalers: StandardScaler, MinMaxScaler, MaxAbsScaler and RobustScaler.

1. Standard Scaler:
Sklearn its main scaler, the StandardScaler, uses a strict definition of standardization to standardize data. It purely centers the data by using the following formula, where u is the mean and s is the standard deviation.
x_scaled = (x — u) / s

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(df.values)

array([[-1.15039743, -1.42360613],
       [-1.01765927, -0.61011691],
       [-0.04424606,  1.55918767],
       [ 1.41587376,  0.40674461],
       [ 0.79642899,  0.06779077]])

In [None]:
df_ss=pd.DataFrame(scaler.fit_transform(df.values),columns=df.columns)
df_ss

Unnamed: 0,age,income
0,-1.150397,-1.423606
1,-1.017659,-0.610117
2,-0.044246,1.559188
3,1.415874,0.406745
4,0.796429,0.067791


2. MinMax Scaler: 
The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter (default at (0,1)). This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small. However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider another scaler.
x_scaled = (x-min(x)) / (max(x)–min(x))

---NOTE:---Importing and using the MinMaxScaler works — just as all the following scalers — in exactly the same way as the StandardScaler. The only difference sits in the parameters on initiation of a new instance.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-3,3))
scaler.fit_transform(df.values)
# Here we scale features to a scale between -3 and 3.

array([[-3.        , -3.        ],
       [-2.68965517, -1.36363636],
       [-0.4137931 ,  3.        ],
       [ 3.        ,  0.68181818],
       [ 1.55172414,  0.        ]])

In [None]:
df_ms=pd.DataFrame(scaler.fit_transform(df.values),columns=df.columns)
df_ms

Unnamed: 0,age,income
0,-3.0,-3.0
1,-2.689655,-1.363636
2,-0.413793,3.0
3,3.0,0.681818
4,1.551724,0.0


# CONCLUSION

### Normalization
In case of normalization, all the attributes are converted to a normalized score or to a
range (0, 1). The problem of normalization is an outlier. If there is an outlier, it will tend
to crunch all of the other values down toward the value of zero.
In order to understand this case, let’s suppose the range of students’ marks is 35 to 45
out of 100. Then 35 will be considered as 0 and 45 as 1, and students will be distributed
between 0 to 1 depending upon their marks. But if there is one student having marks
90, then it will act as an outlier and in this case, 35 will be considered as 0 and 90 as 1.
Now, it will crunch most of the values down toward the value of zero.
In this scenario, the solution is standardization.
### Standardization
In case of standardization, the values are all spread out so that we have a standard
deviation of 1.
Generally, there is no rule for when to use normalization versus standardization.
However, if your data does have outliers, use standardization otherwise use
normalization. Using standardization tends to make the remaining values for all of the
other attributes fall into similar ranges since all attributes will have the same standard
deviation of 1.