<a href="https://www.kaggle.com/code/iamarunkumar/1-data-preprocessing?scriptVersionId=178198074" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Feature Scaling**

Feature scaling is the process of **scaling the data**. It simply consists of scaling all our variables or all our features actually to make sure they **all take values in the same scale**. And we do this, so as to prevent one feature dominating the other feature that would be neglected by the future machine learning model. For example, if there are columns like salary and age -->The values of salary column ($ 10,000)will be enoromously differed from age (45 yrs). Here 10,000 is a huge difference with 45 yrs. Machine learning algorithm won't work well with unscaled data. So, we need to scale the data before we apply our algorithm technique/work on model.

**Feature scaling is always applied to columns (individual columns) and not acorss all columns (rows).**

There are two main types of feature scaling. They are Normalization & Standardization.

Normalization => X - Xmin / Xmax - Xmin. Usually the value ranges between 0 and 1.

Standardization => X - mu / sigma. Usually the value ranges between -3 and +3.

**Always remember, feature scaling is applied after splitting the training set and test set**

There are **6 steps in data preprocessing** to be followed in sequential manner. They are,

1. Import the required libraries
2. Import the data set
3. Taking care of missing data
4. Encoding categorical data
5. Splitting the dataset into training set and test set
6. Feature scaling

# Import the required libraries

In [None]:
# Import the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the datasets

In [None]:
df = pd.read_csv('/kaggle/input/product-purchase/Data.csv')

**Create two new entities.**

    1.matrix of features 
    2.Dependent variable vector
    
1. **Matrix of features**: Features are the columns with which we are going to predict the dependent variable.
   They are also called as independent variables.

2. **Dependent variables**: Mostly these columns will be the last column in your dataset. This is the column where the prediction will be done based on the informations available in features column.

In [None]:
"""Creating feature entity X
iloc is the function in pandas and stands for 'locate indexes' which will take 
the indexes of the rows & columns we want to extract from dataset. We always start with rows to extract and then to column

We specify ':' to extract all rows. Becuase ':' means range and when specifying a range without upper bound or 
lower bound means we are taking everything in range.

 Creating matrix of features"""
X = df.iloc[:,:-1].values

# Creating vector entity Y
y = df.iloc[:,-1].values

In [None]:
print(X)

In [None]:
print(y)

# Taking care of missing values

Our dataset should always be clean while training our models.It must be free from missing values.
There are several ways we remove missing values. Either we remove the row of missing values until the whole data is 1!
The classic way is replacing the missing value with the average/mean value of that column.

In [None]:
# We do this by the famous data science library called sklearn (scikit learn)
# SimpleImputer is the class from 'impute' module in sklearn library helps to replace the missing values

from sklearn.impute import SimpleImputer
# Assign object to SimpleImputer class called Imputer
# SimpleImputer has 2 arguments. 'missing_values' will be assigned with np.nan and 'strategy' as mean
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

"""Now, time to connect the object to matrix of features. This is done using fit method.
The fit method will just look for all missing variables and compute the average of all values in the column.
Remember,we need to pass only numerical values to find the average and not string column. Be careful in passing argument 
Here we pass all rows ([:]) and numerical columns only ([1:3])"""
imputer.fit(X[:,1:3])

"""Now, time to replace all missing values with average of those column values. This is achieved by 'transform' method
Remember, transform method will look for arguments on which columns the missing values should be replaced with.
So, just pass the same arguments that are passed in fit method. becuase that is where we need to replace the values"""

X[:,1:3] = imputer.transform(X[:,1:3])

In [None]:
print(X)

# Encoding categorical data

It will be difficult to compute the columns with strings and columns with numbers or compute matrix of features with dependent variable. Machine learning model requires all columns in numerics. So, we need to change all the strings to numbers.

We achieve this by a process called **one hot encoding**. One hot encoing consits of turning the country column (from our data set) into **3 columns**. If there are 5 differnt classes/different countries, then we turn into 5 different columns.

**If we have 'n' no. of categories, we use OneHotEncoder() and if we have 2 types of classes (yes/no) we use LabelEncoder()**

**Remember, for categorical columns (more than 2 category) we use OneHotEncoder() and for labelled columns (2 classes/types) we use LabelEncoder()**

One hot encoding consists of creating binary vectors for each of these countries.
There are two types of encoding done. One is for independent variable in features/categorical column and the other encoding done for dependent variable or labels (non-numerical values).

**Encoding independent variable**

We use 2 classes to encode the independent variable. they are,
1. ColumnTransformer class from compose module in sklearn library
2. OneHotEncoder class from preprocessing module in sklearn library

In [None]:
# Let's start importing the ColumnTransformer class and OneHotEncoder class

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
"""Creating object 'ct' to ColumnTransformer class
ColumnTransformer has 2 arguments-Transformers(what kind of transformation) & Remainder(columns that will not transform)
Transformer argument has 3 things-1. what kind of transformation (encoding) 2.what kind of encoding (onehotencoding)
3.indexes of column need to be encoded. We do all these by a pair of square brackets and parenthesis.


Be careful and look into the transformer argument how the values are passed in [(),[]].
Three tuple values are passed inside [] and additional [0] for the index of column for one hot encoding.

The remainder argument has a value called 'passthrough' which is a code that let the model know that the other
two columns (Age & Salary) should be left with features and not to be encoded."""
ct = ColumnTransformer(transformers=[('encoding',OneHotEncoder(),[0])], remainder='passthrough')

"""Now, connect the object to the matrix of features. this time this is done in straight process as ColumnTransformer class
has fittransform method. It will need the argument as X(matrix of features). The output of fittransform will be the
matrix of features inside which the one hot encoding done with three columns."""

X = np.array(ct.fit_transform(X))
# It's expected in the future machine learning model that X (matrix of features) to be numpy array. And hence we passed
# np.array(ct.fit_transform(X))

In [None]:
print(X)

**Encoding dependent variable**

We use a class called LabelEncoder from preprocessing module which will encode the labels ('Yes','No') into zeros & ones.
For two types of classes, we use label encoder.

In [None]:
"""Let's import LabelEncoder() class from preprocessing module of sklearn to encode the labelled classes into '0s' and '1s'
Note that, we don't need to convert the output to numpy array as expected in OneHotEncoder() in binary format. We don't
convert the labelled column to binary format."""

from sklearn.preprocessing import LabelEncoder

#Remember, We don't need to pass any arguments for LabelEncoder() class

le = LabelEncoder()
y = le.fit_transform(y)
print(y)

# Splitting the dataset into Training set and Test set
Splitting dataset into training set and test set means, it consists of 2 different set- **1 training set** which we will use to **train the machine learning model on existing observations** and **1 test set** where we will **evaluate the performance of our machine learning model on new observations**.

**Remember the most important point - We do 'feature scaling' after splitting the data set into training set and test set**
The reason being is the test set simply means the brand new set on which we will evaluate the performance of ML model. Then, after training the model, we will deploy the model in test set. So this means, we are not supposed to work with for the training. 

And feature scaling is the technique which we use mean and standard deviation of features to scale the data. If we apply feature scaling before the split, then all data will get mean and standard deviation including the test set. This we call as information leakage on test set. This is wrong. We musn't disturb the test set. It must be fresh/brand new.

So to prevent the information leakage on test set, we do feature scaling after splitting the training set and test set. **Hence, we apply feature scaling after splitting the training set and test set**

**Splitting the data set is done using method called 'train_test_split' in 'method selection' module from sklearn library**
And we create a pair of matrix of features and dependent variable vectors for training set and another pair of matrix of features and dependent variable vectors for test set.

**Remember, we create 4 seperate sets. They are,**

1. X_train - matrix of features on training set
2. X_test - matrix of features of test set
3. y_train - dependent variable vectors of training set
4. y_test - dependent variable vector of test set

Future machine learning model expect these 4 splits as inputs. For training, it will expect Xtrain & ytrain as input in a method called fit method and for predictions also called as inference, these model will predict Xtest


In [None]:
# Let's import the method 'train_test_split' from module called 'model selection' in sklearn library

from sklearn.model_selection import train_test_split

"""We must use the same order of X_train, X_test, y_train and y_test.
We need to pass 4 parameters in train_test_split function. They are actually,
1. The matrix of features X
2. dependent variable vector y
3. split size or the test size of the data set that we split the ratio of training & test. We pass 0.2 
meaning 20% observation for test set and 80% observation for training the model
4. random_state is the argument that needs to maintain the randomness of data that is planned to split and we pass as 1.
We are fixing the seed as 1 for the same training and test set
Remember splitting the data will not be in order. It will be randomly splitted. hence random_state"""


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)


In [None]:
"""Now, let's start printing all 4 split sets in each individual cell. Note that we have only 8 rows (8 customers
taken randomly from data set) in X_train since we passed the test_size as 0.2

And we clearly recognize the features as 3 columns since we had done one hot encoding for country (categorical) column
Remember we have only matrix of features (3 columns) since it's X_train."""

print(X_train)

In [None]:
# Here, you can see only two rows of matrix of features selected randomly since we passed the test_size as 0.2
print(X_test) 

In [None]:
"""Very important point below - we have 8 label encoded values that are same corresponding to X_train (training set).
I mean, the label encoded values (dependent variables) are from the same row 
of one training set of one hot encoded values (matrix of features)."""

print(y_train)

In [None]:
# Again, here we have 2 values that are corresponding to the same test set of matrix of features
print(y_test)

# Feature Scaling

There are enough information given for feature scaling. Refer the first topic for better clarity.

Remember, standardisation and normalisation are the two main feature scaling technique that put all features in the same scale.

Also, **standardisation will put all our values between -3 and +3**
        **Normalisation will put all our values between 0 and +1**
        
**Normalization** => X - Xmin / Xmax - Xmin.

**Standardization** => X - mu / sigma. 

Now here, we will confirm which type of feature scaling is good for ML model evaluation. Either normalisation or standardisation. Below is the answer.

**Normalisation** is recommended when we have data normally distributed or normal distributions in most of the features.
**Standardisation** works well in all cases and since normalisation has some specific conditions when the features are normally
distributed, the **recommendation is to have standardisation technique followed**.

Feature scaling is done using a **class** called **StandardScaler()** in preprocessing module of sklearn library.

Yet, **another question to answer** - do we need to apply feature scaling or standardisation to dummy variables (category column which is one hot encoded) in matrix of features. below is the answer.

The answer is **No**. because the feature scaling is the scaling of features all in same range. standardisation will have values -3 and +3 and the dummy variables already have the values less than that. so **standardisation will worsen the data if we apply the feature scaling to those matrix of features**.

Also, we will loose information on which dummy variable (1.0 0.0 0.0) belongs to which country (France) if we apply standardisation to those variables.

**Finally, to keep the interpretability of data, we don't apply standardisation to dummy variables.**


In [None]:
# Let's start importing the Standard Scaler class from preprocessing module in sklearn library

from sklearn.preprocessing import StandardScaler
"""Remember, we don't need to pass any argument on this class because it does the direct job of identifying mean and
standard deviation and applying it in formula."""

sc = StandardScaler()
"""Now, let's apply standardisation only to columns ignoring the dummy variables for training set.
Remember while looking into the column range now, don't look at the data. look into the output of X. Since the
one hot encoded done (column will be splitted into 3) the range of column will now be changed. It starts from 3
because the index from 0 to 2 will have dummy variable and we ignore these columns. So we start from range 3.
Remember we always fetch all rows so we use ':' for selecting all rows.

Fit will just compute the mean and standard deviation of all the values or features.
Transform will transform all the values into the one which has in formula (X-mean/standard deviation)

Here, we pass the arguments as 'X_train[:,3:]' since we feature scale only onto those columns"""

X_train[:,3:] = sc.fit_transform(X_train[:,3:])

"""now, we use only transform method for X_test because indeed the features of test set should be scaled by the same scaler
that is done in training set. Since, X_test is like a brand new or production data, we do this.

If we spply fit_transform() to X_test, then we will get a new scaler which is not making sense."""

X_test[:,3:] = sc.transform(X_test[:,3:])

In [None]:
print(X_train)

In [None]:
print(X_test)