# Data Preprocessing Tools

## Importing the Libraries


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset


In [None]:
dataset = pd.read_csv('Data.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In Machine learning, the dataset have two parts  

1.   Features
2.   Dependent variables

Features are the columns with which we are going to predict the dependent variable.(Features are present in first columns of dataset).
Most often, the dependent variables are the last columns.



In [None]:
# df.iloc[row_range, column_range].
# 2:3, lower bound will be included and upper bound will be excluded.
# "-1" represents the index of last column.
X = dataset.iloc[:, :-1].values # storing the features in a seperate variable.
y = dataset.iloc[:, -1].values # storing the dependent variables in a seperate variable.

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of Missing Data

way 1: Ignore the missing values.</br> way 2: Replace that missing value with a verage of all values present in that column.

In [None]:
from sklearn.impute import SimpleImputer
# 'SimpleImputer' class is used to replace the values.
imputer = SimpleImputer(missing_values=np.nan, strategy = 'mean') # "imputer" is an object.
imputer.fit(X[:, 1:3]) # "fit()" method only takes numerical columns.
X[:, 1:3] = imputer.transform(X[:, 1:3]) # "transform()" method replaces the values. Buy, we have to specify the columns where we want to replace the data.
# "transform() will return the updated version of the metric feature X, therefore we have to update the X."

`imputer.fit()` calculates the mean of the selected columns. This will be used
to replace the missing values.  
`imputer.transform()` applies the learned mean values to replace the missing values in the selected columns of `X`.  
The result is then reassigned to the same columns in `X`, effectively replacing the missing values in the dataset.


In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding Categorical Data

In [None]:
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


We will encode the `Country` column. So, one way to do this is:


*   Replace `France` with `1`.
*   Replace `Spain` with `2`.

*   Replace `Germany` with `3`.

However, due to this numbering our model could interpret `this order matters`.  

So, encoding this way is not a good option.  

Another way is to use `One Hot encoding`. Since, there are `3` different classes present in `Country` column, One hot encoding will convert this column to `3` different columns.  
`One hot encoding` consist of creating the `binary` vector of each value present in the respective column.  
Example:  


*   `France` will be converted to `100`.
*   `Spain` will be converted to `010`.
*   `Germany` will be converted to `001`.

This removes the confusion of `numerical` order.







### Encoding the Independent Variable


In [None]:
# Importing the required libraries.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Step-1: Create an object of ColumnTransformer class.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [None]:
# Importing the libraries
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(y_test)

[0 1]


## Feature Scaling

There are two feature scaling techniques:


*   Standardisation
*   Normalisation

### Standardization (Z-score normalization):

Z = (X - μ) / σ

Where:
- X = original data point
- μ = mean of the data
- σ = standard deviation of the data

---

### Normalization (Min-Max scaling):

X_norm = (X - X_min) / (X_max - X_min)

Where:
- X = original data point
- X_min = minimum value of the data
- X_max = maximum value of the data




In [None]:
from sklearn.preprocessing import StandardScaler
# StandardScaler will use "Standardization"
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])


The reason we apply `fit_transform` on `X_train` and only `transform` on `X_test` lies in how the StandardScaler works and the principles of data standardization.

1. **fit_transform on X_train** :  
`Fit`: When you call fit, the StandardScaler calculates the `mean` and `standard deviation` of the training data. These statistics (mean and standard deviation) are used to standardize the data, ensuring it has a mean of 0 and a standard deviation of 1.  
`Transform`: After computing the statistics, transform applies the standardization using these statistics, i.e., it subtracts the `mean` and divides by the standard deviation for each feature in `X_train`.
2. **transform on X_test**:  
We only apply `transform` to the test data because we want to use the same `mean` and `standard deviation` that were calculated from the `training data (X_train)`. This ensures that the test data is standardized in the same way as the training data, maintaining consistency.
If we used `fit_transform` on `X_test`, it would calculate a new mean and standard deviation for the test set, leading to different scaling compared to the training data. This would break the `principle` that test data should be treated as unseen data and processed in the same way as the training data.

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


We don't need to apply feature scaling on `dummy/OneHotEncoded` `variables/columns`.