### 1. Import Dependencies

In [9]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.preprocessing import StandardScaler,MinMaxScaler

### 2. Import Concepts

#### 2.1 Normalization vs Standarization

### 2.1.1 What is Normalization?
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1.  
It is also known as **Min-Max scaling**.

### 2.1.2 What is Standardization?
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation.  
This means that the mean of the attribute becomes **zero**, and the resultant distribution has a **unit standard deviation**.


### 3. Basic Processing

In [4]:
df = pd.read_csv('data/processed/CEHHbInToW_Encoded.csv')
df.head(10)

Unnamed: 0,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScoreBins
0,France,Female,42.0,2,0.0,1,1,1,101348.88,1,Fair
1,Spain,Female,41.0,1,83807.86,1,0,1,112542.58,0,Fair
2,France,Female,42.0,8,159660.8,3,1,0,113931.57,1,Poor
3,France,Female,38.91,1,0.0,2,0,0,93826.63,0,Good
4,Spain,Female,43.0,2,125510.82,1,1,1,79084.1,0,
5,Spain,Male,44.0,8,113755.78,2,1,0,149756.71,1,Fair
6,France,Male,50.0,7,0.0,2,1,1,10062.8,0,Excellent
7,Germany,Female,29.0,4,115046.74,4,1,0,119346.88,1,Poor
8,France,Male,44.0,4,142051.07,2,0,1,74940.5,0,Poor
9,France,Male,27.0,2,134603.88,1,1,1,71725.73,0,Good


### Comparison: Min-Max Scaling vs Standardization (Z-score)

| **Condition** | **Min-Max Scaling** | **Standardization (Z-score)** |
|----------------|---------------------|-------------------------------|
| Data has a known, fixed range | ✅ Yes | ❌ Not ideal |
| Data contains outliers | ❌ Sensitive to outliers | ✅ More robust to outliers |
| Data is normally distributed | ❌ Not necessary | ✅ Preferred |
| Data is not normally distributed (e.g., skewed) | ✅ If shape needs to be preserved | ✅ Often works well after log-transform |
| Model is distance-based (KNN, SVM) | ✅ Recommended | ✅ Also acceptable |
| Model is neural network | ✅ Strongly recommended | ❌ May slow training |
| Model is linear or uses regularization | ❌ Not ideal | ✅ Helps with convergence |
| Input features need bounded values (0–1) | ✅ Required | ❌ Not bounded |
| Applying PCA or LDA | ❌ May distort variance | ✅ Required (centering needed) |
| Want to preserve original distribution shape | ✅ Maintains feature shape | ✅ Maintains shape but centers data |
| Working with tree-based models | ❌ Not needed | ❌ Not needed |


In [10]:
columns_need_to_be_scaled = ['Age','Tenure','Balance','EstimatedSalary']

for col in columns_need_to_be_scaled:
    stand_scaler = MinMaxScaler()
    df[col] = stand_scaler.fit_transform(df[col].values.reshape(10000,1))

df

Unnamed: 0,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,CreditScoreBins
0,France,Female,0.324324,0.2,0.000000,1,1,1,0.506735,1,Fair
1,Spain,Female,0.310811,0.1,0.334031,1,0,1,0.562709,0,Fair
2,France,Female,0.324324,0.8,0.636357,3,1,0,0.569654,1,Poor
3,France,Female,0.282568,0.1,0.000000,2,0,0,0.469120,0,Good
4,Spain,Female,0.337838,0.2,0.500246,1,1,1,0.395400,0,
...,...,...,...,...,...,...,...,...,...,...,...
9995,France,Male,0.283784,0.5,0.000000,2,1,0,0.481341,0,Very Good
9996,France,Male,0.229730,1.0,0.228657,1,1,1,0.508490,0,Poor
9997,France,Female,0.243243,0.7,0.000000,1,0,1,0.210390,1,Good
9998,Germany,Male,0.324324,0.3,0.299226,2,1,0,0.464429,1,Very Good


In [11]:
df.to_csv('data/processed/CEHHbInToW_Final.csv',index=False)