**Feature Normalization¶**  

Learn how to scale and adjust the statistical distribution of feature values to improve the performance and accuracy of your models.

**Min-Max Scaling:** Scale features to a fixed range, typically 0 to 1.  
**Z-Score Standardization:** Transform features to have a mean of 0 and a standard deviation of 1.  
**Robust Scaling:** Scale features using statistics that are robust to outliers, such as the median and interquartile range.  
**Yeo-Johnson Transformation:** Apply a transformation that can handle both positive and negative values to achieve normality.  
**Box-Cox Transformation:** Apply a transformation that works with positive values to achieve normality and reduce skewness.  

In Python, there are several libraries available for scaling data, commonly used in machine learning and data preprocessing. These libraries help ensure that features are on a similar scale, which is important for many algorithms. Here are the top libraries for scaling data:  

1. **NumPy**  
While NumPy doesn't have specialized functions for scaling, it allows you to perform custom scaling using simple array operations.  

2. **Pandas**  
Although Pandas does not provide dedicated scaling functions, you can scale data using basic operations on DataFrames.  

3. **scikit-learn (sklearn.preprocessing)**  
The scikit-learn library provides several utilities for scaling and normalizing data. It is one of the most widely used libraries for machine learning in Python.  

**StandardScaler:** Standardizes features by removing the mean and scaling to unit variance (Z-score normalization).  
**MinMaxScaler:** Scales features to a given range, usually between 0 and 1.  
**MaxAbsScaler:** Scales each feature by its maximum absolute value (useful for data that is already centered at zero).  
**RobustScaler:** Scales features using statistics that are robust to outliers.  
 



In [2]:

import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, PowerTransformer

# Load Titanic dataset
df = pd.read_csv("C:/Users/Hp/Downloads/Titanic.csv")

# Fill missing values in Age with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Select numerical columns for transformation
columns_to_transform = ['Age', 'Fare', 'SibSp', 'Parch']

# MinMax Scaling
minmax_scaler = MinMaxScaler()
df_minmax = df.copy()
df_minmax[columns_to_transform] = minmax_scaler.fit_transform(df_minmax[columns_to_transform])

# Standard Scaling
standard_scaler = StandardScaler()
df_standard = df.copy()
df_standard[columns_to_transform] = standard_scaler.fit_transform(df_standard[columns_to_transform])

# Robust Scaling
robust_scaler = RobustScaler()
df_robust = df.copy()
df_robust[columns_to_transform] = robust_scaler.fit_transform(df_robust[columns_to_transform])

# Power Transformation (Yeo-Johnson, works with non-positive values)
power_transformer = PowerTransformer(method='yeo-johnson')
df_power = df.copy()
df_power[columns_to_transform] = power_transformer.fit_transform(df_power[columns_to_transform])

# View transformed dataset summary (MinMax example)
df_minmax.describe()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.363679,0.065376,0.063599,0.062858
std,257.353842,0.486592,0.836071,0.163605,0.137843,0.134343,0.096995
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,0.271174,0.0,0.0,0.01544
50%,446.0,0.0,3.0,0.346569,0.0,0.0,0.028213
75%,668.5,1.0,3.0,0.434531,0.125,0.0,0.060508
max,891.0,1.0,3.0,1.0,1.0,1.0,1.0


In [4]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.361582,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.019697,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,22.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,35.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


**Dataset Overview**  

The dataset contains information about loan applicants and includes the following columns:

**age:** The age of the applicant, indicating how many years they have lived 
Range: 20 - 56  
Mean: 34.75  
Skewness: Slightly skewed to the right (positive skew).  
**employ:** The number of years the applicant has been employed, which can indicate their job stability and experience.  
Range: 0 - 31  
Mean: 8.38  
Skewness: Right-skewed (positive skew).  
address: The number of years the applicant has lived at their current address, providing insights into their residential stability.  
Range: 0 - 28  
Mean: 5.58  
Skewness: Right-skewed (positive skew).  
**income:** The annual income of the applicant (in thousands), representing their earning capacity.    
Range: 10 - 330  
Mean: 55.50  
Skewness: Right-skewed (positive skew).  
**debtinc:** The debt-to-income ratio of the applicant, calculated as the percentage of their   income that goes towards paying debts. This ratio helps assess their financial burden.  
Range: 0.00 - 37.30  
Mean: 10.27  
Skewness: Right-skewed (positive skew).  
**creddebt:** The amount of credit card debt the applicant has (in thousands), showing their reliance on credit and their debt levels.  
Range: 0.00 - 22.12  
Mean: 3.51  
**Skewness:** Right-skewed (positive skew).  
othdebt: The amount of other debt the applicant has (in thousands), which includes all other forms of debt apart from credit card debt.  
Range: 0.00 - 57.03  
Mean: 5.05  
Skewness: Right-skewed (positive skew).  
**ed:** The education level of the applicant (encoded numerically), where higher numbers may represent higher levels of education.  
Unique Values: 1.0, 2.0, 3.0, 4.0, 5.0  
Most Frequent Value (Mode): 1.0  
**default:** A binary indicator of whether the applicant defaulted on the loan (1 for default, 0 for no default), indicating their credit risk.  
Unique Values: 0, 1  
Most Frequent Value (Mode): 0 (majority did not default)  
**TIP**  
If both median and mode are missing, you can infer skewness using mean, minimum, maximum, and quartiles (Q1, Q3):  

If (Max − Mean) > (Mean − Min) or Q3 − Q2 > Q2 − Q1, it's likely positively skewed.  

If (Mean − Min) > (Max − Mean) or Q2 − Q1 > Q3 − Q2, it's likely negatively skewed.  

If both sides are roughly equal, the distribution is likely symmetric.  

This relies on shape asymmetry and should be interpreted cautiously without full distribution data. Do you have a summary you'd like to interpret together?  

**1. Min-Max Scaling**  
Min-Max Scaling transforms features by scaling them to a given range, usually [0, 1]. This technique is useful when the features have different ranges and you want to ensure they contribute equally to the analysis.  
For our dataset, the following columns are suitable for Min-Max Scaling:  
age  
debtinc  
creddebt  
Applying Min-Max Scaling  
Let's apply Min-Max Scaling to these columns.  

In [9]:
from sklearn.preprocessing import MinMaxScaler

# Print column names for confirmation
print("Available columns:", df.columns.tolist())

# Update with correct names based on actual columns
columns_to_scale = ['Age', 'Fare', 'SibSp']  # adjust to match your dataset

scaler = MinMaxScaler()
df_min_max_scaled = df.copy()
df_min_max_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

print("Min-Max Scaled Data:")
display(df_min_max_scaled.head(20))



Available columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Min-Max Scaled Data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,0.271174,0.125,0,A/5 21171,0.014151,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.472229,0.125,0,PC 17599,0.139136,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,0.321438,0.0,0,STON/O2. 3101282,0.015469,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.434531,0.125,0,113803,0.103644,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0.434531,0.0,0,373450,0.015713,,S
5,6,0,3,"Moran, Mr. James",male,0.346569,0.0,0,330877,0.01651,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,0.673285,0.0,0,17463,0.101229,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,0.019854,0.375,1,349909,0.041136,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,0.334004,0.0,2,347742,0.021731,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,0.170646,0.125,0,237736,0.058694,,C


**2. Z-Score Standardization**  
Z-Score Standardization, also known as standardization, transforms features to have a mean of 0 and a standard deviation of 1. This method is less sensitive to outliers and ensures that each feature contributes equally to the analysis.  

Columns  
For our dataset, the following columns are suitable for Z-Score Standardization:  

age  
debtinc  
creddebt  
Applying Z-Score Standardization  
Let's apply Z-Score Standardization to these columns.  

In [10]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select the columns to scale (Make sure these columns exist in your dataset)
columns_to_scale = ['Age', 'Fare', 'SibSp', 'Parch']  # Use Titanic dataset numerical columns

# Create a copy of the original dataset
df_z_score_scaled = df.copy()

# Apply the scaler to the selected columns
df_z_score_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

# Display the first few rows of the transformed dataset
print("Z-Score Standardized Data:")
display(df_z_score_scaled.head(20))



Z-Score Standardized Data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,-0.565736,0.432793,-0.473674,A/5 21171,-0.502445,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.663861,0.432793,-0.473674,PC 17599,0.786845,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,-0.258337,-0.474545,-0.473674,STON/O2. 3101282,-0.488854,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.433312,0.432793,-0.473674,113803,0.42073,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0.433312,-0.474545,-0.473674,373450,-0.486337,,S
5,6,0,3,"Moran, Mr. James",male,-0.104637,-0.474545,-0.473674,330877,-0.478116,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,1.893459,-0.474545,-0.473674,17463,0.395814,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,-2.102733,2.24747,0.76763,349909,-0.224083,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,-0.181487,-0.474545,2.008933,347742,-0.424256,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,-1.180535,0.432793,-0.473674,237736,-0.042956,,C


**3 Robust Scaling**  
Robust Scaling uses statistics that are robust to outliers, such as the median and the interquartile range, to scale features. This method is useful when the dataset contains outliers that could skew the results of standard scaling methods.  

Columns  
For our dataset, the following columns are suitable for Robust Scaling:

income
othdebt

Applying Robust Scaling
Let's apply Robust Scaling to these columns.

In [11]:
from sklearn.preprocessing import RobustScaler
import pandas as pd

# Load the Titanic dataset
df = pd.read_csv("C:/Users/Hp/Downloads/Titanic.csv")

# Fill missing values in 'Age' column (if any)
df['Age'].fillna(df['Age'].median(), inplace=True)

# Columns to scale (must be numerical)
columns_to_scale = ['Age', 'Fare', 'SibSp', 'Parch']

# Make a copy of the original DataFrame
df_robust_scaled = df.copy()

# Initialize the RobustScaler
scaler = RobustScaler()

# Apply the scaler
df_robust_scaled[columns_to_scale] = scaler.fit_transform(df[columns_to_scale])

# Display the first few rows of the transformed DataFrame
print("✅ Robust Scaled Titanic Data:")
display(df_robust_scaled.head(20))


✅ Robust Scaled Titanic Data:


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,-0.461538,1.0,0.0,A/5 21171,-0.312011,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.769231,1.0,0.0,PC 17599,2.461242,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,-0.153846,0.0,0.0,STON/O2. 3101282,-0.282777,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.538462,1.0,0.0,113803,1.673732,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0.538462,0.0,0.0,373450,-0.277363,,S
5,6,0,3,"Moran, Mr. James",male,0.0,0.0,0.0,330877,-0.25968,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,2.0,0.0,0.0,17463,1.620136,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,-2.0,3.0,1.0,349909,0.286744,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,-0.076923,0.0,2.0,347742,-0.143827,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,-1.076923,1.0,0.0,237736,0.676348,,C


**4 Yeo-Johnson and Box-Cox Transformations**
The Yeo-Johnson and Box-Cox transformations are power transformations used to stabilize variance and make the data more normally distributed. Yeo-Johnson can handle both positive and negative values, whereas Box-Cox is only applicable to positive values.  

Columns  
For our dataset, we will check the following columns for negative values and apply the appropriate transformation:  

age  
income  
debtinc  
creddebt  
othdebt  
Applying Transformations  
Let's check for negative values in the columns and apply the Yeo-Johnson Transformation to columns with negative values and the Box-Cox Transformation to columns with only positive values.  

In [12]:
from sklearn.preprocessing import PowerTransformer
import numpy as np

# Columns from Titanic dataset to consider for transformation
columns_to_check = ['Age', 'Fare', 'SibSp', 'Parch']

# Check for negative values in each column
contains_negative = {col: np.any(df[col] < 0) for col in columns_to_check}

# Initialize the transformers
yeo_johnson_transformer = PowerTransformer(method='yeo-johnson')
box_cox_transformer = PowerTransformer(method='box-cox')

# Make copies of the dataset for transformation
df_yeo_johnson = df.copy()
df_box_cox = df.copy()

# Columns with negative or zero values (use Yeo-Johnson)
columns_to_transform_yeo_johnson = [col for col, has_negative in contains_negative.items() if has_negative or np.any(df[col] == 0)]
if columns_to_transform_yeo_johnson:
    df_yeo_johnson[columns_to_transform_yeo_johnson] = yeo_johnson_transformer.fit_transform(df[columns_to_transform_yeo_johnson])
    print("✅ Yeo-Johnson Transformed Data:")
    display(df_yeo_johnson.head(20))
else:
    print("ℹ️ No columns need Yeo-Johnson transformation.")

# Columns with strictly positive values (use Box-Cox)
columns_to_transform_box_cox = [col for col in columns_to_check if col not in columns_to_transform_yeo_johnson]
if columns_to_transform_box_cox:
    df_box_cox[columns_to_transform_box_cox] = box_cox_transformer.fit_transform(df[columns_to_transform_box_cox])
    print("✅ Box-Cox Transformed Data:")
    display(df_box_cox.head(20))
else:
    print("ℹ️ No columns eligible for Box-Cox transformation.")


✅ Yeo-Johnson Transformed Data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1.373636,-0.560253,A/5 21171,-0.87882,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.373636,-0.560253,PC 17599,1.336651,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,-0.67985,-0.560253,STON/O2. 3101282,-0.790065,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.373636,-0.560253,113803,1.067352,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,-0.67985,-0.560253,373450,-0.774439,,S
5,6,0,3,"Moran, Mr. James",male,28.0,-0.67985,-0.560253,330877,-0.725002,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,-0.67985,-0.560253,17463,1.045516,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,1.718889,1.729206,349909,0.184264,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,-0.67985,1.846856,347742,-0.449944,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1.373636,-0.560253,237736,0.530176,,C


✅ Box-Cox Transformed Data:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,-0.521013,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.68424,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,-0.206546,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.467728,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,0.467728,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,-0.053072,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,1.787156,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,-2.403494,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,-0.129519,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,-1.189054,1,0,237736,30.0708,,C
