<div style ="font-family:Trebuchet MS; background-color : #f8f0fa; border-left: 5px solid #1b4332; padding: 12px; border-radius: 50px 50px;">
    <h2 style="color: #1b4332; font-size: 48px; text-align: center;">
        <b>Step 4 in Feature Engineering:Feature Scaling</b>
        <hr style="border-top: 2px solid #264653;">
    </h2>
    <h3 style="font-size: 14px; color: #264653; text-align: left; "><strong> I hope this is very helpful. let's started </strong></h3>
</div>

Feature scaling is a crucial step in data preprocessing, especially when dealing with machine learning algorithms that are sensitive to the scale of the data, such as gradient descent-based algorithms, k-nearest neighbors, and principal component analysis. This article explores various feature scaling techniques and applies them to the Titanic dataset.

Different features in a dataset may have different units and ranges. For instance, in the Titanic dataset, the Age column might range from 0 to 80, while the Fare column could range from 0 to 512. Algorithms like logistic regression, SVM, and neural networks assume that all input features are on the same scale. If not, features with larger ranges could dominate others, leading to biased results.

- we will practice along with the [titanic dataset](https://www.kaggle.com/datasets/brendan45774/test-file/data)

# 1. Standardization (Z-Score Normalization)

Standardization scales features so that they have a mean of 0 and a standard deviation of 1. This method is beneficial when the data follows a Gaussian distribution.

**Fomula**   
         
![image.png](attachment:image.png)

In [2]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

In [4]:
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('..\Data\Titanic.csv')

col_to_scale = ['Age', 'Fare']

scaler = StandardScaler()

scaled_features = scaler.fit_transform(df[col_to_scale])

df_scaled = pd.DataFrame(scaled_features, columns=col_to_scale)

print(df_scaled.head())

        Age      Fare
0  0.298549 -0.497811
1  1.181328 -0.512660
2  2.240662 -0.464532
3 -0.231118 -0.482888
4 -0.584229 -0.417971


# 2. Min-Max Scaling (Normalization)

Min-Max scaling transforms features to a fixed range, typically [0, 1]. This is particularly useful when the features don't follow a normal distribution and you want to retain the distribution's original shape.

**Formula**

![image.png](attachment:image.png)

In [5]:
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('..\Data\Titanic.csv')
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

min_max_col = ['Age', 'Fare']

scaled_features = scaler.fit_transform(df[min_max_col])

df_min_max = pd.DataFrame(scaled_features, columns=min_max_col)

print(df_min_max.head())

        Age      Fare
0  0.452723  0.015282
1  0.617566  0.013663
2  0.815377  0.018909
3  0.353818  0.016908
4  0.287881  0.023984


# 3. Robust Scaling

Robust scaling is used when the dataset contains outliers. Unlike standardization, which uses the mean and standard deviation, robust scaling uses the median and interquartile range (IQR), making it less sensitive to outliers.

**Formula**

![image.png](attachment:image.png)

In [6]:
from sklearn.preprocessing import RobustScaler

df = pd.read_csv('..\Data\Titanic.csv')

# Initialize the RobustScaler
scaler = RobustScaler()

robust_col = ['Age', 'Fare']

scaled_features = scaler.fit_transform(df[robust_col])

df_robust = pd.DataFrame(scaled_features, columns=robust_col)

print(df_robust.head())

        Age      Fare
0  0.416667 -0.280670
1  1.111111 -0.315800
2  1.944444 -0.201943
3  0.000000 -0.245367
4 -0.277778 -0.091793


# 4. Logarithmic Scaling

Logarithmic scaling is a transformation that can help reduce the impact of large values and skewed distributions. It is particularly useful when the feature has a long tail.

**Formula:**

![image.png](attachment:image.png)

In [7]:
df = pd.read_csv('..\Data\Titanic.csv')

df['log_Age'] = np.log(df['Age'] + 1)

print(df[['Age', 'log_Age']].head())

    Age   log_Age
0  34.5  3.569533
1  47.0  3.871201
2  62.0  4.143135
3  27.0  3.332205
4  22.0  3.135494


# Conclusion

Feature scaling is a vital preprocessing step that ensures all features contribute equally to the model, regardless of their original scale. Each scaling method has its advantages depending on the data's distribution and the presence of outliers. By applying these techniques to the Titanic dataset, you can prepare your data effectively for various machine learning algorithms.
