<div style ="font-family:Trebuchet MS; background-color : #f8f0fa; border-left: 5px solid #1b4332; padding: 12px; border-radius: 50px 50px;">
    <h2 style="color: #1b4332; font-size: 48px; text-align: center;">
        <b>Step 5 in Feature Engineering:Feature Transformation</b>
        <hr style="border-top: 2px solid #264653;">
    </h2>
    <h3 style="font-size: 14px; color: #264653; text-align: left; "><strong> I hope this is very helpful. let's started </strong></h3>
</div>

### **Introduction**

Feature transformation involves altering the data's representation to improve the performance of machine learning models. By creating polynomial features, interacting features, or applying transformations like log and square root, you can make the data more suitable for algorithms. This article will cover various feature transformation techniques and apply them to the Titanic dataset.


- we will practice along with the [titanic dataset](https://www.kaggle.com/datasets/brendan45774/test-file/data)

### **1. Polynomial Features**

Polynomial features are created by generating new features that are polynomial combinations of the original features. This can capture more complex relationships in the data.


In [6]:
# create some polynomial feature 
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

from sklearn.preprocessing import PolynomialFeatures

df = pd.read_csv('..\Data\Titanic.csv') 

# Select numerical columns to generate polynomial features
num_cols = ['Age','Fare']

# note that this columns must be without missing values
for col in num_cols:
    df[col] = df[col].fillna(df[col].mean())

# initialize the polynomial transformer 
poly = PolynomialFeatures(degree=2, include_bias=False)

# fit the transformer to the data
df_poly = poly.fit_transform(df[num_cols])

# convert the array to a dataframe
df_poly = pd.DataFrame(df_poly, columns=poly.get_feature_names_out(num_cols))
df_poly.head()

Unnamed: 0,Age,Fare,Age^2,Age Fare,Fare^2
0,34.5,7.8292,1190.25,270.1074,61.296373
1,47.0,7.0,2209.0,329.0,49.0
2,62.0,9.6875,3844.0,600.625,93.847656
3,27.0,8.6625,729.0,233.8875,75.038906
4,22.0,12.2875,484.0,270.325,150.982656


### **2. Interaction Features**

Interaction features are created by multiplying or interacting two or more features. This helps in capturing relationships between features that might be missed when considering them independently.

In [8]:
from sklearn.preprocessing import FunctionTransformer

df = pd.read_csv('..\Data\Titanic.csv')

# Define a custom function to create interaction features
def create_interaction_features(data):
    data['Age_Fare'] = data['Age'] * data['Fare']
    return data

df = create_interaction_features(df)

print(df[['Age', 'Fare', 'Age_Fare']].head())


    Age     Fare  Age_Fare
0  34.5   7.8292  270.1074
1  47.0   7.0000  329.0000
2  62.0   9.6875  600.6250
3  27.0   8.6625  233.8875
4  22.0  12.2875  270.3250


### **3. Binning**
Binning is the process of converting continuous features into discrete intervals or bins. This can make the model more interpretable and robust, especially when dealing with outliers.

In [9]:
# binning the Age column

df = pd.read_csv('..\Data\Titanic.csv')

# Bin the numerical column 'Age' into 5 bins
bins = [0, 12, 19, 30, 50, 100]
labels = ['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']

df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels)

print(df[['Age', 'AgeGroup']].head())

    Age     AgeGroup
0  34.5        Adult
1  47.0        Adult
2  62.0       Senior
3  27.0  Young Adult
4  22.0  Young Adult


### **4. Log, Square Root, and Box-Cox Transformations**

These transformations are applied to make data more Gaussian-like, stabilize variance, and handle skewness in the data.

#### **a. Log Transformation**

In [11]:
df['log_Fare'] = np.log1p(df['Fare'])
print(df['log_Fare'].min(), df['log_Fare'].max())
print(df['Fare'].min(), df['Fare'].max())
print(df[['Fare', 'log_Fare']].head())

0.0 6.240917354759096
0.0 512.3292
      Fare  log_Fare
0   7.8292  2.178064
1   7.0000  2.079442
2   9.6875  2.369075
3   8.6625  2.268252
4  12.2875  2.586824


#### **b. Square Root Transformation:**

In [13]:
df['Sqrt_Fare'] = np.sqrt(df['Fare'])

print(df['Sqrt_Fare'].min(), df['Sqrt_Fare'].max())
print(df['Fare'].min(), df['Fare'].max())

print(df[['Fare', 'Sqrt_Fare']].head())


0.0 22.634690190060034
0.0 512.3292
      Fare  Sqrt_Fare
0   7.8292   2.798071
1   7.0000   2.645751
2   9.6875   3.112475
3   8.6625   2.943213
4  12.2875   3.505353


The square root transformation is less aggressive than log transformation and can also help in reducing skewness.

#### **c. Box-Cox Transformation:**

In [15]:
from scipy.stats import boxcox

# handle missing values in the Fare Columns first
df['Fare'] = df['Fare'].fillna(df['Fare'].mean())

df['Fare_BoxCox'], _ = boxcox(df['Fare'] + 1)  # Adding 1 to handle zero values


print(df['Fare_BoxCox'].min(), df['Fare_BoxCox'].max())
print(df['Fare'].min(), df['Fare'].max())

print(df[['Fare', 'Fare_BoxCox']].head())


0.0 2.9406506490471425
0.0 512.3292
      Fare  Fare_BoxCox
0   7.8292     1.628553
1   7.0000     1.574361
2   9.6875     1.729330
3   8.6625     1.676811
4  12.2875     1.837799


Box-Cox transformation can handle a wider range of distributions and makes the data more Gaussian-like, which is beneficial for many machine learning algorithms.

### **Conclusion**

Feature transformation is a powerful tool in your data preprocessing toolkit. By applying polynomial features, interaction terms, binning, and various transformations, you can enhance the quality of your features and ultimately improve model performance. 