# Feature Engineering

Full Feature Engineering Guide with Imbalanced Data Handling

What is Feature Engineering?

    Feature engineering is the process of creating new features or modifying existing ones to improve the performance of 
    machine learning models. It involves techniques like feature extraction, transformation, encoding, 
    and scaling to make data more useful for predictions .
    
Why Do We Need Feature Engineering?
    
1.lmproves Model Performance — Good features help models make better predictions.

2.Reduces Overfitting — Helps eliminate noise and irrelevant data.

3.Handles Missing Data — Creates meaningful replacements for missing values.

4 Enables Better Interpretability — Makes features more understandable and useful.

5.Reduces Dimensionality — Helps remove unnecessary data points, making the model efficient.


In [3]:
import pandas as pd
# Sample dataset with 'TransactionDate'
data = {'TransactionDate': ['2025-02-01 14:30:00', '2025-02-02 18:45:00', '2025-02-03 08:15:00']}
df = pd.DataFrame(data)

# Convert 'TransactionDate' to datetime
df['TransactionDate'] = pd.to_datetime(df['TransactionDate'])

# Extract day of the week (Monday=0, Sunday=6)
df['DayOfWeek'] = df['TransactionDate'].dt.dayofweek

# Extract the hour of the day
df['Hour'] = df['TransactionDate'].dt.hour

# Create a Weekend flag (1 for weekend, 0 for weekdays)
df['IsWeekend'] = df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)  # 5 and 6 represent Saturday and Sunday

print(df)

      TransactionDate  DayOfWeek  Hour  IsWeekend
0 2025-02-01 14:30:00          5    14          1
1 2025-02-02 18:45:00          6    18          1
2 2025-02-03 08:15:00          0     8          0


In [6]:
#Aggregate Features
#Finding average transaction amount per user
# Sample data
df_trans = pd.DataFrame({
    'UserId': [101, 102, 102, 103, 104],
    'TransAmt': [500, 300, 700, 1000, 400]
})
df_user_avg = df_trans.groupby('UserId')['TransAmt'].mean().reset_index()
df_user_avg.rename(columns={'TransAmt': 'AvgTransAmt'}, inplace=True)
print(df_user_avg)

   UserId  AvgTransAmt
0     101        500.0
1     102        500.0
2     103       1000.0
3     104        400.0


In [11]:
#Encoding Categorical varibles
#convert productCate into Numarical from
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'ProductCategory': ['Electronics', 'Clothing', 'Clothing', 'Electronics', 'Grocery']
})

encoder = OneHotEncoder(sparse_output=False)

encoded_features = encoder.fit_transform(df[['ProductCategory']])

df_encoded = pd.DataFrame(encoded_features, columns=encoder.get_feature_names_out(['ProductCategory']))

df_encoded


Unnamed: 0,ProductCategory_Clothing,ProductCategory_Electronics,ProductCategory_Grocery
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [18]:
#Log tarnsformation from skewed data
import pandas as pd
import numpy as np
df = pd.DataFrame({
    'TransactionAmount': [100, 200, 5000, 10000, 500000, 1200000]
})
# Apply log transformation
df['LogTransactionAmount'] = np.log(df['TransactionAmount'])
df

Unnamed: 0,TransactionAmount,LogTransactionAmount
0,100,4.60517
1,200,5.298317
2,5000,8.517193
3,10000,9.21034
4,500000,13.122363
5,1200000,13.997832


In [17]:
#Feature scaling
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Min-Max Scaling (Normalization)
scaler = MinMaxScaler()
df['NormalizedTransactionAmount'] = scaler.fit_transform(df[['TransactionAmount']])
# Standard Scaling (Standardization)
standard_scaler = StandardScaler()
df['StandardizedTransactionAmount'] = standard_scaler.fit_transform(df[['TransactionAmount']])
df

Unnamed: 0,TransactionAmount,LogTransactionAmount,NormalizedTransactionAmount,StandardizedTransactionAmount
0,100,4.60517,0.0,-0.639098
1,200,5.298317,8.3e-05,-0.638874
2,5000,8.517193,0.004084,-0.62814
3,10000,9.21034,0.008251,-0.616958
4,500000,13.122363,0.416618,0.478829
5,1200000,13.997832,1.0,2.04424


# Final Summary of Feature Engineering & Imbalanced Data Handling

Feature Extraction : Extract new insights from raw data (e.g., Hour, DayOWeek)

Aggregated Features : Calculate meaningful statistics (e.g., AvgTransactionAmountPerUser)

Encoding : Convert categorical variables into numerical (One-Hot Encoding)

Log Transformation : Reduce skewness in data distribution

Feature Scaling : Normalize numerical features for better model performance

Downsampling: Reduce the size of the majority class

Upsampling : Increase the size of the minority class

SMOTE(Synthetic Minority Over-sampling Technique) : Generate synthetic samples for the minority class