## Full Feature Engineering Guide with Imbalanced Data Handling

**What is Feature Engineering?**
 
            Feature engineering is the process of creating new features or modifying existing ones to improve the performance of machine learning models. It involves techniques like feature extraction, transformation, encoding, and scaling to make data more useful for predictions.

**Why Do We Need Feature Engineering?**

1.**Improves Model Performance** – Good features help models make better predictions.
 
2.**Reduces Overfitting** – Helps eliminate noise and irrelevant data.
 
3.**Handles Missing Data** – Creates meaningful replacements for missing values.
 
4.**Enables Better Interpretability** – Makes features more understandable and useful.
5.**Reduces Dimensionality** – Helps remove unnecessary data points, making the model efficient.
has context menu

In [2]:
#Extract Date and Time Features 
import pandas as pd
#Sample Dataset
df=pd.DataFrame({'TransactionDate':pd.to_datetime(['2025-02-05 14:30:00','2025-02-06 18:45:00'])})
#Extract date-realted Features
df['DayOfWeek']=df['TransactionDate'].dt.dayofweek #Monday=0,Sunday=6
df['Hour']=df['TransactionDate'].dt.hour #Extract hour
df['IsWeekend']=df['DayOfWeek'].apply(lambda x:1 if x>=5 else 0)#weekend Flag
print(df)
#Why?Helps capture behavioral trends(ex:shopping habits on weekends vs weekdays)

      TransactionDate  DayOfWeek  Hour  IsWeekend
0 2025-02-05 14:30:00          2    14          0
1 2025-02-06 18:45:00          3    18          0


In [3]:
#Aggregated Features
#Find average transaction amount per user.
df_transactions=pd.DataFrame({
    'UserID':[101,102,101,103,102],
    'TransactionAmount':[500,300,700,1000,400]
})
df_user_avg=df_transactions.groupby('UserID')['TransactionAmount'].mean().reset_index()
df_user_avg.rename(columns={'TransactionAmount':'AvgTransactionAmount'},inplace=True)
print(df_user_avg)
#Why?Identifies high-value customers and spending patterns

   UserID  AvgTransactionAmount
0     101                 600.0
1     102                 350.0
2     103                1000.0


In [6]:
#Encoding Categorical Variables
#Convert ProductXategory(Electronics,Clothing) into numerical form:
from sklearn.preprocessing import OneHotEncoder
df=pd.DataFrame({'ProductCategory':['Electronics','Clothing','Clothing','Grocery']})
encoder=OneHotEncoder(sparse_output=False)
encoded_features=encoder.fit_transform(df[['ProductCategory']])
df_encoded=pd.DataFrame(encoded_features,columns=encoder.get_feature_names_out())
print(df_encoded)
#Why? Converts non-numeric categories to format suitable for ML models

   ProductCategory_Clothing  ProductCategory_Electronics  \
0                       0.0                          1.0   
1                       1.0                          0.0   
2                       1.0                          0.0   
3                       0.0                          0.0   

   ProductCategory_Grocery  
0                      0.0  
1                      0.0  
2                      0.0  
3                      1.0  


In [9]:
#Log TRansformation for Skewed Data
#If TransactionAmount has outliers,apply log Transformation
import numpy as np
df=pd.DataFrame({'TransactionAmount':[100,200,5000,10000,20000]})
df['LogTransactionAmount']=np.log1p(df['TransactionAmount'])#Log1p avoids log(0) issues
print(df)
#Why?Reduces skewness and impact of outliers

   TransactionAmount  LogTransactionAmount
0                100              4.615121
1                200              5.303305
2               5000              8.517393
3              10000              9.210440
4              20000              9.903538


In [10]:
#Feature Scaling
from sklearn.preprocessing import MinMaxScaler,StandardScaler
scaler=MinMaxScaler()
df['NormalizedTransactionAmount']=scaler.fit_transform(df[['TransactionAmount']])
standard_scaler=StandardScaler()
df['StandardScalerTransactionAmount']=standard_scaler.fit_transform(df[['TransactionAmount']])
print(df)
#Why?Ensures all features have same scale,preventing bias in ML models

   TransactionAmount  LogTransactionAmount  NormalizedTransactionAmount  \
0                100              4.615121                     0.000000   
1                200              5.303305                     0.005025   
2               5000              8.517393                     0.246231   
3              10000              9.210440                     0.497487   
4              20000              9.903538                     1.000000   

   StandardScalerTransactionAmount  
0                        -0.937070  
1                        -0.923606  
2                        -0.277351  
3                         0.395831  
4                         1.742196  


**Final Summary of Feature Engineering & Imbalanced Data Handling**
 
Feature Extraction : Extract new insights from raw data (e.g., Hour, DayOfWeek)
 
Aggregated Features : Calculate meaningful statistics (e.g., AvgTransactionAmountPerUser)
 
Encoding : Convert categorical variables into numerical (One-Hot Encoding)
 
Log Transformation : Reduce skewness in data distribution
 
Feature Scaling : Normalize numerical features for better model performance
 
Downsampling: Reduce the size of the majority class
 
Upsampling : Increase the size of the minority class
 
SMOTE(Synthetic Minority Over-sampling Technique) : Generate synthetic samples for the minority class