## Feature Engineering Steps

1. **Split the Data**
   - Create train-test split to avoid data leakage.

2. **Handle Missing Values**
   - Impute missing values (mean, median, mode, predictive models)
   - Drop rows or columns with excessive missing data

3. **Outlier Detection and Handling**
   - Identify outliers using Z-scores, IQR, or visual methods
   - Remove or transform outliers as needed

4. **Transform Features**
   - Log, square root, or exponential transformations for skewed data
   - Standardize or normalize features for algorithms sensitive to scaling
   - Apply binning for continuous variables when necessary

5. **Feature Encoding**
   - Label Encoding or One-Hot Encoding for categorical data
   - Target Encoding and Frequency Encoding for certain cases

6. **Feature Creation**
   - Create interaction features or polynomial features
   - Extract meaningful datetime features (day, month, year, etc.)
   - Aggregate features like sum, mean, and max

7. **Feature Selection**
   - Analyze correlations and remove highly correlated features
   - Use feature importance from models (Random Forest, XGBoost)
   - Apply dimensionality reduction techniques (PCA)

8. **Handle Class Imbalance (if applicable)**
   - Resample using oversampling or undersampling techniques
   - Adjust class weights in models

9. **Verify Feature Engineering**
   - Use cross-validation to validate transformations
   - Ensure no data leakage


## **1. Split the Data**
First, split the data into training and testing sets to ensure that all feature engineering steps are applied only on the training set and then replicated on the test set.

In [5]:
# Train-Test split
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

df = pd.read_csv('../../data/data.csv')
train_df,test_df = train_test_split(df,test_size=0.2,random_state=42)

## **2. Handle Missing Values** ##

In [None]:
# Checking for missing values in training dataset
train_df.isnull().sum()

id                    0
price                 0
levy                  0
manufacturer          0
model                 0
manufacturing_year    0
category              0
leather_interior      0
fuel_type             0
engine_volume         0
distance_travelled    0
cylinders             0
gear_box_type         0
drive_wheels          0
doors                 0
drive_type            0
color                 0
airbags               0
dtype: int64

In [7]:
# Checking for missing values in test dataset
test_df.isnull().sum()

id                    0
price                 0
levy                  0
manufacturer          0
model                 0
manufacturing_year    0
category              0
leather_interior      0
fuel_type             0
engine_volume         0
distance_travelled    0
cylinders             0
gear_box_type         0
drive_wheels          0
doors                 0
drive_type            0
color                 0
airbags               0
dtype: int64

## **3. Outlier detection and handling** ##

Based on EDA, outliers in features like **price**, **cylinders**, and **doors** should be handled carefully. We’ll only apply outlier removal on train_df and carry over the transformation logic to test_df.