# Day 15: ML Feature Engineering & Data Prep

We've successfully built a comprehensive analytics application, and now it's time to add the final, most advanced component. Welcome to **Day 15: ML Feature Engineering & Data Prep**.

## Overview

Today marks our transition from data analytics (describing the past) to machine learning (predicting the future). Our goal is to take the clean dataset we've been using and meticulously transform it into a format that a machine learning model can understand. This process, known as **feature engineering**, is the absolute foundation of a successful predictive model.

We will create a new Jupyter Notebook for this entire machine learning workflow to keep it organized and separate from our earlier exploratory analysis.

## Task 2: Save and Run the Notebook

### Action Steps

1. Create a new Jupyter Notebook file in your `notebooks` folder
2. Name it `05_ml_feature_engineering.ipynb`
3. Copy the code from the first file into this new notebook
4. Run all the cells from top to bottom

### What to Expect

The script will print out status messages at each step of the process. The final output will be the first five rows of your fully processed training data, `X_train_scaled`. It should look like a table of numbers, where:

- All categorical text has been converted
- All values have been scaled

## Completion

**Congratulations** on completing Day 15! You have successfully navigated the most complex and critical part of the machine learning lifecycle: data preparation and feature engineering.

You now have clean, robust training and testing datasets that are perfectly formatted for our next step. Tomorrow, on **Day 16**, we will use this prepared data to train our very first predictive models.

In [4]:
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
print("--- Starting ML Feature Engineering Process ---")

# --- 1. Load the Enhanced Dataset ---
try:
    df = pd.read_csv('../data/ola_data_enhanced.csv')
    df['booking_timestamp'] = pd.to_datetime(df['booking_timestamp'])
    print("✅ Successfully loaded the enhanced dataset.")
except FileNotFoundError:
    print("❌ Error: 'ola_data_enhanced.csv' not found. Please ensure the file exists.")
    exit()

# --- 2. Define the Prediction Target (y) ---
# Our goal is to predict if a ride will be Canceled by the Customer.
df['is_cancelled'] = (df['booking_status'] == 'Canceled by Customer').astype(int)
print(f"\nTarget variable 'is_cancelled' created.")
print(df['is_cancelled'].value_counts(normalize=True))


# --- 3. Select Predictive Features (X) ---
# We select columns that we hypothesize will influence a customer's decision to cancel.
# We exclude identifiers, timestamps, and post-booking info like ratings.
features = [
    'v_tat',
    'c_tat',
    'booking_value',
    'ride_distance',
    'hour_of_day',
    'vehicle_type',
    'payment_method'
]
X = df[features]
y = df['is_cancelled']

print(f"\nSelected {len(features)} features for the model.")


# --- 4. Preprocessing Step 1: Handle Missing Values ---
# ML models cannot handle missing (NaN) values. We'll use a simple median imputation strategy.
for col in ['v_tat', 'c_tat']:
    median_val = X[col].median()
    X[col].fillna(median_val, inplace=True)

# For categorical columns, we'll fill with the most frequent value (mode).
X['payment_method'].fillna(X['payment_method'].mode()[0], inplace=True)
print("\n✅ Handled missing values using median and mode imputation.")


# --- 5. Preprocessing Step 2: One-Hot Encode Categorical Features ---
# Models only understand numbers, so we convert text categories into numeric format.
categorical_cols = ['vehicle_type', 'payment_method']
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)
print("✅ Converted categorical features to numeric using one-hot encoding.")
print("New shape of X:", X.shape)


# --- 6. Preprocessing Step 3: Train-Test Split ---
# We split the data BEFORE scaling to prevent data leakage from the test set.
# This is a critical step for an unbiased evaluation of the model.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, # 20% of the data will be used for testing
    random_state=42, # Ensures the split is the same every time we run the script
    stratify=y # Ensures the proportion of cancellations is the same in both train and test sets
)
print("\n✅ Split data into training (80%) and testing (20%) sets.")


# --- 7. Preprocessing Step 4: Feature Scaling ---
# We scale the numeric features so that variables with large values (like booking_value)
# don't unfairly dominate variables with small values (like hour_of_day).
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert scaled arrays back to DataFrames for easier inspection
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
print("✅ Applied standard scaling to numeric features.")

print("\n--- Feature Engineering Complete ---")
print("Data is now ready for model training.")

# Display the first few rows of the final prepared training data
print("\n--- Head of Processed Training Data (X_train_scaled) ---")
print(X_train_scaled.head())

# ... (all the previous code from Day 15) ...

# --- 8. Save the Processed Data for Model Training ---
# We save the processed datasets to be used in the next notebook.
output_dir = '../data/processed/'
import os
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

X_train_scaled.to_csv(os.path.join(output_dir, 'X_train_scaled.csv'), index=False)
X_test_scaled.to_csv(os.path.join(output_dir, 'X_test_scaled.csv'), index=False)
y_train.to_csv(os.path.join(output_dir, 'y_train.csv'), index=False)
y_test.to_csv(os.path.join(output_dir, 'y_test.csv'), index=False)

print("\n✅ All processed data files have been saved to the 'data/processed/' directory.")

# NEW: Save the scaler object
scaler_path = '../models/'
if not os.path.exists(scaler_path):
    os.makedirs(scaler_path)
joblib.dump(scaler, os.path.join(scaler_path, 'scaler.joblib'))


print("\n✅ All processed data files and the scaler have been saved.")

--- Starting ML Feature Engineering Process ---
✅ Successfully loaded the enhanced dataset.

Target variable 'is_cancelled' created.
is_cancelled
0    0.898092
1    0.101908
Name: proportion, dtype: float64

Selected 7 features for the model.

✅ Handled missing values using median and mode imputation.
✅ Converted categorical features to numeric using one-hot encoding.
New shape of X: (103024, 14)

✅ Split data into training (80%) and testing (20%) sets.
✅ Applied standard scaling to numeric features.

--- Feature Engineering Complete ---
Data is now ready for model training.

--- Head of Processed Training Data (X_train_scaled) ---
      v_tat     c_tat  booking_value  ride_distance  hour_of_day  \
0  0.301422  0.003208      -0.516226      -0.011883    -0.507131   
1 -0.028459  0.003208      -0.151763      -0.898328     1.084996   
2 -0.248380  1.590529       0.928609       0.747926    -1.665042   
3 -1.677864 -0.878637      -0.640813      -0.391788     0.795518   
4  0.631303 -1.76048

# Detailed Analysis of Feature Engineering Output

Of course. You've successfully run the feature engineering script, and the output you're seeing is the result of transforming our clean, human-readable data into a highly processed, numeric format that a machine learning model can understand. This is a critical and complex process, so let's break down exactly what we did and what each part of the output means.

**This is the story of how we prepared our data for a predictive model**.[1]

## Target Variable Creation & Class Imbalance

```
Target variable 'is_cancelled' created.
is_cancelled
0    0.898092  (Not Cancelled)
1    0.101908  (Cancelled by Customer)
Name: proportion, dtype: float64
```

### What We Did
We created our **"target"** variable—the thing we want to predict. The `is_cancelled` column now contains a 1 if the ride was cancelled by the customer and a 0 otherwise.[1]

### What This Means
The output shows that only about 10.2% of the bookings in our dataset were cancelled by customers. This is a classic **imbalanced dataset**. This is a very important finding because it tells us that a lazy model could achieve ~90% accuracy by simply guessing "not cancelled" every time. That's why we will need to use more advanced metrics (like the F1-score) to evaluate our model's true performance tomorrow.[5]

## Feature Selection

```
Selected 7 features for the model.
```

### What We Did
We chose 7 columns from our original dataset that we believe have the power to predict a cancellation before it happens.

### What This Means
We specifically excluded columns like `booking_id` (just an identifier), `customer_rating` (this happens after the ride, so we can't use it to predict something beforehand), and `cancellation_reason` (this is a result of the cancellation, not a predictor of it).

## Handling Missing Values

```
✅ Handled missing values using median and mode imputation.
```

### What We Did
Machine learning algorithms cannot work with empty (NaN) cells. We filled the missing numerical values (`v_tat`, `c_tat`) with their respective median values and the categorical `payment_method` with its most common value.[2]

### What This Means
Our dataset is now complete, with no missing information, making it ready for the next steps.

## One-Hot Encoding

```
✅ Converted categorical features to numeric using one-hot encoding.
New shape of X: (103024, 14)
```

### What We Did
This is one of the most significant transformations. Models don't understand text like "Prime Sedan" or "Cash". **One-hot encoding** converts these text columns into new numeric columns.[3][1][5]

### What This Means
Our feature set X expanded from 7 columns to 14. For example, the single `vehicle_type` column was replaced by several new columns like `vehicle_type_Bike`, `vehicle_type_Mini`, etc. If a ride was a "Bike", it will have a 1 in the `vehicle_type_Bike` column and a 0 in all the others.[1][3]

## Train-Test Split & Scaling

```
✅ Split data into training (80%) and testing (20%) sets.
✅ Applied standard scaling to numeric features.
```

### What We Did
We divided our data into a training set (which the model learns from) and a test set (which we use for an unbiased evaluation). We then applied `StandardScaler` to put all our numeric features on the same scale.

### What This Means
This ensures that a feature with large numbers (like `booking_value`) doesn't have an unfair influence on the model over a feature with small numbers (like `hour_of_day`).[5]

## The Final Processed Data

```
--- Head of Processed Training Data (X_train_scaled) ---
      v_tat     c_tat  booking_value  ...  payment_method_UPI
0  0.301422  0.003208      -0.516226  ...            1.732093
1 -0.028459  0.003208      -0.151763  ...           -0.577336
2 -0.248380  1.590529       0.928609  ...           -0.577336
```

### What This Is
This table is the final product of today's work. It is a completely numeric, scaled, and processed version of your data, ready to be fed into a machine learning algorithm.

### What It Means

- **No Text**: All categorical information is now represented by numeric columns (e.g., `payment_method_UPI`)[1]
- **Scaled Values**: Notice that all the numbers are now small and centered around zero. This is the result of the `StandardScaler`. A positive number means the original value was above the average for that feature, and a negative number means it was below average
- **Ready for Learning**: This numeric representation allows the machine learning models to find mathematical patterns between the features and the `is_cancelled` target we created

