<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 1. PROJECT KICKOFF & DATA INGESTION </h1></div>

üöÄ Objective: Predicting Delivery Delays in Brazilian E-Commerce
Welcome to the Olist Delivery Analysis. In the competitive world of e-commerce, on-time delivery is the primary driver of customer retention. Our goal is to utilize the rich Olist dataset to predict whether an order will be delayed (is_delayed) based on product dimensions, seller location, seasonality, and carrier performance.

In this first section, we will load the relational tables and construct a master dataset.

In [1]:
import pandas as pd
import numpy as np
import warnings
import os
warnings.filterwarnings('ignore')

base_path = './data/' 

try:
    df_orders = pd.read_csv(os.path.join(base_path, 'olist_orders_dataset.csv'))
    df_items = pd.read_csv(os.path.join(base_path, 'olist_order_items_dataset.csv'))
    df_products = pd.read_csv(os.path.join(base_path, 'olist_products_dataset.csv'))
    df_sellers = pd.read_csv(os.path.join(base_path, 'olist_sellers_dataset.csv'))
    df_customers = pd.read_csv(os.path.join(base_path, 'olist_customers_dataset.csv'))
    df_geolocation = pd.read_csv(os.path.join(base_path, 'olist_geolocation_dataset.csv'))
    
    print("‚úÖ Datasets loaded successfully using relative paths.")
    
except FileNotFoundError as e:
    print("‚ùå Error: Could not find the file. Please check Step 1.")

‚úÖ Datasets loaded successfully using relative paths.


# 1.1 üîó Data Merging & Denormalization
The Olist data is normalized. To build a predictive model, we must join the tables to link Orders $\to$ Items $\to$ Products $\to$ Sellers $\to$ Customers.

In [2]:
# Merge relevant datasets
df = df_orders.merge(df_items, on='order_id', how='left')
df = df.merge(df_products, on='product_id', how='left')
df = df.merge(df_customers,on = 'customer_id',how='outer')
df = df.merge(df_sellers,on = 'seller_id',how='outer')

# Drop rows where delay cannot be calculated (essential dates are missing)
df = df.dropna(subset=['order_delivered_customer_date', 'order_estimated_delivery_date'])

print(f"Merged dataset shape: {df.shape}")

Merged dataset shape: (110196, 29)



<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 2. FEATURE ENGINEERING </h1></div>





Raw data rarely tells the whole story. We will now derive meaningful features, such as product volume, calculate the exact delay in days, and classify goods based on their price point.

# 2.1 üì¶ Product & Order Metrics
Here we calculate the physical volume of the package (which affects logistics) and aggregate price metrics per order.

In [3]:
# (1) Calculate Product Volume
df['product_volume_cm3'] = df['product_height_cm'] * df['product_length_cm'] * df['product_width_cm']

# (2) Create number of products per order
df['no_of_products_in_order'] = df.groupby('order_id')['order_item_id'].transform('count')

# (3) Calculate price-based features
df['total_price_per_order'] = df.groupby('order_id')['price'].transform('sum')
df['avg_price_per_order'] = df.groupby('order_id')['price'].transform('mean')

# 2.2 ‚è±Ô∏è Target Variable Definition
We need to define exactly what constitutes a "Delay". We will compare the actual delivery date against the estimated delivery date.

In [4]:
# (4) Create is_delayed and delay_in_days
df['order_delivered_customer_date'] = pd.to_datetime(df['order_delivered_customer_date'])
df['order_estimated_delivery_date'] = pd.to_datetime(df['order_estimated_delivery_date'])

df['delay_in_days'] = (df['order_delivered_customer_date'] - df['order_estimated_delivery_date']).dt.days

# Binary target variable: 1 if delayed, 0 otherwise
df['is_delayed'] = df['delay_in_days'] > 0

# 2.3 üè∑Ô∏è Behavioral & Product Classification
We categorize products into "Convenience," "Shopping," or "Specialty" goods based on price, and extract timing information to see if the approval process impacts delivery.

In [5]:
# (5) Create good type based on price
def classify_product(x):
    if x <= 100:
        return "Convenience Good"
    elif 100 < x <= 500:
        return "Shopping Good"
    else:
        return "Specialty Good"
    
df['good_type'] = df['price'].apply(classify_product)

# (6) Create purchase_day_of_week and order_approval_time_hr
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'],format='%Y-%m-%d %H:%M:%S')
df['order_approved_at'] = pd.to_datetime(df['order_approved_at'], format='%Y-%m-%d %H:%M:%S')

df["purchase_day_of_week"] = df["order_purchase_timestamp"].dt.dayofweek
df["order_approval_time_hr"] = (df["order_approved_at"] - df["order_purchase_timestamp"]).dt.total_seconds() / 3600

# 2.4 üìä Initial Data Check
Let's pause to examine the distribution of our target variable (is_delayed) and the volume of products per order.

In [6]:
print("--- Delay Distribution (Normalized) ---")
print(df["is_delayed"].value_counts(normalize=True))

print("\n--- Products per Order Count ---")
print(df["no_of_products_in_order"].value_counts(normalize=False))

--- Delay Distribution (Normalized) ---
is_delayed
False    0.934072
True     0.065928
Name: proportion, dtype: float64

--- Products per Order Count ---
no_of_products_in_order
1     86840
2     14786
3      3918
4      1980
6      1146
5       965
7       154
10       80
8        64
12       60
11       44
20       40
15       30
14       28
9        27
21       21
13       13
Name: count, dtype: int64


<div style="background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 25px; color: #6a5d50; position: relative; border-radius: 20px 20px 40px 40px; box-shadow: 0px 8px 16px rgba(0, 0, 0, 0.12), inset 0px 2px 6px rgba(255, 255, 255, 0.4); filter: brightness(1.08); overflow: hidden;"> <h3 style="font-family: 'Poppins', sans-serif; font-size: 24px; font-weight: 600; margin: 0 0 10px 0; text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.08);"> Summary / Key Insight </h3> <p style="font-family: 'Poppins', sans-serif; font-size: 18px; font-weight: 400; margin: 0;"> The class distribution output typically shows that Olist logistics are quite efficient, with delays being the minority class. This indicates we will need to handle <strong>class imbalance</strong> during the modeling phase (e.g., using SMOTE or Class Weights). </p></div>

# 2.5 üìÖ Seasonality: The "Holiday" Factor
Logistics networks often clog during high-volume periods like Christmas or Black Friday. We define a custom function to flag orders placed during major Brazilian holidays.

In [7]:
# (7) Create is_holiday_season
# Convert order_purchase_timestamp to datetime format
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'],format='%Y-%m-%d %H:%M:%S')

# Extract month and day separately
df['order_purchase_month'] = df['order_purchase_timestamp'].dt.month
df['order_purchase_day'] = df['order_purchase_timestamp'].dt.day

# Define holiday season intervals
holiday_ranges = [
    ((12, 1), (12, 31)),   # Christmas (Dec 1 - Dec 31)
    ((11, 24), (12, 4)),   # Black Friday (Nov 24 - Dec 4)
    ((4, 25), (5, 14)),    # Mother's Day (Apr 25 - May 14)
    ((8, 1), (8, 15)),     # Father's Day (Aug 1 - Aug 15)
    ((10, 1), (10, 12))    # Children's Day (Oct 1 - Oct 12)
]

# Function to check if a date falls within any holiday range
def is_holiday(month, day):
    for (start_m, start_d), (end_m, end_d) in holiday_ranges:
        if start_m == end_m: # Same month range
            if start_m == month and start_d <= day <= end_d:
                return 1
        else: # Range spreads across two months
            if (month == start_m and day >= start_d) or (month == end_m and day<= end_d):
                return 1
    return 0

# Apply function to create binary holiday_season column
df['is_holiday_season'] = df.apply(lambda x:is_holiday(x['order_purchase_month'], x['order_purchase_day']), axis=1)

# Drop temporary columns for cleanliness
df = df.drop(columns=['order_purchase_month', 'order_purchase_day'])

# Display result
print(df[['order_purchase_timestamp', 'is_holiday_season']].head(20))

   order_purchase_timestamp  is_holiday_season
0       2017-09-26 22:17:05                  0
1       2017-10-18 08:16:34                  0
2       2017-10-12 13:33:22                  1
3       2017-09-03 08:06:30                  0
4       2017-09-03 08:06:30                  0
5       2017-10-22 16:39:09                  0
6       2017-10-09 08:16:17                  1
7       2017-08-11 14:16:43                  1
8       2017-09-10 20:19:02                  0
9       2018-07-12 21:38:26                  0
10      2017-07-05 13:27:40                  0
11      2017-12-27 23:09:37                  1
12      2017-05-05 22:12:04                  1
13      2017-12-06 16:52:25                  1
14      2017-12-06 16:52:25                  1
15      2017-03-17 14:22:50                  0
16      2017-03-17 14:22:50                  0
17      2017-10-07 11:52:52                  1
18      2017-10-08 18:04:57                  1
19      2017-10-08 18:04:57                  1


<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 3. GEOSPATIAL ANALYSIS </h1></div>

Distance is the greatest enemy of speed. To calculate the physical distance between sellers and customers, we must merge geolocation data (Latitude/Longitude) based on Zip Code prefixes.

# 3.1 üó∫Ô∏è Merging Coordinates

In [8]:
# (8.1) Add geo location information to df
# Convert all zip code columns to the same integer type (ensuring consistency)
df['customer_zip_code_prefix'] = df['customer_zip_code_prefix'].astype('Int64')
df['seller_zip_code_prefix'] = df['seller_zip_code_prefix'].astype('Int64')
df_geolocation['geolocation_zip_code_prefix'] = df_geolocation['geolocation_zip_code_prefix'].astype('Int64')

# Drop duplicate zip codes from df_geolocation to avoid row duplication
df_geolocation = df_geolocation.drop_duplicates(subset=['geolocation_zip_code_prefix'])

# Merge geolocation data for sellers
df = df.merge(
    df_geolocation.rename(
        columns={'geolocation_lat': 'geolocation_lat_seller', 'geolocation_lng': 'geolocation_lng_seller'}
    ),
    left_on='seller_zip_code_prefix',
    right_on='geolocation_zip_code_prefix',
    how='left'
).drop(columns=['geolocation_zip_code_prefix'])

# Merge geolocation data for customers
df = df.merge(
    df_geolocation.rename(
        columns={'geolocation_lat': 'geolocation_lat_customer','geolocation_lng': 'geolocation_lng_customer'}
    ),
    left_on='customer_zip_code_prefix',
    right_on='geolocation_zip_code_prefix',
    how='left'
).drop(columns=['geolocation_zip_code_prefix'])

print(f"Final number of rows after Geo Merge: {len(df)}")

Final number of rows after Geo Merge: 110196


# 3.2 üìê Haversine Distance Calculation
We use the Haversine formula to calculate the "as-the-crow-flies" distance between the seller and the customer.

In [9]:
# (8.2) Create seller_customer_dist in km
import math

# Function to calculate the Haversine distance
def haversine(lat1, lon1, lat2, lon2):
    R = 6371.0 # Radius of Earth in kilometers
    lat1_rad = math.radians(lat1)
    lon1_rad = math.radians(lon1)
    lat2_rad = math.radians(lat2)
    lon2_rad = math.radians(lon2)
    
    dlat = lat2_rad - lat1_rad
    dlon = lon2_rad - lon1_rad
    
    a = math.sin(dlat / 2)**2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(dlon / 2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    
    distance = R * c
    return distance

# Now calculate the distance between seller and customer based on their latitude and longitude
df['seller_customer_dist_km'] = df.apply(
    lambda row: haversine(
        row['geolocation_lat_seller'], row['geolocation_lng_seller'],
        row['geolocation_lat_customer'], row['geolocation_lng_customer']
    ), axis=1
)

# Print the relevant columns to check the result
print(df[['geolocation_lat_seller', 'geolocation_lat_customer','seller_zip_code_prefix', 
          'customer_zip_code_prefix', 'seller_customer_dist_km']].head())

   geolocation_lat_seller  geolocation_lat_customer  seller_zip_code_prefix  \
0              -23.644439                -23.757179                    9080   
1              -23.644439                -20.740762                    9080   
2              -23.644439                -17.211338                    9080   
3              -20.297537                -22.826646                   29156   
4              -20.297537                -22.826646                   29156   

   customer_zip_code_prefix  seller_customer_dist_km  
0                     87502               687.431565  
1                     35490               411.866870  
2                     38600               716.237761  
3                     24710               388.981787  
4                     24710               388.981787  


# 3.3 üöö Delivery Speed (km/h)
Finally, we calculate the average speed of the carrier. This acts as a proxy for carrier efficiency.

In [10]:
# (9) Create delivery_speed in km/h
# Ensure timestamps are in datetime format
df['order_delivered_carrier_date'] = pd.to_datetime(df['order_delivered_carrier_date'])

# Calculate delivery time in hours
df['delivery_duration_hr'] = (
    df['order_delivered_customer_date'] - df['order_delivered_carrier_date']
).dt.total_seconds() / 3600

# Remove rows with non-positive delivery durations to avoid division errors
df = df[df['delivery_duration_hr'] > 0]

# Calculate delivery speed in km/h
df['delivery_speed_kmph'] = df['seller_customer_dist_km'] / df['delivery_duration_hr']

print(df[['seller_customer_dist_km', 'delivery_duration_hr','delivery_speed_kmph']].head())

   seller_customer_dist_km  delivery_duration_hr  delivery_speed_kmph
0               687.431565            192.328889             3.574250
1               411.866870            170.284444             2.418699
2               716.237761            172.583889             4.150085
3               388.981787             90.396944             4.303041
4               388.981787             90.396944             4.303041


<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 4. PREPROCESSING & IMPUTATION </h1></div>

Before modeling, we must handle missing values. Our strategy relies on robustness: using medians for skewed physical data (weight/volume) and grouping by state to impute missing distances.

In [11]:
columns_to_keep = [
    'is_delayed', 'product_weight_g', 'freight_value', 'product_volume_cm3',
    'no_of_products_in_order', 'order_delivered_customer_date',
    'total_price_per_order', 'avg_price_per_order', 'delay_in_days',
    'good_type', 'purchase_day_of_week', 'customer_state',
    'order_delivered_carrier_date', 'order_approval_time_hr', 
    'is_holiday_season', 'seller_customer_dist_km',
    'delivery_duration_hr', 'delivery_speed_kmph'
]
df_model = df[columns_to_keep].copy()

# === IMPUTATION STRATEGY ===
# 1. Impute missing weight and volume with median (robust for skewed data)
df_model['product_weight_g'].fillna(df_model['product_weight_g'].median(), inplace=True)
df_model['product_volume_cm3'].fillna(df_model['product_volume_cm3'].median(), inplace=True)

# 2. Impute missing order_approval_time_hr with mean
df_model['order_approval_time_hr'].fillna(df_model['order_approval_time_hr'].mean(), inplace=True)

# 3. Impute missing seller_customer_dist_km using median per customer_state
df_model['seller_customer_dist_km'] = df_model.groupby('customer_state')['seller_customer_dist_km'].transform(
    lambda x: x.fillna(x.median())
)

# 4. Impute remaining missing delivery_speed_kmph using median per customer_state
df_model['delivery_speed_kmph'] = df_model.groupby('customer_state')['delivery_speed_kmph'].transform(
    lambda x: x.fillna(x.median())
)

print("‚úÖ Imputation complete.")

‚úÖ Imputation complete.


# 4.1 üî¢ Encoding & Resampling
We encode categorical variables and use Random OverSampling (ROS) to balance our dataset, ensuring our models don't just predict "No Delay" every time.

In [12]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df_model['good_type_class'] = label_encoder.fit_transform(df_model['good_type'])

# Turning flags into categories since we are going to use them as dummy variables
df_model["good_type_class"] = df_model["good_type_class"].astype("category")
df_model["is_holiday_season"] = df_model["is_holiday_season"].astype("category")

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler

features = [
    'product_weight_g', 'freight_value', 'product_volume_cm3', 'no_of_products_in_order',
    'total_price_per_order', 'avg_price_per_order', 'good_type_class', 'purchase_day_of_week',
    'order_approval_time_hr', 'is_holiday_season', 'seller_customer_dist_km', 
    'delivery_duration_hr', 'delivery_speed_kmph'
]

X = df_model[features]
y = df_model['is_delayed'].astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use Random Oversampling to handle class imbalance
ros = RandomOverSampler(sampling_strategy="auto", random_state=557)
X_train_resampled, Y_train_resampled = ros.fit_resample(X_train, y_train)

print(f"Resampled Train Shape: {X_train_resampled.shape}")

Resampled Train Shape: (144114, 13)


<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 5. MODEL SELECTION & EVALUATION </h1></div>

We will now train a suite of classifiers: Logistic Regression, Decision Tree, Random Forest, and XGBoost. We prioritize class_weight='balanced' and ROC-AUC scores to evaluate performance on the minority "Delayed" class.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight='balanced'),
    "Decision Tree": DecisionTreeClassifier(random_state=42, class_weight='balanced'),
    "Random Forest": RandomForestClassifier(random_state=42, class_weight='balanced'),
    # "Support Vector Machine": SVC(probability=True, random_state=42, class_weight='balanced'), # Computed intensive
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss', enable_categorical=True)
}

# Training Loop
for name, model in models.items():
    model.fit(X_train_resampled, Y_train_resampled)
    print(f"{name} trained.")

# Evaluation Loop
for name, model in models.items():
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else y_pred
    
    auc = roc_auc_score(y_test, y_prob)
    
    print(f"\n=== {name} ===")
    print("ROC AUC Score:", round(auc, 3))
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Logistic Regression trained.
Decision Tree trained.
Random Forest trained.
XGBoost trained.

=== Logistic Regression ===
ROC AUC Score: 0.904
              precision    recall  f1-score   support

           0       0.99      0.85      0.91     30814
           1       0.29      0.83      0.43      2227

    accuracy                           0.85     33041
   macro avg       0.64      0.84      0.67     33041
weighted avg       0.94      0.85      0.88     33041

Confusion Matrix:
 [[26196  4618]
 [  370  1857]]

=== Decision Tree ===
ROC AUC Score: 0.794
              precision    recall  f1-score   support

           0       0.97      0.97      0.97     30814
           1       0.63      0.61      0.62      2227

    accuracy                           0.95     33041
   macro avg       0.80      0.79      0.80     33041
weighted avg       0.95      0.95      0.95     33041

Confusion Matrix:
 [[30005   809]
 [  861  1366]]

=== Random Forest ===
ROC AUC Score: 0.953
              pr

<div style="background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 25px; color: #6a5d50; position: relative; border-radius: 20px 20px 40px 40px; box-shadow: 0px 8px 16px rgba(0, 0, 0, 0.12), inset 0px 2px 6px rgba(255, 255, 255, 0.4); filter: brightness(1.08); overflow: hidden;"> <h3 style="font-family: 'Poppins', sans-serif; font-size: 24px; font-weight: 600; margin: 0 0 10px 0; text-shadow: 1px 1px 3px rgba(0, 0, 0, 0.08);"> Summary / Key Insight </h3> <p style="font-family: 'Poppins', sans-serif; font-size: 18px; font-weight: 400; margin: 0;"> Based on the classification reports, <strong>Random Forest and XGBoost</strong> likely offer the best balance between Precision and Recall. While Logistic Regression is interpretable, tree-based models handle the complex, non-linear relationships (like geolocation vs. delivery speed) more effectively. </p></div>

<div style=" background: url('https://www.transparenttextures.com/patterns/marble.png'), linear-gradient(135deg, #e0d6c9, #cfc5b6); padding: 35px; text-align: center; color: #6a5d50; position: relative; border-radius: 50px 50px 20px 20px; box-shadow: 0px 10px 20px rgba(0, 0, 0, 0.2), inset 0px 2px 10px rgba(255, 255, 255, 0.4); font-family: 'Poppins', sans-serif; filter: brightness(1.05); overflow: hidden;"> <h1 style="font-size: 36px; font-weight: 700; margin: 0; text-shadow: 1px 1px 4px rgba(0, 0, 0, 0.1);"> 6. HYPERPARAMETER OPTIMIZATION </h1></div>

To squeeze the final performance out of our model, we use RandomizedSearchCV on the Random Forest classifier. This searches for the optimal number of trees, depth, and split criteria.

In [14]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

rf = RandomForestClassifier(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=10,  
    cv=3,       
    verbose=2,
    n_jobs=-1,  
    scoring='f1'  
)

random_search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


# 6.1 üèÜ Final Results

In [15]:
print("Best parameters:", random_search.best_params_)
best_model = random_search.best_estimator_

y_pred = best_model.predict(X_test)
y_prob = best_model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_prob)
print("ROC AUC Score:", roc_auc)

Best parameters: {'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2', 'max_depth': None}
              precision    recall  f1-score   support

           0       0.97      0.99      0.98     30814
           1       0.84      0.63      0.72      2227

    accuracy                           0.97     33041
   macro avg       0.91      0.81      0.85     33041
weighted avg       0.96      0.97      0.96     33041

ROC AUC Score: 0.9545361964215439
