# **Fraud Detection Project - CA2**

## **1. Introduction**

### Project Overview
This project is the second phase (CA2) of a comprehensive fraud detection system for **SP-Buy**, an e-commerce platform. Building on the work done in **CA1**, which involved data ingestion, exploratory data analysis (EDA), and the creation of an interactive dashboard, this phase focuses on developing a machine learning model to predict fraudulent orders. The goal is to deploy this model in a user-friendly application that the operations team can use to proactively identify and mitigate fraud.

### Problem Statement
Fraudulent activities pose a significant threat to e-commerce platforms, leading to financial losses, reputational damage, and customer dissatisfaction. Detecting fraud in real-time is challenging due to the imbalanced nature of the data (fraud cases are rare compared to legitimate transactions) and the evolving tactics of fraudsters. In this project, we aim to build a robust machine learning model that can accurately identify fraudulent orders based on historical data.

### CA1 1 Recap
In **CA1 1**, we performed extensive exploratory data analysis (EDA) and data cleaning on the provided datasets:
- **Customer Features**: Information about customers, such as their order history and verification status.
- **Order Features**: Details about each order, including payment method and order value.
- **Labels**: Fraud labels indicating whether an order was fraudulent.

The cleaned and preprocessed data was used to create an interactive dashboard for monitoring fraud trends and patterns. This dashboard provided valuable insights into the dataset, enabling stakeholders to understand the nature of fraud on the platform.

### CA2 Objectives
In **CA2**, we shift our focus to **model development** and **deployment**. The key objectives are:
1. **Advanced Data Analysis**:
   - Perform additional EDA to identify feature importance and detect outliers, which are critical for model creation.
2. **Model Development**:
   - Train, evaluate, and optimize machine learning models to predict fraudulent orders.
3. **Experiment Tracking**:
   - Use **MLflow** to track experiments, log parameters, and compare model performance.
4. **Deployment**:
   - Develop a **Tkinter-based GUI application** to allow the operations team to make predictions on new orders.
5. **Automation**:
   - Design the workflow to be modular and scalable, enabling future integration with **Airflow** for automation.

### Key Challenges
- **Imbalanced Data**: Fraud cases are rare, making it challenging to train a model that can accurately detect them.
- **Feature Engineering**: Identifying and creating meaningful features that improve model performance.
- **Deciding how to proceed with experiments**: Balancing the need for thorough experimentation with time constraints was a key challenge. We addressed this by prioritizing techniques likely to have the most impact (e.g., handling imbalanced data, feature engineering).

### Structure of the Report
This report documents the entire process of **CA2**, from advanced data analysis and model development to deployment and automation. The following sections provide a detailed breakdown of each step:
- **Exploratory Data Analysis (EDA)**: Additional analysis focusing on feature importance and outlier detection.
- **Feature Engineering**: Creation of new features and their impact on model performance.
- **Data Preprocessing**: Techniques used to prepare the data for modeling.
- **Model Training and Evaluation**: Development and comparison of machine learning models.
- **Deployment**: Development of the GUI application and integration of the final model.
- **Conclusion**: Summary of findings, challenges, and future work.


---
## **2. Preparing the Dataset and Libraries**

---

#### Import Necessary Libraries

In [8]:
import pandas as pd
import numpy as np
import os
from urllib.parse import urlparse
import warnings
import itertools
from dask.diagnostics import ProgressBar
from dask.distributed import Client
from ExperimentTracker2 import PhaseOneExperimentTracker

# Data processing and analysis
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    OneHotEncoder,
)
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTENC, RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler
from dask_ml.model_selection import train_test_split as dask_train_test_split

# Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

# Model evaluation and pipeline
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# MLflow for experiment tracking
import mlflow

# Initialize Dask client for parallel processing
client = Client()

# Show all columns
pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")

#### Importing the datasets

In [9]:
# %%time

# ## SQL Server Name and Database Name
# server = 'yj\SQLEXPRESS'
# database = 'PAI_CA1'

# ## Create a connection to the SQL Server
# engine = create_engine('mssql+pyodbc://{}/{}?driver=ODBC Driver 17 for SQL Server'.format(server, database))

# customer_df = pd.read_sql('SELECT * FROM [dbo].[clean-data-customer_v1.0]', engine)
# order_df = pd.read_sql('SELECT * FROM [dbo].[clean-data-order_v1.0]', engine)
# label_df = pd.read_sql('SELECT * FROM [dbo].[clean-data-label_v1.0]', engine)


# # Drop index column
# customer_df.drop(columns=['index'], inplace=True)
# order_df.drop(columns=['index'], inplace=True)
# label_df.drop(columns=['index'], inplace=True)

# # merge the data
# merged_df = pd.merge(label_df, customer_df, on=['customer_id', 'country_code'], how='inner')
# merged_df = pd.merge(merged_df, order_df, on=['order_id', 'country_code'], how='inner')

# merged_df.to_csv('merged_data.csv', index=False)

data = pd.read_csv('./data/merged_data.csv')

merged_df=data.copy()

In [10]:
merged_df.head()

Unnamed: 0,country_code,order_id,customer_id,is_fraud,mobile_verified,num_orders_last_50days,num_cancelled_orders_last_50days,num_refund_orders_last_50days,total_payment_last_50days,num_associated_customers,first_order_datetime,collect_type,payment_method,order_value,num_items_ordered,refund_value,order_date
0,BD,w2lx-myz3,bdpr8uva,0,True,0,0,0,0.0,3,2022-08-13 03:53:52,delivery,PayOnDelivery,8.664062,9,0.870117,2023-04-08
1,BD,ta7z-r91q,bd59rlzo,0,True,7,0,0,228.042468,4,2022-05-08 14:29:19,delivery,CreditCard,21.859375,4,2.279297,2023-02-13
2,BD,t5af-wgb2,bd6zhjvq,0,True,4,1,0,45.674685,2,2021-08-25 07:47:00,delivery,AFbKash,7.125,1,2.349609,2023-03-06
3,BD,sibu-9lm4,bd4fv4rb,0,True,19,0,3,279.805231,5,2021-12-06 13:53:22,delivery,CreditCard,4.535156,5,0.150024,2023-01-29
4,BD,we61-omtr,bdzeepq7,0,True,30,6,4,107.06761,5,2020-07-04 11:45:39,delivery,PayOnDelivery,3.011719,1,3.75,2023-01-16


#### Convert to appropriate dtypes after importing daset

In [18]:
def convert_dtypes(df):
    # Convert 'order_value' and 'refund_value' to float16 for memory efficiency
    df['order_value'] = df['order_value'].astype('float32')
    df['refund_value'] = df['refund_value'].astype('float32')
    
    # Convert 'num_items_ordered' to uint8 after rounding
    df['num_items_ordered'] = df['num_items_ordered'].astype(float).round().astype('uint8')
    
    # Convert 'order_date' and 'first_order_datetime' to datetime
    df['order_date'] = pd.to_datetime(df['order_date'])
    df['first_order_datetime'] = pd.to_datetime(df['first_order_datetime'])
    
    # Convert categorical columns to category dtype for efficiency
    df[['country_code', 'collect_type', 'payment_method']] = df[['country_code', 'collect_type', 'payment_method']].astype('category')
    
    # Convert numerical columns (those that represent counts or numeric features) to uint16
    df[['num_orders_last_50days', 'num_cancelled_orders_last_50days', 'num_refund_orders_last_50days']] = df[['num_orders_last_50days', 'num_cancelled_orders_last_50days', 'num_refund_orders_last_50days']].astype('uint16')
    
    # Convert 'num_associated_customers' to uint8 for efficient memory usage
    df['num_associated_customers'] = df['num_associated_customers'].astype('uint8')
    
    # Convert 'total_payment_last_50days' to float16 for memory efficiency
    df['total_payment_last_50days'] = df['total_payment_last_50days'].astype('float32')
    
    # Convert 'mobile_verified' and 'is_fraud' columns to boolean (mapping string values)
    # df['mobile_verified'] = df['mobile_verified'].map({'True': True, 'False': False})
    # df['is_fraud'] = df['is_fraud'].map({'1': True, '0': False})
    
    return df

# Mermory before
print(f'Memory usage before conversion: {merged_df.memory_usage().sum() / 1e6} MB')
merged_df = convert_dtypes(merged_df)
# Mermory after
print(f'Memory usage after conversion: {merged_df.memory_usage().sum() / 1e6} MB')

Memory usage before conversion: 140.349328 MB
Memory usage after conversion: 144.876688 MB


In [12]:
merged_df.dtypes

country_code                              category
order_id                                    object
customer_id                                 object
is_fraud                                     int64
mobile_verified                               bool
num_orders_last_50days                      uint16
num_cancelled_orders_last_50days            uint16
num_refund_orders_last_50days               uint16
total_payment_last_50days                  float16
num_associated_customers                     uint8
first_order_datetime                datetime64[ns]
collect_type                              category
payment_method                            category
order_value                                float16
num_items_ordered                            uint8
refund_value                               float16
order_date                          datetime64[ns]
dtype: object

---

## **3. Exploratory Data Analysis (EDA)**

---


2.1 Overview
In **CA1**, we performed initial exploratory data analysis (EDA) to understand the structure and distribution of the dataset. This included:

- Merging the datasets (customer-features.csv, order-features.csv, labels.csv).

- Cleaning the data (e.g., handling missing values, removing duplicates).

- Visualizing the distribution of fraud vs. non-fraud cases.

In **CA2**, we focus on additional EDA tasks that are essential for model creation:
- Imbalanced Data: Check how imbalanced is the dataset

- Feature Importance: Identifying which features have the most impact on predicting fraud.

- Outlier Detection: Detecting and handling outliers that could skew model performance.

In [19]:
## 
merged_profile = merged_df.profile_report(title='Merged Data Profiling Report', explorative=True)
merged_profile.to_file('merged_data_profiling_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### **Imbalanced Data**
1. Imbalanced data refers to a situation where the classes in the dataset are not equally represented. According to our pandas profiling report. `Collect Type` is one of these columns that have a high imbalance.

<img class="text-center" src="images/collect_type_imbalance.png" width="1000">

2. `Mobile Verified` is also another column that has a high imbalance.

<img class="text-center" src="images/mobile_verified_imbalance.png" width="1000">

3. We can see that these columns are severely imbalanced, which can pose a challenge when training machine learning models. We will test the different sampling methods 

### **Outlier Detection**

1. Outliers can significantly impact the performance of machine learning models, especially those sensitive to outliers (e.g., linear models). We look at some of the histograms of the numerical columns to identify potential outliers. For the sake of keeping the documentation short, we will not look at all the histograms of the numerical columns.

<img class="text-center" src="images/histograms/num_canc_orders_last_50.png" width="500">
<img class="text-center" src="images/histograms/num_orders_last_50.png" width="500">
<img class="text-center" src="images/histograms/num_refund_orders_last_50.png" width="500">

2. We can see that based on the histograms, the data is heavily skewed, indicating the presence of outliers. We will address this issue and test the different outlier handling techniques during the model experimentation phase.

---

## **4. Model Workflow**

---

`Phase 1`: Baseline Experiments

`Preprocessing`: Testing encoding (one-hot encoding, no enconding => drop categorical columns) and scaler types (StandardScaler, MinMaxScaler, RobustScaler, None).

`Models`: Logistic Regression, Random Forest, GassianNB, LightGBM, GradientBoosting, DecisionTree

`Goal`: Establish baseline performance metrics for scalers and encoding.

`Phase 2`: Advanced Preprocessing

`Preprocessing`: Testing different imbalanced data handling techniques (SMOTE, RUS, ROS), outlier handling techniques (LOF, ISO).

`Models`: Logistic Regression, Random Forest, GassianNB, LightGBM, GradientBoosting, DecisionTree

`Goal`: Identify which preprocessing steps improve performance.

`Phase 3`: Feature Engineering, Hyperparmeter Tuning, and Selection

`First set of Models`: Choose the 2 best-performing type of models, with the best scaler and encode/no encode, and the best imbalanced data handling and outlier handling techniques. and without the best scaler and encode/no encode, and the best imbalanced data handling and outlier handling techniques.

`First Goal`: To decide on the best preprocessing steps

`Second set of Models`: Apply feature selection techniques (e.g., PCA, feature importance from tree-based models).

`Second Goal`: Determine the best model and feature selection techniques and attempt to reduce overfitting if any. (or no feature selection)

`Third set of Models`: Apply hyperparameter tuning to the best model.

`Third Goal`: Determine the best hyperparameters for the best model, reduce overfitting, and improve performance.

`Note`: All these sets of models are trained in the same experiment.

`Preprocessing Data Before Training`

In [None]:
# Load large dataset with Dask
df = dd.read_csv("merged_data.csv")


# Convert data types for memory efficiency
def convert_dtypes(df):
    df["order_value"] = df["order_value"].astype("float32")
    df["refund_value"] = df["refund_value"].astype("float32")
    df["num_items_ordered"] = (
        df["num_items_ordered"].astype(float).round().astype("uint8")
    )
    df["order_date"] = dd.to_datetime(df["order_date"])
    df["first_order_datetime"] = dd.to_datetime(df["first_order_datetime"])
    df[["country_code", "collect_type", "payment_method"]] = df[
        ["country_code", "collect_type", "payment_method"]
    ].astype("category")
    return df


df = convert_dtypes(df)


# Payment method grouping
def group_payment_methods(payment_method):
    mapping = {
        "CreditCard": [
            "GenericCreditCard",
            "CybersourceCreditCard",
            "CybersourceApplePay",
            "CreditCard",
        ],
        "DigitalWallet": ["GCash", "AFbKash", "JazzCashWallet", "AdyenBoost", "PayPal"],
        "BankTransfer": ["XenditDirectDebit", "RazerOnlineBanking"],
        "PaymentOnDelivery": ["Invoice", "PayOnDelivery"],
    }
    for key, values in mapping.items():
        if payment_method in values:
            return key
    return "Others"


df["payment_method"] = df["payment_method"].map(group_payment_methods)


# Date transformations
def date_transformations(df):
    df["days_since_first_order"] = (
        df["order_date"] - df["first_order_datetime"]
    ).dt.days
    df = df.drop(columns=["first_order_datetime"])
    df["order_date_day_of_week"] = df["order_date"].dt.dayofweek
    df["order_date_day"] = df["order_date"].dt.day
    df["order_date_month"] = df["order_date"].dt.month
    df["order_date_year"] = df["order_date"].dt.year
    df = df.drop(columns=["order_date"])
    return df


df = date_transformations(df)
df = df.drop(columns=["order_id", "customer_id"])

# Split data
X = df.drop(columns=["is_fraud"])
y = df["is_fraud"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

`Preparation`: We will first create a baseline model class that will be used to train the models in the different phases. This class will be used to train the models in the different phases using pre-built functions that can be reused in the different phases.

In [None]:
class BaseExperimentTracker:
    def __init__(self, experiment_name, checkpoint_file="experiment_checkpoint.json"):
        self.experiment_name = experiment_name
        self.checkpoint_file = checkpoint_file
        self.completed_runs = self.load_checkpoint()

        # Set the environment variables for MLflow
        os.environ["MLFLOW_TRACKING_URI"] = "https://dagshub.com/REHXZ/PAI_CA2.mlflow"
        os.environ["MLFLOW_TRACKING_USERNAME"] = "REHXZ"
        os.environ["MLFLOW_TRACKING_PASSWORD"] = (
            "f70a7da81f0dca4cab7dd6d83138347b7a0d9f98"
        )

        # Set MLflow tracking URI to DagsHub
        mlflow.set_tracking_uri("https://dagshub.com/REHXZ/PAI_CA2.mlflow")

        # Create or get experiment
        try:
            self.experiment_id = mlflow.create_experiment(experiment_name)
        except Exception:
            self.experiment_id = mlflow.get_experiment_by_name(
                experiment_name
            ).experiment_id
        mlflow.set_experiment(experiment_name)

    def load_checkpoint(self):
        """Load the checkpoint file if it exists."""
        if os.path.exists(self.checkpoint_file):
            with open(self.checkpoint_file, "r") as f:
                return set(json.load(f))
        return set()

    def save_checkpoint(self):
        """Save the current progress to checkpoint file."""
        with open(self.checkpoint_file, "w") as f:
            json.dump(list(self.completed_runs), f)

    def generate_run_id(self, config):
        """Generate a concise, unique identifier for a run configuration."""
        model_abbr = config["models"]["name"][:2].upper()  # Abbreviate model name
        scaler_abbr = config["scaler"].__class__.__name__.replace(
            "Scaler", ""
        )  # Simplify scaler name
        encoding_status = "Enc" if config["encode"]["apply"] else "NoEnc"
        timestamp = time.strftime("%Y%m%d%H%M")  # Add timestamp for uniqueness
        return f"{model_abbr}_{scaler_abbr}_{encoding_status}_{timestamp}"

    def evaluate_metrics(
        self, y_train, y_train_pred, y_train_prob, y_test, y_test_pred, y_test_prob
    ):
        """
        Calculate and return a dictionary of evaluation metrics for both train and test sets.
        """
        metrics = {
            "train_accuracy": accuracy_score(y_train, y_train_pred),
            "train_precision": precision_score(y_train, y_train_pred),
            "train_recall": recall_score(y_train, y_train_pred),
            "train_f1_score": f1_score(y_train, y_train_pred),
            "train_roc_auc": roc_auc_score(y_train, y_train_prob),
            "train_pr_auc": average_precision_score(y_train, y_train_prob),
            "test_accuracy": accuracy_score(y_test, y_test_pred),
            "test_precision": precision_score(y_test, y_test_pred),
            "test_recall": recall_score(y_test, y_test_pred),
            "test_f1_score": f1_score(y_test, y_test_pred),
            "test_roc_auc": roc_auc_score(y_test, y_test_prob),
            "test_pr_auc": average_precision_score(y_test, y_test_prob),
        }
        return metrics

    # def log_mean_fit_time(self, total_fit_time, num_runs):
    #     """Calculate and log the mean fit time for the model."""
    #     mlflow.log_metric("Total_Fit_Time", total_fit_time)
    #     mlflow.log_metric("Num_Runs", total_fit_time / num_runs)

    def plot_learning_curves(self, pipeline, X, y, cv=5):
        """
        Generate and save learning curves.
        """
        train_sizes, train_scores, test_scores = learning_curve(
            pipeline, X, y, cv=cv, scoring="f1", n_jobs=-1
        )
        train_scores_mean = np.mean(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)

        # Plot learning curves
        plt.figure(figsize=(8, 6))
        plt.plot(train_sizes, train_scores_mean, label="Training score")
        plt.plot(train_sizes, test_scores_mean, label="Cross-validation score")
        plt.xlabel("Training examples")
        plt.ylabel("F1 Score")
        plt.title("Learning Curves")
        plt.legend()

        # Save the plot
        learning_curve_path = f"learning_curves_{time.strftime('%Y%m%d_%H%M')}.png"
        plt.savefig(learning_curve_path)
        plt.close()
        return learning_curve_path

    def run_experiments(self):
        # To be implemented by subclasses
        raise NotImplementedError

`Phase 1`: We will start training the baselines models to test the different scalers and if encoding is necessary. (if not necessary, we will drop the categorical columns)

`Metrics`: We defined the key metrics we will be using to determine the we will be using to determine the best model. These metrics are: `Accuracy, Precision, Recall, F1-score, ROC-AUC and PR-AUC`. However our main metric will be the F1-score. This is because the `F1-score` is the harmonic mean of precision and recall, and it provides a balance between the two metrics. In the context of fraud detection, we want to minimize false positives (predicting a transaction as fraudulent when it is not) and false negatives (predicting a transaction as legitimate when it is fraudulent). The F1-score is a good metric to evaluate the trade-off between precision and recall.

In [None]:
class PhaseOneExperimentTracker(BaseExperimentTracker):
    def run_experiments(
        self,
        experiment_combinations,
        X_train,
        y_train,
        X_test,
        y_test,
        numeric_columns,
        categorical_cols,
    ):
        """Run experiments for Phase 1."""
        client = Client()  # Initialize Dask client for parallel processing

        total_time = 0
        for config in experiment_combinations:
            run_id = self.generate_run_id(config)
            if run_id in self.completed_runs:
                print(f"Skipping completed run: {run_id}")
                continue
            print(f"Starting run: {run_id}")

            try:
                with mlflow.start_run(run_name=run_id):
                    # Add descriptive run tags
                    mlflow.set_tag("model_type", config["models"]["name"])
                    mlflow.set_tag("scaler_type", config["scaler"].__class__.__name__)
                    mlflow.set_tag("encoding_applied", str(config["encode"]["apply"]))
                    mlflow.set_tag("dataset", "X_train")

                    # Build preprocessing steps
                    transformers = []
                    if config["scaler"]:
                        transformers.append(
                            ("scaler", config["scaler"], numeric_columns)
                        )
                    if config["encode"]["apply"]:
                        transformers.append(
                            (
                                "encoder",
                                OneHotEncoder(
                                    handle_unknown="ignore", sparse_output=True
                                ),
                                categorical_cols,
                            )
                        )
                    else:
                        transformers.append(
                            ("drop_categorical", "drop", categorical_cols)
                        )

                    preprocessor = ColumnTransformer(
                        transformers=transformers, remainder="passthrough"
                    )
                    pipeline = Pipeline(
                        steps=[
                            ("preprocessor", preprocessor),
                            ("model", config["models"]["instance"]),
                        ]
                    )

                    # Train the pipeline in parallel using Dask
                    pipeline.fit(X_train.compute(), y_train.compute().ravel())

                    # Predictions and probabilities for train and test sets
                    train_predictions = pipeline.predict(X_train.compute())
                    train_probabilities = pipeline.predict_proba(X_train.compute())[
                        :, 1
                    ]

                    start_time = time.time()
                    test_predictions = pipeline.predict(X_test.compute())
                    test_probabilities = pipeline.predict_proba(X_test.compute())[:, 1]
                    total_time += time.time() - start_time

                    # Evaluate metrics
                    metrics = self.evaluate_metrics(
                        y_train.compute(),
                        train_predictions,
                        train_probabilities,
                        y_test.compute(),
                        test_predictions,
                        test_probabilities,
                    )

                    # Log parameters, metrics, and model
                    mlflow.log_params(config)
                    mlflow.log_metrics(metrics)
                    average_time = total_time / len(X_train)
                    mlflow.log_metric("average_prediction_time", average_time)

                    # Log the model with an input example
                    mlflow.sklearn.log_model(
                        pipeline,
                        "model",
                        input_example=X_train.compute().iloc[
                            0:1
                        ],  # Convert to Pandas DataFrame before slicing
                    )
                    # Save learning curves
                    learning_curve_path = self.plot_learning_curves(
                        pipeline, X_train.compute(), y_train.compute()
                    )
                    mlflow.log_artifact(
                        learning_curve_path, artifact_path="learning_curves"
                    )

                    # Mark this run as completed
                    self.completed_runs.add(run_id)
                    self.save_checkpoint()
                    print(f"Completed run: {run_id}")

            except Exception as e:
                print(f"Error in run {run_id}: {str(e)}")
                continue

            finally:
                print("end run")
                mlflow.end_run()  # Ensure the run is properly ended

        client.close()

`Defining Search Space for Training`

In [None]:
# Define experiment configurations
search_space = {
    "scaler": [None, StandardScaler(), MinMaxScaler(), RobustScaler()],
    "encode": [{"apply": True, "columns": ["categorical_col"]}, {"apply": False}],
    "models": [
        {"name": "LogisticRegression", "instance": LogisticRegression()},
        {"name": "RandomForest", "instance": RandomForestClassifier()},
        {"name": "LightGBM", "instance": LGBMClassifier()},
        {"name": "GaussianNB", "instance": GaussianNB()},
        {"name": "DecisionTree", "instance": DecisionTreeClassifier()},
        {"name": "GradientBoosting", "instance": GradientBoostingClassifier()},
    ],
}

# Generate all combinations
keys, values = zip(*search_space.items())
experiment_combinations = [dict(zip(keys, v)) for v in itertools.product(*values)]

categorical_cols = ["payment_method", "country_code", "collect_type"]
numeric_columns = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

`Training`

In [None]:
# Initialize tracker and run experiments
tracker = PhaseOneExperimentTracker("Phase1")
tracker.run_experiments(
    experiment_combinations,
    X_train,
    y_train,
    X_test,
    y_test,
    numeric_columns,
    categorical_cols,
)

`Results`: Our two best baseline models are Random Forest and LightGBM, with the best scaler being MinMaxScaler and Standard Scaler respectively. Both models perform better with encoding. Next, we will move to the next phase to test the different imbalanced data handling techniques and outlier handling techniques.

<img class="text-center" src="images/results/phase1.png" width="1500">


`Phase 2`: We will test the baseline models with different imbalanced data handling techniques (SMOTE, RUS, ROS), outlier handling techniques (LOF, ISO).

In [None]:
class PhaseTwoExperimentTracker(BaseExperimentTracker):
    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
    ):
        """Run experiments for Phase 2 on multiple datasets."""
        client = Client()  # Initialize Dask client for parallel processing
        total_time = 0
        for dataset_name, X_train, y_train in datasets:
            for config in experiment_combinations:
                run_id = self.generate_run_id(config)
                if run_id in self.completed_runs:
                    print(f"Skipping completed run: {run_id}")
                    continue
                print(f"Starting run: {run_id}")
                try:
                    with mlflow.start_run(run_name=run_id):
                        # Add descriptive run tags
                        mlflow.set_tag("dataset", dataset_name)
                        mlflow.set_tag("model_type", config["models"]["name"])
                        mlflow.set_tag(
                            "scaler_type", config["scaler"].__class__.__name__
                        )
                        mlflow.set_tag(
                            "encoding_applied", str(config["encode"]["apply"])
                        )

                        # Split dataset name
                        dataset_names = dataset_name.split("_")
                        mlflow.set_tag(
                            "outlier_technique",
                            (
                                "LOF"
                                if "LOF" in dataset_names
                                else "ISO" if "ISO" in dataset_names else "None"
                            ),
                        )
                        mlflow.set_tag(
                            "resampling_method",
                            (
                                "SMOTE"
                                if "SMOTE" in dataset_names
                                else (
                                    "ROS"
                                    if "ROS" in dataset_names
                                    else "RUS" if "RUS" in dataset_names else "None"
                                )
                            ),
                        )

                        # Build preprocessing steps
                        transformers = []
                        if config["scaler"]:
                            transformers.append(
                                ("scaler", config["scaler"], numeric_columns)
                            )
                        if config["encode"]["apply"]:
                            transformers.append(
                                (
                                    "encoder",
                                    OneHotEncoder(
                                        handle_unknown="ignore", sparse_output=True
                                    ),
                                    categorical_cols,
                                )
                            )
                        else:
                            transformers.append(
                                ("drop_categorical", "drop", categorical_cols)
                            )
                        preprocessor = ColumnTransformer(
                            transformers=transformers, remainder="passthrough"
                        )
                        pipeline = Pipeline(
                            steps=[
                                ("preprocessor", preprocessor),
                                ("model", config["models"]["instance"]),
                            ]
                        )

                        # Train the pipeline
                        y_train_series = (
                            y_train.compute().squeeze()
                        )  # Convert to Pandas Series

                        pipeline.fit(X_train.compute(), y_train_series)

                        # Predictions and probabilities for train and test sets
                        train_predictions = pipeline.predict(X_train.compute())
                        train_probabilities = pipeline.predict_proba(X_train.compute())[
                            :, 1
                        ]
                        start_time = time.time()
                        test_predictions = pipeline.predict(X_test.compute())
                        test_probabilities = pipeline.predict_proba(X_test.compute())[
                            :, 1
                        ]
                        total_time += time.time() - start_time
                        average_time = total_time / len(X_test)

                        # Evaluate metrics
                        metrics = self.evaluate_metrics(
                            y_train.compute(),
                            train_predictions,
                            train_probabilities,
                            y_test.compute(),
                            test_predictions,
                            test_probabilities,
                        )

                        # Log parameters, metrics, and model
                        mlflow.log_params(config)
                        mlflow.log_metrics(metrics)
                        mlflow.sklearn.log_model(
                            pipeline, "model", input_example=X_train.compute().iloc[0:1]
                        )
                        mlflow.log_metric("average_prediction_time", average_time)

                        # Save learning curves
                        learning_curve_path = self.plot_learning_curves(
                            pipeline, X_train.compute(), y_train.compute()
                        )
                        mlflow.log_artifact(
                            learning_curve_path, artifact_path="learning_curves"
                        )

                        # Mark this run as completed
                        self.completed_runs.add(run_id)
                        self.save_checkpoint()
                        print(f"Completed run: {run_id}")
                except Exception as e:
                    print(f"Error in run {run_id}: {str(e)}")
                    continue

                finally:
                    print("end run")
                    mlflow.end_run()  # Ensure the run is properly ended

            client.close()

`Create the different datasets to be used later`

In [None]:
# Load datasets
X_train = dd.read_csv("./data/X_train.csv")
y_train = dd.read_csv("./data/y_train.csv")

X_test = dd.read_csv("./data/X_test.csv")
y_test = dd.read_csv("./data/y_test.csv")

X_train_ISO = dd.read_csv("./data/X_train_ISO.csv")
y_train_ISO = dd.read_csv("./data/y_train_ISO.csv")

X_train_ISO_SMOTE = dd.read_csv("./data/X_train_ISO_smote.csv")
y_train_ISO_SMOTE = dd.read_csv("./data/y_train_ISO_smote.csv")

X_train_ISO_ROS = dd.read_csv("./data/X_train_ISO_ros.csv")
y_train_ISO_ROS = dd.read_csv("./data/y_train_ISO_ros.csv")

X_train_ISO_RUS = dd.read_csv("./data/X_train_ISO_rus.csv")
y_train_ISO_RUS = dd.read_csv("./data/y_train_ISO_rus.csv")

X_train_LOF = dd.read_csv("./data/X_train_LOF.csv")
y_train_LOF = dd.read_csv("./data/y_train_LOF.csv")

X_train_LOF_SMOTE = dd.read_csv("./data/X_train_LOF_smote.csv")
y_train_LOF_SMOTE = dd.read_csv("./data/y_train_LOF_smote.csv")

X_train_LOF_ROS = dd.read_csv("./data/X_train_LOF_ros.csv")
y_train_LOF_ROS = dd.read_csv("./data/y_train_LOF_ros.csv")

X_train_LOF_RUS = dd.read_csv("./data/X_train_LOF_rus.csv")
y_train_LOF_RUS = dd.read_csv("./data/y_train_LOF_rus.csv")

X_train_smote = dd.read_csv("./data/X_train_smote.csv")
y_train_smote = dd.read_csv("./data/y_train_smote.csv")

X_train_ros = dd.read_csv("./data/X_train_ros.csv")
y_train_ros = dd.read_csv("./data/y_train_ros.csv")

X_train_rus = dd.read_csv("./data/X_train_rus.csv")
y_train_rus = dd.read_csv("./data/y_train_rus.csv")

datasets = [
    ("dataset_default", X_train, y_train),
    ("dataset_ISO", X_train_ISO, y_train_ISO),
    ("dataset_ISO_SMOTE", X_train_ISO_SMOTE, y_train_ISO_SMOTE),
    ("dataset_ISO_ROS", X_train_ISO_ROS, y_train_ISO_ROS),
    ("dataset_ISO_RUS", X_train_ISO_RUS, y_train_ISO_RUS),
    ("dataset_LOF", X_train_LOF, y_train_LOF),
    ("dataset_LOF_SMOTE", X_train_LOF_SMOTE, y_train_LOF_SMOTE),
    ("dataset_LOF_ROS", X_train_LOF_ROS, y_train_LOF_ROS),
    ("dataset_LOF_RUS", X_train_LOF_RUS, y_train_LOF_RUS),
    ("dataset_SMOTE", X_train_smote, y_train_smote),
    ("dataset_ROS", X_train_ros, y_train_ros),
    ("dataset_RUS", X_train_rus, y_train_rus),
]

`Defining Search Space for Training`

In [None]:
# Define experiment configurations
search_space = {
    "scaler": [None],
    "encode": [{"apply": True, "columns": ["categorical_col"]}],
    "models": [
        {"name": "LogisticRegression", "instance": LogisticRegression()},
        {"name": "RandomForest", "instance": RandomForestClassifier()},
        {"name": "LightGBM", "instance": LGBMClassifier()},
        {"name": "GaussianNB", "instance": GaussianNB()},
        {"name": "DecisionTree", "instance": DecisionTreeClassifier()},
        {"name": "GradientBoosting", "instance": GradientBoostingClassifier()},
    ],
}

# Generate all combinations
keys, values = zip(*search_space.items())
experiment_combinations = [dict(zip(keys, v)) for v in itertools.product(*values)]

categorical_cols = ["payment_method", "country_code", "collect_type"]
numeric_columns = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

`Training`

In [None]:
# Initialize the tracker
from ExperimentTracker2 import PhaseTwoExperimentTracker
tracker = PhaseTwoExperimentTracker("Phase2")

# Load checkpoint file
tracker.completed_runs

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns,
    categorical_cols=categorical_cols
)

`Results`: The two best type of models are RandomForests and LightGBM, with the resampling method and outlier handling method being ROS and LOF for RandomForests and No resampling and LOF for LightGBM. Next we will combine the best preprocessing steps and test the models again.

<img class="text-center" src="images/results/phase2.png" width="1500">

`Phase 3`: We will test the best models from the previous phases with feature selection techniques (e.g., PCA, feature importance from tree-based models) and hyperparameter tuning.

`First Set of Models`: We will test the best models with the best preprocessing steps.

In [None]:
class PhaseThreeExperimentTracker(BaseExperimentTracker):
    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
    ):
        """Run experiments for Phase 2 on multiple datasets."""
        client = Client()  # Initialize Dask client for parallel processing
        total_time = 0
        for dataset_name, X_train, y_train in datasets:
            for config in experiment_combinations:
                if (
                    dataset_name == "dataset_LOF_ROS"
                    and config["models"] == "RandomForest"
                ) or (dataset_name == "dataset_LOF" and config["models"] == "LightGBM"):
                    run_id = self.generate_run_id(config)
                    if run_id in self.completed_runs:
                        print(f"Skipping completed run: {run_id}")
                        continue
                    print(f"Starting run: {run_id}")
                    try:
                        with mlflow.start_run(run_name=run_id):
                            # Add descriptive run tags
                            mlflow.set_tag("dataset", dataset_name)
                            mlflow.set_tag("model_type", config["models"]["name"])
                            mlflow.set_tag(
                                "scaler_type", config["scaler"].__class__.__name__
                            )
                            mlflow.set_tag(
                                "encoding_applied", str(config["encode"]["apply"])
                            )

                            # Split dataset name
                            dataset_names = dataset_name.split("_")
                            mlflow.set_tag(
                                "outlier_technique",
                                (
                                    "LOF"
                                    if "LOF" in dataset_names
                                    else "ISO" if "ISO" in dataset_names else "None"
                                ),
                            )
                            mlflow.set_tag(
                                "resampling_method",
                                (
                                    "SMOTE"
                                    if "SMOTE" in dataset_names
                                    else (
                                        "ROS"
                                        if "ROS" in dataset_names
                                        else "RUS" if "RUS" in dataset_names else "None"
                                    )
                                ),
                            )

                            # Build preprocessing steps
                            transformers = []
                            if config["scaler"]:
                                transformers.append(
                                    ("scaler", config["scaler"], numeric_columns)
                                )
                            if config["encode"]["apply"]:
                                transformers.append(
                                    (
                                        "encoder",
                                        OneHotEncoder(
                                            handle_unknown="ignore", sparse_output=True
                                        ),
                                        categorical_cols,
                                    )
                                )
                            else:
                                transformers.append(
                                    ("drop_categorical", "drop", categorical_cols)
                                )
                            preprocessor = ColumnTransformer(
                                transformers=transformers, remainder="passthrough"
                            )
                            pipeline = Pipeline(
                                steps=[
                                    ("preprocessor", preprocessor),
                                    ("model", config["models"]["instance"]),
                                ]
                            )

                            # Train the pipeline
                            y_train_series = (
                                y_train.compute().squeeze()
                            )  # Convert to Pandas Series

                            pipeline.fit(X_train.compute(), y_train_series)

                            # Predictions and probabilities for train and test sets
                            train_predictions = pipeline.predict(X_train.compute())
                            train_probabilities = pipeline.predict_proba(
                                X_train.compute()
                            )[:, 1]
                            start_time = time.time()
                            test_predictions = pipeline.predict(X_test.compute())
                            test_probabilities = pipeline.predict_proba(
                                X_test.compute()
                            )[:, 1]
                            total_time += time.time() - start_time
                            average_time = total_time / len(X_test)

                            # Evaluate metrics
                            metrics = self.evaluate_metrics(
                                y_train.compute(),
                                train_predictions,
                                train_probabilities,
                                y_test.compute(),
                                test_predictions,
                                test_probabilities,
                            )

                            # Log parameters, metrics, and model
                            mlflow.log_params(config)
                            mlflow.log_metrics(metrics)
                            mlflow.sklearn.log_model(
                                pipeline,
                                "model",
                                input_example=X_train.compute().iloc[0:1],
                            )
                            mlflow.log_metric("average_prediction_time", average_time)

                            # Save learning curves
                            learning_curve_path = self.plot_learning_curves(
                                pipeline, X_train.compute(), y_train.compute()
                            )
                            mlflow.log_artifact(
                                learning_curve_path, artifact_path="learning_curves"
                            )

                            # Mark this run as completed
                            self.completed_runs.add(run_id)
                            self.save_checkpoint()
                            print(f"Completed run: {run_id}")
                    except Exception as e:
                        print(f"Error in run {run_id}: {str(e)}")
                        continue

                    finally:
                        print("end run")
                        mlflow.end_run()  # Ensure the run is properly ended

            client.close()

`Filter the datasets to only include the best outliers and resampling methods`

In [None]:
datasets = [
    # ("dataset_default", X_train, y_train),
    # ("dataset_ISO", X_train_ISO, y_train_ISO),
    # ("dataset_ISO_SMOTE", X_train_ISO_SMOTE, y_train_ISO_SMOTE),
    # ("dataset_ISO_ROS", X_train_ISO_ROS, y_train_ISO_ROS),
    # ("dataset_ISO_RUS", X_train_ISO_RUS, y_train_ISO_RUS),
    ("dataset_LOF", X_train_LOF, y_train_LOF),
    # ("dataset_LOF_SMOTE", X_train_LOF_SMOTE, y_train_LOF_SMOTE),
    ("dataset_LOF_ROS", X_train_LOF_ROS, y_train_LOF_ROS),
    # ("dataset_LOF_RUS", X_train_LOF_RUS, y_train_LOF_RUS),
    # ("dataset_SMOTE", X_train_smote, y_train_smote),
    # ("dataset_ROS", X_train_ros, y_train_ros),
    # ("dataset_RUS", X_train_rus, y_train_rus)
]

`Defining Search Space for Training`

In [None]:
defined_experiment_combinations = [
    {
        "scaler": MinMaxScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "RandomForest", "instance": RandomForestClassifier()},
    },
    {
        "scaler": None,
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "RandomForest", "instance": RandomForestClassifier()},
    },
    {
        "scaler": StandardScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "LightGBM", "instance": LGBMClassifier()},
    },
    {
        "scaler": None,
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "LightGBM", "instance": LGBMClassifier()},
    },
]

`Training`

In [None]:
from ExperimentTrackers2 import PhaseThreeExperimentTracker

tracker = PhaseThreeExperimentTracker("Final Experiment")

# Load checkpoint file
tracker.completed_runs

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=defined_experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns,
    categorical_cols=categorical_cols,
)

`Second Set of Models`: We will apply feature selection techniques (e.g., PCA, feature importance from tree-based models).

In [None]:
class PhaseFourExperimentTracker(BaseExperimentTracker):
    def generate_run_id(self, config):
        """Generate a concise, unique identifier for a run configuration."""
        model_abbr = config["models"]["name"][:2].upper()  # Abbreviate model name
        scaler_abbr = config["scaler"].__class__.__name__.replace(
            "Scaler", ""
        )  # Simplify scaler name
        encoding_status = "Enc" if config["encode"]["apply"] else "NoEnc"
        timestamp = time.strftime("%Y%m%d%H%M")  # Add timestamp for uniqueness
        return f"{model_abbr}_{scaler_abbr}_{encoding_status}_{timestamp}_reduced"

    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
        drop_columns,
    ):
        """Run experiments for Phase 4 on multiple datasets."""
        client = Client()  # Initialize Dask client for parallel processing
        total_time = 0
        for dataset_name, X_train, y_train in datasets:
            if drop_columns:
                X_train = X_train.drop(columns=drop_columns)
                X_test = X_test.drop(columns=drop_columns)
            for config in experiment_combinations:
                if (
                    dataset_name == "dataset_LOS_ROF"
                    and config["models"] == "RandomForest"
                ) or (dataset_name == "dataset_LOF" and config["models"] == "LightGBM"):
                    run_id = self.generate_run_id(config)
                    if run_id in self.completed_runs:
                        print(f"Skipping completed run: {run_id}")
                        continue
                    print(f"Starting run: {run_id}")
                    try:
                        with mlflow.start_run(run_name=run_id):
                            # Add descriptive run tags
                            mlflow.set_tag("dataset", dataset_name)
                            mlflow.set_tag("model_type", config["models"]["name"])
                            mlflow.set_tag(
                                "scaler_type", config["scaler"].__class__.__name__
                            )
                            mlflow.set_tag(
                                "encoding_applied", str(config["encode"]["apply"])
                            )

                            # Split dataset name
                            dataset_names = dataset_name.split("_")
                            mlflow.set_tag(
                                "outlier_technique",
                                (
                                    "LOF"
                                    if "LOF" in dataset_names
                                    else "ISO" if "ISO" in dataset_names else "None"
                                ),
                            )
                            mlflow.set_tag(
                                "resampling_method",
                                (
                                    "SMOTE"
                                    if "SMOTE" in dataset_names
                                    else (
                                        "ROS"
                                        if "ROS" in dataset_names
                                        else "RUS" if "RUS" in dataset_names else "None"
                                    )
                                ),
                            )

                            # Build preprocessing steps
                            transformers = []
                            if config["scaler"]:
                                transformers.append(
                                    ("scaler", config["scaler"], numeric_columns)
                                )
                            if config["encode"]["apply"]:
                                transformers.append(
                                    (
                                        "encoder",
                                        OneHotEncoder(
                                            handle_unknown="ignore", sparse_output=True
                                        ),
                                        categorical_cols,
                                    )
                                )
                            else:
                                transformers.append(
                                    ("drop_categorical", "drop", categorical_cols)
                                )
                            preprocessor = ColumnTransformer(
                                transformers=transformers, remainder="passthrough"
                            )
                            pipeline = Pipeline(
                                steps=[
                                    ("preprocessor", preprocessor),
                                    ("model", config["models"]["instance"]),
                                ]
                            )

                            # Train the pipeline
                            y_train_series = (
                                y_train.compute().squeeze()
                            )  # Convert to Pandas Series

                            pipeline.fit(X_train.compute(), y_train_series)

                            # Predictions and probabilities for train and test sets
                            train_predictions = pipeline.predict(X_train.compute())
                            train_probabilities = pipeline.predict_proba(
                                X_train.compute()
                            )[:, 1]
                            start_time = time.time()
                            test_predictions = pipeline.predict(X_test.compute())
                            test_probabilities = pipeline.predict_proba(
                                X_test.compute()
                            )[:, 1]
                            total_time += time.time() - start_time
                            average_time = total_time / len(X_test)

                            # Evaluate metrics
                            metrics = self.evaluate_metrics(
                                y_train.compute(),
                                train_predictions,
                                train_probabilities,
                                y_test.compute(),
                                test_predictions,
                                test_probabilities,
                            )

                            # Log parameters, metrics, and model
                            mlflow.log_params(config)
                            mlflow.log_metrics(metrics)
                            mlflow.sklearn.log_model(
                                pipeline,
                                "model",
                                input_example=X_train.compute().iloc[0:1],
                            )
                            mlflow.log_metric("average_prediction_time", average_time)

                            # Save learning curves
                            learning_curve_path = self.plot_learning_curves(
                                pipeline, X_train.compute(), y_train.compute()
                            )
                            mlflow.log_artifact(
                                learning_curve_path, artifact_path="learning_curves"
                            )

                            # Mark this run as completed
                            self.completed_runs.add(run_id)
                            self.save_checkpoint()
                            print(f"Completed run: {run_id}")
                    except Exception as e:
                        print(f"Error in run {run_id}: {str(e)}")
                        continue

                    finally:
                        print("end run")
                        mlflow.end_run()  # Ensure the run is properly ended

            client.close()

`Defining Search Space for Training`

In [None]:
defined_experiment_combinations = [
    {
        "scaler": MinMaxScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "RandomForest", "instance": RandomForestClassifier()},
    },
    {
        "scaler": StandardScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "LightGBM", "instance": LGBMClassifier()},
    },
]

`Training`

In [None]:
from ExperimentTrackers2 import PhaseFourExperimentTracker

tracker = PhaseFourExperimentTracker("Final Experiment")

# Load checkpoint file
tracker.completed_runs

categorical_cols_reduced = [
    "country_code",
]
numeric_columns_reduced = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=defined_experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns_reduced,
    categorical_cols=categorical_cols_reduced,
    drop_columns=["payment_method", "collect_type", "mobile_verified"],
)

`For PCA`

In [None]:
class PhaseFiveExperimentTracker(BaseExperimentTracker):
    def generate_run_id(self, config):
        """Generate a concise, unique identifier for a run configuration."""
        model_abbr = config["models"]["name"][:2].upper()  # Abbreviate model name
        scaler_abbr = config["scaler"].__class__.__name__.replace(
            "Scaler", ""
        )  # Simplify scaler name
        encoding_status = "Enc" if config["encode"]["apply"] else "NoEnc"
        timestamp = time.strftime("%Y%m%d%H%M")  # Add timestamp for uniqueness
        return f"{model_abbr}_{scaler_abbr}_{encoding_status}_{timestamp}_PCA"

    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
        drop_columns,
    ):
        from sklearn.decomposition import PCA

        client = Client(
            dashboard_address=":0", local_directory="/tmp/dask-worker-space"
        )  # Initialize Dask client
        total_time = 0
        for dataset_name, X_train, y_train in datasets:

            def safe_drop_columns(df, drop_columns):
                """Safely drop columns from a DataFrame if they exist."""
                valid_columns = [col for col in drop_columns if col in df.columns]
                if valid_columns:
                    return df.drop(columns=valid_columns)
                return df

            if drop_columns:
                X_train = safe_drop_columns(X_train, drop_columns)
                X_test = safe_drop_columns(X_test, drop_columns)

            for config in experiment_combinations:
                if (
                    dataset_name == "dataset_LOF_ROS"
                    and config["models"]["name"] == "RandomForest"
                ) or (
                    dataset_name == "dataset_LOF"
                    and config["models"]["name"] == "LightGBM"
                ):
                    run_id = self.generate_run_id(config)
                    if run_id in self.completed_runs:
                        print(f"Skipping completed run: {run_id}")
                        continue
                    print(f"Starting run: {run_id}")
                    try:
                        with mlflow.start_run(run_name=run_id):
                            # Add descriptive run tags
                            mlflow.set_tag("dataset", dataset_name)
                            mlflow.set_tag("model_type", config["models"]["name"])
                            mlflow.set_tag(
                                "scaler_type", config["scaler"].__class__.__name__
                            )
                            mlflow.set_tag(
                                "encoding_applied", str(config["encode"]["apply"])
                            )
                            print(config)
                            mlflow.set_tag(
                                "pca_applied", str(config["pca"]["apply"])
                            )  # New tag for PCA
                            # Split dataset name
                            dataset_names = dataset_name.split("_")
                            mlflow.set_tag(
                                "outlier_technique",
                                (
                                    "LOF"
                                    if "LOF" in dataset_names
                                    else "ISO" if "ISO" in dataset_names else "None"
                                ),
                            )
                            mlflow.set_tag(
                                "resampling_method",
                                (
                                    "SMOTE"
                                    if "SMOTE" in dataset_names
                                    else (
                                        "ROS"
                                        if "ROS" in dataset_names
                                        else "RUS" if "RUS" in dataset_names else "None"
                                    )
                                ),
                            )
                            # Build preprocessing steps
                            transformers = []
                            if config["scaler"]:
                                transformers.append(
                                    ("scaler", config["scaler"], numeric_columns)
                                )
                            if config["pca"]["apply"]:
                                pca = PCA(n_components=config["pca"]["n_components"])
                                transformers.append(("pca", pca, numeric_columns))
                            if config["encode"]["apply"]:
                                transformers.append(
                                    (
                                        "encoder",
                                        OneHotEncoder(
                                            handle_unknown="ignore", sparse_output=True
                                        ),
                                        categorical_cols,
                                    )
                                )
                            else:
                                transformers.append(
                                    ("drop_categorical", "drop", categorical_cols)
                                )
                            preprocessor = ColumnTransformer(
                                transformers=transformers, remainder="passthrough"
                            )
                            pipeline = Pipeline(
                                steps=[
                                    ("preprocessor", preprocessor),
                                    ("model", config["models"]["instance"]),
                                ]
                            )
                            # Train the pipeline
                            y_train_series = (
                                y_train.compute().squeeze()
                            )  # Convert to Pandas Series
                            pipeline.fit(X_train.compute(), y_train_series)
                            # Predictions and probabilities for train and test sets
                            train_predictions = pipeline.predict(X_train.compute())
                            train_probabilities = pipeline.predict_proba(
                                X_train.compute()
                            )[:, 1]
                            start_time = time.time()
                            test_predictions = pipeline.predict(X_test.compute())
                            test_probabilities = pipeline.predict_proba(
                                X_test.compute()
                            )[:, 1]
                            total_time += time.time() - start_time
                            average_time = total_time / len(X_test)
                            # Evaluate metrics
                            metrics = self.evaluate_metrics(
                                y_train.compute(),
                                train_predictions,
                                train_probabilities,
                                y_test.compute(),
                                test_predictions,
                                test_probabilities,
                            )
                            # Log parameters, metrics, and model
                            mlflow.log_params(config)
                            mlflow.log_metrics(metrics)
                            mlflow.sklearn.log_model(
                                pipeline,
                                "model",
                                input_example=X_train.compute().iloc[0:1],
                            )
                            mlflow.log_metric("average_prediction_time", average_time)
                            # Save learning curves
                            learning_curve_path = self.plot_learning_curves(
                                pipeline, X_train.compute(), y_train.compute()
                            )
                            mlflow.log_artifact(
                                learning_curve_path, artifact_path="learning_curves"
                            )
                            # Mark this run as completed
                            self.completed_runs.add(run_id)
                            self.save_checkpoint()
                            print(f"Completed run: {run_id}")
                    except Exception as e:
                        print(f"Error in run {run_id}: {str(e)}")
                        continue
                    finally:
                        print("end run")
                        mlflow.end_run()  # Ensure the run is properly ended
        client.close()

`Defining Search Space for Training`

In [None]:
defined_experiment_combinations = [
    {
        "scaler": MinMaxScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "RandomForest", "instance": RandomForestClassifier()},
        "pca": {"apply": True, "n_components": 0.95},
    },
    {
        "scaler": StandardScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "LightGBM", "instance": LGBMClassifier()},
        "pca": {"apply": True, "n_components": 0.95},
    },
]

categorical_cols_reduced = ["country_code"]
numeric_columns_reduced = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

`Training`

In [None]:
from ExperimentTracker2 import PhaseFiveExperimentTracker
tracker = PhaseFiveExperimentTracker("Final Experiment")

# Load checkpoint file
tracker.completed_runs

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=defined_experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns_reduced,
    categorical_cols=categorical_cols_reduced,
    drop_columns=['payment_method', 'collect_type', 'mobile_verified']
)

`Third Set of Models`: We will apply hyperparameter tuning to the best model.

`Note`: As RandomForest and LightGBM have different hyperparameters, we will tune them separately. Hence we will run Phase Six two times, one for RandomForest and one for LightGBM.

In [None]:
class PhaseSixExperimentTracker(BaseExperimentTracker):
    def generate_run_id(self, config):
        """Generate a concise, unique identifier for a run configuration."""
        model_abbr = config["models"]["name"][:2].upper()  # Abbreviate model name
        scaler_abbr = config["scaler"].__class__.__name__.replace(
            "Scaler", ""
        )  # Simplify scaler name
        encoding_status = "Enc" if config["encode"]["apply"] else "NoEnc"
        timestamp = time.strftime("%Y%m%d%H%M")  # Add timestamp for uniqueness
        return f"{model_abbr}_{scaler_abbr}_{encoding_status}_{timestamp}_hypertuned"

    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
        drop_columns,
        n_iter=10,
    ):
        """Run experiments for Phase 4 on multiple datasets."""
        client = Client(
            n_workers=6,
            threads_per_worker=2,
            memory_limit="4GB",
            local_directory="/tmp/dask-worker-space",
            dashboard_address=":0",
        )
        total_time = 0
        for dataset_name, X_train, y_train in datasets:

            def safe_drop_columns(df, drop_columns):
                """Safely drop columns from a DataFrame if they exist."""
                valid_columns = [col for col in drop_columns if col in df.columns]
                if valid_columns:
                    return df.drop(columns=valid_columns)
                return df

            if drop_columns:
                X_train = safe_drop_columns(X_train, drop_columns)
                X_test = safe_drop_columns(X_test, drop_columns)
            for config in experiment_combinations:
                if (
                    dataset_name == "dataset_LOF_ROS"
                    and config["models"]["name"] == "RandomForest"
                ) or (
                    dataset_name == "dataset_LOF"
                    and config["models"]["name"] == "LightGBM"
                ):
                    run_id = self.generate_run_id(config)
                    if run_id in self.completed_runs:
                        print(f"Skipping completed run: {run_id}")
                        continue
                    print(f"Starting run: {run_id}")
                    try:
                        with mlflow.start_run(run_name=run_id):
                            # Add descriptive run tags
                            mlflow.set_tag("dataset", dataset_name)
                            mlflow.set_tag("model_type", config["models"]["name"])
                            mlflow.set_tag(
                                "scaler_type", config["scaler"].__class__.__name__
                            )
                            mlflow.set_tag(
                                "encoding_applied", str(config["encode"]["apply"])
                            )
                            # Split dataset name
                            dataset_names = dataset_name.split("_")
                            mlflow.set_tag(
                                "outlier_technique",
                                (
                                    "LOF"
                                    if "LOF" in dataset_names
                                    else "ISO" if "ISO" in dataset_names else "None"
                                ),
                            )
                            mlflow.set_tag(
                                "resampling_method",
                                (
                                    "SMOTE"
                                    if "SMOTE" in dataset_names
                                    else (
                                        "ROS"
                                        if "ROS" in dataset_names
                                        else "RUS" if "RUS" in dataset_names else "None"
                                    )
                                ),
                            )
                            # Build preprocessing steps
                            transformers = []
                            if config["scaler"]:
                                transformers.append(
                                    ("scaler", config["scaler"], numeric_columns)
                                )
                            if config["encode"]["apply"]:
                                transformers.append(
                                    (
                                        "encoder",
                                        OneHotEncoder(
                                            handle_unknown="ignore", sparse_output=True
                                        ),
                                        categorical_cols,
                                    )
                                )
                            else:
                                transformers.append(
                                    ("drop_categorical", "drop", categorical_cols)
                                )
                            preprocessor = ColumnTransformer(
                                transformers=transformers, remainder="passthrough"
                            )
                            pipeline = Pipeline(
                                steps=[
                                    ("preprocessor", preprocessor),
                                    ("model", config["models"]["instance"]),
                                ]
                            )

                            # Perform hyperparameter tuning
                            param_distributions = config.get("params", {})
                            random_search = RandomizedSearchCV(
                                pipeline,
                                param_distributions=param_distributions,
                                n_iter=n_iter,  # Number of parameter settings that are sampled
                                cv=5,  # Number of cross-validation folds
                                scoring="f1",  # Scoring metric
                                random_state=42,
                            )

                            # Train the pipeline with hyperparameter tuning
                            y_train_series = (
                                y_train.compute().squeeze()
                            )  # Convert to Pandas Series
                            random_search.fit(X_train.compute(), y_train_series)

                            best_pipeline = random_search.best_estimator_

                            # Predictions and probabilities for train and test sets
                            train_predictions = best_pipeline.predict(X_train.compute())
                            train_probabilities = best_pipeline.predict_proba(
                                X_train.compute()
                            )[:, 1]
                            start_time = time.time()
                            test_predictions = best_pipeline.predict(X_test.compute())
                            test_probabilities = best_pipeline.predict_proba(
                                X_test.compute()
                            )[:, 1]
                            total_time += time.time() - start_time
                            average_time = total_time / len(X_test)

                            # Evaluate metrics
                            metrics = self.evaluate_metrics(
                                y_train.compute(),
                                train_predictions,
                                train_probabilities,
                                y_test.compute(),
                                test_predictions,
                                test_probabilities,
                            )

                            # Log parameters, metrics, and model
                            mlflow.log_params(random_search.best_params_)
                            mlflow.log_metrics(metrics)
                            mlflow.sklearn.log_model(
                                best_pipeline,
                                "model",
                                input_example=X_train.compute().iloc[0:1],
                            )
                            mlflow.log_metric("average_prediction_time", average_time)

                            # Save learning curves
                            learning_curve_path = self.plot_learning_curves(
                                best_pipeline, X_train.compute(), y_train.compute()
                            )
                            mlflow.log_artifact(
                                learning_curve_path, artifact_path="learning_curves"
                            )

                            # Mark this run as completed
                            self.completed_runs.add(run_id)
                            self.save_checkpoint()
                            print(f"Completed run: {run_id}")
                    except Exception as e:
                        print(f"Error in run {run_id}: {str(e)}")
                        continue
                    finally:
                        print("end run")
                        mlflow.end_run()  # Ensure the run is properly ended
            client.close()

`Defining Search Space for Training Random Forest`

In [None]:
defined_experiment_combinations = [
    {
        "scaler": MinMaxScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "RandomForest", "instance": RandomForestClassifier()},
        "params": {
            "model__n_estimators": [100, 200, 300],
            "model__max_depth": [10, 20, None],
            "model__min_samples_split": [2, 5, 10],
            "model__min_samples_leaf": [1, 2, 4],
        },
    }
]


categorical_cols_reduced = ["country_code"]
numeric_columns_reduced = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

`Training Random Forest`

In [None]:
from ExperimentTracker2 import PhaseSixExperimentTracker
tracker = PhaseSixExperimentTracker("Final Experiment")

tracker.completed_runs

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=defined_experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns_reduced,
    categorical_cols=categorical_cols_reduced,
    drop_columns=['payment_method', 'collect_type', 'mobile_verified'],
    n_iter=10
)

In [None]:
class PhaseSevenExperimentTracker(BaseExperimentTracker):
    def generate_run_id(self, config):
        """Generate a concise, unique identifier for a run configuration."""
        model_abbr = config["models"]["name"][:2].upper()  # Abbreviate model name
        scaler_abbr = config["scaler"].__class__.__name__.replace(
            "Scaler", ""
        )  # Simplify scaler name
        encoding_status = "Enc" if config["encode"]["apply"] else "NoEnc"
        timestamp = time.strftime("%Y%m%d%H%M")  # Add timestamp for uniqueness
        return f"{model_abbr}_{scaler_abbr}_{encoding_status}_{timestamp}_hypertuned"

    def run_experiments(
        self,
        datasets,
        X_test,
        y_test,
        experiment_combinations,
        numeric_columns,
        categorical_cols,
        n_iter=10,
    ):
        """Run experiments for Phase 4 on multiple datasets."""
        client = Client(
            n_workers=6,
            threads_per_worker=2,
            memory_limit="4GB",
            local_directory="/tmp/dask-worker-space",
            dashboard_address=":0",
        )
        total_time = 0
        for dataset_name, X_train, y_train in datasets:
            for config in experiment_combinations:
                if (
                    dataset_name == "dataset_LOF_ROS"
                    and config["models"]["name"] == "RandomForest"
                ) or (
                    dataset_name == "dataset_LOF"
                    and config["models"]["name"] == "LightGBM"
                ):
                    run_id = self.generate_run_id(config)
                    if run_id in self.completed_runs:
                        print(f"Skipping completed run: {run_id}")
                        continue
                    print(f"Starting run: {run_id}")
                    try:
                        with mlflow.start_run(run_name=run_id):
                            # Add descriptive run tags
                            mlflow.set_tag("dataset", dataset_name)
                            mlflow.set_tag("model_type", config["models"]["name"])
                            mlflow.set_tag(
                                "scaler_type", config["scaler"].__class__.__name__
                            )
                            mlflow.set_tag(
                                "encoding_applied", str(config["encode"]["apply"])
                            )
                            # Split dataset name
                            dataset_names = dataset_name.split("_")
                            mlflow.set_tag(
                                "outlier_technique",
                                (
                                    "LOF"
                                    if "LOF" in dataset_names
                                    else "ISO" if "ISO" in dataset_names else "None"
                                ),
                            )
                            mlflow.set_tag(
                                "resampling_method",
                                (
                                    "SMOTE"
                                    if "SMOTE" in dataset_names
                                    else (
                                        "ROS"
                                        if "ROS" in dataset_names
                                        else "RUS" if "RUS" in dataset_names else "None"
                                    )
                                ),
                            )
                            # Build preprocessing steps
                            transformers = []
                            if config["scaler"]:
                                transformers.append(
                                    ("scaler", config["scaler"], numeric_columns)
                                )
                            if config["encode"]["apply"]:
                                transformers.append(
                                    (
                                        "encoder",
                                        OneHotEncoder(
                                            handle_unknown="ignore", sparse_output=True
                                        ),
                                        categorical_cols,
                                    )
                                )
                            else:
                                transformers.append(
                                    ("drop_categorical", "drop", categorical_cols)
                                )
                            preprocessor = ColumnTransformer(
                                transformers=transformers, remainder="passthrough"
                            )
                            pipeline = Pipeline(
                                steps=[
                                    ("preprocessor", preprocessor),
                                    ("model", config["models"]["instance"]),
                                ]
                            )

                            # Perform hyperparameter tuning
                            param_distributions = config.get("params", {})
                            random_search = RandomizedSearchCV(
                                pipeline,
                                param_distributions=param_distributions,
                                n_iter=n_iter,  # Number of parameter settings that are sampled
                                cv=5,  # Number of cross-validation folds
                                scoring="f1",  # Scoring metric
                                random_state=42,
                            )

                            # Train the pipeline with hyperparameter tuning
                            y_train_series = (
                                y_train.compute().squeeze()
                            )  # Convert to Pandas Series
                            random_search.fit(X_train.compute(), y_train_series)

                            best_pipeline = random_search.best_estimator_

                            # Predictions and probabilities for train and test sets
                            train_predictions = best_pipeline.predict(X_train.compute())
                            train_probabilities = best_pipeline.predict_proba(
                                X_train.compute()
                            )[:, 1]
                            start_time = time.time()
                            test_predictions = best_pipeline.predict(X_test.compute())
                            test_probabilities = best_pipeline.predict_proba(
                                X_test.compute()
                            )[:, 1]
                            total_time += time.time() - start_time
                            average_time = total_time / len(X_test)

                            # Evaluate metrics
                            metrics = self.evaluate_metrics(
                                y_train.compute(),
                                train_predictions,
                                train_probabilities,
                                y_test.compute(),
                                test_predictions,
                                test_probabilities,
                            )

                            # Log parameters, metrics, and model
                            mlflow.log_params(random_search.best_params_)
                            mlflow.log_metrics(metrics)
                            mlflow.sklearn.log_model(
                                best_pipeline,
                                "model",
                                input_example=X_train.compute().iloc[0:1],
                            )
                            mlflow.log_metric("average_prediction_time", average_time)

                            # Save learning curves
                            learning_curve_path = self.plot_learning_curves(
                                best_pipeline, X_train.compute(), y_train.compute()
                            )
                            mlflow.log_artifact(
                                learning_curve_path, artifact_path="learning_curves"
                            )

                            # Mark this run as completed
                            self.completed_runs.add(run_id)
                            self.save_checkpoint()
                            print(f"Completed run: {run_id}")
                    except Exception as e:
                        print(f"Error in run {run_id}: {str(e)}")
                        continue
                    finally:
                        print("end run")
                        mlflow.end_run()  # Ensure the run is properly ended
            client.close()

`Defining Search Space for Training LightGBM`

In [None]:
defined_experiment_combinations = [
    {
        "scaler": StandardScaler(),
        "encode": {"apply": True, "columns": ["categorical_col"]},
        "models": {"name": "LightGBM", "instance": LGBMClassifier()},
        "params": {
            "learning_rate": [0.01, 0.03, 0.05, 1],
            "max_depth": [3, 5, 7, 10, -1],
            "min_samples_split": [2, 5, 10, 20],
            "min_samples_leaf": [1, 5, 10, 20],
        },
    },
]

categorical_cols = ["payment_method", "country_code", "collect_type"]
numeric_columns = [
    "order_value",
    "refund_value",
    "num_items_ordered",
    "days_since_first_order",
    "order_date_day_of_week",
    "order_date_day",
    "order_date_month",
    "order_date_year",
]

# Update the datasets list with scattered futures
datasets = [
    ("dataset_default", X_train, y_train),
    ("dataset_LOF", X_train_LOF, y_train_LOF),
    ("dataset_LOF_ROS", X_train_LOF_ROS, y_train_LOF_ROS),
]

`Training LightGBM`

In [None]:
from ExperimentTracker2 import PhaseSevenExperimentTracker

tracker = PhaseSevenExperimentTracker("Final Experiment")

tracker.completed_runs

# Run experiments with checkpointing
tracker.run_experiments(
    datasets=datasets,
    experiment_combinations=defined_experiment_combinations,
    X_test=X_test,
    y_test=y_test,
    numeric_columns=numeric_columns,
    categorical_cols=categorical_cols,
    n_iter=100,
)

`Results`
`First Set of Models`: We notice that adding the best scaler alongside the best resampling and outlier handling techniques improves the performance of the models. Hence we move forward with these models

`Second Set of Models`: When applying feature selection, we notice that RandomForest performs better with feature selection, while LightGBM doesnt, hence we move ahead with RandomForest with feature selection and LightGBM without feature selection.

`Third Set of Models`: When applying PCA, for RandomForest, the model with PCA performs worst than the model without PCA, hence we move ahead with RandomForest without PCA. For LightGBM, the model with PCA performs better than the model without PCA, however this difference is marginal and PCA could take longe to run and hence we move ahead with LightGBM without PCA.

`Fourth Set of Models`: When applying hyperparameter tuning, we notice that both tuned RandomForest and LightGBM models perform worse than its untuned counterparts. However it actually helps with RandomForest overfitting. Hence, our best models are the tuned models.

<img class="text-center" src="images/results/finalexperiment.png" width="1500">

---

## **4. Choosing Best Model**

---

When comparing between the tuned RandomForest and LightGBM models, we notice that the RandomForest model performs better than the LightGBM model. Hence we will move forward with the RandomForest model.

`Best Parameters for RandomForest`: {'n_estimators': 200, 'min_samples_split': 10, min_samples_leaf': 2, 'max_depth': None}

`Learning Curve for RandomForest`:

<img class="text-center" src="images/final_model/curve.png" width="1000">

`Observation`:
The learning curve shows that the model is not overfitting or underfitting, as the training and validation curves converge and remain close to each other. This indicates that the model is generalizing well to unseen data. There is also leveling off of the validation curve, indicating that additional data may not significantly improve the model's performance. This suggests that the model has reached its optimal performance given the available data. In order to further improve the model's performance, we may need to consider other approaches in feature engineering or run grid search for hyper parameter tuning as we are currently using Random Search