# Model devepment

In this document we develop and compare different models for our model devepment. We have the following sections:

1. Model creation
2. Model evaluation
3. Model implementation on test data

Note that for model creation instead of running the code each time one can load the best model.



### Import libraries

In [None]:
from flask_gui.preprocessing import preprocessor
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report, roc_auc_score
from scipy.stats import uniform, randint
import joblib

### Data collection

In [9]:
# Data collection
total_df = pd.read_csv('./Data/Base.csv')

# Define features (X) and target (y)
X = total_df.drop(columns=['fraud_bool'])
y = total_df['fraud_bool']

# Split the data into training and test sets using stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Data has been loaded")

Data has been loaded


In [10]:
preprocessed_data = preprocessor.fit_transform(X)  # Use transform, not fit_transform


## 1. Model Creation

### 1a. Hyperparameter specification

We decided to use Logistic Regression, Random Forests, Support Vector Classifier, KNN, Gradient Boosting, XGBoost, LightGBM and Naive Bayes. The following hyperparameters were what was decided to be best.

In [11]:
# Pipeline with placeholder classifier
pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])

# Define models and their hyperparameters
models = {
    'Logistic Regression': (
        LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000),
        {'classifier__C': uniform(0.01, 10)}
     ), 
    'Random Forest': (
        RandomForestClassifier(class_weight='balanced', random_state=42),
        {
            'classifier__n_estimators': randint(50, 150),
            'classifier__max_depth': randint(3, 10)
        }
    ),
    'SVC': (
        SVC(class_weight='balanced', probability=True, random_state=42),
        {
            'classifier__C': uniform(0.01, 10),
            'classifier__kernel': ['linear', 'rbf']
        }
    ),
    'KNN': (
        KNeighborsClassifier(),
        {'classifier__n_neighbors': randint(3, 10)}
    ),
    'Gradient Boosting': (
        GradientBoostingClassifier(random_state=42),
        {
            'classifier__n_estimators': randint(50, 150),
            'classifier__learning_rate': uniform(0.01, 0.2)
        }
    ),
    'XGBoost': (
        XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
        {
            'classifier__n_estimators': randint(50, 150),
            'classifier__learning_rate': uniform(0.01, 0.2)
        }
    ),
    'LightGBM': (
        LGBMClassifier(random_state=42),
        {
            'classifier__n_estimators': randint(50, 150),
            'classifier__learning_rate': uniform(0.01, 0.2)
        }
    ),
    'Naive Bayes': (
        GaussianNB(),
        {}  # No hyperparameters for Naive Bayes
    )
}

### 1b. Hyperparameter searching 

The next step is to do the hyperparameter search and we decided to do 3 random searches per model to keep the time complexity low. This code takes a lot of time to run but is a sacrifice our computers are willing to take.

### (WARNING: DON'T RUN CELL, LOAD SEARCH INSTEAD!!!)

In [12]:
# Stratified K-Fold Cross-Validation
stratified_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Number of random searches per model
n_iter_per_model = 3
best_models = []

import joblib

# Dictionary to store all RandomizedSearchCV objects
search_results = {}

# Iterate through each model
for name, (model, params) in models.items():
    print(f"Running RandomizedSearchCV for {name}...")

    # Create pipeline
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])

    # Create RandomizedSearchCV
    search = RandomizedSearchCV(
        pipeline, 
        param_distributions=params,
        n_iter=n_iter_per_model,
        cv=stratified_cv,
        n_jobs=-1,
        random_state=42,
        scoring='roc_auc'
    )

    # Fit the model
    search.fit(X_train, y_train)

    # Store the search object in the dictionary
    search_results[name] = search
    

Running RandomizedSearchCV for Logistic Regression...


TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.


### 1c. Search saving and loading

Save search:

In [22]:
# Save the RandomizedSearchCV object for the best model
search_filename = f"search_results.joblib"
joblib.dump(search_results, "search_results.joblib")
print(f"RandomizedSearchCV object saved as {search_filename}")

RandomizedSearchCV object saved as search_results.joblib


Load search:

In [23]:
search_results = joblib.load(f"search_results.joblib")

## 2. Model evaluation

### 2a. Random search evaluation

Due to the biased data set we use the auc roc score to evaluate different models. We start by printing the best model of each model class for our search. 

In [24]:
# Create a list for storing model information
results_summary = []

# Collect the best model, score, and parameters
for model_name, search in search_results.items():
    best_score = search.best_score_
    best_params = search.best_params_
    results_summary.append({
        'Model': model_name,
        'Best Score (AUC)': f"{best_score:.4f}",
        'Best Parameters': best_params
    })

# Convert to a DataFrame and sort by AUC score
results_df = pd.DataFrame(results_summary).sort_values(by='Best Score (AUC)', ascending=False)

# Display the DataFrame in Jupyter
from IPython.display import display

display(results_df)

KeyError: 'Best Score (AUC)'

We can see that the best performing model on the trainig data is "TODO" with auc_roc score being "TODO". Now we will evaluate this model closer in the following section

### 2b. Best model evaluation

We begin the evaluation of the best model by extracting it from search_results.

In [25]:
# Initialize variables to track the best model
best_model_name = None
best_model_score = -float('inf')
best_model_params = None
best_model_object = None
best_classifier = None

# Iterate through the search results to find the best model
for model_name, search in search_results.items():
    if search.best_score_ > best_model_score:
        best_model_name = model_name
        best_model_score = search.best_score_
        best_model_params = search.best_params_
        best_model_object = search.best_estimator_

        # Extract the classifier from the pipeline
        best_classifier = best_model_object.named_steps['classifier']

# Print the best model details
print(f"Best Model Name: {best_model_name}")
print(f"Best Model Score (AUC): {best_model_score:.4f}")
print(f"Best Model Parameters: {best_model_params}")

# Print the best classifier object
print(f"Best Classifier Object: {best_classifier}")
joblib.dump(best_classifier, "best_model.joblib")


Best Model Name: None
Best Model Score (AUC): -inf
Best Model Parameters: None
Best Classifier Object: None


['best_model.joblib']

Next we evaluate the model on the training data. This gives:

### Ruibin: TODO

## 3. Test data evaluation

In [None]:
# Evaluate the best model
y_pred = best_classifier.predict(X_test)
y_pred_proba = best_classifier.predict_proba(X_test)[:, 1] if hasattr(best_model_name.named_steps['classifier'], 'predict_proba') else y_pred

# Print the best model and its parameters
print(f"\nBest Model: {best_model_name}")
print(f"Best Cross-Validation AUC Score: {best_score:.4f}")

# Print classification report
print(classification_report(y_test, y_pred))

# Calculate and print AUC score on the test set
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"AUC Score on Test Set: {auc_score:.4f}")

NameError: name 'X_test' is not defined

# Michael's code

In [None]:
# Data collection
total_df = pd.read_csv('./Data/Base.csv')

# Split the data into training and test sets using stratified sampling
train_df, test_df = train_test_split(total_df, test_size=0.2, stratify=total_df['fraud_bool'], random_state=42)

In [None]:
# Create numerical dataframe
num_df = train_df.select_dtypes(include=['float64']).drop(columns=['income', 'proposed_credit_limit'])

# Create categorical dataframe
cat_df = train_df.select_dtypes(include=['int64', 'object']).copy()
cat_df[['income', 'proposed_credit_limit']] = train_df[['income', 'proposed_credit_limit']]
cat_df['income'] = cat_df['income'].round(1)
cat_df['proposed_credit_limit'] = cat_df['proposed_credit_limit'].round(0).astype('int64')


highcard_df = cat_df[[col for col in cat_df.columns if cat_df[col].nunique() > 12]]
lowcard_df = cat_df[[col for col in cat_df.columns if (cat_df[col].nunique() <= 12) & (cat_df[col].nunique() > 2)]]
bool_df = cat_df[[col for col in cat_df.columns if cat_df[col].nunique() <= 2]]


In [None]:
from preprocessingwip import numerical_pipeline
from preprocessingwip import low_card_pipeline
from preprocessingwip import high_card_pipeline
from preprocessingwip import boolean_pipeline
from preprocessingwip import resampling_pipeline
from preprocessingwip import variance_threshold_test
from preprocessingwip import scaling_and_selection_pipeline
from preprocessingwip import feature_selection

transformed_num = numerical_pipeline.fit_transform(num_df)
transformed_highcard = high_card_pipeline.fit_transform(highcard_df)
transformed_lowcard = low_card_pipeline.fit_transform(lowcard_df)
# transformed_boolean = boolean_pipeline.fit_transform(bool_df)

In [None]:
# SMOTE RESAMPLING
t_num_X_resampled, t_num_y_resampled = resampling_pipeline(transformed_num, train_df.iloc[:, 0])
t_highcard_X_resampled, t_highcard_y_resampled = resampling_pipeline(transformed_highcard, train_df.iloc[:, 0])
t_lowcard_X_resampled, t_lowcard_y_resampled = resampling_pipeline(transformed_lowcard, train_df.iloc[:, 0])
# t_boolean_X_resampled, t_boolean_y_resampled = resampling_pipeline(transformed_boolean, train_df.iloc[:, 0]) 

Test dataset samples per class Counter({0: 791177, 1: 8823})
After SMOTE resampling dataset shape Counter({0: 791177, 1: 791177})
After NearMiss resampling dataset shape Counter({0: 791177, 1: 791177})
Test dataset samples per class Counter({0: 791177, 1: 8823})
After SMOTE resampling dataset shape Counter({0: 791177, 1: 791177})
After NearMiss resampling dataset shape Counter({0: 791177, 1: 791177})
Test dataset samples per class Counter({0: 791177, 1: 8823})
After SMOTE resampling dataset shape Counter({0: 791177, 1: 791177})


KeyboardInterrupt: 

In [None]:
scaling_and_selection_pipeline(t_num_X_resampled, t_num_y_resampled)
scaling_and_selection_pipeline(t_highcard_X_resampled, t_highcard_y_resampled)
scaling_and_selection_pipeline(t_lowcard_X_resampled, t_lowcard_y_resampled)

TypeError: 'Pipeline' object is not callable

In [None]:
feature_selection(t_num_X_resampled, t_num_y_resampled)
feature_selection(t_highcard_X_resampled, t_highcard_y_resampled)
feature_selection(t_lowcard_X_resampled, t_lowcard_y_resampled)
feature_selection(t_boolean_X_resampled, t_boolean_y_resampled)