# 3.0.0. Building a Baseline Classifier Model

### Methodology

In this section, we trained three different baseline classifier models to handle a imbalance dataset, with presence of null values, and nonlinear relationships among features. The models selected for this initial analysis were Logistic Regression, LightGBM (LGBM), and XGBoost (XGBM), because:
1. Logistic Regression
   Logistic regression is a linear model that estimates probabilities using a logistic function. 
    - Pros:
        - Simplicity and Interpretability: Easy to implement and results are interpretable.
        - Efficient Training: Less computationally intensive compared to tree-based models.
    - Cons:
        - Handling of Non-Linear Features: Performs poorly if relationships between features and the target are non-linear unless extensive feature engineering is done.
        - Requirement for Feature Scaling: Logistic regression requires feature scaling to perform well, as it is sensitive to the magnitude of input features.
        - Cannot Handle Missing Values Directly: Requires complete data or imputation of missing values before training.
        
2. LightGBM (LGBM)
    LightGBM is a gradient boosting framework that uses tree-based learning algorithms.
    - Pros:
        - Scalability: Works well with large datasets and supports GPU learning.
        - Performance: Generally provides high performance, especially on datasets where the relationship between variables is complex and non-linear.
        - Efficiency with Large Datasets: Optimized to run faster and use less memory compared to other gradient boosting frameworks.
        - Handling of Missing Values: Natively handles missing values without requiring imputation.
        - Robust to Feature Scaling: Automatically handles varying scales of data, making it less sensitive to the need for feature normalization.
    - Cons:
        - Overfitting: Prone to overfitting, especially with small data sets.
        - Parameter Tuning: Requires careful tuning of parameters and sometimes extensive hyperparameter optimization.

3. XGBoost (XGBM)
   XGBoost also uses gradient boosting algorithms but is known for its ability to do parallel processing, tree pruning, handling missing values, and regularizing to avoid overfitting.
    - Pros
        - Handling Irregularities: Good at handling missing values and various data irregularities.
        - Model Strength: Regularization helps to prevent overfitting and provides robust performance across various types of data.
    
    - Cons
        - Computational Intensity: Can be resource-intensive in terms of computation, especially with large data sets and a deep number of trees.
        - Complexity: More parameters to tune compared to other models, which can make it harder to find the right model configuration.


These are the parameters tailored for each models:
1. Logistic Regression Parameters:
  Penalty: 'l2' (L2 regularization to prevent overfitting)
  C: 1.0 (Regularization strength; smaller values specify stronger regularization)
2. LightGBM Parameters:
    Num_leaves: 34 (Number of leaves in one tree; controls model complexity)
3. XGBoost Parameters:
    Max_depth: 3 (Maximum depth of a tree; ensures the model is sufficiently deep to learn relationships)
    Subsample: 0.8 (Subsample ratio of the training instances; prevents overfitting)
    Colsample_bytree: 0.8 (Subsample ratio of columns when constructing each tree; provides feature sampling)

We evaluated the models using ROC AUC Score, PR AUC Score, and the Kolmogorov-Smirnov (KS) statistic to gauge their ability to distinguish between classes and predict the minority class in an unbalanced dataset.

### Conclusion
The following results were observed across the models, showing the pros and cons of each model:

| Model                | ROC AUC Score | PR AUC Score | KS Statistic | Handles Nulls | Sensitive to Scale | Overfitting Risk | Computational Intensity |
|----------------------|---------------|--------------|--------------|---------------|--------------------|------------------|-------------------------|
| Logistic Regression  | 0.50000       | 0.19943      | 0.00000      | No            | Yes                | Low              | Low                     |
| LightGBM (LGBM)      | 0.564588      | 0.255012     | 0.116133     | Yes           | No                 | Medium           | Medium                  |
| XGBoost (XGBM)       | 0.618952      | 0.280595     | 0.212456     | Yes           | No                 | High             | High                    |


Based on the evaluation metrics, XGBM outperforms the other models with the highest ROC AUC, PR AUC, and KS scores, making it the most suitable choice given the complexity of the dataset with unbalanced data and non-linear relationships.

In [1]:
import pandas as pd
import numpy as np
import yaml
from pathlib import Path
from scipy.stats import ks_2samp
from sklearn import metrics
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [2]:
def calculate_metrics(y_test: np.ndarray, preds: np.ndarray) -> dict:
    """
    Calculates several key performance metrics for evaluating a classification model.

    This function computes the following metrics:
    - ROC AUC Score: The area under the Receiver Operating Characteristic curve, useful for assessing the 
      overall effectiveness of the predictions with respect to the true outcomes.
    - PR AUC Score: The area under the Precision-Recall curve, useful for datasets with a significant imbalance.
    - Kolmogorov-Smirnov Statistic (KS): Measures the degree to which the distributions of the predicted 
      probabilities of the positive and negative classes differ.

    Parameters:
    - y_test (np.ndarray): An array containing the actual binary labels of the test data (0 or 1).
    - preds (np.ndarray): An array containing the predicted probabilities corresponding to the likelihood 
      of the positive class (class label 1).

    Returns:
    - dict: A dictionary containing the calculated metrics:
        * 'roc_auc_score': float, representing the ROC AUC score.
        * 'pr_auc': float, representing the precision-recall AUC score.
        * 'ks': float, representing the Kolmogorov-Smirnov statistic.
    """
    
    metrics_dict = {
        'roc_auc_score': metrics.roc_auc_score(y_true=y_test, y_score=preds),
        'pr_auc': metrics.average_precision_score(y_true=y_test, y_score=preds),
        'ks': ks_2samp(preds[y_test == 0], preds[y_test == 1])[0],
    }

    return metrics_dict
    
     
def get_logistic_regression_pipeline(numeric_features, **model_parameters):

    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="constant", fill_value=0), 
        MinMaxScaler()
    )
    
    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features)
        ]
    )
    
    pipe = Pipeline(
        steps=[
            ("preprocessor", preprocessor),
            ("classifier", LogisticRegression(**model_parameters, random_state=split_seed))
        ]
    )
    return pipe


### 1. Load Data

In [3]:
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)
    
numeric_features = config["features"]["numerical"]
features = numeric_features
target = config["main"]["target"]
data_train_path = Path.cwd().parent / config["main"]["data_train_path"]
train_validation_path = Path.cwd().parent / config["main"]["data_validation_path"]

train_df = pd.read_pickle(data_train_path)
validation_df = pd.read_pickle(train_validation_path)
split_seed = config["main"]["random_seed"]
model_parameters = config["model_parameters"]

model_parameters

{'logistic_regression': {'penalty': 'l2', 'C': 1.0},
 'lgbm': {'objective': 'binary',
  'boosting_type': 'gbdt',
  'metric': 'auc',
  'num_leaves': 31,
  'learning_rate': 0.05,
  'feature_fraction': 0.8,
  'bagging_fraction': 0.8,
  'bagging_freq': 5,
  'is_unbalance': True},
 'xgbm': {'objective"': 'binary:logistic',
  'booster"': 'gbtree',
  'eval_metric"': 'auc',
  'eta': 0.01,
  'gamma': 0.1,
  'max_depth': 6,
  'min_child_weight': 3,
  'subsample': 0.8,
  'colsample_bytree': 0.8,
  'scale_pos_weight': 4,
  'lambda': 1,
  'alpha': 0.1,
  'max_delta_step': 1,
  'n_estimators': 100}}

In [4]:
X_train, Y_train = train_df[features], train_df[target]
X_val, Y_val = validation_df[features], validation_df[target]
X_train.shape

(9479, 142)

### 2. Train logit regression models

In [5]:
model_results = {}

In [6]:
params = config["model_parameters"]['logistic_regression']

pipe = get_logistic_regression_pipeline(numeric_features, **params) 
pipe.fit(X_train, Y_train)

logist_preds = pipe.predict(X_val[features])

model_results["logist"] = calculate_metrics(Y_val, logist_preds)

{'logist': {'roc_auc_score': 0.49940688018979834,
  'pr_auc': 0.19943019943019943,
  'ks': 0.0011862396204033216}}

### 3. Train lightgbm regression models

In [7]:
params = config["model_parameters"]["lgbm"]
lgbm_model = LGBMClassifier(**params, random_state=split_seed)
lgbm_model.fit(X_train, Y_train)
lgbm_preds = lgbm_model.predict_proba(X_val)[:, 1]

model_results["lgbm"] = calculate_metrics(Y_val, lgbm_preds)

[LightGBM] [Info] Number of positive: 1839, number of negative: 7640
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004577 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 22286
[LightGBM] [Info] Number of data points in the train set: 9479, number of used features: 124
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.194008 -> initscore=-1.424176
[LightGBM] [Info] Start training from score -1.424176


{'logist': {'roc_auc_score': 0.49940688018979834,
  'pr_auc': 0.19943019943019943,
  'ks': 0.0011862396204033216},
 'lgbm': {'roc_auc_score': 0.5737643337287466,
  'pr_auc': 0.24679150727049412,
  'ks': 0.15521098118962887}}

### 4. Train xgboost regression models

In [10]:
params = config["model_parameters"]["xgbm"]
xgbm_model = XGBClassifier(missing=np.nan, **params, random_state=split_seed)

xgbm_model.fit(X_train, Y_train)
xgbm_preds = xgbm_model.predict_proba(X_val)[:, 1]

model_results["xgbm"] = calculate_metrics(Y_val, xgbm_preds)

Parameters: { "booster"", "eval_metric"", "objective"" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.




In [9]:
pd.DataFrame(model_results)

Unnamed: 0,logist,lgbm,xgbm
roc_auc_score,0.499407,0.573764,0.599619
pr_auc,0.19943,0.246792,0.269709
ks,0.001186,0.155211,0.20405
