# Basic and Baseline Model Analysis

In this notebook, we will:

Create a Random Model: This model will predict the target label based solely on the average percentage distribution of classes in the entire dataset.

Build Baseline Models: Using the initially pre-processed data and missing values mean imputation, we will construct three baseline models:

Logistic Regression
Decision Tree
LightGBM
Objective

The goal of this exercise is to establish baseline performance metrics, allowing us to compare these initial models with future models that incorporate more advanced transformations and methods. This comparison will help us determine if further enhancements lead to statistically significant improvements in accuracy and model stability.

In [83]:
import pandas as pd
import polars as pl
import numpy as np

import os

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier

from sklearn.model_selection import GroupKFold, cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split, KFold
from sklearn.metrics import confusion_matrix, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, precision_score, recall_score, accuracy_score, f1_score
from sklearn.impute import SimpleImputer

import warnings
warnings.filterwarnings('ignore', category=UserWarning)

In [4]:
#  Set the working directory
os.chdir('c:/Users/laura/OneDrive/Documentos/Personal Documents/Universidad/DSE CCNY/Courses Semester 2/Applied ML/Final Project/machine-learning-dse-i210-final-project-credit-risk/notebooks')
# Set data paths
data_dir = 'c:/Users/laura/OneDrive/Documentos/Personal Documents/Universidad/DSE CCNY/Courses Semester 2/Applied ML/Final Project/new_aggs/new_aggs/'
data_base = 'C:/Users/laura/OneDrive/Documentos/Personal Documents/Universidad/DSE CCNY/Courses Semester 2/Applied ML/Project_final/home-credit-credit-risk-model-stability/csv_files/train/train_base.csv'

# Load the data

For this baseline model we will use the union of the personal and non personal data after begin preprocessed. In this stage we will use just a set of that. This same code will be later run again with the complete dataset to get the final scores on the complete data. For the model based on the percentage, we pull the percentage of positive classes from the entire dataset 3% 


In [49]:
# Load the .pkl file for the personal data
file1 = data_dir + 'df1.pkl'
df = pd.read_pickle(file1)

# Load the .pkl file for the non personal data
file2 = data_dir + 'df2.pkl'
df2 = pd.read_pickle(file2)

In [50]:
# Joing the applicant personal data and the non personal one to get the complete dataset.
df_full = df.merge(df2, on=['case_id', 'date_decision', 'WEEK_NUM'], how='left')

In [15]:
# confirm the dataset was properly merge by cheking the shape
df_full.shape

(10000, 2640)

In [27]:
df_full.head()

Unnamed: 0,case_id,date_decision,MONTH_x,WEEK_NUM,target_x,empls_employedfrom_796D_distinct_x,empls_employedfrom_796D_min_year_x,empls_employedfrom_796D_min_month_x,empls_employedfrom_796D_min_day_x,empls_employedfrom_796D_max_year_x,...,pmts_pmtsoverdue_635A_median_y,pmts_dpdvalue_108P_sum_y,pmts_pmtsoverdue_635A_sum_y,pmts_date_1107D_distinct_y,pmts_date_1107D_min_year_y,pmts_date_1107D_min_month_y,pmts_date_1107D_min_day_y,pmts_date_1107D_max_year_y,pmts_date_1107D_max_month_y,pmts_date_1107D_max_day_y
0,1488310,2019-08-14,201908,32,0,1.0,,,,,...,,,,,,,,,,
1,13904,2019-05-06,201905,17,0,1.0,,,,,...,,,,,,,,,,
2,783503,2019-08-28,201908,34,1,1.0,,,,,...,,,,,,,,,,
3,17986,2019-06-09,201906,22,1,1.0,,,,,...,,,,,,,,,,
4,1400855,2019-06-13,201906,23,0,1.0,,,,,...,,,,,,,,,,


In [52]:
# dropping the extra target column that appeared because of the join and it is just a duplicate of the target
df_full.rename(columns={'target_x': 'target'}, inplace=True)

# Drop target_y
df_full.drop(columns=['target_y'], inplace=True)

# Extra Data Cleaning

We futher clean the data to be able to feed the previously stated predictors.
 - Boolean Columns:
The boolean columns are filled with False where None is present, and then converted to boolean type using .astype(bool).
- Object Columns:
The object columns, which contain None, True, or False, are replaced with np.nan, 1.0, or 0.0 respectively, and converted to float.

In [53]:
# Convert date_decision to timestamp
df_full["date_decision"] = pd.to_datetime(df_full["date_decision"]).astype('int64') / 10**9

# Get boolean columns from df_full
bool_columns = df_full.select_dtypes(include=['bool']).columns.tolist()

for col in bool_columns:
    df_full[col] = df_full[col].fillna(False).astype(bool)

In [54]:
# Get object columns from df_full
object_columns = df_full.select_dtypes(include=['object']).columns.tolist()

for col in object_columns:
    df_full[col] = df_full[col].replace({None: np.nan, True: 1.0, False: 0.0})


## Train and Validation Split. 

We remove the target from the training datasets. We then further split the dataset into train and validation. For this split we do not apply startify to balanace the target, we are building the baseline which will not consider this.

In [74]:
# Train and Validation split
# Train and Validation split
base = df_full[["case_id", "WEEK_NUM", "target"]]
X = df_full.drop(columns=["case_id", "WEEK_NUM", "target"])
y = df_full["target"]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.7, random_state=1, stratify=y)

# Prepare base_train and base_valid
base_train = base.iloc[X_train.index]
base_valid = base.iloc[X_valid.index]

print(f"Train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Valid: {X_valid.shape}, y_valid: {y_valid.shape}")

Train: (7000, 2635), y_train: (7000,)
Valid: (3000, 2635), y_valid: (3000,)


# Basic Model Predictor based on target percentage - Weighted Random Chance

In [64]:
# Calculate the percentage of target = 1 in df_full
percentage_target_1 = 3
print(f"Percentage of target = 1: {percentage_target_1:.2f}%")


def model_percentage(data, percentage):
    random_numbers = np.random.rand(len(data))
    predictions = (random_numbers < (percentage / 100)).astype(int)
    return predictions

Percentage of target = 1: 3.00%


## Score Metrics Basic Model

We hardcode the target positive class average to 3% which comes from the complete dataset. 

In [75]:
# Apply the model to base_valid
y_valid_pred = model_percentage(base_valid, percentage_target_1)

# Evaluate the model
auc_score = roc_auc_score(y_valid, y_valid_pred)
precision = precision_score(y_valid, y_valid_pred)
recall = recall_score(y_valid, y_valid_pred)
accuracy = accuracy_score(y_valid, y_valid_pred)
f1 = f1_score(y_valid, y_valid_pred)

# Print the evaluation metrics
print(f"AUC: {auc_score:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")

AUC: 0.5032
Precision: 0.1553
Recall: 0.0399
Accuracy: 0.8427
F1 Score: 0.0635


The AUC score reflects the model's ability to distinguish between classes. A score closer to 0.5 indicates that the model performs no better than random guessing. Given that this is a random model based on the target's percentage, an AUC score near 0.5 is expected and confirms the model's lack of discriminatory power.

Precision measures the proportion of true positives among all positive predictions. For a random model, precision is typically low as it doesn't have any mechanism to prioritize true positives over false positives.

Recall measures the proportion of actual positives that are correctly identified by the model. A random model's recall is generally proportional to the actual prevalence of the positive class in the dataset. 

Accuracy measures the proportion of all correct predictions (both true positives and true negatives) out of the total predictions. For a random model, the accuracy is influenced by the class distribution in the dataset. An accuracy score that is close to the proportion of the majority class indicates that the model might be leaning towards predicting the majority class more often.

The F1 score is the harmonic mean of precision and recall, providing a balance between the two. We see for this model that is low, reflecting the trade-off between precision and recall when there is no specific pattern in the predictions.

In [87]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    def safe_roc_auc_score(y_true, y_score):
        """ Compute ROC AUC score only if there are two unique values in y_true """
        if len(set(y_true)) < 2:
            return 0 
        else:
            return roc_auc_score(y_true, y_score)

    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2 * safe_roc_auc_score(x["target"], x["score"]) - 1).tolist()

    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a * x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

#### Stability Basic Model

In [93]:
for base, X in [(base_train, X_train), (base_valid, X_valid)]:
    y_pred = model_percentage(X, percentage_target_1)
    base["score"] = y_pred

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 

The stability score on the train set is: -0.06003275294704073
The stability score on the valid set is: -0.2564073926740489


# Baseline Model

A more sofisticated Baseline we could use would be to train a Logistic Regressor, Decision Tree Classifier or LGBM Classier on their respective default parameters, which are pre-tuned to work reasonably well for a wide range of datasets. Moreover, we will use a simple mean imputer to handle the missing data. With in our use case is 92%. 

In [79]:
# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_valid_imputed = imputer.transform(X_valid)

In [94]:
# Define the models
models = {
    "Logistic Regression": LogisticRegression(random_state=1),
    "Decision Tree": DecisionTreeClassifier(random_state=1),
    "LightGBM": LGBMClassifier(random_state=1)
}

# Function to evaluate and print metrics
def evaluate_model(name, y_true, y_pred, y_pred_probs):
    auc_score = roc_auc_score(y_true, y_pred_probs)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    print(f"--- {name} ---")
    print(f"AUC: {auc_score:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}\n")

# Iterate through models
for name, model in models.items():
    # Fit the model
    model.fit(X_train_imputed, y_train)
    
    # Predict probabilities and classes
    y_valid_pred_probs = model.predict_proba(X_valid_imputed)[:, 1]
    y_valid_pred = model.predict(X_valid_imputed)
    
    # Evaluate the model
    evaluate_model(name, y_valid, y_valid_pred, y_valid_pred_probs)

    for base, X in [(base_train, X_train_imputed), (base_valid, X_valid_imputed)]:
        y_pred = model.predict_proba(X)[:, 1]
        base["score"] = y_pred
    
    stability_score_train = gini_stability(base_train)
    stability_score_valid = gini_stability(base_valid)

    print(f'The stability score on the train set is: {stability_score_train}') 
    print(f'The stability score on the valid set is: {stability_score_valid}') 

--- Logistic Regression ---
AUC: 0.5498
Precision: 0.0000
Recall: 0.0000
Accuracy: 0.8663
F1 Score: 0.0000

The stability score on the train set is: -0.02598690127861006
The stability score on the valid set is: -0.1658323244940394
--- Decision Tree ---
AUC: 0.5664
Precision: 0.2440
Recall: 0.2544
Accuracy: 0.7950
F1 Score: 0.2491

The stability score on the train set is: 0.811273401138279
The stability score on the valid set is: -0.26950080972741763
[LightGBM] [Info] Number of positive: 936, number of negative: 6064
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.170689 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 130494
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 2264
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.133714 -> initscore=-1.868509
[LightGBM] [Info] Start training from score -1.868509
--- LightGBM ---
AUC: 0.7817
Precision: 0.5347
Recal

Baselines Model Performance Overview
- Logistic Regression
  - Auc 0.54: Slightly better than random guessing (AUC of 0.5). Indicates poor discriminatory power.
  - Precision: 0.0000: Model did not predict any positives, reflects difficulty in handling class imbalance.
  - Recall: 0.0000: Model failed to capture any true positives.Indicates that the model is biased towards predicting the majority class (negative).
  - Accuracy: 0.8663: High accuracy, but misleading due to class imbalance.
  - -0.16 gini score indicates that the model's predictions are no better than random chance.

**Logistic Regression struggles with the imbalanced dataset, failing to predict the minority class, resulting in poor performance across all metrics except accuracy (missleading due to class imbalance).**

- Decision Tree
  - AUC: 0.5664:Slightly better than Logistic Regression. Shows moderate discriminatory power.
  - Precision: 0.2440: Indicates some ability to correctly predict positive cases.Affected by class imbalance.
  - Recall: 0.2544: Slightly better at capturing true positives than Logistic Regression. Reflects some sensitivity to the minority class.
  - Accuracy: 0.7950: Lower than Logistic Regression, but this is due to better handling of positive cases.
  - F1 Score: 0.2491:Better balance between precision and recall.
  - -0.26 gini score indicates that the model's predictions are no better than random chance.

**The Decision Tree shows improved handling of the minority class compared to Logistic Regression, with better precision, recall, and F1 score, though it still struggles with the class imbalance.**

- LightGBM
  - AUC: 0.7817: Significantly better than Logistic Regression and Decision Tree.
  - Precision: 0.5347: Highest precision among the models. Better at correctly predicting positive cases.
  - Recall: 0.1347: Lower recall, indicating some challenges in capturing all true positives. Reflects trade-off between precision and recall.
  - Accuracy: 0.8687: High accuracy, though  influenced by class imbalance.
  - F1 Score: 0.2151: Reflects better  performance in handling the imbalanced dataset.
  - The stability score on the valid set is 0.095 which is closely to random chance but we see a slight improve. 

**LightGBM outperforms the other models, especially in terms of AUC and precision, demonstrating better handling of the imbalanced dataset. The model’s ability to auto-tune parameters and handle missing data contributes to its superior performance.**

## Comparison of LightGBM Performance: Mean Imputation vs. Handling Missing Values Directly

Finally, let's check how LightGBM performs with the preprocessed data with the missing values, instead of the mean impute.

The lightGBM model accounts with a handling of missing values build in. It treats missing values as a separate category so:

- Split Finding with Missing Values: Missing Value as a Separate Category: LightGBM treats missing values as a separate category when finding splits. During training, it can decide whether to assign missing values to the left or right side of a split.
- Optimal Split Decision: LightGBM evaluates the best way to handle missing values by considering them during the split-finding process. This means it optimally decides where to place missing values to minimize the loss function.

Reference: LightGBM's official documentation: LightGBM Handling Missing Values. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#missing-value-handle

In [85]:
model = LGBMClassifier(random_state=1)
# Fit the model
model.fit(X_train, y_train)

# Predict probabilities and classes
y_valid_pred_probs = model.predict_proba(X_valid)[:, 1]
y_valid_pred = model.predict(X_valid)

# Evaluate the model
evaluate_model(name, y_valid, y_valid_pred, y_valid_pred_probs)

[LightGBM] [Info] Number of positive: 936, number of negative: 6064
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.218191 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 124213
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 2478
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.133714 -> initscore=-1.868509
[LightGBM] [Info] Start training from score -1.868509
--- LightGBM ---
AUC: 0.7884
Precision: 0.5196
Recall: 0.1322
Accuracy: 0.8677
F1 Score: 0.2107



A marginally better ability to capture true positives when we include the mean imputation. However, we could not find any significant differences. The minimal difference indicates that both methods struggle equally with identifying all true positives.

# Conclusions

- Establishing Baselines:

  - We utilized a random weighted model as the simplest baseline to understand the expected performance based purely on the dataset's class distribution.
  - Additionally, we developed a more sophisticated baseline using the LightGBM model with default automated parameters.


- Insights Gained:

  - This exercise provided a clear understanding of the expected baseline performance given the current data distribution.
  - The LightGBM model serves as a benchmark, demonstrating the performance achievable with minimal parameter tuning and preprocessing.

- Next Steps:
  - Although the current LightGBM model shows results that are only marginally better than random guessing, it sets a foundation for further improvement.
  - Future efforts will focus on enhancing this baseline through advanced data preprocessing and transformation techniques aimed at significantly improving accuracy and model stability.

- Objective:
The goal is to surpass the performance of the LightGBM baseline by implementing more sophisticated data handling and model tuning strategies, aiming for a statistically significant improvement in the model's predictive capabilities. We will focus our effort in dealing with the target class inbalance, the missing values, and the great number of features. Moreover, we will investigate the relationship between regularization and the stability score. 