## Data Loading

In [1]:
import os
import warnings

os.chdir("C:/Users/samue/Downloads/Maybank/")
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
from maybank.src.data_processing.data_loading import (
    load_catalog,
    load_raw_data,
    load_meta_data,
)

catalog_config = load_catalog("maybank/conf/base/catalog.yaml")

raw_data = load_raw_data(catalog_config)

raw_data.head(3)

Unnamed: 0,C_ID,C_AGE,C_EDU,C_HSE,PC,INCM_TYP,gn_occ,NUM_PRD,CASATD_CNT,MTHCASA,...,MAXUT,N_FUNDS,CC_AVE,MAX_MTH_TRN_AMT,MIN_MTH_TRN_AMT,AVG_TRN_AMT,ANN_TRN_AMT,ANN_N_TRX,CC_LMT,C_seg
0,1443,65,Masters,EXECUTIVE CONDOMINIUM,19250.0,6.0,PMEB,3,8.0,6896.91,...,,,13.233333,,,,,,34500.0,AFFLUENT
1,1559,86,O-Levels,PRIVATE CONDOMINIUM,99018.0,2.0,PMEB,4,13.0,51714.78,...,,,727.629167,8530.88,273.44,2296.713333,27560.56,88.0,4000.0,AFFLUENT
2,1913,69,A-Levels,,10155.0,3.0,PMEB,4,1.0,5420.09,...,59600.88,1.0,367.389167,523.35,122.13,283.580833,3402.97,78.0,5000.0,AFFLUENT


In [3]:
meta_data = load_meta_data(catalog_config)

meta_data.head(3)

Unnamed: 0,Feature,Definition,Remarks
0,C_ID,Dummy customer ID,
1,C_AGE,customer Age,
2,C_EDU,customer Education,


## Exploratory Data Analysis

Using Sweetviz for my exploratory data analysis (EDA) is an excellent choice as it offers a comprehensive suite of interactive visualizations and insights.

With features like histogram charts, association analysis, counts, and detection of missing and zero values, Sweetviz provides everything needed for a thorough examination of the dataset.

It's a convenient tool for quickly gaining valuable insights into the data distribution, correlations, and potential issues before delving deeper into analysis or modeling.

Please open `maybank/data/raw/EDA.html` to view the interactive html EDA file.

In [4]:
from maybank.src.data_processing.data_preprocessing import (
    convert_c_seg_to_binary,
    perform_eda_with_sweetviz,
)

raw_data = convert_c_seg_to_binary(raw_data)

perform_eda_with_sweetviz(
    raw_data, target_feat="C_seg", html_file_path="maybank/data/raw/EDA.html"
)

  from .autonotebook import tqdm as notebook_tqdm
Done! Use 'show' commands to display/save.   |██████████| [100%]   00:01 -> (00:00 left)


Report maybank/data/raw/EDA.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Imputations for Missing Values

By delving into the metadata to understand the definition of each feature, we can make informed assumptions and devise the most suitable approach for handling missing values.

This process enables us to tailor our imputation strategy based on the characteristics and significance of each feature, ensuring a more effective data treatment approach.


| No. | Variable           | Missing % | Imputation Strategy                                | Explanation                                                                                                                                                             |
|-----|--------------------|-----------|----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1   | C_ID               | 0%        | No missing values                                  | -                                                                                                                                                                       |
| 2   | C_AGE              | 0%        | No missing values                                  | -                                                                                                                                                                       |
| 3   | C_EDU              | 58%       | Constant Imputation: "Not Provided"                | Since it is the customer education, missing values are imputed with "Not Provided", cannot use "Others" as it is already specified.                                     |
| 4   | C_HSE              | 66%       | Constant Imputation: "Not Provided"                | Since it is the customer house type, missing values are imputed with "Not Provided".                                                                                    |
| 5   | PC                 | <1%       | Zero Imputation                                    | Distribution is not skewed and only <1% missing. PC is postal code, hence does not make sense to use mean/mode imputation. Hence use zero imputation instead.           |
| 6   | INCM_TYP           | 45%       | Mode Imputation: 2.0                               | Drop at 1, 7, and 8 bins. Impute with 2, assuming no income would be already be recorded under either 1 or 8 depending on which is the lowest income type bin.          |
| 7   | gn_occ             | 1%        | Constant Imputation: "Not Provided"                | Since it is occupation, missing values are imputed with "Not Provided", cannot use "Others" as it is already specified.                                                 |
| 8   | NUM_PRD            | 0%        | No missing values                                  | -                                                                                                                                                                       |
| 9   | CASATD_CNT         | 38%       | Zero Imputation: 0                                 | Impute with 0, indicating no CASA or TD accounts, verified with below CASA and TD analysis.                                                                             |
| 10  | MTHCASA            | 41%       | Zero Imputation: 0                                 | Impute with 0, indicating no CASA account.                                                                                                                              |
| 11  | MAXCASA            | 41%       | Zero Imputation: 0                                 | Impute with 0, indicating no CASA account.                                                                                                                              |
| 12  | MINCASA            | 41%       | Zero Imputation: 0                                 | Impute with 0, indicating no CASA account.                                                                                                                              |
| 13  | OWN_CASA (NEW)     | -         | New Binary Column: 1 for owned, 0 for not owned    | Created a new binary column to differentiate not owning a CASA and an empty CASA account, since MTH/MAX/MINCASA already have 0s, which implies an empty CASA account.   |
| 14  | DRvCR              | 55%       | Zero Imputation: 0                                 | Impute with 0, indicating either zero debit or absence of credit (0 division).                                                                                          |
| 15  | MTHTD              | -         | Zero Imputation: 0                                 | Impute with 0, indicating no TD account                                                                                                                                 |
| 16  | MAXTD              | -         | Zero Imputation: 0                                 | Impute with 0, indicating no TD account                                                                                                                                 |
| 17  | OWN_TD (NEW)       | -         | New Binary Column: 1 for owned, 0 for not owned    | Created a new binary column, upon below CASA and TD analysis, it can be shown that indeed missing values in CASATD_CNT implies 0 CASA and TD accounts.                  |
| 18  | Asset_value        | 0%        | No missing values                                  | -                                                                                                                                                                       |
| 19  | HL_tag             | 96%       | Constant Imputation: 0                             | Supposed to be either 1 or 0 according to metadata, and only 1 exists at the moment. Impute with 0.                                                                     |
| 20  | AL_tag             | 92%       | Constant Imputation: 0                             | Supposed to be either 1 or 0 according to metadata, and only 1 exists at the moment. Impute with 0.                                                                     |
| 21  | pur_price_avg      | 92%       | Zero Imputation: 0                                 | Impute with 0, no 0s, hence should be safe assumption that the property purchase price is 0. Also _avg could imply average, hence if no property owned, then 0 division.|
| 22  | UT_AVE             | 96%       | Zero Imputation: 0                                 | Impute with 0, indicating no UT transaction, no 0 exists and verified with below Unit Trust Analysis.                                                                   |
| 23  | MAXUT              | 96%       | Zero Imputation: 0                                 | Impute with 0, indicating no UT transaction, no 0 exists and verified with below Unit Trust Analysis.                                                                   |
| 24  | N_FUNDS            | 96%       | Zero Imputation: 0                                 | Impute with 0, indicating no funds owned, no 0 exists and verified with Unit Trust Analysis.                                                                            |
| 25  | MAX_MTH_TRN_AMT    | 82%       | Zero Imputation: 0                                 | Impute with 0, verified with below Credit Card TRN Analysis, follow imputation of ANN_N_TRX.                                                                            |
| 26  | MIN_MTH_TRN_AMT    | 82%       | Zero Imputation: 0                                 | Impute with 0, verified with below Credit Card TRN Analysis, follow imputation of ANN_N_TRX.                                                                            |
| 27  | AVG_TRN_AMT        | 82%       | Zero Imputation: 0                                 | Impute with 0, verified with below Credit Card TRN Analysis, follow imputation of ANN_N_TRX.                                                                            |
| 28  | ANN_TRN_AMT        | 82%       | Zero Imputation: 0                                 | Impute with 0, verified with below Credit Card TRN Analysis, follow imputation of ANN_N_TRX.                                                                            |
| 29  | ANN_N_TRX          | 82%       | Zero Imputation: 0                                 | Impute with 0. 0 does not exist hence 82% either no credit transaction or no credit card owned.                                                                         |
| 30  | CC_AVE             | 74%       | Zero Imputation: 0                                 | Impute with 0, indicating no credit card owned in the past, hence safe to impute with 0.                                                                                |
| 31  | CC_LMT             | 28%       | Zero Imputation: 0                                 | Impute with 0. indicating no credit card owned at the moment, hence safe to impute with 0. Assumption still valid since 28% within the above 82% of rows 25-29.         |
| 32  | OWN_CC (NEW)       | -         | New Binary Column: 1 for owned, 0 for not owned    | Created new binary column, to distinguish 0 credit limit due to bank assigning 0 limit and not owning a credit card hence 0 limit due to imputation.                    |
| 32  | OWN_PREV_CC (NEW)  | -         | New Binary Column: 1 for owned, 0 for not owned    | Created new binary column, to distinguish 0 CC_AVE due to 0 CC_AVE and not owning a credit card in past hence 0 limit due to imputation.                                |



#### CASA and TD Analysis

I am identifying and validating indices where missing values in monthly, max, and min CASA and TD accounts imply zero accounts owned, ensuring consistency and accuracy in data handling.

In [5]:
casa_td_missing_indices = set(raw_data[raw_data["CASATD_CNT"].isna()].index)
missing_monthly_casa_indices = set(raw_data[raw_data["MTHCASA"].isna()].index)
missing_max_casa_indices = set(raw_data[raw_data["MAXCASA"].isna()].index)
missing_min_casa_indices = set(raw_data[raw_data["MINCASA"].isna()].index)
missing_monthly_td_indices = set(raw_data[raw_data["MTHTD"].isna()].index)
missing_max_td_indices = set(raw_data[raw_data["MAXTD"].isna()].index)

# Ensure missing monthly, max, and min CASA indices are the same
assert missing_monthly_casa_indices == missing_max_casa_indices == missing_min_casa_indices

# Ensure missing monthly and max TD indices are the same
assert missing_monthly_td_indices == missing_max_td_indices

# Check intersection of indices of missing monthly CASA and TD, which imply 0 CASA and TD accounts owned
intersection_casa_td = missing_monthly_casa_indices.intersection(missing_monthly_td_indices)

# Validate assumption that missing monthly CASA and TD data imply 0 CASA and TD accounts
assert casa_td_missing_indices == intersection_casa_td

#### Unit Trust Analysis

I am confirming that missing indices for average UT, maximum UT, and the number of UT funds are consistent, indicating zero funds owned. Therefore, it's safe to impute these missing values with 0, reflecting that the customer owns no funds.

In [6]:
missing_ut_ave_indices = set(raw_data[raw_data["UT_AVE"].isna()].index)
missing_max_ut_indices = set(raw_data[raw_data["MAXUT"].isna()].index)
missing_n_funds_indices = set(raw_data[raw_data["N_FUNDS"].isna()].index)

# Ensure missing indices for average UT, maximum UT, and number of UT funds are the same
assert missing_ut_ave_indices == missing_max_ut_indices == missing_n_funds_indices

# Safe to impute these missing values with 0, indicating the customer owns 0 funds
# This implies no average UT and MAXUT values.

#### Credit Card Trn Analysis

I am confirming that missing indices for maximum, minimum, average, and annual transaction amounts, as well as the annual number of transactions, are consistent. This suggests either no credit card ownership or no credit transactions for these rows.

In [7]:
missing_max_monthly_trn_amt_indices = set(
    raw_data[raw_data["MAX_MTH_TRN_AMT"].isna()].index
)
missing_min_monthly_trn_amt_indices = set(
    raw_data[raw_data["MIN_MTH_TRN_AMT"].isna()].index
)
missing_avg_trn_amt_indices = set(raw_data[raw_data["AVG_TRN_AMT"].isna()].index)
missing_annual_trn_amt_indices = set(raw_data[raw_data["ANN_TRN_AMT"].isna()].index)
missing_annual_n_trx_indices = set(raw_data[raw_data["ANN_N_TRX"].isna()].index)

# Ensure missing indices for maximum, minimum, average, and annual transaction amount, 
# as well as annual number of transactions, are the same
assert (
    missing_max_monthly_trn_amt_indices
    == missing_min_monthly_trn_amt_indices
    == missing_avg_trn_amt_indices
    == missing_annual_trn_amt_indices
    == missing_annual_n_trx_indices
)

# Assume missing values imply either no credit card ownership or no credit transactions for these rows

## Data Cleaning

- Impute missing data based on above analysis
- Optimize efficiency by converting float64 to float32
<!-- - Drop outliers using the z-score method.
  - Using the z-score method with a threshold of 3 allows us to identify data points that are significantly distant from the mean in terms of standard deviations. This approach is particularly useful when dealing with heavily left-skewed features, as it provides a standardized way to detect and remove extreme values that might otherwise distort our analysis or modeling efforts. -->

In [8]:
from maybank.src.data_processing.data_preprocessing import (
    impute_missing_data,
    convert_float64_to_float32,
    drop_outliers,  
    # This approach hasn't been utilized; it employs z-score outlier detection implementation. 
    # Other techniques, such as isolation forest, could be considered. 
    # Initially included for clustering modeling, which is sensitive to outliers.
)

In [9]:
cleaned_data = raw_data.copy()

cleaned_data = impute_missing_data(cleaned_data)

cleaned_data = convert_float64_to_float32(cleaned_data)

# Eliminate potential bias introduced by non-unique customer IDs.
# Exclude the PC (Personal Computer) feature, assuming it represents a binary or categorical variation. If it corresponds to a geographical location, consider implementing geographical encoding techniques.
cleaned_data = cleaned_data.drop(columns=["C_ID", "PC"]).reset_index(drop=True)

cleaned_data.head(3)

Unnamed: 0,C_AGE,C_EDU,C_HSE,INCM_TYP,gn_occ,NUM_PRD,CASATD_CNT,MTHCASA,MAXCASA,MINCASA,...,CC_AVE,MAX_MTH_TRN_AMT,MIN_MTH_TRN_AMT,AVG_TRN_AMT,ANN_TRN_AMT,ANN_N_TRX,CC_LMT,C_seg,CC_AVE_copy,CC_LMT_copy
0,65,Masters,EXECUTIVE CONDOMINIUM,6.0,PMEB,3,8.0,6896.910156,4899.080078,910.880005,...,13.233334,0.0,0.0,0.0,0.0,0.0,34500.0,1,13.233334,34500.0
1,86,O-Levels,PRIVATE CONDOMINIUM,2.0,PMEB,4,13.0,51714.78125,35740.550781,1318.25,...,727.62915,8530.879883,273.440002,2296.713379,27560.560547,88.0,4000.0,1,727.62915,4000.0
2,69,A-Levels,Not Provided,3.0,PMEB,4,1.0,5420.089844,5420.089844,5420.089844,...,367.38916,523.349976,122.129997,283.580841,3402.969971,78.0,5000.0,1,367.38916,5000.0


## Feature Engineering

I will be utilizing tree-based models like XGBoost or Random Forests, where there is generally no need to standardize or normalize features. 

**Reasoning:**

1. **Invariance to Monotonic Transformations:**
   - Tree-based models base decisions on feature comparisons, making them insensitive to monotonic transformations like normalization or standardization.
2. **Natural Handling of Different Scales:**
   - These models partition the feature space based on relative feature values, not their absolute magnitude, making them robust to varying scales among features.

Therefore, training the models using raw features maintains the data's original form, allowing tree-based models to effectively learn from the inherent structure and relationships within the features. Furthermore, XGBoost and Random Forests are robust machine learning models that are less sensitive to outliers, unlike clustering models like SVMs that are sensitive to boundary spaces or regression-based models.

Then for categorical variables like occupation, I will be utilizing **Stratified K-Fold Target Encoding** on the specified categorical columns first.

1. **Prevent Data Leakage**
   - To prevent data leakage and ensure each fold is representative of the whole dataset given our imbalanced dataset.
2. **Tracking Purpose**:
   - I will keep track of all the encoders used for target encoding. This is under the assumption that my test set, or the future observation used for inferencing in production stage is from a similar distribution to my training set.

**Example of loading back used encoders/scalers**
```{python}
encoder_config = load_catalog("maybank/conf/base/encoder.yaml")

for col, encoder_file in config["encoders"].items():
    encoder = joblib.load(encoder_file)
    encoded_df[col] = encoder.transform(encoded_df[col])
```

In [10]:
from maybank.src.feature_engineering.feature_engineering import (
    add_features,
    stratified_kfold_target_encoding,
    standardize_columns,
    # Although not utilized, StandardScaler was tested out.
    # This was initially implemented due to the initial approach of using clustering.
)
from maybank.src.utils.utils import save_processed_data_as_csv

processed_data = cleaned_data.copy()

processed_data = add_features(processed_data)

processed_data = stratified_kfold_target_encoding(
    processed_data,
    ["INCM_TYP", "C_EDU", "C_HSE", "gn_occ"],
    "C_seg",
    encoder_folder="maybank/data/models/",
)

save_processed_data_as_csv(processed_data, "maybank/data/preprocessed/processed_data.csv")

processed_data.head(3)

DataFrame saved to maybank/data/preprocessed/processed_data.csv


Unnamed: 0,C_AGE,NUM_PRD,CASATD_CNT,MTHCASA,MAXCASA,MINCASA,DRvCR,MTHTD,MAXTD,Asset value,...,CC_LMT,C_seg,OWN_CASA,OWN_TD,OWN_CC,OWN_PREV_CC,INCM_TYP_encoded,C_EDU_encoded,C_HSE_encoded,gn_occ_encoded
0,65,3,8.0,6896.910156,4899.080078,910.880005,1020768.0,105000.0,25000.0,111896.90625,...,34500.0,1,1,1,1,1,0.155051,0.181734,0.22639,0.151867
1,86,4,13.0,51714.78125,35740.550781,1318.25,8.32642,575572.0,135026.15625,627286.75,...,4000.0,1,1,1,1,1,0.178821,0.165399,0.182339,0.152018
2,69,4,1.0,5420.089844,5420.089844,5420.089844,0.41066,0.0,0.0,64161.738281,...,5000.0,1,1,0,1,1,0.132776,0.199774,0.172529,0.152039


Please open `maybank/data/preprocessed/EDA.html` to view the interactive html EDA file.

In [11]:
perform_eda_with_sweetviz(
    processed_data,
    target_feat="C_seg",
    html_file_path="maybank/data/preprocessed/EDA.html",
)

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:02 -> (00:00 left)


Report maybank/data/preprocessed/EDA.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


## Data Sampling   

I plan to incorporate a combined under-sampling and over-sampling strategy to ensure a more balanced training fold for each iteration. Below are some common methods from [imbalanced-learn](https://imbalanced-learn.org/stable/index.html).

**Under-sampling methods:**

| Method                      | Advantages                               | Disadvantages                              |
|-----------------------------|------------------------------------------|--------------------------------------------|
| ClusterCentroids            | Preserves information while reducing majority class. | May not accurately capture underlying distribution. |
| CondensedNearestNeighbour   | Selects subset representing majority class. | May inadvertently remove informative instances.         |
| EditedNearestNeighbours     | Removes noisy majority class instances.  | Sensitive to noise, may not fully address imbalance. |
| RandomUnderSampler          | Simple and fast.                         | May discard useful instances, effectiveness in solving imbalance may vary. |

Considering that we have a large majority class, I will opt for the fast and straightforward **RandomUnderSampler**.

**Over-sampling methods:**

| Method                      | Advantages                               | Disadvantages                              |
|-----------------------------|------------------------------------------|--------------------------------------------|
| RandomOverSampler           | Simple and effective.                    | May lead to overfitting, reduction in diversity. |
| SMOTENC                     | Handles both numerical and categorical features. | Requires parameter tuning, can be computationally intensive. |
| ADASYN                      | Focuses on low-density regions.          | Sensitive to noise, requires careful parameter adjustment. |

Since our aim is to uncover hidden affluent customers, I prefer a more intricate approach for over-sampling.

Comparing **SMOTENC** with **ADASYN**, **ADASYN** appears preferable due to its capability to generate synthetic samples in regions where the classifier is likely to make errors. **ADASYN** treats variables as continuous scale. On the other hand, **SMOTENC** is tailored for datasets with a mix of categorical and continuous features, incorporating the nature of categorical features during synthetic sample generation.

Therefore, I will utilize **SMOTENC**.

In [12]:
from maybank.src.data_processing.data_sampling import resample_data
from collections import Counter

# For testing purposes
resampled_X, resampled_y = resample_data(
    processed_data,
    "C_seg",
    [
        "HL_tag",
        "AL_tag",
        "OWN_CASA",
        "OWN_TD",
        "OWN_CC",
        "OWN_PREV_CC",
    ],
)

print(Counter(resampled_y))

resampled_X.head(3)

Counter({0: 55157, 1: 55157})


Unnamed: 0,C_AGE,NUM_PRD,CASATD_CNT,MTHCASA,MAXCASA,MINCASA,DRvCR,MTHTD,MAXTD,Asset value,...,ANN_N_TRX,CC_LMT,OWN_CASA,OWN_TD,OWN_CC,OWN_PREV_CC,INCM_TYP_encoded,C_EDU_encoded,C_HSE_encoded,gn_occ_encoded
55695,46,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0,0,0,0.180085,0.165917,0.137042,0.152039
30952,48,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0,0,0,0,0.130531,0.199774,0.172529,0.152039
63353,79,1,1.0,19358.650391,19358.650391,19358.650391,0.0,0.0,0.0,19358.650391,...,0.0,0.0,1,0,0,0,0.178664,0.16842,0.171835,0.140593


## Modeling

Comparing 2 common classification approaches, evaluated on average recall and precision using (10 folds) k-fold cross-validation to determine the best approach to proceed with:

- **Random Forest**
  - **n_estimators:** 100
  
- **XGBoost Classifier**
  - **objective:** binary: logistic
  - **learning_rate:** 0.001
  - **n_estimators:** 100

In [13]:
from maybank.src.modeling.modeling import perform_cross_validation_over_models

perform_cross_validation_over_models(
    processed_data,
    label_column="C_seg",
    binary_columns=["HL_tag", "AL_tag", "OWN_CASA", "OWN_TD", "OWN_CC", "OWN_PREV_CC"],
    number_folds=10,
)

10it [05:00, 30.03s/it]

Random Forest - Avg Precision: 0.58
Random Forest - Avg Recall: 0.63
Random Forest - Avg F1 Score: 0.6
XGBoost - Avg Precision: 0.48
XGBoost - Avg Recall: 0.78
XGBoost - Avg F1 Score: 0.59





# Building XGBoostClassifier

From the above it would seem that **XGBoost** has a `highest average recall score` which is our objective, to have high classification of affluent classes. Then, any false positives will actually be our targeted potential affluent customers.

I will be using **Optuna**, an optimization library, to automatically find the best hyperparameters for an XGBoost classifier, ensuring that the model I train is as effective as possible for the given dataset.

By specifying a range of values for each parameter, **Optuna** iteratively tests different combinations using a smart search strategy, assessing their performance based on recall score through k-fold cross-validation. This method helps me identify the optimal configuration that maximizes recall, which is crucial for reducing false negatives in my predictions.

Additionally, I have actually already implemented the **SelectFromModel** feature from scikit-learn, which serves as a meta-transformer for selecting the best features based on importance weights.

After finding the best hyperparmeters and training the XGBoost model, **SelectFromModel** reviews the model's feature importances and retains only those that meet a specified threshold. However since the number of features is already so small, I have commented it out. However feel free to uncomment it out to rerun, it will work smoothly under `tune_xgboost_model()` in `src/modeling/modeling.py`.

This process not only helps in enhancing the model's performance by focusing on the most relevant features but also aids in reducing complexity and improving interpretability. The integration of feature selection within the cross-validation loop ensures that the feature selection process is robust and prevents overfitting, making my model more generalizable to unseen data.

In [14]:
from maybank.src.modeling.modeling import tune_xgboost_model

# Testing function out, hence only 2 trials
# main_pipeline.py will be running it fully with parameters loaded from parameters.yaml
tune_xgboost_model(
    processed_data,
    label_column="C_seg",
    binary_columns=["HL_tag", "AL_tag", "OWN_CASA", "OWN_TD", "OWN_CC", "OWN_PREV_CC"],
    n_trials=2, # For optuna number of trials
    n_splits=10, # For k_fold
)

[I 2024-03-31 14:09:02,817] A new study created in memory with name: no-name-ec01eb00-646d-4c24-92a8-7b2879e739c9
[I 2024-03-31 14:10:20,491] Trial 0 finished with value: 0.5326139540221855 and parameters: {'n_estimators': 113, 'max_depth': 10, 'learning_rate': 0.3294209024773819, 'subsample': 0.7334155879588973, 'colsample_bytree': 0.9228784287980911, 'gamma': 3.260515446607697, 'min_child_weight': 10}. Best is trial 0 with value: 0.5326139540221855.
[I 2024-03-31 14:11:51,435] Trial 1 finished with value: 0.5400248453343304 and parameters: {'n_estimators': 255, 'max_depth': 10, 'learning_rate': 0.12184039324682568, 'subsample': 0.7959468912328407, 'colsample_bytree': 0.92544620412886, 'gamma': 1.7938336175218594, 'min_child_weight': 5}. Best is trial 1 with value: 0.5400248453343304.


Best Hyperparameters for Maximum Recall: {'n_estimators': 255, 'max_depth': 10, 'learning_rate': 0.12184039324682568, 'subsample': 0.7959468912328407, 'colsample_bytree': 0.92544620412886, 'gamma': 1.7938336175218594, 'min_child_weight': 5}
Model saved to maybank/data/models/xgboost_model_20240331_141152.pkl
