A supervised machine learning experiment was conducted to classify customers into high-level age segments using a combination of demographic, financial, behavioral, and RFM-derived features. The objective of this modelling task was to assess whether age cohorts can be accurately inferred from behavioural patterns alone—including spending intensity, account tenure, portfolio engagement, transaction frequency, and product diversity. Establishing this behavioural,lifecycle relationship provides insight into how strongly observable customer actions encode age-related characteristics. This, in turn, enables applications such as privacy-preserving personalization, data-driven marketing segmentation, and early identification of product needs across different age groups within the customer base.

In [None]:
!pip install scikit-learn-extra --no-deps

Collecting scikit-learn-extra
  Using cached scikit_learn_extra-0.3.0-cp312-cp312-linux_x86_64.whl
Installing collected packages: scikit-learn-extra
Successfully installed scikit-learn-extra-0.3.0


In [None]:
!pip uninstall -y numpy scikit-learn-extra
!pip install "numpy<2.0" --no-deps


Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Found existing installation: scikit-learn-extra 0.3.0
Uninstalling scikit-learn-extra-0.3.0:
  Successfully uninstalled scikit-learn-extra-0.3.0
Collecting numpy<2.0
  Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
Installing collected packages: numpy
Successfully installed numpy-1.26.4


In [None]:
import numpy as np
print(f"NumPy version: {np.__version__}")

from sklearn_extra.cluster import KMedoids
print(" KMedoids imported successfully!")

NumPy version: 1.26.4
 KMedoids imported successfully!


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Verify scikit-learn version (will use the default compatible version for Python 3.12)
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

from mpl_toolkits.mplot3d import Axes3D
from scipy.spatial.distance import cdist

import warnings
warnings.filterwarnings('ignore')

Scikit-learn version: 1.6.1


In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
full_df = pd.read_parquet('/content/drive/My Drive/Colab Notebooks/DSC678-Capstone/Banking_Project/RFM/rfm_clusters.parquet')
print("Final features loaded.")
print(f"Shape: {full_df.shape}")
print("\nColumns:")
print(full_df.columns.tolist())

full_df.head(10)

Final features loaded.
Shape: (104898, 54)

Columns:
['customer_id', 'residence_country', 'gender', 'age', 'first_join_date', 'residence_index', 'channel_entrance', 'activity_status', 'household_gross_income', 'saving_account', 'guarantees', 'junior_account', 'loans', 'credit_card', 'pensions', 'direct_debit', 'mortgage', 'employment_status', 'employment_status_int', 'personal_income', 'current_loan_amount', 'credit_score', 'customer_segment_model', 'years_calc', 'total_products_owned', 'junior_guarantee', 'customer_tenure_months', 'current_products_owned', 'total_adoptions', 'portfolio_value', 'avg_adoption_value', 'adoption_value_std', 'total_cancellations', 'net_product_growth', 'product_churn_rate', 'adoption_value_cv', 'category_diversity', 'product_diversity', 'active_months', 'adoption_frequency', 'avg_days_between_adoptions', 'norm_adoptions', 'norm_portfolio', 'norm_growth', 'norm_diversity', 'norm_frequency', 'engagement_score', 'engagement_category', 'recency_proxy', 'cluste

Unnamed: 0,customer_id,residence_country,gender,age,first_join_date,residence_index,channel_entrance,activity_status,household_gross_income,saving_account,...,norm_diversity,norm_frequency,engagement_score,engagement_category,recency_proxy,cluster,cluster_name,recency,frequency,monetary
0,15891,ES,0,59,2020-07-28,Y,KAT,1,122813.94,0,...,0.066667,0.071429,0.029698,Very Low,10.133333,3,New Joiners,10.133333,1.0,341.33
1,15899,ES,1,57,2000-01-16,Y,KAT,1,130835.64,0,...,0.4,0.428571,0.271758,Low,260.1,4,Hibernating,260.1,6.0,14823.15
2,15900,ES,1,48,2000-01-16,Y,KAT,1,105327.03,0,...,0.266667,0.035294,0.148321,Very Low,260.1,4,Hibernating,260.1,7.0,2524.94
3,15902,ES,0,57,2000-01-16,Y,KAT,1,230408.25,0,...,0.133333,0.142857,0.08286,Very Low,260.1,4,Hibernating,260.1,2.0,10218.54
4,15906,ES,0,55,2006-02-16,Y,KAT,1,81005.49,0,...,0.533333,0.070169,0.444838,Medium,186.0,2,VIP Champions,186.0,13.0,152711.31
5,15916,ES,0,54,2000-01-16,Y,KAT,1,465589.68,0,...,0.533333,0.045378,0.287463,Low,260.1,0,Champions,260.1,9.0,15439.3
6,15918,ES,0,50,2000-01-16,Y,KAT,1,298795.08,0,...,0.533333,0.571429,0.465549,Medium,260.1,2,VIP Champions,260.1,8.0,116904.57
7,15919,ES,1,55,2000-01-16,Y,KAT,1,318796.59,0,...,0.333333,0.357143,0.211526,Low,260.1,4,Hibernating,260.1,5.0,10841.5
8,15923,ES,1,49,2000-01-16,Y,KAT,1,279663.69,0,...,0.466667,0.052693,0.256707,Low,260.1,0,Champions,260.1,9.0,7420.53
9,15924,ES,1,52,2000-01-16,Y,KAT,1,130903.68,0,...,0.2,0.047096,0.100978,Very Low,260.1,4,Hibernating,260.1,4.0,2601.89


To maintain a strictly numerical feature space for modelling, all non-numeric attributes—such as country codes, channel identifiers, categorical customer-segment labels, and other encoded string-based variables—were excluded from the dataset. Additionally, a continuous temporal feature, years_since_join, was engineered from the original first_join_date field to quantitatively represent customer tenure. Following these transformations, the resulting modelling dataset consisted exclusively of numerical variables, and no records were dropped during preprocessing, as all customers possessed complete numeric information required for analysis.

In [None]:
#Keeping first_join_date and Convert It Into Numeric Features
full_df['first_join_date'] = pd.to_datetime(full_df['first_join_date'], errors='coerce')

# years since joining
full_df['years_since_join'] = (pd.to_datetime('today') - full_df['first_join_date']).dt.days / 365.25

In [None]:
#Keep all numeric features
drop_cols = [
    'customer_id',
    'residence_country',
    'residence_index',
    'channel_entrance',
    'employment_status',
    'engagement_category',
    'customer_segment_model'
]

model_df = full_df.drop(columns=drop_cols)

In [None]:
print("\nBefore dropping missing values:")
print("Total rows:", len(model_df))

# Drop rows with ANY missing values
model_df_clean = model_df.dropna()

print("\nAfter dropping missing values:")
print("Total rows:", len(model_df_clean))

# Count how many rows were removed
print("\nRows removed:", len(model_df) - len(model_df_clean))



Before dropping missing values:
Total rows: 104898

After dropping missing values:
Total rows: 104898

Rows removed: 0


We subsequently discretized the continuous age variable into three analytically meaningful categorical segments—Young (18–35), Mid (36–50), and Older (50+). This transformation reflects standard lifecycle-based segmentation practices commonly applied in retail banking analytics, enabling the model to capture non-linear differences in demographic behavior while preserving interpretability for downstream classification tasks.

In [None]:
import pandas as pd

# Copy clean dataset
df_2class = model_df_clean.copy()

# Define bins
age_bins_3 = [17, 35, 50, 100]   # Young, Mid, Older
age_labels_3 = ["Young (18–35)", "Mid (36–50)", "Older (50+)"]

df_2class["age_bucket_3"] = pd.cut(
    df_2class["age"],
    bins=age_bins_3,
    labels=age_labels_3,
    right=True
)

print(df_2class["age_bucket_3"].value_counts())

age_bucket_3
Mid (36–50)      67639
Older (50+)      33959
Young (18–35)     3300
Name: count, dtype: int64


A substantial class imbalance was observed after constructing the age buckets, with the Mid (36–50) segment comprising the majority of records, followed by the Older (50+) group, and a relatively small proportion of Young (18–35) customers. Training a model on this distribution would bias predictions toward the dominant class, resulting in inflated accuracy and reduced sensitivity to minority groups. To mitigate this issue, Synthetic Minority Oversampling Technique (SMOTE) was applied exclusively to the training set. SMOTE generates synthetic minority-class samples through interpolation, enabling the model to learn representative patterns for under-represented groups without duplicating existing observations. After oversampling, all three age segments were evenly represented in the training data, ensuring balanced learning, improved recall for minority classes, and reduced risk of majority-class bias.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

# Define X and y
X_2 = df_2class.drop(columns=["age", "age_bucket_3", "first_join_date", "cluster_name"])
y_2 = df_2class["age_bucket_3"]

# Encode age labels into integers
le2 = LabelEncoder()
y_2_encoded = le2.fit_transform(y_2)

# Train-test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X_2, y_2_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_2_encoded
)

print("Before SMOTE:", pd.Series(y_train2).value_counts())

# Apply SMOTE only to training data
sm = SMOTE(random_state=42, sampling_strategy='auto')
X_train2_res, y_train2_res = sm.fit_resample(X_train2, y_train2)

print("After SMOTE:", pd.Series(y_train2_res).value_counts())

Before SMOTE: 0    54111
1    27167
2     2640
Name: count, dtype: int64
After SMOTE: 1    54111
0    54111
2    54111
Name: count, dtype: int64


In [None]:
!pip install lightgbm




A comparative evaluation was conducted using three supervised learning algorithms—Random Forest, XGBoost, and LightGBM—to assess their ability to classify customers into the three engineered age segments. These models were selected for their strong performance on structured financial data and their capacity to model non-linear relationships among behavioural, demographic, and RFM-based predictors. All models were trained on the SMOTE-balanced dataset and evaluated on the original, imbalanced validation set to measure real-world generalization.

Across all experiments, the models achieved similar overall accuracy (61%–65%). However, class-level performance demonstrated clear patterns. The Mid (36–50) group consistently achieved the highest recall due to its large representation and more stable behavioural profile. The Young (18–35) segment achieved moderate recall, reflecting their smaller sample size and more dynamic financial behaviour. The Older (50+) segment remained the most difficult to classify, likely due to behavioural overlap with middle-aged customers, particularly in income, tenure, and product-holding patterns.

Overall, these results indicate that customer behaviour contains meaningful—but not fully distinct—signals related to age segmentation. Mid-life customer patterns are most learnable, while younger and older groups exhibit higher behavioural variability. This suggests that age prediction from behavioural data is feasible but would benefit from additional features or temporal data to further separate overlapping lifecycle patterns.

In [None]:
from lightgbm import LGBMClassifier


from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder # Import LabelEncoder

cls_models_2 = {

    "XGBoost": XGBClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="multi:softprob",
        eval_metric="mlogloss",
        random_state=42,
        n_jobs=-1
    ),
    "LightGBM": LGBMClassifier(
        n_estimators=300,
        learning_rate=0.05,
        max_depth=-1,        # let LightGBM decide tree depth
        num_leaves=31,
        subsample=0.8,
        colsample_bytree=0.8,
        objective="multiclass",
        random_state=42,
        n_jobs=-1
    )
}


In [None]:
results_final = {}

for name, clf in cls_models_2.items():
    print("\n" + "="*60)
    print(f" TRAINING MODEL: {name}")
    print("="*60)

    # -----------------------------
    # Fit on SMOTE-resampled train
    # -----------------------------
    clf.fit(X_train2_res, y_train2_res)

    # =========================================================================
    #  TRAIN SET EVALUATION
    # =========================================================================
    print("\n" + "-"*60)
    print(" TRAIN SET EVALUATION")
    print("-"*60)
    y_train_pred = clf.predict(X_train2_res)

    train_acc = accuracy_score(y_train2_res, y_train_pred)
    train_f1  = f1_score(y_train2_res, y_train_pred, average="macro")
    print("\nClassification Report:")
    print(classification_report(y_train2_res, y_train_pred,
                                target_names=le2.classes_))
    print(f"Train Accuracy : {train_acc:.3f}")
    print(f"Train Macro F1 : {train_f1:.3f}")

    # =========================================================================
    #  TEST SET EVALUATION
    # =========================================================================
    print("\n" + "-"*60)
    print(" TEST SET EVALUATION")
    print("-"*60)
    y_test_pred = clf.predict(X_test2)

    test_acc = accuracy_score(y_test2, y_test_pred)
    test_f1  = f1_score(y_test2, y_test_pred, average="macro")
    print("\nClassification Report:")

    print(classification_report(y_test2, y_test_pred,
                                target_names=le2.classes_))
    print(f"Test Accuracy  : {test_acc:.3f}")
    print(f"Test Macro F1  : {test_f1:.3f}")

    # Confusion matrix on test
    print("\nConfusion Matrix (Test):")
    print(confusion_matrix(y_test2, y_test_pred))


    # Store metrics for comparison table
    results_final[name] = {
        "train_acc": train_acc,
        "test_acc":  test_acc,
        "train_macro_f1": train_f1,
        "test_macro_f1":  test_f1,

    }



 TRAINING MODEL: XGBoost

------------------------------------------------------------
 TRAIN SET EVALUATION
------------------------------------------------------------

Classification Report:
               precision    recall  f1-score   support

  Mid (36–50)       0.65      0.89      0.75     54111
  Older (50+)       0.80      0.47      0.59     54111
Young (18–35)       0.86      0.90      0.88     54111

     accuracy                           0.75    162333
    macro avg       0.77      0.75      0.74    162333
 weighted avg       0.77      0.75      0.74    162333

Train Accuracy : 0.753
Train Macro F1 : 0.742

------------------------------------------------------------
 TEST SET EVALUATION
------------------------------------------------------------

Classification Report:
               precision    recall  f1-score   support

  Mid (36–50)       0.67      0.88      0.76     13528
  Older (50+)       0.53      0.18      0.27      6792
Young (18–35)       0.26      0.37   

Given the presence of more than 40 numerical features—many of which exhibited moderate to high correlation, Principal Component Analysis (PCA) was evaluated as a dimensionality-reduction technique. PCA was selected to identify whether compressing the feature space could reduce redundancy, improve generalization, or enhance class separation for minority age groups.

Before applying PCA, all features were standardized using z-score normalization to ensure equal contribution across variables. PCA was then fitted on the training set and configured to retain 95% of total variance, resulting in a reduced representation of 18 principal components derived from the original 45 behavioural and financial features. This transformation significantly simplified the feature space while preserving the dominant information structure.

In [None]:
from sklearn.decomposition import PCA

# --- STANDARDIZE FEATURES BEFORE PCA ---
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train2_res)
X_test_scaled  = scaler.transform(X_test2)

print("Train scaled shape:", X_train_scaled.shape)
print("Test scaled shape :", X_test_scaled.shape)

# --- PCA: KEEP 95% OF VARIANCE ---
pca = PCA(n_components=0.95, random_state=42)

X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca  = pca.transform(X_test_scaled)

print("PCA components:", pca.n_components_)
print("X_train_pca shape:", X_train_pca.shape)
print("X_test_pca shape :", X_test_pca.shape)

Train scaled shape: (162333, 45)
Test scaled shape : (20980, 45)
PCA components: 18
X_train_pca shape: (162333, 18)
X_test_pca shape : (20980, 18)


In [None]:
# PCA loadings
feature_names = X_train2_res.columns
loadings = pd.DataFrame(
    pca.components_.T,
    index=feature_names,
    columns=[f"PC{i+1}" for i in range(pca.n_components_)]
)

# Show top contributing features for each component
for pc in loadings.columns:
    print(f"\n=== Top Features for {pc} ===")
    display(loadings[pc].abs().sort_values(ascending=False).head(10))


=== Top Features for PC1 ===


Unnamed: 0,PC1
engagement_score,0.254958
product_diversity,0.247749
norm_diversity,0.247749
category_diversity,0.244256
norm_adoptions,0.241883
frequency,0.241883
total_adoptions,0.241883
net_product_growth,0.234907
norm_growth,0.234907
adoption_value_cv,0.215472



=== Top Features for PC2 ===


Unnamed: 0,PC2
customer_tenure_months,0.335766
recency,0.335766
recency_proxy,0.335766
years_since_join,0.335766
years_calc,0.333796
cluster,0.242037
active_months,0.185016
product_churn_rate,0.180028
total_cancellations,0.168661
mortgage,0.168009



=== Top Features for PC3 ===


Unnamed: 0,PC3
mortgage,0.336252
adoption_value_std,0.313348
monetary,0.299617
norm_portfolio,0.299617
portfolio_value,0.299617
avg_adoption_value,0.273092
recency,0.24658
recency_proxy,0.24658
customer_tenure_months,0.24658
years_since_join,0.24658



=== Top Features for PC4 ===


Unnamed: 0,PC4
adoption_frequency,0.451074
norm_frequency,0.451074
active_months,0.296143
avg_days_between_adoptions,0.294471
product_churn_rate,0.262531
total_cancellations,0.21206
norm_growth,0.152153
net_product_growth,0.152153
total_products_owned,0.131483
mortgage,0.125805



=== Top Features for PC5 ===


Unnamed: 0,PC5
junior_guarantee,0.68935
junior_account,0.681792
guarantees,0.101568
total_products_owned,0.095756
avg_days_between_adoptions,0.087456
total_cancellations,0.073823
adoption_frequency,0.065121
norm_frequency,0.065121
personal_income,0.042847
household_gross_income,0.042594



=== Top Features for PC6 ===


Unnamed: 0,PC6
personal_income,0.694582
household_gross_income,0.693688
current_loan_amount,0.111668
credit_score,0.096012
activity_status,0.048036
gender,0.043787
junior_guarantee,0.042411
junior_account,0.042239
total_products_owned,0.038967
credit_card,0.037582



=== Top Features for PC7 ===


Unnamed: 0,PC7
credit_score,0.434019
activity_status,0.377719
credit_card,0.373618
total_products_owned,0.352072
current_products_owned,0.331438
loans,0.243662
adoption_value_cv,0.209947
direct_debit,0.207892
pensions,0.185501
adoption_frequency,0.16656



=== Top Features for PC8 ===


Unnamed: 0,PC8
credit_score,0.398399
activity_status,0.390788
pensions,0.345311
avg_days_between_adoptions,0.252962
total_products_owned,0.25085
current_products_owned,0.245567
credit_card,0.240885
cluster,0.231241
loans,0.165022
direct_debit,0.151557



=== Top Features for PC9 ===


Unnamed: 0,PC9
loans,0.8486
current_loan_amount,0.285642
activity_status,0.243229
gender,0.192227
avg_days_between_adoptions,0.133484
adoption_value_cv,0.123791
guarantees,0.099608
credit_card,0.093026
mortgage,0.07919
norm_frequency,0.070092



=== Top Features for PC10 ===


Unnamed: 0,PC10
employment_status_int,0.721823
guarantees,0.535307
gender,0.322506
saving_account,0.240115
pensions,0.12722
junior_account,0.075271
credit_card,0.042874
activity_status,0.037056
direct_debit,0.03271
avg_days_between_adoptions,0.025546



=== Top Features for PC11 ===


Unnamed: 0,PC11
saving_account,0.865169
guarantees,0.476768
employment_status_int,0.100177
junior_account,0.072244
credit_card,0.059147
activity_status,0.030822
gender,0.030059
pensions,0.025967
credit_score,0.02331
direct_debit,0.022306



=== Top Features for PC12 ===


Unnamed: 0,PC12
guarantees,0.650858
employment_status_int,0.599459
saving_account,0.433958
junior_account,0.096672
loans,0.081983
gender,0.079551
credit_card,0.036605
activity_status,0.033496
product_churn_rate,0.02227
cluster,0.021718



=== Top Features for PC13 ===


Unnamed: 0,PC13
gender,0.738422
pensions,0.457168
employment_status_int,0.318181
loans,0.172111
credit_card,0.170791
guarantees,0.163941
direct_debit,0.09942
product_churn_rate,0.085451
avg_days_between_adoptions,0.082726
total_cancellations,0.076341



=== Top Features for PC14 ===


Unnamed: 0,PC14
pensions,0.732101
gender,0.48017
credit_card,0.302329
direct_debit,0.194915
avg_days_between_adoptions,0.170415
total_cancellations,0.112673
current_products_owned,0.086443
total_products_owned,0.08565
total_adoptions,0.066401
norm_adoptions,0.066401



=== Top Features for PC15 ===


Unnamed: 0,PC15
product_churn_rate,0.468111
total_cancellations,0.422181
avg_days_between_adoptions,0.396211
norm_growth,0.306443
net_product_growth,0.306443
gender,0.220759
adoption_frequency,0.171424
norm_frequency,0.171424
active_months,0.164238
activity_status,0.13554



=== Top Features for PC16 ===


Unnamed: 0,PC16
avg_days_between_adoptions,0.510423
adoption_frequency,0.31892
norm_frequency,0.31892
cluster,0.304598
activity_status,0.296297
product_churn_rate,0.285237
adoption_value_cv,0.275151
credit_score,0.245878
current_loan_amount,0.189843
avg_adoption_value,0.165359



=== Top Features for PC17 ===


Unnamed: 0,PC17
current_loan_amount,0.847437
loans,0.32441
cluster,0.207004
avg_days_between_adoptions,0.166659
avg_adoption_value,0.146406
activity_status,0.098735
monetary,0.09227
norm_portfolio,0.09227
portfolio_value,0.09227
norm_frequency,0.091518



=== Top Features for PC18 ===


Unnamed: 0,PC18
cluster,0.597779
credit_card,0.411552
direct_debit,0.382814
avg_adoption_value,0.372111
avg_days_between_adoptions,0.241832
current_loan_amount,0.17797
norm_frequency,0.137267
adoption_frequency,0.137267
product_churn_rate,0.085877
adoption_value_cv,0.067703


In [None]:
threshold = 0.20

strong_features = (loadings.abs() > threshold).any(axis=1)
selected_features = loadings[strong_features]

print("\nFeatures strongly contributing to PCA components:")
display(selected_features)


Features strongly contributing to PCA components:


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18
gender,0.023926,0.085395,-0.042079,-0.019431,0.000387,-0.043787,-0.03148,0.100615,0.192227,0.322506,-0.030059,-0.079551,0.738422,0.48017,-0.220759,0.012693,0.047615,-0.011347
activity_status,0.149829,-0.047416,-0.056211,0.048672,0.029287,0.048036,0.377719,0.390788,0.243229,0.037056,-0.030822,0.033496,0.000183,-0.049893,0.13554,-0.296297,-0.098735,0.005818
household_gross_income,0.011911,0.004625,-0.011141,0.016423,0.042594,0.693688,-0.074492,-0.028604,-0.029931,0.011301,0.00033,-0.013523,0.031824,0.017508,-0.019981,0.030854,-0.082179,-0.013599
saving_account,0.003741,0.001531,-0.0017,0.010182,0.003922,0.002343,-0.024753,0.040788,-0.031169,0.240115,0.865169,0.433958,-0.000808,-0.03296,-0.003416,-0.001784,0.003664,-0.013227
guarantees,0.004385,-6.1e-05,-0.004291,0.006714,0.101568,-0.004235,-0.010719,-0.058639,-0.099608,0.535307,-0.476768,0.650858,-0.163941,0.006829,-0.028733,0.017053,0.005917,-0.005904
junior_account,-0.004123,-0.031188,-0.000961,-0.079549,0.681792,-0.042239,0.019324,-0.080542,-0.007958,-0.075271,0.072244,-0.096672,0.035553,0.027209,0.039904,-0.054073,-0.018501,0.02396
loans,0.010302,0.025286,0.024171,-0.004749,0.003709,-0.017108,-0.243662,-0.165022,0.8486,-0.019564,-0.006295,0.081983,-0.172111,-0.008852,0.00917,-0.042746,-0.32441,-0.041699
credit_card,0.136916,0.003955,-0.043229,0.109416,0.038725,-0.037582,-0.373618,0.240885,-0.093026,0.042874,-0.059147,-0.036605,0.170791,-0.302329,0.081299,-0.045453,0.01008,-0.411552
pensions,0.052582,0.028,-0.018973,0.084709,0.036452,-0.00747,-0.185501,0.345311,-0.066068,-0.12722,0.025967,0.008103,-0.457168,0.732101,0.018486,0.092509,0.070054,-0.03978
direct_debit,0.188392,0.013793,-0.036283,0.110081,0.025268,-0.032339,-0.207892,0.151557,-0.031684,0.03271,-0.022306,-0.01862,0.09942,-0.194915,0.033262,0.018968,-0.019181,0.382814


In [None]:
cls_models_pca = {
    "XGBoost_FS": XGBClassifier(
        n_estimators=300, learning_rate=0.05, max_depth=5,
        subsample=0.8, colsample_bytree=0.8,
        objective="multi:softprob", eval_metric="mlogloss",
        random_state=42, n_jobs=-1),

    "LightGBM_FS": LGBMClassifier(
        n_estimators=300, learning_rate=0.05,
        num_leaves=31, subsample=0.8, colsample_bytree=0.8,
        objective="multiclass", random_state=42, n_jobs=-1)
}


In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

print("\n==============================================")
print("  USING PCA-SELECTED ORIGINAL FEATURES")
print("==============================================")

results_pca_selected = {}

for name, clf in cls_models_pca.items():

    print(f"\n\n================ {name} (PCA-Selected Features) ================")

    # ---------------------------------------------------------------
    # 1) TRAIN MODEL
    # ---------------------------------------------------------------
    clf.fit(X_train_pca, y_train2_res)

    # ---------------------------------------------------------------
    # 2) TRAIN SET EVALUATION
    # ---------------------------------------------------------------
    print("\n---------------- TRAIN SET PERFORMANCE ----------------")
    y_train_pred = clf.predict(X_train_pca)

    train_acc = accuracy_score(y_train2_res, y_train_pred)
    train_f1  = f1_score(y_train2_res, y_train_pred, average="macro")

    print(f"Train Accuracy: {train_acc:.3f}")
    print(f"Train Macro F1: {train_f1:.3f}")
    print("\nTrain Classification Report:")
    print(classification_report(
        y_train2_res,
        y_train_pred,
        target_names=df_2class['age_bucket_3'].cat.categories
    ))

    # ---------------------------------------------------------------
    # 3) TEST SET EVALUATION
    # ---------------------------------------------------------------
    print("\n---------------- TEST SET PERFORMANCE ----------------")
    y_test_pred = clf.predict(X_test_pca)

    test_acc = accuracy_score(y_test2, y_test_pred)
    test_f1  = f1_score(y_test2, y_test_pred, average="macro")

    print(f"Test Accuracy: {test_acc:.3f}")
    print(f"Test Macro F1: {test_f1:.3f}")
    print("\nTest Classification Report:")
    print(classification_report(
        y_test2,
        y_test_pred,
        target_names=df_2class['age_bucket_3'].cat.categories
    ))

    # ---------------------------------------------------------------
    # CONFUSION MATRIX
    # ---------------------------------------------------------------
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test2, y_test_pred))


    # Store results
    results_pca_selected[name] = {
        "train_acc": train_acc,
        "train_f1": train_f1,
        "test_acc": test_acc,
        "test_f1": test_f1,

    }


  USING PCA-SELECTED ORIGINAL FEATURES



---------------- TRAIN SET PERFORMANCE ----------------
Train Accuracy: 0.687
Train Macro F1: 0.679

Train Classification Report:
               precision    recall  f1-score   support

Young (18–35)       0.61      0.67      0.64     54111
  Mid (36–50)       0.67      0.50      0.57     54111
  Older (50+)       0.78      0.89      0.83     54111

     accuracy                           0.69    162333
    macro avg       0.68      0.69      0.68    162333
 weighted avg       0.68      0.69      0.68    162333


---------------- TEST SET PERFORMANCE ----------------
Test Accuracy: 0.543
Test Macro F1: 0.422

Test Classification Report:
               precision    recall  f1-score   support

Young (18–35)       0.68      0.64      0.66     13528
  Mid (36–50)       0.42      0.34      0.38      6792
  Older (50+)       0.14      0.57      0.22       660

     accuracy                           0.54     20980
    macro avg       0.41      0.52 