# Processed Dataset loading

In [89]:
import sys
import os
import importlib
import pathlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd


# Add project to path so we can import our modules
sys.path.append(os.path.abspath(".."))

# Import functionality from our source code
import src.data_loading.data_loader as data_loader
importlib.reload(data_loader)
from src.data_loading.data_loader import load_data, get_numerical_features, get_categorical_features

# Loading preprocessed dataset
script_dir  = pathlib.Path.cwd()              
project_root = script_dir.parent

file_path_full_df = project_root / 'data' / 'processed' / 'preprocessed_full_dataset.csv'
file_path_first_10k_rows = project_root / 'data' / 'processed' / 'preprocessed_dataset_first_10k_rows.csv'

df_full = load_data(str(file_path_full_df))
df_first_10k_rows = load_data(str(file_path_first_10k_rows))

Loading data from c:\Users\Utente\Desktop\STUDIO\LUISS\ANNO_3\Advanced_Coding\Credit_Score_Classification\data\processed\preprocessed_full_dataset.csv
Loaded dataset with 100000 rows and 32 columns
Loading data from c:\Users\Utente\Desktop\STUDIO\LUISS\ANNO_3\Advanced_Coding\Credit_Score_Classification\data\processed\preprocessed_dataset_first_10k_rows.csv
Loaded dataset with 10000 rows and 32 columns


In [90]:
# Choose here if to use the df full or 10k rows (faster computations but less reliable results)

#df = df_full 

df = df_first_10k_rows 

# Credit Score Classification Pipeline (End-to-End)

This script executes a modular, reproducible, and configurable machine learning pipeline for multi-class credit score classification, using XGBoost and an extended preprocessing + tuning architecture. The pipeline is implemented using a custom CreditScorePipeline class, which encapsulates all steps from data preparation to evaluation.

Pipeline Overview and Key Responsibilities

The pipeline performs the following key tasks:

  - Data Splitting

    - Splits the dataset into training, validation, and test sets using stratified sampling to preserve class distributions.

    - Ensures independence between hyperparameter tuning (via validation set) and final model evaluation (via test set).

  - Preprocessing Setup

    - Loads configurations (e.g., encoding strategies, SMOTE settings, outlier handling) from a central YAML config file.

    - Defines hyperparameter search space using values specified in the config.

  - Pipeline Construction (imblearn pipeline composed of)

    - OutlierHandler (custom component)

    - FeatureEncoder (with support for different encoding strategies)

    - SMOTE to address class imbalance

    - XGBoost Classifier as the final model

    - Each component is parameterized for tuning.

  - Hyperparameter Tuning

    - Uses RandomizedSearchCV with cross-validation (cv=5) to explore a subset of combinations from a large parameter space.

    - Randomized search allows balancing thoroughness and computational efficiency (n_iter=30 here).

    - Includes tuning of model, preprocessing, and sampling strategies—making this a full pipeline optimization.

  - Model Evaluation

    - Evaluates the best-found model on both validation and test sets using classification metrics and confusion matrix.

    - Returns a dictionary of performance metrics for downstream reporting.

  - Feature Importance Visualization

    - Extracts feature importances from the trained XGBoost model and displays the top features in a bar plot.

    - Supports recovery of feature names even after transformation via pipeline components.


Key Design Highlights

- Configuration-driven design: All hyperparameter ranges, encoding strategies, and SMOTE settings are externally defined in config.yaml, allowing flexibility and separation of concerns.

- Modular structure: Each stage of the pipeline is independently testable, extensible, and reusable.

- Reproducibility: The random_state parameter ensures consistent results across runs.

- Balanced tuning: Incorporates not just model hyperparameters, but also preprocessing decisions (e.g., outlier removal thresholds, encoding methods), providing a more realistic view of model performance.


Default Parameters Used:

- Test Set Size: 20%

- Validation Set Size: 20% (of remaining training data)

- Cross-Validation: 5-fold

- Random Search Iterations: 30 combinations

- Model: XGBoost (multi:softprob objective for multiclass classification)

In [91]:
import importlib
import src.models.credit_score_pipeline as pipeline_module
importlib.reload(pipeline_module)
from src.models.credit_score_pipeline import CreditScorePipeline

In [92]:
'''

df_copy = df.drop(columns=["Customer_ID"]) # Dropping the identifier column Customer_ID

# Relative path from notebooks directory to config file
config_path = "../config.yaml"


"""Run the credit score classification pipeline."""


# Separate features and target
X = df_copy.drop("Credit_Score", axis=1)
y = df_copy["Credit_Score"]

# Initialize pipeline
pipeline = CreditScorePipeline(config_path=config_path, random_state=42)

# Run full pipeline
results = pipeline.run_full_pipeline(
    X, y, 
    test_size=0.2,
    val_size=0.2,
    n_iter=30,  # Number of hyperparameter combinations to try
    cv=5        # Number of cross-validation folds
)

'''


'\n\ndf_copy = df.drop(columns=["Customer_ID"]) # Dropping the identifier column Customer_ID\n\n# Relative path from notebooks directory to config file\nconfig_path = "../config.yaml"\n\n\n"""Run the credit score classification pipeline."""\n\n\n# Separate features and target\nX = df_copy.drop("Credit_Score", axis=1)\ny = df_copy["Credit_Score"]\n\n# Initialize pipeline\npipeline = CreditScorePipeline(config_path=config_path, random_state=42)\n\n# Run full pipeline\nresults = pipeline.run_full_pipeline(\n    X, y, \n    test_size=0.2,\n    val_size=0.2,\n    n_iter=30,  # Number of hyperparameter combinations to try\n    cv=5        # Number of cross-validation folds\n)\n\n'

Analysis of results:

- The model performs consistently well across all data splits (cross-validation, validation, and test), indicating good generalization and no evident overfitting.

- High precision and recall across majority classes (especially for class 1 and class 2), with F1-scores ~0.98.

- The confusion matrix confirms this, showing very few misclassifications and strong separation across classes.


Potential Concerns & Biases:
 - Over-reliance on City feature

   - The top four most important features are all related to the City variable (e.g., City_ZeroVille, City_Standhampton).

   - Given prior observations that City categories showed perfect correlation with the target, this likely reflects data leakage or labeling artifacts.

   - This inflates performance artificially and compromises model fairness and generalizability. It may also mask the true influence of more general financial behavior features.

 - Underweighting of key financial predictors

   - Important behavioral features like Num_of_Delayed_Payment, Credit_Mix, and Loan_Count are ranked very low in importance.

   - This may be a consequence of the model prioritizing easier-to-exploit categorical artifacts (City, Street), rather than genuinely predictive patterns.



### Drop column City

In [93]:
'''
df_copy = df.drop(columns=["Customer_ID","City"]) 

# Relative path from notebooks directory to config file
config_path = "../config.yaml"


"""Run the credit score classification pipeline."""


# Separate features and target
X = df_copy.drop("Credit_Score", axis=1)
y = df_copy["Credit_Score"]

# Initialize pipeline
pipeline = CreditScorePipeline(config_path=config_path, random_state=42)

# Run full pipeline
results = pipeline.run_full_pipeline(
    X, y, 
    test_size=0.2,
    val_size=0.2,
    n_iter=30,  # Number of hyperparameter combinations to try
    cv=5        # Number of cross-validation folds
)

'''

'\ndf_copy = df.drop(columns=["Customer_ID","City"]) \n\n# Relative path from notebooks directory to config file\nconfig_path = "../config.yaml"\n\n\n"""Run the credit score classification pipeline."""\n\n\n# Separate features and target\nX = df_copy.drop("Credit_Score", axis=1)\ny = df_copy["Credit_Score"]\n\n# Initialize pipeline\npipeline = CreditScorePipeline(config_path=config_path, random_state=42)\n\n# Run full pipeline\nresults = pipeline.run_full_pipeline(\n    X, y, \n    test_size=0.2,\n    val_size=0.2,\n    n_iter=30,  # Number of hyperparameter combinations to try\n    cv=5        # Number of cross-validation folds\n)\n\n'

Results:

### Drop both columns City and Street

In [94]:
'''
df_copy = df.drop(columns=["Customer_ID","City","Street"]) 

# Relative path from notebooks directory to config file
config_path = "../config.yaml"


"""Run the credit score classification pipeline."""


# Separate features and target
X = df_copy.drop("Credit_Score", axis=1)
y = df_copy["Credit_Score"]

# Initialize pipeline
pipeline = CreditScorePipeline(config_path=config_path, random_state=42)

# Run full pipeline
results = pipeline.run_full_pipeline(
    X, y, 
    test_size=0.2,
    val_size=0.2,
    n_iter=30,  # Number of hyperparameter combinations to try
    cv=5        # Number of cross-validation folds
)
'''

'\ndf_copy = df.drop(columns=["Customer_ID","City","Street"]) \n\n# Relative path from notebooks directory to config file\nconfig_path = "../config.yaml"\n\n\n"""Run the credit score classification pipeline."""\n\n\n# Separate features and target\nX = df_copy.drop("Credit_Score", axis=1)\ny = df_copy["Credit_Score"]\n\n# Initialize pipeline\npipeline = CreditScorePipeline(config_path=config_path, random_state=42)\n\n# Run full pipeline\nresults = pipeline.run_full_pipeline(\n    X, y, \n    test_size=0.2,\n    val_size=0.2,\n    n_iter=30,  # Number of hyperparameter combinations to try\n    cv=5        # Number of cross-validation folds\n)\n'

Results:

### Customer-based data split to avoid data leakage

In [95]:
df_copy = df.drop(columns=["City","Street"]) 

# Relative path from notebooks directory to config file
config_path = "../config.yaml"


"""Run the credit score classification pipeline."""


# Separate features and target
X = df_copy.drop("Credit_Score", axis=1)
y = df_copy["Credit_Score"]

# Initialize pipeline
pipeline = CreditScorePipeline(config_path=config_path, random_state=42)

# Run full pipeline
results = pipeline.run_full_pipeline(
    X, y, 
    test_size=0.2,
    val_size=0.2,
    n_iter=30,  # Number of hyperparameter combinations to try
    cv=5,        # Number of cross-validation folds
    custom_split=True,
    customer_id_col="Customer_ID"
)

Feature-based split using Customer_ID (then dropped) into train (800 customers, 6400 records), validation (200 customers, 1600 records), and test (250 customers, 2000 records) sets.
Fitting 5 folds for each of 30 candidates, totalling 150 fits


KeyboardInterrupt: 

Results:

### Aggregated dataset with unique customers records

In [96]:
script_dir  = pathlib.Path.cwd()              
project_root = script_dir.parent

file_path_aggregated_df = project_root / 'data' / 'processed' / 'aggregated_dataset.csv'

df_aggregated = load_data(str(file_path_aggregated_df))


Loading data from c:\Users\Utente\Desktop\STUDIO\LUISS\ANNO_3\Advanced_Coding\Credit_Score_Classification\data\processed\aggregated_dataset.csv
Loaded dataset with 12500 rows and 32 columns


In [97]:
df_copy = df.drop(columns=["Customer_ID", "Month", "City", "Street"]) # Dropping the identifier column Customer_ID

# Relative path from notebooks directory to config file
config_path = "../config.yaml"


"""Run the credit score classification pipeline."""


# Separate features and target
X = df_copy.drop("Credit_Score", axis=1)
y = df_copy["Credit_Score"]

# Initialize pipeline
pipeline = CreditScorePipeline(config_path=config_path, random_state=42)

# Run full pipeline
results = pipeline.run_full_pipeline(
    X, y, 
    test_size=0.2,
    val_size=0.2,
    n_iter=30,  # Number of hyperparameter combinations to try
    cv=5        # Number of cross-validation folds
)

Random split into train (6400 records), validation (1600 records), and test (2000 records) sets.
Fitting 5 folds for each of 30 candidates, totalling 150 fits


KeyboardInterrupt: 