In [1]:
%pip install missingno MissForest lazypredict

Note: you may need to restart the kernel to use updated packages.


In [None]:
import sqlite3
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.inspection import permutation_importance
import shap
import lightgbm as lgb

# NOTA
El DataFrame df se crea a partir de la tabla basa_datos_pripal de la base de datos credit_scoring.db. El archivo no se incluye en el repositorio debido a su tamaño; la generación de la base de datos a partir del CSV original y la creación de la tabla se explica detalladamente en el notebook data-collection.ipynb

Descriptive analysis:
In this part of the project, we begin exploring the dataset created from the initial information obtained from the LendingClub dataset (Kaggle). The objective of this stage is to describe and understand the structure of the data, the variables and their types, their distributions, skewness, and the presence of missing values.

In [3]:
conn = sqlite3.connect("/workspaces/final_project_creditscoring/Data/credit_scoring.db")
df = pd.read_sql("SELECT * FROM basa_datos_pripal", conn)
conn.close()



In [4]:
conn = sqlite3.connect("/workspaces/final_project_creditscoring/Data/credit_scoring.db")
df = pd.read_sql("SELECT * FROM basa_datos_pripal", conn)
conn.close()
n_rows,n_cols = df.shape

print(f'En este df existen {n_rows} filas y {n_cols} columnas')

En este df existen 192309 filas y 151 columnas


With the analysis below, we can understand that the dataset contains a large number of numerical variables, along with several categorical features represented as object types. This initial inspection highlights the need for feature selection and type handling in later stages. 

In [9]:
object_cols = df.select_dtypes(include="object").columns.tolist()
int_cols = df.select_dtypes(include="int64").columns.tolist()
float_cols = df.select_dtypes(include="float64").columns.tolist()

print(f"Objects: {len(object_cols)}")
print(f"Ints: {len(int_cols)}")
print(f"Floats: {len(float_cols)}")

Objects: 38
Ints: 1
Floats: 112


The code below is intended to help us understand whether we have incorrectly typed columns that should be numeric instead of objects. The fact that, when we execute the code, the list is empty tells us that our data is mostly properly typed and that we don’t have object columns that should be numbers.

In [10]:
numeric_like = []

for col in object_cols:
    converted = pd.to_numeric(df[col], errors="coerce")
    ratio = converted.notna().mean()

    if ratio > 0.9:  # 90% convertible
        numeric_like.append(col)

numeric_like

[]

In [11]:
for col in object_cols:
    print(f"\n{col}")
    print(df[col].value_counts().head(5))


member_id
Series([], Name: count, dtype: int64)

term
term
36 months    137472
60 months     54837
Name: count, dtype: int64

grade
grade
B    57542
C    51332
A    37335
D    27757
E    12718
Name: count, dtype: int64

sub_grade
sub_grade
B3    12706
B4    12370
C1    12100
B5    11572
C2    11040
Name: count, dtype: int64

emp_title
emp_title
Teacher             2338
Manager             1952
Owner               1249
Supervisor           933
Registered Nurse     921
Name: count, dtype: int64

emp_length
emp_length
10+ years    60263
2 years      17674
< 1 year     16144
3 years      15639
5 years      13163
Name: count, dtype: int64

home_ownership
home_ownership
MORTGAGE    93952
RENT        79309
OWN         18968
ANY            52
OTHER          15
Name: count, dtype: int64

verification_status
verification_status
Source Verified    67530
Not Verified       62455
Verified           62324
Name: count, dtype: int64

issue_d
issue_d
2011-10-01 00:00:00    1941
2011-11-01 00:00:00    

Column format review – findings:
1. No numeric-like object columns detected. All object variables represent categorical or textual information.
2. Multiple date-related columns are stored as object and will require conversion to datetime.
3. Several high-cardinality or free-text columns (emp_title, url, desc) were identified as non-informative for modeling and will be considered for removal in later steps.

Revision of constant columns: 

The following code initially didn’t specify dropna=False, which made it show a few columns as constants. This led us to investigate what was happening and whether we were working with the right dataframe. However, this mistake was enlightening, as it helped us identify possible *data leakage variables, such as: hardship_type, deferral_term, and hardship_length.*

In [12]:
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

# Unique values per column
uniq = df.nunique(dropna=False)

# Show results
print(uniq)


id                                            192309
member_id                                          1
loan_amnt                                       1447
funded_amnt                                     1447
funded_amnt_inv                                 4951
term                                               2
int_rate                                         499
installment                                    42629
grade                                              7
sub_grade                                         35
emp_title                                      92975
emp_length                                        12
home_ownership                                     6
annual_inc                                     15656
verification_status                                3
issue_d                                          103
loan_status                                        9
pymnt_plan                                         2
url                                           

In [13]:
uniq = df.nunique(dropna=False)
uniq[uniq == 1]

member_id      1
policy_code    1
dtype: int64

After the code revision, the only two variables with constant values are member_id and policy_code. It does not make much sense to have a unique value for member_id if we have almost 200k data entries, so we needed to check the exact values contained in this column.

In [14]:
cols = ['member_id', 'policy_code']

for col in cols:
    print(df[col].value_counts(dropna=False))

member_id
None    192309
Name: count, dtype: int64
policy_code
1.0    192309
Name: count, dtype: int64


Below, we proceed to drop from the dataset the columns that aren’t relevant to the analysis or that were identified as redundant.

In [15]:
df = df.drop(columns=['member_id', 'policy_code','id','url','emp_title','desc','title'])

Revision of Duplicated Rows: No duplicated rows were identified.

In [16]:
df.duplicated().sum()

np.int64(0)

Revision of Duplicated Columns: Two variables (deferral_term and hardship_length) were found to be exact duplicates, containing identical values across all observations. Both variables are related to post-loan hardship events (we previously identified them as potential data leakers) and will therefore be excluded from the modeling stage. 

In [17]:
df.T.duplicated().sum()
df.T.duplicated(keep=False)

loan_amnt                                     False
funded_amnt                                   False
funded_amnt_inv                               False
term                                          False
int_rate                                      False
installment                                   False
grade                                         False
sub_grade                                     False
emp_length                                    False
home_ownership                                False
annual_inc                                    False
verification_status                           False
issue_d                                       False
loan_status                                   False
pymnt_plan                                    False
purpose                                       False
zip_code                                      False
addr_state                                    False
dti                                           False
delinq_2yrs 

Missing values: 
We identified columns with a high percentage of missing values, so we proceeded to define a missing threshold of 50%, where variables with more than 50% missing values will be considered for exclusion from the modeling stage. However, first we must evaluate them on a case-by-case basis to understand if any of those variables are conceptually important.

In [18]:
missing = df.isna().mean()*100
missing[missing>50]
print(missing)

loan_amnt                                      0.000000
funded_amnt                                    0.000000
funded_amnt_inv                                0.000000
term                                           0.000000
int_rate                                       0.000000
installment                                    0.000000
grade                                          0.000000
sub_grade                                      0.000000
emp_length                                     5.422523
home_ownership                                 0.000000
annual_inc                                     0.000000
verification_status                            0.000000
issue_d                                        0.000000
loan_status                                    0.000000
pymnt_plan                                     0.000000
purpose                                        0.000000
zip_code                                       0.000000
addr_state                                     0

In [19]:
missing_threshold = 50

high_missing_cols = missing[missing >= missing_threshold]
print(high_missing_cols)

mths_since_last_delinq                        54.782147
mths_since_last_record                        87.175327
next_pymnt_d                                  75.705245
mths_since_last_major_derog                   80.238054
annual_inc_joint                              96.907581
dti_joint                                     96.907581
verification_status_joint                     97.014180
open_acc_6m                                   63.197250
open_act_il                                   63.197250
open_il_12m                                   63.197250
open_il_24m                                   63.197250
mths_since_rcnt_il                            64.383882
total_bal_il                                  63.197250
il_util                                       68.575574
open_rv_12m                                   63.197250
open_rv_24m                                   63.197250
max_bal_bc                                    63.197250
all_util                                      63

In [20]:
(missing > 0).sum()

np.int64(95)

Now we have to identify other missing values and audit them to understand how we should treat each case.

In [21]:
#Identify other missing values
cat_col = df.select_dtypes(include=['object']).columns

for col in cat_col: 
    print(df[col].value_counts())

term
36 months    137472
60 months     54837
Name: count, dtype: int64
grade
B    57542
C    51332
A    37335
D    27757
E    12718
F     4417
G     1208
Name: count, dtype: int64
sub_grade
B3    12706
B4    12370
C1    12100
B5    11572
C2    11040
B2    10845
B1    10049
C3     9871
C4     9583
A5     9521
A4     9127
C5     8738
D1     6617
A1     6535
D2     6314
A3     6311
A2     5841
D3     5667
D4     4927
D5     4232
E1     3185
E2     2938
E3     2421
E4     2171
E5     2003
F1     1390
F2     1026
F3      861
F4      636
F5      504
G1      404
G2      288
G3      200
G4      184
G5      132
Name: count, dtype: int64
emp_length
10+ years    60263
2 years      17674
< 1 year     16144
3 years      15639
5 years      13163
1 year       12722
4 years      12500
6 years       9969
7 years       8823
8 years       8077
9 years       6907
Name: count, dtype: int64


home_ownership
MORTGAGE    93952
RENT        79309
OWN         18968
ANY            52
OTHER          15
NONE           13
Name: count, dtype: int64
verification_status
Source Verified    67530
Not Verified       62455
Verified           62324
Name: count, dtype: int64
issue_d
2011-10-01 00:00:00    1941
2011-11-01 00:00:00    1941
2011-12-01 00:00:00    1941
2012-01-01 00:00:00    1941
2012-02-01 00:00:00    1941
2012-03-01 00:00:00    1941
2012-04-01 00:00:00    1941
2013-01-01 00:00:00    1941
2013-02-01 00:00:00    1941
2013-03-01 00:00:00    1941
2013-04-01 00:00:00    1941
2013-05-01 00:00:00    1941
2013-06-01 00:00:00    1941
2013-07-01 00:00:00    1941
2011-09-01 00:00:00    1941
2012-12-01 00:00:00    1941
2012-11-01 00:00:00    1941
2012-10-01 00:00:00    1941
2012-09-01 00:00:00    1941
2012-08-01 00:00:00    1941
2012-07-01 00:00:00    1941
2012-06-01 00:00:00    1941
2012-05-01 00:00:00    1941
2013-08-01 00:00:00    1941
2014-11-01 00:00:00    1941
2014-10-01 00:00:00   

In [22]:
df_faltantes = df.replace(['None'],np.nan,inplace=True)

Columns with more than 50% missing values were manually reviewed and classified into post-loan variables, second-applicant features, structurally missing variables, and late-reported behavioral features based on domain knowledge and data documentation. No features were removed at this stage; the analysis documents decisions to be applied during model preparation.

In [23]:
def classify_column(col):
    # --- POST-LOAN / LEAKAGE ---
    if (
        col.startswith(("hardship", "settlement", "deferral")) or
        col in [
            'recoveries',
            'collection_recovery_fee',
            'total_rec_late_fee',
            'out_prncp',
            'out_prncp_inv',
            'loan_status'
        ]
    ):
        return "post_loan"

    # --- SECOND APPLICANT ---
    if col.startswith("sec_app") or col.endswith("_joint"):
        return "second_applicant"

    # --- STRUCTURAL MISSING ---
    if col.startswith("mths_since"):
        return "structural_missing"

    # --- EVERYTHING ELSE ---
    return "other"


# Explicit categories (including OTHER)
categories = [
    "post_loan",
    "second_applicant",
    "structural_missing",
    "other"
]

# Build the dictionary of column lists
analysis_dict = {
    cat: [col for col in df.columns if classify_column(col) == cat]
    for cat in categories
}

print("🔍 --- STARTING COLUMN CATEGORIZATION ANALYSIS --- 🔍\n")

for category, cols in analysis_dict.items():
    print(f"📁 CATEGORY: {category.upper()}")
    if not cols:
        print("   ❌ No columns found in this category.\n")
    else:
        print(f"   ✅ Found {len(cols)} columns.")
        missing_stats = df[cols].isnull().mean() * 100
        print(missing_stats.sort_values(ascending=False).to_string())
        print("-" * 40 + "\n")

print("🚀 --- ANALYSIS COMPLETE --- 🚀")


🔍 --- STARTING COLUMN CATEGORIZATION ANALYSIS --- 🔍

📁 CATEGORY: POST_LOAN
   ✅ Found 24 columns.
hardship_type                     99.685922
hardship_length                   99.685922
hardship_end_date                 99.685922
hardship_start_date               99.685922
hardship_amount                   99.685922
deferral_term                     99.685922
hardship_status                   99.685922
hardship_reason                   99.685922
hardship_loan_status              99.685922
hardship_payoff_balance_amount    99.685922
hardship_last_payment_amount      99.685922
hardship_dpd                      99.685922
settlement_date                   98.681809
settlement_amount                 98.681809
settlement_percentage             98.681809
settlement_status                 98.681809
settlement_term                   98.681809
hardship_flag                      0.000000
recoveries                         0.000000
collection_recovery_fee            0.000000
out_prncp             

In [24]:
# def classify_column(col):
#     if col.startswith(("hardship", "settlement", "deferral")):
#         return "post_loan"
#     if col.startswith("sec_app") or col.endswith("_joint"):
#         return "second_applicant"
#     if col.startswith("mths_since"):
#         return "structural_missing"
#     return "other"

# categories = ["post_loan", "second_applicant", "structural_missing"]
# analysis_dict = {cat: [col for col in df.columns if classify_column(col) == cat] for cat in categories}

# print("🔍 --- STARTING COLUMN CATEGORIZATION ANALYSIS --- 🔍\n")

# for category, cols in analysis_dict.items():
#     print(f"📁 CATEGORY: {category.upper()}")
#     if not cols:
#         print("   ❌ No columns found in this category.\n")
#     else:
#         print(f"   ✅ Found {len(cols)} columns.")
#         missing_stats = df[cols].isnull().mean() * 100
#         print(missing_stats.sort_values(ascending=False).to_string())
#         print("-" * 40 + "\n")

# print("🚀 --- ANALYSIS COMPLETE --- 🚀")

In [25]:
#Column checker: to be able to quickly check the characteristics of a column and its type

target_col = 'loan_status'  

col_type = df[target_col].dtype

summary = pd.DataFrame({
    'Count': df[target_col].value_counts(dropna=False),
    'Percentage (%)': df[target_col].value_counts(dropna=False, normalize=True) * 100
})

print(f"Content Analysis for: {target_col.upper()}")
print(summary)
print('---'*30)
print(f"Data Type: {col_type}")


Content Analysis for: LOAN_STATUS
                                                     Count  Percentage (%)
loan_status                                                               
Fully Paid                                          119522       62.151017
Current                                              44223       22.995804
Charged Off                                          26066       13.554228
Late (31-120 days)                                    1180        0.613596
In Grace Period                                        468        0.243358
Does not meet the credit policy. Status:Fully Paid     442        0.229838
Late (16-30 days)                                      219        0.113879
Does not meet the credit policy. Status:Charged...     186        0.096719
Default                                                  3        0.001560
------------------------------------------------------------------------------------------
Data Type: object


In [26]:
all_classified_cols = (analysis_dict['post_loan'] + 
                      analysis_dict['second_applicant'] + 
                      analysis_dict['structural_missing'])

other_cols = [col for col in df.columns if col not in all_classified_cols]

print("🔍 --- AUDITING 'OTHER' COLUMNS WITH HIGH MISSING RATIO (>40%) ---")
high_missing_other = df[other_cols].isnull().mean()
high_missing_other = high_missing_other[high_missing_other > 0.4].sort_values(ascending=False)

if high_missing_other.empty:
    print("✅ No additional critical missing values found outside defined categories.")
else:
    print("⚠️ Attention: The following columns also have a high missing ratio:")
    print(high_missing_other.to_string())

🔍 --- AUDITING 'OTHER' COLUMNS WITH HIGH MISSING RATIO (>40%) ---
⚠️ Attention: The following columns also have a high missing ratio:
orig_projected_additional_accrued_interest    0.997566
payment_plan_start_date                       0.996859
debt_settlement_flag_date                     0.986818
next_pymnt_d                                  0.757052
il_util                                       0.685756
all_util                                      0.632004
open_acc_6m                                   0.631973
open_il_12m                                   0.631973
open_act_il                                   0.631973
open_rv_24m                                   0.631973
open_rv_12m                                   0.631973
total_bal_il                                  0.631973
open_il_24m                                   0.631973
total_cu_tl                                   0.631973
inq_fi                                        0.631973
max_bal_bc                               

1. *Post-loan variables:*

These features contain information generated after loan origination, such as hardship or settlement events. Their high missingness reflects the fact that most loans do not enter these processes. Because these variables include future information relative to the credit decision, they were identified as potential sources of data leakage.

Planned decision: Exclude.

2. *Second-applicant variables:*

These variables describe characteristics of a co-borrower in joint loan applications. The high proportion of missing values reflects that most loans involve a single applicant, meaning missing values indicate the absence of a second applicant rather than missing information.

Rather than modeling the full co-borrower profile, the presence of a second applicant is captured through a binary indicator. This approach preserves potentially relevant information while avoiding additional complexity and extensive imputation.

Planned decision: Create a binary flag indicating whether a loan includes a second applicant, and exclude detailed second-applicant features during model preparation.

3. *Structurally missing variables:*

These features represent the time since the last occurrence of negative credit events. Missing values indicate that the event has never occurred, making the missingness itself informative.

Planned decision: Retain for modeling and apply a dedicated imputation strategy at a later stage.

4. *Late-reported features:*

These variables were introduced into the dataset at later periods and are unavailable for older loans. Missingness is driven by historical reporting limitations rather than borrower behavior.

Planned decision: Evaluate after defining the temporal train-test split.

DATA CLEANING & PREPROCESSING STRATEGY

Now that we have a clearer understanding of the data, we can proceed with data cleaning and processing.

1. DF Backup: Create a full copy of the raw dataset to ensure data integrity and allow for easy rollbacks during the experimentation phase.

2. Target Definition & Filtering: Refine the loan_status variable. We exclude ongoing loans and focus only on definitive outcomes.

Default (1): Charged off, default, or late (30–120 days).
Charged Off
Late (31-120 days)
Default
Does not meet the credit policy. Status:Charged Off

Non-Default (0): Fully paid.
Fully Paid

3. Leakage Removal: Drop all Post-loan variables. These features contain information only available after the credit decision has been made, which would lead to Data Leakage.

4. Structural Simplification (Joint Apps): Consolidate +16 second-applicant features into a single Binary Flag (is_joint_application). This reduces dimensionality while preserving the fact that a co-borrower exists.

5. Zero Ratio & Variance Analysis: Identify features with excessive sparsity. We decide whether to drop columns with near-zero variance or binarize features where the simple presence of an event (0 vs >0) is more predictive than its frequency.

6. Missingness Audit (Missingno): Visualize the remaining missing values to determine the mechanism of missingness (Random vs. Structural). This dictates the final decision: drop the column (if >50% NaN) or keep it for Imputation after the Train-Test Split.

In [27]:
df_backup_raw = df.copy()

Target Definition & Filtering:

In [28]:
df["loan_status"].value_counts()

loan_status
Fully Paid                                             119522
Current                                                 44223
Charged Off                                             26066
Late (31-120 days)                                       1180
In Grace Period                                           468
Does not meet the credit policy. Status:Fully Paid        442
Late (16-30 days)                                         219
Does not meet the credit policy. Status:Charged Off       186
Default                                                     3
Name: count, dtype: int64

In [29]:
default_statuses = [
    "Charged Off",
    "Default",
    "Late (31-120 days)",
    "Does not meet the credit policy. Status:Charged Off"
]

non_default_statuses = [
    "Fully Paid"
]

In [30]:
df_target = df[
    df["loan_status"].isin(default_statuses + non_default_statuses)
].copy()

In [31]:
df_target["target_default"] = df_target["loan_status"].apply(
    lambda x: 1 if x in default_statuses else 0
)

In [32]:
df_target["target_default"].value_counts(normalize=True)

target_default
0    0.813313
1    0.186687
Name: proportion, dtype: float64

In [33]:
pd.crosstab(df_target["loan_status"], df_target["target_default"])

target_default,0,1
loan_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Charged Off,0,26066
Default,0,3
Does not meet the credit policy. Status:Charged Off,0,186
Fully Paid,119522,0
Late (31-120 days),0,1180


In [34]:
print(df_target["loan_status"].value_counts())

loan_status
Fully Paid                                             119522
Charged Off                                             26066
Late (31-120 days)                                       1180
Does not meet the credit policy. Status:Charged Off       186
Default                                                     3
Name: count, dtype: int64


In [35]:
df_target.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,...,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,target_default
0,21000.0,13225.0,12822.854854,60 months,13.61,305.06,C,C2,1 year,MORTGAGE,...,,Cash,N,,,,,,,0
1,4500.0,4500.0,4500.0,60 months,16.82,111.41,E,E2,4 years,RENT,...,,Cash,N,,,,,,,0
2,8000.0,8000.0,7540.076422,36 months,7.88,250.25,A,A5,< 1 year,RENT,...,,Cash,N,,,,,,,0
3,18200.0,18200.0,17054.542187,36 months,7.51,566.22,A,A4,2 years,MORTGAGE,...,,Cash,N,,,,,,,0
4,20000.0,20000.0,19811.515921,60 months,15.7,483.18,D,D4,10+ years,MORTGAGE,...,,Cash,N,,,,,,,0


In [36]:
# # Function to inspect unique values and their prevalence
# def inspect_categories(dataframe, column_list):
#     """
#     Prints frequency and percentage distribution for categorical features.
#     """
#     for col in column_list:
#         print(f"\n--- Feature: {col.upper()} ---")
        
#         counts = dataframe[col].value_counts(dropna=False)
#         percentages = dataframe[col].value_counts(dropna=False, normalize=True) * 100
        
#         summary = pd.DataFrame({
#             'Count': counts,
#             'Percentage (%)': percentages.round(2)
#         })
        
#         print(summary)
#         print("-" * 30)

# target_cols = analysis_dict.get('loan_status', [])
# inspect_categories(df, target_cols)
# print(target_cols)

Now we proceed with a zero ratio analysis, that has the objective of helping us decide which variables doesn't have enough information to support the model and which ones are corrupt with 0s that should be NaNs.

In [37]:
#Zero Ratio Analysis
zero_ratio = (df == 0).mean() * 100
zero_ratio_all = zero_ratio[zero_ratio > 0].sort_values(ascending=False)

print("ALL COLUMNS WITH ZEROS")
print(zero_ratio_all.to_string()) 

ALL COLUMNS WITH ZEROS
delinq_amnt                            99.771202
acc_now_delinq                         99.702042
chargeoff_within_12_mths               99.412924
collections_12_mths_ex_med             98.901247
tax_liens                              97.929374
total_rec_late_fee                     96.115106
collection_recovery_fee                90.665023
pub_rec_bankruptcies                   90.167387
recoveries                             89.107634
pub_rec                                87.175327
delinq_2yrs                            83.199954
num_tl_30dpd                           76.676079
out_prncp                              76.144122
out_prncp_inv                          76.144122
num_tl_120dpd_2m                       74.385494
num_tl_90g_dpd_24m                     72.702786
tot_coll_amt                           65.989631
num_accts_ever_120_pd                  59.765274
inq_last_6mths                         56.788814
mort_acc                               34.5246

Having identified the columns with higher amounts of 0s, we decided to audit them one by one and review their unique values to understand whether the 0s are informative or if they represent null/NaN values. This audit is half based on the code shown below and half on a manual review of the dataset dictionary.

In [38]:
# 1. Calculate Zero Ratio again to get the target columns
zero_ratio = (df == 0).mean() * 100
# Define a threshold (e.g., columns with more than 50% zeros)
high_zero_threshold = 30.0
high_zero_cols = zero_ratio[zero_ratio > high_zero_threshold].sort_values(ascending=False).index.tolist()

def audit_high_zero_columns(dataframe, column_list):
    """
    Audits columns with high zero ratios to see value distribution 
    and help decide between dropping, keeping, or binarizing.
    """
    print(f"🔍 --- AUDITING {len(column_list)} COLUMNS WITH > {high_zero_threshold}% ZEROS --- \n")
    
    for col in column_list:
        print(f"📊 Feature: {col.upper()}")
        print(f"Zero Ratio: {zero_ratio[col]:.2f}%")
        
        # Count unique values excluding zero
        non_zero_values = dataframe[dataframe[col] != 0][col]
        unique_counts = non_zero_values.nunique()
        
        print(f"Unique values (excluding zero): {unique_counts}")
        
        if unique_counts < 15:
            # If few unique values, show frequency
            print("Distribution (Top Values):")
            print(dataframe[col].value_counts().head(10))
        else:
            # If many unique values, show basic stats for non-zero data
            print("Non-zero stats:")
            print(non_zero_values.describe()[['mean', 'min', 'max']])
        
        print("-" * 40)

# Execute the audit
audit_high_zero_columns(df, high_zero_cols)

🔍 --- AUDITING 20 COLUMNS WITH > 30.0% ZEROS --- 

📊 Feature: DELINQ_AMNT
Zero Ratio: 99.77%
Unique values (excluding zero): 335
Non-zero stats:
mean     4089.927273
min         2.000000
max     65000.000000
Name: delinq_amnt, dtype: float64
----------------------------------------
📊 Feature: ACC_NOW_DELINQ
Zero Ratio: 99.70%
Unique values (excluding zero): 3
Distribution (Top Values):
acc_now_delinq
0.0    191736
1.0       545
2.0        24
3.0         4
Name: count, dtype: int64
----------------------------------------
📊 Feature: CHARGEOFF_WITHIN_12_MTHS
Zero Ratio: 99.41%
Unique values (excluding zero): 6
Distribution (Top Values):
chargeoff_within_12_mths
0.0    191180
1.0      1027
2.0        86
3.0         7
4.0         6
5.0         2
9.0         1
Name: count, dtype: int64
----------------------------------------
📊 Feature: COLLECTIONS_12_MTHS_EX_MED
Zero Ratio: 98.90%
Unique values (excluding zero): 9
Distribution (Top Values):
collections_12_mths_ex_med
0.0     190196
1.0    

The function below is intended to serve as a filter to help us quickly verify whether a column was correctly classified in the categories defined above.

In [39]:
def get_column_category(column_name, mapping_dict):
    """
    Checks which category a specific column belongs to based on the analysis_dict.
    """
    for category, columns in mapping_dict.items():
        if column_name in columns:
            return category
    return "other (or not found)"

test_col = 'loan_status' # You can change this name to any column
result = get_column_category(test_col, analysis_dict)
print(f"Verification: The column '{test_col}' is categorized as: {result.upper()}")

# 3. Batch verification (Optional)
# List of columns you want to verify right now
verify_list = ['sec_app_fico_range_low', 'mths_since_last_delinq', 'loan_amnt', 'settlement_term']

print("BATCH VERIFICATION:")
for col in verify_list:
    cat = get_column_category(col, analysis_dict)
    print(f"- {col:30} -> Category: {cat}")

Verification: The column 'loan_status' is categorized as: POST_LOAN
BATCH VERIFICATION:
- sec_app_fico_range_low         -> Category: second_applicant
- mths_since_last_delinq         -> Category: structural_missing
- loan_amnt                      -> Category: other
- settlement_term                -> Category: post_loan


In [40]:
low_variance_cols = []

for col in df.columns:
    vc = df[col].value_counts(dropna=False, normalize=True)
    if vc.iloc[0] > 0.99:   # más del 99% el mismo valor
        low_variance_cols.append(col)

low_variance_cols

['pymnt_plan',
 'acc_now_delinq',
 'chargeoff_within_12_mths',
 'delinq_amnt',
 'sec_app_mths_since_last_major_derog',
 'hardship_flag',
 'hardship_type',
 'hardship_reason',
 'hardship_status',
 'deferral_term',
 'hardship_amount',
 'hardship_start_date',
 'hardship_end_date',
 'payment_plan_start_date',
 'hardship_length',
 'hardship_dpd',
 'hardship_loan_status',
 'orig_projected_additional_accrued_interest',
 'hardship_payoff_balance_amount',
 'hardship_last_payment_amount']

In [41]:
desc = df.describe().T
desc.sort_values(by='max', ascending=False).head(20)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
annual_inc,192309.0,75169.592025,69239.420623,0.0,45000.0,64000.0,90000.0,8700000.0
tot_hi_cred_lim,147812.0,174407.636396,175198.12996,0.0,49067.75,112098.5,254223.0,5850873.0
tot_cur_bal,147812.0,140579.788461,157130.293045,0.0,28571.75,78569.0,211994.75,5327039.0
annual_inc_joint,5947.0,122752.267765,65794.037105,13464.0,82772.0,110000.0,146000.0,1500000.0
total_rev_hi_lim,147812.0,33396.409128,33326.34261,0.0,14400.0,24700.0,41500.0,1417100.0
total_bal_ex_mort,157627.0,48858.278112,47215.87623,0.0,20217.0,36431.0,61609.5,1276247.0
total_il_high_credit_limit,147812.0,41767.308371,43336.431399,0.0,14006.75,31084.0,56414.25,1214546.0
revol_bal,192309.0,16045.894254,20507.664434,0.0,5860.0,11215.0,19809.0,1190046.0
total_bal_il,70775.0,35223.405016,43461.041253,0.0,8607.5,22890.0,46015.5,801779.0
total_bc_limit,157627.0,22276.557677,21968.833712,0.0,8000.0,15700.0,29000.0,520500.0


The analysis of the maximum values reveals the presence of extreme values in some financial variables, which suggests the need to apply transformations or outlier treatment techniques in later stages.

In [42]:
desc.assign(
    mean_median_ratio = desc['mean'] / desc['50%']
).sort_values('mean_median_ratio', ascending=False).head(10)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,mean_median_ratio
pub_rec,192309.0,0.157767,0.502036,0.0,0.0,0.0,0.0,40.0,inf
inq_last_6mths,192309.0,0.681424,0.977064,0.0,0.0,0.0,1.0,8.0,inf
delinq_2yrs,192309.0,0.26953,0.798124,0.0,0.0,0.0,0.0,26.0,inf
total_rec_late_fee,192309.0,1.415083,11.190303,0.0,0.0,0.0,0.0,1098.360001,inf
recoveries,192309.0,153.452068,774.136498,0.0,0.0,0.0,0.0,35581.88,inf
out_prncp_inv,192309.0,2404.686422,5862.737964,0.0,0.0,0.0,0.0,39091.64,inf
open_il_12m,70775.0,0.670816,0.925402,0.0,0.0,0.0,1.0,20.0,inf
acc_now_delinq,192309.0,0.003146,0.059249,0.0,0.0,0.0,0.0,3.0,inf
tot_coll_amt,147812.0,201.409906,1907.319784,0.0,0.0,0.0,0.0,380757.0,inf
collection_recovery_fee,192309.0,22.341847,128.942794,0.0,0.0,0.0,0.0,7002.19,inf


Several numerical variables present highly skewed distributions, with median values equal to zero and a small proportion of non-zero observations. This pattern is expected for count-based credit history variables. However, some highly skewed variables correspond to post-loan information and will therefore be excluded from the modeling process to prevent data leakage.

EDA (cerrando)
1. Tratamiento columnas con alto % de 0s ⏳
2. Matriz de missing (missingno) ⏳
3. Valores faltantes explícitos ✅
4. Valores faltantes ocultos ✅
5. Filas duplicadas ✅
6. Drop policy_code ✅

Decisiones de features
7. Identificar columnas data leakage ⏳(ya estan identificadas, ahora hay que hacer drop)
8. Definir estrategia second applicant (flag + drop cols) ⏳
9. Identificar columnas ID / no predictivas ⏳(ya estan identificadas, ahora hay que hacer drop)
10. Definir target ⏳(ya identificado)

Modelado
11. Crear df_model
12. Split temporal
13. Imputación / preprocessing