## Feature Engineering Pipeline

### Workflow
1. Missing-value treatment with audit logs.
2. Categorical grouping and domain-aware feature synthesis.
3. Interaction and utilization features.
4. Numeric transformation for skewed distributions.
5. Selective feature retention before one-hot encoding.
6. Persist encoded artifacts for modeling.


Import Libraries


In [43]:
import os
import sys
import importlib
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

PROJECT_ROOT = Path.cwd().resolve().parent
SRC_DIR = PROJECT_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

import notebook_checks as notebook_checks_mod
import styling.styling as styling_mod
import feature_pipeline as feature_pipeline_mod

importlib.reload(notebook_checks_mod)
importlib.reload(styling_mod)
importlib.reload(feature_pipeline_mod)

from config import get_config
from feature_pipeline import (
    EncodeCategorialFeatures,
    add_a1c_med_interaction,
    add_comorbidity_interaction,
    add_diag_combination,
    add_domain_specific_features,
    add_intensity_of_care,
    add_interactive_features,
    add_lab_med_ratio,
    add_procedure_diversification,
    apply_age_specialty_interaction,
    apply_train_fitted_category_mappings,
    audit_missing,
    clean_missing_notebook,
    one_hot_encode,
    plot_a1c_med_impact,
    plot_age_specialty_impact,
    plot_comorbidity_triad_impact,
    plot_intensity_of_care_effect,
    plot_lab_med_ratio_effect,
    run_chi2_analysis,
    summarize_ohe_cardinality,
    save_final_feature_outputs,
    run_chi2_feature_selection,
    display_chi2_selection_dashboard,
    transform_skewed_numerical_features,
)

run_basic_data_quality_gates = notebook_checks_mod.run_basic_data_quality_gates
assert_target_mapping_complete = notebook_checks_mod.assert_target_mapping_complete
assert_no_missing_values = notebook_checks_mod.assert_no_missing_values
assert_feature_alignment = notebook_checks_mod.assert_feature_alignment
build_reproducibility_footer = notebook_checks_mod.build_reproducibility_footer

apply_notebook_style = styling_mod.apply_notebook_style
build_eda_output_paths = styling_mod.build_eda_output_paths
build_artifact_path = styling_mod.build_artifact_path
save_table_snapshot = styling_mod.save_table_snapshot

cfg = get_config(PROJECT_ROOT)
apply_notebook_style()
output_path, table_output_path = build_eda_output_paths(PROJECT_ROOT)

NOTEBOOK_ID = '02'
def fig_path(section_id: str, slug: str):
    return build_artifact_path(output_path, NOTEBOOK_ID, section_id, slug, 'png')

def table_path(section_id: str, slug: str, extension: str = 'png'):
    return build_artifact_path(table_output_path, NOTEBOOK_ID, section_id, slug, extension)


Import Data

In [44]:
train_df = pd.read_csv(cfg.interim_train_path)
test_df = pd.read_csv(cfg.interim_test_path)

quality_report = run_basic_data_quality_gates(
    train_df=train_df,
    test_df=test_df,
    target_col='readmitted',
    required_cols=['readmitted', 'encounter_id', 'patient_nbr'],
    leakage_id_cols=['encounter_id'],
)
display(quality_report)

save_table_snapshot(
    quality_report,
    table_path('01', 'quality_gates_report'),
    title='Feature Pipeline - Data Quality Gates (Schema, Leakage, Class Checks)',
    index=False,
)
quality_report.to_csv(table_path('01', 'quality_gates_report', 'csv'), index=False)


Unnamed: 0,readmitted,count,percentage,dataset
0,NO,43891,53.91,train
1,>30,28436,34.93,train
2,<30,9085,11.16,train
3,NO,10973,53.91,test
4,>30,7109,34.93,test
5,<30,2272,11.16,test


## Missing Values

Cleaning policy:
1. Drop near-empty columns (`weight`, `max_glu_serum`).
2. Fill operational unknowns for specialty/payer.
3. Handle diagnosis missingness with justified clinical logic.
4. Remove rows with unjustified or structurally invalid missing records.


`weight` and `max_glu_serum` are dropped due to extreme sparsity.

`medical_specialty`, `payer_code`, and `A1Cresult` are imputed with explicit categories so models can learn missingness behavior without hidden NaN leakage.


In [45]:
# Cell 1 - Clean missing values
train_df = clean_missing_notebook(train_df)
test_df = clean_missing_notebook(test_df)

print("Rows with unjustified missing data have been deleted.")


Rows with unjustified missing data have been deleted.


Check if there are Still Missing Values in the DataSet

In [46]:
# Verify cleaning worked and persist audit tables
train_missing_report = audit_missing(train_df)
test_missing_report = audit_missing(test_df)

if train_missing_report.empty:
    print('No missing values detected in train_df')
else:
    display(train_missing_report)
    save_table_snapshot(
        train_missing_report.reset_index().rename(columns={'index': 'feature'}),
        table_path('02', 'train_missing_audit'),
        title='Feature Pipeline - Train Missingness Audit',
        index=False,
    )
    train_missing_report.to_csv(table_path('02', 'train_missing_audit', 'csv'))

if test_missing_report.empty:
    print('No missing values detected in test_df')
else:
    display(test_missing_report)
    save_table_snapshot(
        test_missing_report.reset_index().rename(columns={'index': 'feature'}),
        table_path('02', 'test_missing_audit'),
        title='Feature Pipeline - Test Missingness Audit',
        index=False,
    )
    test_missing_report.to_csv(table_path('02', 'test_missing_audit', 'csv'))


No missing values detected in train_df
No missing values detected in test_df


## Drop Columns

Drop leakage-prone or non-informative identifiers before encoding.


In [47]:
# Drop ID columns that don't help prediction
train_df = train_df.drop(columns=['encounter_id', 'patient_nbr'])
test_df = test_df.drop(columns=['encounter_id', 'patient_nbr'])
y_train = train_df.pop('readmitted')
y_test = test_df.pop('readmitted')


## Transform Targets


0 (NO): The baseline (patient was not readmitted).

1 (>30): Patient was readmitted, but after a month.

2 (<30): The highest priority group (patient was readmitted within 30 days).

In [48]:
multiclass_map = {'NO': 0, '>30': 1, '<30': 2}
binary_map     = {'NO': 0, '>30': 1, '<30': 1}

assert_target_mapping_complete(y_train, multiclass_map, 'y_train_raw')
assert_target_mapping_complete(y_test, multiclass_map, 'y_test_raw')
assert_target_mapping_complete(y_train, binary_map, 'y_train_raw_binary')
assert_target_mapping_complete(y_test, binary_map, 'y_test_raw_binary')

y_train_binary = y_train.map(binary_map)
y_test_binary  = y_test.map(binary_map)

y_train = y_train.map(multiclass_map)
y_test  = y_test.map(multiclass_map)

assert y_train.notna().all() and y_test.notna().all(), 'Target mapping produced NaN.'
assert y_train_binary.notna().all() and y_test_binary.notna().all(), 'Binary mapping produced NaN.'


## Encode Categorical Features

Design choice: collapse raw medication columns into stable summary features to avoid noisy high-dimensional one-hot expansions.


In [49]:
# ============================================================== 
# EXECUTION
# ============================================================== 

engineer = EncodeCategorialFeatures(drop_raw_medication_columns=True)
train_df = engineer.fit_transform(train_df)
test_df  = engineer.fit_transform(test_df)

print('Feature engineering complete')
print(f'Shape Train: {train_df.shape}')
print(f'Shape Test: {test_df.shape}')


Feature engineering complete
Shape Train: (79394, 29)
Shape Test: (19855, 29)


## Create Interactive Features


### Numerical Transformation Check

Assess skewness and apply `log1p` to heavily right-skewed numeric features where valid.


In [50]:
# Create age-specialty interaction feature
train_df['age_specialty_interaction'] = apply_age_specialty_interaction(train_df)
test_df['age_specialty_interaction'] = apply_age_specialty_interaction(test_df)

print(f"New feature created. Unique interactions in training: {train_df['age_specialty_interaction'].nunique()}")
print(test_df[['medical_specialty', 'age', 'age_specialty_interaction']].head())


New feature created. Unique interactions in training: 94
   medical_specialty       age  age_specialty_interaction
0  Family_Outpatient   [60-70)  Family_Outpatient_[60-70)
1  Family_Outpatient   [80-90)  Family_Outpatient_[80-90)
2      Other_Unknown  [90-100)     Other_Unknown_[90-100)
3  Internal_Medicine   [40-50)  Internal_Medicine_[40-50)
4  Internal_Medicine   [50-60)  Internal_Medicine_[50-60)


In [51]:
fig = plot_age_specialty_impact(
    train_df=train_df,
    y_train=y_train,
    benchmark=0.112,
    min_count=50,
    top_n=15,
)
fig.show()


### A1Cresults x DiabetesMedicament Change

In [52]:
fig = plot_a1c_med_impact(
    train_df=train_df,
    y_train=y_train,
    benchmark=0.112,
)
fig.show()


In [53]:
# Apply to both splits
train_df['a1c_med_interaction'] = add_a1c_med_interaction(train_df)
test_df['a1c_med_interaction'] = add_a1c_med_interaction(test_df)

print("New feature sample (A1C x Med Adjustment):")
print(train_df[['A1Cresult', 'med_was_adjusted', 'a1c_med_interaction']].head())


New feature sample (A1C x Med Adjustment):
    A1Cresult  med_was_adjusted     a1c_med_interaction
0  Not_Tested                 1  Not_Tested_MedAdjusted
1  Not_Tested                 0  Not_Tested_NoMedChange
2  Not_Tested                 0  Not_Tested_NoMedChange
3  Not_Tested                 0  Not_Tested_NoMedChange
4  Not_Tested                 1  Not_Tested_MedAdjusted


###  Diag1 x Diag2 x Diag3

In [54]:
fig = plot_comorbidity_triad_impact(
    train_df=train_df,
    y_train=y_train,
    benchmark=0.112,
    min_count=100,
)
fig.show()


### Comorbidity Interaction (Order-Agnostic)
Create a stable diagnosis interaction key (`comorbidity_interaction`) and a capped high-signal combination feature (`diag_combination`).


In [55]:
# Create diagnosis interaction features
train_df['comorbidity_interaction'] = add_comorbidity_interaction(train_df)
test_df['comorbidity_interaction'] = add_comorbidity_interaction(test_df)

# Initial combination creation
train_df = add_diag_combination(train_df, top_n=40)
test_df = add_diag_combination(test_df, top_n=40)

# Fit train category support once and apply consistently to test
mapping_columns = [
    'age_specialty_interaction',
    'comorbidity_interaction',
    'diag_combination',
]

train_df, test_df, category_mapping_report = apply_train_fitted_category_mappings(
    train_df=train_df,
    test_df=test_df,
    columns=mapping_columns,
    min_count_map={
        'age_specialty_interaction': 30,
        'comorbidity_interaction': 25,
        'diag_combination': 25,
    },
    top_n_map={
        'age_specialty_interaction': 140,
        'comorbidity_interaction': 180,
        'diag_combination': 50,
    },
    other_label_map={
        'age_specialty_interaction': 'AgeSpecialty_Other',
        'comorbidity_interaction': 'Comorbidity_Other',
        'diag_combination': 'DiagCombo_Other',
    },
)

print('New comorbidity feature sample:')
print(train_df[['diag_1', 'diag_2', 'diag_3', 'comorbidity_interaction']].head())
print(f"\nUnique diag combinations (train): {train_df['diag_combination'].nunique()}")
print(f"\nTop combinations:\n{train_df['diag_combination'].value_counts().head(15)}")

display(category_mapping_report)
save_table_snapshot(
    category_mapping_report,
    table_path('04', 'train_fitted_category_mapping_report'),
    title='Feature Pipeline - Train-Fitted Interaction Category Mapping',
    index=False,
)
category_mapping_report.to_csv(
    table_path('04', 'train_fitted_category_mapping_report', 'csv'),
    index=False,
)


New comorbidity feature sample:
                  diag_1            diag_2                 diag_3  \
0            Respiratory  Mental_Disorders                  Other   
1                  Other       Circulatory  External_Supplemental   
2            Circulatory         Neoplasms            Circulatory   
3            Circulatory          Diabetes            Circulatory   
4  External_Supplemental       Circulatory            Circulatory   

                         comorbidity_interaction  
0             Mental_Disorders_Other_Respiratory  
1        Circulatory_External_Supplemental_Other  
2              Circulatory_Circulatory_Neoplasms  
3               Circulatory_Circulatory_Diabetes  
4  Circulatory_Circulatory_External_Supplemental  

Unique diag combinations (train): 42

Top combinations:
diag_combination
Other_Combination                                  25468
Circulatory + Other                                 7230
Circulatory                                         5266
Ci

Unnamed: 0,feature,min_count,top_n,train_unique_before,train_unique_after,test_unique_before,test_unique_after,unknown_test_rows_before_mapping,train_rows_mapped_to_other,test_rows_mapped_to_other,other_label
0,age_specialty_interaction,30,140,94,72,89,72,1,228,58,AgeSpecialty_Other
1,comorbidity_interaction,25,180,461,181,439,181,6,6190,1547,Comorbidity_Other
2,diag_combination,25,50,41,42,41,40,196,1,196,DiagCombo_Other


### Healthcare Utilization Scores (Intensity of Care)

In [56]:
fig = plot_intensity_of_care_effect(train_df=train_df, y_train=y_train)
fig.show()


In [57]:
# Apply intensity and diversification features
train_df['intensity_of_care'] = add_intensity_of_care(train_df)
test_df['intensity_of_care'] = add_intensity_of_care(test_df)

train_df['procedure_diversification'] = add_procedure_diversification(train_df)
test_df['procedure_diversification'] = add_procedure_diversification(test_df)

print("New numerical features added: intensity_of_care, procedure_diversification")
print(train_df[['num_lab_procedures', 'num_procedures', 'time_in_hospital', 'number_diagnoses', 'intensity_of_care', 'procedure_diversification']].head())


New numerical features added: intensity_of_care, procedure_diversification
   num_lab_procedures  num_procedures  time_in_hospital  number_diagnoses  \
0                  57               0                 6                 9   
1                  39               0                 3                 7   
2                  38               0                 1                 4   
3                  57               0                 2                 3   
4                  54               0                14                 9   

   intensity_of_care  procedure_diversification  
0           9.500000                        0.0  
1          13.000000                        0.0  
2          38.000000                        0.0  
3          28.500000                        0.0  
4           3.857143                        0.0  


Lab & Procedure "Intensity"
Lab-to-Medication Ratio: num_lab_procedures / num_medications.

High ratio: The doctors are searching for a diagnosis (Diagnostic intensity).

Low ratio: The treatment plan is established; they are just managing the meds (Maintenance intensity).

Procedure Diversification: num_procedures divided by the number of diagnoses. Measures how much "intervention" was required per ailment.

## Generate domain-specific Features?

4. Lab & Procedure Ratio: Diagnostic vs. Maintenance Intensity

In [58]:
fig = plot_lab_med_ratio_effect(train_df=train_df)
fig.show()


In [59]:
# Apply domain-specific feature block (keeps feature creation standardized)
train_df = add_domain_specific_features(train_df)
test_df = add_domain_specific_features(test_df)

# Transform highly skewed numerical features
exclude_from_log = ['med_was_adjusted', 'any_diabetes_medication', 'insulin_active']
train_df, test_df, skew_report = transform_skewed_numerical_features(
    train_df=train_df,
    test_df=test_df,
    skew_threshold=1.0,
    exclude_cols=exclude_from_log,
)

print('Domain-specific features present:')
print([c for c in ['intensity_of_care', 'lab_med_ratio', 'procedure_diversification'] if c in train_df.columns])

display(skew_report.head(20))
save_table_snapshot(
    skew_report,
    table_path('05', 'numeric_skew_transform_audit'),
    title='Feature Pipeline - Numeric Skewness Transform Audit',
    index=False,
)
skew_report.to_csv(table_path('05', 'numeric_skew_transform_audit', 'csv'), index=False)


Domain-specific features present:
['intensity_of_care', 'lab_med_ratio', 'procedure_diversification']


Unnamed: 0,feature,train_skew_before,train_skew_after,transform_applied
0,number_emergency,16.4736,3.5785,log1p
1,number_outpatient,8.6988,2.7216,log1p
2,number_inpatient,3.6422,1.436,log1p
3,num_medications,1.3252,-0.473,log1p
4,num_procedures,1.3113,0.5124,log1p
5,time_in_hospital,1.1319,0.1017,log1p
6,number_diagnoses,-0.9016,-0.9016,none
7,num_lab_procedures,-0.2377,-0.2377,none


## Feature Selection

Retain only statistically relevant categorical features before one-hot encoding.


Selective One-Hot Encoding


In [60]:
# ============================================================== 
# CHI-SQUARE FEATURE SELECTION
# ============================================================== 

selection_artifacts = run_chi2_feature_selection(
    train_df=train_df,
    test_df=test_df,
    y_train=y_train,
    target_col='readmitted',
    alpha=0.02,
    min_cramers_v=0.02,
    max_features_to_keep=None,
    protected_features=[
        'age',
        'admission_source_id',
        'admission_type_id',
        'diag_1',
        'diag_2',
        'diag_3',
        'medical_specialty',
        'payer_group',
        'age_specialty_interaction',
        'comorbidity_interaction',
        'diag_combination',
        'a1c_med_interaction',
    ],
)

chi2_results = selection_artifacts.chi2_results
features_to_keep = selection_artifacts.features_to_keep
features_to_drop = selection_artifacts.features_to_drop
train_df_dropped = selection_artifacts.train_df_selected
test_df_dropped = selection_artifacts.test_df_selected

print('Chi-square analysis complete')
print(f'Features to KEEP : {len(features_to_keep)}')
print(f'Features to DROP : {len(features_to_drop)}')

protected_set = {
    'age', 'admission_source_id', 'admission_type_id', 'diag_1', 'diag_2', 'diag_3',
    'medical_specialty', 'payer_group', 'age_specialty_interaction',
    'comorbidity_interaction', 'diag_combination', 'a1c_med_interaction'
}
missing_protected = sorted([f for f in protected_set if f not in train_df_dropped.columns])
assert not missing_protected, f"Protected features dropped unexpectedly: {missing_protected}"

display(chi2_results)

save_table_snapshot(
    chi2_results,
    table_path('06', 'chi2_feature_selection'),
    title='Feature Pipeline - Chi-Square Feature Selection Results',
    index=False,
)
chi2_results.to_csv(table_path('06', 'chi2_feature_selection', 'csv'), index=False)


Chi-square analysis complete
Features to KEEP : 16
Features to DROP : 2


Unnamed: 0,Feature,Chi2,p_value,Degrees_of_Freedom,CramersV,Significant,Keep
0,comorbidity_interaction,1595.37,0.0,360,0.1002,True,True
1,age_specialty_interaction,947.35,0.0,142,0.0772,True,True
2,diag_combination,870.36,0.0,82,0.074,True,True
3,admission_source_id,725.76,0.0,6,0.0676,True,True
4,diabetesMed,322.85,0.0,2,0.0638,True,True
5,diag_3,585.72,0.0,26,0.0607,True,True
6,medical_specialty,561.23,0.0,18,0.0595,True,True
7,diag_2,500.34,0.0,26,0.0561,True,True
8,diag_1,395.2,0.0,24,0.0499,True,True
9,a1c_med_interaction,392.17,0.0,10,0.0497,True,True


In [61]:
# Quick audit of cardinality before one-hot encoding
summary_before, cardinality_before = summarize_ohe_cardinality(train_df)
summary_after_selection, cardinality_after_selection = summarize_ohe_cardinality(train_df_dropped)

print('Before chi-square selection:')
print(f"Total columns     : {summary_before['total_columns']}")
print(f"OHE-like columns  : {summary_before['ohe_like_columns']}")
print(f"Non-OHE columns   : {summary_before['non_ohe_columns']}")

print('')
print('After chi-square selection:')
print(f"Total columns     : {summary_after_selection['total_columns']}")
print(f"OHE-like columns  : {summary_after_selection['ohe_like_columns']}")
print(f"Non-OHE columns   : {summary_after_selection['non_ohe_columns']}")

cardinality_preview = cardinality_after_selection.head(20)
display(cardinality_preview)

save_table_snapshot(
    cardinality_preview,
    table_path('06', 'post_selection_cardinality'),
    title='Feature Pipeline - Post-Selection Categorical Cardinality',
    index=False,
)
cardinality_preview.to_csv(table_path('06', 'post_selection_cardinality', 'csv'), index=False)


Before chi-square selection:
Total columns     : 36
OHE-like columns  : 30
Non-OHE columns   : 6

After chi-square selection:
Total columns     : 34
OHE-like columns  : 30
Non-OHE columns   : 4


Unnamed: 0,feature,n_unique
14,comorbidity_interaction,181
12,age_specialty_interaction,72
15,diag_combination,42
4,payer_code,17
7,diag_2,14
8,diag_3,14
6,diag_1,13
1,age,10
5,medical_specialty,10
11,payer_group,6


In [62]:
# ============================================================== 
# EXECUTION
# ============================================================== 

assert_feature_alignment(train_df_dropped, test_df_dropped)
assert_no_missing_values(train_df_dropped, 'train_df_before_ohe')
assert_no_missing_values(test_df_dropped, 'test_df_before_ohe')

# Around 1% of rows, but kept under 700 as requested
min_frequency_count = min(680, max(120, int(len(train_df_dropped) * 0.009)))
print(f"Using OHE min_frequency={min_frequency_count} (rows)")

train_df_encoded, test_df_encoded, encoder = one_hot_encode(
    train_df_dropped,
    test_df_dropped,
    drop='first',
    min_frequency=min_frequency_count,
    max_categories=None,
    verbose=True,
)

assert_feature_alignment(train_df_encoded, test_df_encoded)
assert_no_missing_values(train_df_encoded, 'train_df_encoded')
assert_no_missing_values(test_df_encoded, 'test_df_encoded')

print(f'Encoded train shape: {train_df_encoded.shape}')
print(f'Encoded test shape : {test_df_encoded.shape}')


Using OHE min_frequency=680 (rows)
Encoding complete
train: (79394, 34) -> (79394, 185)
test : (19855, 34) -> (19855, 185)
Encoded train shape: (79394, 185)
Encoded test shape : (19855, 185)


## Save Files

All final artifacts are persisted to `data/final/` and table diagnostics to `src/eda/tables/`.


In [63]:
# ============================================================== 
# SAVE FINAL FEATURE ARTIFACTS
# ============================================================== 

saved_paths = save_final_feature_outputs(
    train_df_encoded=train_df_encoded,
    test_df_encoded=test_df_encoded,
    y_train=y_train,
    y_test=y_test,
    y_train_binary=y_train_binary,
    y_test_binary=y_test_binary,
    encoder=encoder,
    config=cfg,
)

print('Saved to data/final/')
print(f"   train_encoded    : {train_df_encoded.shape}")
print(f"   test_encoded     : {test_df_encoded.shape}")
print(f"   y_train          : {y_train.shape}")
print(f"   y_test           : {y_test.shape}")
print(f"   y_train_binary   : {y_train_binary.shape}")
print(f"   y_test_binary    : {y_test_binary.shape}")
print(f"   encoder          : {encoder}")
print('')
print('Saved files:')
for key, path in saved_paths.items():
    print(f"   {key:16s} -> {path}")

repro_footer = build_reproducibility_footer(cfg.random_state)
display(repro_footer)
save_table_snapshot(
    repro_footer,
    table_path('99', 'reproducibility_footer'),
    title='Feature Pipeline - Reproducibility Footer',
    index=False,
)
repro_footer.to_csv(table_path('99', 'reproducibility_footer', 'csv'), index=False)

print('')
print(f"Table diagnostics saved to: {table_output_path}")


Saved to data/final/
   train_encoded    : (79394, 185)
   test_encoded     : (19855, 185)
   y_train          : (79394,)
   y_test           : (19855,)
   y_train_binary   : (79394,)
   y_test_binary    : (19855,)
   encoder          : ColumnTransformer(remainder='passthrough',
                  transformers=[('cat',
                                 OneHotEncoder(drop='first',
                                               handle_unknown='ignore',
                                               min_frequency=680,
                                               sparse_output=False),
                                 ['race', 'age', 'admission_type_id',
                                  'admission_source_id', 'payer_code',
                                  'medical_specialty', 'diag_1', 'diag_2',
                                  'diag_3', 'change', 'diabetesMed',
                                  'payer_group', 'age_specialty_interaction',
                                  'a1c_med_intera

Unnamed: 0,timestamp_utc,python_version,platform,random_seed,numpy,pandas,sklearn,matplotlib,seaborn
0,2026-02-18T12:28:13+00:00,3.11.0,macOS-26.2-arm64-arm-64bit,42,1.26.4,2.2.0,1.7.2,3.10.6,0.13.2



Table diagnostics saved to: /Users/casimircasparuhlig/Desktop/classification-project-casimiruhlig/src/eda/tables


## Expert To-Do

- Add leakage tests for interaction features that depend on outcome-adjacent signals.
- Add monotonic-bin checks for transformed numeric features.
- Add stability checks for category drift between train and test after grouping.
