# Collinearity and Leakage Test

__Split Distribution__  
Step 1 - Perform distribution test, Kolgomorov-Smirnov for continuous, Chi-square for categorical  

__Collinearity (Moved to later part)__  
step 2 - Run the VIF -> drop the highest -> Repeat (Threshold VIF < 5)  
Note:
* unlike p-value which the choice of drop is arbitrary, VIF check against the remaining variables which give clear values
* modern ML can handle multicollinearity but GLM struggle  
* the VIF will be run after variable selection only for GLM, but not for trees

__Leakage Test__  
step 3 - Check p-value against the predictor for regression data. Check if p-value is suspiciously high.  
step 4 - Run random forest against the data (can use default param). Check top 10 feature importance and check manually for a potential leakage.  

In [1]:
import pandas as pd
import numpy as np
import json
import os

from sklearn.ensemble import RandomForestRegressor

# stats
from scipy.stats import ks_2samp, chi2_contingency, pearsonr
from statsmodels.stats.outliers_influence import variance_inflation_factor

__Split Distribution__  
Step 1 - Perform distribution test, Kolgomorov-Smirnov for continuous, Chi-square for categorical  

> KS test: 2 variables (1.60%) have p < 0.05
> Chi-square test: 4 variables (3.05%) have p < 0.05
>
> Conclusion: The train-test split preserved the distribution well

In [2]:
with open("PROCESSED/DATA/merged_and_dropped.cat_cols.json") as f:
    cat_cols = json.load(f)

X_train = pd.read_parquet("INPUTS/TRAIN/X_train.parquet")
X_test = pd.read_parquet("INPUTS/TEST/X_test.parquet")
y_train = pd.read_parquet("INPUTS/TRAIN/y_train.parquet")
y_test = pd.read_parquet("INPUTS/TEST/y_test.parquet")

X_train[cat_cols] = X_train[cat_cols].astype("category")
X_test[cat_cols] = X_test[cat_cols].astype("category")

num_cols = [c for c in X_train.columns if c not in cat_cols]

In [3]:
# one-hot encode categorical variables
X_encoded = pd.get_dummies(X_train, drop_first=True)

rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf.fit(X_encoded, y_train.iloc[:, 0])
importances = pd.Series(rf.feature_importances_, index=X_encoded.columns)
top10 = importances.sort_values(ascending=False).head(10)
print(top10)
# top10.to_csv("LOG/rf_leakage_test.csv")

P_DEMO__RIDAGEYR_Age_in_years_at_screening                  0.107723
P_LUX__LUXCAPM_Median_CAP_decibels_per_meter_dB_m           0.056829
P_ALB_CR__URDACT_Albumin_creatinine_ratio_mg_g              0.048834
P_MCQ__MCQ366D                                              0.021322
P_MCQ__MCQ300C                                              0.020041
P_TCHOL__LBDTCSI_Total_Cholesterol_mmol_L                   0.019221
P_ALB_CR__URXUMS_Albumin_urine_mg_L                         0.018319
P_BPQ__BPQ020_Ever_told_you_had_high_blood_pressure_2.0     0.016682
P_BIOPRO__LBDSCHSI_Cholesterol_refrigerated_serum_mmol_L    0.016425
P_BIOPRO__LBXSOSSI_Osmolality_mmol_Kg                       0.014955
dtype: float64


#### Initial test

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, classification_report, make_scorer, roc_auc_score
import numpy as np
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# ensure target is categorical
y_train_cat = y_train.iloc[:, 0].astype("category")
y_test_cat  = y_test.iloc[:, 0].astype("category")

# parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

# model
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

scorers = {
    'auc': 'roc_auc_ovr',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

# fit
grid.fit(X_encoded, y_train_cat)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

# evaluate on train/test
best_rf = grid.best_estimator_

y_pred_train = best_rf.predict(X_encoded)
y_pred_test  = best_rf.predict(X_test_encoded)

acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average='macro')
f1_test  = f1_score(y_test_cat, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_cat, best_rf.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_cat, best_rf.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
Best CV score: 0.6054338112411181
CV AUC at best-AUC params: 0.8777209115251665
CV F1 at best-AUC params: 0.6054338112411181
CV Accuracy at best-AUC params: 0.8506862843401745
Train accuracy: 0.936,  F1: 0.868, AUC: 0.995
Test  accuracy: 0.872,  F1: 0.649, AUC: 0.896

Classification report:

              precision    recall  f1-score   support

         0.0       0.87      0.99      0.93      1642
         1.0       0.81      0.24      0.37       306

    accuracy                           0.87      1948
   macro avg       0.84      0.61      0.65      1948
weighted avg       0.86      0.87      0.84      1948



In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# target as categorical
y_train_cat = y_train.iloc[:, 0].astype("category")
y_test_cat  = y_test.iloc[:, 0].astype("category")

# unregularized logistic regression (GLM)
model = LogisticRegression(
    penalty=None,        # FIX: unregularized
    solver="lbfgs",
    max_iter=500,
    n_jobs=-1
)

# fit
model.fit(X_encoded, y_train_cat)

# predictions
y_pred_train = model.predict(X_encoded)
y_pred_test  = model.predict(X_test_encoded)

# metrics
acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average="macro")
f1_test  = f1_score(y_test_cat, y_pred_test, average="macro")

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}")
print(f"Train AUC: {roc_auc_score(y_train_cat, model.predict_proba(X_encoded)[:, 1]):.3f}")
print(f"Test  AUC: {roc_auc_score(y_test_cat, model.predict_proba(X_test_encoded)[:, 1]):.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))


Train accuracy: 0.856,  F1: 0.691
Test  accuracy: 0.868,  F1: 0.698
Train AUC: 0.860
Test  AUC: 0.877

Classification report:

              precision    recall  f1-score   support

         0.0       0.89      0.96      0.92      1642
         1.0       0.64      0.38      0.47       306

    accuracy                           0.87      1948
   macro avg       0.76      0.67      0.70      1948
weighted avg       0.85      0.87      0.85      1948



In [12]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report, make_scorer
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# target
y_train_cat = y_train.iloc[:, 0].astype(int)
y_test_cat  = y_test.iloc[:, 0].astype(int)

# parameter grid (XGB version)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# model
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    tree_method='hist',
    random_state=42,
    n_jobs=-1
)

# scorers
scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

# grid search
grid = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# fit
grid.fit(X_encoded, y_train_cat)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

# evaluate
best_xgb = grid.best_estimator_

y_pred_train = best_xgb.predict(X_encoded)
y_pred_test  = best_xgb.predict(X_test_encoded)

acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average='macro')
f1_test  = f1_score(y_test_cat, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_cat, best_xgb.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_cat, best_xgb.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))


Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}
Best CV score: 0.7775390646263148
CV AUC at best-AUC params: 0.9157619049539963
CV F1 at best-AUC params: 0.7775390646263148
CV Accuracy at best-AUC params: 0.8886887283913188
Train accuracy: 0.944,  F1: 0.891, AUC: 0.980
Test  accuracy: 0.914,  F1: 0.819, AUC: 0.934

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1642
           1       0.80      0.60      0.69       306

    accuracy                           0.91      1948
   macro avg       0.86      0.79      0.82      1948
weighted avg       0.91      0.91      0.91      1948

