# Ensemble Modeling with Voting and Stacking

To further improve model performance and stability, we adopt an ensemble learning approach, commonly referred to as model stacking or a voting ensemble. Instead of relying on a single algorithm, this strategy combines the strengths of multiple, diverse learners to produce a more robust and generalizable model.

The process follows three key steps:

Define Diverse Base Learners
We select models with different learning biases—such as XGBoost, LightGBM, and Random Forest—to ensure complementary decision patterns.

Individual Model Evaluation
Each model is trained independently and evaluated on a validation set to confirm its standalone performance and contribution to the ensemble.

Model Ensembling
The trained models are then combined into a single, powerful predictor using a VotingClassifier, which aggregates their predicted probabilities to make final decisions.

This approach reduces variance, mitigates overfitting, and often delivers superior performance compared to any individual model.
Below is the complete implementation demonstrating this ensemble strategy in practice.

# IMPORT LIBS

In [2]:
import pandas as pd
import numpy as np
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# IMPORTING DATA SET

In [3]:
# Load the Processed Data
df_train = pd.read_csv('../data/cleaned/processed_train_remove.csv')
df_val = pd.read_csv('../data/cleaned/processed_validation_remove.csv')
df_kaggle_test = pd.read_csv('../data/cleaned/processed_kaggle_test_remove.csv')

# PREPROCESSING 

In [4]:
# Define Target and ID columns
target_col = "diagnosed_diabetes"
id_col = "id"

In [5]:
X_train = df_train.drop(columns=[target_col])
y_train = df_train[target_col]
X_val = df_val.drop(columns=[target_col])
y_val = df_val[target_col]

In [6]:
submission_ids = df_kaggle_test[id_col]
X_kaggle_test = df_kaggle_test.drop(columns=[id_col])
X_kaggle_test = X_kaggle_test[X_train.columns]

In [7]:
# Define Multiple Learners 
# We use a dictionary so we can loop through them easily
models = {
    'XGBoost': xgb.XGBClassifier(
        n_estimators=1000, learning_rate=0.05, max_depth=6, 
        subsample=0.8, colsample_bytree=0.8, n_jobs=-1, random_state=42
    ),
    'LightGBM': lgb.LGBMClassifier(
        n_estimators=1000, learning_rate=0.05, num_leaves=31, 
        metric='auc', n_jobs=-1, verbose=-1, random_state=42
    ),
    'AdaBoost': AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=3),
        n_estimators=200, learning_rate=1.0, random_state=42
    ),
    'RandomForest': RandomForestClassifier(
        n_estimators=500, max_depth=10, min_samples_split=10, 
        n_jobs=-1, random_state=42
    ),
    'ExtraTrees': ExtraTreesClassifier(
        n_estimators=500, max_depth=10, min_samples_split=10, 
        n_jobs=-1, random_state=42
    )
}

In [29]:
# Train & Cross-Check Learners
print("--- Cross-Checking Individual Models ---")
trained_models = []

for name, model in models.items():
    # Train
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    
    # Check Performance on Validation Set
    val_probs = model.predict_proba(X_val)[:, 1]
    score = roc_auc_score(y_val, val_probs)
    print(f"  -> {name} Validation ROC-AUC: {score:.5f}")
    
    # Save the trained model tuple for the ensemble later
    trained_models.append((name, model))

--- Cross-Checking Individual Models ---
Training XGBoost...
  -> XGBoost Validation ROC-AUC: 0.72332
Training LightGBM...
  -> LightGBM Validation ROC-AUC: 0.72444
Training AdaBoost...
  -> AdaBoost Validation ROC-AUC: 0.71559
Training RandomForest...
  -> RandomForest Validation ROC-AUC: 0.69720
Training ExtraTrees...
  -> ExtraTrees Validation ROC-AUC: 0.68896


In [None]:
# Build the Ensemble
# We use Soft Voting to average the probabilities of all models
print("\n--- Building Ensemble ---")

ensemble = VotingClassifier(
    estimators=trained_models,
    voting='soft',
    n_jobs=-1
)

In [None]:
# Fitting with the Ensemble
ensemble.fit(X_train, y_train)

In [None]:
# Evaluate Ensemble
ensemble_probs = ensemble.predict_proba(X_val)[:, 1]
ensemble_score = roc_auc_score(y_val, ensemble_probs)

In [None]:
print(f"Ensemble Validation ROC-AUC: {ensemble_score:.5f}")

In [None]:
# Generate Submission
test_probs = ensemble.predict_proba(X_kaggle_test)[:, 1]

submission = pd.DataFrame({
    id_col: submission_ids,
    target_col: test_probs
})

In [None]:
submission.to_csv('../data/submission/submission_ensemble_final_remove.csv', index=False)
print("Success! 'submission_ensemble_final.csv' saved.")

# Combining Train And Validation Set to Boost Performance

In [8]:
# Combine Train + Validation into One FULL Dataset 
print("Combining Train and Validation sets...")
df_full_train = pd.concat([df_train, df_val], axis=0).reset_index(drop=True)

Combining Train and Validation sets...


In [9]:
# Separate X and y
X_full = df_full_train.drop(columns=[target_col])
y_full = df_full_train[target_col]

In [10]:
print(f"Full Training Data Shape: {X_full.shape}")
print(f"Test Data Shape: {X_kaggle_test.shape}")

Full Training Data Shape: (700000, 34)
Test Data Shape: (300000, 34)


In [11]:
# Define the Models
clf_xgb = xgb.XGBClassifier(
    n_estimators=1000, 
    learning_rate=0.05, 
    max_depth=6, 
    subsample=0.8, 
    colsample_bytree=0.8, 
    n_jobs=-1, 
    random_state=42
)

clf_lgb = lgb.LGBMClassifier(
    n_estimators=1000, 
    learning_rate=0.05, 
    num_leaves=31, 
    metric='auc', 
    n_jobs=-1, 
    verbose=-1, 
    random_state=42
)

clf_ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=3),
    n_estimators=200, 
    learning_rate=1.0, 
    random_state=42
)

clf_rf = RandomForestClassifier(
    n_estimators=500, 
    max_depth=12, 
    n_jobs=-1, 
    random_state=42
)

clf_et = ExtraTreesClassifier(
    n_estimators=500, 
    max_depth=12, 
    n_jobs=-1, 
    random_state=42
)

In [12]:
# Build and Train the Final Ensemble 
ensemble = VotingClassifier(
    estimators=[
        ('xgb', clf_xgb),
        ('lgb', clf_lgb),
        ('ada', clf_ada),
        ('rf', clf_rf),
        ('et', clf_et)
    ],
    voting='soft',
    n_jobs=-1
)

In [13]:
print("\nTraining Final Ensemble on FULL Dataset (this may take a while)...")
ensemble.fit(X_full, y_full)


Training Final Ensemble on FULL Dataset (this may take a while)...


0,1,2
,estimators,"[('xgb', ...), ('lgb', ...), ...]"
,voting,'soft'
,weights,
,n_jobs,-1
,flatten_transform,True
,verbose,False

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,0.8
,device,
,early_stopping_rounds,
,enable_categorical,False

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.05
,n_estimators,1000
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,3
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,
,max_leaf_nodes,
,min_impurity_decrease,0.0

0,1,2
,n_estimators,500
,criterion,'gini'
,max_depth,12
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True

0,1,2
,n_estimators,500
,criterion,'gini'
,max_depth,12
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,False


In [14]:
# Predict and Submit 
print("Generating predictions...")
final_probs = ensemble.predict_proba(X_kaggle_test)[:, 1]

submission = pd.DataFrame({
    id_col: submission_ids,
    target_col: final_probs
})

Generating predictions...


In [15]:
submission.to_csv('../data/submission/submission_full_data_remove.csv', index=False)
print("Success! 'submission_full_data.csv' saved.")

Success! 'submission_full_data.csv' saved.
