<a href="https://colab.research.google.com/github/ChiNonsoHenry16/Comparative-Analysis-of-Classical-ML-and-Neural-Network-Models-for-Pig-Weight-Prediction/blob/main/Kfold_Validation_of_Ensemble_Models_for_Pig_Weight_Estimation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

K-fold cross validation using several values of K (3, 5, 7, and 10) was done in this Colab notebook for Ensemble models. The ensemble models used here include voting, bagging and stacking regressors. K-fold cross-validation is a model evaluation technique used in machine learning to assess how well a model generalizes to an independent dataset. It helps in avoiding overfitting and provides a more reliable estimate of model performance.

Random train/test splits improve model performance estimation, use all data for training and testing, and reduce variance. However, computationally expensive and not suitable for time series.

🔹 How It Works:The method involves splitting data into K equal-sized folds, iterating K times, using K-1 folds for training and the remaining 1 fold for testing. Performance metrics like accuracy, MSE, and F1-score are averaged over the K iterations. For example, with 100 samples, a 5-fold cross-validation would involve splitting data into 5 parts, running 5 iterations, and calculating and averaging the validation results.

Evaluation Metrics: R squared, MAE, RMSE and MAPE.




Installation of requiste libraries

In [None]:
!pip install lightgbm catboost xgboost openpyxl scikit-optimize
!pip install -q memory_profiler
!pip install --upgrade scikit-learn

import numpy as np
import pandas as pd
import time
from memory_profiler import memory_usage
from scipy import stats

from sklearn.model_selection import train_test_split, GridSearchCV, KFold, RandomizedSearchCV

from sklearn.model_selection import cross_val_score, KFold, RandomizedSearchCV
from sklearn.metrics import make_scorer, mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import VotingRegressor, BaggingRegressor, AdaBoostRegressor, RandomForestRegressor, HistGradientBoostingRegressor, StackingRegressor
from sklearn.ensemble import IsolationForest

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# --- 2. LOAD DATA ---
from google.colab import files
uploaded = files.upload()
import io
data = pd.read_excel(io.BytesIO(uploaded['dataset.xlsx']))

Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting scikit-optimize
  Downloading scikit_optimize-0.10.2-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting pyaml>=16.9 (from scikit-optimize)
  Downloading pyaml-25.1.0-py3-none-any.whl.metadata (12 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading scikit_optimize-0.10.2-py2.py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.8/107.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyaml-25.1.0-py3-none-any.whl (26 kB)
Installing collected packages: pyaml, scikit-optimize, catboost
Successfully installed catboost-1.2.8 pyaml-25.1.0 scikit-optimize-0.10.2


Saving dataset.xlsx to dataset.xlsx


I initially decided to run voting, bagging and stacking regressors using these codes (pipeline) but it was taking a long time, so I edited the codes to run it individually below. I also used cross_val_score and set the jobs to 1

In [None]:
#Preprocessing of the dataset
data.rename(columns={'male/female':'gender'}, inplace=True)
data['Date of determination'] = pd.to_datetime(data['Date of determination'])
data['date of birth'] = pd.to_datetime(data['date of birth'])
data = pd.concat([data, pd.get_dummies(data['gender'])], axis=1)
data = pd.concat([data, pd.get_dummies(data['breed'])], axis=1)
data['age_in_days'] = (data['Date of determination'] - data['date of birth']).dt.days
median_height = data[data['The height of a pig'] != 0]['The height of a pig'].median()
data['The height of a pig'] = data['The height of a pig'].replace(0, median_height)
drop_cols = ['Serial number', 'breed', 'gender', 'Date of determination', 'date of birth']
data = data.drop(columns=drop_cols)

main_features = [
    'Chest circumference of pig',
    'Abdominal circumference of pigs',
    'Waist circumference of pig',
    'Length of pig',
    'The height of a pig',
    'female',
    'male',
    'S21',
    'S23',
    'age_in_days'
]

from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=42)
mask = iso.fit_predict(data[main_features]) == 1
data = data.loc[mask]

X = data[main_features].values
y = data['Weight measurement']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# --- Define ensemble base models ---
hgb = HistGradientBoostingRegressor(random_state=42)
lgbm = LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1)
cat = CatBoostRegressor(verbose=0, random_state=42)
xgb = XGBRegressor(random_state=42, eval_metric='rmse', use_label_encoder=False)
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

# Stacking regressor with Ridge meta-learner
stacking_regressor = StackingRegressor(
    estimators=[('hgb', hgb), ('lgbm', lgbm), ('cat', cat), ('xgb', xgb), ('rf', rf)],
    final_estimator=Ridge(),
    cv=3,
    n_jobs=-1,
    passthrough=False
)

ensemble_configs = {
    "Voting Regressor": {
        "model": VotingRegressor(
            estimators=[('hgb', hgb), ('lgbm', lgbm), ('cat', cat), ('xgb', xgb), ('rf', rf)]
        ),
        "param_grid": {}
    },
    "Bagging Regressor": {
        "model": BaggingRegressor(
            estimator=RandomForestRegressor(random_state=42),
            random_state=42
        ),
        "param_grid": {
            'n_estimators': [10, 30, 50],
            'max_samples': [0.6, 0.8, 1.0],
            'max_features': [0.6, 0.8, 1.0]
        }
    },
    "Stacking Regressor": {
        "model": stacking_regressor,
        "param_grid": {
            'final_estimator__alpha': [0.01, 0.1, 1, 10]
        }
    }
}

# --- Randomized Search with memory/time profiling ---
from sklearn.exceptions import FitFailedWarning
import warnings

def randomized_search_with_profile(model, param_grid, X_train, y_train, n_iter=20):
    if param_grid:
        rs = RandomizedSearchCV(
            model, param_grid, n_iter=n_iter, cv=3,
            n_jobs=-1, scoring='r2', verbose=0, error_score='raise'
        )
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FitFailedWarning)
            start_time = time.time()
            mem_usage = memory_usage((rs.fit, (X_train, y_train)), max_usage=True, interval=0.1)
            train_time = time.time() - start_time
            peak_mem = mem_usage
        best_model = rs.best_estimator_
    else:
        start_time = time.time()
        mem_usage = memory_usage((model.fit, (X_train, y_train)), max_usage=True, interval=0.1)
        train_time = time.time() - start_time
        peak_mem = mem_usage
        best_model = model
        best_model.fit(X_train, y_train)
    return best_model, train_time, peak_mem

# --- K-Fold Cross Validation for Ensembles ---
def kfold_cv_ensemble(X, y, model, param_grid, ks=[3,5,7,10]):
    from sklearn.model_selection import KFold
    results = {}

    for k in ks:
        print(f"\nRunning {k}-Fold Cross Validation")
        kf = KFold(n_splits=k, shuffle=True, random_state=42)

        metrics = {
            'Train_R2': [], 'Test_R2': [],
            'Train_RMSE': [], 'Test_RMSE': [],
            'Train_MAE': [], 'Test_MAE': [],
            'Train_MAPE': [], 'Test_MAPE': [],
            'Training_Time': [], 'Peak_Memory': []
        }

        for train_idx, test_idx in kf.split(X):
            X_train_fold, X_test_fold = X[train_idx], X[test_idx]
            y_train_fold, y_test_fold = y.iloc[train_idx], y.iloc[test_idx]

            best_model, train_time, peak_mem = randomized_search_with_profile(model, param_grid, X_train_fold, y_train_fold)

            y_train_pred = best_model.predict(X_train_fold)
            y_test_pred = best_model.predict(X_test_fold)

            metrics['Train_R2'].append(r2_score(y_train_fold, y_train_pred))
            metrics['Test_R2'].append(r2_score(y_test_fold, y_test_pred))
            metrics['Train_RMSE'].append(np.sqrt(mean_squared_error(y_train_fold, y_train_pred))) # Changed here
            metrics['Test_RMSE'].append(np.sqrt(mean_squared_error(y_test_fold, y_test_pred))) # Changed here
            metrics['Train_MAE'].append(mean_absolute_error(y_train_fold, y_train_pred))
            metrics['Test_MAE'].append(mean_absolute_error(y_test_fold, y_test_pred))
            metrics['Train_MAPE'].append(np.mean(np.abs((y_train_fold - y_train_pred) / y_train_fold)) * 100)
            metrics['Test_MAPE'].append(np.mean(np.abs((y_test_fold - y_test_pred) / y_test_fold)) * 100)
            metrics['Training_Time'].append(train_time)
            metrics['Peak_Memory'].append(peak_mem)

        # Summarize per k using confidence intervals
        summary = {}
        for key, values in metrics.items():
            vals = np.array(values)
            mean_val = vals.mean()
            std_val = vals.std(ddof=1)
            var_val = vals.var(ddof=1)
            ci_low, ci_high = 0, 0
            if len(vals) > 1:
                ci = stats.t.interval(0.95, len(vals)-1, loc=mean_val, scale=stats.sem(vals))
                ci_low, ci_high = ci
            summary[key] = (mean_val, std_val, var_val, (ci_low, ci_high))

        results[k] = summary

    # Print results
    print("\n=== Cross-validation Summary ===\n")
    for k, sumstats in results.items():
        print(f"k={k}: R2={sumstats['Test_R2'][0]:.3f}  RMSE={sumstats['Test_RMSE'][0]:.3f}  MAE={sumstats['Test_MAE'][0]:.3f}  "
              f"MAPE={sumstats['Test_MAPE'][0]:.3f}  Time={sumstats['Training_Time'][0]:.3f}s Mem={sumstats['Peak_Memory'][0]:.2f}MB")

    for metric in ['Test_R2', 'Test_RMSE', 'Test_MAE', 'Test_MAPE']:
        means = [results[k][metric][0] for k in results]
        stds = [results[k][metric][1] for k in results]
        vars_ = [results[k][metric][2] for k in results]
        cis = [results[k][metric][3] for k in results]

        overall_mean = np.mean(means)
        overall_std = np.mean(stds)
        overall_var = np.mean(vars_)
        overall_ci_low = min(ci[0] for ci in cis)
        overall_ci_high = max(ci[1] for ci in cis)

        print(f"\n{metric.replace('Test_', '')}: Mean={overall_mean:.3f}, Std={overall_std:.3f}, Var={overall_var:.5f}, "
              f"CI=({overall_ci_low:.3f}, {overall_ci_high:.3f})")

    # Find the best k based on mean Test R2
    mean_r2_per_k = {k: results[k]['Test_R2'][0] for k in results}
    best_k = max(mean_r2_per_k, key=mean_r2_per_k.get)
    best_r2 = mean_r2_per_k[best_k]
    print(f"\nBest k for Mean R2: {best_k} with Mean R2 = {best_r2:.6f}")

    return results


# === Example: Running cross-validation for all your ensemble models ===

results_all_models = {}

for name, config in ensemble_configs.items():
    print(f"\n\n***** Cross-validating {name} *****")
    model = config['model']
    param_grid = config['param_grid']

    res = kfold_cv_ensemble(X_scaled, y, model, param_grid, ks=[3,5,7,10])
    results_all_models[name] = res

# If needed, you can now process `results_all_models` further or save it.



***** Cross-validating Voting Regressor *****

Running 3-Fold Cross Validation


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.




Running 5-Fold Cross Validation


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.




Running 7-Fold Cross Validation


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.




Running 10-Fold Cross Validation


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode


=== Cross-validation Summary ===

k=3: R2=0.693  RMSE=3.644  MAE=2.757  MAPE=2.477  Time=4.204s Mem=526.71MB
k=5: R2=0.697  RMSE=3.619  MAE=2.735  MAPE=2.458  Time=5.151s Mem=631.10MB
k=7: R2=0.697  RMSE=3.616  MAE=2.732  MAPE=2.454  Time=4.883s Mem=634.60MB
k=10: R2=0.697  RMSE=3.615  MAE=2.732  MAPE=2.455  Time=5.356s Mem=640.66MB

R2: Mean=0.696, Std=0.010, Var=0.00012, CI=(0.682, 0.711)

RMSE: Mean=3.623, Std=0.080, Var=0.00731, CI=(3.499, 3.788)

MAE: Mean=2.739, Std=0.055, Var=0.00323, CI=(2.637, 2.878)

MAPE: Mean=2.461, Std=0.044, Var=0.00213, CI=(2.364, 2.591)

Best k for Mean R2: 10 with Mean R2 = 0.697451


***** Cross-validating Bagging Regressor *****

Running 3-Fold Cross Validation





Running 5-Fold Cross Validation




Running the ensemble models individually

VOTING REGRESSOR

In [None]:
data.rename(columns={'male/female':'gender'}, inplace=True)
data['Date of determination'] = pd.to_datetime(data['Date of determination'])
data['date of birth'] = pd.to_datetime(data['date of birth'])
data = pd.concat([data, pd.get_dummies(data['gender'])], axis=1)
data = pd.concat([data, pd.get_dummies(data['breed'])], axis=1)
data['age_in_days'] = (data['Date of determination'] - data['date of birth']).dt.days
median_height = data[data['The height of a pig'] != 0]['The height of a pig'].median()
data['The height of a pig'] = data['The height of a pig'].replace(0, median_height)
drop_cols = ['Serial number', 'breed', 'gender', 'Date of determination', 'date of birth']
data = data.drop(columns=drop_cols)
main_features = [
    'Chest circumference of pig',
    'Abdominal circumference of pigs',
    'Waist circumference of pig',
    'Length of pig',
    'The height of a pig',
    'female',
    'male',
    'S21',
    'S23',
    'age_in_days'
]
iso = IsolationForest(contamination=0.05, random_state=42)
mask = iso.fit_predict(data[main_features]) == 1
data = data.loc[mask]
X = data[main_features].values
y = data['Weight measurement']
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


# SETUP FOR THE VOTING REGRESSOR
hgb = HistGradientBoostingRegressor(random_state=42)
lgbm = LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1)
cat = CatBoostRegressor(verbose=0, random_state=42)
xgb = XGBRegressor(random_state=42, eval_metric='rmse', use_label_encoder=False)
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
voting_reg = VotingRegressor([
    ('hgb', hgb),
    ('lgbm', lgbm),
    ('cat', cat),
    ('xgb', xgb),
    ('rf', rf)
])

def mean_abs_pct_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def confidence_interval(data, confidence=0.95):
    import scipy.stats as stats
    a = np.array(data)
    n = len(a)
    mean_val = np.mean(a)
    if n > 1:
        se = np.std(a, ddof=1) / np.sqrt(n)
        h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    else:
        h = 0
    return mean_val, h, np.var(a) if n > 1 else 0, (mean_val - h, mean_val + h)

def kfold_eval(X, y, model, ks=[3,5,7,10], print_header="VotingRegressor"):
    results = {}
    print("\nThe Original Model")
    print(f"=== {print_header} ===")
    for k in ks:
        kf = KFold(n_splits=k, shuffle=True, random_state=42)
        scores_r2, scores_rmse, scores_mae, scores_mape = [], [], [], []
        train_times, peak_mems = [], []
        for train_idx, test_idx in kf.split(X):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
            def train_model():
                model.fit(X_train, y_train)
            start = time.time()
            mem_usage = memory_usage((train_model, ()), interval=0.1, max_usage=True)
            elapsed = time.time() - start
            y_pred = model.predict(X_test)
            scores_r2.append(r2_score(y_test, y_pred))
            scores_rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))
            scores_mae.append(mean_absolute_error(y_test, y_pred))
            scores_mape.append(mean_abs_pct_error(y_test, y_pred))
            train_times.append(elapsed)
            peak_mems.append(mem_usage)
        mean_r2, h_r2, var_r2, ci_r2 = confidence_interval(scores_r2)
        mean_rmse, h_rmse, var_rmse, ci_rmse = confidence_interval(scores_rmse)
        mean_mae, h_mae, var_mae, ci_mae = confidence_interval(scores_mae)
        mean_mape, h_mape, var_mape, ci_mape = confidence_interval(scores_mape)
        mean_time = np.mean(train_times)
        mean_mem = np.mean(peak_mems)
        print(f"k={k}: R2={mean_r2:.3f}  RMSE={mean_rmse:.3f}  MAE={mean_mae:.3f} MAPE={mean_mape:.3f}  Time={mean_time:.3f}s Mem={mean_mem:.2f}MB")
        results[k] = {
            'r2': (mean_r2, h_r2, var_r2, ci_r2),
            'rmse': (mean_rmse, h_rmse, var_rmse, ci_rmse),
            'mae': (mean_mae, h_mae, var_mae, ci_mae),
            'mape': (mean_mape, h_mape, var_mape, ci_mape),
            'time': mean_time,
            'mem': mean_mem
        }
    metrics_to_aggregate = ['r2', 'rmse', 'mae', 'mape']
    for metric in metrics_to_aggregate:
        means = [results[k][metric][0] for k in results]
        stds  = [results[k][metric][1] for k in results]
        vars_ = [results[k][metric][2] for k in results]
        cis   = [results[k][metric][3] for k in results]
        overall_mean = np.mean(means)
        overall_std  = np.mean(stds)
        overall_var  = np.mean(vars_)
        overall_ci_low = min(ci[0] for ci in cis)
        overall_ci_high = max(ci[1] for ci in cis)
        metric_name = metric.upper() if metric.upper() != "MAPE" else "MAPE"
        print(f" {metric_name}: Mean={overall_mean:.3f}, Std={overall_std:.3f}, Var={overall_var:.5f}, CI=({overall_ci_low:.3f}, {overall_ci_high:.3f})")
    mean_r2_per_k = {k: results[k]['r2'][0] for k in results}
    best_k = max(mean_r2_per_k, key=mean_r2_per_k.get)
    best_r2 = mean_r2_per_k[best_k]
    print(f" Best k for Mean R2: {best_k} with Mean R2 = {best_r2}")


# RUN THE VOTING REGRESSOR EVALUATION
kfold_eval(X_scaled, y, voting_reg, ks=[3,5,7,10], print_header="VotingRegressor")


The Original Model
=== VotingRegressor ===


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



k=3: R2=0.693  RMSE=3.644  MAE=2.757 MAPE=2.477  Time=3.556s Mem=610.45MB


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



k=5: R2=0.697  RMSE=3.619  MAE=2.735 MAPE=2.458  Time=4.158s Mem=688.98MB


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



k=7: R2=0.697  RMSE=3.616  MAE=2.732 MAPE=2.454  Time=4.172s Mem=692.85MB


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.



k=10: R2=0.697  RMSE=3.615  MAE=2.732 MAPE=2.455  Time=4.571s Mem=730.83MB
 R2: Mean=0.696, Std=0.012, Var=0.00010, CI=(0.682, 0.711)
 RMSE: Mean=3.623, Std=0.094, Var=0.00620, CI=(3.499, 3.788)
 MAE: Mean=2.739, Std=0.069, Var=0.00269, CI=(2.637, 2.878)
 MAPE: Mean=2.461, Std=0.058, Var=0.00174, CI=(2.364, 2.591)
 Best k for Mean R2: 10 with Mean R2 = 0.6974510885275741




Voting without the Errors

In [None]:
# --- Data preprocessing (your original code) ---
data.rename(columns={'male/female': 'gender'}, inplace=True)
data['Date of determination'] = pd.to_datetime(data['Date of determination'])
data['date of birth'] = pd.to_datetime(data['date of birth'])

data = pd.concat([data, pd.get_dummies(data['gender'])], axis=1)
data = pd.concat([data, pd.get_dummies(data['breed'])], axis=1)
data['age_in_days'] = (data['Date of determination'] - data['date of birth']).dt.days

median_height = data[data['The height of a pig'] != 0]['The height of a pig'].median()
data['The height of a pig'] = data['The height of a pig'].replace(0, median_height)

drop_cols = ['Serial number', 'breed', 'gender', 'Date of determination', 'date of birth']
data = data.drop(columns=drop_cols)

main_features = [
    'Chest circumference of pig',
    'Abdominal circumference of pigs',
    'Waist circumference of pig',
    'Length of pig',
    'The height of a pig',
    'female',
    'male',
    'S21',
    'S23',
    'age_in_days'
]

# --- Outlier Removal ---
iso = IsolationForest(contamination=0.05, random_state=42)
mask = iso.fit_predict(data[main_features]) == 1
data = data.loc[mask]

X = data[main_features].values
y = data['Weight measurement']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


# --- Define base regressors ---
hgb = HistGradientBoostingRegressor(random_state=42)
lgbm = LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1)
cat = CatBoostRegressor(verbose=0, random_state=42)
xgb = XGBRegressor(random_state=42, eval_metric='rmse', use_label_encoder=False, n_jobs=-1)
rf = RandomForestRegressor(random_state=42, n_jobs=-1)

voting_reg = VotingRegressor(
    estimators=[
        ('hgb', hgb),
        ('lgbm', lgbm),
        ('cat', cat),
        ('xgb', xgb),
        ('rf', rf)
    ],
    n_jobs=-1
)

# --- Parameter grid ---
param_grid = {
    'hgb__max_iter': [100, 150],
    'hgb__max_depth': [None, 10, 20],

    'lgbm__n_estimators': [50, 100],
    'lgbm__max_depth': [-1, 10, 20],
    'lgbm__learning_rate': [0.05, 0.1],

    'cat__iterations': [100, 150],
    'cat__depth': [6, 8],

    'xgb__n_estimators': [50, 100],
    'xgb__max_depth': [3, 6],
    'xgb__learning_rate': [0.05, 0.1],

    'rf__n_estimators': [50, 100],
    'rf__max_depth': [None, 10, 20]
}

random_search = RandomizedSearchCV(
    voting_reg, param_distributions=param_grid,
    n_iter=15, cv=3, n_jobs=-1, scoring='r2',
    random_state=42, verbose=0
)

random_search.fit(X_scaled, y)
best_model = random_search.best_estimator_


# --- Metric functions same as before ---
def mean_abs_pct_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100


def confidence_interval(data, confidence=0.95):
    import scipy.stats as stats
    a = np.array(data)
    n = len(a)
    mean_val = np.mean(a)
    if n > 1:
        se = np.std(a, ddof=1) / np.sqrt(n)
        h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    else:
        h = 0
    return mean_val, h, np.var(a) if n > 1 else 0, (mean_val - h, mean_val + h)


def kfold_eval(X, y, model, ks=[3, 5, 7, 10], print_header="Voting Regressor with Boosters"):
    results = {}
    print("\nThe Original Model")
    print(f"=== {print_header} ===")
    for k in ks:
        kf = KFold(n_splits=k, shuffle=True, random_state=42)
        scores_r2, scores_rmse, scores_mae, scores_mape = [], [], [], []
        train_times, peak_mems = [], []
        for train_idx, test_idx in kf.split(X):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

            def train_model():
                model.fit(X_train, y_train)

            start = time.time()
            mem_usage = memory_usage((train_model, ()), interval=0.1, max_usage=True)
            elapsed = time.time() - start

            y_pred = model.predict(X_test)
            scores_r2.append(r2_score(y_test, y_pred))
            scores_rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))
            scores_mae.append(mean_absolute_error(y_test, y_pred))
            scores_mape.append(mean_abs_pct_error(y_test, y_pred))
            train_times.append(elapsed)
            peak_mems.append(mem_usage)

        mean_r2, h_r2, var_r2, ci_r2 = confidence_interval(scores_r2)
        mean_rmse, h_rmse, var_rmse, ci_rmse = confidence_interval(scores_rmse)
        mean_mae, h_mae, var_mae, ci_mae = confidence_interval(scores_mae)
        mean_mape, h_mape, var_mape, ci_mape = confidence_interval(scores_mape)
        mean_time = np.mean(train_times)
        mean_mem = np.mean(peak_mems)

        print(f"k={k}: R2={mean_r2:.3f}  RMSE={mean_rmse:.3f}  MAE={mean_mae:.3f} "
              f"MAPE={mean_mape:.3f}  Time={mean_time:.3f}s Mem={mean_mem:.2f}MB")
        results[k] = {
            'r2': (mean_r2, h_r2, var_r2, ci_r2),
            'rmse': (mean_rmse, h_rmse, var_rmse, ci_rmse),
            'mae': (mean_mae, h_mae, var_mae, ci_mae),
            'mape': (mean_mape, h_mape, var_mape, ci_mape),
            'time': mean_time,
            'mem': mean_mem
        }

    # Aggregate stats
    metrics_to_aggregate = ['r2', 'rmse', 'mae', 'mape']
    for metric in metrics_to_aggregate:
        means = [results[k][metric][0] for k in results]
        stds = [results[k][metric][1] for k in results]
        vars_ = [results[k][metric][2] for k in results]
        cis = [results[k][metric][3] for k in results]
        overall_mean = np.mean(means)
        overall_std = np.mean(stds)
        overall_var = np.mean(vars_)
        overall_ci_low = min(ci[0] for ci in cis)
        overall_ci_high = max(ci[1] for ci in cis)
        metric_name = metric.upper() if metric != 'mape' else 'MAPE'
        print(f" {metric_name}: Mean={overall_mean:.3f}, Std={overall_std:.3f}, "
              f"Var={overall_var:.5f}, CI=({overall_ci_low:.3f}, {overall_ci_high:.3f})")

    # Best k by mean R2
    mean_r2_per_k = {k: results[k]['r2'][0] for k in results}
    best_k = max(mean_r2_per_k, key=mean_r2_per_k.get)
    best_r2 = mean_r2_per_k[best_k]
    print(f" Best k for Mean R2: {best_k} with Mean R2 = {best_r2:.3f}")


# --- Run evaluation ---
kfold_eval(X_scaled, y, best_model, ks=[3, 5, 7, 10], print_header="Voting Regressor with Boosters")


The Original Model
=== Voting Regressor with Boosters ===




k=3: R2=0.693  RMSE=3.643  MAE=2.766 MAPE=2.486  Time=1.352s Mem=546.21MB




k=5: R2=0.695  RMSE=3.633  MAE=2.756 MAPE=2.478  Time=1.815s Mem=554.60MB




k=7: R2=0.695  RMSE=3.631  MAE=2.756 MAPE=2.477  Time=1.687s Mem=557.99MB




k=10: R2=0.695  RMSE=3.627  MAE=2.755 MAPE=2.477  Time=1.756s Mem=558.35MB
 R2: Mean=0.695, Std=0.010, Var=0.00009, CI=(0.682, 0.707)
 RMSE: Mean=3.633, Std=0.088, Var=0.00584, CI=(3.511, 3.774)
 MAE: Mean=2.758, Std=0.064, Var=0.00272, CI=(2.659, 2.873)
 MAPE: Mean=2.479, Std=0.055, Var=0.00182, CI=(2.383, 2.590)
 Best k for Mean R2: 10 with Mean R2 = 0.695




BAGGING REGRESSOR WITH CROSS_VAL SCORE

In [None]:
data.rename(columns={'male/female': 'gender'}, inplace=True)
data['Date of determination'] = pd.to_datetime(data['Date of determination'])
data['date of birth'] = pd.to_datetime(data['date of birth'])
data = pd.concat([data, pd.get_dummies(data['gender'])], axis=1)
data = pd.concat([data, pd.get_dummies(data['breed'])], axis=1)
data['age_in_days'] = (data['Date of determination'] - data['date of birth']).dt.days
median_height = data[data['The height of a pig'] != 0]['The height of a pig'].median()
data['The height of a pig'] = data['The height of a pig'].replace(0, median_height)
drop_cols = ['Serial number', 'breed', 'gender', 'Date of determination', 'date of birth']
data = data.drop(columns=drop_cols)


main_features = [
    'Chest circumference of pig',
    'Abdominal circumference of pigs',
    'Waist circumference of pig',
    'Length of pig',
    'The height of a pig',
    'female',
    'male',
    'S21',
    'S23',
    'age_in_days'
]

# Outlier Removal
iso = IsolationForest(contamination=0.05, random_state=42)
mask = iso.fit_predict(data[main_features]) == 1
data = data.loc[mask]

X = data[main_features].values
y = data['Weight measurement']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Model
base_rf = RandomForestRegressor(random_state=42)
bagging_reg = BaggingRegressor(
    estimator=base_rf,
    random_state=42,
    n_jobs=-1
)
param_grid = {
    'n_estimators': [10, 30, 50],
    'max_samples': [0.6, 0.8, 1.0],
    'max_features': [0.6, 0.8, 1.0]
}
random_search = RandomizedSearchCV(
    bagging_reg, param_distributions=param_grid, n_iter=15, cv=3,
    n_jobs=-1, scoring='r2', random_state=42, verbose=0
)
random_search.fit(X_scaled, y)
best_model = random_search.best_estimator_

def mean_abs_pct_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def confidence_interval(data, confidence=0.95):
    import scipy.stats as stats
    a = np.array(data)
    n = len(a)
    mean_val = np.mean(a)
    if n > 1:
        se = np.std(a, ddof=1) / np.sqrt(n)
        h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    else:
        h = 0
    return mean_val, h, np.var(a) if n > 1 else 0, (mean_val - h, mean_val + h)

# Cross-validation and Results Summary with memory profiling
def kfold_eval(X, y, model, ks=[3,5,7,10], print_header="Bagging RF"):
    results = {}
    print("\nThe Original Model")
    print(f"=== {print_header} ===")
    for k in ks:
        kf = KFold(n_splits=k, shuffle=True, random_state=42)
        scores_r2, scores_rmse, scores_mae, scores_mape = [], [], [], []
        train_times, peak_mems = [], []
        for train_idx, test_idx in kf.split(X):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

            def train_model():
                model.fit(X_train, y_train)

            start = time.time()
            mem_usage = memory_usage((train_model, ()), interval=0.1, max_usage=True)
            elapsed = time.time() - start
            y_pred = model.predict(X_test)
            scores_r2.append(r2_score(y_test, y_pred))
            scores_rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))
            scores_mae.append(mean_absolute_error(y_test, y_pred))
            scores_mape.append(mean_abs_pct_error(y_test, y_pred))
            train_times.append(elapsed)
            peak_mems.append(mem_usage)  # this returns a float in MB

        # Stats/summary
        mean_r2, h_r2, var_r2, ci_r2 = confidence_interval(scores_r2)
        mean_rmse, h_rmse, var_rmse, ci_rmse = confidence_interval(scores_rmse)
        mean_mae, h_mae, var_mae, ci_mae = confidence_interval(scores_mae)
        mean_mape, h_mape, var_mape, ci_mape = confidence_interval(scores_mape)
        mean_time = np.mean(train_times)
        mean_mem = np.mean(peak_mems)

        print(f"k={k}: R2={mean_r2:.3f}  RMSE={mean_rmse:.3f}  MAE={mean_mae:.3f} "
              f"MAPE={mean_mape:.3f}  Time={mean_time:.3f}s Mem={mean_mem:.2f}MB")
        results[k] = {
            'r2': (mean_r2, h_r2, var_r2, ci_r2),
            'rmse': (mean_rmse, h_rmse, var_rmse, ci_rmse),
            'mae': (mean_mae, h_mae, var_mae, ci_mae),
            'mape': (mean_mape, h_mape, var_mape, ci_mape),
            'time': mean_time,
            'mem': mean_mem
        }
    # Aggregate
    metrics_to_aggregate = ['r2', 'rmse', 'mae', 'mape']
    for metric in metrics_to_aggregate:
        means = [results[k][metric][0] for k in results]
        stds  = [results[k][metric][1] for k in results]
        vars_ = [results[k][metric][2] for k in results]
        cis   = [results[k][metric][3] for k in results]
        overall_mean = np.mean(means)
        overall_std  = np.mean(stds)
        overall_var  = np.mean(vars_)
        overall_ci_low = min(ci[0] for ci in cis)
        overall_ci_high = max(ci[1] for ci in cis)
        if metric == 'r2': metric_name="R2"
        elif metric=="rmse": metric_name="RMSE"
        elif metric=="mae": metric_name="MAE"
        else: metric_name="MAPE"
        print(f" {metric_name}: Mean={overall_mean:.3f}, Std={overall_std:.3f}, "
              f"Var={overall_var:.5f}, CI=({overall_ci_low:.3f}, {overall_ci_high:.3f})")
    # Best k by mean R2
    mean_r2_per_k = {k: results[k]['r2'][0] for k in results}
    best_k = max(mean_r2_per_k, key=mean_r2_per_k.get)
    best_r2 = mean_r2_per_k[best_k]
    print(f" Best k for Mean R2: {best_k} with Mean R2 = {best_r2}")

# Run it!
kfold_eval(X_scaled, y, best_model, ks=[3,5,7,10], print_header="Bagging RF")




The Original Model
=== Bagging RF ===
k=3: R2=0.675  RMSE=3.751  MAE=2.847 MAPE=2.563  Time=23.962s Mem=2423.68MB
k=5: R2=0.677  RMSE=3.734  MAE=2.833 MAPE=2.550  Time=25.695s Mem=2566.73MB




k=7: R2=0.678  RMSE=3.728  MAE=2.829 MAPE=2.546  Time=27.205s Mem=2589.45MB
k=10: R2=0.679  RMSE=3.722  MAE=2.827 MAPE=2.545  Time=28.529s Mem=2610.82MB
 R2: Mean=0.677, Std=0.011, Var=0.00010, CI=(0.663, 0.692)
 RMSE: Mean=3.734, Std=0.081, Var=0.00469, CI=(3.630, 3.872)
 MAE: Mean=2.834, Std=0.068, Var=0.00259, CI=(2.727, 2.968)
 MAPE: Mean=2.551, Std=0.059, Var=0.00173, CI=(2.444, 2.682)
 Best k for Mean R2: 10 with Mean R2 = 0.6791535298927858


STACKING REGRESSOR USING CROSS_VAL_SCORE

In [None]:
data.rename(columns={'male/female':'gender'}, inplace=True)
data['Date of determination'] = pd.to_datetime(data['Date of determination'])
data['date of birth'] = pd.to_datetime(data['date of birth'])
data = pd.concat([data, pd.get_dummies(data['gender'])], axis=1)
data = pd.concat([data, pd.get_dummies(data['breed'])], axis=1)
data['age_in_days'] = (data['Date of determination'] - data['date of birth']).dt.days
median_height = data[data['The height of a pig'] != 0]['The height of a pig'].median()
data['The height of a pig'] = data['The height of a pig'].replace(0, median_height)
drop_cols = ['Serial number', 'breed', 'gender', 'Date of determination', 'date of birth']
data = data.drop(columns=drop_cols)
main_features = [
    'Chest circumference of pig',
    'Abdominal circumference of pigs',
    'Waist circumference of pig',
    'Length of pig',
    'The height of a pig',
    'female',
    'male',
    'S21',
    'S23',
    'age_in_days'
]
iso = IsolationForest(contamination=0.05, random_state=42)
mask = iso.fit_predict(data[main_features]) == 1
data = data.loc[mask]
X = data[main_features].values
y = data['Weight measurement']
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# -------------------------------
# STACKING REGRESSOR SETUP
# -------------------------------
hgb = HistGradientBoostingRegressor(random_state=42)
lgbm = LGBMRegressor(random_state=42, verbose=-1)
cat  = CatBoostRegressor(verbose=0, random_state=42)
xgb  = XGBRegressor(random_state=42, verbosity=0)
rf   = RandomForestRegressor(random_state=42)
base_estimators = [
    ('hgb', hgb),
    ('lgbm', lgbm),
    ('cat', cat),
    ('xgb', xgb),
    ('rf', rf)
]
meta_reg = HistGradientBoostingRegressor(random_state=42)
stacking_reg = StackingRegressor(
    estimators=base_estimators,
    final_estimator=meta_reg,
    n_jobs=-1,
    passthrough=False
)

def mean_abs_pct_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def confidence_interval(data, confidence=0.95):
    import scipy.stats as stats
    a = np.array(data)
    n = len(a)
    mean_val = np.mean(a)
    if n > 1:
        se = np.std(a, ddof=1) / np.sqrt(n)
        h = se * stats.t.ppf((1 + confidence) / 2, n - 1)
    else:
        h = 0
    return mean_val, h, np.var(a) if n > 1 else 0, (mean_val - h, mean_val + h)

def kfold_eval(X, y, model, ks=[3,5,7,10], print_header="StackingRegressor"):
    results = {}
    print("\nThe Original Model")
    print(f"=== {print_header} ===")
    for k in ks:
        kf = KFold(n_splits=k, shuffle=True, random_state=42)
        scores_r2, scores_rmse, scores_mae, scores_mape = [], [], [], []
        train_times, peak_mems = [], []
        for train_idx, test_idx in kf.split(X):
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
            def train_model():
                model.fit(X_train, y_train)
            start = time.time()
            mem_usage = memory_usage((train_model, ()), interval=0.1, max_usage=True)
            elapsed = time.time() - start
            y_pred = model.predict(X_test)
            scores_r2.append(r2_score(y_test, y_pred))
            scores_rmse.append(np.sqrt(mean_squared_error(y_test, y_pred)))
            scores_mae.append(mean_absolute_error(y_test, y_pred))
            scores_mape.append(mean_abs_pct_error(y_test, y_pred))
            train_times.append(elapsed)
            peak_mems.append(mem_usage)
        mean_r2, h_r2, var_r2, ci_r2 = confidence_interval(scores_r2)
        mean_rmse, h_rmse, var_rmse, ci_rmse = confidence_interval(scores_rmse)
        mean_mae, h_mae, var_mae, ci_mae = confidence_interval(scores_mae)
        mean_mape, h_mape, var_mape, ci_mape = confidence_interval(scores_mape)
        mean_time = np.mean(train_times)
        mean_mem = np.mean(peak_mems)
        print(f"k={k}: R2={mean_r2:.3f}  RMSE={mean_rmse:.3f}  MAE={mean_mae:.3f} MAPE={mean_mape:.3f}  Time={mean_time:.3f}s Mem={mean_mem:.2f}MB")
        results[k] = {
            'r2': (mean_r2, h_r2, var_r2, ci_r2),
            'rmse': (mean_rmse, h_rmse, var_rmse, ci_rmse),
            'mae': (mean_mae, h_mae, var_mae, ci_mae),
            'mape': (mean_mape, h_mape, var_mape, ci_mape),
            'time': mean_time,
            'mem': mean_mem
        }
    metrics_to_aggregate = ['r2', 'rmse', 'mae', 'mape']
    for metric in metrics_to_aggregate:
        means = [results[k][metric][0] for k in results]
        stds  = [results[k][metric][1] for k in results]
        vars_ = [results[k][metric][2] for k in results]
        cis   = [results[k][metric][3] for k in results]
        overall_mean = np.mean(means)
        overall_std  = np.mean(stds)
        overall_var  = np.mean(vars_)
        overall_ci_low = min(ci[0] for ci in cis)
        overall_ci_high = max(ci[1] for ci in cis)
        metric_name = metric.upper() if metric.upper() != "MAPE" else "MAPE"
        print(f" {metric_name}: Mean={overall_mean:.3f}, Std={overall_std:.3f}, Var={overall_var:.5f}, CI=({overall_ci_low:.3f}, {overall_ci_high:.3f})")
    mean_r2_per_k = {k: results[k]['r2'][0] for k in results}
    best_k = max(mean_r2_per_k, key=mean_r2_per_k.get)
    best_r2 = mean_r2_per_k[best_k]
    print(f" Best k for Mean R2: {best_k} with Mean R2 = {best_r2}")

# -------------------------------
# RUN THE STACKED REGRESSOR EVALUATION
# -------------------------------
kfold_eval(X_scaled, y, stacking_reg, ks=[3,5,7,10], print_header="StackingRegressor")


The Original Model
=== StackingRegressor ===




k=3: R2=0.657  RMSE=3.853  MAE=2.950 MAPE=2.659  Time=21.851s Mem=733.49MB




k=5: R2=0.669  RMSE=3.780  MAE=2.893 MAPE=2.607  Time=20.135s Mem=830.23MB




k=7: R2=0.666  RMSE=3.801  MAE=2.918 MAPE=2.629  Time=20.532s Mem=867.04MB




k=10: R2=0.668  RMSE=3.787  MAE=2.894 MAPE=2.608  Time=21.501s Mem=882.91MB
 R2: Mean=0.665, Std=0.014, Var=0.00011, CI=(0.636, 0.685)
 RMSE: Mean=3.805, Std=0.114, Var=0.00738, CI=(3.658, 4.047)
 MAE: Mean=2.914, Std=0.093, Var=0.00386, CI=(2.760, 3.140)
 MAPE: Mean=2.626, Std=0.084, Var=0.00288, CI=(2.477, 2.841)
 Best k for Mean R2: 5 with Mean R2 = 0.6694995839491183


