This file contains models using K-fold cross validation technique.

For this dataset size (2000 samples), a k value of 5 is ideal:

1. Balanced trade-off between bias and variance
2. Enough samples per fold (~1600 train / 400 valid)
3. Commonly used and stable for datasets of this size
4. Efficient runtime (vs. 10-fold)


Resource links:

Hyperparameter Tuning: https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide

In [18]:
%pip install catboost
%pip install pytorch-tabnet
%pip install xgboost


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting xgboost
  Downloading xgboost-3.0.2-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.2-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   - -------------------------------------- 4.7/150.0 MB 40.9 MB/s eta 0:00:04
   ----- ---------------------------------- 19.4/150.0 MB 59.6 MB/s eta 0:00:03
   -------- ------------------------------- 32.0/150.0 MB 60.3 MB/s eta 0:00:02
   ------------ --------------------------- 45.1/150.0 MB 61.5 MB/s eta 0:00:02
   --------------- ------------------------ 58.5/150.0 MB 62.2 MB/s eta 0:00:02
   ------------------- -------------------- 71.6/150.0 MB 62.7 MB/s eta 0:00:02
   ---------------------- ----------------- 85.5/150.0 MB 63.4 MB/s eta 0:00:02
   -------------------------- ------------- 99.4/150.0 MB 64.0 MB/s eta 0:00:01
   -------

In [None]:
# Importing packages

from sklearn.model_selection import KFold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_percentage_error
import lightgbm as lgb
import numpy as np
import pandas as pd




In [4]:

# Load the training and test datasets
train_df = pd.read_csv(r"C:\Users\tm0792.STUDENTS.007\OneDrive - UNT System\Competitions\Shell ai Hackathon\dataset\train.csv")
test_df = pd.read_csv(r"C:\Users\tm0792.STUDENTS.007\OneDrive - UNT System\Competitions\Shell ai Hackathon\dataset\test.csv")  


In [5]:
# split the data

X = train_df.iloc[:, :55]
y = train_df.iloc[:, 55:]


# apply k-fold setup

kf = KFold(n_splits=5, shuffle=True, random_state=42)


In [7]:
# Linear Regression with 5-Fold CV

lr_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiOutputRegressor(LinearRegression())
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    lr_mape_scores.append(mape)
    print(f"Linear Regression - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nLinear Regression - Average MAPE: {np.mean(lr_mape_scores):.4f}")


Linear Regression - Fold 1 MAPE: 3.2206
Linear Regression - Fold 2 MAPE: 2.5862
Linear Regression - Fold 3 MAPE: 2.5361
Linear Regression - Fold 4 MAPE: 1.8997
Linear Regression - Fold 5 MAPE: 1.5331

Linear Regression - Average MAPE: 2.3552


In [8]:
# Random Forest with 5-Fold CV

rf_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiOutputRegressor(RandomForestRegressor(n_estimators=100, random_state=42))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    rf_mape_scores.append(mape)
    print(f"Random Forest - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nRandom Forest - Average MAPE: {np.mean(rf_mape_scores):.4f}")


Random Forest - Fold 1 MAPE: 3.3093
Random Forest - Fold 2 MAPE: 1.4264
Random Forest - Fold 3 MAPE: 2.3522
Random Forest - Fold 4 MAPE: 1.5515
Random Forest - Fold 5 MAPE: 1.3220

Random Forest - Average MAPE: 1.9923


In [9]:
# LightGBM with 5-Fold CV

lgb_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiOutputRegressor(lgb.LGBMRegressor(n_estimators=100, random_state=42))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    lgb_mape_scores.append(mape)
    print(f"LightGBM - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nLightGBM - Average MAPE: {np.mean(lgb_mape_scores):.4f}")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000972 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12981
[LightGBM] [Info] Number of data points in the train set: 1600, number of used features: 55
[LightGBM] [Info] Start training from score -0.007867
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001052 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12981
[LightGBM] [Info] Number of data points in the train set: 1600, number of used features: 55
[LightGBM] [Info] Start training from score -0.004643
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000938 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12981
[LightGBM] [Info] Number of data points in the train set: 1600, number of used features: 55
[LightGBM] [Info] Start t

Initial Parameters of catboost:
verbose=0, iterations=500, learning_rate=0.05, random_seed=42 --> Average MAPE: 0.8586
verbose=0, iterations=800, learning_rate=0.05, random_seed=42 --> Average MaPE: 0.8183


In [21]:
# CatBoost Regressor (with Multiout) using 5 fold

from catboost import CatBoostRegressor

cat_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiOutputRegressor(CatBoostRegressor(verbose=0, iterations=800, learning_rate=0.05, random_seed=42))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    cat_mape_scores.append(mape)
    print(f"CatBoost - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nCatBoost - Average MAPE: {np.mean(cat_mape_scores):.4f}")


CatBoost - Fold 1 MAPE: 0.9051
CatBoost - Fold 2 MAPE: 0.6674
CatBoost - Fold 3 MAPE: 1.1052
CatBoost - Fold 4 MAPE: 0.7940
CatBoost - Fold 5 MAPE: 0.6201

CatBoost - Average MAPE: 0.8183


In [20]:
from catboost import CatBoostRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, mean_absolute_percentage_error

# Custom scoring function for MAPE (lower is better)
def mape_scorer(y_true, y_pred):
    return mean_absolute_percentage_error(y_true, y_pred)

neg_mape_scorer = make_scorer(mape_scorer, greater_is_better=False)

# Base model
base_model = MultiOutputRegressor(CatBoostRegressor(verbose=0, random_seed=42))

# Hyperparameter grid
param_grid = {
    'estimator__learning_rate': [0.01, 0.05, 0.1],
    'estimator__iterations': [300, 500, 800]
}

# Grid search with 5-fold cross-validation
grid_search = GridSearchCV(
    estimator=base_model,
    param_grid=param_grid,
    scoring=neg_mape_scorer,
    cv=5,
    verbose=1,
    n_jobs=-1
)

# Fit grid search
grid_search.fit(X, y)

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print("Best Average MAPE (negative):", grid_search.best_score_)
print("Best Average MAPE (positive):", -grid_search.best_score_)


Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best Parameters: {'estimator__iterations': 800, 'estimator__learning_rate': 0.05}
Best Average MAPE (negative): -0.8375409204021406
Best Average MAPE (positive): 0.8375409204021406


In [15]:
# TabNet Regressor using 5-fold

from pytorch_tabnet.tab_model import TabNetRegressor
from sklearn.base import BaseEstimator, RegressorMixin
import torch

class MultiTabNet(BaseEstimator, RegressorMixin):
    def __init__(self, input_dim, output_dim):
        self.models = [TabNetRegressor(input_dim=input_dim, output_dim=1, verbose=0) for _ in range(output_dim)]

    def fit(self, X, Y):
        for i, model in enumerate(self.models):
            model.fit(X.values, Y.iloc[:, i].values.reshape(-1, 1), max_epochs=100)
        return self

    def predict(self, X):
        preds = [model.predict(X.values).ravel() for model in self.models]
        return np.stack(preds, axis=1)

tabnet_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiTabNet(input_dim=X.shape[1], output_dim=y.shape[1])
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    tabnet_mape_scores.append(mape)
    print(f"TabNet - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nTabNet - Average MAPE: {np.mean(tabnet_mape_scores):.4f}")




TabNet - Fold 1 MAPE: 4.3672




TabNet - Fold 2 MAPE: 2.6047




TabNet - Fold 3 MAPE: 3.0359




TabNet - Fold 4 MAPE: 2.7833




TabNet - Fold 5 MAPE: 1.8744

TabNet - Average MAPE: 2.9331


In [16]:
# Multi-Layer Perceptron (MLP) with PyTorch

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

class MLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(MLP, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, output_dim)
        )

    def forward(self, x):
        return self.model(x)

mlp_mape_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = torch.tensor(X.iloc[train_idx].values, dtype=torch.float32), torch.tensor(X.iloc[val_idx].values, dtype=torch.float32)
    y_train, y_val = torch.tensor(y.iloc[train_idx].values, dtype=torch.float32), torch.tensor(y.iloc[val_idx].values, dtype=torch.float32)

    model = MLP(X.shape[1], y.shape[1])
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.L1Loss()

    train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=64, shuffle=True)

    model.train()
    for epoch in range(50):
        for xb, yb in train_loader:
            optimizer.zero_grad()
            pred = model(xb)
            loss = loss_fn(pred, yb)
            loss.backward()
            optimizer.step()

    model.eval()
    with torch.no_grad():
        y_pred = model(X_val).numpy()
    mape = mean_absolute_percentage_error(y_val, y_pred)
    mlp_mape_scores.append(mape)
    print(f"MLP - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nMLP - Average MAPE: {np.mean(mlp_mape_scores):.4f}")


MLP - Fold 1 MAPE: 8.6005
MLP - Fold 2 MAPE: 2.4126
MLP - Fold 3 MAPE: 3.1793
MLP - Fold 4 MAPE: 3.3021
MLP - Fold 5 MAPE: 2.3370

MLP - Average MAPE: 3.9663


In [19]:
# Stacking Regressor (LightGBM + XGBoost + RandomForest)

from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from xgboost import XGBRegressor
import lightgbm as lgb

stacking_mape_scores = []

base_learners = [
    ('lgb', lgb.LGBMRegressor(n_estimators=100, random_state=42)),
    ('xgb', XGBRegressor(n_estimators=100, random_state=42, verbosity=0)),
    ('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
]

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    model = MultiOutputRegressor(StackingRegressor(estimators=base_learners, final_estimator=lgb.LGBMRegressor()))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)

    mape = mean_absolute_percentage_error(y_val, y_pred)
    stacking_mape_scores.append(mape)
    print(f"Stacking - Fold {fold+1} MAPE: {mape:.4f}")

print(f"\nStacking - Average MAPE: {np.mean(stacking_mape_scores):.4f}")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000956 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12981
[LightGBM] [Info] Number of data points in the train set: 1600, number of used features: 55
[LightGBM] [Info] Start training from score -0.007867
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000797 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12980
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 55
[LightGBM] [Info] Start training from score -0.004947
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001228 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12980
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 55
[LightGBM] [Info] Start t



Stacking - Fold 1 MAPE: 2.6746
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001239 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12981
[LightGBM] [Info] Number of data points in the train set: 1600, number of used features: 55
[LightGBM] [Info] Start training from score 0.004040
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000551 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12980
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features: 55
[LightGBM] [Info] Start training from score 0.005208
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000835 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12980
[LightGBM] [Info] Number of data points in the train set: 1280, number of used features:

KeyboardInterrupt: 