## Modelling experiments guidelines:

1. The problem provided to us is a `Binary Classification problem`. The raw dataset is already pre-processed and saved. We will work with that processed dataset here for training different models and performing inference.

2. Multiple models will be tried out here, ranging from baselines such as Logistic Regressions, KNN, SVM to tree-based ensembles like XgBoost, Random Forest, CatBoost etc to Neural Networks and Probabilistic models like Gaussian Processes (GP).

3. One important thing to note is that the problem statement asked to train the model on entire dataset and save it. However, for our purpose, right now, due to the lack of test/unseen data, we will split our available data into training and test set, train models only on the training set and evaluate it on the unseen data. Later, the saved models will be trained on entire dataset, not on any split

4. In this data, we will choose and report the best model. However the final models that will be saved in the `Models/` folder will be trained on the `entire dataset`..

In [1]:
import os
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv



#   Load the Data path
env_path = Path(os.getcwd()).parent / 'Config' / '.env'
load_dotenv(env_path)

PREPROCESSED_PATH = os.getenv('PREPROCESSED_PATH')


#   Load all the 4 datasets. They will be used for training different models as per convenience
unprocessed_df = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'unprocessed_data.csv'))

final_df_qt = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df.csv'))
final_df_not_qt = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_not_qt.csv'))

qt_df_with_corr = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_with_corr.csv'))
not_qt_with_corr = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_not_qt_with_corr.csv'))


print(final_df_qt.shape, final_df_not_qt.shape, qt_df_with_corr.shape, not_qt_with_corr.shape)

  from .autonotebook import tqdm as notebook_tqdm


(2188, 101) (2188, 101) (2188, 103) (2188, 103)


In [2]:
#   The target variable is well-balanced. No need for any additional operations
print(final_df_qt['target'].value_counts())

target
1    1117
0    1071
Name: count, dtype: int64


To begin with, we will start with Baseline Models like Logistic Regression etc. For these models, the quantiled datasets like `final_df_qt` and `qt_df_with_corr` suits the best. We will do all our experiments with the `final_df_qt` and later repeat with the latter, keeping the same code

### Train-test Split

In [3]:
from sklearn.model_selection import train_test_split

df = final_df_qt.copy()
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1750, 100) (438, 100) (1750,) (438,)


### Lets automate the model fitting, training and testing part:-

In [4]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

In [5]:
def model_fit_and_evaluate(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    try:
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    print(f"Model: {model.__class__.__name__}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred, zero_division=0))
    print("Recall:", recall_score(y_test, y_pred, zero_division=0))
    print("F1 Score:", f1_score(y_test, y_pred, zero_division=0))
    if auc is not None:
        print("ROC AUC:", auc)
    print('-' * 30)

#### Train and evaluate on some baseline models

In [6]:
#   Seems like SVC(degree=2) and Logisitic regression are the top perfomers (around ~ 71% accuracy)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='rbf', probability=True),             
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),  
    SVC(kernel='poly', degree=4, probability=True),  
    KNeighborsClassifier()
]

# Run evaluation for all models
for model in models:
    model_fit_and_evaluate(X_train, X_test, y_train, y_test, model)

Model: LogisticRegression
Accuracy: 0.7123287671232876
Precision: 0.6991150442477876
Recall: 0.7314814814814815
F1 Score: 0.7149321266968326
ROC AUC: 0.7816358024691359
------------------------------
Model: SVC
Accuracy: 0.7123287671232876
Precision: 0.6939655172413793
Recall: 0.7453703703703703
F1 Score: 0.71875
ROC AUC: 0.7946800967634301
------------------------------
Model: SVC
Accuracy: 0.7214611872146118
Precision: 0.7117117117117117
Recall: 0.7314814814814815
F1 Score: 0.7214611872146118
ROC AUC: 0.7832520020020021
------------------------------
Model: SVC
Accuracy: 0.6894977168949772
Precision: 0.6769911504424779
Recall: 0.7083333333333334
F1 Score: 0.6923076923076923
ROC AUC: 0.7683308308308308
------------------------------
Model: SVC
Accuracy: 0.682648401826484
Precision: 0.672645739910314
Recall: 0.6944444444444444
F1 Score: 0.683371298405467
ROC AUC: 0.76073990657324
------------------------------
Model: KNeighborsClassifier
Accuracy: 0.6552511415525114
Precision: 0.620817

#### Now I want to check if our pre-processing even added any value

In [7]:
df = unprocessed_df.copy()
X_train_u, X_test_u, y_train_u, y_test_u = train_test_split(unprocessed_df.drop('target', axis=1), unprocessed_df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train_u.shape, X_test_u.shape, y_train_u.shape, y_test_u.shape)

(1750, 48) (438, 48) (1750,) (438,)


In [8]:
models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='rbf', probability=True),             
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),  
    SVC(kernel='poly', degree=4, probability=True),  
    KNeighborsClassifier()
]

for model in models:
    model_fit_and_evaluate(X_train_u, X_test_u, y_train_u, y_test_u, model)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: LogisticRegression
Accuracy: 0.7146118721461188
Precision: 0.6842105263157895
Recall: 0.7824074074074074
F1 Score: 0.7300215982721382
ROC AUC: 0.7743576910243577
------------------------------
Model: SVC
Accuracy: 0.684931506849315
Precision: 0.6433823529411765
Recall: 0.8101851851851852
F1 Score: 0.7172131147540983
ROC AUC: 0.7446404738071405
------------------------------
Model: SVC
Accuracy: 0.6164383561643836
Precision: 0.5710059171597633
Recall: 0.8935185185185185
F1 Score: 0.6967509025270758
ROC AUC: 0.6946634134134133
------------------------------
Model: SVC
Accuracy: 0.591324200913242
Precision: 0.5506849315068493
Recall: 0.9305555555555556
F1 Score: 0.6919104991394148
ROC AUC: 0.6536953620286955
------------------------------
Model: SVC
Accuracy: 0.5662100456621004
Precision: 0.5347593582887701
Recall: 0.9259259259259259
F1 Score: 0.6779661016949152
ROC AUC: 0.6158658658658659
------------------------------
Model: KNeighborsClassifier
Accuracy: 0.6027397260273972
Preci

**Comparing the scores, we can clearly see that our feature transformation has clearly improved the performance of our models. We can easily understand it from the `SVC results`. Moreover, convergence is now faster.**

#### Lets use some combined_features now if they can improve the baselines (~71%)

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False, degree=2):
    steps = []

    if use_poly:
        steps.append(('poly', PolynomialFeatures(degree=degree)))
    
    steps.append(('model', model))
    pipeline = Pipeline(steps)
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    try:
        y_prob = pipeline.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    result =  {
        'Model': model.__class__.__name__ + (f' (poly deg={degree})' if use_poly else ''),
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC AUC': auc
    }

    if isinstance(model, SVC):
        result.update({
            'kernel': model.kernel,
            'degree': model.degree,
        })

    return result

In [10]:
#   Combined_feature was able to outperform the baseline (although by a mere margin). 
#   The current percent stands at ~ 72%

results = []

linear_models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='linear', probability=True)
]

nonlinear_models = [
    SVC(kernel='rbf', probability=True),
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),
    SVC(kernel='poly', degree=4, probability=True),
    KNeighborsClassifier()
]

# Linear models with and without poly
print('Starting with Linear models....')
for model in tqdm(linear_models):
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False))     #   ~ 100 features
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=True, degree=2))  # ~10000 features
    # results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=True, degree=3))


# Non-linear models without poly
for model in tqdm(nonlinear_models):
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False))

df_results = pd.DataFrame(results)
df_results.sort_values(by='Accuracy', ascending=False, inplace=True)

df_results

Starting with Linear models....


100%|██████████| 2/2 [02:28<00:00, 74.18s/it]
100%|██████████| 5/5 [00:12<00:00,  2.49s/it]


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC AUC,kernel,degree
5,SVC,0.721461,0.711712,0.731481,0.721461,0.783231,poly,2.0
2,SVC,0.714612,0.702222,0.731481,0.716553,0.780197,linear,3.0
0,LogisticRegression,0.712329,0.699115,0.731481,0.714932,0.781636,,
4,SVC,0.712329,0.693966,0.74537,0.71875,0.794753,rbf,3.0
6,SVC,0.689498,0.676991,0.708333,0.692308,0.768352,poly,3.0
7,SVC,0.682648,0.672646,0.694444,0.683371,0.76074,poly,4.0
1,LogisticRegression (poly deg=2),0.657534,0.646018,0.675926,0.660633,0.697823,,
8,KNeighborsClassifier,0.655251,0.620818,0.773148,0.68866,0.713536,,
3,SVC (poly deg=2),0.646119,0.639269,0.648148,0.643678,0.676916,linear,3.0


**Thus, among our baselines, `SVC` with `polynomial kernel` and `degree=2` provided us with the best results**

1. Now, we will move to the tree-based ensemble methods, such as RandomForest, XgBoost, LightGBM, CatBoost etc. An important thing to note now is that, for model-fitting on tree based methods, data transformations don't tend to work well. In fact, they are known to detoriate performace.

2. That's why, we will be using the `final_df_not_qt` as our data here. Notably, the (numerical + other) features are not quantiled here. Also, the remaining features are also not standardized; perfect for fitting tree-based models.

3. Also, it is not recommended to create polynomial features for tree-based ensembles. The models are capable of capturing interactions and nonlinearities on their own

#### Performing train-test split

In [11]:
df = final_df_not_qt.copy()
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1750, 100) (438, 100) (1750,) (438,)


#### Tree-based ensemble model fitting

In [12]:
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier

In [13]:
def evaluate_tree_model(X_train, X_test, y_train, y_test, model, param_desc=''):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    try:
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    return {
        'Model': model.__class__.__name__,
        'Params': param_desc,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC AUC': auc
    }

def run_tree_model_grid(X_train, X_test, y_train, y_test):
    results = []

    param_grid = {
        'max_depth': [3, 5, 7],
        'n_estimators': [250, 450, 750],
        'learning_rate': [0.05, 0.1],  # Only for boosting models
    }

    rf_params = [(d, n) for d in param_grid['max_depth'] for n in param_grid['n_estimators']]
    boosting_params = [(d, n, lr) for d in param_grid['max_depth'] for n in param_grid['n_estimators'] for lr in param_grid['learning_rate']]

    print("Running Random Forest...")
    for d, n in tqdm(rf_params):
        model = RandomForestClassifier(max_depth=d, n_estimators=n, random_state=42)
        desc = f'max_depth={d}, n_estimators={n}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running XGBoost...")
    for d, n, lr in tqdm(boosting_params):
        model = xgb.XGBClassifier(max_depth=d, n_estimators=n, learning_rate=lr,
                                   eval_metric='logloss', random_state=42)
        desc = f'max_depth={d}, n_estimators={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running LightGBM...")
    for d, n, lr in tqdm(boosting_params):
        model = lgb.LGBMClassifier(max_depth=d, n_estimators=n, learning_rate=lr,
                                force_col_wise=True, verbosity=-1, random_state=42)
        desc = f'max_depth={d}, n_estimators={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running CatBoost...")
    for d, n, lr in tqdm(boosting_params):
        model = CatBoostClassifier(depth=d, iterations=n, learning_rate=lr,
                                   verbose=0, random_state=42)
        desc = f'depth={d}, iterations={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    df_results = pd.DataFrame(results)
    df_results.sort_values(by='Accuracy', ascending=False, inplace=True)
    
    return df_results

In [14]:
df_tree_results = run_tree_model_grid(X_train, X_test, y_train, y_test)
df_tree_results

Running Random Forest...


100%|██████████| 9/9 [00:37<00:00,  4.20s/it]


Running XGBoost...


100%|██████████| 18/18 [00:49<00:00,  2.77s/it]


Running LightGBM...


100%|██████████| 18/18 [00:16<00:00,  1.08it/s]


Running CatBoost...


100%|██████████| 18/18 [01:38<00:00,  5.47s/it]


Unnamed: 0,Model,Params,Accuracy,Precision,Recall,F1 Score,ROC AUC
45,CatBoostClassifier,"depth=3, iterations=250, lr=0.05",0.751142,0.735683,0.773148,0.753950,0.813167
9,XGBClassifier,"max_depth=3, n_estimators=250, lr=0.05",0.748858,0.736607,0.763889,0.750000,0.806953
6,RandomForestClassifier,"max_depth=7, n_estimators=250",0.748858,0.734513,0.768519,0.751131,0.810540
27,LGBMClassifier,"max_depth=3, n_estimators=250, lr=0.05",0.748858,0.734513,0.768519,0.751131,0.808037
4,RandomForestClassifier,"max_depth=5, n_estimators=450",0.746575,0.733333,0.763889,0.748299,0.805535
...,...,...,...,...,...,...,...
32,LGBMClassifier,"max_depth=3, n_estimators=750, lr=0.1",0.710046,0.692641,0.740741,0.715884,0.795796
42,LGBMClassifier,"max_depth=7, n_estimators=450, lr=0.1",0.707763,0.692982,0.731481,0.711712,0.799383
56,CatBoostClassifier,"depth=5, iterations=750, lr=0.1",0.707763,0.703704,0.703704,0.703704,0.799383
36,LGBMClassifier,"max_depth=5, n_estimators=450, lr=0.1",0.705479,0.688312,0.736111,0.711409,0.795942


**From the above evaluations, it is clear that CatBoost outperformed the previous baseline and now the best performing model (CatBoost) stands at accuracy ~ 75% !!**

### Soft-voting classifier:-
As tree-based ensembles are showing good performance for this dataset, lets combine the top 3 best performing tree models to check if the performance is improved...

In [15]:
from sklearn.ensemble import VotingClassifier

In [16]:
def evaluate_ensemble_model(X_train, X_test, y_train, y_test):
    model_cb = CatBoostClassifier(depth=3, iterations=250, learning_rate=0.05, verbose=0, random_state=42)
    model_xgb = xgb.XGBClassifier(max_depth=3, n_estimators=250, learning_rate=0.05, eval_metric='logloss', use_label_encoder=False, random_state=42)
    model_rf = RandomForestClassifier(max_depth=7, n_estimators=250, random_state=42)


    ensemble = VotingClassifier(
        estimators=[
            ('catboost', model_cb),
            ('xgboost', model_xgb),
            ('randomforest', model_rf)
        ],
        voting='soft'
    )

    ensemble.fit(X_train, y_train)
    y_pred = ensemble.predict(X_test)
    y_prob = ensemble.predict_proba(X_test)[:, 1]


    results = {
        'Model': 'Ensemble (CatBoost + XGB + RF)',
        'Params': 'cb:depth=3,iter=250,lr=0.05 | xgb:max_depth=3,n_est=250,lr=0.05 | rf:max_depth=7,n_est=250',
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC AUC': roc_auc_score(y_test, y_prob)
    }

    return pd.DataFrame([results])

In [17]:
df_ensemble_result = evaluate_ensemble_model(X_train, X_test, y_train, y_test)
df_ensemble_result

# final_df = pd.concat([df_tree_results, df_ensemble_result], ignore_index=True)
# final_df.sort_values(by='Accuracy', ascending=False)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Unnamed: 0,Model,Params,Accuracy,Precision,Recall,F1 Score,ROC AUC
0,Ensemble (CatBoost + XGB + RF),"cb:depth=3,iter=250,lr=0.05 | xgb:max_depth=3,...",0.767123,0.752212,0.787037,0.769231,0.816275


**The performace did improve to ~ 77% now!! This shows that the models, when used together, complemented and covered up the strengths and weaknesses of one another!!**

### Gaussian Processes:-

1. Now, we will move to a new type of model, the Probabilistic models. We saw in the pre-processing section that after transformation, most of the (numerical + other) features were approximately Gaussian in their distribution.

2. To fit these distributions, Gaussian Processes (GP) can be a good choice since they are non-parametric and based on Gaussian distributions. Along with predicting the class, they are also able to predict the uncertainity estimate for that prediction.

3. We can use a hybrid kernel design (also combining with ARD kernel). This gives us natural feature selection and ability to handle both numerical and categorical features

In [18]:
import torch
import gpytorch

In [19]:
class CompositeKernelGP(gpytorch.models.ApproximateGP):
    def __init__(self, inducing_points, num_numerical, num_binary):
        # Variational distribution and strategy
        variational_distribution = gpytorch.variational.CholeskyVariationalDistribution(
            inducing_points.size(0)
        )
        variational_strategy = gpytorch.variational.VariationalStrategy(
            self, inducing_points, variational_distribution
        )
        super().__init__(variational_strategy)
        
     
        self.num_numerical = num_numerical
        self.num_binary = num_binary
     
        self.mean_module = gpytorch.means.ConstantMean()
        
        # Composite kernel:
        # 1. ARD-RBF for numerical features
        # 2. Matern (ν=0.5) for binary features
        self.covar_module = (
            gpytorch.kernels.ScaleKernel(
                gpytorch.kernels.RBFKernel(
                    ard_num_dims=num_numerical,
                    active_dims=range(num_numerical)
                )
                + gpytorch.kernels.ScaleKernel(
                    gpytorch.kernels.MaternKernel(
                        nu=0.5,
                        ard_num_dims=num_binary,
                        active_dims=range(num_numerical, num_numerical + num_binary)
                    )
                )
            )
        )

    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)

        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

In [20]:
def train_gpytorch_model(X_train, y_train, num_numerical, num_binary, num_epochs=100, lr=0.1):
    
    if isinstance(X_train, pd.DataFrame):
        X_train = X_train.values

    if isinstance(y_train, pd.DataFrame):
        y_train = y_train.values
    

    X_train_t = torch.tensor(X_train, dtype=torch.float32)
    y_train_t = torch.tensor(y_train, dtype=torch.float32)
    
    num_inducing = max(100, len(X_train) // 10)
    rand_idx = torch.randperm(len(X_train))[:num_inducing]
    inducing_points = X_train_t[rand_idx].clone()
    
   
    model = CompositeKernelGP(
        inducing_points=inducing_points,
        num_numerical=num_numerical,
        num_binary=num_binary
    )
    likelihood = gpytorch.likelihoods.BernoulliLikelihood()
    

    model.train()
    likelihood.train()
    
    
    optimizer = torch.optim.Adam([
        {'params': model.parameters()},
        {'params': likelihood.parameters()}
    ], lr=lr)
    
    
    mll = gpytorch.mlls.VariationalELBO(likelihood, model, num_data=y_train_t.size(0))
    
    
    for epoch in tqdm(range(num_epochs)):
        optimizer.zero_grad()
        output = model(X_train_t)
        loss = -mll(output, y_train_t)
        loss.backward()
        optimizer.step()
        
        if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch+1}/{num_epochs} - Loss: {loss.item():.4f}')
    
    return model, likelihood



def evaluate_gpytorch(model, likelihood, X_test, y_test):

    if isinstance(X_test, pd.DataFrame):
        X_test = X_test.values

    if isinstance(y_test, pd.DataFrame):
        y_test = y_test.values


    X_test_t = torch.tensor(X_test, dtype=torch.float32)
    
    
    model.eval()
    likelihood.eval()
    
   
    with torch.no_grad(), gpytorch.settings.fast_pred_var():
        observed_pred = likelihood(model(X_test_t))
        probabilities = observed_pred.mean.numpy()
        predictions = (probabilities > 0.5).astype(int)
    
   
    metrics = {
        'Accuracy': accuracy_score(y_test, predictions),
        'Precision': precision_score(y_test, predictions, zero_division=0),
        'Recall': recall_score(y_test, predictions, zero_division=0),
        'F1 Score': f1_score(y_test, predictions, zero_division=0),
        'ROC AUC': roc_auc_score(y_test, probabilities)
    }
    return metrics


def gp_model_fit_and_evaluate(X_train, X_test, y_train, y_test, num_numerical, num_binary):

    X_train = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
    X_test = X_test.values if isinstance(X_test, pd.DataFrame) else X_test
    y_train = y_train.values if isinstance(y_train, pd.Series) else y_train
    y_test = y_test.values if isinstance(y_test, pd.Series) else y_test


    model, likelihood = train_gpytorch_model(
        X_train, y_train,
        num_numerical=num_numerical,
        num_binary=num_binary
    )
    
 
    metrics = evaluate_gpytorch(model, likelihood, X_test, y_test)
    
    top_scale = model.covar_module
    sum_kernel = top_scale.base_kernel
    rbf_kernel = sum_kernel.kernels[0]
    matern_scale = sum_kernel.kernels[1]
    matern_kernel = matern_scale.base_kernel

    lengthscales_numerical = rbf_kernel.lengthscale.detach().cpu().numpy().squeeze()
    lengthscales_binary   = matern_kernel.lengthscale.detach().cpu().numpy().squeeze()

    return {
        'model': model,
        'likelihood': likelihood,
        'metrics': metrics,
        'lengthscales_numerical': lengthscales_numerical,
        'lengthscales_binary': lengthscales_binary
    }

In [21]:
results = gp_model_fit_and_evaluate(
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test,
    num_numerical=101-74-1,
    num_binary=74
)

print("\nEvaluation Metrics:")
for metric, value in results['metrics'].items():
    print(f"{metric}: {value:.4f}")


print("\nFeature Importance (Numerical - smaller lengthscale = more important):")
ls_num = results['lengthscales_numerical'].squeeze()  # shape: (num_numerical,)

#   Sorted by importance
sorted_idxs = np.argsort(ls_num)
print("Numerical features sorted by importance (smaller lengthscale ⇒ more important):")
for rank, idx in enumerate(sorted_idxs, start=1):
    print(f"{rank:2d}. Feature {idx+1:3d} — lengthscale = {ls_num[idx]:.4f}")

 11%|█         | 11/100 [00:01<00:09,  8.91it/s]

Epoch 10/100 - Loss: 0.7612


 22%|██▏       | 22/100 [00:02<00:08,  9.65it/s]

Epoch 20/100 - Loss: 0.6712


 32%|███▏      | 32/100 [00:03<00:06, 10.23it/s]

Epoch 30/100 - Loss: 0.6477


 40%|████      | 40/100 [00:04<00:05, 10.44it/s]

Epoch 40/100 - Loss: 0.6405


 51%|█████     | 51/100 [00:05<00:06,  7.31it/s]

Epoch 50/100 - Loss: 0.6375


 61%|██████    | 61/100 [00:07<00:05,  7.55it/s]

Epoch 60/100 - Loss: 0.6355


 71%|███████   | 71/100 [00:08<00:03,  7.59it/s]

Epoch 70/100 - Loss: 0.6344


 81%|████████  | 81/100 [00:09<00:02,  7.62it/s]

Epoch 80/100 - Loss: 0.6336


 91%|█████████ | 91/100 [00:11<00:01,  7.33it/s]

Epoch 90/100 - Loss: 0.6332


100%|██████████| 100/100 [00:12<00:00,  8.16it/s]

Epoch 100/100 - Loss: 0.6330

Evaluation Metrics:
Accuracy: 0.6301
Precision: 0.6098
Recall: 0.6944
F1 Score: 0.6494
ROC AUC: 0.6745

Feature Importance (Numerical - smaller lengthscale = more important):
Numerical features sorted by importance (smaller lengthscale ⇒ more important):
 1. Feature  16 — lengthscale = 0.3457
 2. Feature  15 — lengthscale = 0.5052
 3. Feature   4 — lengthscale = 0.5707
 4. Feature  23 — lengthscale = 0.9184
 5. Feature  22 — lengthscale = 0.9703
 6. Feature   3 — lengthscale = 1.0329
 7. Feature  20 — lengthscale = 1.0993
 8. Feature  13 — lengthscale = 1.6707
 9. Feature  12 — lengthscale = 1.9108
10. Feature  24 — lengthscale = 2.1892
11. Feature   8 — lengthscale = 2.2631
12. Feature   7 — lengthscale = 2.2870
13. Feature  14 — lengthscale = 2.3688
14. Feature  17 — lengthscale = 2.4872
15. Feature  18 — lengthscale = 2.5336
16. Feature  10 — lengthscale = 2.5780
17. Feature   1 — lengthscale = 2.5823
18. Feature  11 — lengthscale = 2.7070
19. Feature  




**Although the Gaussian Process model was not able to outperform the Tree-based ensemble models, yet it provided us with a very important insight => Feature Importance...!! This can help understand very clearly which feature has more predictive power than the other**


### Neural Networks:-

1. We will try out 2-3 different types of Neural Network architectures and then check how does they perform as compared to the previous models and baselines

2. And, for neural networks as well, we will use the quantile transformed dataset as neural networks tend to fit well on transformed and standardized datasets...

#### Train-test split

In [22]:
df = final_df_qt.copy()
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1750, 100) (438, 100) (1750,) (438,)


In [23]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [24]:
#   Type-I architecture
class Network_v1(nn.Module):
    def __init__(self, dropout=0.1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 64),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 16),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(16, 1),
            # nn.Sigmoid()          #   We will use BCEWithLogits loss (to induce more stability)
        )

    def forward(self, x):
        return self.layers(x)



class Network_v2(nn.Module):
    def __init__(self, dropout=0.1):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(100, 64),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 32),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 16),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(16, 1),
            # nn.Sigmoid()          #   We will use BCEWithLogits loss (to induce more stability)
        )

    def forward(self, x):
        return self.layers(x)
    


#   Using skip-connection
class Network_v3(nn.Module):
    def __init__(self, dropout=0.1):
        super().__init__()
        self.layers1 = nn.Sequential(
            nn.Linear(100, 64),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(64, 32),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 100),
            nn.SiLU()
            # nn.Sigmoid()          #   We will use BCEWithLogits loss (to induce more stability)
        )

        self.layers2 = nn.Sequential(
            nn.Linear(100, 32),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(32, 16),
            nn.SiLU(),
            nn.Dropout(dropout),
            nn.Linear(16, 1),
            # nn.Sigmoid()          #   We will use BCEWithLogits loss (to induce more stability)
        )


    def forward(self, x):
        skip = self.layers1(x)
        return self.layers2(x + skip)

In [25]:
def train_and_evaluate(
    X_train, y_train,
    X_test, y_test,
    model: nn.Module,
    optimizer: torch.optim.Optimizer,
    epochs: int = 10,
    batch_size: int = 32,
    device: str = None
):
    
    device = 'cpu'
    model.to(device)


    if not torch.is_tensor(X_train):
        X_train = torch.tensor(X_train, dtype=torch.float32)
    if not torch.is_tensor(y_train):
        y_train = torch.tensor(y_train, dtype=torch.float32).unsqueeze(1)
    if not torch.is_tensor(X_test):
        X_test = torch.tensor(X_test, dtype=torch.float32)
    if not torch.is_tensor(y_test):
        y_test = torch.tensor(y_test, dtype=torch.float32).unsqueeze(1)

    train_ds = TensorDataset(X_train, y_train)
    test_ds = TensorDataset(X_test, y_test)
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

    criterion = nn.BCEWithLogitsLoss()


    model.train()
    for epoch in range(1, epochs + 1):
        epoch_loss = 0.0
        for xb, yb in train_loader:
            xb, yb = xb.to(device), yb.to(device)
            optimizer.zero_grad()
            logits = model(xb)
            loss = criterion(logits, yb)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * xb.size(0)
        avg_loss = epoch_loss / len(train_loader.dataset)
        print(f"Epoch {epoch}/{epochs} - Loss: {avg_loss:.4f}")

  
    model.eval()
    all_preds = []
    all_probs = []
    all_targets = []
    with torch.no_grad():
        for xb, yb in test_loader:
            xb = xb.to(device)
            logits = model(xb)
            probs = torch.sigmoid(logits).cpu().numpy().flatten()
            preds = (probs >= 0.5).astype(int)
            all_probs.extend(probs.tolist())
            all_preds.extend(preds.tolist())
            all_targets.extend(yb.numpy().flatten().tolist())

    
    acc = accuracy_score(all_targets, all_preds)
    prec = precision_score(all_targets, all_preds, zero_division=0)
    rec = recall_score(all_targets, all_preds, zero_division=0)
    f1 = f1_score(all_targets, all_preds, zero_division=0)
    try:
        auc = roc_auc_score(all_targets, all_probs)
    except ValueError:
        auc = None

    print("\nEvaluation Results")
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall: {rec:.4f}")
    print(f"F1 Score: {f1:.4f}")
    if auc is not None:
        print(f"ROC AUC: {auc:.4f}")
    print('-' * 30)

    return {
        'accuracy': acc,
        'precision': prec,
        'recall': rec,
        'f1_score': f1,
        'roc_auc': auc
    }

In [26]:
from torch.optim import AdamW

def run_all_models(X_train, y_train, X_test, y_test, epochs=70, batch_size=64):
    models = [
        ('Network_v1', Network_v1()),
        ('Network_v2', Network_v2()),
        ('Network_v3', Network_v3())
    ]

    results = {}
    for name, model in models:
        print(f"\n=== Training {name} ===")
        optimizer = AdamW(model.parameters())
        metrics = train_and_evaluate(
            X_train, y_train,
            X_test,  y_test,
            model,
            optimizer,
            epochs=epochs,
            batch_size=batch_size
        )
        results[name] = metrics

    return results

In [27]:
results = run_all_models(X_train.values, y_train.values, X_test.values, y_test.values, epochs=70, batch_size=64)
pd.DataFrame(results)


=== Training Network_v1 ===
Epoch 1/70 - Loss: 0.6808
Epoch 2/70 - Loss: 0.6179
Epoch 3/70 - Loss: 0.5618
Epoch 4/70 - Loss: 0.5355
Epoch 5/70 - Loss: 0.5250
Epoch 6/70 - Loss: 0.5217
Epoch 7/70 - Loss: 0.5124
Epoch 8/70 - Loss: 0.5124
Epoch 9/70 - Loss: 0.5121
Epoch 10/70 - Loss: 0.5051
Epoch 11/70 - Loss: 0.5010
Epoch 12/70 - Loss: 0.4984
Epoch 13/70 - Loss: 0.4954
Epoch 14/70 - Loss: 0.4905
Epoch 15/70 - Loss: 0.4881
Epoch 16/70 - Loss: 0.4835
Epoch 17/70 - Loss: 0.4761
Epoch 18/70 - Loss: 0.4729
Epoch 19/70 - Loss: 0.4721
Epoch 20/70 - Loss: 0.4595
Epoch 21/70 - Loss: 0.4583
Epoch 22/70 - Loss: 0.4529
Epoch 23/70 - Loss: 0.4496
Epoch 24/70 - Loss: 0.4451
Epoch 25/70 - Loss: 0.4333
Epoch 26/70 - Loss: 0.4282
Epoch 27/70 - Loss: 0.4269
Epoch 28/70 - Loss: 0.4165
Epoch 29/70 - Loss: 0.4168
Epoch 30/70 - Loss: 0.4039
Epoch 31/70 - Loss: 0.4035
Epoch 32/70 - Loss: 0.3950
Epoch 33/70 - Loss: 0.3876
Epoch 34/70 - Loss: 0.3938
Epoch 35/70 - Loss: 0.3795
Epoch 36/70 - Loss: 0.3778
Epoch 37

Unnamed: 0,Network_v1,Network_v2,Network_v3
accuracy,0.682648,0.678082,0.657534
precision,0.662447,0.647059,0.638655
recall,0.726852,0.763889,0.703704
f1_score,0.693157,0.700637,0.669604
roc_auc,0.728604,0.718385,0.722379


In [28]:
results = run_all_models(X_train.values, y_train.values, X_test.values, y_test.values, epochs=200, batch_size=64)
pd.DataFrame(results)


=== Training Network_v1 ===
Epoch 1/200 - Loss: 0.6740
Epoch 2/200 - Loss: 0.6189
Epoch 3/200 - Loss: 0.5650
Epoch 4/200 - Loss: 0.5315
Epoch 5/200 - Loss: 0.5277
Epoch 6/200 - Loss: 0.5235
Epoch 7/200 - Loss: 0.5159
Epoch 8/200 - Loss: 0.5095
Epoch 9/200 - Loss: 0.5056
Epoch 10/200 - Loss: 0.5008
Epoch 11/200 - Loss: 0.5000
Epoch 12/200 - Loss: 0.4997
Epoch 13/200 - Loss: 0.4941
Epoch 14/200 - Loss: 0.4884
Epoch 15/200 - Loss: 0.4879
Epoch 16/200 - Loss: 0.4802
Epoch 17/200 - Loss: 0.4733
Epoch 18/200 - Loss: 0.4757
Epoch 19/200 - Loss: 0.4695
Epoch 20/200 - Loss: 0.4620
Epoch 21/200 - Loss: 0.4594
Epoch 22/200 - Loss: 0.4509
Epoch 23/200 - Loss: 0.4444
Epoch 24/200 - Loss: 0.4471
Epoch 25/200 - Loss: 0.4391
Epoch 26/200 - Loss: 0.4304
Epoch 27/200 - Loss: 0.4260
Epoch 28/200 - Loss: 0.4243
Epoch 29/200 - Loss: 0.4152
Epoch 30/200 - Loss: 0.4150
Epoch 31/200 - Loss: 0.4127
Epoch 32/200 - Loss: 0.4088
Epoch 33/200 - Loss: 0.3960
Epoch 34/200 - Loss: 0.3923
Epoch 35/200 - Loss: 0.3838


Unnamed: 0,Network_v1,Network_v2,Network_v3
accuracy,0.6621,0.639269,0.671233
precision,0.651786,0.623932,0.662162
recall,0.675926,0.675926,0.680556
f1_score,0.663636,0.648889,0.671233
roc_auc,0.719323,0.697719,0.712775


**Conclusion:** Even the neural network was not able to outperform the tree-based ensembles in evaluation metrics. Thus, the best accuracy achieved by any of our model is **75%**