## Modelling experiments guidelines:

1. The problem provided to us is a `Binary Classification problem`. The raw dataset is already pre-processed and saved. We will work with that processed dataset here for training different models and performing inference.

2. Multiple models will be tried out here, ranging from baselines such as Logistic Regressions, KNN, SVM to tree-based ensembles like XgBoost, Random Forest, CatBoost etc to Neural Networks and Probabilistic models like Gaussian Processes (GP).

3. Once done, we will also try to perform clustering of the dataset after doing PCA/t-SNE (to convert to 2D) inorder to get an idea about the spread of different classes in 2D.

4. One important thing to note is that the problem statement asked to train the model on entire dataset and save it. However, for our purpose, due to the lack of test/unseen data, we will split our available data into training and test set, train models only on the training set and evaluate it on the unseen data.

5. In this data, we will choose and report the best model. However the final models that will be saved in the `Models/` folder will be trained on the `entire dataset`..

### Benchmarks:-

|S No. |  Model Name  |    Test accuracy  | Test Precision |  Test Recall   |   Test F1   |
|------|--------------|--------------------|---------------|----------------|-------------|
|  1.  |  Logistic Regression | 1          |  1

In [1]:
import os
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from pathlib import Path
from dotenv import load_dotenv



#   Load the Data path
env_path = Path(os.getcwd()).parent / 'Config' / '.env'
load_dotenv(env_path)

PREPROCESSED_PATH = os.getenv('PREPROCESSED_PATH')


#   Load all the 4 datasets. They will be used for training different models as per convenience
unprocessed_df = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'unprocessed_data.csv'))

final_df_qt = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df.csv'))
final_df_not_qt = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_not_qt.csv'))

qt_df_with_corr = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_with_corr.csv'))
not_qt_with_corr = pd.read_csv(os.path.join(PREPROCESSED_PATH, 'final_df_not_qt_with_corr.csv'))


print(final_df_qt.shape, final_df_not_qt.shape, qt_df_with_corr.shape, not_qt_with_corr.shape)

  from .autonotebook import tqdm as notebook_tqdm


(2188, 101) (2188, 101) (2188, 103) (2188, 103)


In [2]:
#   The target variable is well-balanced. No need for any additional operations
print(final_df_qt['target'].value_counts())

target
1    1117
0    1071
Name: count, dtype: int64


To begin with, we will start with Baseline Models like Logistic Regression etc. For these models, the quantiled datasets like `final_df_qt` and `qt_df_with_corr` suits the best. We will do all our experiments with the `final_df_qt` and later repeat with the latter, keeping the same code

### Train-test Split

In [3]:
from sklearn.model_selection import train_test_split

df = final_df_qt.copy()
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1750, 100) (438, 100) (1750,) (438,)


### Lets automate the model fitting, training and testing part:-

In [4]:
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

In [5]:
def model_fit_and_evaluate(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    try:
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    print(f"Model: {model.__class__.__name__}")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred, zero_division=0))
    print("Recall:", recall_score(y_test, y_pred, zero_division=0))
    print("F1 Score:", f1_score(y_test, y_pred, zero_division=0))
    if auc is not None:
        print("ROC AUC:", auc)
    print('-' * 30)

#### Train and evaluate on some baseline models

In [6]:
#   Seems like SVC(degree=2) and Logisitic regression are the top perfomers (around ~ 71% accuracy)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='rbf', probability=True),             
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),  
    SVC(kernel='poly', degree=4, probability=True),  
    KNeighborsClassifier()
]

# Run evaluation for all models
for model in models:
    model_fit_and_evaluate(X_train, X_test, y_train, y_test, model)

Model: LogisticRegression
Accuracy: 0.7123287671232876
Precision: 0.6991150442477876
Recall: 0.7314814814814815
F1 Score: 0.7149321266968326
ROC AUC: 0.7816358024691359
------------------------------
Model: SVC
Accuracy: 0.7123287671232876
Precision: 0.6939655172413793
Recall: 0.7453703703703703
F1 Score: 0.71875
ROC AUC: 0.7946800967634301
------------------------------
Model: SVC
Accuracy: 0.7214611872146118
Precision: 0.7117117117117117
Recall: 0.7314814814814815
F1 Score: 0.7214611872146118
ROC AUC: 0.7832311478144812
------------------------------
Model: SVC
Accuracy: 0.6894977168949772
Precision: 0.6769911504424779
Recall: 0.7083333333333334
F1 Score: 0.6923076923076923
ROC AUC: 0.7683725392058726
------------------------------
Model: SVC
Accuracy: 0.682648401826484
Precision: 0.672645739910314
Recall: 0.6944444444444444
F1 Score: 0.683371298405467
ROC AUC: 0.7606773440106774
------------------------------
Model: KNeighborsClassifier
Accuracy: 0.6552511415525114
Precision: 0.6208

#### Now I want to check if our pre-processing even added any value

In [7]:
df = unprocessed_df.copy()
X_train_u, X_test_u, y_train_u, y_test_u = train_test_split(unprocessed_df.drop('target', axis=1), unprocessed_df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train_u.shape, X_test_u.shape, y_train_u.shape, y_test_u.shape)

(1750, 48) (438, 48) (1750,) (438,)


In [None]:
models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='rbf', probability=True),             
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),  
    SVC(kernel='poly', degree=4, probability=True),  
    KNeighborsClassifier()
]

for model in models:
    model_fit_and_evaluate(X_train_u, X_test_u, y_train_u, y_test_u, model)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Model: LogisticRegression
Accuracy: 0.7146118721461188
Precision: 0.6842105263157895
Recall: 0.7824074074074074
F1 Score: 0.7300215982721382
ROC AUC: 0.7743576910243577
------------------------------
Model: SVC
Accuracy: 0.684931506849315
Precision: 0.6433823529411765
Recall: 0.8101851851851852
F1 Score: 0.7172131147540983
ROC AUC: 0.7447238905572238
------------------------------
Model: SVC
Accuracy: 0.6164383561643836
Precision: 0.5710059171597633
Recall: 0.8935185185185185
F1 Score: 0.6967509025270758
ROC AUC: 0.6948302469135802
------------------------------
Model: SVC
Accuracy: 0.591324200913242
Precision: 0.5506849315068493
Recall: 0.9305555555555556
F1 Score: 0.6919104991394148
ROC AUC: 0.6536953620286955
------------------------------
Model: SVC
Accuracy: 0.5662100456621004
Precision: 0.5347593582887701
Recall: 0.9259259259259259
F1 Score: 0.6779661016949152
ROC AUC: 0.6158658658658659
------------------------------
Model: KNeighborsClassifier
Accuracy: 0.6027397260273972
Preci

**Comparing the scores, we can clearly see that our feature transformation has clearly improved the performance of our models. We can easily understand it from the `SVC results`. Moreover, convergence is now faster.**

#### Lets use some combined_features now if they can improve the baselines (~71%)

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False, degree=2):
    steps = []

    if use_poly:
        steps.append(('poly', PolynomialFeatures(degree=degree)))
    
    steps.append(('model', model))
    pipeline = Pipeline(steps)
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    try:
        y_prob = pipeline.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    result =  {
        'Model': model.__class__.__name__ + (f' (poly deg={degree})' if use_poly else ''),
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC AUC': auc
    }

    if isinstance(model, SVC):
        result.update({
            'kernel': model.kernel,
            'degree': model.degree,
        })

    return result

In [None]:
#   Combined_feature was able to outperform the baseline (although by a mere margin). 
#   The current percent stands at ~ 72%

results = []

linear_models = [
    LogisticRegression(max_iter=1000),
    SVC(kernel='linear', probability=True)
]

nonlinear_models = [
    SVC(kernel='rbf', probability=True),
    SVC(kernel='poly', degree=2, probability=True),
    SVC(kernel='poly', degree=3, probability=True),
    SVC(kernel='poly', degree=4, probability=True),
    KNeighborsClassifier()
]

# Linear models with and without poly
print('Starting with Linear models....')
for model in tqdm(linear_models):
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False))     #   ~ 100 features
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=True, degree=2))  # ~10000 features
    # results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=True, degree=3))


# Non-linear models without poly
for model in tqdm(nonlinear_models):
    results.append(model_fit_and_evaluate(X_train, X_test, y_train, y_test, model, use_poly=False))

df_results = pd.DataFrame(results)
df_results.sort_values(by='Accuracy', ascending=False, inplace=True)

df_results

Starting with Linear models....


100%|██████████| 2/2 [02:19<00:00, 69.81s/it]
100%|██████████| 5/5 [00:09<00:00,  1.86s/it]


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,ROC AUC,kernel,degree
5,SVC,0.721461,0.711712,0.731481,0.721461,0.783262,poly,2.0
2,SVC,0.714612,0.702222,0.731481,0.716553,0.780197,linear,3.0
0,LogisticRegression,0.712329,0.699115,0.731481,0.714932,0.781636,,
4,SVC,0.712329,0.693966,0.74537,0.71875,0.79468,rbf,3.0
6,SVC,0.689498,0.676991,0.708333,0.692308,0.768456,poly,3.0
7,SVC,0.682648,0.672646,0.694444,0.683371,0.760761,poly,4.0
1,LogisticRegression (poly deg=2),0.657534,0.646018,0.675926,0.660633,0.697823,,
8,KNeighborsClassifier,0.655251,0.620818,0.773148,0.68866,0.713536,,
3,SVC (poly deg=2),0.646119,0.639269,0.648148,0.643678,0.676885,linear,3.0


**Thus, among our baselines, `SVC` with `polynomial kernel` and `degree=2` provided us with the best results**

1. Now, we will move to the tree-based ensemble methods, such as RandomForest, XgBoost, LightGBM, CatBoost etc. An important thing to note now is that, for model-fitting on tree based methods, data transformations don't tend to work well. In fact, they are known to detoriate performace.

2. That's why, we will be using the `final_df_not_qt` as our data here. Notably, the (numerical + other) features are not quantiled here. Also, the remaining features are also not standardized; perfect for fitting tree-based models.

3. Also, it is not recommended to create polynomial features for tree-based ensembles. The models are capable of capturing interactions and nonlinearities on their own

#### Performing train-test split

In [11]:
df = final_df_not_qt.copy()
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.20, shuffle=True, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1750, 100) (438, 100) (1750,) (438,)


#### Tree-based ensemble model fitting

In [13]:
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier

In [None]:
def evaluate_tree_model(X_train, X_test, y_train, y_test, model, param_desc=''):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    try:
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
    except:
        auc = None

    return {
        'Model': model.__class__.__name__,
        'Params': param_desc,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, zero_division=0),
        'Recall': recall_score(y_test, y_pred, zero_division=0),
        'F1 Score': f1_score(y_test, y_pred, zero_division=0),
        'ROC AUC': auc
    }

def run_tree_model_grid(X_train, X_test, y_train, y_test):
    results = []

    param_grid = {
        'max_depth': [3, 5, 7],
        'n_estimators': [250, 450, 750],
        'learning_rate': [0.05, 0.1],  # Only for boosting models
    }

    rf_params = [(d, n) for d in param_grid['max_depth'] for n in param_grid['n_estimators']]
    boosting_params = [(d, n, lr) for d in param_grid['max_depth'] for n in param_grid['n_estimators'] for lr in param_grid['learning_rate']]

    print("Running Random Forest...")
    for d, n in tqdm(rf_params):
        model = RandomForestClassifier(max_depth=d, n_estimators=n, random_state=42)
        desc = f'max_depth={d}, n_estimators={n}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running XGBoost...")
    for d, n, lr in tqdm(boosting_params):
        model = xgb.XGBClassifier(max_depth=d, n_estimators=n, learning_rate=lr,
                                   eval_metric='logloss', random_state=42)
        desc = f'max_depth={d}, n_estimators={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running LightGBM...")
    for d, n, lr in tqdm(boosting_params):
        model = lgb.LGBMClassifier(max_depth=d, n_estimators=n, learning_rate=lr,
                                force_col_wise=True, verbosity=-1, random_state=42)
        desc = f'max_depth={d}, n_estimators={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    print("Running CatBoost...")
    for d, n, lr in tqdm(boosting_params):
        model = CatBoostClassifier(depth=d, iterations=n, learning_rate=lr,
                                   verbose=0, random_state=42)
        desc = f'depth={d}, iterations={n}, lr={lr}'
        results.append(evaluate_tree_model(X_train, X_test, y_train, y_test, model, desc))

    df_results = pd.DataFrame(results)
    df_results.sort_values(by='Accuracy', ascending=False, inplace=True)
    
    return df_results

In [24]:
df_tree_results = run_tree_model_grid(X_train, X_test, y_train, y_test)
df_tree_results

Running Random Forest...


100%|██████████| 9/9 [00:16<00:00,  1.87s/it]


Running XGBoost...


100%|██████████| 18/18 [00:41<00:00,  2.32s/it]


Running LightGBM...


100%|██████████| 18/18 [00:11<00:00,  1.55it/s]


Running CatBoost...


100%|██████████| 18/18 [01:27<00:00,  4.87s/it]


Unnamed: 0,Model,Params,Accuracy,Precision,Recall,F1 Score,ROC AUC
45,CatBoostClassifier,"depth=3, iterations=250, lr=0.05",0.751142,0.735683,0.773148,0.753950,0.813167
9,XGBClassifier,"max_depth=3, n_estimators=250, lr=0.05",0.748858,0.736607,0.763889,0.750000,0.806953
6,RandomForestClassifier,"max_depth=7, n_estimators=250",0.748858,0.734513,0.768519,0.751131,0.810540
27,LGBMClassifier,"max_depth=3, n_estimators=250, lr=0.05",0.748858,0.734513,0.768519,0.751131,0.808037
4,RandomForestClassifier,"max_depth=5, n_estimators=450",0.746575,0.733333,0.763889,0.748299,0.805535
...,...,...,...,...,...,...,...
32,LGBMClassifier,"max_depth=3, n_estimators=750, lr=0.1",0.710046,0.692641,0.740741,0.715884,0.795796
42,LGBMClassifier,"max_depth=7, n_estimators=450, lr=0.1",0.707763,0.692982,0.731481,0.711712,0.799383
56,CatBoostClassifier,"depth=5, iterations=750, lr=0.1",0.707763,0.703704,0.703704,0.703704,0.799383
36,LGBMClassifier,"max_depth=5, n_estimators=450, lr=0.1",0.705479,0.688312,0.736111,0.711409,0.795942


**From the above evaluations, it is clear that CatBoost outperformed the previous baseline and now the best performing model (CatBoost) stands at accuracy ~ 75% !!**

1. Now, we will move to a new type of model, the Probabilistic models. We saw in the pre-processing section that after transformation, most of the (numerical + other) features were approximately Gaussian in their distribution.

2. To fit these distributions, Gaussian Processes (GP) can be a good choice since they are non-parametric and based on Gaussian distributions. Along with predicting the class, they are also able to predict the uncertainity estimate for that prediction.

3. We can use a hybrid kernel design (also combining with ARD kernel). This gives us natural feature selection and ability to handle both numerical and categorical features