### Discription:
This Notebook is attempting to use **CatBoost** and **LGBM** to see how they perform on the Heart Disease prediction dataset. This notebook is being written after a few trials with a **Deep-learning** and **XGBoost** model (you can checkout those on my GitHub page or on my Kaggle profile).
Best hyperparameter sets will be found using **Optuna**. Hope you will find this helpful.

## Importing the dataset:

In [20]:
import pandas as pd

train_path = '/kaggle/input/competitions/playground-series-s6e2/train.csv'
train_df = pd.read_csv(train_path)
train_df.head() #checking if the data has loaded safely

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


In [21]:
target = 'Heart Disease' #creating a ‘TARGET’ constant var for ‘Heart Disease’

## Encoding the target:
This listing will not only cover encoding the target but also seperating the target and other features and removal of redundant features

In [22]:
train_df[target] = train_df[target].map({'Absence' : 0, 'Presence' : 1})

In [23]:
#removing 'id' from the df:
train_df = train_df.drop('id', axis=1)

#Seperating the target and features:
X = train_df.drop(target, axis=1)
y = train_df[target]

## Feature Engineering:
We'll be using the same feature engineering as done in previous iteration (where I tried XGB), but in addtion to that here we'll also try frequency encoding (a new concept to me as well, but doesnt hurt trying) more about this further down the listing.

In [24]:
def create_features(df):
    #Feature crosses/ Interaction features: (Optinal, some models are able to learn these relatonships on there own)
    df['Y1'] = df['BP'] * df['Cholesterol']
    df['Y2'] = df['Number of vessels fluro'] * df['Slope of ST']
    df['Y3'] = df['Cholesterol'] * df['Slope of ST']
    df['Y4'] = df['Cholesterol'] * df['Number of vessels fluro']
    df['Y5'] = df['BP'] * df['Slope of ST']
    df['Y6'] = df['BP'] * df['Number of vessels fluro']

    return df

In [25]:
#creating the interaction features in the X:
X = create_features(X)

In [26]:
#Creating custom transformers

from sklearn.base import BaseEstimator, TransformerMixin

#Binning Transformer
class Binning(BaseEstimator, TransformerMixin):
    def __init__(self, col_to_bin, num_bins, new_col_name ,labels=None):
        self.col_to_bin = col_to_bin
        self.num_bins = num_bins
        self.labels = labels
        self.new_col_name = new_col_name

    def fit(self, X, y=None):
        X = X.copy()
        _, self.bin_edges = pd.cut(X[self.col_to_bin], bins=self.num_bins, labels=False, retbins=True)
        return self

    def transform(self,X):
        X = X.copy() 
        X[self.new_col_name] = pd.cut(X[self.col_to_bin], bins=self.bin_edges, labels=False)
        return X

#GroupMean Transformer
class GroupMeanEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, groupby_col, agg_col, new_col_name):
        self.groupby_col = groupby_col
        self.agg_col = agg_col
        self.new_col_name = new_col_name

    def fit(self,X,y=None):
        self.means = X.groupby(self.groupby_col,observed=True)[self.agg_col].mean()
        return self

    def transform(self,X):
        X = X.copy()
        X[self.new_col_name] = X[self.groupby_col].map(self.means)
        return X

###  Frequency Encoding:
Frequency Encoding is a way to convert a categorical feature into numbers by replacing each category with how often it appears in the dataset.
We'll be using a custom transformer as above for this purpose as well.

In [27]:
class FreqEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, cat_cols, normalize=True):
        self.cat_cols = cat_cols
        self.normalize = normalize
        self.freq_maps = {}

    def fit(self, X, y=None):
        for col in self.cat_cols:
            self.freq_maps[col] = X[col].value_counts(normalize=self.normalize)
        return self

    def transform(self, X):
        X = X.copy()

        for col in self.cat_cols:
            X[col + '_freq'] = X[col].map(self.freq_maps[col])
            X[col + '_freq'] = X[col + '_freq'].fillna(0) #to handel unseen categories
        return X

If a category appears in test data but was never seen in training data:

map() returns → NaN (this is bad)

filling with 0 kinda tells this category never appeared in the training data

## Building data preprocessing pipelines:

In [28]:
cat_cols = ['Sex','Chest pain type','FBS over 120','Exercise angina','EKG results'] #to pass in FreqEncoder

In [29]:
from sklearn.pipeline import Pipeline 

preprocessor = Pipeline([
    ('Binning', Binning(col_to_bin='Age', num_bins=3, new_col_name='Age_bins')),
    ('GroupMeanEncoder_BP', GroupMeanEncoder(groupby_col='Age_bins', agg_col='BP', new_col_name='X1')),
    ('GroupMeanEncoder_Cholesterol', GroupMeanEncoder(groupby_col='Age_bins', agg_col='Cholesterol', new_col_name='X2')),
    ('GroupMeanEncoder_HR', GroupMeanEncoder(groupby_col='Age_bins', agg_col='Max HR', new_col_name='X3')),
    ('FreqEncoding', FreqEncoder(cat_cols=cat_cols))
])

## Finding best Hyperparameter set (via **Optuna**):

In [30]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) #warnings are annoying so this bypasses them

### CatBoost:

In [31]:
import optuna 
from sklearn.model_selection import cross_val_score
from catboost import CatBoostClassifier

def objective_cb(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 500, 2000),
        'depth': trial.suggest_int('depth', 4, 8),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1, 50, log=True),
    
        'task_type': 'GPU',
        'devices': '0',
        'verbose': False,
        'random_seed': 42,
    }

    model = Pipeline([
        ('prep', preprocessor),
        ('catboost', CatBoostClassifier(**params))
    ])

    score = cross_val_score(model, X, y, cv=5, scoring='roc_auc').mean()

    return score


In [32]:
# study_cb = optuna.create_study(direction='maximize') #keeps track of all hyperparameter trials
# study_cb.optimize(objective_cb, n_trials=75) #Run the objective function 75 times with different hyperparameters.

In [33]:
# print(study_cb.best_params) #best hyperparameters
# print(study_cb.best_value) #best score

First run score and best parameters:

* Score = 0.9553924958148589
* Hyperparameter set = {'iterations': 1254, 'depth': 4, 'learning_rate': 0.09787901496322517, 'l2_leaf_reg': 48.73782544764864}

Sadly Optuna does not give you a trained best model on full data unlike GridSearchCV, we'll be doing it manually with the best parameters. Optuna takes a long time to find you the best hyperparameters so running it again and again isnt a good idea/ its expensive so running it once or twice is enough in my opinion

### LGBM:

In [34]:
from lightgbm import LGBMClassifier

def objective_lgbm(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 500, 2000),
        "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.1, log=True),
        
        "num_leaves": trial.suggest_int("num_leaves", 16, 128),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        
        "min_child_samples": trial.suggest_int("min_child_samples", 10, 100),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-3, 10, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 10, log=True),
        
        "random_state": 42,
        "n_jobs": -1,
        "verbose":-1
    }

    model = Pipeline([
        ('prep', preprocessor),
        ('LGBM', LGBMClassifier(**params))
    ])

    score = cross_val_score(model, X, y, cv=5, scoring='roc_auc').mean()

    return score

In [35]:
# study_lgbm = optuna.create_study(direction='maximize')
# study_lgbm.optimize(objective_lgbm, n_trials=50)

In [36]:
# print(study_lgbm.best_params)
# print(study_lgbm.best_value)

First run score and best parameters:

* Score = 0.9553766302150766
* Hyperparameter set = {'n_estimators': 1712, 'learning_rate': 0.02743719738580626, 'num_leaves': 24, 'max_depth': 4, 'min_child_samples': 29, 'subsample': 0.7009221068214425, 'colsample_bytree': 0.6046253918162702, 'reg_alpha': 0.028014049796877397, 'reg_lambda': 0.00813793499748922}

## Retraining the respective model:
Training the models with there respective best hyperparameter sets again on the full training dataset and tehn using those models to conduct final predictions.

### CatBoost:
Freezing this as our final CatBoost model


In [37]:
final_cb_model = Pipeline([  
        ('prep', preprocessor),
        ('catboost', CatBoostClassifier(
            iterations=1254,
            depth= 4, 
            learning_rate= 0.09787901496322517, 
            l2_leaf_reg= 48.73782544764864,
            task_type= 'GPU',
            devices= '0',
            verbose= False,
            random_seed= 42,))
    ])

In [38]:
final_cb_model.fit(X,y)

### LGBM:
Freezing this as our final LGBM model


In [39]:
final_lgbm_model = Pipeline([
        ('prep', preprocessor),
        ('LGBM', LGBMClassifier(
            n_estimators= 1712,
            learning_rate= 0.02743719738580626,
            num_leaves= 24,
            max_depth= 4,
            min_child_samples= 29,
            subsample= 0.7009221068214425,
            colsample_bytree= 0.6046253918162702,
            reg_alpha= 0.028014049796877397,
            reg_lambda= 0.00813793499748922,
            random_state= 42,
            n_jobs= -1,
            verbose= -1
        ))
    ])

In [40]:
final_lgbm_model.fit(X,y)

## Preparing the test data:

In [41]:
test_path = '/kaggle/input/competitions/playground-series-s6e2/test.csv'
test_df = pd.read_csv(test_path)
test_df.head() #cheking if the file has loaded safely

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
0,630000,58,1,3,120,288,0,2,145,1,0.8,2,3,3
1,630001,55,0,2,120,209,0,0,172,0,0.0,1,0,3
2,630002,54,1,4,120,268,0,0,150,1,0.0,2,3,7
3,630003,44,0,3,112,177,0,0,168,0,0.9,1,0,3
4,630004,43,1,1,138,267,0,0,163,0,1.8,2,0,7


In [44]:
# Removing redundant features
X_test = test_df.drop('id', axis=1)

#applying the create_features on test set
X_test = create_features(X_test)

## Predicting on the test set:

In [45]:
y_pred_cb = final_cb_model.predict_proba(X_test)[:,1] #CatBoost predictions

y_pred_lgbm = final_lgbm_model.predict_proba(X_test)[:,1] #LGBM predictions

## Preparing submission CSVs

In [48]:
#CatBoost submission
submission_cb = pd.DataFrame({
    'id': test_df['id'],
    target: y_pred_cb
})

#LGBM submission
submission_lgbm = pd.DataFrame({
    'id':  test_df['id'],
    target: y_pred_lgbm
})

In [49]:
submission_cb.to_csv('submission.csv', index=False) #CB
# submission_lgbm.to_csv('submission_lgbm.csv', index=False) #LGBM

CatBoost model performed better than LGBM model:

Scores:
* (CB:0.95355)
* (LGBM:0.95353)

These scores are much better than my previous iteration XGBoost model which I hypertuned using GridSearchCV

## Key Takeaways:
* Optuna is good for finding the best set of hyperparameters but has its own disadvantages, them being:
  1. It takes a lot of time
  2. If you are thinking of running it in industry, the processing would be quite expensive but the results are worthit
* Investing time on feature engineering most of the times yield good returns


If you found this notebook to be helpful or informative in anyways please leave behind an upvote to appretiate the time efforts put into it.(Also sorry for any typo mistakes if any xD) 

Thanks for reading it all the way!!! 