## Importing dataset:

In [1]:
import pandas as pd

train_path = '/kaggle/input/competitions/playground-series-s6e2/train.csv'
train_df = pd.read_csv(train_path)
train_df.head() #checking if the dataset is loaded safely

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence



No EDA will be performed in this iteration sicne it was already covered in the 1st iteration of this project (You can check the previous iteration notebooks on my profile or on my github). 

In [2]:
target = 'Heart Disease' #Small but important to create a ‘TARGET’ constant var for ‘Heart Disease’

## Reducing the Dtypes(Quantization):
Although dtype reduction (e.g., float64 → float32) can reduce memory usage, this is mainly useful in memory-constrained environments such as edge devices. In Kaggle’s environment, memory is sufficient, and reducing precision may introduce unnecessary numerical rounding. Since our goal is maximizing accuracy rather than minimizing memory, we retain the original dtypes.

## Feature Engineering:
Trying the best feature engineering methods to create new and more informative features. This section will consist:
* Putting numerical features in Bins (Converting numerical to categorical)
* Groupby categorical features and aggrigate statistics of numerical features
* Try some interaction features/ feature crosses
* Perform Target encoding (more about it later)

In [3]:
 #Automating the process of feature engineering by wrapping all the functionality in a reuable function:
def create_features(df):
    #removing the redundent features:
    df = df.drop('id',axis=1)
    
    #Putting Age into 3 bins
    # df['age_bins'] = pd.cut(df['Age'],bins=3,labels=['moderate','middle','old']) 

    #Grouping numerical columns by Age and calculating the aggrigate statistic:
    # df['X1'] = df.groupby('age_bins')['BP'].transform('mean')   
    # df['X2'] = df.groupby('age_bins')['Cholesterol'].transform('mean') 
    # df['X3'] = df.groupby('age_bins')['Max HR'].transform('mean') 

    #Feature crosses/ Interaction features: (Optinal, XGB is able to learn these relatonships on its own)
    df['Y1'] = df['BP'] * df['Cholesterol']
    df['Y2'] = df['Number of vessels fluro'] * df['Slope of ST']
    df['Y3'] = df['Cholesterol'] * df['Slope of ST']
    df['Y4'] = df['Cholesterol'] * df['Number of vessels fluro']
    df['Y5'] = df['BP'] * df['Slope of ST']
    df['Y6'] = df['BP'] * df['Number of vessels fluro']

    return df

* Binning Age before splitting data will cause problems since the bins will be calculated seperately for train and test data theres no way of saying if the max and min values used for bins are same, so mitigating this by creating a custom transformer in the below listing.
* Apparantly grouping numerical columns by Age and calculating the aggrigate statistic  causes data leakage if performend before splitting train and validation data a better way to do this will be make our own custom transformer covered in the next listing

In [4]:
#creating the interaction features in the train_df:
train_df = create_features(train_df)

Sklearn does:

1.  Call fit() on your transformer
2.  Then immediately call transform() on that SAME training data, without an explicit call 

In [5]:
#Creating custom transformer for binning ages
from sklearn.base import BaseEstimator, TransformerMixin

class Binning(BaseEstimator, TransformerMixin):
    def __init__(self, col_to_bin, num_bins, new_col_name ,labels=None):
        self.col_to_bin = col_to_bin
        self.num_bins = num_bins
        self.labels = labels
        self.new_col_name = new_col_name

    def fit(self, X, y=None):
        X = X.copy()
        _, self.bin_edges = pd.cut(X[self.col_to_bin], bins=self.num_bins, labels=False, retbins=True) #the last attribute ensures that the same bins are used to bin teh test data during transform
        return self

    def transform(self,X):
        X = X.copy() 
        X[self.new_col_name] = pd.cut(X[self.col_to_bin], bins=self.bin_edges, labels=False)
        return X

In [6]:
#Creating our custom transformer
#This way we can use the means calculated from train data for others i.e. Validation and Test data
#One more advantage is that this can directly be called in pipeline so that it can be used with CV

class GroupMeanEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, groupby_col, agg_col, new_col_name):
        self.groupby_col = groupby_col
        self.agg_col = agg_col
        self.new_col_name = new_col_name

    def fit(self,X,y=None):
        self.means = X.groupby(self.groupby_col,observed=True)[self.agg_col].mean()
        return self

    def transform(self,X):
        X = X.copy()
        X[self.new_col_name] = X[self.groupby_col].map(self.means)
        return X

## Creating data preprocessing pipeline:
We'll be using Kfolds cross validation and use gridsearchCV to find the optimal parameters for XGBoost, so preparing preprocessing pipelines keeping that in mind 

In [7]:
from sklearn.pipeline import Pipeline

preprocessor = Pipeline([
    ('Binning', Binning(col_to_bin='Age', num_bins=3, new_col_name='Age_bins')),
    ('GroupMeanEncoder_BP', GroupMeanEncoder(groupby_col='Age_bins', agg_col='BP', new_col_name='X1')),
    ('GroupMeanEncoder_Cholesterol', GroupMeanEncoder(groupby_col='Age_bins', agg_col='Cholesterol', new_col_name='X2')),
    ('GroupMeanEncoder_HR', GroupMeanEncoder(groupby_col='Age_bins', agg_col='Max HR', new_col_name='X3')),
])

## Encoding the target:

In [8]:
train_df[target] = train_df[target].map({'Absence': 0, 'Presence': 1})

## Seperating target and features:

In [9]:
X = train_df.drop(target, axis=1) #features
y = train_df[target] #target

## Preparing the model:
As mentioned above we'll be using the gridsearchCV to find the best parameters and retrain our model using thoes parameters on the complete train data

In [10]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

model = XGBClassifier(
    objective = 'binary:logistic',
    enable_categorical = True,
    eval_metric = 'auc',
    device = 'cuda',
    tree_method='hist',
    random_state = 42
)

final_pipeline = Pipeline([
    ('prep',preprocessor),
    ('model',model)
])

param_grid = {
    'model__max_depth': range(3,8),
    'model__colsample_bytree':[0.3,0.5,0.7,0.9],
    'model__learning_rate':[0.01,0.05,0.1],
    'model__subsample':[0.7,0.8],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=final_pipeline,
    param_grid=param_grid,
    cv=cv,
    scoring='roc_auc',
    verbose=False,
    n_jobs=1
)

In [11]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)  #warnings are annoying so this bypasses them

In [12]:
#fitting data to the grid:
grid.fit(X,y)

print("best_score:",grid.best_score_)
print('best_parameters:',grid.best_params_)

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)


best_score: 0.9549090916217624
best_parameters: {'model__colsample_bytree': 0.5, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__subsample': 0.7}


Finally after a long time of training we recived best parameters and our CV score, both look pretty solid, we wont have to retrain a new model on teh complete train data teh GridSearchCV does it for us, it finds the best set of parameters and then also trains the model on those parameters using the whole train data, it can be retrived with 'grid.best_estimator_'

In [13]:
best_model = grid.best_estimator_ #best model with best set of hyperparameters found through GridSearchCV

## Importing Test data:

In [14]:
test_path = '/kaggle/input/competitions/playground-series-s6e2/test.csv'
test_df  = pd.read_csv(test_path)
test_df.head() #cehcking if test data has loaded safely

Unnamed: 0,id,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
0,630000,58,1,3,120,288,0,2,145,1,0.8,2,3,3
1,630001,55,0,2,120,209,0,0,172,0,0.0,1,0,3
2,630002,54,1,4,120,268,0,0,150,1,0.0,2,3,7
3,630003,44,0,3,112,177,0,0,168,0,0.9,1,0,3
4,630004,43,1,1,138,267,0,0,163,0,1.8,2,0,7


## Performing preprocessing on test data:

In [15]:
X_test = create_features(test_df)

## Predicting on test data:
Using the best_model we got from GridSearchCV for prediction

In [16]:
y_test_pred_proba = best_model.predict_proba(X_test)[:,1]

## Preparing the submission CSV:

In [17]:
submission = pd.DataFrame({
    'id': test_df['id'],
    target: y_test_pred_proba 
})

submission.to_csv('submission.csv', index=False)

### Future Improvements:
* Use **Optuna** for more efficient hyperparameter optimization.

Thanks to the Kaggle community for helpful feedback and insights.