# <span style="color:#2E86C1">🏥 Hospital Readmission Prediction Challenge</span>

---

## <span style="color:#1ABC9C">📌 Challenge Overview</span>
<div style="padding: 15px; border-radius: 5px">
Healthcare organizations face significant challenges in preventing avoidable hospital readmissions. This competition focuses on developing machine learning models to predict <span style="color:#E74C3C">**30-day readmission risk**</span> using clinical data, helping providers:
- <span style="color:#28B463">Improve patient outcomes</span> 🌱
- <span style="color:#2E86C1">Reduce healthcare costs</span> 💰
- <span style="color:#8E44AD">Optimize care coordination</span> 🤝
</div>

---

## <span style="color:#1ABC9C">🎯 Competition Objective</span>
<div style="padding: 15px; border-radius: 5px; border: 0.5px solid #D4E6F1">

**Develop a binary classifier to predict:**

**```Readmitted_30 = 1 if readmitted within 30 days```**

   **```Readmitted_30 = 0 if no readmission```**


## Hospital Readmission Dataset Overview

### Key Features Table
| Column / Group        | Description                                                                 | Relationship to Readmitted_30                                                                 |
|-----------------------|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| ID                    | Unique stay identifier                                                      | No predictive value; used only for record linkage                                            |
| STAY_DRG_CD           | Diagnosis-Related Group code                                               | Complex DRGs (e.g., cardiac/respiratory) correlate with higher readmission rates            |
| stay_length (engineered) | Days between admission & discharge                                      | Longer stays → higher risk (indicates complications/comorbidities)                         |
| STUS_CD               | Discharge status code                                                       | AMA/transfers show distinct readmission patterns (see discharge codes below)               |
| TYPE_ADM              | Admission type (emergency, urgent, elective)                                | Emergency/urgent admissions → higher risk                                                   |
| SRC_ADMS              | Source of admission (ER, referral, transfer)                                | ER/transfers from post-acute care → distinct risk profiles                                  |
| AD_DGNS               | Admitting diagnosis code (ICD)                                              | Initial diagnosis (e.g., sepsis, CHF) strongly influences readmission                      |
| DGNSCD01-DGNSCD25     | 25 diagnosis codes                                                          | Comorbidity burden → higher risk; specific codes (e.g., CKD) are risk factors               |
| PRCDRCD01-PRCDRCD25   | 25 procedure codes                                                          | Major surgeries → higher complication/readmission rates                                     |
| Readmitted_30         | **Target**: 1 if readmitted within 30 days, else 0                          |                                                                                             |

---

### Discharge Status Codes (STUS_CD)
| Code | Definition                                      | Readmission Risk          |
|------|-------------------------------------------------|---------------------------|
| 1    | Discharged to home/self-care                    | Lowest risk               |
| 3    | Transferred to SNF (Medicare-certified)         | Moderate risk             |
| 6    | Home with home health services                  | Moderate–high risk        |
| 62   | Transferred to inpatient rehabilitation (IRF)   | High risk                 |
| 63   | Transferred to long-term care hospital (LTCH)   | Very high risk            |

---

### Admission Types (TYPE_ADM)
| Code | Definition                                      | Readmission Risk          |
|------|-------------------------------------------------|---------------------------|
| 1    | Emergency admission                             | High risk                 |
| 2    | Urgent admission                                | High risk                 |
| 3    | Elective admission                              | Lower risk                |
| 5    | Trauma Center admission                         | Very High risk            |
| 9    | Information Not Available                       | Unknown risk              |

---

### Admission Sources (SRC_ADMS)
| Code | Definition                                      | Readmission Risk          |
|------|-------------------------------------------------|---------------------------|
| 1    | Physician Referral                              | Moderate risk             |
| 2    | Clinic Referral                                 | Moderate risk             |
| 5    | Transfer from SNF/ICF                           | High risk                 |


---

### Diagnosis and Procedures Code (DGNSCD01-25, PRCDRCD01-25)
| Column          | Description                                      | Relationship to Readmitted_30                                                                 |
|-----------------|--------------------------------------------------|-----------------------------------------------------------------------------------------------|
| **DGNSCD01**    | Primary diagnosis code (ICD-10)                  | Primary condition driving the stay; often the strongest predictor. High-risk primaries (e.g., CHF, sepsis) ↑ readmission. |
| **DGNSCD02**    | Secondary diagnosis code                         | Comorbidity #1: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD03**    | Secondary diagnosis code                         | Comorbidity #2: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD04**    | Secondary diagnosis code                         | Comorbidity #3: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD05**    | Secondary diagnosis code                         | Comorbidity #4: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD06**    | Secondary diagnosis code                         | Comorbidity #5: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD07**    | Secondary diagnosis code                         | Comorbidity #6: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD08**    | Secondary diagnosis code                         | Comorbidity #7: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD09**    | Secondary diagnosis code                         | Comorbidity #8: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD10**    | Secondary diagnosis code                         | Comorbidity #9: adds to overall burden; presence ↑ readmission risk.                          |
| **DGNSCD11**    | Secondary diagnosis code                         | Comorbidity #10: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD12**    | Secondary diagnosis code                         | Comorbidity #11: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD13**    | Secondary diagnosis code                         | Comorbidity #12: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD14**    | Secondary diagnosis code                         | Comorbidity #13: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD15**    | Secondary diagnosis code                         | Comorbidity #14: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD16**    | Secondary diagnosis code                         | Comorbidity #15: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD17**    | Secondary diagnosis code                         | Comorbidity #16: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD18**    | Secondary diagnosis code                         | Comorbidity #17: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD19**    | Secondary diagnosis code                         | Comorbidity #18: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD20**    | Secondary diagnosis code                         | Comorbidity #19: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD21**    | Secondary diagnosis code                         | Comorbidity #20: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD22**    | Secondary diagnosis code                         | Comorbidity #21: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD23**    | Secondary diagnosis code                         | Comorbidity #22: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD24**    | Secondary diagnosis code                         | Comorbidity #23: adds to overall burden; presence ↑ readmission risk.                         |
| **DGNSCD25**    | Secondary diagnosis code                         | Comorbidity #24: adds to overall burden; presence ↑ readmission risk.                         |
| **PRCDRCD01**   | Primary procedure code (ICD-10-PCS)              | Primary intervention; major surgeries or invasive procedures ↑ readmission likelihood.        |
| **PRCDRCD02**   | Secondary procedure code                          | Procedure #2: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD03**   | Secondary procedure code                          | Procedure #3: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD04**   | Secondary procedure code                          | Procedure #4: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD05**   | Secondary procedure code                          | Procedure #5: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD06**   | Secondary procedure code                          | Procedure #6: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD07**   | Secondary procedure code                          | Procedure #7: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD08**   | Secondary procedure code                          | Procedure #8: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD09**   | Secondary procedure code                          | Procedure #9: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD10**   | Secondary procedure code                          | Procedure #10: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD11**   | Secondary procedure code                          | Procedure #11: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD12**   | Secondary procedure code                          | Procedure #12: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD13**   | Secondary procedure code                          | Procedure #13: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD14**   | Secondary procedure code                          | Procedure #14: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD15**   | Secondary procedure code                          | Procedure #15: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD16**   | Secondary procedure code                          | Procedure #16: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD17**   | Secondary procedure code                          | Procedure #17: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD18**   | Secondary procedure code                          | Procedure #18: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD19**   | Secondary procedure code                          | Procedure #19: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD20**   | Secondary procedure code                          | Procedure #20: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD21**   | Secondary procedure code                          | Procedure #21: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD22**   | Secondary procedure code                          | Procedure #22: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD23**   | Secondary procedure code                          | Procedure #23: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD24**   | Secondary procedure code                          | Procedure #24: additional interventions; more procedures signal complexity and ↑ readmission risk. |
| **PRCDRCD25**   | Secondary procedure code                          | Procedure #25: additional interventions; more procedures signal complexity and ↑ readmission risk. |

# ✅ Data Preparation

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.metrics import (f1_score, roc_auc_score, precision_score, recall_score, roc_curve, auc)
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
import optuna
from catboost import CatBoostClassifier

# Settings
import warnings
warnings.filterwarnings('ignore')
np.seterr(all='ignore')

{'divide': 'warn', 'over': 'warn', 'under': 'ignore', 'invalid': 'warn'}

In [5]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import torch
print("CUDA Available:", torch.cuda.is_available())

/kaggle/input/softec-25-machine-learning-competition/sample_submission.csv
/kaggle/input/softec-25-machine-learning-competition/train.csv
/kaggle/input/softec-25-machine-learning-competition/metaData.csv
/kaggle/input/softec-25-machine-learning-competition/test.csv
CUDA Available: False


In [6]:
# Define paths
train_csv_path = '/kaggle/input/softec-25-machine-learning-competition/train.csv'
test_csv_path = '/kaggle/input/softec-25-machine-learning-competition/test.csv'

df_train = pd.read_csv(train_csv_path)
df_test = pd.read_csv(test_csv_path)

In [7]:
# peek the data
df_train.head()

Unnamed: 0,ID,STAY_DRG_CD,STAY_FROM_DT,STAY_THRU_DT,STUS_CD,TYPE_ADM,SRC_ADMS,AD_DGNS,DGNSCD01,PRCDRCD01,...,DGNSCD22,PRCDRCD22,DGNSCD23,PRCDRCD23,DGNSCD24,PRCDRCD24,DGNSCD25,PRCDRCD25,stay_drg_cd,Readmitted_30
0,17319,,2017-12-13 00:00:00,2017-12-20 00:00:00,62,1,2,M25551,S72001A,0SRR01Z,...,Z803,,Z86711,,Z86718,,Z85828,,469,0
1,19722,,2017-10-19 00:00:00,2017-10-23 00:00:00,1,1,1,R531,A419,,...,,,,,,,,,871,1
2,89699,,2018-08-06 00:00:00,2018-08-08 00:00:00,1,1,1,R002,J690,,...,,,,,,,,,177,0
3,8086,,2016-12-20 00:00:00,2016-12-27 00:00:00,62,5,1,K661,K661,,...,I440,,,,,,,,393,0
4,68049,,2016-01-06 00:00:00,2016-01-12 00:00:00,6,1,1,J9601,J690,,...,,,,,,,,,177,0


In [8]:
df_train.describe(include='all')

Unnamed: 0,ID,STAY_DRG_CD,STAY_FROM_DT,STAY_THRU_DT,STUS_CD,TYPE_ADM,SRC_ADMS,AD_DGNS,DGNSCD01,PRCDRCD01,...,DGNSCD22,PRCDRCD22,DGNSCD23,PRCDRCD23,DGNSCD24,PRCDRCD24,DGNSCD25,PRCDRCD25,stay_drg_cd,Readmitted_30
count,130296.0,3798.0,130296,130296,130296.0,130296.0,130296.0,130296,130296,61628,...,41252,4423,36587,4409,32382,4399,20261,4388,126498.0,130296.0
unique,,,3213,3214,,,,2260,2691,2596,...,1826,28,1698,23,1603,18,1286,15,520.0,
top,,,2017-01-09 00:00:00,2018-01-11 00:00:00,,,,R0602,A419,5A09357,...,-,-,-,-,-,-,-,-,871.0,
freq,,,132,141,,,,24865,28655,4650,...,4365,4365,4365,4365,4365,4365,4365,4365,27887.0,
mean,81315.674833,430.033175,,,7.427434,1.263838,1.482478,,,,...,,,,,,,,,,0.204903
std,47062.963,230.927115,,,16.370187,0.649099,1.200835,,,,...,,,,,,,,,,0.403632
min,2.0,23.0,,,1.0,1.0,1.0,,,,...,,,,,,,,,,0.0
25%,40507.75,291.0,,,1.0,1.0,1.0,,,,...,,,,,,,,,,0.0
50%,81315.5,309.0,,,3.0,1.0,1.0,,,,...,,,,,,,,,,0.0
75%,122083.5,682.0,,,6.0,1.0,1.0,,,,...,,,,,,,,,,0.0


In [9]:
print("\nRaw Training Data Shape:", df_train.shape)
print("Raw Test Data Shape:", df_test.shape)


Raw Training Data Shape: (130296, 60)
Raw Test Data Shape: (32575, 59)


In [10]:
print("Duplicates:", df_train.duplicated().sum())
print("Duplicates:", df_test.duplicated().sum())

Duplicates: 0
Duplicates: 0


In [11]:
print("Raw Training Data Info:")
df_train.info()
print("\nRaw Test Data Info:")
df_test.info()

Raw Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130296 entries, 0 to 130295
Data columns (total 60 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ID             130296 non-null  int64  
 1   STAY_DRG_CD    3798 non-null    float64
 2   STAY_FROM_DT   130296 non-null  object 
 3   STAY_THRU_DT   130296 non-null  object 
 4   STUS_CD        130296 non-null  int64  
 5   TYPE_ADM       130296 non-null  int64  
 6   SRC_ADMS       130296 non-null  int64  
 7   AD_DGNS        130296 non-null  object 
 8   DGNSCD01       130296 non-null  object 
 9   PRCDRCD01      61628 non-null   object 
 10  DGNSCD02       130153 non-null  object 
 11  PRCDRCD02      38509 non-null   object 
 12  DGNSCD03       129937 non-null  object 
 13  PRCDRCD03      25879 non-null   object 
 14  DGNSCD04       129532 non-null  object 
 15  PRCDRCD04      17840 non-null   object 
 16  DGNSCD05       128889 non-null  object 
 17  PRCDR

In [12]:
print("\nTraining Missing Values(%):")
print((df_train.isnull().sum()/len(df_train) * 100).sort_values(ascending=False))


Training Missing Values(%):
STAY_DRG_CD      97.085099
PRCDRCD25        96.632283
PRCDRCD24        96.623841
PRCDRCD23        96.616166
PRCDRCD22        96.605422
PRCDRCD21        96.594677
PRCDRCD20        96.581630
PRCDRCD19        96.565512
PRCDRCD18        96.541720
PRCDRCD17        96.518696
PRCDRCD16        96.483392
PRCDRCD15        96.432738
PRCDRCD14        96.345245
PRCDRCD13        96.246239
PRCDRCD12        96.113465
PRCDRCD11        95.933874
PRCDRCD10        95.669092
PRCDRCD09        95.360564
PRCDRCD08        94.893934
PRCDRCD07        94.141800
PRCDRCD06        92.493246
PRCDRCD05        90.244520
PRCDRCD04        86.308098
DGNSCD25         84.450021
PRCDRCD03        80.138300
DGNSCD24         75.147357
DGNSCD23         71.920090
PRCDRCD02        70.444987
DGNSCD22         68.339780
DGNSCD21         64.448640
DGNSCD20         60.510683
DGNSCD19         56.214312
PRCDRCD01        52.701541
DGNSCD18         46.801130
DGNSCD17         41.464051
DGNSCD16         36.268189

In [13]:
print("🔍 Missing columns in Training Data:")
print(df_train.isnull().sum()[df_train.isnull().sum() > 0])
print("\n🔍 Missing columns in Testing Data:")
print(df_test.isnull().sum()[df_test.isnull().sum() > 0])

🔍 Missing columns in Training Data:
STAY_DRG_CD    126498
PRCDRCD01       68668
DGNSCD02          143
PRCDRCD02       91787
DGNSCD03          359
PRCDRCD03      104417
DGNSCD04          764
PRCDRCD04      112456
DGNSCD05         1407
PRCDRCD05      117585
DGNSCD06         2469
PRCDRCD06      120515
DGNSCD07         4086
PRCDRCD07      122663
DGNSCD08         6285
PRCDRCD08      123643
DGNSCD09         9212
PRCDRCD09      124251
DGNSCD10        12929
PRCDRCD10      124653
DGNSCD11        17350
PRCDRCD11      124998
DGNSCD12        22472
PRCDRCD12      125232
DGNSCD13        28056
PRCDRCD13      125405
DGNSCD14        34213
PRCDRCD14      125534
DGNSCD15        40506
PRCDRCD15      125648
DGNSCD16        47256
PRCDRCD16      125714
DGNSCD17        54026
PRCDRCD17      125760
DGNSCD18        60980
PRCDRCD18      125790
DGNSCD19        73245
PRCDRCD19      125821
DGNSCD20        78843
PRCDRCD20      125842
DGNSCD21        83974
PRCDRCD21      125859
DGNSCD22        89044
PRCDRCD22      125

In [14]:
def preprocess_df(df):
    df = df.copy()
    df.drop(columns=['ID', 'STAY_DRG_CD', 'STAY_FROM_DT', 'STAY_THRU_DT'], inplace=True, errors='ignore')
    df.dropna(subset=['stay_drg_cd'], inplace=True)
    df['stay_drg_cd'] = pd.to_numeric(df['stay_drg_cd'], errors='coerce').astype('Int64')
    df.dropna(subset=['stay_drg_cd'], inplace=True)
    return df

In [15]:
df_train = preprocess_df(df_train)

In [16]:
categorical_cols = ['STUS_CD', 'TYPE_ADM', 'SRC_ADMS', 'AD_DGNS', 'stay_drg_cd']

for col in categorical_cols:
    value_counts = df_train[col].value_counts(dropna=False)
    print(f"\n🔢 Total Unique Values in '{col}': {df_train[col].nunique(dropna=False)}")
    print("=" * 50)


🔢 Total Unique Values in 'STUS_CD': 5

🔢 Total Unique Values in 'TYPE_ADM': 5

🔢 Total Unique Values in 'SRC_ADMS': 3

🔢 Total Unique Values in 'AD_DGNS': 2236

🔢 Total Unique Values in 'stay_drg_cd': 519


In [17]:
#converting into categorical columns
categorical_cols = ['STUS_CD', 'TYPE_ADM', 'SRC_ADMS']
df_train[categorical_cols] = df_train[categorical_cols].astype('category')

In [19]:
df_train.dtypes

STUS_CD          category
TYPE_ADM         category
SRC_ADMS         category
AD_DGNS            object
DGNSCD01           object
PRCDRCD01          object
DGNSCD02           object
PRCDRCD02          object
DGNSCD03           object
PRCDRCD03          object
DGNSCD04           object
PRCDRCD04          object
DGNSCD05           object
PRCDRCD05          object
DGNSCD06           object
PRCDRCD06          object
DGNSCD07           object
PRCDRCD07          object
DGNSCD08           object
PRCDRCD08          object
DGNSCD09           object
PRCDRCD09          object
DGNSCD10           object
PRCDRCD10          object
DGNSCD11           object
PRCDRCD11          object
DGNSCD12           object
PRCDRCD12          object
DGNSCD13           object
PRCDRCD13          object
DGNSCD14           object
PRCDRCD14          object
DGNSCD15           object
PRCDRCD15          object
DGNSCD16           object
PRCDRCD16          object
DGNSCD17           object
PRCDRCD17          object
DGNSCD18    

---

# 📈 Data Analysis & EDA & Features Selection

# Hidden Layers Baseline Model

This baseline model is used to preprocess the training data, train a classifier, evaluate its performance, and plot feature importance. It can also be used for feature selection based on feature importance and the performance metrics.

## Function Overview
The `hidden_layers_baseline` function performs the following tasks:
1. **Preprocessing**:
   - Drops unnecessary columns.
   - Handles missing values by filling or dropping them.
   - Encodes categorical features (e.g., `stay_drg_cd`).
   
2. **Model Training**:
   - Splits the dataset into training and validation sets.
   - Trains a CatBoost Classifier using GPU acceleration and class weights.
   
3. **Model Evaluation**:
   - Predicts on the validation set and computes the following metrics:
     - F1 Score
     - AUC-ROC
     - Precision
     - Recall
     
4. **Plots**:
   - Bar chart of evaluation metrics.
   - ROC Curve to evaluate the model's true positive vs false positive rates.

5. **Feature Importance**:
   - Extracts feature importance scores from the trained model.
   - Plots the top 20 most important features based on the CatBoost model.

6. **Test Data Prediction**:
   - Prepares the test dataset in the same way as the training data.
   - Makes predictions on the test set and saves the results in a CSV file.

## Detailed Steps

### 1. **Data Preprocessing**
   The preprocessing steps include:
   - Dropping irrelevant columns such as `ID`, `STAY_DRG_CD`, `STAY_FROM_DT`, and `STAY_THRU_DT`.
   - Dropping rows where `stay_drg_cd` is missing.
   - Encoding the `stay_drg_cd` column as numeric values.
   - Filling missing values with `'unknown'` for categorical features.

### 2. **Model Training**
   The CatBoost classifier is used to train the model:
   - It uses GPU acceleration for faster training (`task_type="GPU"`).
   - Class weights are adjusted to deal with class imbalances (class weights: `[0.63, 2.44]`).
   - The model is trained on the `X_train` and `y_train` datasets, with categorical features specified.

### 3. **Model Evaluation**
   After training, the model's performance is evaluated on the validation set (`X_val`, `y_val`) using several metrics:
   - **F1 Score**: Balances precision and recall for binary classification tasks.
   - **AUC-ROC**: Measures the trade-off between true positive rate and false positive rate.
   - **Precision**: Measures the proportion of positive predictions that are actually correct.
   - **Recall**: Measures the proportion of actual positives that are correctly predicted.

   The results are visualized using a bar chart with the metrics displayed for easy comparison.

### 4. **ROC Curve**
   The ROC curve is plotted to visually assess the performance of the classifier. The area under the curve (AUC) is also calculated, which indicates the ability of the model to discriminate between classes.

### 5. **Feature Importance**
   The feature importance is obtained from the CatBoost model, which helps identify the most influential features in making predictions:
   - The importance of each feature is printed and saved to a CSV file (`feature_importance.csv`).
   - A horizontal bar chart is plotted to visualize the top 20 most important features.

### 6. **Test Data Prediction**
   After preprocessing the test data (similar to the training data), predictions are made on the test set, and the results are saved to a submission file (`submission.csv`).

### Example of Usage:
```python
hidden_layers_baseline('train_data.csv', 'test_data.csv', submission_filename='submission.csv')

In [22]:
def hidden_layers_baseline(train_df_path, test_df_path, submission_filename='submission.csv'):
    df = pd.read_csv(train_df_path)

    # Preprocessing
    df.drop(columns=['ID','STAY_DRG_CD','STAY_FROM_DT','STAY_THRU_DT'], inplace=True)
    df.dropna(subset=['stay_drg_cd'], inplace=True)
    df['stay_drg_cd'] = pd.to_numeric(df['stay_drg_cd'], errors='coerce').astype('Int64')
    df.dropna(subset=['stay_drg_cd'], inplace=True)
    df.fillna('unknown', inplace=True)

    X = df.drop(columns=['Readmitted_30'])
    y = df['Readmitted_30']
    cat_features = X.select_dtypes(include=['object','category']).columns.tolist()

    X_train, X_val, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

    # Train model
    model = CatBoostClassifier(
        #task_type="GPU",
        #devices='0',
        verbose=100,
        random_state=42,
        class_weights=[0.63, 2.44]
    )
    model.fit(X_train, y_train, cat_features=cat_features)

    # Predict & Metrics
    y_pred = model.predict(X_val)
    y_prob = model.predict_proba(X_val)[:, 1]
    f1 = f1_score(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_prob)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)

    # Plotting metric bar chart
    plt.figure(figsize=(8, 5))
    metrics = {'F1 Score': f1, 'AUC-ROC': auc_score, 'Precision': precision, 'Recall': recall}
    sns.barplot(x=list(metrics.keys()), y=list(metrics.values()), palette='viridis')
    plt.ylim(0, 1)
    plt.title("Model Performance Metrics")
    for i, value in enumerate(metrics.values()):
        plt.text(i, value + 0.02, f"{value:.2f}", ha='center', va='bottom', fontsize=12)
    plt.tight_layout()
    plt.show()

    # Plotting ROC Curve
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = sk_auc(fpr, tpr)
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--', lw=2, label='Random')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Recall)')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc='lower right')
    plt.grid(True)
    plt.show()

    # Feature Importances
    feature_importances = model.get_feature_importance()
    feature_names = X_train.columns
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': feature_importances
    }).sort_values(by='Importance', ascending=False)
    print(importance_df)
    importance_df.to_csv('feature_importance.csv')

    plt.figure(figsize=(12, 8))
    plt.barh(importance_df['Feature'][::-1], importance_df['Importance'][::-1], color='skyblue')
    plt.xlabel('Importance')
    plt.title('Top 20 Feature Importances (CatBoost)')
    plt.grid(axis='x')
    plt.tight_layout()
    plt.show()

    # Test data preprocessing
    test_df = pd.read_csv(test_df_path)
    submission_ids = test_df['ID']
    test_df.drop(columns=['ID', 'STAY_DRG_CD', 'STAY_FROM_DT', 'STAY_THRU_DT'], inplace=True)
    test_df['stay_drg_cd'] = pd.to_numeric(test_df['stay_drg_cd'], errors='coerce').fillna(-1).astype('Int64')
    test_df.fillna('unknown', inplace=True)

    # Predict on test
    y_pred_test = model.predict(test_df)

    # Submission
    submission_df = pd.DataFrame({
        'ID': submission_ids,
        'Readmitted_30': y_pred_test
    })
    submission_df.to_csv(submission_filename, index=False)
    print(f"Submission file saved as {submission_filename}")

In [None]:
hidden_layers_baseline(
    train_df_path= '/kaggle/input/softec-25-machine-learning-competition/train.csv',
    test_df_path='/kaggle/input/softec-25-machine-learning-competition/test.csv',
    submission_filename='submission_2.csv'
)

Learning rate set to 0.072884
0:	learn: 0.6905125	total: 619ms	remaining: 10m 18s


## Baseline Model Results

The results from the baseline model suggest that the most relevant features for predicting readmission (label `Readmitted_30`) have been identified. Based on feature importance and model evaluation, the following columns were selected for training and testing:

- **Columns to Keep**: These columns will be used for model training and include various diagnosis, procedure codes, and admission types.
- **Columns to Test**: These columns will be used for making predictions on the test dataset, with a similar set of features as the training set.



In [None]:
# Columns to keep
columns_to_keep = [
    'STUS_CD', 'stay_drg_cd', 'DGNSCD02', 'DGNSCD01', 'PRCDRCD01', 'AD_DGNS', 'DGNSCD18', 'DGNSCD03',
    'DGNSCD17', 'TYPE_ADM', 'DGNSCD05', 'DGNSCD08', 'DGNSCD06', 'DGNSCD07', 'DGNSCD13', 'DGNSCD04',
    'DGNSCD12', 'DGNSCD11', 'DGNSCD15', 'DGNSCD09', 'DGNSCD14', 'DGNSCD16', 'DGNSCD10', 'DGNSCD24',
    'PRCDRCD02', 'DGNSCD19', 'DGNSCD20', 'DGNSCD22', 'DGNSCD25', 'PRCDRCD03', 'DGNSCD21', 'SRC_ADMS',
    'PRCDRCD04', 'DGNSCD23', 'PRCDRCD05', 'PRCDRCD06', 'PRCDRCD22', 'PRCDRCD10', 'PRCDRCD09',
    'PRCDRCD21', 'PRCDRCD11', 'PRCDRCD07', 'PRCDRCD08', 'PRCDRCD19', 'PRCDRCD12', 'PRCDRCD13', 'PRCDRCD15', 'Readmitted_30'
]
columns_to_test = [
    'STUS_CD', 'stay_drg_cd', 'DGNSCD02', 'DGNSCD01', 'PRCDRCD01', 'AD_DGNS', 'DGNSCD18', 'DGNSCD03',
    'DGNSCD17', 'TYPE_ADM', 'DGNSCD05', 'DGNSCD08', 'DGNSCD06', 'DGNSCD07', 'DGNSCD13', 'DGNSCD04',
    'DGNSCD12', 'DGNSCD11', 'DGNSCD15', 'DGNSCD09', 'DGNSCD14', 'DGNSCD16', 'DGNSCD10', 'DGNSCD24',
    'PRCDRCD02', 'DGNSCD19', 'DGNSCD20', 'DGNSCD22', 'DGNSCD25', 'PRCDRCD03', 'DGNSCD21', 'SRC_ADMS',
    'PRCDRCD04', 'DGNSCD23', 'PRCDRCD05', 'PRCDRCD06', 'PRCDRCD22', 'PRCDRCD10', 'PRCDRCD09',
    'PRCDRCD21', 'PRCDRCD11', 'PRCDRCD07', 'PRCDRCD08', 'PRCDRCD19', 'PRCDRCD12', 'PRCDRCD13', 'PRCDRCD15'
]
df_train = df_train[columns_to_keep]
df_test = df_test[columns_to_test]

### Splitting Training Data by Target Value

The training data is split into two subsets based on the target variable `Readmitted_30`, which indicates whether the patient was readmitted within 30 days:

- **df_target_0**: Contains rows where `Readmitted_30` is 0 (no readmission).
- **df_target_1**: Contains rows where `Readmitted_30` is 1 (readmitted within 30 days).

In [None]:
df_target_0 = df_train[df_train['Readmitted_30'] == 0].copy()
df_target_1 = df_train[df_train['Readmitted_30'] == 1].copy()
print(f'Total Train Data Shape',df_train.shape)
print(f'Total Train Data Shape with target column 0',df_target_0.shape)
print(f'Total Train Data Shape with target column 1',df_target_1.shape)
print(f'Total Test Data Shape with test',df_test.shape)

### Outlier Analysis and Distribution of `STUS_CD` for Target Classes

In this analysis, we examine the distribution of the `STUS_CD` feature for two target classes in the dataset: 

- **Readmitted_30 = 0** (patients not readmitted within 30 days)
- **Readmitted_30 = 1** (patients readmitted within 30 days)

#### Steps:
1. **Outlier Detection**: 
   - The Interquartile Range (IQR) method is used to detect outliers. Any values outside the range defined by `Q1 - 1.5 * IQR` and `Q3 + 1.5 * IQR` are considered outliers.
   - We calculate the number and percentage of outliers for both target classes.

2. **Boxplot**:
   - A boxplot is plotted for each target class to visualize the spread and presence of outliers.

3. **Distribution Plot with KDE**:
   - A histogram and Kernel Density Estimation (KDE) curve are plotted for both target classes to understand the underlying distribution of `STUS_CD`.

#### Results:
For **Readmitted_30 = 0**:
- **Number of outliers**: Printed after analysis.
- **Percentage of outliers**: Printed after analysis.
  
For **Readmitted_30 = 1**:
- **Number of outliers**: Printed after analysis.
- **Percentage of outliers**: Printed after analysis.

This approach provides insights into the variability and distribution of the feature across different target classes.




In [None]:
data = pd.to_numeric(df_target_0['STUS_CD'], errors='coerce').dropna()
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data < lower_bound) | (data > upper_bound)]
num_outliers = outliers.count()
percent_outliers = (num_outliers / len(data)) * 100


plt.figure(figsize=(8, 2))
plt.boxplot(data, vert=False)
plt.title('STUS_CD Boxplot (Readmitted_30 = 0)')
plt.xlabel('STUS_CD')
plt.show()
print(f"[Target = 0] Number of outliers: {num_outliers}")
print(f"[Target = 0] Percentage of outliers: {percent_outliers:.2f}%")

plt.figure(figsize=(8, 4))
ax = data.plot.hist(bins=30, density=True, alpha=0.6)
data.plot.kde(ax=ax)
plt.title('STUS_CD Distribution with KDE (Readmitted_30 = 0)')
plt.xlabel('STUS_CD')
plt.ylabel('Density')
plt.show()
print('-'*40)
data1 = pd.to_numeric(df_target_1['STUS_CD'], errors='coerce').dropna()
Q1 = data1.quantile(0.25)
Q3 = data1.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data1[(data1 < lower_bound) | (data1 > upper_bound)]
num_outliers = outliers.count()
percent_outliers = (num_outliers / len(data1)) * 100
plt.figure(figsize=(8, 2))
plt.boxplot(data1, vert=False)
plt.title('STUS_CD Boxplot (Readmitted_30 = 1)')
plt.xlabel('STUS_CD')
plt.show()
print(f"[Target = 1] Number of outliers: {num_outliers}")
print(f"[Target = 1] Percentage of outliers: {percent_outliers:.2f}%")
plt.figure(figsize=(8, 4))
ax = data1.plot.hist(bins=30, density=True, alpha=0.6)
data1.plot.kde(ax=ax)
plt.title('STUS_CD Distribution with KDE (Readmitted_30 = 1)')
plt.xlabel('STUS_CD')
plt.ylabel('Density')
plt.show()

### Outlier Handling: Winsorization for Target 0 and Capping for Target 1

In this analysis, we handle the outliers in the `STUS_CD` feature for two target classes (Readmitted_30 = 0 and Readmitted_30 = 1) in the dataset:

1. **For Target 0 (Readmitted_30 = 0)**:
   - **Outlier Removal**: We remove the rows where the `STUS_CD` values fall outside the calculated bounds using the IQR method (lower and upper bounds). 
   - **Result**: Rows with extreme `STUS_CD` values are removed, reducing the dataset size for this target class.

2. **For Target 1 (Readmitted_30 = 1)**:
   - **Outlier Capping (Winsorization)**: Instead of removing outliers, we cap the extreme `STUS_CD` values that are below the lower bound or above the upper bound to the respective boundary value (Winsorization).
   - **Result**: Outliers are transformed, ensuring no rows are removed from this dataset.

After cleaning, the following insights are provided:
- The number and percentage of rows removed from Target 0 due to outliers.
- The number of outliers capped in Target 1 without any rows being removed.
- The distribution and outliers are visualized using boxplots and histograms with Kernel Density Estimation (KDE) for both target classes.

This method ensures that we handle outliers appropriately while maintaining the integrity of the dataset for model training.


In [None]:
#winsorize 0 target outlier
df_target_0_cleaned = df_target_0[
    (pd.to_numeric(df_target_0['STUS_CD'], errors='coerce') >= lower_bound) &
    (pd.to_numeric(df_target_0['STUS_CD'], errors='coerce') <= upper_bound)
]
df_target_1_cleaned = df_target_1.copy()
df_target_1_cleaned['STUS_CD'] = pd.to_numeric(df_target_1_cleaned['STUS_CD'], errors='coerce')

# Capping outliers with target 1
df_target_1_cleaned['STUS_CD'] = df_target_1_cleaned['STUS_CD'].clip(lower=lower_bound, upper=upper_bound)

In [None]:
print(f'Total Train Data Shape with target column 0: {df_target_0.shape}')
print(f'Total Train Data Shape with target column 1: {df_target_1.shape}')
print('\nAfter Removing Outliers:\n')
print(f'Total Train Data Shape with target column 0: {df_target_0_cleaned.shape}')
print(f'Total Train Data Shape with target column 1: {df_target_1_cleaned.shape}')
print('\n')

# For target 0
removed_rows_target_0 = df_target_0.shape[0] - df_target_0_cleaned.shape[0]
print(f'Rows removed from target 0 dataset due to outliers: {removed_rows_target_0} ({(removed_rows_target_0 / df_target_0.shape[0]) * 100:.2f}%)')

# For target 1
removed_rows_target_1 = df_target_1.shape[0] - df_target_1_cleaned.shape[0]
print(f'Rows capped in target 1 dataset (not removed, just transformed): {removed_rows_target_1} (0.00%)')

In [None]:
# === For df_target_1 ===
data_1 = pd.to_numeric(df_target_1_cleaned['STUS_CD'], errors='coerce').dropna()

# Compute IQR-based outliers
Q1 = data_1.quantile(0.25)
Q3 = data_1.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data1[(data_1 < lower_bound) | (data_1 > upper_bound)]
num_outliers = outliers.count()
percent_outliers = (num_outliers / len(data_1)) * 100

# Boxplot
plt.figure(figsize=(8, 2))
plt.boxplot(data_1, vert=False)
plt.title('STUS_CD Boxplot (Readmitted_30 = 1)')
plt.xlabel('STUS_CD')
plt.show()

print(f"[Target = 1] Number of outliers: {num_outliers}")
print(f"[Target = 1] Percentage of outliers: {percent_outliers:.2f}%")

# Distribution with KDE
plt.figure(figsize=(8, 4))
ax = data_1.plot.hist(bins=30, density=True, alpha=0.6)
data1.plot.kde(ax=ax)
plt.title('STUS_CD Distribution with KDE (Readmitted_30 = 1)')
plt.xlabel('STUS_CD')
plt.ylabel('Density')
plt.show()

In [None]:
# === For df_test ===
data = pd.to_numeric(df_test['STUS_CD'], errors='coerce').dropna()

# Compute IQR-based outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = data[(data < lower_bound) | (data > upper_bound)]
num_outliers = outliers.count()
percent_outliers = (num_outliers / len(data)) * 100

# Boxplot
plt.figure(figsize=(8, 2))
plt.boxplot(data, vert=False)
plt.title('STUS_CD Boxplot (Readmitted_30 = 0)')
plt.xlabel('STUS_CD')
plt.show()

print(f"[Target = 0] Number of outliers: {num_outliers}")
print(f"[Target = 0] Percentage of outliers: {percent_outliers:.2f}%")

### Step 6: Analyze Missing Value Ranges in `df_target_0`

- We calculate the percentage of missing values for each row in `df_target_0`.
- We then analyze rows that have more than 50%, 60%, 70%, 80%, and 90% missing values. For each threshold, we calculate the percentage and number of rows with missing values exceeding the threshold.

### Step 7: Analyze Missing Value Ranges in `df_target_1`

- The same analysis is performed on `df_target_1` to determine the rows with varying percentages of missing values.

### Dropping Rows with Missing Values Above a Threshold

- Rows with more than a specified threshold (e.g., 70%) of missing values are dropped to clean the dataset.
- This reduces the dataset size by removing rows with excessive missing values, ensuring that only rows with sufficient data are retained.

### Imputing Missing Values

- For remaining missing values:
  - **Categorical columns** are filled with the mode (most frequent value).
  - **Numerical columns** are filled using the median, ensuring the values fall within the Interquartile Range (IQR) bounds.

- The imputation process ensures that the dataset has no missing values after cleaning.

### Final Dataset

- After dropping rows and imputing missing values, the final cleaned datasets for both training (`df_target_0` and `df_target_1`) and test (`df_test`) are obtained, and their shapes are displayed for verification.


In [None]:
# Analyzing Missing Value Ranges in df_target_0
missing_percentage = df_target_0.isnull().mean(axis=1)

thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]

for thresh in thresholds:
    percent = (missing_percentage > thresh).mean() * 100
    print(f"Rows with >{int(thresh*100)}% missing values: {percent:.2f}% ({(missing_percentage > thresh).sum()} rows)")

In [None]:
# Analyzing Missing Value Ranges in df_target_1
missing_percentage = df_target_1.isnull().mean(axis=1)

thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]

for thresh in thresholds:
    percent = (missing_percentage > thresh).mean() * 100
    print(f"Rows with >{int(thresh*100)}% missing values: {percent:.2f}% ({(missing_percentage > thresh).sum()} rows)")

In [None]:
def drop_rows_with_missing_threshold(df, threshold_percent=70):
    """
    Drops rows from the DataFrame that have more than `threshold_percent` of missing values.

    Args:
        df (pd.DataFrame): Input DataFrame.
        threshold_percent (float): Threshold percentage (0-100) for missing values.

    Returns:
        pd.DataFrame: Cleaned DataFrame.
    """
    # Calculate how many columns is the threshold
    threshold_count = int((threshold_percent / 100.0) * df.shape[1])

    # Drop rows with more than threshold_count missing values
    df_cleaned = df[df.isnull().sum(axis=1) <= threshold_count].copy()

    print(f"✅ Rows before: {df.shape[0]}, after: {df_cleaned.shape[0]} (Dropped {df.shape[0] - df_cleaned.shape[0]})")
    return df_cleaned

In [None]:
df_target_0 = drop_rows_with_missing_threshold(df_target_0, threshold_percent=70)

In [None]:
def impute_missing_values(df):
    """
    Imputes missing values:
      - For categorical columns: fills with mode.
      - For numerical columns: fills using IQR-based strategy (median +/- 1.5*IQR).

    Returns:
        pd.DataFrame: DataFrame with imputed values.
    """
    df_imputed = df.copy()

    for col in df_imputed.columns:
        if df_imputed[col].dtype == 'object' or df_imputed[col].dtype.name == 'category':
            # Fill categorical with mode
            mode = df_imputed[col].mode()
            if not mode.empty:
                df_imputed[col].fillna(mode[0], inplace=True)
        else:
            # For numeric columns
            Q1 = df_imputed[col].quantile(0.25)
            Q3 = df_imputed[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            median = df_imputed[col].median()

            # Fill missing values with median if within IQR bounds, else just use median
            df_imputed[col].fillna(median, inplace=True)

    print("✅ Null values imputed successfully.")
    return df_imputed

In [None]:
df_target_0 = impute_missing_values(df_target_0)
df_target_1 = impute_missing_values(df_target_1)
df_test = impute_missing_values(df_test)

In [None]:
print(f'Missing values in df_target_0 : {df_target_0.isnull().sum().sum()}')
print(f'Missing values in df_target_1 : {df_target_1.isnull().sum().sum()}')
print(f'Missing values in df_test : {df_test.isnull().sum().sum()}')

In [None]:
df_cleaned = pd.concat([df_target_0, df_target_1], ignore_index=True)

In [None]:
print(f'Cleaned train Dataset shape :{df_cleaned.shape}')
print(f'Cleaned test Dataset shape :{df_test.shape}')

In [None]:
def check_column_mismatches(df_train, df_test):
    """
    Prints the mismatched columns between train and test DataFrames without dropping any columns.

    Parameters:
        df_train (pd.DataFrame): Training dataset
        df_test (pd.DataFrame): Testing dataset
    """
    train_cols = set(df_train.columns)
    test_cols = set(df_test.columns)

    train_extra = train_cols - test_cols
    test_extra = test_cols - train_cols

    if train_extra or test_extra:
        print("⚠️ Column mismatch found:")
        if train_extra:
            print(f"🟠 Columns in train but not in test: {sorted(train_extra)}")
        if test_extra:
            print(f"🔵 Columns in test but not in train: {sorted(test_extra)}")
    else:
        print("✅ Train and test datasets have exactly the same columns.")

    print(f"\nTrain Dataset shape : {df_train.shape}")
    print(f"Test Dataset shape  : {df_test.shape}")

In [None]:
check_column_mismatches(df_cleaned, df_test)

---

# 🧠 Model Selection

 

---

# ⚙️ Why Use CatBoost?

- CatBoost is a gradient boosting library developed by Yandex, tailored for categorical features.  
- Handles categorical variables **natively** — no need for manual encoding like One-Hot or Label Encoding.  
- Supports **GPU training**, significantly speeding up experiments.  
- Includes **regularization parameters** and **early stopping**, making it highly resistant to overfitting.  
- Delivers **state-of-the-art performance** on tabular data with minimal preprocessing.  

---

  

---

# 🧾 Explained: Key Hyperparameters Tuned via Optuna  

- `iterations`: Number of boosting rounds. Higher = potentially better fit but more risk of overfitting.  
- `learning_rate`: Controls step size in gradient descent. Lower = slower but more precise learning.  
- `depth`: Tree depth. Shallow trees overfit less; deeper trees can learn more complex patterns.  
- `l2_leaf_reg`: L2 regularization term to **penalize overly complex models**.  
- `random_strength`: Adds randomness in tree splitting to improve generalization.  
- `bootstrap_type`: Sampling strategy for boosting:  
  - **Bayesian**: Adds randomness using Bayesian bootstrapping.  
  - **Bernoulli**: Classic random subsampling.  
- `subsample`: Only applies to Bernoulli. Reduces overfitting by training on random subsets.  
- `early_stopping_rounds`: Stops training if no improvement seen in N rounds — a great overfitting guard.  

---

# ⚖️ Why Use `class_weights`?

- Addresses **class imbalance** in the target (`Readmitted_30`).  
- Heavier weight is assigned to the **minority class** to reduce bias toward majority.  

### 📐 Formula for Balanced Class Weights:

\[
\text{weight}_i = \frac{n_{\text{samples}}}{n_{\text{classes}} \times n_{\text{samples}_i}}
\]

- In our case: `class_weights = [0.55, 2.0]`  
  - This means the second class (likely "readmitted") is ~4x rarer than the first.  

---

# 🎯 Step 3: Running the Optuna Study  

- Optuna searches the parameter space to **maximize AUC**, ideal for imbalanced classification.  
- Even though only one trial was run here, typically **dozens to hundreds** are used.  

Extracted after tuning:
- `study.best_value`: Best AUC achieved.  
- `study.best_params`: Parameters that led to best performance.  

---

# 🧠 Step 4: Final Model Training Using Best Parameters  

- Used the best hyperparameters from Optuna to train on the **full cleaned dataset (X, y)**.  
- Ensured model used **optimal regularization, structure, and learning rate**.  
- Enabled `task_type = "GPU"` to **speed up training**.  

---

# 📦 Step 5: Predictions and Submission File  

- Preprocessed the test dataset using the same feature order as training data.  
- Used the trained model to make predictions.  
- Created `submission_optuna.csv` containing:  
  - `ID`: From test set.  
  - `Readmitted_30`: Predicted label.  



In [None]:
df_cleaned.drop(columns=['PRCDRCD19','PRCDRCD13','PRCDRCD11','PRCDRCD22'],inplace=True)
df_cleaned.shape

## 🚫 Removal of Leakage-Prone Features  
We dropped features: `PRCDRCD19`, `PRCDRCD13`, `PRCDRCD11`, and `PRCDRCD22`.  

- These were found to cause data leakage, leading to suspiciously high validation performance and poor generalization.  
- Including such features would mean the model "cheats" by using information not available at prediction time.  
- ✅ Removing them improved model robustness and prevented overfitting. 

In [None]:
X = df_cleaned.drop("Readmitted_30", axis=1)
y = df_cleaned["Readmitted_30"]

cat_features = X.select_dtypes(include=['category', 'object']).columns.tolist()

def objective(trial):
    bootstrap_type = trial.suggest_categorical("bootstrap_type", ["Bayesian", "Bernoulli"])
    
    params = {
        "iterations": trial.suggest_int("iterations", 500, 1500),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1),
        "depth": trial.suggest_int("depth", 4, 10),
        "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1.0, 10.0),
        "random_strength": trial.suggest_float("random_strength", 1.0, 10.0),
        "bootstrap_type": bootstrap_type,
        "class_weights": [0.55, 2.0],
        "loss_function": "Logloss",
        "eval_metric": "AUC",
        "verbose": 0,
        "task_type": "GPU",
        "devices": "0",
        "early_stopping_rounds": 50,
        "random_seed": 42
    }

    # subsample for Bernoulli bootstrapping
    if bootstrap_type == "Bernoulli":
        params["subsample"] = trial.suggest_float("subsample", 0.6, 1.0)

    auc_scores = []
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model = CatBoostClassifier(**params)
        model.fit(
            X_train, y_train,
            eval_set=(X_val, y_val),
            use_best_model=True,
            cat_features=cat_features
        )

        preds = model.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, preds)
        auc_scores.append(auc)

    return np.mean(auc_scores)

## 🧪 Step 1: Preparing Data for Optuna Study  

- Separated features (`X`) from the target (`y = Readmitted_30`).  
- Identified **categorical columns** to inform CatBoost which features require special handling.  
- Preparing this before training ensures **reproducibility** and proper model configuration.  

---

## 🛠️ Step 2: Optuna Objective Function for Hyperparameter Tuning  

- Used **Optuna**, a powerful library for automated hyperparameter optimization.  
- Defined a custom `objective()` function that:
  - Performs **5-fold Stratified Cross-Validation** to maintain class distribution.  
  - Trains `CatBoostClassifier` using trial-specific parameters.  
  - Returns the **mean AUC** across folds for each trial.

---

## 🧾 Explained: Key Hyperparameters Tuned via Optuna  

- `iterations`: Number of boosting rounds. Higher = potentially better fit but more risk of overfitting.  
- `learning_rate`: Controls step size in gradient descent. Lower = slower but more precise learning.  
- `depth`: Tree depth. Shallow trees overfit less; deeper trees can learn more complex patterns.  
- `l2_leaf_reg`: L2 regularization term to **penalize overly complex models**.  
- `random_strength`: Adds randomness in tree splitting to improve generalization.  
- `bootstrap_type`: Sampling strategy for boosting:  
  - **Bayesian**: Adds randomness using Bayesian bootstrapping.  
  - **Bernoulli**: Classic random subsampling.  
- `subsample`: Only applies to Bernoulli. Reduces overfitting by training on random subsets.  
- `early_stopping_rounds`: Stops training if no improvement seen in N rounds — a great overfitting guard.  

---

## ⚖️ Why Use `class_weights`?

- Addresses **class imbalance** in the target (`Readmitted_30`).  
- Heavier weight is assigned to the **minority class** to reduce bias toward majority.  

### 📐 Formula for Balanced Class Weights:

\[
\text{weight}_i = \frac{n_{\text{samples}}}{n_{\text{classes}} \times n_{\text{samples}_i}}
\]


- In our case: `class_weights = [0.55, 2.0]`  
- This means the second class (likely "readmitted") is ~4x rarer than the first.  

---

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=1)
print("Best AUC:", study.best_value)
print("Best Params:", study.best_params)

## 🎯 Step 3: Running the Optuna Study  

- Optuna searches the parameter space to **maximize AUC**, ideal for imbalanced classification.  
- Even though only one trial was run here, typically **dozens to hundreds** are used.  

Extracted after tuning:
- `study.best_value`: Best AUC achieved.  
- `study.best_params`: Parameters that led to best performance.  

In [None]:
# Grabing the best params from Optuna study
best_params = study.best_params
# best_params already contains 'iterations', 'learning_rate', etc.
best_params.update({
    "loss_function": "Logloss",
    "eval_metric": "AUC",
    "verbose": 0,
    "task_type": "GPU",
    "devices": "0",
    "random_seed": 42
})

# Train final model on full training set
final_model = CatBoostClassifier(**best_params)
final_model.fit(
    X, 
    y, 
    cat_features=cat_features
)

# Load and preprocess test data
X_test = df_test[X.columns]
test_csv_path = '/kaggle/input/softec-25-machine-learning-competition/test.csv'
df_ID = pd.read_csv(test_csv_path)

ID = df_ID['ID']

# Predict
test_preds = final_model.predict(X_test)

# Building submission DataFrame
submission = pd.DataFrame({
    "ID": ID,
    "Readmitted_30": test_preds
})

# Save to CSV
submission.to_csv("submission_optuna.csv", index=False)
print("Wrote submission_optuna.csv!")

---

## 🧠 Step 4: Final Model Training Using Best Parameters  

- Used the best hyperparameters from Optuna to train on the **full cleaned dataset (X, y)**.  
- Ensured model used **optimal regularization, structure, and learning rate**.  
- Enabled `task_type = "GPU"` to **speed up training**.  

---

## 📦 Step 5: Predictions and Submission File  

- Preprocessed the test dataset using the same feature order as training data.  
- Used the trained model to make predictions.  
- Created `submission_optuna.csv` containing:  
  - `ID`: From test set.  
  - `Readmitted_30`: Predicted label.  

In [None]:
# Split validation set (X, y are full training data)
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=42
)

# Re-training model on X_train
final_model.fit(X_train, y_train, cat_features=cat_features)

# Predictions
y_pred = final_model.predict(X_val)
y_prob = final_model.predict_proba(X_val)[:, 1]

# Computing metrics
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
auc = roc_auc_score(y_val, y_prob)

# Bar Plot of Metrics
plt.figure(figsize=(8, 5))
metrics = {
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1,
    'AUC': auc
}
sns.barplot(x=list(metrics.keys()), y=list(metrics.values()), palette='mako')
plt.ylim(0, 1)
plt.title("Final Model Metrics")
for i, v in enumerate(metrics.values()):
    plt.text(i, v + 0.02, f"{v:.2f}", ha='center', va='bottom')
plt.tight_layout()
plt.show()

# ROC Curve
fpr, tpr, _ = roc_curve(y_val, y_prob)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}', color='blue')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.tight_layout()
plt.show()

---

# ✅ Summary  

- ❌ Removed leaking features that were harming generalization.  
- 🔍 Hyperparameters were tuned via Optuna to boost AUC score.  
- 🧠 Used CatBoost with built-in categorical support and regularization.  
- ⚖️ Applied class weights to correct class imbalance.  
- 📈 Final model was trained on the full dataset using the best parameters.  
- 📁 Submission file generated for competition entry.  