> - **Goal**: Predict the likelihood of heart disease.
> - **Evaluation**: Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
> - **Best Score**: 0.95393 \ 0.94733 (610 \ 764 ) - Private Leaderboard


### Import Libraries

In [37]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier

In [18]:
test = pd.read_csv('./data/heart_disease/test.zip', index_col='id')
train = pd.read_csv('./data/heart_disease/train.zip', index_col='id')

In [19]:
train.columns = train.columns.str.replace(' ', '_').str.lower()
test.columns = test.columns.str.replace(' ', '_').str.lower()

In [20]:
train.head(5)

Unnamed: 0_level_0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,heart_disease
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence
1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence
2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence
3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence
4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence


In [21]:
train.info()

<class 'pandas.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 14 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   age                      630000 non-null  int64  
 1   sex                      630000 non-null  int64  
 2   chest_pain_type          630000 non-null  int64  
 3   bp                       630000 non-null  int64  
 4   cholesterol              630000 non-null  int64  
 5   fbs_over_120             630000 non-null  int64  
 6   ekg_results              630000 non-null  int64  
 7   max_hr                   630000 non-null  int64  
 8   exercise_angina          630000 non-null  int64  
 9   st_depression            630000 non-null  float64
 10  slope_of_st              630000 non-null  int64  
 11  number_of_vessels_fluro  630000 non-null  int64  
 12  thallium                 630000 non-null  int64  
 13  heart_disease            630000 non-null  str    
dtypes: float64(1), 

In [22]:
train.describe()

Unnamed: 0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium
count,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0
mean,54.136706,0.714735,3.312752,130.497433,245.011814,0.079987,0.98166,152.816763,0.273725,0.716028,1.455871,0.45104,4.618873
std,8.256301,0.451541,0.851615,14.975802,33.681581,0.271274,0.998783,19.112927,0.44587,0.948472,0.545192,0.798549,1.950007
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,223.0,0.0,0.0,142.0,0.0,0.0,1.0,0.0,3.0
50%,54.0,1.0,4.0,130.0,243.0,0.0,0.0,157.0,0.0,0.1,1.0,0.0,3.0
75%,60.0,1.0,4.0,140.0,269.0,0.0,2.0,166.0,1.0,1.4,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


## Feature Exploration

### number_of_vessels_fluro	:
> - Coronary Artery Disease Detection: Fluoroscopy identifies stenotic lesions by detecting calcification, with two or three-vessel calcification significantly increasing the risk of stenosis in patients aged 45–64.
> - Age Dependency: In patients over 65, the number of vessels calcified is less effective for detection. In patients under 45, even single mild calcification is highly significant.
> - Procedure Metrics: While imaging, typically (10.4\pm 3.1) exposures are taken with a total fluoroscopy time of (10.2\ pm 5.1\) minutes.
> - Selective Angiography: Specific vessels like the celiac, splenic, or renal arteries require precise flow rates (e.g., 4–15 mL/s for the hepatic artery).

### bp (Blood Pressure):
> Categories:
> - Normal: Less than 120/80 mmHg.
> - Elevated: 120-129 systolic AND less than 80 mmHg diastolic.
> - Hypertension (High BP): 130/80 mmHg or higher.
> - Hypotension (Low BP): Generally, any reading too low to deliver enough blood to the body's organs.
> - Significance: Consistently high BP (hypertension) can damage blood vessels and lead to serious health issues, including heart disease, stroke, and kidney failure

### cholesterol
> - Total Cholesterol: Below (200mg/dL) is desirable; (200–239mg/dL) is borderline high; \(240\text{\ mg/dL}\) or higher is high.
> - LDL (Low-Density Lipoprotein - "Bad" Cholesterol): Ideally less than (100 mg/dL). People with diabetes or heart disease may need to aim for below (70 mg/dL).
> - HDL (High-Density Lipoprotein - "Good" Cholesterol): Higher is better. Preferably (60 mg/dL) or higher, although levels at least (40 mg/dL) for men and (50 mg/dL) for women are generally acceptable.

### fbs_over_120 (Fasting Blood Sugar):
> - Low Blood Sugar (Hypoglycemia): Below (70 mg/dL).
> - Normal Blood Sugar: (70-99 mg/dL).
> - Prediabetes: (100-125 mg/dL).
> - Diabetes: (>=126 mg/dL) or higher on two separate tests.

### max_hr ( Maximum Heart Rate):
> - Age-Related Maximum Heart Rate: A common formula to estimate maximum heart rate is (220 - age).
> - Exercise Intensity Zones:
> - - Moderate Intensity: 50-70% of maximum heart rate.
> - - Vigorous Intensity: 70-85% of maximum heart rate.
> - - Significance: Achieving a higher maximum heart rate during exercise can indicate better cardiovascular fitness, while a lower maximum heart rate may suggest potential heart issues.

### st_depression ( ST Depression):
> - Definition: ST depression refers to a downward shift of the ST segment on an electrocardiogram (ECG) reading.
> - Significance: It can indicate myocardial ischemia, which is a condition where the heart muscle doesn't get enough oxygen-rich blood.
> - Causes: ST depression can be caused by various factors, including coronary artery disease, electrolyte imbalances, or certain medications
> - Normal: Less than 0.5 mm
> - Mild: 0.5 to 1.0 mm
> - Moderate: 1.0 to 2.0 mm
> - Severe: Greater than 2.0 mm


### slope_of_st ( Slope of the ST segment):
> - Upsloping: Generally considered normal and indicates that the heart is receiving adequate blood flow
> - Flat: May suggest some level of ischemia or reduced blood flow to the heart muscle
> - Downsloping: Often associated with significant ischemia and may indicate a higher risk of heart disease
> - 0 - upsloping
> - 1 - flat
> - 2 - downsloping
> - 3 - undefined

### thallium ( aka Thallium Stress Test):
> - Normal < 5
> - Toxic > 7 (aka thallotoxicosis)

In [23]:
# basis on description we can create a new features
def feature_engineering(data: pd.DataFrame):
    df = data.copy()

    # create thallotoxicosis column
    df['thallotoxicosis'] = np.where(df['thallium'] > 7, 1, 0)

    # create st_slope_category column
    slope_mapping = {0: 'upsloping', 1: 'flat', 2: 'downsloping', 3: 'undefined'}
    df['st_slope_category'] = df['slope_of_st'].map(slope_mapping)

    # create st_depression_category column
    def categorize_st_depression(value):
        if value < 0.5:
            return 'normal'
        elif 0.5 <= value < 1.0:
            return 'mild'
        elif 1.0 <= value < 2.0:
            return 'moderate'
        else:
            return 'severe'

    df['st_depression_category'] = df['st_depression'].apply(categorize_st_depression)

    # create cholesterol_category column
    def categorize_cholesterol(value):
        if value < 200:
            return 'desirable'
        elif 200 <= value < 240:
            return 'borderline_high'
        else:
            return 'high'

    df['cholesterol_category'] = df['cholesterol'].apply(categorize_cholesterol)

    # create bp_category column
    def categorize_bp(row):
        systolic = row['bp'] // 1000  # assuming bp is stored as systolic * 1000 + diastolic
        diastolic = row['bp'] % 1000
        if systolic < 120 and diastolic < 80:
            return 'normal'
        elif 120 <= systolic < 130 and diastolic < 80:
            return 'elevated'
        elif systolic >= 130 or diastolic >= 80:
            return 'hypertension'
        else:
            return 'hypotension'

    df['bp_category'] = df.apply(categorize_bp, axis=1)

    # create vessels_fluro_cat column

    def categorize_vessels_fluro(value):
        if value == 0:
            return 'none'
        elif value == 1:
            return 'single_vessel'
        elif value == 2:
            return 'two_vessels'
        elif value == 3:
            return 'three_vessels'
        else:
            return 'unknown'

    df['vessels_fluro_cat'] = df['number_of_vessels_fluro'].apply(categorize_vessels_fluro)

    return df

In [24]:
# apply feature engineering
train_fe = feature_engineering(train)
test_fe = feature_engineering(test)

In [25]:
train_fe.head(5)

Unnamed: 0_level_0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,heart_disease,thallotoxicosis,st_slope_category,st_depression_category,cholesterol_category,bp_category,vessels_fluro_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,Presence,0,downsloping,severe,borderline_high,hypertension,two_vessels
1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,Absence,0,flat,normal,high,hypertension,none
2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,Absence,0,flat,normal,desirable,hypertension,none
3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,Absence,0,downsloping,moderate,borderline_high,hypertension,none
4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,Presence,0,downsloping,severe,borderline_high,hypertension,three_vessels


In [26]:
cal_columns = test_fe.select_dtypes(include=[str]).columns.tolist()
num_columns = test_fe.select_dtypes(include=[np.number]).columns.tolist()

In [27]:
# create a pipeline for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_columns),
        ('cat', OrdinalEncoder(), cal_columns)
    ])

In [28]:
# split the data

X = train_fe.drop('heart_disease', axis=1)
y = train_fe['heart_disease']

y = y.map({'Absence': 0, 'Presence': 1})

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [29]:
X.head(5)

Unnamed: 0_level_0,age,sex,chest_pain_type,bp,cholesterol,fbs_over_120,ekg_results,max_hr,exercise_angina,st_depression,slope_of_st,number_of_vessels_fluro,thallium,thallotoxicosis,st_slope_category,st_depression_category,cholesterol_category,bp_category,vessels_fluro_cat
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,58,1,4,152,239,0,0,158,1,3.6,2,2,7,0,downsloping,severe,borderline_high,hypertension,two_vessels
1,52,1,1,125,325,0,2,171,0,0.0,1,0,3,0,flat,normal,high,hypertension,none
2,56,0,2,160,188,0,2,151,0,0.0,1,0,3,0,flat,normal,desirable,hypertension,none
3,44,0,3,134,229,0,2,150,0,1.0,2,0,3,0,downsloping,moderate,borderline_high,hypertension,none
4,58,1,4,140,234,0,2,125,1,3.8,2,3,3,0,downsloping,severe,borderline_high,hypertension,three_vessels


In [30]:
y.head(5)

id
0    1
1    0
2    0
3    0
4    1
Name: heart_disease, dtype: int64

In [31]:
# create a model pipeline ( use GaussianNB as base model)

model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])

In [32]:
# train the model
model.fit(X_train, y_train)

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('preprocessor', ...), ('classifier', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"transformers  transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str  Like in Pipeline and FeatureUnion, this allows the transformer and  its parameters to be set using ``set_params`` and searched in grid  search. transformer : {'drop', 'passthrough'} or estimator  Estimator must support :term:`fit` and :term:`transform`.  Special-cased strings 'drop' and 'passthrough' are accepted as  well, to indicate to drop the columns or to pass them through  untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable  Indexes the data on its second axis. Integers are interpreted as  positional columns, while strings can reference DataFrame columns  by name. A scalar string or int should be used where  ``transformer`` expects X to be a 1d array-like (vector),  otherwise a 2d array will be passed to the transformer.  A callable is passed the input data `X` and can return any of the  above. To select multiple columns by name or dtype, you can use  :obj:`make_column_selector`.","[('num', ...), ('cat', ...)]"
,"remainder  remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.",'drop'
,"sparse_threshold  sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.",0.3
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"transformer_weights  transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.",False
,"verbose_feature_names_out  verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix  all feature names with the name of the transformer that generated that  feature. It is equivalent to setting  `verbose_feature_names_out=""{transformer_name}__{feature_name}""`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not  prefix any feature names and will error if feature names are not  unique. - If ``Callable[[str, str], str]``,  :meth:`ColumnTransformer.get_feature_names_out` will rename all the features  using the name of the transformer. The first argument of the callable is the  transformer name and the second argument is the feature name. The returned  string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will  be formatted using two field names: ``transformer_name`` and ``feature_name``.  e.g. ``""{feature_name}__{transformer_name}""``. See :meth:`str.format` method  from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6  `verbose_feature_names_out` can be a callable or a string to be formatted.",True
,"force_int_remainder_cols  force_int_remainder_cols: bool, default=False This parameter has no effect. .. note::  If you do not access the list of columns for the remainder columns  in the `transformers_` fitted attribute, you do not need to set  this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7  The default value for `force_int_remainder_cols` will change from  `True` to `False` in version 1.7. .. deprecated:: 1.7  `force_int_remainder_cols` is deprecated and will be removed in 1.9.",'deprecated'

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"categories  categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith  column. The passed categories should not mix strings and numeric  values, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute.",'auto'
,"dtype  dtype: number type, default=np.float64 Desired dtype of output.",<class 'numpy.float64'>
,"handle_unknown  handle_unknown: {'error', 'use_encoded_value'}, default='error' When set to 'error' an error will be raised in case an unknown categorical feature is present during transform. When set to 'use_encoded_value', the encoded value of unknown categories will be set to the value given for the parameter `unknown_value`. In :meth:`inverse_transform`, an unknown category will be denoted as None. .. versionadded:: 0.24",'error'
,"unknown_value  unknown_value: int or np.nan, default=None When the parameter handle_unknown is set to 'use_encoded_value', this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in `fit`. If set to np.nan, the `dtype` parameter must be a float dtype. .. versionadded:: 0.24",
,"encoded_missing_value  encoded_missing_value: int or np.nan, default=np.nan Encoded value of missing categories. If set to `np.nan`, then the `dtype` parameter must be a float dtype. .. versionadded:: 1.1",
,"min_frequency  min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered  infrequent. - If `float`, categories with a smaller cardinality than  `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.3  Read more in the :ref:`User Guide `.",
,"max_categories  max_categories: int, default=None Specifies an upper limit to the number of output categories for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. `max_categories` do **not** take into account missing or unknown categories. Setting `unknown_value` or `encoded_missing_value` to an integer will increase the number of unique integer codes by one each. This can result in up to `max_categories + 2` integer codes. .. versionadded:: 1.3  Read more in the :ref:`User Guide `.",

0,1,2
,"priors  priors: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",
,"var_smoothing  var_smoothing: float, default=1e-9 Portion of the largest variance of all features that is added to variances for calculation stability. .. versionadded:: 0.20",1e-09


In [33]:
# evaluate the model on test set

y_val_pred = model.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_val, y_val_pred)
print(f'Validation ROC AUC: {roc_auc:.4f}')

Validation ROC AUC: 0.9334


In [35]:
# evaluate the model on train set ( to check for overfitting)
y_train_pred = model.predict_proba(X_train)[:, 1]
roc_auc_train = roc_auc_score(y_train, y_train_pred)
print(f'Train ROC AUC: {roc_auc_train:.4f}')

Train ROC AUC: 0.9314


In [34]:
# confusion matrix and classification report
y_val_pred_class = model.predict(X_val)
cm = confusion_matrix(y_val, y_val_pred_class)
cr = classification_report(y_val, y_val_pred_class)
print('Confusion Matrix:')
print(cm)
print('\nClassification Report:')
print(cr)

Confusion Matrix:
[[62347  7162]
 [ 9487 47004]]

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.90      0.88     69509
           1       0.87      0.83      0.85     56491

    accuracy                           0.87    126000
   macro avg       0.87      0.86      0.87    126000
weighted avg       0.87      0.87      0.87    126000



In [38]:
# lets iterate across different models and calculate roc_auc_score
models = {
    'GaussianNB': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42)
}

for model_name, model_instance in models.items():
    model_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model_instance)
    ])
    model_pipeline.fit(X_train, y_train)
    y_val_pred = model_pipeline.predict_proba(X_val)[:, 1]
    roc_auc = roc_auc_score(y_val, y_val_pred)
    print(f'{model_name} Validation ROC AUC: {roc_auc:.4f}')

# GaussianNB Validation ROC AUC: 0.9334
# Decision Tree Validation ROC AUC: 0.8252
# Random Forest Validation ROC AUC: 0.9469
# Logistic Regression Validation ROC AUC: 0.9519

Decision Tree Validation ROC AUC: 0.8252
Random Forest Validation ROC AUC: 0.9469
Logistic Regression Validation ROC AUC: 0.9519


In [42]:
# let's build Votting Classifier with top 3 models

voting_clf = VotingClassifier(
    estimators=[
        ('gnb', GaussianNB()),
        ('rf', RandomForestClassifier(random_state=42)),
        ('lr', LogisticRegression(max_iter=1000, random_state=42))
    ],
    voting='soft'
)

voting_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', voting_clf)
])

voting_pipeline.fit(X_train, y_train)
y_val_pred = voting_pipeline.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_val, y_val_pred)
print(f'Voting Classifier Validation ROC AUC: {roc_auc:.4f}')

Voting Classifier Validation ROC AUC: 0.9472


In [46]:
submission = pd.DataFrame(
    {
        'id': test_fe.index,
        'heart_disease': np.round(voting_pipeline.predict_proba(test_fe)[:, 1], 2)
    }
)

In [47]:
submission.head(5)

Unnamed: 0,id,heart_disease
0,630000,0.91
1,630001,0.0
2,630002,1.0
3,630003,0.0
4,630004,0.12


In [48]:
submission.to_csv('submission.csv', index=False)