# Cardiovascular Disease – Model Comparison and Final Model Export

This notebook continues the work on the **Cardiovascular Disease** dataset from Kaggle.

Goals:

1. Load the dataset and show basic info.
2. Split into training and validation sets.
3. Train multiple models and compare them with accuracy, F1, and ROC AUC.
4. Decide which model is best for this problem.
5. Train the chosen model on the full dataset and save it to disk for deployment.


In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

from joblib import dump

In [2]:
# Path to the dataset (adjust if needed)
data_path = "dataset\cardio_train.csv"

# The Kaggle cardio dataset uses ';' as a separator
df = pd.read_csv(data_path, sep=';')

print("Shape:", df.shape)
df.head()

Shape: (70000, 13)


Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0


In [3]:
# Basic info about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB


In [4]:
# Summary statistics for numeric columns
df.describe()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


In [5]:
# Distribution of the target variable (cardio = 1 means cardiovascular disease present)
df['cardio'].value_counts(normalize=True)

cardio
0    0.5003
1    0.4997
Name: proportion, dtype: float64

In [6]:
# Create age in years (rounded to 1 decimal) and drop the original 'age' (in days)
df['age_years'] = (df['age'] / 365).round(1)
df = df.drop(columns=['age'])

# Define features X and target y
X = df.drop(columns=['cardio', 'id'])
y = df['cardio']

# Train/validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_val.shape

((56000, 11), (14000, 11))

In [7]:
models = {
    "log_reg": LogisticRegression(max_iter=1000),
    "decision_tree": DecisionTreeClassifier(max_depth=5, random_state=42),
    "random_forest": RandomForestClassifier(
        n_estimators=100, max_depth=10, random_state=42, n_jobs=-1
    ),
    "gradient_boosting": GradientBoostingClassifier(random_state=42),
    "knn": KNeighborsClassifier(n_neighbors=15),
}

def evaluate_models(models, X_train, y_train, X_val, y_val):
    results = []
    for name, model in models.items():
        print(f"Training {name}...")
        model.fit(X_train, y_train)
        
        y_pred = model.predict(X_val)
        acc = accuracy_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)
        
        if hasattr(model, "predict_proba"):
            y_proba = model.predict_proba(X_val)[:, 1]
            roc_auc = roc_auc_score(y_val, y_proba)
        else:
            roc_auc = np.nan
        
        results.append({
            "model": name,
            "accuracy": acc,
            "f1": f1,
            "roc_auc": roc_auc
        })
    return pd.DataFrame(results)

In [8]:
results_df = evaluate_models(models, X_train, y_train, X_val, y_val)

# Show models sorted by ROC AUC (higher is better)
results_df.sort_values("roc_auc", ascending=False)

Training log_reg...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Training decision_tree...
Training random_forest...
Training gradient_boosting...
Training knn...


Unnamed: 0,model,accuracy,f1,roc_auc
3,gradient_boosting,0.733143,0.723546,0.799762
2,random_forest,0.732571,0.720597,0.79832
1,decision_tree,0.727571,0.71375,0.787435
0,log_reg,0.714071,0.702578,0.77829
4,knn,0.709429,0.699023,0.768099


## Choosing the best model

For this medical classification problem, ROC AUC is the most important metric,
because we care about ranking patients by their risk of cardiovascular disease.

Steps:

1. Look at **ROC AUC** first.
2. If two models have very close ROC AUC, use **F1** and **accuracy** to choose the one that balances precision and recall.
3. Simpler models (like logistic regression) are easier to interpret, but tree-based ensembles (random forest, gradient boosting) often give better performance.

In my runs on this dataset, **gradient boosting** gives the highest ROC AUC with strong F1 and accuracy.
So I choose **GradientBoostingClassifier** as the final baseline model.


In [9]:
# Programmatically pick the best model by ROC AUC
sorted_df = results_df.sort_values("roc_auc", ascending=False)
best_row = sorted_df.iloc[0]
best_name = best_row['model']

print("Best model based on ROC AUC:")
print(best_row)

best_model = models[best_name]

Best model based on ROC AUC:
model       gradient_boosting
accuracy             0.733143
f1                   0.723546
roc_auc              0.799762
Name: 3, dtype: object


## Train the chosen model on the full dataset and save it

Now we:

1. Re-train the best model (gradient boosting) on the **entire dataset** (no validation split).
2. Save the fitted model and the feature column order with `joblib`.

These files will be used later in `train.py` and `predict.py` when we deploy the model as a web service.


In [10]:
# Rebuild X and y on the full dataset (with age_years)
df_full = pd.read_csv(data_path, sep=';')
df_full['age_years'] = (df_full['age'] / 365).round(1)
df_full = df_full.drop(columns=['age'])

X_full = df_full.drop(columns=['cardio', 'id'])
y_full = df_full['cardio']

feature_columns = X_full.columns.tolist()

final_model = GradientBoostingClassifier(random_state=42)
final_model.fit(X_full, y_full)

dump(final_model, 'cardio_model.joblib')
dump(feature_columns, 'cardio_feature_columns.joblib')

print('Saved cardio_model.joblib and cardio_feature_columns.joblib')

Saved cardio_model.joblib and cardio_feature_columns.joblib


### Summary

- We compared several models and found that **gradient boosting** performs best for this dataset (highest ROC AUC with good F1 and accuracy).
- We retrained a `GradientBoostingClassifier` on the full dataset and saved:
  - `cardio_model.joblib` – the trained model
  - `cardio_feature_columns.joblib` – the exact list and order of feature columns

Next steps for the midterm project:

1. Move the training logic into a standalone script: `train.py`.
2. Create a `predict.py` (or `main.py`) web service that:
   - loads `cardio_model.joblib` and `cardio_feature_columns.joblib`
   - accepts JSON with patient features
   - returns the predicted risk of cardiovascular disease.

You can use this notebook as a reference when writing those scripts.
