# Heart Disease Classification with Logistic Regression

This notebook mirrors the production pipeline in `src/` and adds intuitive diagnostics so you can explain, verify, and extend the classifier before deploying it through the FastAPI service.

## 1. Experiment roadmap

1. Load the heart disease dataset and review basic descriptive statistics.
2. Visualise relationships that typically influence cardiovascular risk.
3. Train the same scikit-learn pipeline defined in `src/pipeline.py`.
4. Evaluate discrimination (ROC-AUC) and calibration metrics to validate probabilistic quality.

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    RocCurveDisplay,
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
    classification_report,
    log_loss,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

DATA_PATH = Path('..') / 'data' / 'heart.csv'
DATA_PATH.resolve()

The feature engineering and scaling steps here must stay in sync with `src/pipeline.py`. If you change preprocessing for experimentation, port those adjustments back into the production code.

In [None]:
df = pd.read_csv(DATA_PATH)
df.head()

The dataset combines continuous measurements (blood pressure, cholesterol, heart rate) with categorical-like integers (chest pain type, thalassemia). A quick overview ensures ranges are sensible and highlights potential scaling needs.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(data=df, x='age', y='thalach', hue='target', ax=axes[0], palette='Set1')
axes[0].set_title('Age vs. Max Heart Rate by Outcome')
sns.boxplot(data=df, x='target', y='oldpeak', ax=axes[1])
axes[1].set_title('ST Depression by Outcome')
plt.tight_layout()
plt.show()

## 2. Train/validation split

We recreate the stratified split from `src/data.py`. Stratification keeps the positive class proportion stable across train and validation folds so performance estimates remain reliable.

In [None]:
FEATURES = [
    'age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal'
]
TARGET = 'target'

X = df[FEATURES]
y = df[TARGET]
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_val.shape

## 3. Pipeline training and metrics

The pipeline mirrors `HeartDiseasePipeline.build()`: standardise numeric features and fit a class-weighted logistic regression to counter mild class imbalance.

In [None]:
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, FEATURES)], remainder='drop'
)
clf = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(max_iter=1000, solver='liblinear', class_weight='balanced', random_state=42)),
    ]
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_val)
y_proba = clf.predict_proba(X_val)[:, 1]
metrics = {
    'accuracy': float(accuracy_score(y_val, y_pred)),
    'precision': float(precision_score(y_val, y_pred)),
    'recall': float(recall_score(y_val, y_pred)),
    'f1': float(f1_score(y_val, y_pred)),
    'roc_auc': float(roc_auc_score(y_val, y_proba)),
    'log_loss': float(log_loss(y_val, y_proba)),
}
metrics

### Classification report

A granular report helps bridge the gap between overall metrics and class-specific behaviour.

In [None]:
print(classification_report(y_val, y_pred))
ConfusionMatrixDisplay.from_predictions(y_val, y_pred, display_labels=['No disease', 'Disease'])
plt.title('Confusion Matrix')
plt.show()

RocCurveDisplay.from_predictions(y_val, y_proba)
plt.title('ROC Curve')
plt.show()

## 4. Sync with production artifacts

- The metrics dictionary above should match the values stored in `artifacts/metrics.json` after running `python "Supervised Learning/Logistic Regression/src/train.py"`.
- If you change preprocessing or model parameters here, port them into `src/pipeline.py` to keep the FastAPI endpoint consistent.
- Calibration concerns? Consider Platt scaling or isotonic regression and update the service to expose recalibrated probabilities.