# Heart Disease Prediction

This notebook demonstrates how to build a machine learning pipeline to predict heart disease. We will use the UCI Heart Disease dataset, clean it, train two different models (Logistic Regression and Decision Tree), and evaluate their performance.

## 1. Load Dataset

First, we load the dataset using the Pandas library. We expect the file `heart_disease_uci.csv` to be in the same directory.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib

# Load the dataset
df = pd.read_csv('heart_disease_uci.csv')
df.head()

## 2. Data Cleaning and Preprocessing

Real-world data is often messy. In this step, we:
1. **Drop irrelevant columns**: Identifiers like `id` and `dataset` origin don't help prediction.
2. **Rename columns**: Ensuring consistent naming (e.g., `thalch` to `thalach`).
3. **Encode categorical variables**: Machine learning models require numbers. We map text values (e.g., 'Male', 'Female') to numbers (1, 0).
4. **Handle missing values**: We fill missing text data with the mode (most frequent) and numeric data with the median.

In [None]:
# Data Cleaning
# Step 1: Drop irrelevant columns if they exist
irrelevant_cols = ['id', 'dataset']
df.drop(columns=[col for col in irrelevant_cols if col in df.columns], inplace=True, errors='ignore')

# Step 2: Rename columns for consistency
df.rename(columns={'thalch': 'thalach'}, inplace=True)

# Step 3: Robust Categorical Encoding
# Using a dictionary for mapping to ensure clarity and easy updates
mappings = {
    'sex': {'Male': 1, 'Female': 0},
    'cp': {'typical angina': 0, 'atypical angina': 1, 'non-anginal': 2, 'asymptomatic': 3},
    'fbs': {True: 1, False: 0},
    'restecg': {'normal': 0, 'st-t abnormality': 1, 'lv hypertrophy': 2},
    'exang': {True: 1, False: 0},
    'slope': {'upsloping': 0, 'flat': 1, 'downsloping': 2},
    'thal': {'normal': 1, 'fixed defect': 2, 'reversable defect': 3}
}

for col, mapping in mappings.items():
    if col in df.columns:
        df[col] = df[col].map(mapping)

# Step 4: Handle Missing Values
# Separate numeric and categorical columns for appropriate imputation
numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = df.select_dtypes(exclude=[np.number]).columns

# Fill numeric NaNs with median
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Fill categorical NaNs with mode (if any left)
for col in categorical_cols:
    if not df[col].mode().empty:
        df[col] = df[col].fillna(df[col].mode()[0])

# Step 5: Target encoding (Idempotent)
# Convert 'num' (0-4) to binary target (0=Healthy, 1=Disease)
if 'num' in df.columns:
    df['target'] = df['num'].apply(lambda x: 1 if x > 0 else 0)
    df.drop('num', axis=1, inplace=True)
elif 'target' not in df.columns:
    # Ensure target exists if num was already dropped but target somehow didn't get created
    # This is a safety fallback
    print("Warning: 'num' column not found and 'target' does not exist.")

df.head()

## 3. Train/Test Split

We split the data into two sets:
- **Training Set (80%)**: Used to teach the model.
- **Test Set (20%)**: Used to evaluate how well the model generalizes to new, unseen data.

In [6]:
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Feature Scaling

Logistic Regression performs better when all features are on a similar scale (e.g., age is 0-100, while cholesterol is 100-500). We use `StandardScaler` to normalize these features. Decision Trees generally don't require scaling, but it doesn't hurt them.

In [7]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 5. Model Training

We train two models:
1. **Logistic Regression**: A statistical model that uses a logistic function to model a binary dependent variable.
2. **Decision Tree**: A flowchart-like structure where an internal node represents a feature, the branch represents a decision rule, and each leaf node represents the outcome.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Train Logistic Regression
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Train Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

## 6. Evaluation

We evaluate the models using accuracy (percentage of correct predictions) and a classification report (Precision, Recall, F1-Score).

In [None]:
y_pred_lr = lr_model.predict(X_test_scaled)
print("--- Logistic Regression Evaluation ---")
print(classification_report(y_test, y_pred_lr))

y_pred_dt = dt_model.predict(X_test)
print("--- Decision Tree Evaluation ---")
print(classification_report(y_test, y_pred_dt))

## 7. Export Models

Finally, we save the trained models and the scaler using `joblib`. These `.pkl` files will be loaded by our FastAPI backend to make real-time predictions in the application.

In [None]:
joblib.dump(lr_model, 'logistic_model.pkl')
joblib.dump(dt_model, 'decision_tree_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("Models saved successfully: logistic_model.pkl, decision_tree_model.pkl, scaler.pkl")