# Heart Disease Prediction
#### This project uses machine learning to predict the presence of heart disease based on various health metrics.
In this project, I aim to build a binary classification MLP Neural Network to predict the presence of heart disease in patients based on various health metrics. The dataset used for this project is sourced from the UCI Machine Learning Repository and contains several features such as age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, and others. The project provides a comprehensive analysis of the dataset, including data preprocessing, model training, and evaluation. The final model is evaluated using accuracy, precision, recall, F1-score, and confusion matrix metrics to ensure its effectiveness in predicting heart disease.

The final code cell provides an interactive interface for users to input their health metrics and receive a prediction on whether they are likely to have heart disease or not. This project serves as a practical application of machine learning techniques in the healthcare domain, demonstrating how data-driven approaches can aid in early detection and prevention of heart disease.

In [None]:
%pip install torch kagglehub scikit-learn matplotlib

### Data Structure
There are 13 features in the dataset, and 1 target variable.
### Heart Disease Dataset
This dataset is used for predicting the presence of heart disease in individuals based on various medical attributes. The dataset contains both categorical and numerical features.
### Features
- **age**: Age of the individual
- **sex**: Sex of the individual (1 = male; 0 = female)
- **cp**: Chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic
- **trestbps**: Resting blood pressure (in mm Hg)
- **chol**: Serum cholesterol in mg/dl
- **fbs**: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- **restecg**: Resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- **thalach**: Maximum heart rate achieved
- **exang**: Exercise induced angina (1 = yes; 0 = no)
- **oldpeak**: ST depression induced by exercise relative to rest
- **slope**: Slope of the peak exercise ST segment
    - Value 1: upsloping
    - Value 2: flat
    - Value 3: downsloping
- **ca**: Number of major vessels (0-3) colored by fluoroscopy
- **thal**: Thalassemia
    - 3 = normal
    - 6 = fixed defect
    - 7 = reversible defect
### Target Variable
- **target**: Diagnosis of heart disease
    - Value 0: No presence of heart disease
    - Value 1: Presence of heart disease


In [None]:
# Install dependencies as needed
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

BATCH_SIZE = 32

# Set the path to the file you'd like to load
file_path = "heart.csv"

# Load the latest version
df: pd.DataFrame = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "johnsmith88/heart-disease-dataset",
  file_path,
)

df = df.dropna().reset_index(drop=True)



# Split the dataset into features and target variable
X = df.drop(columns=["target"])
y = df["target"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


We import the dataset from kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset) and load it into a pandas DataFrame. The dataset is then preprocessed to handle missing values and seperate the features and target variable. The features are standardized using `StandardScaler` from `sklearn.preprocessing`. The dataset is then split into training and testing sets using `train_test_split` from `sklearn.model_selection`.

## Formatting the pandas DataFrame into a PyTorch Dataset
PyTorch requires the data to be in a specific format, so we create a custom dataset class that inherits from `torch.utils.data.Dataset`. This class will handle the loading and preprocessing of the data. The `__init__` method initializes the dataset, the `__len__` method returns the length of the dataset, and the `__getitem__` method retrieves a single sample from the dataset.

In [None]:
# Define a custom pytorch dataset
import torch
from torch.utils.data import Dataset


class HeartDiseaseDataset(Dataset):
    def __init__(self, features, targets):
        self.features = features
        self.targets = targets

    def __len__(self):
        return len(self.features)

    def __getitem__(self, idx):
        return self.features[idx], self.targets[idx]

## Creating DataLoaders
We use `torch.utils.data.DataLoader` to create DataLoaders for both the training and testing datasets. DataLoaders allow us to iterate over the dataset in batches, which is essential for training neural networks efficiently. We set the batch size to 32 and enable shuffling for the training DataLoader to ensure that the model sees the data in a different order each epoch.

In [None]:
from torch.utils.data import DataLoader

# Convert numpy arrays to torch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)

# Convert the tensors to a pytorch dataset
train_dataset = HeartDiseaseDataset(X_train_tensor, y_train_tensor)
test_dataset = HeartDiseaseDataset(X_test_tensor, y_test_tensor)

train_dataloader: DataLoader = DataLoader(
    train_dataset, batch_size=BATCH_SIZE, shuffle=True
)
test_dataloader: DataLoader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=False
)

## Defining the Neural Network
The goal is to create a neural network that can predict the presence of heart disease based on the features provided. We will use PyTorch to define and train this neural network. It will predict binary outcomes (presence or absence of heart disease) based on the input features.

We define a simple feedforward neural network using `torch.nn.Module`. The network consists of an input layer, one hidden layer, and an output layer. The input layer has 13 neurons (corresponding to the 13 features), the hidden layer has 64 neurons with ReLU activation, and the output layer has 1 neuron that outputs the raw logits for the binary classification task.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class NeuralNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_relu_stack = torch.nn.Sequential(
            torch.nn.Linear(13, 64),  # 13 input features
            torch.nn.ReLU(),
            torch.nn.Linear(64, 18),
            torch.nn.ReLU(),
            torch.nn.Linear(18, 1),  # Output layer for binary classification
        )

    def forward(self, x):
        logits = self.linear_relu_stack(x)
        return logits
    
# Initialize the model
model = NeuralNetwork().to(device)

# Implementing the training and testing loop functions

## Training Loop
### Abstract
We implement the training loop to train the neural network on the training dataset. The training loop will iterate over the training DataLoader, compute the predictions, calculate the loss, and update the model parameters using backpropagation.

### Detailed training steps
On each iteration, we will:
1. Forward pass the input data through the model to get the predictions.
2. Compute the loss using binary cross-entropy with logits. The binary cross-entropy loss with logits is computed by combining a sigmoid activation function with the binary cross-entropy loss, which is suitable for binary classification tasks. The exact equation is: 
   $$
   \sigma(x) = \frac{1}{1 + e^{-x}} \\
   \text{loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \cdot \log(\sigma(x_i)) + (1 - y_i) \cdot \log(1 - \sigma(x_i))]
   $$
   where $y_i$ is the true label, $x_i$ is the model's output, and $\sigma$ is the sigmoid function.
3. Backward pass to compute gradients.
4. Update the model parameters using the optimizer.
5. Reset the gradients to zero for the next iteration. This is important to prevent accumulation of gradients from previous iterations, which can lead to incorrect updates.

## Testing Loop
### Abstract
We implement the testing loop to evaluate the model's performance on the testing dataset. The testing loop will iterate over the testing DataLoader, compute the predictions, and calculate the accuracy of the model.

### Detailed testing steps
On each iteration, we will:
1. Forward pass the input data through the model to get the predictions.
2. Apply a sigmoid activation function to the output logits to convert them into probabilities.
3. Convert the probabilities to binary predictions (0 or 1) based on a threshold (0.5).
4. Compare the predictions with the true labels to compute the accuracy.
5. Calculate precision, recall, and F1 score using `sklearn.metrics` for a more comprehensive evaluation of the model's performance and build a confusion matrix to visualize the model's performance.

### Metrics used for Evaluation
- **Accuracy**: The proportion of true results (both true positives and true negatives) among the total number of cases examined.
- **Precision**: The proportion of true positive results in all positive predictions made by the model.
- **Recall**: The proportion of true positive results in all actual positive cases.
- **F1 Score**: The harmonic mean of precision and recall, providing a balance between the two metrics.
  - Harmonic Mean: The harmonic mean is a measure of the average of a set of numbers, calculated as the reciprocal of the arithmetic mean of the reciprocals of the numbers. It is particularly useful for rates and ratios, such as precision and recall in classification tasks.
   $$
   F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
   $$
- **Confusion Matrix**: A table used to describe the performance of a classification model, showing the true positive, true negative, false positive, and false negative counts.

In [None]:
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix as sk_confusion_matrix

def train_loop(train_dataloader, model: NeuralNetwork, loss_fn: torch.nn.Module, optimizer: torch.optim.Optimizer):
    size = len(train_dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(train_dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction and loss
        pred = model(X) # Forward pass through the model

        # Unsqueeze to match the output shape
        loss = loss_fn(pred, y.unsqueeze(1).float()) # Calculate BCE loss

        # Backpropagation
        loss.backward() # Compute gradients
        optimizer.step() # Make a step with the optimizer to update the model parameters
        optimizer.zero_grad() # Reset the gradients to zero before the next iteration

        loss, current = loss.item(), batch * BATCH_SIZE + len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(test_dataloader, model: NeuralNetwork, loss_fn: torch.nn.Module):
    model.eval()
    size = len(test_dataloader.dataset)
    num_batches = len(test_dataloader)
    test_loss, correct = 0, 0

    all_preds = []
    all_labels = []

    with torch.no_grad():
        for X, y in test_dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_fn(pred, y.unsqueeze(1).float()).item()

            # Convert logits to probabilities using sigmoid activation
            pred = torch.sigmoid(pred)
            predicted_classes = (pred > 0.5).float()

            correct += (predicted_classes == y.unsqueeze(1)).sum().item()
            all_preds.extend(predicted_classes.cpu().numpy())
            all_labels.extend(y.unsqueeze(1).cpu().numpy())
    
    test_loss /= num_batches
    correct /= size

    # Calculate precision, recall, and F1 score
    precision = precision_score(all_labels, all_preds)
    recall = recall_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds)
    
    # confusion matrix
    confusion_matrix = sk_confusion_matrix(all_labels, all_preds)
    confusion_df = pd.DataFrame(confusion_matrix, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"])
    print("Confusion Matrix:")
    print(confusion_df)
    print(
        f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n"
        f"Precision: {precision:>0.1f}, Recall: {recall:>0.1f}, F1 Score: {f1:>0.1f}")

In [None]:
learning_rate = 0.001
num_epochs = 10
criterion = torch.nn.BCEWithLogitsLoss()  # Binary Cross-Entropy Loss for binary classification
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(num_epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, criterion, optimizer)
    test_loop(test_dataloader, model, criterion)
    
    # Save the final model
    torch.save(model.state_dict(), "heart_disease_model.pth")

# Post-Training Thoughts
After training the model for 10 epochs, we observe the following:
- The model achieves an accuracy of 99% on the testing dataset.
- The precision, recall, and F1 score are all exactly 1.0, indicating that the model performs well in distinguishing between the presence and absence of heart disease.
- The confusion matrix does show that there are false negatives, which means that the model sometimes fails to identify individuals with heart disease, even though it performs well overall. However, there are no false positives, indicating that when the model predicts heart disease, it is almost always correct.

Overall, the model demonstrates excellent performance on this dataset, achieving high accuracy and perfect precision, recall, and F1 score. The confusion matrix confirms that the model is effective in classifying heart disease cases, although there is room for improvement in reducing false negatives.

# Try it Yourself
- Run the below cell to input your own data and see how the model performs.
- You can input the values for each feature in the dataset, and the model will predict whether the individual has heart disease or not. It will also provide the probability of the prediction.

In [None]:
import numpy as np

def predict_heart_disease(model, input_data):
    model.eval()
    input_tensor = torch.tensor(input_data, dtype=torch.float32).to(device)
    with torch.no_grad():
        output = model(input_tensor)
        probability = torch.sigmoid(output).float()
        prediction = 1 if probability > 0.5 else 0
    return prediction, probability

# Get user input for prediction
def get_user_input():
    print("Enter the following features for heart disease prediction:")
    features = []
    features.append(float(input("Age: ")))
    features.append(float(input("Sex (1 = male; 0 = female): ")))
    features.append(float(input("Chest Pain Type (0-3): ")))
    features.append(float(input("Resting Blood Pressure: ")))
    features.append(float(input("Serum Cholesterol: ")))
    features.append(float(input("Fasting Blood Sugar > 120 mg/dl (1 = true; 0 = false): ")))
    features.append(float(input("Resting ECG (0-2): ")))
    features.append(float(input("Max Heart Rate: ")))
    features.append(float(input("Exercise Induced Angina (1 = yes; 0 = no): ")))
    features.append(float(input("ST Depression: ")))
    features.append(float(input("Slope of ST Segment (0-2): ")))
    features.append(float(input("Number of Major Vessels (0-3): ")))
    features.append(float(input("Thalassemia (1 = normal; 2 = fixed defect; 3 = reversable defect): ")))

    features = np.array(features, dtype=np.float32)

    # Standardize the input features
    features = scaler.transform(features.reshape(1, -1)) # uses same scaler as training data

    return features[0]

# Interactive loop for predictions
while True:
    user_input = get_user_input()
    prediction, probability = predict_heart_disease(model, user_input)
    print(f"Prediction: {'Heart Disease Likely' if prediction == 1 else 'Heart Disease Unlikely'}")
    print(f"Probability: {probability.item():.4f}")
    
    cont = input("Do you want to make another prediction? (yes/no): ").strip().lower()
    if cont != 'yes':
        break
    print("Thank you for using the heart disease prediction model!")

Remember to run all the cells in order to ensure that the model is trained and ready for predictions!

Thanks for reading! If you have any questions or suggestions, feel free to reach out.