<div align="center">

###### Lab 2

# National Tsing Hua University

#### Spring 2025

#### 11320IEEM 513600

#### Deep Learning and Industrial Applications
    
## Lab 2: Predicting Heart Disease with Deep Learning

</div>

### Introduction

In the realm of healthcare, early detection and accurate prediction of diseases play a crucial role in patient care and management. Heart disease remains one of the leading causes of mortality worldwide, making the development of effective diagnostic tools essential. This lab leverages deep learning to predict the presence of heart disease in patients using a subset of 14 key attributes from the Cleveland Heart Disease Database. The objective is to explore and apply deep learning techniques to distinguish between the presence and absence of heart disease based on clinical parameters.

Throughout this lab, you'll engage with the following key activities:
- Use [Pandas](https://pandas.pydata.org) to process the CSV files.
- Use [PyTorch](https://pytorch.org) to build an Artificial Neural Network (ANN) to fit the dataset.
- Evaluate the performance of the trained model to understand its accuracy.

### Attribute Information

1. age: Age of the patient in years
2. sex: (Male/Female)
3. cp: Chest pain type (4 types: low, medium, high, and severe)
4. trestbps: Resting blood pressure
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl
7. restecg: Resting electrocardiographic results (values 0,1,2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina
10. oldpeak: Oldpeak = ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
14. target: target have disease or not (1=yes, 0=no)

### References
- [UCI Heart Disease Data](https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data) for the dataset we use in this lab.


# Utils

## Get Training Data

In [None]:
import pandas as pd
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset

df = pd.read_csv('heart_dataset_train_all.csv')

# one hot encoding
sex_description = {
    'Male': 0,
    'Female': 1,
}
df.loc[:, 'sex'] = df['sex'].map(sex_description)

# Mapping 'cp' (chest pain) descriptions to numbers
pain_description = {
    'low': 0,
    'medium': 1,
    'high': 2,
    'severe': 3
}
df.loc[:, 'cp'] = df['cp'].map(pain_description)

df = df.dropna()
# split train, validation data
np_data = df.values
print(np_data.shape)
split_point = int(np_data.shape[0]*0.7)

np.random.shuffle(np_data)

x_train = np_data[:split_point, :13]
y_train = np_data[:split_point, 13]
x_val = np_data[split_point:, :13]
y_val = np_data[split_point:, 13]

# trasform to Dataloader
x_train = np.array(x_train, dtype=float)
x_train = torch.from_numpy(x_train).float()
y_train = np.array(y_train, dtype=int)
y_train = torch.from_numpy(y_train).long()

x_val = np.array(x_val, dtype=float)
x_val = torch.from_numpy(x_val).float()
y_val = np.array(y_val, dtype=int)
y_val = torch.from_numpy(y_val).long()

batch_size = 32

# Create datasets
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
print(f'Number of samples in train and validation are {len(train_loader.dataset)} and {len(val_loader.dataset)}.')

## Get Testing Data

In [None]:
test_data = pd.read_csv('heart_dataset_test.csv')
test_data = test_data.values
# Convert to PyTorch tensors
x_test = torch.from_numpy(test_data[:, :13]).float()
y_test = torch.from_numpy(test_data[:, 13]).long()

# Create datasets
test_dataset = TensorDataset(x_test, y_test)

# Create dataloaders
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

In [None]:
import torch.nn as nn
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(13, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 2)
        )

    def forward(self, x):
        return self.model(x)

## Train Function

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR, StepLR
from tqdm.auto import tqdm
from typing import List, Tuple


epochs = 100


# print(model)



# model = Model().to(device)
# optimizer = optim.Adam(model.parameters(), lr=1e-3)
# lr_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)

def experiment(model,
               train_loader,
               val_loader,
               test_loader,
               lr_scheduler,
               optimizer,
               epochs=100)->Tuple[List, int]:
    """
    Train & Evaluate the Model
    Args:
        train_loader, val_loader, test_loader -> Data
        lr_scheduler, optimizer -> hyperparameters adjustment
        
    Returns:
        history: [train_losses, val_losses, train_accuracies, val_accuracies, lrs]
        test_accuracy
    """
    
    train_losses = []
    val_losses = []
    train_accuracies = []
    val_accuracies = []
    lrs = []
    criterion = nn.CrossEntropyLoss()
    best_val_loss = float('inf')
    best_val_acc = -1
    for epoch in tqdm(range(epochs)):
        # Training
        model.train()
        total_loss = 0.0
        train_correct = 0
        total_train_samples = 0

        for features, labels in train_loader:
            features = features.to(device)
            labels = labels.to(device)

            outputs = model(features)

            loss = criterion(outputs, labels)
            total_loss += loss.item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            train_predicted = outputs.argmax(-1)
            train_correct += (train_predicted == labels).sum().item()
            total_train_samples += labels.size(0)
            
        current_lr = optimizer.param_groups[0]['lr']
        lrs.append(current_lr)
        # Learning rate update
        lr_scheduler.step()

        avg_train_loss = total_loss / len(train_loader)
        train_accuracy = 100. * train_correct / total_train_samples

        # Validation
        model.eval()
        total_val_loss = 0.0
        correct = 0
        total = 0
        with torch.no_grad():
            for features, labels in val_loader:
                features = features.to(device)
                labels = labels.to(device)

                outputs = model(features)

                loss = criterion(outputs, labels)
                total_val_loss += loss.item()

                predicted = outputs.argmax(-1)
                correct += (predicted == labels).sum().item()
                total += labels.size(0)

        avg_val_loss = total_val_loss / len(val_loader)
        val_accuracy = 100. * correct / total

        # Checkpoint
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss

        if val_accuracy > best_val_acc:
            best_val_acc = val_accuracy
            torch.save(model.state_dict(), 'model_classification.pth')

        # Store performance
        train_losses.append(avg_train_loss)
        train_accuracies.append(train_accuracy)
        val_losses.append(avg_val_loss)
        val_accuracies.append(val_accuracy)
        
    #Test Section
    # Load the trained weights
    model.load_state_dict(torch.load('model_classification.pth'))

    # Set the model to evaluation mode
    model.eval()

    test_correct = 0
    total_test_loss = 0.0
    test_total = 0

    with torch.no_grad():
        for features, labels in test_loader:
            features = features.to(device)
            labels = labels.to(device)

            outputs = model(features)
            
            loss = criterion(outputs, labels)
            total_test_loss += loss.item()
            
            predicted = outputs.argmax(-1)
            test_correct += (predicted == labels).sum().item()
            test_total += labels.size(0)
            
    test_loss = total_test_loss / len(test_loader)        
    test_accuracy = 100. * test_correct / test_total
    print(f'Test accuracy is {test_accuracy}%')
    
    history = [train_losses, val_losses, train_accuracies, val_accuracies, lrs]
    
    return history, test_accuracy, test_loss

In [None]:
model = Model().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
lr_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)
history, test_accuracy, test_loss = experiment(model,
                                               train_loader,
                                               val_loader,
                                               test_loader,
                                               lr_scheduler,
                                                   optimizer)

## Change Learning Rate

In [None]:
import matplotlib.pyplot as plt

lr_param = [1e-2, 5*1e-3, 1e-4]

fig, ax = plt.subplots(3, 3, figsize=(15, 15))
ax = ax.flatten()

result = {"learning rate":[1e-2, 1e-2, 5*1e-3, 5*1e-3, 1e-4, 1e-4],
          "meaning":["Accuracy", "Loss", "Accuracy", "Loss", "Accuracy", "Loss"],
          "train":[],
          "val":[],
          "test":[]}

for i in range(3):
    
    model = Model().to(device)
    optimizer = optim.Adam(model.parameters(), lr=lr_param[i])
    lr_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)
    history, test_accuracy, test_loss = experiment(model,
                                                   train_loader,
                                                   val_loader,
                                                   test_loader,
                                                   lr_scheduler,
                                                   optimizer)
    
    train_losses, val_losses, train_accuracies, val_accuracies, lr = history
    
    result["train"].append(train_accuracies[-1])
    result["train"].append(train_losses[-1])
    result["val"].append(val_accuracies[-1])
    result["val"].append(val_losses[-1])
    result["test"].append(test_accuracy)
    result["test"].append(test_loss)
    
    # Plotting training and validation accuracy
    ax[i*3].plot(train_accuracies)
    ax[i*3].plot(val_accuracies)
    ax[i*3].set_title(f'Model Accuracy, Test Accuracy={test_accuracy:.4f}')
    ax[i*3].set_xlabel('Epochs')
    ax[i*3].set_ylabel('Accuracy')
    ax[i*3].legend(['Train', 'Val'])

    # Plotting training and validation loss
    ax[i*3+1].plot(train_losses)
    ax[i*3+1].plot(val_losses)
    ax[i*3+1].set_title('Model Loss')
    ax[i*3+1].set_xlabel('Epochs')
    ax[i*3+1].set_ylabel('Loss')
    ax[i*3+1].legend(['Train', 'Val'])
    
    # Plotting Change of Learning Rate
    ax[i*3+2].plot(lr, label='learning rate')
    ax[i*3+2].set_title('Learning Rate')
    ax[i*3+2].set_xlabel('Epochs')
    ax[i*3+2].set_ylabel('Loss')
    ax[i*3+2].legend()
    
plt.subplots_adjust(wspace=0.2, hspace=1)

In [None]:
df = pd.DataFrame(result)
df = df.groupby(["learning rate", "meaning"]).mean().sort_values(["learning rate"], ascending=False)
df

## Change Beta

In [None]:
betas = [(0.9, 0.999), (0.5, 0.6), (0.1, 0.2)]
fig, ax = plt.subplots(3, 3, figsize=(15, 15))
ax = ax.flatten()

result = {"beta":[(0.9, 0.95), (0.9, 0.95), (0.5, 0.6), (0.5, 0.6), (0.1, 0.2), (0.1, 0.2)],
          "meaning":["Accuracy", "Loss", "Accuracy", "Loss", "Accuracy", "Loss"],
          "train":[],
          "val":[],
          "test":[]}

for i in range(3):
    
    model = Model().to(device)
    optimizer = optim.Adam(model.parameters(), lr=1e-3, betas=betas[i])
    lr_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)
    history, test_accuracy, test_loss = experiment(model,
                                        train_loader,
                                        val_loader,
                                        test_loader,
                                        lr_scheduler,
                                        optimizer)
    
    train_losses, val_losses, train_accuracies, val_accuracies, lr = history
    
    result["train"].append(train_accuracies[-1])
    result["train"].append(train_losses[-1])
    result["val"].append(val_accuracies[-1])
    result["val"].append(val_losses[-1])
    result["test"].append(test_accuracy)
    result["test"].append(test_loss)


    # Plotting training and validation accuracy
    ax[i*3].plot(train_accuracies)
    ax[i*3].plot(val_accuracies)
    ax[i*3].set_title(f'Model Accuracy, Test Accuracy={test_accuracy:.4f}')
    ax[i*3].set_xlabel('Epochs')
    ax[i*3].set_ylabel('Accuracy')
    ax[i*3].legend(['Train', 'Val'])

    # Plotting training and validation loss
    ax[i*3+1].plot(train_losses)
    ax[i*3+1].plot(val_losses)
    ax[i*3+1].set_title('Model Loss')
    ax[i*3+1].set_xlabel('Epochs')
    ax[i*3+1].set_ylabel('Loss')
    ax[i*3+1].legend(['Train', 'Val'])
    
    # Plotting Change of Learning Rate
    ax[i*3+2].plot(lr, label='learning rate')
    ax[i*3+2].set_title('Learning Rate')
    ax[i*3+2].set_xlabel('Epochs')
    ax[i*3+2].set_ylabel('Loss')
    ax[i*3+2].legend()
    
plt.subplots_adjust(wspace=0.2, hspace=1)

In [None]:
df = pd.DataFrame(result)
df = df.groupby(["beta", "meaning"]).mean()
df

## Differences between Training Data and Testing Data

In [None]:
df_test = pd.read_csv('heart_dataset_test.csv')

df = pd.read_csv('heart_dataset_train_all.csv')

# one hot encoding
sex_description = {
    'Male': 0,
    'Female': 1,
}
df.loc[:, 'sex'] = df['sex'].map(sex_description)

# Mapping 'cp' (chest pain) descriptions to numbers
pain_description = {
    'low': 0,
    'medium': 1,
    'high': 2,
    'severe': 3
}
df.loc[:, 'cp'] = df['cp'].map(pain_description)

df_train = df.dropna()
features = df_train.columns

In [None]:
print(len(df_train), len(df_test))

In [None]:
df_train.describe()

In [None]:
df_test.describe()

In [None]:
from scipy.stats import ks_2samp

for col in features:
    stat, p_value = ks_2samp(df_train[col], df_test[col])
    print(f"{col}: KS statistic = {stat:.4f}, p = {p_value:.4f}")

In [None]:
from scipy.stats import chi2_contingency

for col in ["sex", "cp"]:
    train_counts = df_train[col].value_counts()
    test_counts = df_test[col].value_counts()
    all_categories = set(train_counts.index).union(set(test_counts.index))

    train_freq = [train_counts.get(cat, 0) for cat in all_categories]
    test_freq = [test_counts.get(cat, 0) for cat in all_categories]

    stat, p_value, _, _ = chi2_contingency([train_freq, test_freq])
    print(f"{col}: Chi2 p = {p_value:.4f}")

## Feature Selection & Experiments

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(20, 15))
ax = sns.heatmap(df_train.corr(), cmap="Purples", annot=True)

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, RFE

X = df_train.drop(columns="target")
y = df_train["target"]

selector = SelectKBest(score_func=f_classif, k=10)  # Or use mutual_info_classif
X_selected = selector.fit_transform(X, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features.tolist())
selected_features = selected_features.tolist()

In [None]:
n = len(selected_features)
selected_features.append("target")
df = df_train[selected_features]
df = df.dropna()
# split train, validation data
np_data = df.values
print(np_data.shape)
split_point = int(np_data.shape[0]*0.7)

np.random.shuffle(np_data)

x_train = np_data[:split_point, :n]
y_train = np_data[:split_point, n]
x_val = np_data[split_point:, :n]
y_val = np_data[split_point:, n]

# trasform to Dataloader
x_train = np.array(x_train, dtype=float)
x_train = torch.from_numpy(x_train).float()
y_train = np.array(y_train, dtype=int)
y_train = torch.from_numpy(y_train).long()

x_val = np.array(x_val, dtype=float)
x_val = torch.from_numpy(x_val).float()
y_val = np.array(y_val, dtype=int)
y_val = torch.from_numpy(y_val).long()

batch_size = 32

# Create datasets
train_dataset = TensorDataset(x_train, y_train)
val_dataset = TensorDataset(x_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
print(f'Number of samples in train and validation are {len(train_loader.dataset)} and {len(val_loader.dataset)}.')

test_data = df_test[selected_features]
test_data = test_data.values
# Convert to PyTorch tensors
x_test = torch.from_numpy(test_data[:, :n]).float()
y_test = torch.from_numpy(test_data[:, n]).long()

# Create datasets
test_dataset = TensorDataset(x_test, y_test)

# Create dataloaders
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)

In [None]:
class Model(nn.Module):
    def __init__(self, in_channel):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(in_channel, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 2)
        )

    def forward(self, x):
        return self.model(x)


model = Model(n).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
lr_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=0)
histroy, test_accuracy = experiment(model,
                                    train_loader,
                                    val_loader,
                                    test_loader,
                                    lr_scheduler,
                                    optimizer)

In [None]:
import torch
import torch.nn as nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer

class TabularTransformer(nn.Module):
    def __init__(self, num_features, d_model, nhead, num_layers, num_classes=1):
        super(TabularTransformer, self).__init__()
        self.feature_embedding = nn.Linear(1, d_model)  # or use Embedding for categorical
        self.pos_embedding = nn.Parameter(torch.randn(1, num_features, d_model))
        encoder_layer = TransformerEncoderLayer(d_model=d_model, nhead=nhead)
        self.transformer = TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(num_features * d_model, 128),
            nn.ReLU(),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        # x shape: [batch, num_features]
        x = x.unsqueeze(-1)  # [batch, num_features, 1]
        x = self.feature_embedding(x)  # [batch, num_features, d_model]
        x = x + self.pos_embedding  # Add positional embedding
        x = self.transformer(x)  # [batch, num_features, d_model]
        out = self.classifier(x)  # [batch, 1]
        return out
    


In [None]:
optimizer = optim.Adam(model.parameters(), lr=1e-3)
model = TabularTransformer(13, 256, 4, 1, num_classes=2)
lr_scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0)
history, test_accuracy, test_loss = experiment(model,
                                    train_loader,
                                    val_loader,
                                    test_loader,
                                    lr_scheduler,
                                    optimizer,
                                   epochs=100)

In [None]:
import matplotlib.pyplot as plt

train_losses, val_losses, train_accuracies, val_accuracies, _ = history
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

# Plotting training and validation accuracy
ax[0].plot(train_accuracies)
ax[0].plot(val_accuracies)
ax[0].set_title('Model Accuracy')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Accuracy')
ax[0].legend(['Train', 'Val'])

# Plotting training and validation loss
ax[1].plot(train_losses)
ax[1].plot(val_losses)
ax[1].set_title('Model Loss')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Loss')
ax[1].legend(['Train', 'Val'])

plt.show()