<a href="https://colab.research.google.com/github/ChristophWuersch/AppliedNeuralNetworks/blob/main/U02/BinaryClassification_HeartDataset_SOLUTION-PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="Bilder/ost_logo.png" width="240" height="120" align="right"/>
<div style="text-align: left"> <b> Applied Neural Networks | FS 2025 </b><br>
<a href="mailto:christoph.wuersch@ost.ch"> © Christoph Würsch </a> </div>
<a href="https://www.ost.ch/de/forschung-und-dienstleistungen/technik/systemtechnik/ice-institut-fuer-computational-engineering/"> Eastern Switzerland University of Applied Sciences OST | ICE </a>

# Binäre Klassifikation mit kontinuierlichen und kategorischen Merkmalen

**Author:** 
- Christoph Würsch, Eastern Switzerland University of Applied Science OST
- [Francois Chollet](https://twitter.com/fchollet)<br>



Diese Übungsserie zeigt, wie eine strukturierte Datenklassifizierung ausgehend von einer rohen
CSV-Datei mit keras vorgenommen werden kann. Die verwendeten Daten enthalten sowohl numerische als auch kategorische Merkmale. Wir verwenden Keras Vorverarbeitungsschichten zur Normalisierung der numerischen Merkmale und zur Vektorisierung (one-hot-coding) der kategorischen Merkmale.

### Der Datensatz

[Unser Datensatz](https://archive.ics.uci.edu/ml/datasets/heart+Disease) wird von der Cleveland Clinic Foundation für Herzkrankheiten zur Verfügung gestellt. Es handelt sich um eine CSV-Datei mit 303 Zeilen. Jede Zeile enthält Informationen über einen Patienten (eine **Stichprobe**), und jede Spalte beschreibt ein Attribut des Patienten (ein **Merkmal**). Wir verwenden die Merkmale, um vorherzusagen, ob ein Patient eine Herzerkrankung hat (**binäre Klassifizierung**).



Hier ist eine Zusammenfassung der Merkmale:

Column| Description| Feature Type
------------|--------------------|----------------------
Age | Age in years | Numerical
Sex | (1 = male; 0 = female) | Categorical
CP | Chest pain type (0, 1, 2, 3, 4) | Categorical
Trestbpd | Resting blood pressure (in mm Hg on admission) | Numerical
Chol | Serum cholesterol in mg/dl | Numerical
FBS | fasting blood sugar in 120 mg/dl (1 = true; 0 = false) | Categorical
RestECG | Resting electrocardiogram results (0, 1, 2) | Categorical
Thalach | Maximum heart rate achieved | Numerical
Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical
Oldpeak | ST depression induced by exercise relative to rest | Numerical
Slope | Slope of the peak exercise ST segment | Numerical
CA | Number of major vessels (0-3) colored by fluoroscopy | Both numerical & categorical
Thal | 3 = normal; 6 = fixed defect; 7 = reversible defect | Categorical
Target | Diagnosis of heart disease (1 = true; 0 = false) | Target

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import pandas as pd
import sys

# Print versions in a compact form
print(f"Python version: {sys.version}")
print(f"torch: {torch.__version__}")
print(f"matplotlib: {matplotlib.__version__}")
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")

In [None]:
# comment this line, if pytorch-lightning is already installed
!pip install lightning

## (a) Datensatz laden

Wir laden wir die Daten herunter und speichern diese in einen Pandas-Dataframe:

In [None]:

file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(file_url)


In [None]:
df.to_csv('heart.csv')

Der Datensatz umfasst 303 Proben mit 14 Spalten pro Probe (13 Merkmale, plus die Zielbezeichnung Bezeichnung):

In [None]:
df.shape

Die letzte Spalte, `target`, gibt an, ob der Patient eine Herzerkrankung hat (`1`) oder nicht (`0`).


## (b) EDA

Hier ist ein kurzer Einblick in die Daten:

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
# Select only numeric columns
numeric_features= df.select_dtypes(include=['number']).columns.tolist()
print(numeric_features)

corr=df[numeric_features].corr()    



In [None]:
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, vmin=0.0, center=0, annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});
plt.show()

In [None]:
features=df.columns[0:-1]
response=df.columns[-1]

features

In [None]:
# Create a 3x5 grid of subplots
fig, axes = plt.subplots(3, 5, figsize=(20, 12))  # Adjust figsize for readability

# Flatten axes for easy iteration
axes = axes.flatten()

# Loop through features and create boxplots
for i, feature in enumerate(features):
    sns.boxplot(data=df, x='target', y=feature, ax=axes[i])
    axes[i].set_title(f'Boxplot of {feature}')
    axes[i].set_xlabel('Target')
    axes[i].set_ylabel(feature)

# Hide any unused subplots (if features < 15)
for i in range(len(features), len(axes)):
    fig.delaxes(axes[i])

# Adjust layout
plt.tight_layout()
plt.show()


## (c) Standardisierung und One-hot-Encoding

Die folgenden Merkmale sind kategorische Merkmale, die als ganze Zahlen kodiert sind:

- `Geschlecht`
- `cp` 
- `fbs`
- `restecg`
- `exang`
- `ca`

Wir kodieren diese Merkmale mit **one-hot encoding**. Wir haben zwei Optionen

In [None]:
print(df.dtypes)

In [None]:
categorical= ['sex', 'cp', 'fbs', 'restecg', 'exang', 'ca', 'thal']

df_onehot=pd.get_dummies(data=df.iloc[:,1:-1],columns=categorical)

features=df_onehot.columns
features


In [None]:
df_onehot.head()

## (d) Aufteilen in einen Trainings- und Validierungsdatensatz

Wir teilen den Datensatz auf in einen Trainings- und Validierungsdatensatz. Hierfür verwenden wir direkt die Methoden `df.sample` und `df.drop()` eines Pandas-Datenframes `df`. 

In [None]:
#Generate Dataframe with correct encodiding
numericFeatures=features[0:5]
categoricFeatures=features[5:]

print(numericFeatures)
print(categoricFeatures)

In [None]:
df_onehot[features].dtypes

In [None]:
X=df_onehot[features].astype(np.float32).to_numpy()
y=df['target'].to_numpy()

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
X_train

In [None]:
dg=pd.DataFrame(data=X_train, columns=features)
dg.head()

## (e) Standardisieren der quantitativen, kontinuierlichen Merkmale

In [None]:
from sklearn.preprocessing import StandardScaler

myScaler=StandardScaler()
Xtrain= np.hstack((myScaler.fit_transform(X_train[:,0:5]),X_train[:,5:]))
Xtest = np.hstack((myScaler.fit_transform(X_test[:,0:5]),X_test[:,5:]))


In [None]:
np.shape(Xtrain)

In [None]:
Xtrain

In [None]:
np.shape(y_train.reshape(-1,1))

## (f) Erstellen der Daten-Klasse (`Dataset`) und des `DataLoader`

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, index):
        return self.X[index], self.y[index]

# Create Dataset objects
train_dataset = CustomDataset(Xtrain, y_train)
test_dataset  = CustomDataset(Xtest, y_test)

In [None]:
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create DataLoader objects
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Move data to device (optional, for demonstration purposes)
for X_batch, y_batch in train_loader:
    X_batch, y_batch = X_batch.to(device), y_batch.to(device)
    print(f"Batch X shape: {X_batch.shape}, Batch y shape: {y_batch.shape}")
    break  # Just to show one batch, remove in actual training


In [None]:
trainData=np.hstack((Xtrain,y_train.reshape(-1,1)))
valData =np.hstack((Xtest,y_test.reshape(-1,1)))
fullFeatures=list(features)
fullFeatures.append('target')
print(fullFeatures)


Wir können aber auch mit Hilfe einer helper-Funktion direkt aus den `pandas`-Dataframes die `DataLoader`erzeugen.

In [None]:
train_dataframe=pd.DataFrame(data=trainData, columns=fullFeatures)
train_dataframe

In [None]:
val_dataframe=pd.DataFrame(data=valData, columns=fullFeatures)
val_dataframe

In [None]:
from torch.utils.data import DataLoader, TensorDataset

# Convert DataFrame to tensors
def dataframe_to_dataloader(df, target_column, batch_size=32, shuffle=True):
    """
    Converts a pandas DataFrame into a PyTorch DataLoader.
    Args:
    - df: pandas DataFrame with features and target.
    - target_column: Name of the column containing the target variable.
    - batch_size: Size of batches in the DataLoader.
    - shuffle: Whether to shuffle the data.
    """
    features = torch.tensor(df.drop(columns=[target_column]).values, dtype=torch.float32)
    target   = torch.tensor(df[target_column].values, dtype=torch.float32)
    
    dataset = TensorDataset(features, target)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)




In [None]:
# Example usage
batch_size = 32
train_loader = dataframe_to_dataloader(train_dataframe, target_column='target', batch_size=16)
val_loader   = dataframe_to_dataloader(val_dataframe, target_column='target', batch_size=16)

# Iterate through the DataLoader
for batch_features, batch_targets in train_loader:
    print(batch_features.shape, batch_targets.shape)
    break

## (g) Erstellen der Modellarchitektur

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping

class ClassifierNet(pl.LightningModule):
    def __init__(self, n_hidden=32):
        super(ClassifierNet, self).__init__()
        self.fc1 = nn.Linear(28, n_hidden)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(n_hidden, n_hidden)
        self.output = nn.Linear(n_hidden, 1)
        self.sigmoid = nn.Sigmoid()
        self.criterion = nn.BCELoss()

        # Initialize metric storage
        self.train_losses = []
        self.val_losses = []
        self.train_accuracies = []
        self.val_accuracies = []

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.output(x)
        return self.sigmoid(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x).squeeze()
        loss = self.criterion(y_hat, y)
        acc = ((y_hat > 0.5).float() == y).float().mean()
        self.log('train_loss', loss)
        self.log('train_acc', acc, prog_bar=True)
        self.train_losses.append(loss.item())
        self.train_accuracies.append(acc.item())
        return {'loss': loss, 'train_acc': acc}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x).squeeze()
        loss = self.criterion(y_hat, y)
        acc = ((y_hat > 0.5).float() == y).float().mean()
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)
        self.val_losses.append(loss.item())
        self.val_accuracies.append(acc.item())
        return {'val_loss': loss, 'val_acc': acc}

    def configure_optimizers(self):
        return torch.optim.RMSprop(self.parameters(), lr=0.001)


## (h) Trainieren

In [None]:
# Model instantiation
model = ClassifierNet(n_hidden=32)


In [None]:
from torchsummary import summary

# Example input size: (batch_size, input_features)
summary(model, input_size=(batch_size,28))

In [None]:
# Trainer setup
max_epochs=50
trainer = pl.Trainer(max_epochs=40, log_every_n_steps=1)
trainer.fit(model, train_loader, val_loader)

## (i) Lernkurven

In [None]:
# Function to plot learning curves
def plot_learning_curves(model):
    epochs_train = np.array(range(1, len(model.train_losses) + 1)) / len(model.train_losses)*max_epochs 
    epochs_val = np.array(range(1, len(model.val_losses) + 1)) / len(model.val_losses)*max_epochs
    
    # Plotting
    plt.figure(figsize=(12, 5))

    # Loss plot
    plt.subplot(1, 2, 1)
    plt.plot(epochs_train, model.train_losses, 'b.-', label='Training Loss')
    plt.plot(epochs_val, model.val_losses, 'r.-', label='Validation Loss')
    plt.title('Loss Over Epochs')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.grid(True)
    plt.legend()

    # Accuracy plot
    plt.subplot(1, 2, 2)
    plt.plot(epochs_train, model.train_accuracies, 'b.-', label='Training Accuracy')
    plt.plot(epochs_val, model.val_accuracies, 'r.-', label='Validation Accuracy')
    plt.title('Accuracy Over Epochs')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.grid(True)
    plt.legend()


    plt.tight_layout()
    plt.show()



In [None]:
# Plot learning curves
plot_learning_curves(model)

## (j) Eine Funktion für Training und Darstellung

Nun führen wir alles in einer einzigen Funktion zusammen.

In [None]:
def TrainAndPlot(n_hidden=32,max_epochs=50):
    # Model instantiation
    model = ClassifierNet(n_hidden=n_hidden)
    # Trainer setup

    trainer = pl.Trainer(max_epochs=max_epochs, log_every_n_steps=1)
    trainer.fit(model, train_loader, val_loader)
    plot_learning_curves(model)
    

## (j) Ein etwas anderes Netzwerk

In [None]:
TrainAndPlot(n_hidden=32,max_epochs=50)   

## (f) Variable Anzahl an Hidden Layer

In [None]:
from pytorch_lightning import Trainer


In [None]:
class ClassifierNet(pl.LightningModule):
    def __init__(self, layer_sizes):
        """
        Initialize the classifier network.
        
        Args:
        - layer_sizes (list): A list of integers specifying the number of nodes
                              in each hidden layer. The first element is the input size,
                              and the last element is the output size.
        """
        super(ClassifierNet, self).__init__()

        self.layers = nn.ModuleList()
        for i in range(len(layer_sizes) - 1):
            self.layers.append(nn.Linear(layer_sizes[i], layer_sizes[i + 1]))
            if i < len(layer_sizes) - 2:  # Add dropout only for hidden layers
                self.layers.append(nn.ReLU())
                self.layers.append(nn.Dropout(p=0.5))  # Dropout probability is 0.5

        self.output_activation = nn.Sigmoid()  # Output layer activation for binary classification
        self.criterion = nn.BCELoss()  # Binary Cross-Entropy Loss

        # Initialize metric storage
        self.train_losses = []
        self.val_losses = []
        self.train_accuracies = []
        self.val_accuracies = []

    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return self.output_activation(x)  # Apply sigmoid activation at the output layer

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x).squeeze()
        loss = self.criterion(y_hat, y)
        acc = ((y_hat > 0.5).float() == y).float().mean()
        self.log('train_loss', loss)
        self.log('train_acc', acc, prog_bar=True)
        self.train_losses.append(loss.item())
        self.train_accuracies.append(acc.item())
        return {'loss': loss, 'train_acc': acc}

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x).squeeze()
        loss = self.criterion(y_hat, y)
        acc = ((y_hat > 0.5).float() == y).float().mean()
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)
        self.val_losses.append(loss.item())
        self.val_accuracies.append(acc.item())
        return {'val_loss': loss, 'val_acc': acc}

    def configure_optimizers(self):
        return torch.optim.RMSprop(self.parameters(), lr=0.001)

In [None]:
def TrainPlot(layer_sizes, batch_size, max_epochs=50):
    # Model instantiation
    model = ClassifierNet(layer_sizes)
    # Trainer setup
    summary(model, input_size=(batch_size,28))

    trainer = pl.Trainer(max_epochs=max_epochs, log_every_n_steps=1);
    trainer.fit(model, train_loader, val_loader);
    plot_learning_curves(model)

In [None]:
# Example usage
layer_sizes = [28, 16, 32, 32, 16, 1]  # Input size, hidden layers, output size
batch_size=32
TrainPlot(layer_sizes,batch_size, max_epochs=50)  