<a href="https://colab.research.google.com/github/MoLue/wft_digital_medicine/blob/main/wft_ds.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview
In this notebook, we will work with a cardiovascular disease (CVD) dataset published on [Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data). The primary goal is to perform an initial data exploration to understand the dataset's structure and key insights, followed by the application of various machine learning methods to predict the likelihood of heart disease.

## Getting Started

The imports are categorized for clarity into core libraries, visualization, machine learning, and configurations, making them easier to locate and manage.

In [1]:
# ===============================
# 1. Core Libraries for Data and Math
# ===============================
import numpy as np  # Linear algebra
import pandas as pd  # Data processing, CSV file I/O (e.g., pd.read_csv)

# ===============================
# 2. Visualization Libraries
# ===============================
import matplotlib.pyplot as plt  # Plotting
import matplotlib  # Additional Matplotlib configuration
import seaborn as sns  # Statistical data visualization

# ===============================
# 3. Machine Learning Libraries
# ===============================
from sklearn import preprocessing  # Preprocessing utilities
from sklearn.preprocessing import LabelEncoder  # Label encoding

# ===============================
# 4. Miscellaneous Configurations
# ===============================
import warnings  # Suppress warnings

# ===============================
# 5. Configurations
# ===============================
warnings.filterwarnings("ignore")  # Ignore warnings
pd.set_option("display.max_rows", None)  # Show all rows in DataFrames
matplotlib.style.use('ggplot')  # Use ggplot style for Matplotlib


## The data
Cardiovascular diseases (CVDs) are the leading cause of death globally, claiming 17.9 million lives annually, with many deaths linked to heart attacks, strokes, or premature cases under 70. This dataset includes 11 features to help predict heart disease, aiding early detection and management through machine learning models.

Let’s import the data to begin our analysis.

In [None]:
# Read the CSV and store it in our data frame variable
df = pd.read_csv("https://raw.githubusercontent.com/MoLue/wft_digital_medicine/main/data/heart.csv")

# show the first 5 entries
df.head()

**The Attributess include:**
* Age: age of the patient [years]
* Sex: sex of the patient [M: Male, F: Female]
* ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* RestingBP: resting blood pressure [mm Hg]
* Cholesterol: serum cholesterol [mm/dl]
* FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
* ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
* Oldpeak: oldpeak = ST [Numeric value measured in depression]
* ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* HeartDisease: output class [1: heart disease, 0: Normal]

More information about the dataset you can find here: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

## Pandas as central data exploration and data handling tool

The Pandas Cheat Sheet is a concise and powerful reference tool that summarizes key Pandas operations, making it easier to work with data in Python. It’s particularly helpful for beginners learning Pandas or for experienced users needing a quick refresher.

If you’re new to Pandas, the cheat sheet introduces essential concepts like creating DataFrames, indexing, and basic operations like filtering, sorting, and aggregation.

Quick Reference for Common Tasks:
When working on data manipulation, the cheat sheet provides quick access to frequently used operations, such as:
Handling missing data.
Applying functions (apply, map).
Joining and merging datasets.
Grouping and aggregating data.

Explore Advanced Features:
For advanced users, the cheat sheet highlights more complex operations like pivot tables, reshaping data, and handling time-series data, which can save time when dealing with complex workflows.

Bookmark for Fast Access:
Keep the cheat sheet accessible while working on projects to quickly recall the syntax for specific operations, ensuring smoother and more efficient coding.
Where to Find It

You can access the Pandas Cheat Sheet via this [link](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

# Data exploration
In the first step you should start to feel comfortable with the data. You can get an overview about this notebook, the tools and the data as well.

Some useful methods to get an overview:
- `df.describe()` and `df.describe().T`
- `df.head()`
- `df.tail()`
- `df.info()`
- `df.dtypes`



In [None]:
# Use this block to get information about your data
# ...



This code prepares the dataset for more structured and efficient data exploration by clearly separating categorical (string) columns and numerical columns. Here's how it helps:

**Consistent Data Types**:
Converts all object columns to string explicitly, ensuring uniform data types for categorical columns. This avoids errors or inconsistencies during exploration or analysis.

**Column Segmentation:**
By separating categorical (string_col) and numerical columns (num_col), you can apply appropriate analysis techniques to each type of data.
Example: Use summary statistics or visualizations (like histograms) for numeric columns.


**Focus on the Target Variable**:
Excludes the target variable HeartDisease from num_col to avoid accidentally treating it as an independent feature during exploratory analysis or model training.
By organizing the data this way, the process of exploring relationships, distributions, and patterns becomes much easier and reduces the risk of errors. This segmentation is particularly useful for tasks like:

- Feature engineering
- Applying statistical tests
- Creating tailored visualizations for numeric and categorical variables

In [3]:
# Select all columns with object data type (categorical/string columns)
string_col = df.select_dtypes(include="object").columns

# Convert these columns explicitly to string type for consistency
df[string_col] = df[string_col].astype("string")

# Re-select string columns to ensure consistency after the conversion
string_col = df.select_dtypes("string").columns.to_list()

# Create a list of all numeric columns excluding the target variable "HeartDisease"
num_col = df.columns.to_list()  # Start with all columns
for col in string_col:
    num_col.remove(col)  # Remove string columns from the numeric list
num_col.remove("HeartDisease")  # Exclude the target variable


### Example for a more detailed descriptive analysis

In [None]:
# Calculate descriptive statistics for patients with and without heart disease
yes = df[df['HeartDisease'] == 1].describe().T
no = df[df['HeartDisease'] == 0].describe().T

# Define new color palette
colors = 'coolwarm'

# Create subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot heatmap for patients with heart disease
plt.subplot(1, 2, 1)
sns.heatmap(
    yes[['mean']],
    annot=True,
    cmap=colors,
    linewidths=0.5,
    linecolor='white',
    cbar=False,
    fmt='.2f'
)
plt.title('Heart Disease', fontsize=14, color='darkred', weight='bold')

# Plot heatmap for patients without heart disease
plt.subplot(1, 2, 2)
sns.heatmap(
    no[['mean']],
    annot=True,
    cmap=colors,
    linewidths=0.5,
    linecolor='white',
    cbar=False,
    fmt='.2f'
)
plt.title('No Heart Disease', fontsize=14, color='darkblue', weight='bold')

# Adjust layout for better spacing
fig.tight_layout(pad=3)

# Display the plots
plt.show()


## Correlation Matrix:
What does `df.corr()` do? Try to visualize it with a diagram by using a heatmap. What does it represent?

Note: `sns` is our helper to visualize our graphs and plots with the library *seaborn*

In [None]:
corr_matrix = df.select_dtypes("int64").corr()
corr_matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = df.select_dtypes("int64").corr()

# Create the heatmap
plt.figure(figsize=(10, 8))

# Add your code here:
# Hint: SNS Heatmap https://seaborn.pydata.org/generated/seaborn.heatmap.html

# End of your code

# Add title
plt.title("Correlation Plot of the Heart Failure Prediction", fontsize=16)

# Show the plot
plt.tight_layout()
plt.show()


## Histograms
Compare your columns with histogram plots. Therefore you can use `px.histogram(...)`. Which values provide some good overview?

In [None]:
# Create a figure
plt.figure(figsize=(10, 6))

# Group data by Sex and plot histograms
chest_pain_types = df["Sex"].unique()
for pain_type in chest_pain_types:
    subset = df[df["Sex"] == pain_type]
    plt.hist(subset["Age"], bins=20, alpha=0.7, label=pain_type)

# Add title and labels
plt.title("Distribution of Age by Sex", fontsize=16)
plt.xlabel("Age", fontsize=14)
plt.ylabel("Count", fontsize=14)
plt.legend(title="Sex", fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

Now try some other plots on your own in the following code box.


## Pair Plot

Create a pairplot to visualize relationships between variables in your dataset. Use the URL below to understand how to implement this visualization:
Seaborn Pairplot Documentation

Hint: Use the hue parameter to group data by a categorical variable like HeartDisease.

Purpose: This plot will help you explore patterns, correlations, and distributions within your data.
A pair plot is a powerful data visualization tool that displays relationships between multiple features in a dataset. It creates scatterplots for every pair of numerical features and histograms (or kernel density estimates) for individual features along the diagonal.

Add your solution in the provided code section to complete the visualization.

A pair plot can show pairwise bivariate distributions. What does this plot show in the end? (More information: https://seaborn.pydata.org/generated/seaborn.pairplot.html)

In [None]:
# You can enter your results for Question 3
plt.figure(figsize=(15,10))

# Add your code for the pairplot here.
# Hint: Think about what do you try to say with that kind of visualization!
# Hint: https://seaborn.pydata.org/generated/seaborn.pairplot.html


# End of your code
plt.tight_layout()
plt.plot()

## Outliers
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.The box plot is a standardized way of displaying the distribution of data based on the five number summary:

- Minimum
- First quartile
- Median
- Third quartile
- Maximum.
A segment inside the rectangle shows the median and “whiskers” above and below the box show the locations of the minimum and maximum.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Create a boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(
    data=df,
    x="HeartDisease",
    y="Age",
    palette="husl"  # Color palette
)

# Add title and labels
plt.title("Distribution of Age by Heart Disease Status", fontsize=16)
plt.xlabel("Heart Disease (0 = No, 1 = Yes)", fontsize=14)
plt.ylabel("Age", fontsize=14)

# Show the plot
plt.tight_layout()
plt.show()


Try to find other outliers within the dataset!

In [None]:
# Your code here

# Data preprocessing
Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.

The concepts are

1. Check for Null Values
2. Feature Scaling
3. Handling Categorical Variables

## Null Values
Do we have null values? (More information under: https://pandas.pydata.org/docs/reference/api/pandas.isnull.html)

In [None]:
# Check null values. Your code goes here:
df.isnull().sum()


## Feature Scaling
Feature scaling is a preprocessing technique used in machine learning to standardize the range of values across different features in a dataset. This process ensures that features with varying units or magnitudes contribute equally to a model, preventing one feature from dominating due to its scale. For example, in a dataset with age measured in years and income in thousands, the larger range of income values could disproportionately affect the results unless the data is scaled.

Feature scaling is crucial for models that rely on distance metrics, such as k-Nearest Neighbors or Support Vector Machines, because these methods calculate distances between points, and larger-scaled features can skew these calculations. Additionally, gradient-based models like neural networks benefit from scaled data, as it helps them converge faster during training by normalizing the gradient values.

There are several methods to perform feature scaling. Min-Max scaling, also known as normalization, transforms values to a fixed range, typically between 0 and 1, preserving the distribution of the data. Standardization, on the other hand, centers features around a mean of 0 and scales them to have a standard deviation of 1, which is particularly useful when the data is assumed to follow a normal distribution. For datasets with outliers, robust scaling is an alternative that scales data based on the median and interquartile range, making it less sensitive to extreme values.

Applying feature scaling ensures that all features contribute equally to the learning process and improves the performance and reliability of machine learning models. It is particularly essential when the algorithms depend on feature magnitude or require gradient-based optimization for training.

Normalization of all features: for various machine learning methods it is necessary to have normalized values and clear distinction in types.

In [13]:
# We need to transform text data to numerical values later to be able
# to process them

# textual columns
string_col = df.select_dtypes(include="object").columns
df[string_col]=df[string_col].astype("string")
string_col=df.select_dtypes("string").columns.to_list()

# numerical columns
num_col=df.columns.to_list()
for col in string_col:
    num_col.remove(col)
num_col.remove("HeartDisease")

In [None]:
# As we will be using both types of approches for demonstration lets do First Label Ecoding
# which will be used with Tree Based Algorthms
df_tree = df.apply(LabelEncoder().fit_transform)
df_tree.head()

## Preperation of Categorization / Classification

The code performs one-hot encoding on categorical features to convert them into a numerical format suitable for non-tree-based algorithms like Logistic Regression, Support Vector Machines, or k-Nearest Neighbors. This transformation creates new binary columns for each category in the categorical variables, making the data fully numerical while preserving the information in the original categories.

Additionally, the code ensures that the target variable (HeartDisease) is moved to the end of the DataFrame. This is done for better organization and to separate the target column from the features, which is helpful for data exploration and model training.

**Why Is This Important?**

Non-tree-based algorithms require numerical input to process data effectively, as they cannot inherently handle categorical features. One-hot encoding ensures that categorical information is represented in a machine-readable format without introducing ordinal relationships that do not exist. By organizing the target column at the end, the code enhances clarity and simplifies workflows for modeling and evaluation. This step is critical for maintaining a structured and interpretable dataset when working with machine learning pipelines.

In [None]:
## Creating one hot encoded features for working with non tree based algorithms
df_nontree=pd.get_dummies(df, columns=string_col,drop_first=False)
df_nontree.head()

In [None]:
# Getting the target column at the end
target = "HeartDisease"
y = df_nontree[target].values
df_nontree.drop("HeartDisease",axis=1,inplace=True)
df_nontree=pd.concat([df_nontree,df[target]],axis=1)
df_nontree.head()

# Using ML for classification

## Classification Metrics Table
When we perform ML techniques we will see a certain output.
The metrics for each class (0 and 1) as well as overall scores are explained as follows:

**Precision**:	The proportion of correctly predicted positive instances out of all instances predicted as positive.

**Recall**:	The proportion of correctly predicted positive instances out of all actual positive instances.

**F1-Score**:	The harmonic mean of precision and recall, providing a balanced measure of model performance.

**Support**:	The number of true instances for each class in the validation set.

The **ROC-AUC** score (Receiver Operating Characteristic - Area Under the Curve) measures the model's ability to distinguish between classes.
For example a score of 0.88 indicates that the model has a high ability to differentiate between positive (1) and negative (0) classes, with 1.0 being a perfect score and 0.5 representing random guessing.


## Distance based ML
- Logistic regression
- SVM

In [None]:
# Prepare feature columns by excluding the target column
feature_columns = df_nontree.columns.to_list()
feature_columns.remove(target)

### Logistic Regression
Logistic Regression is a statistical method used for binary classification tasks, where the goal is to predict one of two possible outcomes (e.g., yes/no, 0/1, positive/negative) based on input features. Despite its name, Logistic Regression is a classification algorithm, not a regression algorithm. It models the relationship between the input features and the target variable by estimating probabilities using a logistic (sigmoid) function, which maps values to a range between 0 and 1.

Logistic Regression is widely used in applications such as medical diagnosis (e.g., predicting the likelihood of a disease), customer behavior analysis (e.g., churn prediction), and risk assessment (e.g., likelihood of loan default). It provides interpretable results, as the coefficients of the features indicate their contribution to the predicted probability.

One of its key strengths is simplicity and efficiency, making it a good baseline model for classification tasks. However, it assumes a linear relationship between the features and the log-odds of the outcome, which might limit its performance on complex, non-linear datasets without feature transformations or extensions like polynomial terms.

In [None]:
# Import necessary libraries
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import MinMaxScaler

# Initialize a list to store accuracy for each fold
accuracy_log = []

# Set up Stratified K-Fold cross-validation with 5 splits
kf = StratifiedKFold(n_splits=5)

# Iterate through each fold
for fold, (train_idx, val_idx) in enumerate(kf.split(X=df_nontree, y=y)):
    # Split the data into training and validation sets
    X_train = df_nontree.loc[train_idx, feature_columns]
    y_train = df_nontree.loc[train_idx, target]
    X_valid = df_nontree.loc[val_idx, feature_columns]
    y_valid = df_nontree.loc[val_idx, target]

    # Apply Min-Max Scaling to the features
    scaler = MinMaxScaler()
    X_train = scaler.fit_transform(X_train)
    X_valid = scaler.transform(X_valid)

    # Initialize and train a logistic regression model
    clf = LogisticRegression()
    clf.fit(X_train, y_train)

    # Predict on the validation set
    y_pred = clf.predict(X_valid)

    # Print classification metrics for the current fold
    print(f"The fold is: {fold}")
    print(classification_report(y_valid, y_pred))

    # Calculate the ROC-AUC score and append to accuracy list
    accuracy = roc_auc_score(y_valid, y_pred)
    accuracy_log.append(accuracy)
    print(f"The ROC-AUC score for Fold {fold + 1}: {accuracy}")


### Support Vector Machine

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that separates data points from different classes with the largest possible margin. For non-linearly separable data, SVM can use kernel functions to map the data into higher-dimensional spaces, allowing it to handle more complex relationships.

SVM is commonly used in tasks such as text classification, image recognition, and bioinformatics, as it is effective in high-dimensional spaces and robust to overfitting, especially in cases with a clear margin of separation.

Now it’s your turn to write the code for implementing an SVM model! Use your knowledge of scikit-learn and the concepts discussed to create a working example


In [None]:
# SVM
from sklearn.svm import SVC
acc_svm=[]
kf=model_selection.StratifiedKFold(n_splits=5)
for fold , (trn_,val_) in enumerate(kf.split(X=df_nontree,y=y)):
    # Your Code starts here...
    # Hint: You find some help here: https://scikit-learn.org/1.5/modules/svm.html

### k-Nearest Neighbor (KNN)


In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import MinMaxScaler

# List to store accuracy scores for each fold
acc_KNN = []

# Initialize Stratified K-Fold cross-validation
kf = StratifiedKFold(n_splits=5)

# Iterate through each fold
for fold, (trn_, val_) in enumerate(kf.split(X=df_nontree, y=y)):
    # Your Code goes here:

# Display average accuracy across folds
print(f"Average ROC-AUC across all folds: {sum(acc_KNN) / len(acc_KNN)}")


## Tree based ML
- Decision tree classifier
- Random forest
- XGBoost

### Decision Tree Classifier

In [None]:
# Prepare the feature columns for the tree-based model
feature_col_tree = df_tree.columns.to_list()
feature_col_tree.remove(target)

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold

# List to store accuracy scores for each fold
acc_Dtree = []

# Initialize Stratified K-Fold cross-validation
kf = StratifiedKFold(n_splits=5)

# Iterate through each fold
for fold, (trn_, val_) in enumerate(kf.split(X=df_tree, y=y)):
    # Your code goes here:

# Display average accuracy across folds
print(f"Average ROC-AUC across all folds: {sum(acc_Dtree) / len(acc_Dtree)}")


In [None]:
# If you have a trained Decision Tree Classifier (clf), let's visualize it

# Import necessary libraries
import graphviz
from sklearn.tree import export_graphviz

# Export the Decision Tree as DOT data
dot_data = export_graphviz(
    clf,                          # Trained Decision Tree Classifier
    out_file=None,                # No need to save the DOT file to disk
    feature_names=feature_col_tree,  # Feature names for interpretability
    class_names=[str(cls) for cls in clf.classes_],  # Class names as strings
    filled=True,                  # Fill the nodes with colors based on class
    rounded=True,                 # Rounded node borders for better aesthetics
    special_characters=True       # Allow special characters in feature names
)

# Render the graph using Graphviz
graph = graphviz.Source(dot_data, format="png")  # Render graph as a PNG
graph.render("decision_tree")  # Save the graph to a file (optional)
graph  # Display the graph in the notebook


### Random Forest

In [None]:
# Import Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold

# Initialize list to store accuracy scores
acc_RandF = []

# Initialize Stratified K-Fold cross-validation
kf = StratifiedKFold(n_splits=5)

# Iterate through each fold
for fold, (trn_, val_) in enumerate(kf.split(X=df_tree, y=y)):
    # Your code goes here:

# Display average accuracy across folds
print(f"Average ROC-AUC across all folds: {sum(acc_RandF) / len(acc_RandF)}")


#### Feature importance:
Run the following code and think about what does the plot say about the data?

In [None]:
# Visualize Feature Importance from the Random Forest Classifier
import numpy as np
import matplotlib.pyplot as plt

# Ensure clf (Random Forest Classifier) is already trained

plt.figure(figsize=(20, 15))

# Extract feature importance from the trained classifier
importance = clf.feature_importances_

# Sort feature importance in ascending order for visualization
idxs = np.argsort(importance)

# Create a horizontal bar chart for feature importance
plt.title("Feature Importance")
plt.barh(range(len(idxs)), importance[idxs], align="center")

# Add feature names to the y-axis
plt.yticks(range(len(idxs)), [feature_col_tree[i] for i in idxs])

# Label the x-axis
plt.xlabel("Random Forest Feature Importance")

# Display the plot
plt.show()


### XG Boost

In [None]:
# Import necessary libraries
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import StratifiedKFold

# List to store accuracy scores
acc_XGB = []

# Initialize Stratified K-Fold cross-validation
kf = StratifiedKFold(n_splits=5)

# Iterate through each fold
for fold, (trn_, val_) in enumerate(kf.split(X=df_tree, y=y)):
    # Your code goes here:

# Display average accuracy across folds
print(f"Average ROC-AUC across all folds: {sum(acc_XGB) / len(acc_XGB)}")


In [None]:
# Import necessary libraries for visualization
from xgboost import plot_tree
import matplotlib.pyplot as plt

# Ensure clf (XGBoost Classifier) is already trained
fig, ax = plt.subplots(figsize=(30, 30))

# Visualize the first tree (num_trees=0) from the trained XGBoost model
plot_tree(clf, num_trees=0, rankdir="LR", ax=ax)

# Display the plot
plt.show()


# Discussion
- Do you think that the results of the different ML methods are reasonable and overall "good"?
- What do you think about the data quality?