# Heart Disease – Random Forest Klassifikation  
**Train/Test-Split + 5-Fold Cross-Validation (Grid Search nur auf Train)**

Dieses Notebook:
1. lädt `heart.csv`  
2. baut eine Preprocessing+RF-Pipeline  
3. reserviert ein Holdout-Testset (20%)  
4. macht Hyperparameter-Tuning mit `GridSearchCV` auf dem Trainingsset (5-Fold CV)  
5. bewertet final auf dem Testset.

> Vorteil: Keine Datenleckage, glaubwürdiger finaler Test-Score.


In [4]:
# 0) Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    classification_report,
    confusion_matrix
)


## 1) Datensatz laden

In [6]:
df = pd.read_csv("heart.csv")

print("Dataset shape:", df.shape)
print("Spalten:", list(df.columns))
df.head()


Dataset shape: (918, 12)
Spalten: ['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## 2) Features (X) und Zielvariable (y)

In [7]:
X = df.drop(columns=["HeartDisease"])
y = df["HeartDisease"]

X.head()


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up


## 3) Numerische vs. kategoriale Features trennen

In [8]:
# Kategoriale Features sind als object codiert (z.B. "M"/"F", "ASY", ...)
cat_features = X.select_dtypes(include=["object"]).columns.tolist()

# Numerische Features sind int/float
num_features = X.select_dtypes(include=[np.number]).columns.tolist()

print("Kategoriale Features:", cat_features)
print("Numerische Features:", num_features)


Kategoriale Features: ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
Numerische Features: ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']


## 4) Preprocessing definieren

In [9]:
# - OneHotEncoder: macht aus Kategorien numerische Dummy-Spalten
# - StandardScaler: skaliert numerische Features (RF braucht es nicht zwingend,
#   ist aber in einer generischen Pipeline ok)
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
    ]
)


## 5) Pipeline (Preprocess + Random Forest)

In [10]:
pipe = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", RandomForestClassifier(random_state=42))
    ]
)


## 6) Train/Test-Split (Holdout-Testset reservieren)

In [11]:
# stratify=y erhält das Klassenverhältnis in Train und Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train size:", X_train.shape, "Test size:", X_test.shape)


Train size: (734, 11) Test size: (184, 11)


## 7) Grid Search mit 5-Fold CV auf dem Trainingsset

In [12]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

param_grid = {
    "model__n_estimators": [100, 200, 300],
    "model__max_depth": [None, 5, 10, 20],
    "model__min_samples_split": [2, 5, 10],
    "model__min_samples_leaf": [1, 2, 4],
    "model__max_features": ["sqrt", "log2", None]
}

grid = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=cv,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1
)

# Wichtig: Fit nur auf Trainingsdaten -> CV läuft intern nur auf Train
grid.fit(X_train, y_train)

print("Beste Hyperparameter:")
print(grid.best_params_)
print("Beste CV-Accuracy (Train-CV):", grid.best_score_)


Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Beste Hyperparameter:
{'model__max_depth': 5, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 4, 'model__min_samples_split': 10, 'model__n_estimators': 300}
Beste CV-Accuracy (Train-CV): 0.8692386543658559


## 8) Finale Bewertung auf dem Testset

In [13]:
best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

test_acc = accuracy_score(y_test, y_pred)
test_auc = roc_auc_score(y_test, y_proba)

print("Finale Test-Set Bewertung")
print("Test Accuracy:", test_acc)
print("Test ROC-AUC:", test_auc)

print("\nClassification Report (Test):")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix (Test):")
print(confusion_matrix(y_test, y_pred))


Finale Test-Set Bewertung
Test Accuracy: 0.8913043478260869
Test ROC-AUC: 0.930416068866571

Classification Report (Test):
              precision    recall  f1-score   support

           0       0.91      0.84      0.87        82
           1       0.88      0.93      0.90       102

    accuracy                           0.89       184
   macro avg       0.89      0.89      0.89       184
weighted avg       0.89      0.89      0.89       184


Confusion Matrix (Test):
[[69 13]
 [ 7 95]]
