# Heart Disease – Random Forest Klassifikation  
**Train/Test-Split + RandomizedSearchCV (5-Fold CV nur auf Train)**

Dieses Notebook:
1. lädt `heart.csv`  
2. erstellt Pipeline (Preprocessing + RandomForest)  
3. reserviert ein Testset (20%)  
4. optimiert Hyperparameter mit `RandomizedSearchCV` auf dem Trainingsset (5-Fold CV)  
5. bewertet final auf dem Testset


In [1]:
# 0) Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.metrics import (
    accuracy_score,
    roc_auc_score,
    classification_report,
    confusion_matrix
)

# Für Hyperparameter-Verteilungen
from scipy.stats import randint


## 1) Datensatz laden

In [2]:
df = pd.read_csv("heart.csv")
print("Dataset shape:", df.shape)
print("Spalten:", list(df.columns))
df.head()


Dataset shape: (918, 12)
Spalten: ['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## 2) Features (X) und Zielvariable (y)

In [3]:
X = df.drop(columns=["HeartDisease"])
y = df["HeartDisease"]


## 3) Numerische vs. kategoriale Features

In [4]:
cat_features = X.select_dtypes(include=["object"]).columns.tolist()
num_features = X.select_dtypes(include=[np.number]).columns.tolist()

print("Kategoriale Features:", cat_features)
print("Numerische Features:", num_features)


Kategoriale Features: ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
Numerische Features: ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']


## 4) Preprocessing + Pipeline

In [5]:
preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features)
    ]
)

pipe = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", RandomForestClassifier(random_state=42))
    ]
)


## 5) Train/Test-Split (Holdout-Testset reservieren)

In [6]:
# stratify=y hält die Klassenverteilung konstant
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train size:", X_train.shape, "Test size:", X_test.shape)


Train size: (734, 11) Test size: (184, 11)


## 6) Randomized Search mit 5-Fold CV nur auf Train

In [7]:
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Suchraum: Mischung aus Verteilungen und fixen Listen
param_dist = {
    "model__n_estimators": randint(100, 600),      # 100–599 Bäume
    "model__max_depth": [None, 5, 10, 20, 30],
    "model__min_samples_split": randint(2, 15),   # 2–14
    "model__min_samples_leaf": randint(1, 8),     # 1–7
    "model__max_features": ["sqrt", "log2", None]
}

random_search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=40,                # Anzahl getesteter zufälliger Kombinationen
    cv=cv,
    scoring="accuracy",
    n_jobs=-1,
    random_state=42,
    verbose=1
)

# Fit nur auf Trainingsdaten -> CV läuft intern nur auf Train
random_search.fit(X_train, y_train)

print("Beste Hyperparameter:")
print(random_search.best_params_)
print("Beste CV-Accuracy (Train-CV):", random_search.best_score_)


Fitting 5 folds for each of 40 candidates, totalling 200 fits
Beste Hyperparameter:
{'model__max_depth': 5, 'model__max_features': 'log2', 'model__min_samples_leaf': 5, 'model__min_samples_split': 3, 'model__n_estimators': 317}
Beste CV-Accuracy (Train-CV): 0.8692386543658561


## 7) Finale Bewertung auf dem Testset

In [8]:
best_model = random_search.best_estimator_

# Vorhersagen auf Testdaten
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

test_acc = accuracy_score(y_test, y_pred)
test_auc = roc_auc_score(y_test, y_proba)

print("Finale Test-Set Bewertung")
print("Test Accuracy:", test_acc)
print("Test ROC-AUC:", test_auc)

print("\nClassification Report (Test):")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix (Test):")
print(confusion_matrix(y_test, y_pred))


Finale Test-Set Bewertung
Test Accuracy: 0.8967391304347826
Test ROC-AUC: 0.9305356288857006

Classification Report (Test):
              precision    recall  f1-score   support

           0       0.92      0.84      0.88        82
           1       0.88      0.94      0.91       102

    accuracy                           0.90       184
   macro avg       0.90      0.89      0.89       184
weighted avg       0.90      0.90      0.90       184


Confusion Matrix (Test):
[[69 13]
 [ 6 96]]
