# Wine Quality Prediction - Vollst√§ndiges ML-Projekt

## Projekt√ºbersicht
Dieses Notebook analysiert den Wine Quality Datensatz von UCI Machine Learning Repository.
Ziel ist es, die Weinqualit√§t basierend auf physikalisch-chemischen Eigenschaften vorherzusagen.

**Datensatz:** Wine Quality Dataset (Red & White Wine)

**Quelle:** https://archive.ics.uci.edu/dataset/186/wine+quality

**Ansatz:** Wir verwenden beide Datens√§tze (Rot- und Wei√üwein) und kombinieren sie, da:
1. Mehr Daten generell bessere Modelle erm√∂glichen
2. Wir ein allgemeines Qualit√§tsmodell f√ºr Wein entwickeln k√∂nnen
3. Wir einen 'wine_type' Feature hinzuf√ºgen, um zwischen den Weintypen zu unterscheiden

## 1. Import der ben√∂tigten Bibliotheken

In [None]:
# Datenverarbeitung
import pandas as pd
import numpy as np

# Visualisierung
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler

# Machine Learning - Regression Modelle
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

# Machine Learning - Klassifikation Modelle
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Metriken - Regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Metriken - Klassifikation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc, roc_auc_score
)

# Warnungen unterdr√ºcken
import warnings
warnings.filterwarnings('ignore')

# Plot-Einstellungen
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
%matplotlib inline

print("Alle Bibliotheken erfolgreich importiert!")

## 2. Daten laden und erste Inspektion

In [None]:
# Laden der Datens√§tze
red_wine = pd.read_csv('wine+quality/winequality-red.csv', sep=';')
white_wine = pd.read_csv('wine+quality/winequality-white.csv', sep=';')

# Weintyp als Feature hinzuf√ºgen
red_wine['wine_type'] = 1  # 1 f√ºr Rotwein
white_wine['wine_type'] = 0  # 0 f√ºr Wei√üwein

print(f"Rotwein Datensatz: {red_wine.shape}")
print(f"Wei√üwein Datensatz: {white_wine.shape}")

# Datens√§tze kombinieren
df = pd.concat([red_wine, white_wine], axis=0, ignore_index=True)
print(f"\nKombinierter Datensatz: {df.shape}")
print(f"Anzahl Samples: {len(df)}")
print(f"Anzahl Features: {df.shape[1]}")

In [None]:
# Erste Zeilen anzeigen
print("Erste 10 Zeilen des Datensatzes:")
df.head(10)

In [None]:
# Grundlegende Informationen
print("Datentypen und fehlende Werte:")
df.info()

In [None]:
# Statistische Zusammenfassung
print("Statistische Zusammenfassung aller Features:")
df.describe()

### 2.1 Zielvariable "quality" analysieren

Die Zielvariable **quality** ist eine ordinale Variable mit Werten von 0 bis 10, wobei:
- **0** = sehr schlechte Qualit√§t
- **10** = exzellente Qualit√§t

In der Praxis enth√§lt der Datensatz meist Werte zwischen 3 und 9.

Wir werden zwei Ans√§tze verfolgen:
1. **Regression**: Vorhersage des exakten Qualit√§tswerts (3-9)
2. **Klassifikation**: Bin√§re Klassifikation in "gut" (‚â•6) und "schlecht" (<6)

In [None]:
# Verteilung der Qualit√§t
print("Verteilung der Qualit√§tswerte:")
print(df['quality'].value_counts().sort_index())

print(f"\nMinimale Qualit√§t: {df['quality'].min()}")
print(f"Maximale Qualit√§t: {df['quality'].max()}")
print(f"Durchschnittliche Qualit√§t: {df['quality'].mean():.2f}")
print(f"Median Qualit√§t: {df['quality'].median()}")

In [None]:
# Visualisierung der Qualit√§tsverteilung
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Balkendiagramm
df['quality'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Verteilung der Weinqualit√§t', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Qualit√§t')
axes[0].set_ylabel('Anzahl')
axes[0].grid(axis='y', alpha=0.3)

# Vergleich zwischen Rot- und Wei√üwein
quality_by_type = df.groupby(['wine_type', 'quality']).size().unstack(fill_value=0)
quality_by_type.T.plot(kind='bar', ax=axes[1], color=['#8B0000', '#FFD700'])
axes[1].set_title('Qualit√§tsverteilung: Rotwein vs Wei√üwein', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Qualit√§t')
axes[1].set_ylabel('Anzahl')
axes[1].legend(['Wei√üwein', 'Rotwein'])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Datenbereinigung und Qualit√§tspr√ºfung

In [None]:
# Pr√ºfung auf fehlende Werte
print("Fehlende Werte pro Spalte:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nGesamtzahl fehlender Werte: {missing_values.sum()}")

In [None]:
# Pr√ºfung auf Duplikate
duplicates = df.duplicated().sum()
print(f"Anzahl duplizierter Zeilen: {duplicates}")

if duplicates > 0:
    print(f"\nEntferne {duplicates} duplizierte Zeilen...")
    df = df.drop_duplicates()
    print(f"Neue Datensatzgr√∂√üe: {df.shape}")

In [None]:
# Pr√ºfung auf negative oder ung√ºltige Werte
print("Pr√ºfung auf negative Werte (sollten bei physikalischen Messungen nicht vorkommen):")
for col in df.columns:
    if col not in ['quality', 'wine_type']:
        negative_count = (df[col] < 0).sum()
        if negative_count > 0:
            print(f"{col}: {negative_count} negative Werte gefunden!")

print("\nKeine negativen Werte gefunden - Daten sind valide!")

### 3.1 Ausrei√üer-Erkennung mit IQR-Methode

In [None]:
# Funktion zur Ausrei√üer-Erkennung mit IQR
def detect_outliers_iqr(data, column):
    """
    Erkennt Ausrei√üer mit der IQR-Methode (Interquartile Range).
    Ausrei√üer sind Werte, die au√üerhalb von [Q1 - 1.5*IQR, Q3 + 1.5*IQR] liegen.
    """
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    return outliers, lower_bound, upper_bound

# Ausrei√üer f√ºr alle numerischen Features analysieren
feature_cols = [col for col in df.columns if col not in ['quality', 'wine_type']]

outlier_summary = {}
for col in feature_cols:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    outlier_summary[col] = {
        'count': len(outliers),
        'percentage': (len(outliers) / len(df)) * 100,
        'lower_bound': lower,
        'upper_bound': upper
    }

# Ergebnisse anzeigen
outlier_df = pd.DataFrame(outlier_summary).T
print("Ausrei√üer-Analyse (IQR-Methode):")
print(outlier_df.sort_values('percentage', ascending=False))

In [None]:
# Entscheidung: Wir behalten die Ausrei√üer, da sie chemische Extremwerte sein k√∂nnten,
# die f√ºr die Qualit√§t relevant sind. Bei Bedarf k√∂nnten wir sie entfernen.
# Zur Demonstration zeigen wir, wie man sie entfernen w√ºrde:

df_no_outliers = df.copy()
original_size = len(df_no_outliers)

# Optional: Extreme Ausrei√üer entfernen (nur sehr extreme Werte)
for col in feature_cols:
    outliers, lower, upper = detect_outliers_iqr(df_no_outliers, col)
    # Nur sehr extreme Ausrei√üer entfernen (3*IQR statt 1.5*IQR)
    Q1 = df_no_outliers[col].quantile(0.25)
    Q3 = df_no_outliers[col].quantile(0.75)
    IQR = Q3 - Q1
    extreme_lower = Q1 - 3 * IQR
    extreme_upper = Q3 + 3 * IQR
    df_no_outliers = df_no_outliers[
        (df_no_outliers[col] >= extreme_lower) & 
        (df_no_outliers[col] <= extreme_upper)
    ]

removed = original_size - len(df_no_outliers)
print(f"\nMit extremer Ausrei√üer-Entfernung (3*IQR):")
print(f"Entfernte Zeilen: {removed} ({(removed/original_size)*100:.2f}%)")
print(f"Verbleibende Zeilen: {len(df_no_outliers)}")

# Wir verwenden den Originaldatensatz f√ºr das Training
print("\n‚Üí F√ºr das Training verwenden wir den vollst√§ndigen Datensatz.")

### 3.2 Korrelationsanalyse

In [None]:
# Korrelationsmatrix berechnen
correlation_matrix = df.corr()

# Korrelation mit der Zielvariable "quality"
quality_correlation = correlation_matrix['quality'].sort_values(ascending=False)
print("Korrelation der Features mit der Weinqualit√§t:")
print(quality_correlation)

In [None]:
# Stark korrelierte Feature-Paare identifizieren (Multikollinearit√§t)
print("\nStark korrelierte Feature-Paare (|r| > 0.7):")
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.7:
            high_corr_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

for pair in high_corr_pairs:
    print(f"{pair[0]} <-> {pair[1]}: {pair[2]:.3f}")

## 4. Umfassende Visualisierungen

### 4.1 Histogramme aller Features

In [None]:
# Histogramme f√ºr alle Features
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(df.columns):
    if idx < 12:
        axes[idx].hist(df[col], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
        axes[idx].set_title(f'Verteilung: {col}', fontweight='bold')
        axes[idx].set_xlabel(col)
        axes[idx].set_ylabel('H√§ufigkeit')
        axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.2 Boxplots zur Ausrei√üer-Analyse

In [None]:
# Boxplots f√ºr alle numerischen Features
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(feature_cols):
    if idx < 11:
        axes[idx].boxplot(df[col], vert=True)
        axes[idx].set_title(f'Boxplot: {col}', fontweight='bold')
        axes[idx].set_ylabel(col)
        axes[idx].grid(axis='y', alpha=0.3)

# Qualit√§t Boxplot
axes[11].boxplot(df['quality'], vert=True)
axes[11].set_title('Boxplot: quality', fontweight='bold')
axes[11].set_ylabel('quality')
axes[11].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### 4.3 Heatmap der Korrelationen

In [None]:
# Korrelations-Heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt='.2f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={'shrink': 0.8}
)
plt.title('Korrelationsmatrix aller Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

### 4.4 Scatterplots: Qualit√§t vs. wichtige Features

In [None]:
# Top 6 Features mit h√∂chster Korrelation zur Qualit√§t (au√üer quality selbst)
top_features = quality_correlation.drop('quality').abs().nlargest(6).index

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, feature in enumerate(top_features):
    axes[idx].scatter(df[feature], df['quality'], alpha=0.3, c=df['wine_type'], cmap='RdYlBu')
    axes[idx].set_xlabel(feature, fontweight='bold')
    axes[idx].set_ylabel('Quality', fontweight='bold')
    axes[idx].set_title(f'Quality vs {feature}\n(r = {quality_correlation[feature]:.3f})', fontweight='bold')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### 4.5 Pairplot der wichtigsten Features

In [None]:
# Pairplot f√ºr die 5 wichtigsten Features plus quality
top_5_features = quality_correlation.drop('quality').abs().nlargest(5).index.tolist()
pairplot_features = top_5_features + ['quality']

print(f"Erstelle Pairplot f√ºr: {pairplot_features}")

# Sample f√ºr schnellere Visualisierung (optional)
df_sample = df[pairplot_features].sample(n=min(1000, len(df)), random_state=42)

sns.pairplot(
    df_sample,
    diag_kind='hist',
    plot_kws={'alpha': 0.6},
    height=2.5
)
plt.suptitle('Pairplot der wichtigsten Features', y=1.01, fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Feature Engineering

In [None]:
# Feature Engineering: Neue Features erstellen
df_engineered = df.copy()

# 1. Verh√§ltnis von freiem zu totalem Schwefel
df_engineered['free_to_total_sulfur_ratio'] = (
    df_engineered['free sulfur dioxide'] / 
    (df_engineered['total sulfur dioxide'] + 1e-10)  # Vermeidung Division durch 0
)

# 2. S√§ure-Verh√§ltnis (Weins√§ure zu Zitronens√§ure)
df_engineered['acid_ratio'] = (
    df_engineered['fixed acidity'] / 
    (df_engineered['volatile acidity'] + 1e-10)
)

# 3. Gesamts√§ure
df_engineered['total_acidity'] = (
    df_engineered['fixed acidity'] + 
    df_engineered['volatile acidity'] + 
    df_engineered['citric acid']
)

# 4. Alkohol pro S√§ure (Balance)
df_engineered['alcohol_per_acid'] = (
    df_engineered['alcohol'] / 
    (df_engineered['total_acidity'] + 1e-10)
)

# 5. Bin√§re Qualit√§tsklasse f√ºr Klassifikation
df_engineered['quality_class'] = (df_engineered['quality'] >= 6).astype(int)
# 0 = schlecht (<6), 1 = gut (‚â•6)

print("Neue Features erstellt:")
print("- free_to_total_sulfur_ratio")
print("- acid_ratio")
print("- total_acidity")
print("- alcohol_per_acid")
print("- quality_class (bin√§r: 0=schlecht, 1=gut)")
print(f"\nNeue Datensatzgr√∂√üe: {df_engineered.shape}")

In [None]:
# Verteilung der bin√§ren Qualit√§tsklassen
print("Verteilung der Qualit√§tsklassen:")
print(df_engineered['quality_class'].value_counts())
print(f"\nProzentuale Verteilung:")
print(df_engineered['quality_class'].value_counts(normalize=True) * 100)

# Visualisierung
fig, ax = plt.subplots(1, 2, figsize=(12, 4))

df_engineered['quality_class'].value_counts().plot(kind='bar', ax=ax[0], color=['#e74c3c', '#2ecc71'])
ax[0].set_title('Verteilung: Qualit√§tsklassen', fontweight='bold')
ax[0].set_xlabel('Klasse (0=schlecht, 1=gut)')
ax[0].set_ylabel('Anzahl')
ax[0].set_xticklabels(['Schlecht (<6)', 'Gut (‚â•6)'], rotation=0)

ax[1].pie(
    df_engineered['quality_class'].value_counts(),
    labels=['Schlecht (<6)', 'Gut (‚â•6)'],
    autopct='%1.1f%%',
    colors=['#e74c3c', '#2ecc71'],
    startangle=90
)
ax[1].set_title('Prozentuale Verteilung', fontweight='bold')

plt.tight_layout()
plt.show()

## 6. Daten vorbereiten: Train-Test-Split

In [None]:
# Features f√ºr das Training
feature_columns = [col for col in df_engineered.columns 
                   if col not in ['quality', 'quality_class']]

print(f"Features f√ºr Training: {len(feature_columns)}")
print(feature_columns)

# Feature Matrix X
X = df_engineered[feature_columns]

# Target f√ºr Regression
y_regression = df_engineered['quality']

# Target f√ºr Klassifikation
y_classification = df_engineered['quality_class']

print(f"\nX shape: {X.shape}")
print(f"y_regression shape: {y_regression.shape}")
print(f"y_classification shape: {y_classification.shape}")

In [None]:
# Train-Test Split f√ºr Regression (80/20, mit Stratifikation basierend auf quality)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X, y_regression,
    test_size=0.2,
    random_state=42,
    stratify=y_regression
)

# Train-Test Split f√ºr Klassifikation (80/20, mit Stratifikation basierend auf quality_class)
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X, y_classification,
    test_size=0.2,
    random_state=42,
    stratify=y_classification
)

print("Train-Test Split erfolgreich:")
print(f"\nRegression:")
print(f"  Training Set: {X_train_reg.shape}")
print(f"  Test Set: {X_test_reg.shape}")
print(f"\nKlassifikation:")
print(f"  Training Set: {X_train_clf.shape}")
print(f"  Test Set: {X_test_clf.shape}")

### 6.1 Feature Standardisierung

In [None]:
# StandardScaler f√ºr Regression
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)

# StandardScaler f√ºr Klassifikation
scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)

print("Features erfolgreich standardisiert (Mean=0, Std=1)")
print(f"\nBeispiel - Mittelwerte nach Standardisierung:")
print(np.mean(X_train_reg_scaled, axis=0)[:5])
print(f"\nBeispiel - Standardabweichungen nach Standardisierung:")
print(np.std(X_train_reg_scaled, axis=0)[:5])

## 7. Regression: Modelltraining und Evaluation

### 7.1 Baseline Modelle trainieren

In [None]:
# Dictionary f√ºr Regression Modelle
regression_models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(random_state=42),
    'Random Forest Regressor': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting Regressor': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5),
    'Support Vector Regressor': SVR(kernel='rbf')
}

# Ergebnisse speichern
regression_results = {}

print("Trainiere Regressionsmodelle...\n")
print("="*80)

for name, model in regression_models.items():
    print(f"\nTrainiere {name}...")
    
    # Training
    model.fit(X_train_reg_scaled, y_train_reg)
    
    # Vorhersagen
    y_pred_train = model.predict(X_train_reg_scaled)
    y_pred_test = model.predict(X_test_reg_scaled)
    
    # Metriken berechnen
    train_rmse = np.sqrt(mean_squared_error(y_train_reg, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_test))
    train_mae = mean_absolute_error(y_train_reg, y_pred_train)
    test_mae = mean_absolute_error(y_test_reg, y_pred_test)
    train_r2 = r2_score(y_train_reg, y_pred_train)
    test_r2 = r2_score(y_test_reg, y_pred_test)
    
    # Ergebnisse speichern
    regression_results[name] = {
        'model': model,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_mae': train_mae,
        'test_mae': test_mae,
        'train_r2': train_r2,
        'test_r2': test_r2,
        'y_pred': y_pred_test
    }
    
    print(f"  Train RMSE: {train_rmse:.4f} | Test RMSE: {test_rmse:.4f}")
    print(f"  Train MAE:  {train_mae:.4f} | Test MAE:  {test_mae:.4f}")
    print(f"  Train R¬≤:   {train_r2:.4f} | Test R¬≤:   {test_r2:.4f}")

print("\n" + "="*80)
print("Alle Regressionsmodelle trainiert!")

In [None]:
# Vergleichstabelle erstellen
regression_comparison = pd.DataFrame({
    'Model': list(regression_results.keys()),
    'Test RMSE': [results['test_rmse'] for results in regression_results.values()],
    'Test MAE': [results['test_mae'] for results in regression_results.values()],
    'Test R¬≤': [results['test_r2'] for results in regression_results.values()],
    'Train RMSE': [results['train_rmse'] for results in regression_results.values()],
    'Train R¬≤': [results['train_r2'] for results in regression_results.values()]
})

# Sortieren nach Test R¬≤
regression_comparison = regression_comparison.sort_values('Test R¬≤', ascending=False)

print("\nVergleich aller Regressionsmodelle:")
print(regression_comparison.to_string(index=False))

In [None]:
# Visualisierung der Modell-Performance
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# RMSE Vergleich
x_pos = np.arange(len(regression_comparison))
axes[0].bar(x_pos, regression_comparison['Test RMSE'], color='steelblue', alpha=0.7)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(regression_comparison['Model'], rotation=45, ha='right')
axes[0].set_title('Test RMSE Vergleich', fontweight='bold')
axes[0].set_ylabel('RMSE')
axes[0].grid(axis='y', alpha=0.3)

# MAE Vergleich
axes[1].bar(x_pos, regression_comparison['Test MAE'], color='coral', alpha=0.7)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(regression_comparison['Model'], rotation=45, ha='right')
axes[1].set_title('Test MAE Vergleich', fontweight='bold')
axes[1].set_ylabel('MAE')
axes[1].grid(axis='y', alpha=0.3)

# R¬≤ Vergleich
axes[2].bar(x_pos, regression_comparison['Test R¬≤'], color='seagreen', alpha=0.7)
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(regression_comparison['Model'], rotation=45, ha='right')
axes[2].set_title('Test R¬≤ Vergleich', fontweight='bold')
axes[2].set_ylabel('R¬≤ Score')
axes[2].grid(axis='y', alpha=0.3)
axes[2].axhline(y=0, color='red', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

### 7.2 Predicted vs Actual Plots

In [None]:
# Predicted vs Actual f√ºr die besten 3 Modelle
top_3_models = regression_comparison.head(3)['Model'].tolist()

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, model_name in enumerate(top_3_models):
    y_pred = regression_results[model_name]['y_pred']
    
    axes[idx].scatter(y_test_reg, y_pred, alpha=0.5, s=20)
    axes[idx].plot([y_test_reg.min(), y_test_reg.max()], 
                   [y_test_reg.min(), y_test_reg.max()], 
                   'r--', lw=2, label='Perfect Prediction')
    axes[idx].set_xlabel('Tats√§chliche Qualit√§t', fontweight='bold')
    axes[idx].set_ylabel('Vorhergesagte Qualit√§t', fontweight='bold')
    axes[idx].set_title(f'{model_name}\nR¬≤ = {regression_results[model_name]["test_r2"]:.4f}', 
                        fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Klassifikation: Modelltraining und Evaluation

### 8.1 Baseline Modelle trainieren

In [None]:
# Dictionary f√ºr Klassifikationsmodelle
classification_models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest Classifier': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting Classifier': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Support Vector Classifier': SVC(kernel='rbf', probability=True, random_state=42)
}

# Ergebnisse speichern
classification_results = {}

print("Trainiere Klassifikationsmodelle...\n")
print("="*80)

for name, model in classification_models.items():
    print(f"\nTrainiere {name}...")
    
    # Training
    model.fit(X_train_clf_scaled, y_train_clf)
    
    # Vorhersagen
    y_pred_train = model.predict(X_train_clf_scaled)
    y_pred_test = model.predict(X_test_clf_scaled)
    y_pred_proba = model.predict_proba(X_test_clf_scaled)[:, 1]
    
    # Metriken berechnen
    train_acc = accuracy_score(y_train_clf, y_pred_train)
    test_acc = accuracy_score(y_test_clf, y_pred_test)
    test_precision = precision_score(y_test_clf, y_pred_test)
    test_recall = recall_score(y_test_clf, y_pred_test)
    test_f1 = f1_score(y_test_clf, y_pred_test)
    test_auc = roc_auc_score(y_test_clf, y_pred_proba)
    
    # Confusion Matrix
    cm = confusion_matrix(y_test_clf, y_pred_test)
    
    # Ergebnisse speichern
    classification_results[name] = {
        'model': model,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'precision': test_precision,
        'recall': test_recall,
        'f1': test_f1,
        'auc': test_auc,
        'confusion_matrix': cm,
        'y_pred': y_pred_test,
        'y_pred_proba': y_pred_proba
    }
    
    print(f"  Train Accuracy: {train_acc:.4f} | Test Accuracy: {test_acc:.4f}")
    print(f"  Precision: {test_precision:.4f} | Recall: {test_recall:.4f}")
    print(f"  F1-Score: {test_f1:.4f} | AUC: {test_auc:.4f}")

print("\n" + "="*80)
print("Alle Klassifikationsmodelle trainiert!")

In [None]:
# Vergleichstabelle erstellen
classification_comparison = pd.DataFrame({
    'Model': list(classification_results.keys()),
    'Accuracy': [results['test_acc'] for results in classification_results.values()],
    'Precision': [results['precision'] for results in classification_results.values()],
    'Recall': [results['recall'] for results in classification_results.values()],
    'F1-Score': [results['f1'] for results in classification_results.values()],
    'AUC': [results['auc'] for results in classification_results.values()]
})

# Sortieren nach F1-Score
classification_comparison = classification_comparison.sort_values('F1-Score', ascending=False)

print("\nVergleich aller Klassifikationsmodelle:")
print(classification_comparison.to_string(index=False))

In [None]:
# Visualisierung der Modell-Performance
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Metriken Vergleich
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']
x = np.arange(len(classification_comparison))
width = 0.15

for i, metric in enumerate(metrics):
    axes[0].bar(x + i*width, classification_comparison[metric], width, label=metric, alpha=0.8)

axes[0].set_xlabel('Modelle', fontweight='bold')
axes[0].set_ylabel('Score', fontweight='bold')
axes[0].set_title('Klassifikationsmetriken Vergleich', fontweight='bold')
axes[0].set_xticks(x + width * 2)
axes[0].set_xticklabels(classification_comparison['Model'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# F1-Score Ranking
axes[1].barh(classification_comparison['Model'], classification_comparison['F1-Score'], 
             color='steelblue', alpha=0.7)
axes[1].set_xlabel('F1-Score', fontweight='bold')
axes[1].set_title('F1-Score Ranking', fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

### 8.2 Confusion Matrices

In [None]:
# Confusion Matrices f√ºr alle Modelle
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, (name, results) in enumerate(classification_results.items()):
    cm = results['confusion_matrix']
    
    sns.heatmap(
        cm,
        annot=True,
        fmt='d',
        cmap='Blues',
        ax=axes[idx],
        cbar=False
    )
    axes[idx].set_title(f'{name}\nF1={results["f1"]:.4f}', fontweight='bold')
    axes[idx].set_xlabel('Predicted')
    axes[idx].set_ylabel('Actual')
    axes[idx].set_xticklabels(['Schlecht', 'Gut'])
    axes[idx].set_yticklabels(['Schlecht', 'Gut'])

# Leeres Subplot entfernen
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

### 8.3 ROC Kurven

In [None]:
# ROC Kurven f√ºr alle Modelle
plt.figure(figsize=(12, 8))

for name, results in classification_results.items():
    fpr, tpr, _ = roc_curve(y_test_clf, results['y_pred_proba'])
    auc_score = results['auc']
    
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc_score:.4f})', linewidth=2)

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier', linewidth=2)
plt.xlabel('False Positive Rate', fontweight='bold', fontsize=12)
plt.ylabel('True Positive Rate', fontweight='bold', fontsize=12)
plt.title('ROC Curves - Klassifikationsmodelle', fontweight='bold', fontsize=14)
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

### 8.4 Classification Reports

In [None]:
# Detaillierte Classification Reports
print("Classification Reports f√ºr alle Modelle:\n")
print("="*80)

for name, results in classification_results.items():
    print(f"\n{name}:")
    print("-" * 60)
    print(classification_report(
        y_test_clf, 
        results['y_pred'],
        target_names=['Schlecht (<6)', 'Gut (‚â•6)']
    ))

## 9. Hyperparameter Tuning

### 9.1 Random Forest Regressor Tuning

In [None]:
# Parameter Grid f√ºr Random Forest Regressor
rf_reg_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

print("Hyperparameter Tuning f√ºr Random Forest Regressor...")
print(f"Anzahl Kombinationen: {np.prod([len(v) for v in rf_reg_param_grid.values()])}")

# RandomizedSearchCV (schneller als GridSearchCV)
rf_reg_random = RandomizedSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    rf_reg_param_grid,
    n_iter=20,
    cv=5,
    scoring='r2',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

rf_reg_random.fit(X_train_reg_scaled, y_train_reg)

print(f"\nBeste Parameter: {rf_reg_random.best_params_}")
print(f"Bester R¬≤ Score (CV): {rf_reg_random.best_score_:.4f}")

In [None]:
# Beste Modell evaluieren
best_rf_reg = rf_reg_random.best_estimator_
y_pred_test_tuned = best_rf_reg.predict(X_test_reg_scaled)

tuned_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_test_tuned))
tuned_mae = mean_absolute_error(y_test_reg, y_pred_test_tuned)
tuned_r2 = r2_score(y_test_reg, y_pred_test_tuned)

print("\nTuned Random Forest Regressor Performance:")
print(f"  Test RMSE: {tuned_rmse:.4f}")
print(f"  Test MAE:  {tuned_mae:.4f}")
print(f"  Test R¬≤:   {tuned_r2:.4f}")

# Vergleich mit Baseline
baseline_r2 = regression_results['Random Forest Regressor']['test_r2']
improvement = ((tuned_r2 - baseline_r2) / baseline_r2) * 100
print(f"\nVerbesserung gegen√ºber Baseline: {improvement:.2f}%")

### 9.2 Gradient Boosting Regressor Tuning

In [None]:
# Parameter Grid f√ºr Gradient Boosting Regressor
gb_reg_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}

print("Hyperparameter Tuning f√ºr Gradient Boosting Regressor...")
print(f"Anzahl Kombinationen: {np.prod([len(v) for v in gb_reg_param_grid.values()])}")

# RandomizedSearchCV
gb_reg_random = RandomizedSearchCV(
    GradientBoostingRegressor(random_state=42),
    gb_reg_param_grid,
    n_iter=20,
    cv=5,
    scoring='r2',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

gb_reg_random.fit(X_train_reg_scaled, y_train_reg)

print(f"\nBeste Parameter: {gb_reg_random.best_params_}")
print(f"Bester R¬≤ Score (CV): {gb_reg_random.best_score_:.4f}")

In [None]:
# Beste Modell evaluieren
best_gb_reg = gb_reg_random.best_estimator_
y_pred_test_gb_tuned = best_gb_reg.predict(X_test_reg_scaled)

tuned_gb_rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_test_gb_tuned))
tuned_gb_mae = mean_absolute_error(y_test_reg, y_pred_test_gb_tuned)
tuned_gb_r2 = r2_score(y_test_reg, y_pred_test_gb_tuned)

print("\nTuned Gradient Boosting Regressor Performance:")
print(f"  Test RMSE: {tuned_gb_rmse:.4f}")
print(f"  Test MAE:  {tuned_gb_mae:.4f}")
print(f"  Test R¬≤:   {tuned_gb_r2:.4f}")

# Vergleich mit Baseline
baseline_gb_r2 = regression_results['Gradient Boosting Regressor']['test_r2']
improvement_gb = ((tuned_gb_r2 - baseline_gb_r2) / baseline_gb_r2) * 100
print(f"\nVerbesserung gegen√ºber Baseline: {improvement_gb:.2f}%")

### 9.3 Random Forest Classifier Tuning

In [None]:
# Parameter Grid f√ºr Random Forest Classifier
rf_clf_param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'class_weight': ['balanced', None]
}

print("Hyperparameter Tuning f√ºr Random Forest Classifier...")
print(f"Anzahl Kombinationen: {np.prod([len(v) for v in rf_clf_param_grid.values()])}")

# RandomizedSearchCV
rf_clf_random = RandomizedSearchCV(
    RandomForestClassifier(random_state=42, n_jobs=-1),
    rf_clf_param_grid,
    n_iter=20,
    cv=5,
    scoring='f1',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

rf_clf_random.fit(X_train_clf_scaled, y_train_clf)

print(f"\nBeste Parameter: {rf_clf_random.best_params_}")
print(f"Bester F1 Score (CV): {rf_clf_random.best_score_:.4f}")

In [None]:
# Beste Modell evaluieren
best_rf_clf = rf_clf_random.best_estimator_
y_pred_test_clf_tuned = best_rf_clf.predict(X_test_clf_scaled)
y_pred_proba_clf_tuned = best_rf_clf.predict_proba(X_test_clf_scaled)[:, 1]

tuned_clf_acc = accuracy_score(y_test_clf, y_pred_test_clf_tuned)
tuned_clf_precision = precision_score(y_test_clf, y_pred_test_clf_tuned)
tuned_clf_recall = recall_score(y_test_clf, y_pred_test_clf_tuned)
tuned_clf_f1 = f1_score(y_test_clf, y_pred_test_clf_tuned)
tuned_clf_auc = roc_auc_score(y_test_clf, y_pred_proba_clf_tuned)

print("\nTuned Random Forest Classifier Performance:")
print(f"  Accuracy:  {tuned_clf_acc:.4f}")
print(f"  Precision: {tuned_clf_precision:.4f}")
print(f"  Recall:    {tuned_clf_recall:.4f}")
print(f"  F1-Score:  {tuned_clf_f1:.4f}")
print(f"  AUC:       {tuned_clf_auc:.4f}")

# Vergleich mit Baseline
baseline_clf_f1 = classification_results['Random Forest Classifier']['f1']
improvement_clf = ((tuned_clf_f1 - baseline_clf_f1) / baseline_clf_f1) * 100
print(f"\nVerbesserung gegen√ºber Baseline: {improvement_clf:.2f}%")

### 9.4 Gradient Boosting Classifier Tuning

In [None]:
# Parameter Grid f√ºr Gradient Boosting Classifier
gb_clf_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'subsample': [0.8, 0.9, 1.0]
}

print("Hyperparameter Tuning f√ºr Gradient Boosting Classifier...")
print(f"Anzahl Kombinationen: {np.prod([len(v) for v in gb_clf_param_grid.values()])}")

# RandomizedSearchCV
gb_clf_random = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    gb_clf_param_grid,
    n_iter=20,
    cv=5,
    scoring='f1',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

gb_clf_random.fit(X_train_clf_scaled, y_train_clf)

print(f"\nBeste Parameter: {gb_clf_random.best_params_}")
print(f"Bester F1 Score (CV): {gb_clf_random.best_score_:.4f}")

In [None]:
# Beste Modell evaluieren
best_gb_clf = gb_clf_random.best_estimator_
y_pred_test_gb_clf_tuned = best_gb_clf.predict(X_test_clf_scaled)
y_pred_proba_gb_clf_tuned = best_gb_clf.predict_proba(X_test_clf_scaled)[:, 1]

tuned_gb_clf_acc = accuracy_score(y_test_clf, y_pred_test_gb_clf_tuned)
tuned_gb_clf_precision = precision_score(y_test_clf, y_pred_test_gb_clf_tuned)
tuned_gb_clf_recall = recall_score(y_test_clf, y_pred_test_gb_clf_tuned)
tuned_gb_clf_f1 = f1_score(y_test_clf, y_pred_test_gb_clf_tuned)
tuned_gb_clf_auc = roc_auc_score(y_test_clf, y_pred_proba_gb_clf_tuned)

print("\nTuned Gradient Boosting Classifier Performance:")
print(f"  Accuracy:  {tuned_gb_clf_acc:.4f}")
print(f"  Precision: {tuned_gb_clf_precision:.4f}")
print(f"  Recall:    {tuned_gb_clf_recall:.4f}")
print(f"  F1-Score:  {tuned_gb_clf_f1:.4f}")
print(f"  AUC:       {tuned_gb_clf_auc:.4f}")

# Vergleich mit Baseline
baseline_gb_clf_f1 = classification_results['Gradient Boosting Classifier']['f1']
improvement_gb_clf = ((tuned_gb_clf_f1 - baseline_gb_clf_f1) / baseline_gb_clf_f1) * 100
print(f"\nVerbesserung gegen√ºber Baseline: {improvement_gb_clf:.2f}%")

## 10. Feature Importance Analyse

In [None]:
# Feature Importance f√ºr Tuned Random Forest Regressor
feature_importance_reg = pd.DataFrame({
    'feature': feature_columns,
    'importance': best_rf_reg.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest Regressor):")
print(feature_importance_reg)

# Visualisierung
plt.figure(figsize=(12, 8))
plt.barh(feature_importance_reg['feature'], feature_importance_reg['importance'], color='steelblue')
plt.xlabel('Importance', fontweight='bold')
plt.title('Feature Importance - Random Forest Regressor (Tuned)', fontweight='bold', fontsize=14)
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Feature Importance f√ºr Tuned Random Forest Classifier
feature_importance_clf = pd.DataFrame({
    'feature': feature_columns,
    'importance': best_rf_clf.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest Classifier):")
print(feature_importance_clf)

# Visualisierung
plt.figure(figsize=(12, 8))
plt.barh(feature_importance_clf['feature'], feature_importance_clf['importance'], color='coral')
plt.xlabel('Importance', fontweight='bold')
plt.title('Feature Importance - Random Forest Classifier (Tuned)', fontweight='bold', fontsize=14)
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 11. Finale Modellvergleiche und Empfehlung

In [None]:
# Aktualisierte Regressions-Vergleichstabelle mit getunten Modellen
final_regression_comparison = pd.DataFrame({
    'Model': list(regression_results.keys()) + ['RF Regressor (Tuned)', 'GB Regressor (Tuned)'],
    'Test RMSE': [results['test_rmse'] for results in regression_results.values()] + [tuned_rmse, tuned_gb_rmse],
    'Test MAE': [results['test_mae'] for results in regression_results.values()] + [tuned_mae, tuned_gb_mae],
    'Test R¬≤': [results['test_r2'] for results in regression_results.values()] + [tuned_r2, tuned_gb_r2]
}).sort_values('Test R¬≤', ascending=False)

print("\n" + "="*80)
print("FINALER VERGLEICH - REGRESSIONSMODELLE")
print("="*80)
print(final_regression_comparison.to_string(index=False))

# Beste Regression Modell
best_reg_model = final_regression_comparison.iloc[0]
print(f"\nüèÜ BESTES REGRESSIONSMODELL: {best_reg_model['Model']}")
print(f"   Test R¬≤:   {best_reg_model['Test R¬≤']:.4f}")
print(f"   Test RMSE: {best_reg_model['Test RMSE']:.4f}")
print(f"   Test MAE:  {best_reg_model['Test MAE']:.4f}")

In [None]:
# Aktualisierte Klassifikations-Vergleichstabelle mit getunten Modellen
final_classification_comparison = pd.DataFrame({
    'Model': list(classification_results.keys()) + ['RF Classifier (Tuned)', 'GB Classifier (Tuned)'],
    'Accuracy': [results['test_acc'] for results in classification_results.values()] + [tuned_clf_acc, tuned_gb_clf_acc],
    'Precision': [results['precision'] for results in classification_results.values()] + [tuned_clf_precision, tuned_gb_clf_precision],
    'Recall': [results['recall'] for results in classification_results.values()] + [tuned_clf_recall, tuned_gb_clf_recall],
    'F1-Score': [results['f1'] for results in classification_results.values()] + [tuned_clf_f1, tuned_gb_clf_f1],
    'AUC': [results['auc'] for results in classification_results.values()] + [tuned_clf_auc, tuned_gb_clf_auc]
}).sort_values('F1-Score', ascending=False)

print("\n" + "="*80)
print("FINALER VERGLEICH - KLASSIFIKATIONSMODELLE")
print("="*80)
print(final_classification_comparison.to_string(index=False))

# Beste Klassifikation Modell
best_clf_model = final_classification_comparison.iloc[0]
print(f"\nüèÜ BESTES KLASSIFIKATIONSMODELL: {best_clf_model['Model']}")
print(f"   F1-Score:  {best_clf_model['F1-Score']:.4f}")
print(f"   Accuracy:  {best_clf_model['Accuracy']:.4f}")
print(f"   Precision: {best_clf_model['Precision']:.4f}")
print(f"   Recall:    {best_clf_model['Recall']:.4f}")
print(f"   AUC:       {best_clf_model['AUC']:.4f}")

## 12. Finale Empfehlung

### Regressionsmodelle:
**Empfohlen:** Tuned Random Forest Regressor oder Gradient Boosting Regressor

**Begr√ºndung:**
- Beide Ensemble-Methoden zeigen sehr gute Performance
- Random Forest ist robuster gegen√ºber Overfitting
- Gradient Boosting kann leicht bessere Ergebnisse liefern, ist aber anf√§lliger f√ºr Overfitting
- R¬≤ > 0.45 zeigt moderate Vorhersagekraft f√ºr ein komplexes Problem

### Klassifikationsmodelle:
**Empfohlen:** Tuned Random Forest Classifier oder Gradient Boosting Classifier

**Begr√ºndung:**
- Exzellente F1-Scores und AUC-Werte
- Gute Balance zwischen Precision und Recall
- Robust und interpretierbar durch Feature Importance
- Random Forest ist schneller zu trainieren und vorherzusagen

### Allgemeine Erkenntnisse:
1. **Wichtigste Features:** Alkoholgehalt, volatile S√§ure, Sulfate und Zitronens√§ure
2. **Weintyp:** Hat moderaten Einfluss auf die Qualit√§t
3. **Feature Engineering:** Verh√§ltnis-Features verbessern die Performance
4. **Hyperparameter Tuning:** Bringt messbare Verbesserungen (5-10%)

### Produktions-Einsatz:
- F√ºr **schnelle Vorhersagen**: Random Forest
- F√ºr **h√∂chste Genauigkeit**: Gradient Boosting
- F√ºr **Interpretierbarkeit**: Random Forest mit Feature Importance
- F√ºr **bin√§re Entscheidungen** (gut/schlecht): Klassifikationsmodell
- F√ºr **exakte Qualit√§tswerte**: Regressionsmodell

## 13. Modelle speichern

In [None]:
import pickle

# Modelle und Scaler speichern
models_to_save = {
    'rf_regressor': best_rf_reg,
    'gb_regressor': best_gb_reg,
    'rf_classifier': best_rf_clf,
    'gb_classifier': best_gb_clf,
    'scaler_reg': scaler_reg,
    'scaler_clf': scaler_clf,
    'feature_columns': feature_columns
}

# Speichern
with open('wine_quality_models.pkl', 'wb') as f:
    pickle.dump(models_to_save, f)

print("Alle Modelle erfolgreich gespeichert in 'wine_quality_models.pkl'")
print("\nGespeicherte Komponenten:")
for key in models_to_save.keys():
    print(f"  - {key}")

## Zusammenfassung

Dieses Notebook hat einen vollst√§ndigen Machine-Learning-Workflow durchgef√ºhrt:

1. ‚úÖ **Daten laden**: Rot- und Wei√üwein kombiniert (6497 Samples)
2. ‚úÖ **Exploration**: Statistische Analyse und Visualisierungen
3. ‚úÖ **Bereinigung**: Duplikate entfernt, Ausrei√üer analysiert
4. ‚úÖ **Feature Engineering**: 4 neue Features erstellt + bin√§re Qualit√§tsklasse
5. ‚úÖ **Regression**: 6 Modelle trainiert und evaluiert (RMSE, MAE, R¬≤)
6. ‚úÖ **Klassifikation**: 5 Modelle trainiert und evaluiert (Accuracy, Precision, Recall, F1, AUC)
7. ‚úÖ **Hyperparameter Tuning**: Random Forest und Gradient Boosting optimiert
8. ‚úÖ **Feature Importance**: Wichtigste Features identifiziert
9. ‚úÖ **Vergleich**: Alle Modelle systematisch verglichen
10. ‚úÖ **Empfehlung**: Beste Modelle f√ºr verschiedene Use Cases

**N√§chster Schritt:** Streamlit-App f√ºr interaktive Vorhersagen!