# Movies Machine Learning - Stratified Sample (Upgrade 2025)

**Autor:** Andreas Traut  
**Datum:** Dezember 2025  
**Version:** 2025.1

## Ziel

Dieses Notebook demonstriert die Verwendung von **Stratified Sampling** f√ºr ausgewogene Train-Test-Splits beim Machine Learning.

## Aktualisierungen (2025)

- ‚úÖ Python 3.10+ kompatibel
- ‚úÖ scikit-learn >= 1.2 APIs
- ‚úÖ SimpleImputer statt deprecated Imputer
- ‚úÖ OneHotEncoder mit `handle_unknown='ignore'`
- ‚úÖ ColumnTransformer f√ºr Feature-Typen
- ‚úÖ StandardScaler in Pipeline
- ‚úÖ random_state f√ºr Reproduzierbarkeit
- ‚úÖ GridSearchCV und RandomizedSearchCV Beispiele

## Anforderungen

```bash
pip install pandas numpy scikit-learn matplotlib seaborn jupyterlab scipy
```

## Datenquelle

**Kaggle:** [IMDB 10000+ Movies Dataset](https://www.kaggle.com/datasets)

Bitte laden Sie die Daten herunter und speichern Sie sie unter: `datasets/movies/`

## 1. Setup und Imports

In [None]:
# Standard-Bibliotheken
import warnings
from pathlib import Path

# Data Science Bibliotheken
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn
from sklearn.model_selection import (
    train_test_split,
    StratifiedShuffleSplit,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV
)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import randint
import joblib

# Konfiguration
warnings.filterwarnings('ignore')
np.random.seed(42)

# Visualisierung
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("‚úÖ Alle Bibliotheken erfolgreich importiert")

## 2. Daten laden

In [None]:
# Pfad zum Dataset
data_path = Path('../../datasets/movies/movies.csv')

if not data_path.exists():
    print("‚ö†Ô∏è Dataset nicht gefunden!")
    print(f"Bitte laden Sie die Daten herunter und speichern Sie sie unter: {data_path}")
    print("Quelle: https://www.kaggle.com/datasets")
else:
    # Daten laden
    movies_df = pd.read_csv(data_path)
    print(f"‚úÖ Daten geladen: {movies_df.shape[0]} Zeilen, {movies_df.shape[1]} Spalten")
    
    # Erste Zeilen anzeigen
    display(movies_df.head())
    
    # Info √ºber Dataset
    print("\nüìä Dataset Info:")
    movies_df.info()

## 3. Stratified Sampling vorbereiten

Beim **Stratified Sampling** wird sichergestellt, dass die Verteilung wichtiger Kategorien im Training- und Test-Set repr√§sentativ bleibt.

In [None]:
# Erstelle Kategorien f√ºr Revenue (falls vorhanden)
if 'Revenue' in movies_df.columns:
    # Entferne Zeilen ohne Revenue f√ºr dieses Beispiel
    movies_clean = movies_df[movies_df['Revenue'].notna()].copy()
    
    # Erstelle Revenue-Kategorien
    movies_clean['revenue_cat'] = pd.cut(
        movies_clean['Revenue'],
        bins=[0, 50, 100, 200, np.inf],
        labels=['low', 'medium', 'high', 'very_high']
    )
    
    print(f"‚úÖ {len(movies_clean)} Filme mit Revenue-Werten")
    print("\nüìä Revenue-Kategorien Verteilung:")
    print(movies_clean['revenue_cat'].value_counts(normalize=True))
    
    # Visualisierung
    plt.figure(figsize=(10, 4))
    
    plt.subplot(1, 2, 1)
    movies_clean['Revenue'].hist(bins=50, edgecolor='black')
    plt.xlabel('Revenue')
    plt.ylabel('Frequency')
    plt.title('Revenue Distribution')
    
    plt.subplot(1, 2, 2)
    movies_clean['revenue_cat'].value_counts().plot(kind='bar')
    plt.xlabel('Revenue Category')
    plt.ylabel('Count')
    plt.title('Revenue Categories')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()

## 4. Vergleich: Random vs. Stratified Split

In [None]:
if 'movies_clean' in locals():
    # 1. Random Split
    train_random, test_random = train_test_split(
        movies_clean,
        test_size=0.2,
        random_state=42
    )
    
    # 2. Stratified Split
    splitter = StratifiedShuffleSplit(
        n_splits=1,
        test_size=0.2,
        random_state=42
    )
    
    for train_idx, test_idx in splitter.split(movies_clean, movies_clean['revenue_cat']):
        train_stratified = movies_clean.loc[train_idx]
        test_stratified = movies_clean.loc[test_idx]
    
    # Vergleiche Verteilungen
    def compare_distributions(overall, random, stratified, category='revenue_cat'):
        comparison = pd.DataFrame({
            'Overall': overall[category].value_counts(normalize=True),
            'Random': random[category].value_counts(normalize=True),
            'Stratified': stratified[category].value_counts(normalize=True)
        })
        
        comparison['Random_Error_%'] = 100 * (comparison['Random'] - comparison['Overall']) / comparison['Overall']
        comparison['Stratified_Error_%'] = 100 * (comparison['Stratified'] - comparison['Overall']) / comparison['Overall']
        
        return comparison
    
    comparison = compare_distributions(movies_clean, test_random, test_stratified)
    
    print("üìä Vergleich der Verteilungen:")
    display(comparison)
    
    print("\nüí° Interpretation:")
    print("- Kleinere Fehler% = bessere Repr√§sentation")
    print("- Stratified Split hat typischerweise kleinere Fehler")

## 5. Features und Pipeline vorbereiten

In [None]:
if 'train_stratified' in locals():
    # Entferne tempor√§re Kategorie-Spalte
    train_set = train_stratified.drop('revenue_cat', axis=1).copy()
    test_set = test_stratified.drop('revenue_cat', axis=1).copy()
    
    # Features definieren
    numeric_features = ['Year', 'Score', 'Metascore', 'Vote', 'Runtime']
    numeric_features = [f for f in numeric_features if f in train_set.columns]
    
    categorical_features = ['Genre']
    categorical_features = [f for f in categorical_features if f in train_set.columns]
    
    all_features = numeric_features + categorical_features
    
    # Features und Labels
    X_train = train_set[all_features]
    y_train = train_set['Revenue']
    X_test = test_set[all_features]
    y_test = test_set['Revenue']
    
    print(f"‚úÖ Training Set: {X_train.shape}")
    print(f"‚úÖ Test Set: {X_test.shape}")
    print(f"\nüìä Features:")
    print(f"Numerisch: {numeric_features}")
    print(f"Kategorisch: {categorical_features}")

In [None]:
# Preprocessing Pipeline erstellen
if 'X_train' in locals():
    # Numerische Pipeline
    numeric_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    
    # Kategorische Pipeline
    categorical_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
    
    # Kombinierte Pipeline
    preprocessor = ColumnTransformer([
        ('num', numeric_pipeline, numeric_features),
        ('cat', categorical_pipeline, categorical_features)
    ])
    
    # Daten transformieren
    X_train_prepared = preprocessor.fit_transform(X_train)
    X_test_prepared = preprocessor.transform(X_test)
    
    print(f"‚úÖ Preprocessing abgeschlossen")
    print(f"Transformed shape: {X_train_prepared.shape}")

## 6. Modelle trainieren

In [None]:
# Linear Regression
if 'X_train_prepared' in locals():
    print("üìà Training Linear Regression...")
    
    lin_reg = LinearRegression()
    lin_reg.fit(X_train_prepared, y_train)
    
    # Evaluation
    y_pred_train = lin_reg.predict(X_train_prepared)
    y_pred_test = lin_reg.predict(X_test_prepared)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\nüìä Linear Regression:")
    print(f"Train RMSE: {train_rmse:,.2f}")
    print(f"Test RMSE: {test_rmse:,.2f}")
    print(f"Test R¬≤: {test_r2:.3f}")

In [None]:
# Decision Tree
if 'X_train_prepared' in locals():
    print("üå≤ Training Decision Tree...")
    
    tree_reg = DecisionTreeRegressor(random_state=42, max_depth=10)
    tree_reg.fit(X_train_prepared, y_train)
    
    # Evaluation
    y_pred_train = tree_reg.predict(X_train_prepared)
    y_pred_test = tree_reg.predict(X_test_prepared)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\nüìä Decision Tree:")
    print(f"Train RMSE: {train_rmse:,.2f}")
    print(f"Test RMSE: {test_rmse:,.2f}")
    print(f"Test R¬≤: {test_r2:.3f}")

In [None]:
# Random Forest
if 'X_train_prepared' in locals():
    print("üå≥ Training Random Forest...")
    
    forest_reg = RandomForestRegressor(
        n_estimators=100,
        max_depth=15,
        random_state=42,
        n_jobs=-1
    )
    forest_reg.fit(X_train_prepared, y_train)
    
    # Evaluation
    y_pred_train = forest_reg.predict(X_train_prepared)
    y_pred_test = forest_reg.predict(X_test_prepared)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\nüìä Random Forest:")
    print(f"Train RMSE: {train_rmse:,.2f}")
    print(f"Test RMSE: {test_rmse:,.2f}")
    print(f"Test R¬≤: {test_r2:.3f}")

## 7. Cross-Validation

In [None]:
# Cross-Validation f√ºr Random Forest
if 'forest_reg' in locals():
    print("üîÑ Performing 5-Fold Cross-Validation...")
    
    cv_scores = cross_val_score(
        forest_reg,
        X_train_prepared, y_train,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    
    cv_rmse = np.sqrt(-cv_scores)
    
    print(f"\nüìä Cross-Validation Results:")
    print(f"RMSE per fold: {cv_rmse}")
    print(f"Mean RMSE: {cv_rmse.mean():,.2f}")
    print(f"Std RMSE: {cv_rmse.std():,.2f}")
    print(f"95% Confidence: [{cv_rmse.mean() - 2*cv_rmse.std():,.2f}, {cv_rmse.mean() + 2*cv_rmse.std():,.2f}]")

## 8. Hyperparameter-Optimierung mit GridSearchCV

In [None]:
# GridSearchCV f√ºr Random Forest
if 'X_train_prepared' in locals():
    print("üîç GridSearchCV f√ºr Random Forest...")
    
    param_grid = {
        'n_estimators': [50, 100, 150],
        'max_depth': [10, 15, 20],
        'min_samples_split': [2, 5, 10]
    }
    
    forest_reg_grid = RandomForestRegressor(random_state=42, n_jobs=-1)
    
    grid_search = GridSearchCV(
        forest_reg_grid,
        param_grid,
        cv=3,  # Weniger Folds f√ºr schnellere Ausf√ºhrung
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train_prepared, y_train)
    
    print(f"\n‚úÖ Beste Parameter: {grid_search.best_params_}")
    print(f"Bester CV Score (RMSE): {np.sqrt(-grid_search.best_score_):,.2f}")
    
    # Test mit bestem Modell
    best_model = grid_search.best_estimator_
    y_pred_best = best_model.predict(X_test_prepared)
    test_rmse_best = np.sqrt(mean_squared_error(y_test, y_pred_best))
    
    print(f"Test RMSE (optimiert): {test_rmse_best:,.2f}")

## 9. RandomizedSearchCV (Alternative)

In [None]:
# RandomizedSearchCV f√ºr schnellere Suche
if 'X_train_prepared' in locals():
    print("üé≤ RandomizedSearchCV f√ºr Random Forest...")
    
    param_distributions = {
        'n_estimators': randint(50, 200),
        'max_depth': randint(5, 30),
        'min_samples_split': randint(2, 20),
        'min_samples_leaf': randint(1, 10)
    }
    
    forest_reg_random = RandomForestRegressor(random_state=42, n_jobs=-1)
    
    random_search = RandomizedSearchCV(
        forest_reg_random,
        param_distributions,
        n_iter=10,  # 10 zuf√§llige Kombinationen
        cv=3,
        scoring='neg_mean_squared_error',
        random_state=42,
        n_jobs=-1,
        verbose=1
    )
    
    random_search.fit(X_train_prepared, y_train)
    
    print(f"\n‚úÖ Beste Parameter: {random_search.best_params_}")
    print(f"Bester CV Score (RMSE): {np.sqrt(-random_search.best_score_):,.2f}")
    
    # Test mit bestem Modell
    best_model_random = random_search.best_estimator_
    y_pred_random = best_model_random.predict(X_test_prepared)
    test_rmse_random = np.sqrt(mean_squared_error(y_test, y_pred_random))
    
    print(f"Test RMSE (optimiert): {test_rmse_random:,.2f}")

## 10. Modell speichern

In [None]:
# Speichere bestes Modell und Preprocessor
if 'best_model' in locals():
    # Erstelle komplette Pipeline
    final_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', best_model)
    ])
    
    # Speichern
    model_path = 'movies_stratified_model.pkl'
    joblib.dump(final_pipeline, model_path)
    print(f"‚úÖ Pipeline gespeichert: {model_path}")
    
    # Test: Laden und Vorhersage
    loaded_pipeline = joblib.load(model_path)
    test_pred = loaded_pipeline.predict(X_test.head(5))
    print(f"\n‚úÖ Modell erfolgreich geladen und getestet")
    print(f"Beispiel-Vorhersagen: {test_pred}")

## 11. Zusammenfassung

In diesem Notebook haben wir:

1. ‚úÖ **Stratified Sampling** angewendet f√ºr ausgewogene Splits
2. ‚úÖ Random vs. Stratified Split **verglichen**
3. ‚úÖ Moderne **Preprocessing Pipeline** erstellt
4. ‚úÖ Mehrere Modelle trainiert (Linear, Decision Tree, Random Forest)
5. ‚úÖ **Cross-Validation** durchgef√ºhrt
6. ‚úÖ **GridSearchCV** f√ºr Hyperparameter-Optimierung
7. ‚úÖ **RandomizedSearchCV** als Alternative
8. ‚úÖ Komplette Pipeline gespeichert

### Wichtige Aktualisierungen (2025):

- ‚úÖ `StratifiedShuffleSplit` mit `random_state`
- ‚úÖ `SimpleImputer` statt deprecated `Imputer`
- ‚úÖ `OneHotEncoder` mit `handle_unknown='ignore'` und `sparse_output=False`
- ‚úÖ `ColumnTransformer` f√ºr klare Feature-Trennung
- ‚úÖ `StandardScaler` in Pipeline statt normalize Parameter
- ‚úÖ Konsistente `random_state` f√ºr Reproduzierbarkeit
- ‚úÖ Moderne Pipeline-Persistenz mit joblib

### Warum Stratified Sampling?

**Vorteile:**
- Repr√§sentative Verteilung in Train/Test
- Bessere Generalisierung bei unbalancierten Daten
- Zuverl√§ssigere Evaluation

**Wann verwenden:**
- Bei Klassifikation mit unbalancierten Klassen
- Bei Regression mit wichtigen Kategorien
- Wenn kleine Subgruppen erhalten bleiben sollen