# Movies Machine Learning - Predict NaNs (Upgrade 2025)

**Autor:** Andreas Traut  
**Datum:** Dezember 2025  
**Version:** 2025.1

## Ziel

Dieses Notebook zeigt, wie fehlende Werte (NaNs) in der Spalte "Revenue" eines Filmdatensatzes vorhergesagt werden k√∂nnen.

## Aktualisierungen (2025)

- ‚úÖ Python 3.10+ kompatibel
- ‚úÖ scikit-learn >= 1.2 APIs
- ‚úÖ SimpleImputer statt deprecated Imputer
- ‚úÖ OneHotEncoder mit `handle_unknown='ignore'`
- ‚úÖ StandardScaler in Pipeline
- ‚úÖ random_state f√ºr Reproduzierbarkeit
- ‚úÖ Moderner, modularer Code

## Anforderungen

```bash
pip install pandas numpy scikit-learn matplotlib seaborn jupyterlab
```

## Datenquelle

**Kaggle:** [IMDB 10000+ Movies Dataset](https://www.kaggle.com/datasets)

Bitte laden Sie die Daten herunter und speichern Sie sie unter: `datasets/movies/`

## 1. Setup und Imports

In [None]:
# Standard-Bibliotheken
import warnings
from pathlib import Path

# Data Science Bibliotheken
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# Konfiguration
warnings.filterwarnings('ignore')
np.random.seed(42)

# Visualisierung
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("‚úÖ Alle Bibliotheken erfolgreich importiert")

## 2. Daten laden

In [None]:
# Pfad zum Dataset
data_path = Path('../../datasets/movies/movies.csv')

if not data_path.exists():
    print("‚ö†Ô∏è Dataset nicht gefunden!")
    print(f"Bitte laden Sie die Daten herunter und speichern Sie sie unter: {data_path}")
    print("Quelle: https://www.kaggle.com/datasets")
else:
    # Daten laden
    movies_df = pd.read_csv(data_path)
    print(f"‚úÖ Daten geladen: {movies_df.shape[0]} Zeilen, {movies_df.shape[1]} Spalten")
    
    # Erste Zeilen anzeigen
    display(movies_df.head())
    
    # Info √ºber Dataset
    print("\nüìä Dataset Info:")
    movies_df.info()

## 3. Explorative Datenanalyse (EDA)

In [None]:
# Statistische Zusammenfassung
print("üìà Statistische Zusammenfassung:")
display(movies_df.describe())

In [None]:
# Fehlende Werte analysieren
print("‚ùì Fehlende Werte:")
missing_values = movies_df.isnull().sum()
missing_percent = 100 * missing_values / len(movies_df)
missing_df = pd.DataFrame({
    'Anzahl': missing_values,
    'Prozent': missing_percent
})
display(missing_df[missing_df['Anzahl'] > 0].sort_values('Anzahl', ascending=False))

In [None]:
# Visualisierung: Revenue Distribution
if 'Revenue' in movies_df.columns:
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    movies_df['Revenue'].dropna().hist(bins=50, edgecolor='black')
    plt.xlabel('Revenue')
    plt.ylabel('Frequency')
    plt.title('Revenue Distribution')
    
    plt.subplot(1, 2, 2)
    movies_df['Revenue'].dropna().plot(kind='box')
    plt.ylabel('Revenue')
    plt.title('Revenue Box Plot')
    
    plt.tight_layout()
    plt.show()

## 4. Daten vorbereiten

In [None]:
# Separate Daten mit und ohne NaN in Revenue
if 'Revenue' in movies_df.columns:
    # Daten MIT Revenue-Werten (f√ºr Training)
    movies_with_revenue = movies_df[movies_df['Revenue'].notna()].copy()
    
    # Daten OHNE Revenue-Werte (f√ºr Vorhersage)
    movies_without_revenue = movies_df[movies_df['Revenue'].isna()].copy()
    
    print(f"‚úÖ Daten mit Revenue: {len(movies_with_revenue)} Zeilen")
    print(f"‚ö†Ô∏è Daten ohne Revenue (NaN): {len(movies_without_revenue)} Zeilen")
    
    # Features f√ºr Modellierung ausw√§hlen
    # Numerische Features
    numeric_features = ['Year', 'Score', 'Metascore', 'Vote', 'Runtime']
    numeric_features = [f for f in numeric_features if f in movies_df.columns]
    
    # Kategorische Features
    categorical_features = ['Genre', 'Director']
    categorical_features = [f for f in categorical_features if f in movies_df.columns]
    
    print(f"\nüìä Features:")
    print(f"Numerisch ({len(numeric_features)}): {numeric_features}")
    print(f"Kategorisch ({len(categorical_features)}): {categorical_features}")

## 5. Train-Test Split

In [None]:
if 'Revenue' in movies_df.columns and len(movies_with_revenue) > 0:
    # Features und Labels
    all_features = numeric_features + categorical_features
    X = movies_with_revenue[all_features]
    y = movies_with_revenue['Revenue']
    
    # Train-Test Split mit random_state f√ºr Reproduzierbarkeit
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=0.2,
        random_state=42
    )
    
    print(f"‚úÖ Train-Test Split erstellt:")
    print(f"Training Set: {X_train.shape}")
    print(f"Test Set: {X_test.shape}")

## 6. Preprocessing Pipeline erstellen

In [None]:
# Numerische Pipeline
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Kategorische Pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Kombinierte Pipeline mit ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

print("‚úÖ Preprocessing Pipeline erstellt")

## 7. Modelle trainieren und evaluieren

In [None]:
# Decision Tree Regressor
if 'X_train' in locals():
    print("üå≤ Training Decision Tree Regressor...")
    
    # Pipeline: Preprocessing + Model
    tree_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', DecisionTreeRegressor(random_state=42, max_depth=10))
    ])
    
    # Training
    tree_pipeline.fit(X_train, y_train)
    
    # Vorhersagen
    y_pred_train = tree_pipeline.predict(X_train)
    y_pred_test = tree_pipeline.predict(X_test)
    
    # Evaluation
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\nüìä Decision Tree Results:")
    print(f"Train RMSE: {train_rmse:,.2f}")
    print(f"Test RMSE: {test_rmse:,.2f}")
    print(f"Test R¬≤: {test_r2:.3f}")

In [None]:
# Random Forest Regressor
if 'X_train' in locals():
    print("üå≥ Training Random Forest Regressor...")
    
    # Pipeline: Preprocessing + Model
    forest_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', RandomForestRegressor(
            n_estimators=100,
            max_depth=15,
            random_state=42,
            n_jobs=-1
        ))
    ])
    
    # Training
    forest_pipeline.fit(X_train, y_train)
    
    # Vorhersagen
    y_pred_train = forest_pipeline.predict(X_train)
    y_pred_test = forest_pipeline.predict(X_test)
    
    # Evaluation
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    
    print(f"\nüìä Random Forest Results:")
    print(f"Train RMSE: {train_rmse:,.2f}")
    print(f"Test RMSE: {test_rmse:,.2f}")
    print(f"Test R¬≤: {test_r2:.3f}")

## 8. Cross-Validation

In [None]:
# Cross-Validation f√ºr Random Forest
if 'forest_pipeline' in locals():
    print("üîÑ Performing Cross-Validation...")
    
    cv_scores = cross_val_score(
        forest_pipeline,
        X_train, y_train,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    
    cv_rmse_scores = np.sqrt(-cv_scores)
    
    print(f"\nüìä Cross-Validation Results (5-fold):")
    print(f"RMSE Scores: {cv_rmse_scores}")
    print(f"Mean RMSE: {cv_rmse_scores.mean():,.2f}")
    print(f"Std RMSE: {cv_rmse_scores.std():,.2f}")

## 9. Vorhersage f√ºr NaN-Werte

In [None]:
# Vorhersagen f√ºr Filme ohne Revenue
if 'movies_without_revenue' in locals() and len(movies_without_revenue) > 0:
    print(f"üîÆ Vorhersage f√ºr {len(movies_without_revenue)} Filme ohne Revenue...")
    
    # Features vorbereiten
    X_predict = movies_without_revenue[all_features]
    
    # Vorhersagen
    predicted_revenue = forest_pipeline.predict(X_predict)
    
    # Ergebnisse hinzuf√ºgen
    movies_without_revenue['Predicted_Revenue'] = predicted_revenue
    
    print("\n‚úÖ Vorhersagen abgeschlossen!")
    print("\nBeispiele:")
    if 'Title' in movies_without_revenue.columns:
        display(movies_without_revenue[['Title', 'Year', 'Predicted_Revenue']].head(10))
    else:
        display(movies_without_revenue[['Predicted_Revenue']].head(10))
else:
    print("‚ÑπÔ∏è Keine Filme ohne Revenue gefunden")

## 10. Modell speichern

In [None]:
# Modell speichern
if 'forest_pipeline' in locals():
    model_path = 'movies_revenue_predictor.pkl'
    joblib.dump(forest_pipeline, model_path)
    print(f"‚úÖ Modell gespeichert: {model_path}")
    
    # Modell laden (Test)
    loaded_model = joblib.load(model_path)
    print(f"‚úÖ Modell erfolgreich geladen")

## 11. Zusammenfassung

In diesem Notebook haben wir:

1. ‚úÖ Daten geladen und exploriert
2. ‚úÖ Fehlende Werte analysiert
3. ‚úÖ Preprocessing Pipeline erstellt (moderne APIs)
4. ‚úÖ Decision Tree und Random Forest Modelle trainiert
5. ‚úÖ Cross-Validation durchgef√ºhrt
6. ‚úÖ NaN-Werte vorhergesagt
7. ‚úÖ Modell gespeichert

### Wichtige Aktualisierungen (2025):

- `SimpleImputer` statt deprecated `Imputer`
- `OneHotEncoder` mit `handle_unknown='ignore'`
- `StandardScaler` in Pipeline
- `random_state` f√ºr Reproduzierbarkeit
- Moderne Pipeline-Struktur mit `ColumnTransformer`