# NHL Player Performance Prediction - Complete ML Pipeline

This notebook demonstrates a comprehensive machine learning approach for NHL player performance prediction using:

## 🎯 **What We'll Cover:**
1. **Baseline Models** - Simple, interpretable models
2. **Advanced Models** - More sophisticated algorithms  
3. **Ensemble Methods** - Combining multiple models
4. **Feature Engineering** - Hockey-specific features
5. **Model Evaluation** - Comprehensive performance analysis
6. **Hockey-Specific Insights** - Position analysis, age curves, etc.

## 📊 **Models Included:**
- **Baseline**: Linear Regression, Ridge, Lasso, Decision Trees, k-NN
- **Advanced**: Random Forest, XGBoost, LightGBM, SVR, Neural Networks
- **Ensemble**: Voting, Stacking, Bagging, Custom Weighted

## Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

# Import our ML modules
from ml_models.features import FeatureEngineer, HockeyFeatures
from ml_models.models import BaselineModels, AdvancedModels, EnsembleModels
from ml_models.evaluation import ModelEvaluator

# Import data pipeline
from data_pipeline import NHLDataPipeline
from config import config

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

print("✅ All imports successful!")
print(f"🏒 Ready to build NHL prediction models!")

## Data Loading and Preparation

In [None]:
# Check if we have processed data available
processed_dir = Path(config.data.base_data_dir) / "processed"
training_file = processed_dir / "training_dataset.parquet"

if training_file.exists():
    print("📊 Loading existing processed data...")
    df = pd.read_parquet(training_file)
    print(f"✅ Loaded {len(df)} training examples")
else:
    print("📥 No processed data found. Running data pipeline...")
    print("This will take several minutes to download and process NHL data.")
    
    # Initialize and run data pipeline
    pipeline = NHLDataPipeline()
    
    # Download data
    download_results = pipeline.download_all_data(force_refresh=False)
    
    # Get players and create training dataset
    all_players = pipeline.get_all_players_for_seasons()
    df = pipeline.create_training_dataset(all_players)
    
    if df.empty:
        raise ValueError("No training data could be created. Please check the data pipeline.")
    
    print(f"✅ Created {len(df)} training examples")

# Display basic info about the dataset
print(f"\n📋 Dataset Info:")
print(f"  Shape: {df.shape}")
print(f"  Columns: {len(df.columns)}")
print(f"  Missing values: {df.isnull().sum().sum()}")

# Show sample of data
print(f"\n🔍 Sample Data:")
display(df[['name', 'role', 'age', 'ppg_1', 'ppg_2', 'target_points']].head())

# Target variable statistics
print(f"\n🎯 Target Variable (Points per Game) Statistics:")
print(df['target_points'].describe())

## Feature Engineering

Create hockey-specific features to improve model performance.

In [None]:
print("🔧 Starting feature engineering...")

# Initialize feature engineer
feature_engineer = FeatureEngineer(scaler_type='standard')

# Separate features and target
# Remove non-predictive columns
exclude_cols = ['player_id', 'name', 'target_points']
X_raw = df.drop(columns=exclude_cols)
y = df['target_points']

print(f"📊 Initial features: {len(X_raw.columns)}")

# Apply hockey-specific feature engineering
print("🏒 Creating hockey-specific features...")
X_hockey = HockeyFeatures.create_all_hockey_features(X_raw)

print(f"📈 Features after hockey engineering: {len(X_hockey.columns)}")

# Apply general feature engineering
print("⚙️ Applying feature engineering and scaling...")
X_engineered = feature_engineer.fit_transform(X_hockey, y)

print(f"✅ Final engineered features: {len(X_engineered.columns)}")

# Show feature importance data
feature_info = feature_engineer.get_feature_importance_data()
print(f"\n📋 Feature Engineering Summary:")
print(f"  Total features: {feature_info['num_features']}")
print(f"  Scaler type: {feature_info['scaler_type']}")
print(f"  Pipeline fitted: {feature_info['is_fitted']}")

# Display feature groups
feature_groups = HockeyFeatures.get_feature_groups()
print(f"\n📊 Feature Groups Available:")
for group_name, patterns in feature_groups.items():
    matching_features = [col for col in X_engineered.columns 
                        for pattern in patterns if pattern in col]
    if matching_features:
        print(f"  {group_name}: {len(matching_features)} features")

## Train-Test Split

Split data for proper evaluation (temporal split to avoid data leakage).

In [None]:
from sklearn.model_selection import train_test_split

# Split the data (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X_engineered, y, test_size=0.2, random_state=42, stratify=None
)

print(f"📊 Data Split:")
print(f"  Training set: {X_train.shape[0]} samples")
print(f"  Test set: {X_test.shape[0]} samples")
print(f"  Features: {X_train.shape[1]}")

# Check target distribution
print(f"\n🎯 Target Distribution:")
print(f"  Training mean: {y_train.mean():.3f} ± {y_train.std():.3f}")
print(f"  Test mean: {y_test.mean():.3f} ± {y_test.std():.3f}")

# Convert to numpy arrays for sklearn compatibility
X_train_np = X_train.values
X_test_np = X_test.values
y_train_np = y_train.values
y_test_np = y_test.values

feature_names = X_train.columns.tolist()

print(f"✅ Data prepared for model training")

## Baseline Models

Start with simple, interpretable baseline models.

In [None]:
print("🎯 Training Baseline Models...")

# Initialize baseline models
baseline = BaselineModels(random_state=42)

# Get all baseline models
baseline_models = baseline.get_all_baseline_models()
print(f"📊 Training {len(baseline_models)} baseline models...")

# Train models (with hyperparameter tuning)
fitted_baseline = baseline.fit_all_models(X_train_np, y_train_np, tune=True, cv=5)

print(f"✅ Successfully trained {len(fitted_baseline)} baseline models")

# Evaluate baseline models
print("\n📈 Evaluating baseline models...")
baseline_scores = baseline.evaluate_models(X_test_np, y_test_np)

# Display results
baseline_summary = baseline.get_model_summary()
print("\n🏆 Baseline Model Results:")
display(baseline_summary.round(4))

# Get best baseline model
best_baseline_name, best_baseline_model, best_baseline_score = baseline.get_best_model('rmse')
print(f"\n🥇 Best baseline model: {best_baseline_name} (RMSE: {best_baseline_score:.4f})")

## Advanced Models

Train more sophisticated models for better performance.

In [None]:
print("🚀 Training Advanced Models...")

# Initialize advanced models
advanced = AdvancedModels(random_state=42)

# Get all advanced models
advanced_models = advanced.get_all_advanced_models()
print(f"📊 Training {len(advanced_models)} advanced models...")

# Train models (with hyperparameter tuning for key models)
fitted_advanced = advanced.fit_all_models(X_train_np, y_train_np, tune=True, cv=5)

print(f"✅ Successfully trained {len(fitted_advanced)} advanced models")

# Evaluate advanced models
print("\n📈 Evaluating advanced models...")
advanced_scores = advanced.evaluate_models(X_test_np, y_test_np)

# Display results
advanced_summary = advanced.get_model_summary()
print("\n🏆 Advanced Model Results:")
display(advanced_summary.round(4))

# Get best advanced model
best_advanced_name, best_advanced_model, best_advanced_score = advanced.get_best_model('rmse')
print(f"\n🥇 Best advanced model: {best_advanced_name} (RMSE: {best_advanced_score:.4f})")

## Ensemble Models

Combine multiple models for potentially better performance.

In [None]:
print("🎭 Training Ensemble Models...")

# Initialize ensemble models
ensemble = EnsembleModels(random_state=42)

# Get all ensemble models
ensemble_models = ensemble.get_all_ensemble_models()
print(f"📊 Training {len(ensemble_models)} ensemble models...")

# Train ensemble models
fitted_ensembles = ensemble.fit_ensemble_models(X_train_np, y_train_np)

print(f"✅ Successfully trained {len(fitted_ensembles)} ensemble models")

# Create adaptive ensemble
print("\n🧠 Creating adaptive ensemble...")
adaptive_ensemble = ensemble.create_adaptive_ensemble(X_train_np, y_train_np, validation_size=0.2)
if adaptive_ensemble:
    fitted_ensembles['adaptive'] = adaptive_ensemble
    print("✅ Adaptive ensemble created")

# Evaluate ensemble models
print("\n📈 Evaluating ensemble models...")
ensemble_scores = ensemble.evaluate_ensembles(X_test_np, y_test_np)

# Display results
ensemble_summary = ensemble.get_ensemble_summary()
print("\n🏆 Ensemble Model Results:")
display(ensemble_summary.round(4))

# Get best ensemble model
best_ensemble_name, best_ensemble_model, best_ensemble_score = ensemble.get_best_ensemble('rmse')
print(f"\n🥇 Best ensemble model: {best_ensemble_name} (RMSE: {best_ensemble_score:.4f})")

## Comprehensive Model Evaluation

Compare all models and generate detailed analysis.

In [None]:
print("📊 Comprehensive Model Evaluation...")

# Initialize evaluator
evaluator = ModelEvaluator()

# Combine all fitted models
all_models = {}
all_models.update({f"baseline_{k}": v for k, v in fitted_baseline.items()})
all_models.update({f"advanced_{k}": v for k, v in fitted_advanced.items()})
all_models.update({f"ensemble_{k}": v for k, v in fitted_ensembles.items()})

print(f"🔍 Evaluating {len(all_models)} total models...")

# Generate comprehensive comparison
comparison_results = evaluator.compare_models(all_models, X_test_np, y_test_np)

print("\n🏆 Final Model Comparison (Top 10):")
display(comparison_results.head(10).round(4))

# Find overall best model
overall_best = comparison_results.iloc[0]
print(f"\n🥇 Overall Best Model: {overall_best['model']}")
print(f"   RMSE: {overall_best['rmse']:.4f}")
print(f"   R²: {overall_best['r2']:.4f}")
print(f"   MAE: {overall_best['mae']:.4f}")

## Visualization and Analysis

In [None]:
# Model comparison visualization
print("📈 Creating model comparison visualizations...")
evaluator.plot_model_comparison(figsize=(15, 6))

In [None]:
# Predictions vs actual for top 6 models
top_models = comparison_results.head(6)['model'].tolist()
print(f"📊 Plotting predictions vs actual for top {len(top_models)} models...")
evaluator.plot_predictions_vs_actual(model_names=top_models, figsize=(18, 12))

In [None]:
# Residual analysis for top models
print("🔍 Residual analysis for top models...")
evaluator.plot_residuals(model_names=top_models[:4], figsize=(16, 10))

In [None]:
# Feature importance for interpretable models
interpretable_models = {
    'Random Forest': fitted_advanced.get('random_forest'),
    'Decision Tree': fitted_baseline.get('decision_tree')
}

# Add XGBoost if available
if 'xgboost' in fitted_advanced:
    interpretable_models['XGBoost'] = fitted_advanced['xgboost']

interpretable_models = {k: v for k, v in interpretable_models.items() if v is not None}

if interpretable_models:
    print(f"🔧 Feature importance analysis for {len(interpretable_models)} models...")
    evaluator.plot_feature_importance(
        interpretable_models, feature_names, top_k=15, figsize=(16, 10)
    )
else:
    print("⚠️ No interpretable models available for feature importance analysis")

## Hockey-Specific Analysis

In [None]:
# Analyze performance by player position
print("🏒 Hockey-Specific Analysis...")

# Get the best model for analysis
best_model = all_models[overall_best['model']]
best_predictions = best_model.predict(X_test_np)

# Create analysis dataframe
analysis_df = pd.DataFrame({
    'actual': y_test_np,
    'predicted': best_predictions,
    'role': X_test['role'].values if 'role' in X_test.columns else 'Unknown',
    'age': X_test['age'].values if 'age' in X_test.columns else np.nan
})

analysis_df['residual'] = analysis_df['actual'] - analysis_df['predicted']
analysis_df['absolute_error'] = np.abs(analysis_df['residual'])

# Performance by position
if 'role' in analysis_df.columns and analysis_df['role'].nunique() > 1:
    print("\n📊 Model Performance by Position:")
    position_performance = analysis_df.groupby('role').agg({
        'absolute_error': ['mean', 'std'],
        'residual': ['mean', 'std']
    }).round(4)
    display(position_performance)
    
    # Visualize performance by position
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Box plot of errors by position
    analysis_df.boxplot(column='absolute_error', by='role', ax=axes[0])
    axes[0].set_title('Prediction Error by Position')
    axes[0].set_xlabel('Position')
    axes[0].set_ylabel('Absolute Error')
    
    # Scatter plot of actual vs predicted by position
    for role in analysis_df['role'].unique():
        role_data = analysis_df[analysis_df['role'] == role]
        axes[1].scatter(role_data['actual'], role_data['predicted'], 
                       alpha=0.6, label=f'Position {role}')
    
    axes[1].plot([analysis_df['actual'].min(), analysis_df['actual'].max()], 
                [analysis_df['actual'].min(), analysis_df['actual'].max()], 
                'r--', label='Perfect Prediction')
    axes[1].set_xlabel('Actual Points per Game')
    axes[1].set_ylabel('Predicted Points per Game')
    axes[1].set_title('Predictions by Position')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Performance by age groups
if not analysis_df['age'].isna().all():
    analysis_df['age_group'] = pd.cut(analysis_df['age'], 
                                     bins=[0, 23, 27, 31, 50], 
                                     labels=['Young (≤23)', 'Prime (24-27)', 'Veteran (28-31)', 'Old (32+)'])
    
    print("\n📊 Model Performance by Age Group:")
    age_performance = analysis_df.groupby('age_group').agg({
        'absolute_error': ['mean', 'std', 'count'],
        'residual': ['mean']
    }).round(4)
    display(age_performance)

## Model Insights and Recommendations

In [None]:
# Generate comprehensive evaluation report
print("📋 Generating Comprehensive Evaluation Report...")

evaluation_report = evaluator.generate_evaluation_report(
    all_models, X_test_np, y_test_np, feature_names
)

print("\n📊 Evaluation Summary:")
summary = evaluation_report['summary']
for key, value in summary.items():
    print(f"  {key}: {value}")

print("\n🏆 Best Models:")
best_models = evaluation_report['best_models']
for metric, model in best_models.items():
    print(f"  {metric}: {model}")

print("\n💡 Recommendations:")
for i, recommendation in enumerate(evaluation_report['recommendations'], 1):
    print(f"  {i}. {recommendation}")

## Save Best Model for Production Use

In [None]:
import joblib

# Save the best model and feature engineer
models_dir = Path("models_saved")
models_dir.mkdir(exist_ok=True)

# Save best model
best_model_name = overall_best['model']
best_model_obj = all_models[best_model_name]

model_path = models_dir / f"best_nhl_model_{best_model_name.replace('_', '-')}.joblib"
joblib.dump(best_model_obj, model_path)

# Save feature engineer
feature_engineer_path = models_dir / "feature_engineer.joblib"
joblib.dump(feature_engineer, feature_engineer_path)

# Save model metadata
metadata = {
    'best_model_name': best_model_name,
    'performance_metrics': {
        'rmse': float(overall_best['rmse']),
        'r2': float(overall_best['r2']),
        'mae': float(overall_best['mae'])
    },
    'feature_names': feature_names,
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'target_mean': float(y.mean()),
    'target_std': float(y.std())
}

metadata_path = models_dir / "model_metadata.json"
import json
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"💾 Model artifacts saved:")
print(f"  Best model: {model_path}")
print(f"  Feature engineer: {feature_engineer_path}")
print(f"  Metadata: {metadata_path}")

print(f"\n🎯 Production Ready!")
print(f"Best model: {best_model_name}")
print(f"Performance: RMSE={overall_best['rmse']:.4f}, R²={overall_best['r2']:.4f}")

## Example: Making Predictions on New Data

In [None]:
# Example of how to use the saved model for predictions
print("🔮 Example: Making Predictions on New Data")

# Load saved model and feature engineer
loaded_model = joblib.load(model_path)
loaded_feature_engineer = joblib.load(feature_engineer_path)

# Use current season data if available
current_file = processed_dir / f"current_season_{config.data.current_season}.parquet"

if current_file.exists():
    print(f"📊 Loading current season data for predictions...")
    current_df = pd.read_parquet(current_file)
    
    # Prepare features (same preprocessing as training)
    exclude_cols = ['player_id', 'name'] + (['target_points'] if 'target_points' in current_df.columns else [])
    X_current_raw = current_df.drop(columns=[col for col in exclude_cols if col in current_df.columns])
    
    # Apply same feature engineering
    X_current_hockey = HockeyFeatures.create_all_hockey_features(X_current_raw)
    X_current_processed = loaded_feature_engineer.transform(X_current_hockey)
    
    # Make predictions
    predictions = loaded_model.predict(X_current_processed.values)
    
    # Create results dataframe
    results_df = pd.DataFrame({
        'player_name': current_df['name'].values,
        'position': current_df['role'].values if 'role' in current_df.columns else 'Unknown',
        'predicted_ppg': predictions
    })
    
    # Sort by predicted performance
    results_df = results_df.sort_values('predicted_ppg', ascending=False)
    
    print(f"\n🏆 Top 10 Predicted Performers for {config.data.current_season}:")
    display(results_df.head(10))
    
    # Save predictions
    predictions_path = models_dir / f"predictions_{config.data.current_season}.csv"
    results_df.to_csv(predictions_path, index=False)
    print(f"\n💾 Predictions saved to: {predictions_path}")
    
else:
    print("📝 No current season data available for predictions.")
    print("Run the data pipeline first to download current season data.")

print("\n✅ Model training and evaluation complete!")
print("🚀 Ready for NHL pool optimization!")

## Summary

### 🎯 **What We Accomplished:**

1. **✅ Data Pipeline**: Loaded and processed NHL player data
2. **✅ Feature Engineering**: Created hockey-specific features
3. **✅ Baseline Models**: Trained simple, interpretable models
4. **✅ Advanced Models**: Applied sophisticated ML algorithms
5. **✅ Ensemble Methods**: Combined models for better performance
6. **✅ Evaluation**: Comprehensive model comparison and analysis
7. **✅ Production Ready**: Saved best model for deployment

### 📊 **Model Performance:**
- **Best Model**: See results above
- **Key Insights**: Position-specific performance patterns
- **Ready for**: Pool optimization and player selection

### 🚀 **Next Steps:**
1. Use predictions for team optimization
2. Integrate with salary constraints 
3. Build automated prediction pipeline
4. Monitor model performance over time

**Your NHL prediction models are ready to help build the optimal fantasy team! 🏒**