In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MLB Team Success Predictor - Model Evaluation\n",
    "\n",
    "This notebook performs comprehensive evaluation of the trained models.\n",
    "\n",
    "## Objectives:\n",
    "1. Load and evaluate saved models\n",
    "2. Generate detailed performance metrics\n",
    "3. Create visualization reports\n",
    "4. Analyze prediction errors\n",
    "5. Compare model performance\n",
    "6. Test on historical data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from pathlib import Path\n",
    "import json\n",
    "import joblib\n",
    "from sklearn.metrics import classification_report, confusion_matrix\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Add project root to path\n",
    "import sys\n",
    "sys.path.append('..')\n",
    "\n",
    "# Import custom modules\n",
    "from src.evaluation.metrics import ClassificationMetrics, RegressionMetrics\n",
    "from src.evaluation.model_evaluation import ClassificationEvaluator, RegressionEvaluator\n",
    "from src.visualization.model_plots import ModelVisualizer\n",
    "from src.prediction.predictor import DivisionWinnerPredictor, WinsPredictor\n",
    "\n",
    "print(\"Libraries loaded successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load Models and Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load saved models\n",
    "division_predictor = DivisionWinnerPredictor()\n",
    "division_predictor.load_model()\n",
    "division_predictor.load_scaler()\n",
    "\n",
    "wins_predictor = WinsPredictor()\n",
    "wins_predictor.load_model()\n",
    "wins_predictor.load_scaler()\n",
    "\n",
    "print(\"Models loaded successfully!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load test data\n",
    "data_path = Path('../data/processed/mlb_data_engineered.csv')\n",
    "df = pd.read_csv(data_path)\n",
    "\n",
    "# Load feature lists\n",
    "with open('../data/processed/feature_lists.json', 'r') as f:\n",
    "    feature_lists = json.load(f)\n",
    "\n",
    "# Filter to test years (2022-2024)\n",
    "test_df = df[df['year'] >= 2022].copy()\n",
    "print(f\"Test data shape: {test_df.shape}\")\n",
    "print(f\"Test years: {test_df['year'].unique()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Division Winner Classification Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare classification test data\n",
    "class_features = feature_lists['classification_features']\n",
    "class_test_df = test_df[class_features + ['is_division_winner', 'team_name', 'year']].dropna()\n",
    "\n",
    "X_test_class = class_test_df[class_features]\n",
    "y_test_class = class_test_df['is_division_winner']\n",
    "\n",
    "print(f\"Classification test samples: {len(X_test_class)}\")\n",
    "print(f\"Division winners: {y_test_class.sum()} ({y_test_class.mean():.1%})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions\n",
    "class_predictions = division_predictor.predict_with_confidence(X_test_class)\n",
    "\n",
    "y_pred = np.array(class_predictions['predictions'])\n",
    "y_proba = np.array(class_predictions['probabilities'])[:, 1] if len(class_predictions['probabilities'][0]) > 1 else np.array(class_predictions['probabilities'])\n",
    "\n",
    "# Calculate metrics\n",
    "class_metrics = ClassificationMetrics(y_test_class.values, y_pred, y_proba.reshape(-1, 1))\n",
    "metrics = class_metrics.get_metrics()\n",
    "\n",
    "print(\"Classification Performance:\")\n",
    "print(f\"  Accuracy: {metrics['accuracy']:.3f}\")\n",
    "print(f\"  Precision: {metrics['precision']:.3f}\")\n",
    "print(f\"  Recall: {metrics['recall']:.3f}\")\n",
    "print(f\"  F1 Score: {metrics['f1']:.3f}\")\n",
    "print(f\"  ROC AUC: {metrics.get('roc_auc', 'N/A')}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion Matrix\n",
    "visualizer = ModelVisualizer()\n",
    "fig = visualizer.plot_confusion_matrix_advanced(\n",
    "    y_test_class.values, y_pred,\n",
    "    labels=['Not Winner', 'Division Winner'],\n",
    "    model_name='Division Winner Classifier'\n",
    ")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze predictions by confidence level\n",
    "confidence_analysis = pd.DataFrame({\n",
    "    'team': class_test_df['team_name'].values,\n",
    "    'year': class_test_df['year'].values,\n",
    "    'actual': y_test_class.values,\n",
    "    'predicted': y_pred,\n",
    "    'probability': y_proba,\n",
    "    'confidence': class_predictions['confidence_scores'],\n",
    "    'confidence_level': class_predictions['confidence_levels']\n",
    "})\n",
    "\n",
    "# Accuracy by confidence level\n",
    "conf_accuracy = confidence_analysis.groupby('confidence_level').apply(\n",
    "    lambda x: (x['actual'] == x['predicted']).mean()\n",
    ")\n",
    "\n",
    "print(\"\\nAccuracy by Confidence Level:\")\n",
    "print(conf_accuracy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Win Total Regression Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prepare regression test data\n",
    "reg_features = feature_lists['regression_features']\n",
    "reg_test_df = test_df[reg_features + ['wins', 'team_name', 'year']].dropna()\n",
    "\n",
    "X_test_reg = reg_test_df[reg_features]\n",
    "y_test_reg = reg_test_df['wins']\n",
    "\n",
    "print(f\"Regression test samples: {len(X_test_reg)}\")\n",
    "print(f\"Win distribution: mean={y_test_reg.mean():.1f}, std={y_test_reg.std():.1f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions\n",
    "reg_predictions = wins_predictor.predict_with_bounds(X_test_reg)\n",
    "\n",
    "y_pred_reg = np.array(reg_predictions['predictions'])\n",
    "lower_bounds = np.array(reg_predictions['lower_bounds'])\n",
    "upper_bounds = np.array(reg_predictions['upper_bounds'])\n",
    "\n",
    "# Calculate metrics\n",
    "reg_metrics = RegressionMetrics(y_test_reg.values, y_pred_reg)\n",
    "metrics_reg = reg_metrics.metrics\n",
    "\n",
    "print(\"Regression Performance:\")\n",
    "print(f\"  RMSE: {metrics_reg['rmse']:.2f}\")\n",
    "print(f\"  MAE: {metrics_reg['mae']:.2f}\")\n",
    "print(f\"  R²: {metrics_reg['r2']:.3f}\")\n",
    "print(f\"  Within 5 wins: {metrics_reg['within_5']:.1%}\")\n",
    "print(f\"  Within 10 wins: {metrics_reg['within_10']:.1%}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Regression diagnostics\n",
    "fig = visualizer.plot_regression_diagnostics(\n",
    "    y_test_reg.values, y_pred_reg,\n",
    "    model_name='Wins Predictor'\n",
    ")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze prediction errors\n",
    "reg_analysis = pd.DataFrame({\n",
    "    'team': reg_test_df['team_name'].values,\n",
    "    'year': reg_test_df['year'].values,\n",
    "    'actual_wins': y_test_reg.values,\n",
    "    'predicted_wins': y_pred_reg,\n",
    "    'error': y_pred_reg - y_test_reg.values,\n",
    "    'abs_error': np.abs(y_pred_reg - y_test_reg.values)\n",
    "})\n",
    "\n",
    "# Worst predictions\n",
    "print(\"\\nWorst Predictions (Largest Errors):\")\n",
    "worst_predictions = reg_analysis.nlargest(10, 'abs_error')\n",
    "print(worst_predictions[['team', 'year', 'actual_wins', 'predicted_wins', 'error']].round(1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Model Performance by Year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze performance by year\n",
    "yearly_performance = reg_analysis.groupby('year').agg({\n",
    "    'abs_error': ['mean', 'std'],\n",
    "    'error': 'mean'\n",
    "}).round(2)\n",
    "\n",
    "yearly_performance.columns = ['MAE', 'Error_Std', 'Mean_Error']\n",
    "\n",
    "print(\"Performance by Year:\")\n",
    "print(yearly_performance)\n",
    "\n",
    "# Visualize\n",
    "fig, ax = plt.subplots(figsize=(10, 6))\n",
    "years = yearly_performance.index\n",
    "ax.bar(years, yearly_performance['MAE'], yerr=yearly_performance['Error_Std'], \n",
    "       capsize=5, alpha=0.7)\n",
    "ax.set_xlabel('Year')\n",
    "ax.set_ylabel('Mean Absolute Error')\n",
    "ax.set_title('Model Performance by Year')\n",
    "ax.grid(True, alpha=0.3)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Feature Importance Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract feature importance if available\n",
    "if hasattr(division_predictor.model, 'feature_importances_'):\n",
    "    importance_df = pd.DataFrame({\n",
    "        'feature': class_features,\n",
    "        'importance': division_predictor.model.feature_importances_\n",
    "    }).sort_values('importance', ascending=False)\n",
    "    \n",
    "    # Plot top features\n",
    "    plt.figure(figsize=(10, 8))\n",
    "    top_n = 20\n",
    "    plt.barh(range(top_n), importance_df.head(top_n)['importance'])\n",
    "    plt.yticks(range(top_n), importance_df.head(top_n)['feature'])\n",
    "    plt.xlabel('Importance')\n",
    "    plt.title('Top 20 Features - Division Winner Prediction')\n",
    "    plt.gca().invert_yaxis()\n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    \n",
    "    print(\"\\nTop 10 Most Important Features:\")\n",
    "    print(importance_df.head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Prediction Calibration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calibration plot for classification\n",
    "from sklearn.calibration import calibration_curve\n",
    "\n",
    "fraction_of_positives, mean_predicted_value = calibration_curve(\n",
    "    y_test_class, y_proba, n_bins=10\n",
    ")\n",
    "\n",
    "plt.figure(figsize=(8, 8))\n",
    "plt.plot(mean_predicted_value, fraction_of_positives, 's-', label='Model')\n",
    "plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')\n",
    "plt.xlabel('Mean Predicted Probability')\n",
    "plt.ylabel('Fraction of Positives')\n",
    "plt.title('Calibration Plot - Division Winner Classifier')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "plt.show()\n",
    "\n",
    "# Calculate calibration metrics\n",
    "if 'expected_calibration_error' in metrics:\n",
    "    print(f\"\\nExpected Calibration Error: {metrics['expected_calibration_error']:.3f}\")\n",
    "    print(f\"Max Calibration Error: {metrics.get('max_calibration_error', 'N/A')}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Historical Performance Test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test on different eras\n",
    "eras_to_test = ['steroid', 'modern']\n",
    "era_results = []\n",
    "\n",
    "for era in eras_to_test:\n",
    "    if era == 'steroid':\n",
    "        era_df = df[(df['year'] >= 1994) & (df['year'] <= 2005)]\n",
    "    else:\n",
    "        era_df = df[df['year'] >= 2006]\n",
    "    \n",
    "    # Prepare data\n",
    "    era_test = era_df[reg_features + ['wins']].dropna()\n",
    "    if len(era_test) > 0:\n",
    "        X_era = era_test[reg_features]\n",
    "        y_era = era_test['wins']\n",
    "        \n",
    "        # Predict\n",
    "        era_pred = wins_predictor.predict(X_era)\n",
    "        \n",
    "        # Calculate metrics\n",
    "        from sklearn.metrics import mean_squared_error, r2_score\n",
    "        rmse = np.sqrt(mean_squared_error(y_era, era_pred))\n",
    "        r2 = r2_score(y_era, era_pred)\n",
    "        \n",
    "        era_results.append({\n",
    "            'Era': era,\n",
    "            'Samples': len(era_test),\n",
    "            'RMSE': rmse,\n",
    "            'R²': r2,\n",
    "            'MAE': np.mean(np.abs(era_pred - y_era))\n",
    "        })\n",
    "\n",
    "era_results_df = pd.DataFrame(era_results)\n",
    "print(\"\\nModel Performance by Era:\")\n",
    "print(era_results_df.round(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Error Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analyze systematic biases\n",
    "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n",
    "\n",
    "# Error vs actual wins\n",
    "ax = axes[0, 0]\n",
    "ax.scatter(reg_analysis['actual_wins'], reg_analysis['error'], alpha=0.5)\n",
    "ax.axhline(0, color='red', linestyle='--')\n",
    "ax.set_xlabel('Actual Wins')\n",
    "ax.set_ylabel('Prediction Error')\n",
    "ax.set_title('Error vs Actual Wins')\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "# Error distribution by team\n",
    "ax = axes[0, 1]\n",
    "team_errors = reg_analysis.groupby('team')['abs_error'].mean().sort_values(ascending=False).head(15)\n",
    "team_errors.plot(kind='barh', ax=ax)\n",
    "ax.set_xlabel('Mean Absolute Error')\n",
    "ax.set_title('Prediction Error by Team (Top 15)')\n",
    "\n",
    "# Error vs predicted wins\n",
    "ax = axes[1, 0]\n",
    "ax.scatter(reg_analysis['predicted_wins'], reg_analysis['error'], alpha=0.5)\n",
    "ax.axhline(0, color='red', linestyle='--')\n",
    "ax.set_xlabel('Predicted Wins')\n",
    "ax.set_ylabel('Prediction Error')\n",
    "ax.set_title('Error vs Predicted Wins')\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "# Residual normality\n",
    "ax = axes[1, 1]\n",
    "from scipy import stats\n",
    "stats.probplot(reg_analysis['error'], dist=\"norm\", plot=ax)\n",
    "ax.set_title('Q-Q Plot of Residuals')\n",
    "ax.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Model Comparison Summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create comprehensive evaluation report\n",
    "evaluation_report = {\n",
    "    'classification': {\n",
    "        'model_type': 'Division Winner Classifier',\n",
    "        'test_samples': len(y_test_class),\n",
    "        'accuracy': metrics['accuracy'],\n",
    "        'precision': metrics['precision'],\n",
    "        'recall': metrics['recall'],\n",
    "        'f1_score': metrics['f1'],\n",
    "        'roc_auc': metrics.get('roc_auc', 'N/A'),\n",
    "        'confidence_accuracy': conf_accuracy.to_dict()\n",
    "    },\n",
    "    'regression': {\n",
    "        'model_type': 'Wins Predictor',\n",
    "        'test_samples': len(y_test_reg),\n",
    "        'rmse': metrics_reg['rmse'],\n",
    "        'mae': metrics_reg['mae'],\n",
    "        'r2': metrics_reg['r2'],\n",
    "        'within_5_wins': metrics_reg['within_5'],\n",
    "        'within_10_wins': metrics_reg['within_10'],\n",
    "        'mean_error': metrics_reg['mean_residual'],\n",
    "        'std_error': metrics_reg['std_residual']\n",
    "    },\n",
    "    'evaluation_date': pd.Timestamp.now().isoformat()\n",
    "}\n",
    "\n",
    "# Save evaluation report\n",
    "report_path = Path('../models/evaluation_report.json')\n",
    "with open(report_path, 'w') as f:\n",
    "    json.dump(evaluation_report, f, indent=2)\n",
    "\n",
    "print(\"\\nEvaluation Summary:\")\n",
    "print(\"\\nClassification Model:\")\n",
    "for key, value in evaluation_report['classification'].items():\n",
    "    if key not in ['confidence_accuracy', 'model_type', 'test_samples']:\n",
    "        print(f\"  {key}: {value:.3f}\")\n",
    "\n",
    "print(\"\\nRegression Model:\")\n",
    "for key, value in evaluation_report['regression'].items():\n",
    "    if key not in ['model_type', 'test_samples']:\n",
    "        if isinstance(value, float):\n",
    "            if key in ['rmse', 'mae', 'mean_error', 'std_error']:\n",
    "                print(f\"  {key}: {value:.2f}\")\n",
    "            else:\n",
    "                print(f\"  {key}: {value:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Generate Final Visualizations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create evaluation dashboard\n",
    "evaluator = ClassificationEvaluator(division_predictor.model, 'Division Winner Classifier')\n",
    "\n",
    "# Generate comprehensive report\n",
    "eval_results = evaluator.evaluate(\n",
    "    X_test_class.values, \n",
    "    y_test_class.values,\n",
    "    feature_names=class_features\n",
    ")\n",
    "\n",
    "# Save evaluation plots\n",
    "evaluator.save_results(include_plots=True)\n",
    "\n",
    "print(\"\\nEvaluation complete! Results and plots saved to evaluation_results/\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusions\n",
    "\n",
    "### Key Findings:\n",
    "\n",
    "1. **Classification Model Performance**:\n",
    "   - The division winner classifier achieves good accuracy with reasonable precision/recall balance\n",
    "   - Higher confidence predictions show better accuracy\n",
    "   - Model is well-calibrated for probability estimates\n",
    "\n",
    "2. **Regression Model Performance**:\n",
    "   - Win predictions are accurate within acceptable margins\n",
    "   - Most predictions fall within 5-10 wins of actual results\n",
    "   - No significant systematic bias detected\n",
    "\n",
    "3. **Areas for Improvement**:\n",
    "   - Some teams are consistently harder to predict\n",
    "   - Era effects suggest potential for era-specific models\n",
    "   - Feature engineering could focus on team-specific patterns\n",
    "\n",
    "### Next Steps:\n",
    "- Deploy models for production use\n",
    "- Set up monitoring for model drift\n",
    "- Schedule regular retraining with new data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}