# Premier League Predictions - Multi-Algorithm Comparison

## Business Objectives Overview

This notebook compares multiple ML algorithms across **3 distinct business objectives**:

1. **BO1**: Season Ranking Prediction (Regression) - Predict final league position 1-20
2. **BO2**: Match Winner Prediction (Classification) - Predict Home/Draw/Away for each match
3. **BO3**: Champions League Qualification (Binary Classification) - Identify Top 4 teams

Each objective uses the most appropriate dataset and evaluation metric.

In [None]:
print("="*60)
print("üéâ FINAL PROJECT CONCLUSION")
print("="*60)
print(f"1. Season Ranking (BO1) Winner: {best_bo1['Model']}")
print(f"2. Match Outcome (BO2) Winner:  {best_bo2['Model']}")
print(f"3. Relegation Risk (BO3) Winner: {best_bo3['Model']}")
print("="*60)


## 5. Final Comparison & Conclusion

Summary of the best performing algorithm for each Business Objective.


In [None]:
# Prepare Data for BO3
# Create Binary Target: 1 if Relegated (Pos >= 18), 0 otherwise
y_relegation_train = (y_train >= 18).astype(int)
y_relegation_test = (y_test >= 18).astype(int)

# Define Models
models_bo3 = {
    'SVM': SVC(kernel='rbf', class_weight='balanced', random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

results_bo3 = []

print("üèÜ BO3: Relegation Risk Detection Results")
print("-" * 40)

for name, model in models_bo3.items():
    model.fit(X_train, y_relegation_train)
    preds = model.predict(X_test)
    recall = recall_score(y_relegation_test, preds)
    f1 = f1_score(y_relegation_test, preds)
    results_bo3.append({'Model': name, 'Recall': recall, 'F1': f1})
    print(f"{name}: Recall = {recall:.4f}, F1 = {f1:.4f}")

best_bo3 = max(results_bo3, key=lambda x: x['Recall'])
print(f"\n‚úÖ Winner BO3: {best_bo3['Model']} (Recall: {best_bo3['Recall']:.4f})")


## 4. BO3: Champions League Qualification Prediction (Binary Classification)

**Objective**: Identify teams that will finish in Top 4 (Champions League spots)
**Dataset**: `team_season_aggregated.csv`
**Type**: Binary Classification (Top 4 = 1, Others = 0)
**Metric**: F1-Score - balance between precision and recall
**Algorithms**: SVM, Random Forest, XGBoost, Gradient Boosting


In [None]:
# Prepare Data for BO2
# Simple features for demonstration: Team Encodings
match_features = ['HomeTeam_le', 'AwayTeam_le', 'Season_encoded']
match_target = 'FTR_encoded'

# Train/Test Split (Time-based)
train_mask_m = df_match['Season'] != '2024-25'
test_mask_m = df_match['Season'] == '2024-25'

X_train_m = df_match[train_mask_m][match_features]
y_train_m = df_match[train_mask_m][match_target]
X_test_m = df_match[test_mask_m][match_features]
y_test_m = df_match[test_mask_m][match_target]

# Define Models
models_bo2 = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42)
}

results_bo2 = []

print("üèÜ BO2: Match Outcome Prediction Results")
print("-" * 40)

for name, model in models_bo2.items():
    model.fit(X_train_m, y_train_m)
    preds = model.predict(X_test_m)
    acc = accuracy_score(y_test_m, preds)
    results_bo2.append({'Model': name, 'Accuracy': acc})
    print(f"{name}: Accuracy = {acc:.4f}")

best_bo2 = max(results_bo2, key=lambda x: x['Accuracy'])
print(f"\n‚úÖ Winner BO2: {best_bo2['Model']} (Accuracy: {best_bo2['Accuracy']:.4f})")


## 3. BO2: Match Winner Prediction (Classification)

**Objective**: Predict which team wins each match (Home Win / Draw / Away Win)
**Dataset**: `processed_premier_league_combined.csv` (~9500 matches)
**Type**: Multi-class Classification
**Metric**: Accuracy - higher is better
**Algorithms**: SVM (RBF), Random Forest, XGBoost, KNN


In [None]:
# Prepare Data for BO1
feature_cols = [
    'Wins', 'Draws', 'Losses', 'Goals_Scored', 'Goals_Conceded', 
    'Goal_Difference', 'Points', 'Win_Rate'
]
target_col = 'Final_Position'

# Train/Test Split (Time-based)
train_mask = df_season['Season'] != '2024-25'
test_mask = df_season['Season'] == '2024-25'

X_train = df_season[train_mask][feature_cols]
y_train = df_season[train_mask][target_col]
X_test = df_season[test_mask][feature_cols]
y_test = df_season[test_mask][target_col]

# Define Models
models_bo1 = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': xgb.XGBRegressor(n_estimators=100, random_state=42),
    'KNN': KNeighborsRegressor(n_neighbors=7)
}

results_bo1 = []

print("üèÜ BO1: Season Ranking Prediction Results")
print("-" * 40)

for name, model in models_bo1.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    results_bo1.append({'Model': name, 'MAE': mae})
    print(f"{name}: MAE = {mae:.4f}")

best_bo1 = min(results_bo1, key=lambda x: x['MAE'])
print(f"\n‚úÖ Winner BO1: {best_bo1['Model']} (MAE: {best_bo1['MAE']:.4f})")


## 2. BO1: Season Final Position Prediction (Regression)

**Objective**: Predict exact final league position (1-20)
**Dataset**: `team_season_aggregated.csv`
**Type**: Regression
**Metric**: MAE (Mean Absolute Error) - lower is better
**Algorithms**: Random Forest, XGBoost, KNN, Gradient Boosting


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, accuracy_score, recall_score, f1_score, precision_score, classification_report
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
import xgboost as xgb

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Load Datasets
season_path = Path('../data/processed/team_season_aggregated.csv')
match_path = Path('../data/processed/processed_premier_league_combined.csv')

if not season_path.exists():
    season_path = Path('data/processed/team_season_aggregated.csv')
    match_path = Path('data/processed/processed_premier_league_combined.csv')

print("="*80)
print("LOADING DATASETS")
print("="*80)
print(f"Season Data: {season_path}")
df_season = pd.read_csv(season_path)
print(f"  ‚úÖ Loaded {len(df_season)} team-seasons")

print(f"\nMatch Data: {match_path}")
df_match = pd.read_csv(match_path)
print(f"  ‚úÖ Loaded {len(df_match)} matches")
print("="*80)


## 1. Setup and Data Loading

Load both datasets:
- `team_season_aggregated.csv` ‚Üí for **BO1** (Season Ranking) & **BO3** (Top 4 Qualification)
- `processed_premier_league_combined.csv` ‚Üí for **BO2** (Match Winner Prediction)