# Premier League Betting App – Match Outcome & Score Prediction

This notebook implements the pipeline described in the project proposal:

- **Client:** Premier League Betting App  
- **Goal:**  
  - Classification: Predict the **match outcome**  
    - `0` = Home Win, `1` = Away Win, `2` = Draw  
  - Regression: Predict the **final score** (home & away goals).  
- **Data:** Premier League matchup stats (e.g., Kaggle datasets for seasons 2019/2020, 2020/2021, 2021/2022) concatenated into a single dataset.  
- **Models:**  
  - **Classification:** SVM, Random Forest, Logistic Regression  
  - **Regression:** Linear Regression (with regularization), Random Forest Regressor, Gradient Boosting Regressor  
- **Framework:** Preprocessing → Data splitting → Hyperparameter tuning → Model training → Validation → Visualization


In [1]:
# Core libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Preprocessing & model selection
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Classification models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Regression models
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    mean_squared_error, r2_score
)

pd.set_option("display.max_columns", 100)


## 1. Data Loading

Update the file paths below to point to your downloaded Kaggle CSV files for the **Premier League** seasons
2019/2020, 2020/2021, and 2021/2022.

You may have columns like:

- `HomeTeam`, `AwayTeam`
- `FTHG` (Full-Time Home Goals), `FTAG` (Full-Time Away Goals)
- `FTR` (Full-Time Result) as `'H'`, `'A'`, `'D'`

You can adjust the column names in the preprocessing steps later if your dataset uses different ones.


In [4]:
# TODO: Update these paths to match your local dataset files
path_2019_2020 = "2019-20.csv"
path_2020_2021 = "2020-2021.csv"
path_2021_2022 = "2021-2022.csv"

# Load datasets
df_19_20 = pd.read_csv(path_2019_2020)
df_20_21 = pd.read_csv(path_2020_2021)
df_21_22 = pd.read_csv(path_2021_2022)

# Concatenate datasets
df = pd.concat([df_19_20, df_20_21, df_21_22], ignore_index=True)

print(df.shape)
df.head()


(1020, 106)


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,PSH,PSD,PSA,WHH,WHD,WHA,VCH,VCD,VCA,MaxH,MaxD,MaxA,AvgH,AvgD,AvgA,B365>2.5,B365<2.5,...,AHh,B365AHH,B365AHA,PAHH,PAHA,MaxAHH,MaxAHA,AvgAHH,AvgAHA,B365CH,B365CD,B365CA,BWCH,BWCD,BWCA,IWCH,IWCD,IWCA,PSCH,PSCD,PSCA,WHCH,WHCD,WHCA,VCCH,VCCD,VCCA,MaxCH,MaxCD,MaxCA,AvgCH,AvgCD,AvgCA,B365C>2.5,B365C<2.5,PC>2.5,PC<2.5,MaxC>2.5,MaxC<2.5,AvgC>2.5,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E0,09/08/2019,20:00,Liverpool,Norwich,4,1,H,4,0,H,M Oliver,15,12,7,5,9,9,11,2,0,2,0,0,1.14,10.0,19.0,1.14,8.25,18.5,1.15,8.0,18.0,1.15,9.59,18.05,1.12,8.5,21.0,1.14,9.5,23.0,1.16,10.0,23.0,1.14,8.75,19.83,1.4,3.0,...,-2.25,1.96,1.94,1.97,1.95,1.97,2.0,1.94,1.94,1.14,9.5,21.0,1.14,9.0,20.0,1.15,8.0,18.0,1.14,10.43,19.63,1.11,9.5,21.0,1.14,9.5,23.0,1.16,10.5,23.0,1.14,9.52,19.18,1.3,3.5,1.34,3.44,1.36,3.76,1.32,3.43,-2.25,1.91,1.99,1.94,1.98,1.99,2.07,1.9,1.99
1,E0,10/08/2019,12:30,West Ham,Man City,0,5,A,0,1,A,M Dean,5,14,3,9,6,13,1,1,2,2,0,0,12.0,6.5,1.22,11.5,5.75,1.26,11.0,6.1,1.25,11.68,6.53,1.26,13.0,6.0,1.24,12.0,6.5,1.25,13.0,6.75,1.29,11.84,6.28,1.25,1.44,2.75,...,1.75,2.0,1.9,2.02,1.9,2.02,1.92,1.99,1.89,12.0,7.0,1.25,11.0,6.0,1.26,11.0,6.1,1.25,11.11,6.68,1.27,11.0,6.5,1.24,12.0,6.5,1.25,13.0,7.0,1.29,11.14,6.46,1.26,1.4,3.0,1.43,3.03,1.5,3.22,1.41,2.91,1.75,1.95,1.95,1.96,1.97,2.07,1.98,1.97,1.92
2,E0,10/08/2019,15:00,Bournemouth,Sheffield United,1,1,D,0,0,D,K Friend,13,8,3,3,10,19,3,4,2,1,0,0,1.95,3.6,3.6,1.95,3.6,3.9,1.97,3.55,3.8,2.04,3.57,3.9,2.0,3.5,3.8,2.0,3.6,4.0,2.06,3.65,4.0,2.01,3.53,3.83,1.9,1.9,...,-0.5,2.01,1.89,2.04,1.88,2.04,1.91,2.0,1.88,1.95,3.7,4.2,1.95,3.6,3.9,1.97,3.55,3.85,1.98,3.67,4.06,1.95,3.6,3.9,2.0,3.6,4.0,2.03,3.7,4.2,1.98,3.58,3.96,1.9,1.9,1.94,1.97,1.97,1.98,1.91,1.92,-0.5,1.95,1.95,1.98,1.95,2.0,1.96,1.96,1.92
3,E0,10/08/2019,15:00,Burnley,Southampton,3,0,H,0,0,D,G Scott,10,11,4,3,6,12,2,7,0,0,0,0,2.62,3.2,2.75,2.65,3.2,2.75,2.65,3.2,2.75,2.71,3.31,2.81,2.7,3.2,2.75,2.7,3.3,2.8,2.8,3.33,2.85,2.68,3.22,2.78,2.1,1.72,...,0.0,1.92,1.98,1.93,2.0,1.94,2.0,1.91,1.98,2.7,3.25,2.9,2.65,3.1,2.85,2.6,3.2,2.85,2.71,3.19,2.9,2.62,3.2,2.8,2.7,3.25,2.9,2.72,3.26,2.95,2.65,3.18,2.88,2.1,1.72,2.19,1.76,2.25,1.78,2.17,1.71,0.0,1.87,2.03,1.89,2.03,1.9,2.07,1.86,2.02
4,E0,10/08/2019,15:00,Crystal Palace,Everton,0,0,D,0,0,D,J Moss,6,10,2,3,16,14,6,2,2,1,0,1,3.0,3.25,2.37,3.2,3.2,2.35,3.1,3.2,2.4,3.21,3.37,2.39,3.1,3.3,2.35,3.2,3.3,2.45,3.21,3.4,2.52,3.13,3.27,2.4,2.2,1.66,...,0.25,1.85,2.05,1.88,2.05,1.88,2.09,1.84,2.04,3.4,3.5,2.25,3.3,3.3,2.25,3.4,3.3,2.2,3.37,3.45,2.27,3.3,3.3,2.25,3.4,3.3,2.25,3.55,3.5,2.34,3.41,3.37,2.23,2.2,1.66,2.22,1.74,2.28,1.77,2.17,1.71,0.25,1.82,2.08,1.97,1.96,2.03,2.08,1.96,1.93


## 2. Exploratory Data Analysis (EDA)

Quick sanity checks: data types, missing values, and simple distributions.


In [None]:
# Overview of the dataset
df.info()


In [None]:
# Check basic statistics for numerical columns
df.describe().T


In [None]:
# Check missing values
df.isna().mean().sort_values(ascending=False).head(20)


## 3. Feature Engineering & Target Definition

We define:

- **Classification target `y_cls`**: Match outcome encoded as  
  - `0` = Home Win  
  - `1` = Away Win  
  - `2` = Draw  

- **Regression targets `y_reg_home` and `y_reg_away`**: Final scores (home & away goals).

Adjust the column names in this section if your dataset uses different labels for goals and result.


In [None]:
# ---- Adjust these column names to match your dataset ----
home_team_col = "HomeTeam"
away_team_col = "AwayTeam"
home_goals_col = "FTHG"  # Full Time Home Goals
away_goals_col = "FTAG"  # Full Time Away Goals
result_col = "FTR"       # Full Time Result: 'H', 'A', 'D'

# Map textual result to numeric class: 0=Home Win, 1=Away Win, 2=Draw
result_mapping = {"H": 0, "A": 1, "D": 2}
df = df.dropna(subset=[home_goals_col, away_goals_col, result_col])
df["match_outcome"] = df[result_col].map(result_mapping)

# Regression targets
df["home_score"] = df[home_goals_col]
df["away_score"] = df[away_goals_col]

# Example feature set: you can expand this with more stats from your dataset
# For now, let's include team names + any other numeric stats that might exist.
feature_cols_categorical = [home_team_col, away_team_col]
feature_cols_numeric = [
    col for col in df.columns
    if col not in feature_cols_categorical
    and col not in [home_goals_col, away_goals_col, result_col, "match_outcome", "home_score", "away_score"]
    and pd.api.types.is_numeric_dtype(df[col])
]

print("Categorical features:", feature_cols_categorical)
print("Numeric features:", feature_cols_numeric)


## 4. Train–Test Split

We split the data into:

- **Training set:** 80%  
- **Test set:** 20%  

We will perform **5-fold cross-validation** on the training set during hyperparameter tuning.


In [None]:
from sklearn.utils import shuffle

# Shuffle to reduce temporal bias if data is ordered by date
df = shuffle(df, random_state=42).reset_index(drop=True)

X = df[feature_cols_categorical + feature_cols_numeric]

# Classification target
y_cls = df["match_outcome"]

# Regression targets
y_reg_home = df["home_score"]
y_reg_away = df["away_score"]

X_train, X_test, y_cls_train, y_cls_test, y_reg_home_train, y_reg_home_test, y_reg_away_train, y_reg_away_test = train_test_split(
    X, y_cls, y_reg_home, y_reg_away, test_size=0.2, random_state=42, stratify=y_cls
)

X_train.shape, X_test.shape


## 5. Preprocessing Pipelines

We apply the following preprocessing steps:

- **Missing values:** `SimpleImputer` with mean (numeric) or most frequent (categorical)  
- **Scaling:** `StandardScaler` for numeric features  
- **Encoding:** `OneHotEncoder` for categorical features  

We build a `ColumnTransformer` to apply the correct preprocessing to each subset of columns.


In [None]:
# Numeric preprocessing: impute missing values with mean, then scale
numeric_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler()),
    ]
)

# Categorical preprocessing: impute most frequent, then one-hot encode
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

# Combined preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, feature_cols_numeric),
        ("cat", categorical_transformer, feature_cols_categorical),
    ]
)

preprocessor


## 6. Classification – Match Outcome

We train and tune:

- **SVM (SVC)** – tuning `kernel`, `C`, `gamma`  
- **RandomForestClassifier** – tuning `n_estimators`, `max_depth`, `min_samples_split`  
- **LogisticRegression** – tuning `penalty` (L1/L2) and `C`  


In [None]:
classification_results = {}

# Helper function to run GridSearchCV and store results
def run_classification_grid_search(model, param_grid, model_name):
    pipe = Pipeline(steps=[("preprocess", preprocessor), ("model", model)])
    
    grid = GridSearchCV(
        pipe,
        param_grid=param_grid,
        cv=5,
        scoring="accuracy",
        n_jobs=-1,
        verbose=1,
    )
    grid.fit(X_train, y_cls_train)
    
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)
    
    acc = accuracy_score(y_cls_test, y_pred)
    prec = precision_score(y_cls_test, y_pred, average="weighted", zero_division=0)
    rec = recall_score(y_cls_test, y_pred, average="weighted", zero_division=0)
    f1 = f1_score(y_cls_test, y_pred, average="weighted", zero_division=0)
    
    classification_results[model_name] = {
        "best_params": grid.best_params_,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1": f1,
        "best_model": best_model,
        "y_pred": y_pred,
    }
    
    print(f"\nModel: {model_name}")
    print("Best params:", grid.best_params_)
    print(f"Accuracy: {acc:.4f}")
    print(f"Precision (weighted): {prec:.4f}")
    print(f"Recall (weighted): {rec:.4f}")
    print(f"F1-score (weighted): {f1:.4f}")
    print("\nClassification report:")
    print(classification_report(y_cls_test, y_pred, zero_division=0))
    
    return best_model, y_pred


In [None]:
# 6.1 SVM (SVC)
svm_param_grid = {
    "model__kernel": ["rbf", "linear"],
    "model__C": [0.1, 1, 10],
    "model__gamma": ["scale", "auto"],
}

svm_model, svm_y_pred = run_classification_grid_search(
    SVC(probability=True),
    svm_param_grid,
    model_name="SVM"
)


In [None]:
# 6.2 Random Forest Classifier
rf_cls_param_grid = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [None, 10, 20],
    "model__min_samples_split": [2, 5],
}

rf_cls_model, rf_cls_y_pred = run_classification_grid_search(
    RandomForestClassifier(random_state=42),
    rf_cls_param_grid,
    model_name="RandomForestClassifier"
)


In [None]:
# 6.3 Logistic Regression
log_reg_param_grid = {
    "model__penalty": ["l1", "l2"],
    "model__C": [0.1, 1, 10],
    "model__solver": ["liblinear"],  # supports L1 and L2
}

log_reg_model, log_reg_y_pred = run_classification_grid_search(
    LogisticRegression(max_iter=1000, multi_class="auto"),
    log_reg_param_grid,
    model_name="LogisticRegression"
)


### 6.4 Confusion Matrices & Model Comparison

In [None]:
def plot_confusion_matrix_for_model(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots(figsize=(4, 4))
    im = ax.imshow(cm, interpolation="nearest")
    ax.figure.colorbar(im, ax=ax)
    ax.set(
        xticks=range(3),
        yticks=range(3),
        xticklabels=["Home Win (0)", "Away Win (1)", "Draw (2)"],
        yticklabels=["Home Win (0)", "Away Win (1)", "Draw (2)"],
        ylabel="True label",
        xlabel="Predicted label",
        title=title,
    )
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    
    # Annotate
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(
                j, i, format(cm[i, j], "d"),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black"
            )
    plt.tight_layout()
    plt.show()


for name, res in classification_results.items():
    plot_confusion_matrix_for_model(y_cls_test, res["y_pred"], f"Confusion Matrix – {name}")


In [None]:
# Compare classification models by metrics
cls_summary = pd.DataFrame(
    {
        name: {
            "accuracy": res["accuracy"],
            "precision_weighted": res["precision"],
            "recall_weighted": res["recall"],
            "f1_weighted": res["f1"],
        }
        for name, res in classification_results.items()
    }
).T

cls_summary.sort_values("accuracy", ascending=False)


## 7. Regression – Final Scores

We train and tune:

- **Linear Regression with regularization**: Ridge (L2) and Lasso (L1)  
- **RandomForestRegressor** – tuning `n_estimators`, `max_depth`  
- **GradientBoostingRegressor** – tuning `n_estimators`, `max_depth`, `learning_rate`  

We build **separate models** for home and away scores.


In [None]:
regression_results_home = {}
regression_results_away = {}

def run_regression_grid_search(base_model, param_grid, model_name, y_train, y_test, target_label, results_dict):
    pipe = Pipeline(steps=[("preprocess", preprocessor), ("model", base_model)])
    
    grid = GridSearchCV(
        pipe,
        param_grid=param_grid,
        cv=5,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1,
        verbose=1,
    )
    grid.fit(X_train, y_train)
    
    best_model = grid.best_estimator_
    y_pred = best_model.predict(X_test)
    
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    
    results_dict[model_name] = {
        "best_params": grid.best_params_,
        "rmse": rmse,
        "r2": r2,
        "best_model": best_model,
        "y_pred": y_pred,
    }
    
    print(f"\n[{target_label}] Model: {model_name}")
    print("Best params:", grid.best_params_)
    print(f"RMSE: {rmse:.4f}")
    print(f"R^2: {r2:.4f}")
    
    return best_model, y_pred


In [None]:
# 7.1 Ridge Regression – Home score
ridge_param_grid = {
    "model__alpha": [0.1, 1.0, 10.0],
}

ridge_home_model, ridge_home_pred = run_regression_grid_search(
    Ridge(),
    ridge_param_grid,
    model_name="Ridge",
    y_train=y_reg_home_train,
    y_test=y_reg_home_test,
    target_label="Home Score",
    results_dict=regression_results_home,
)

# 7.2 Lasso Regression – Home score
lasso_param_grid = {
    "model__alpha": [0.001, 0.01, 0.1, 1.0],
}

lasso_home_model, lasso_home_pred = run_regression_grid_search(
    Lasso(max_iter=10000),
    lasso_param_grid,
    model_name="Lasso",
    y_train=y_reg_home_train,
    y_test=y_reg_home_test,
    target_label="Home Score",
    results_dict=regression_results_home,
)


In [None]:
# 7.3 RandomForestRegressor – Home score
rf_reg_param_grid = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [None, 10, 20],
}

rf_home_model, rf_home_pred = run_regression_grid_search(
    RandomForestRegressor(random_state=42),
    rf_reg_param_grid,
    model_name="RandomForestRegressor",
    y_train=y_reg_home_train,
    y_test=y_reg_home_test,
    target_label="Home Score",
    results_dict=regression_results_home,
)

# 7.4 GradientBoostingRegressor – Home score
gb_reg_param_grid = {
    "model__n_estimators": [100, 200],
    "model__learning_rate": [0.05, 0.1],
    "model__max_depth": [2, 3],
}

gb_home_model, gb_home_pred = run_regression_grid_search(
    GradientBoostingRegressor(random_state=42),
    gb_reg_param_grid,
    model_name="GradientBoostingRegressor",
    y_train=y_reg_home_train,
    y_test=y_reg_home_test,
    target_label="Home Score",
    results_dict=regression_results_home,
)


In [None]:
# Repeat for AWAY score

# Ridge – Away score
ridge_away_model, ridge_away_pred = run_regression_grid_search(
    Ridge(),
    ridge_param_grid,
    model_name="Ridge",
    y_train=y_reg_away_train,
    y_test=y_reg_away_test,
    target_label="Away Score",
    results_dict=regression_results_away,
)

# Lasso – Away score
lasso_away_model, lasso_away_pred = run_regression_grid_search(
    Lasso(max_iter=10000),
    lasso_param_grid,
    model_name="Lasso",
    y_train=y_reg_away_train,
    y_test=y_reg_away_test,
    target_label="Away Score",
    results_dict=regression_results_away,
)

# RandomForestRegressor – Away score
rf_away_model, rf_away_pred = run_regression_grid_search(
    RandomForestRegressor(random_state=42),
    rf_reg_param_grid,
    model_name="RandomForestRegressor",
    y_train=y_reg_away_train,
    y_test=y_reg_away_test,
    target_label="Away Score",
    results_dict=regression_results_away,
)

# GradientBoostingRegressor – Away score
gb_away_model, gb_away_pred = run_regression_grid_search(
    GradientBoostingRegressor(random_state=42),
    gb_reg_param_grid,
    model_name="GradientBoostingRegressor",
    y_train=y_reg_away_train,
    y_test=y_reg_away_test,
    target_label="Away Score",
    results_dict=regression_results_away,
)


### 7.5 Regression Performance Summary

In [None]:
home_reg_summary = pd.DataFrame(
    {
        name: {"rmse": res["rmse"], "r2": res["r2"]}
        for name, res in regression_results_home.items()
    }
).T

away_reg_summary = pd.DataFrame(
    {
        name: {"rmse": res["rmse"], "r2": res["r2"]}
        for name, res in regression_results_away.items()
    }
).T

print("Home score regression summary:")
display(home_reg_summary.sort_values("rmse"))

print("\nAway score regression summary:")
display(away_reg_summary.sort_values("rmse"))


## 8. Visualization

### 8.1 Predicted vs Actual Score Scatter Plots

We use the **best-performing regression models** for home and away scores and visualize predicted vs actual values.


In [None]:
def plot_predicted_vs_actual(y_true, y_pred, title):
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.scatter(y_true, y_pred, alpha=0.6)
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    ax.plot([min_val, max_val], [min_val, max_val])
    ax.set_xlabel("Actual")
    ax.set_ylabel("Predicted")
    ax.set_title(title)
    plt.tight_layout()
    plt.show()

# Choose the model with lowest RMSE for home and away
best_home_model_name = min(regression_results_home, key=lambda k: regression_results_home[k]["rmse"])
best_away_model_name = min(regression_results_away, key=lambda k: regression_results_away[k]["rmse"])

best_home_pred = regression_results_home[best_home_model_name]["y_pred"]
best_away_pred = regression_results_away[best_away_model_name]["y_pred"]

print("Best home score model:", best_home_model_name)
print("Best away score model:", best_away_model_name)

plot_predicted_vs_actual(y_reg_home_test, best_home_pred, f"Home Score – Predicted vs Actual ({best_home_model_name})")
plot_predicted_vs_actual(y_reg_away_test, best_away_pred, f"Away Score – Predicted vs Actual ({best_away_model_name})")


### 8.2 Outcome Prediction Accuracy by Score Margin

We can also examine how often the **predicted outcome** (from the best classification model) is correct
as a function of the **true goal difference** (score margin).

In [None]:
# Pick best classification model by accuracy
best_cls_model_name = max(classification_results, key=lambda k: classification_results[k]["accuracy"])
best_cls_pred = classification_results[best_cls_model_name]["y_pred"]

print("Best classification model:", best_cls_model_name)

# Compute true score margin (home - away) and absolute margin
true_home = y_reg_home_test.reset_index(drop=True)
true_away = y_reg_away_test.reset_index(drop=True)
true_margin = true_home - true_away
abs_margin = true_margin.abs()

# Accuracy per margin bin
bins = [0, 1, 2, 3, 5, np.inf]
labels = ["0", "1", "2", "3-4", "5+"]
margin_bins = pd.cut(abs_margin, bins=bins, labels=labels, right=False)

correct = (best_cls_pred == y_cls_test.reset_index(drop=True)).astype(int)
accuracy_by_margin = correct.groupby(margin_bins).mean()

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(accuracy_by_margin.index.astype(str), accuracy_by_margin.values)
ax.set_xlabel("Absolute Goal Difference (True)")
ax.set_ylabel("Outcome Prediction Accuracy")
ax.set_title("Outcome Prediction Accuracy by Score Margin")
plt.tight_layout()
plt.show()

accuracy_by_margin


## 9. Conclusion & Next Steps

In this notebook we:

1. Loaded and concatenated multiple Premier League seasons (2019/2020–2021/2022).  
2. Preprocessed the data with imputation, scaling, and one-hot encoding.  
3. Trained and tuned **classification models** (SVM, Random Forest, Logistic Regression) to predict match outcomes.  
4. Trained and tuned **regression models** (Ridge, Lasso, RandomForestRegressor, GradientBoostingRegressor) to predict final scores.  
5. Evaluated models using appropriate metrics:  
   - **Classification:** Accuracy, Precision, Recall, F1-score  
   - **Regression:** RMSE, R²  
6. Visualized confusion matrices, predicted vs actual scores, and outcome accuracy by score margin.

**Possible extensions:**

- Incorporate more advanced features (form, xG, home/away streaks, betting odds).  
- Use time-aware validation (e.g., train on earlier seasons, test on later).  
- Try more advanced models (XGBoost, LightGBM, neural networks).  
- Calibrate predicted probabilities for betting strategies.
