<u>**EPL (English Premier League) Soccer Match Predictor**</u>

This project creates a machine learning model pipeline designed to predict Premier League match outcomes (home win, draw, or away win) using historical match data enriched with advanced features. Data is secured from www.football-data.co.uk (publicly available data). It incorporates team performance metrics such as rolling averages of goals scored and conceded, team form calculated from recent results, and dynamic ELO ratings that reflect team strength over time. The model preprocesses these features, scales them for consistency, and employs a Random Forest classifier optimized through randomized hyperparameter search and validated with stratified 10-fold cross-validation. The model significantly beats a random prediction (1/3) with an accuracy of over 56%. The model then gives an example at the end by predicting the winner of **Nottingham Forrest vs Burnley**.

In [26]:
# Import required packages
import pandas as pd
import numpy as np
import requests
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

In [27]:
# Load data (from football-data.co.uk
url = "https://www.football-data.co.uk/mmz4281/2324/E0.csv"
df = pd.read_csv(url)
df = df[['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']]
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values('Date')
df.head()

Unnamed: 0,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HS,AS,HST,AST,HC,AC
0,2023-08-11,Burnley,Man City,0,3,A,6,17,1,8,6,5
1,2023-08-12,Arsenal,Nott'm Forest,2,1,H,15,6,7,2,8,3
2,2023-08-12,Bournemouth,West Ham,1,1,D,14,16,5,3,10,4
3,2023-08-12,Brighton,Luton,4,1,H,27,9,12,3,6,7
4,2023-08-12,Everton,Fulham,0,1,A,19,9,9,2,10,4


In [28]:
# Initialize and create ELO ratings
elo_ratings = {}
BASE_ELO = 1500
K = 30  # Sensitivity constant

In [29]:
# Create ELO helper functions
def expected_score(r1, r2):
    """
    Calculate the expected score (probability of winning) for player/team 1 
    against player/team 2 based on their ELO ratings.

    Parameters:
    r1 (float): ELO rating of player/team 1 (e.g., home team).
    r2 (float): ELO rating of player/team 2 (e.g., away team).

    Returns:
    float: Expected score (between 0 and 1) representing the probability 
           that player/team 1 wins the match.
    """
    return 1 / (1 + 10 ** ((r2 - r1) / 400))


def update_elo(home, away, result):
    """
    Update ELO ratings for two teams after a match and return their pre-match ratings.

    Parameters:
    home (str): Name of the home team.
    away (str): Name of the away team.
    result (str): Match result. Should be one of:
                  'H' for a home win,
                  'D' for a draw,
                  'A' for an away win.

    Returns:
    tuple: (home_elo, away_elo) — the ELO ratings of the home and away teams before the match.

    Notes:
    - Uses global `elo_ratings` dictionary to retrieve and store team ELOs.
    - Uses global constants `BASE_ELO` (starting rating) and `K` (adjustment factor).
    - ELO ratings are updated in-place in the `elo_ratings` dictionary.
    """
    home_elo = elo_ratings.get(home, BASE_ELO)
    away_elo = elo_ratings.get(away, BASE_ELO)
    
    expected_home = expected_score(home_elo, away_elo)
    expected_away = 1 - expected_home

    # Assign actual result scores
    if result == 'H':
        actual_home = 1
        actual_away = 0
    elif result == 'D':
        actual_home = actual_away = 0.5
    else:  # 'A'
        actual_home = 0
        actual_away = 1

    # Update ELOs based on result
    new_home = home_elo + K * (actual_home - expected_home)
    new_away = away_elo + K * (actual_away - expected_away)

    elo_ratings[home] = new_home
    elo_ratings[away] = new_away

    return home_elo, away_elo  # Return pre-match ELOs

In [31]:
# Build pre-match rolling averages and ELOs
home_stats = ['FTHG', 'HS', 'HST', 'HC']
away_stats = ['FTAG', 'AS', 'AST', 'AC']
for stat in home_stats:
    df[f'Home_{stat}_avg'] = df.groupby('HomeTeam')[stat].shift().rolling(5).mean()
for stat in away_stats:
    df[f'Away_{stat}_avg'] = df.groupby('AwayTeam')[stat].shift().rolling(5).mean()

# Add ELO columns
home_elos = []
away_elos = []
for idx, row in df.iterrows():
    home = row['HomeTeam']
    away = row['AwayTeam']
    result = row['FTR']
    
    home_elo, away_elo = update_elo(home, away, result)
    home_elos.append(home_elo)
    away_elos.append(away_elo)

df['HomeELO'] = home_elos
df['AwayELO'] = away_elos

df['elo_diff'] = df['HomeELO'] - df['AwayELO']

In [32]:
def calc_form(df, team_col, result_col, window=5):
    """
    Calculates a team's rolling form based on recent results.

    Encodes match outcomes as Win=1, Draw=0.5, Loss=0, and computes
    the average over the last `window` games for each team.

    Parameters:
        df (DataFrame): Match data with dates and results.
        team_col (str): Column with team names.
        result_col (str): Column with match outcomes ('W', 'D', 'L').
        window (int): Number of past games to average (default=5).

    Returns:
        List[float]: Rolling form score or NaN if insufficient history.
    """
    df = df.sort_values('Date')
    form = []
    team_results = {}

    for _, row in df.iterrows():
        team = row[team_col]
        result = row[result_col]
        
        history = team_results.get(team, [])
        form.append(sum(history[-window:]) / window if len(history) >= window else np.nan)
        # Encode result (e.g., Win=1, Draw=0.5, Loss=0)
        outcome = 1 if result == 'W' else 0.5 if result == 'D' else 0
        team_results.setdefault(team, []).append(outcome)

    return form

df['home_form'] = calc_form(df, 'HomeTeam', 'FTR')
df['away_form'] = calc_form(df, 'AwayTeam', 'FTR')

In [33]:
# Prep final data set
df.dropna(inplace=True)
features = [col for col in df.columns if 'avg' in col] + ['HomeELO', 'AwayELO', 'elo_diff', 'home_form', 'away_form']
X = df[features]
y = df['FTR']
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [36]:
# Train model
# Define the random forest classifier
rf = RandomForestClassifier(random_state=42)

# Parameter grid for RandomizedSearchCV
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

# Stratified 10-Fold CV
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=50,
    scoring='accuracy',
    cv=cv,
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit RandomizedSearchCV on training data
random_search.fit(X_train_scaled, y_train)

print("Best hyperparameters found:")
print(random_search.best_params_)
print(f"Best cross-validation accuracy: {random_search.best_score_:.4f}")

Fitting 10 folds for each of 50 candidates, totalling 500 fits
Best hyperparameters found:
{'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None, 'max_depth': 10}
Best cross-validation accuracy: 0.6091


In [38]:
# Evaluate
y_pred = random_search.predict(X_test_scaled)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=le.classes_))


Accuracy: 0.5636363636363636
              precision    recall  f1-score   support

           A       0.56      0.53      0.55        17
           D       0.67      0.17      0.27        12
           H       0.56      0.77      0.65        26

    accuracy                           0.56        55
   macro avg       0.59      0.49      0.49        55
weighted avg       0.58      0.56      0.53        55



In [48]:
# Predict Example-- Burnley v Nott'm Forest on 2024-05-19
# (note further iteration will take out the hard coding)
match_features = {
    'Home_FTHG_avg': 1.5,    # example rolling avg goals scored by Burnley at home
    'Home_HS_avg': 12.0,     # example shots average
    'Home_HST_avg': 5.0,
    'Home_HC_avg': 4.0,
    'Away_FTAG_avg': 1.2,    # away stats for Nott'm Forest
    'Away_AS_avg': 10.0,
    'Away_AST_avg': 3.5,
    'Away_AC_avg': 2.5,
    'HomeELO': elo_ratings.get('Burnley', 1500),
    'AwayELO': elo_ratings.get("Nott'm Forest", 1500),
    'elo_diff': elo_ratings.get('Burnley', 1500) - elo_ratings.get("Nott'm Forest", 1500),
    'home_form': 0.6,        # example form scores (between 0 and 1)
    'away_form': 0.4,
}

# Convert to DataFrame
match_df = pd.DataFrame([match_features])

# Scale features
match_scaled = scaler.transform(match_df)

# Predict result
pred_encoded = random_search.predict(match_scaled)
pred_label = le.inverse_transform(pred_encoded)

print(f"Predicted match outcome: {pred_label[0]}")

Predicted match outcome: H
