
# Premier League Match Score Prediction — Regression Analysis

## Objective
Predict **home and away team scores** (FTHG, FTAG) using **historical performance data only** — no betting odds or external features.

## Approach
- **Data:** 3 seasons (2019-20, 2020-21, 2021-22) — 1,020 Premier League matches
- **Features:** Team identities + season-to-date cumulative statistics (goals scored/conceded averages, matches played)
- **Evaluation:** Time-based split
  - **Train:** 2019-20 + 2020-21 seasons (640 matches)
  - **Test:** 2021-22 season (380 matches)
- **Models:** Ridge, Lasso, Random Forest, Gradient Boosting

## Why This Matters
Tests whether cumulative season performance can predict future match scores without relying on betting market information.

---


In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Classification Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Regression Models
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix,
    mean_squared_error, r2_score
)

pd.set_option("display.max_columns", 100)



# **STEP 1 — Load & Merge Dataset**



In [3]:

# TODO: Update dataset paths
path_2019 = "2019-20.csv"
path_2020 = "2020-2021.csv"
path_2021 = "2021-2022.csv"

df_19 = pd.read_csv(path_2019)
df_20 = pd.read_csv(path_2020)
df_21 = pd.read_csv(path_2021)

df = pd.concat([df_19, df_20, df_21], ignore_index=True)
df.head()


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee,HS,AS,HST,AST,HF,AF,HC,AC,HY,AY,HR,AR,B365H,B365D,B365A,BWH,BWD,BWA,IWH,IWD,IWA,PSH,PSD,PSA,WHH,WHD,WHA,VCH,VCD,VCA,MaxH,MaxD,MaxA,AvgH,AvgD,AvgA,B365>2.5,B365<2.5,...,AHh,B365AHH,B365AHA,PAHH,PAHA,MaxAHH,MaxAHA,AvgAHH,AvgAHA,B365CH,B365CD,B365CA,BWCH,BWCD,BWCA,IWCH,IWCD,IWCA,PSCH,PSCD,PSCA,WHCH,WHCD,WHCA,VCCH,VCCD,VCCA,MaxCH,MaxCD,MaxCA,AvgCH,AvgCD,AvgCA,B365C>2.5,B365C<2.5,PC>2.5,PC<2.5,MaxC>2.5,MaxC<2.5,AvgC>2.5,AvgC<2.5,AHCh,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA
0,E0,09/08/2019,20:00,Liverpool,Norwich,4,1,H,4,0,H,M Oliver,15,12,7,5,9,9,11,2,0,2,0,0,1.14,10.0,19.0,1.14,8.25,18.5,1.15,8.0,18.0,1.15,9.59,18.05,1.12,8.5,21.0,1.14,9.5,23.0,1.16,10.0,23.0,1.14,8.75,19.83,1.4,3.0,...,-2.25,1.96,1.94,1.97,1.95,1.97,2.0,1.94,1.94,1.14,9.5,21.0,1.14,9.0,20.0,1.15,8.0,18.0,1.14,10.43,19.63,1.11,9.5,21.0,1.14,9.5,23.0,1.16,10.5,23.0,1.14,9.52,19.18,1.3,3.5,1.34,3.44,1.36,3.76,1.32,3.43,-2.25,1.91,1.99,1.94,1.98,1.99,2.07,1.9,1.99
1,E0,10/08/2019,12:30,West Ham,Man City,0,5,A,0,1,A,M Dean,5,14,3,9,6,13,1,1,2,2,0,0,12.0,6.5,1.22,11.5,5.75,1.26,11.0,6.1,1.25,11.68,6.53,1.26,13.0,6.0,1.24,12.0,6.5,1.25,13.0,6.75,1.29,11.84,6.28,1.25,1.44,2.75,...,1.75,2.0,1.9,2.02,1.9,2.02,1.92,1.99,1.89,12.0,7.0,1.25,11.0,6.0,1.26,11.0,6.1,1.25,11.11,6.68,1.27,11.0,6.5,1.24,12.0,6.5,1.25,13.0,7.0,1.29,11.14,6.46,1.26,1.4,3.0,1.43,3.03,1.5,3.22,1.41,2.91,1.75,1.95,1.95,1.96,1.97,2.07,1.98,1.97,1.92
2,E0,10/08/2019,15:00,Bournemouth,Sheffield United,1,1,D,0,0,D,K Friend,13,8,3,3,10,19,3,4,2,1,0,0,1.95,3.6,3.6,1.95,3.6,3.9,1.97,3.55,3.8,2.04,3.57,3.9,2.0,3.5,3.8,2.0,3.6,4.0,2.06,3.65,4.0,2.01,3.53,3.83,1.9,1.9,...,-0.5,2.01,1.89,2.04,1.88,2.04,1.91,2.0,1.88,1.95,3.7,4.2,1.95,3.6,3.9,1.97,3.55,3.85,1.98,3.67,4.06,1.95,3.6,3.9,2.0,3.6,4.0,2.03,3.7,4.2,1.98,3.58,3.96,1.9,1.9,1.94,1.97,1.97,1.98,1.91,1.92,-0.5,1.95,1.95,1.98,1.95,2.0,1.96,1.96,1.92
3,E0,10/08/2019,15:00,Burnley,Southampton,3,0,H,0,0,D,G Scott,10,11,4,3,6,12,2,7,0,0,0,0,2.62,3.2,2.75,2.65,3.2,2.75,2.65,3.2,2.75,2.71,3.31,2.81,2.7,3.2,2.75,2.7,3.3,2.8,2.8,3.33,2.85,2.68,3.22,2.78,2.1,1.72,...,0.0,1.92,1.98,1.93,2.0,1.94,2.0,1.91,1.98,2.7,3.25,2.9,2.65,3.1,2.85,2.6,3.2,2.85,2.71,3.19,2.9,2.62,3.2,2.8,2.7,3.25,2.9,2.72,3.26,2.95,2.65,3.18,2.88,2.1,1.72,2.19,1.76,2.25,1.78,2.17,1.71,0.0,1.87,2.03,1.89,2.03,1.9,2.07,1.86,2.02
4,E0,10/08/2019,15:00,Crystal Palace,Everton,0,0,D,0,0,D,J Moss,6,10,2,3,16,14,6,2,2,1,0,1,3.0,3.25,2.37,3.2,3.2,2.35,3.1,3.2,2.4,3.21,3.37,2.39,3.1,3.3,2.35,3.2,3.3,2.45,3.21,3.4,2.52,3.13,3.27,2.4,2.2,1.66,...,0.25,1.85,2.05,1.88,2.05,1.88,2.09,1.84,2.04,3.4,3.5,2.25,3.3,3.3,2.25,3.4,3.3,2.2,3.37,3.45,2.27,3.3,3.3,2.25,3.4,3.3,2.25,3.55,3.5,2.34,3.41,3.37,2.23,2.2,1.66,2.22,1.74,2.28,1.77,2.17,1.71,0.25,1.82,2.08,1.97,1.96,2.03,2.08,1.96,1.93


### Explore the data

In [4]:
# Explore the data structure
print("Total columns:", len(df.columns))
print("\nDataframe shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())
print("\nData types:")
print(df.dtypes)

Total columns: 106

Dataframe shape: (1020, 106)

Column names:
['Div', 'Date', 'Time', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'VCH', 'VCD', 'VCA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA', 'B365>2.5', 'B365<2.5', 'P>2.5', 'P<2.5', 'Max>2.5', 'Max<2.5', 'Avg>2.5', 'Avg<2.5', 'AHh', 'B365AHH', 'B365AHA', 'PAHH', 'PAHA', 'MaxAHH', 'MaxAHA', 'AvgAHH', 'AvgAHA', 'B365CH', 'B365CD', 'B365CA', 'BWCH', 'BWCD', 'BWCA', 'IWCH', 'IWCD', 'IWCA', 'PSCH', 'PSCD', 'PSCA', 'WHCH', 'WHCD', 'WHCA', 'VCCH', 'VCCD', 'VCCA', 'MaxCH', 'MaxCD', 'MaxCA', 'AvgCH', 'AvgCD', 'AvgCA', 'B365C>2.5', 'B365C<2.5', 'PC>2.5', 'PC<2.5', 'MaxC>2.5', 'MaxC<2.5', 'AvgC>2.5', 'AvgC<2.5', 'AHCh', 'B365CAHH', 'B365CAHA', 'PCAHH', 'PCAHA', 'MaxCAHH', 'MaxCAHA', 'AvgCAHH', 'AvgCAHA']

Data types:
D


# **STEP 2 — Feature Engineering**
Define classification + regression targets, map textual outcomes to numeric classes.


In [13]:

# Define target variables for regression (what we want to predict)
home_goals_col = "FTHG"
away_goals_col = "FTAG"

# Step 1: Remove columns with data leakage (post-match info)
# These columns contain information AFTER the match or during the match
columns_to_drop = [
    # Results 
    'FTR', 'HTR',
    
    # Half-time scores (leak info about final score)
    'HTHG', 'HTAG',
    
    # Match statistics (not available before match)
    'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY', 'HR', 'AR', 'Referee',
    
    # Closing odds (post-match, contain leakage)
    'B365CH', 'B365CD', 'B365CA', 'BWCH', 'BWCD', 'BWCA', 'IWCH', 'IWCD', 'IWCA',
    'PSCH', 'PSCD', 'PSCA', 'WHCH', 'WHCD', 'WHCA', 'VCCH', 'VCCD', 'VCCA',
    'MaxCH', 'MaxCD', 'MaxCA', 'AvgCH', 'AvgCD', 'AvgCA',
    'B365C>2.5', 'B365C<2.5', 'PC>2.5', 'PC<2.5', 'MaxC>2.5', 'MaxC<2.5', 
    'AvgC>2.5', 'AvgC<2.5', 'AHCh', 'B365CAHH', 'B365CAHA', 'PCAHH', 'PCAHA',
    'MaxCAHH', 'MaxCAHA', 'AvgCAHH', 'AvgCAHA',
    
    # Non-predictive columns
    'Div', 'Time'
]

# Step 2: Select features for HISTORICAL MODEL (no betting odds)
features_to_keep = [
    # Team identity (essential for team strength)
    'HomeTeam', 'AwayTeam',
    
    # Date (needed for time-series feature engineering)
    'Date'
]

# Create feature dataframe - only team names and date initially
df_features = df[features_to_keep + [home_goals_col, away_goals_col]].copy()

print("Base features selected (before engineering):")
print(f"  - Team identity: HomeTeam, AwayTeam")
print(f"  - Historical features will be added")

print(f"\nFeatures: {[col for col in df_features.columns if col not in [home_goals_col, away_goals_col]]}")
print(f"Targets: {home_goals_col}, {away_goals_col}")

df_features.head()


Base features selected (before engineering):
  - Team identity: HomeTeam, AwayTeam
  - Historical features will be added

Features: ['HomeTeam', 'AwayTeam', 'Date']
Targets: FTHG, FTAG


Unnamed: 0,HomeTeam,AwayTeam,Date,FTHG,FTAG
0,Liverpool,Norwich,09/08/2019,4,1
1,West Ham,Man City,10/08/2019,0,5
2,Bournemouth,Sheffield United,10/08/2019,1,1
3,Burnley,Southampton,10/08/2019,3,0
4,Crystal Palace,Everton,10/08/2019,0,0


### Historical-Only Model Approach (Season-to-Date)
**This model uses ONLY historical match results - no betting odds or market data.**

Features include:
- Team identity (HomeTeam, AwayTeam)
- Season-to-date averages: cumulative performance up to each match
- All features derived from past match results only

In [18]:
# Step 4: Create season-to-date cumulative features
# Sort by date to ensure chronological order
df_features['Date'] = pd.to_datetime(df_features['Date'], format='%d/%m/%Y')
df_features = df_features.sort_values('Date').reset_index(drop=True)

# Extract season from date (e.g., 2019-08-09 -> 2019)
df_features['Season'] = df_features['Date'].dt.year
df_features.loc[df_features['Date'].dt.month >= 8, 'Season'] = df_features['Date'].dt.year
df_features.loc[df_features['Date'].dt.month < 8, 'Season'] = df_features['Date'].dt.year - 1

# Initialize cumulative feature columns
df_features['home_goals_scored_avg'] = np.nan
df_features['home_goals_conceded_avg'] = np.nan
df_features['home_matches_played'] = 0
df_features['away_goals_scored_avg'] = np.nan
df_features['away_goals_conceded_avg'] = np.nan
df_features['away_matches_played'] = 0

# Calculate cumulative stats for each team in each season
for team in df_features['HomeTeam'].unique():
    # Home matches
    home_mask = df_features['HomeTeam'] == team
    for season in df_features.loc[home_mask, 'Season'].unique():
        season_mask = (df_features['HomeTeam'] == team) & (df_features['Season'] == season)
        season_indices = df_features[season_mask].sort_values('Date').index
        
        goals_scored = df_features.loc[season_indices, 'FTHG']
        goals_conceded = df_features.loc[season_indices, 'FTAG']
        
        # Cumulative averages (shift to exclude current match)
        df_features.loc[season_indices, 'home_goals_scored_avg'] = goals_scored.shift(1).expanding().mean().values
        df_features.loc[season_indices, 'home_goals_conceded_avg'] = goals_conceded.shift(1).expanding().mean().values
        df_features.loc[season_indices, 'home_matches_played'] = range(len(season_indices))
    
    # Away matches
    away_mask = df_features['AwayTeam'] == team
    for season in df_features.loc[away_mask, 'Season'].unique():
        season_mask = (df_features['AwayTeam'] == team) & (df_features['Season'] == season)
        season_indices = df_features[season_mask].sort_values('Date').index
        
        goals_scored = df_features.loc[season_indices, 'FTAG']
        goals_conceded = df_features.loc[season_indices, 'FTHG']
        
        # Cumulative averages (shift to exclude current match)
        df_features.loc[season_indices, 'away_goals_scored_avg'] = goals_scored.shift(1).expanding().mean().values
        df_features.loc[season_indices, 'away_goals_conceded_avg'] = goals_conceded.shift(1).expanding().mean().values
        df_features.loc[season_indices, 'away_matches_played'] = range(len(season_indices))

# Fill initial NaN values (first match of each season) with overall league averages
league_avg_goals = df_features['FTHG'].mean()
df_features['home_goals_scored_avg'] = df_features['home_goals_scored_avg'].fillna(league_avg_goals)
df_features['home_goals_conceded_avg'] = df_features['home_goals_conceded_avg'].fillna(league_avg_goals)
df_features['away_goals_scored_avg'] = df_features['away_goals_scored_avg'].fillna(league_avg_goals)
df_features['away_goals_conceded_avg'] = df_features['away_goals_conceded_avg'].fillna(league_avg_goals)

print(f"\nNew shape: {df_features.shape}")
print("\nSeason-to-date features created:")
print("- home_goals_scored_avg: Home team's avg goals scored so far this season")
print("- home_goals_conceded_avg: Home team's avg goals conceded so far this season")
print("- away_goals_scored_avg: Away team's avg goals scored so far this season")
print("- away_goals_conceded_avg: Away team's avg goals conceded so far this season")
print("- home_matches_played: Number of matches played by home team this season")
print("- away_matches_played: Number of matches played by away team this season")

print("\nExample: First 10 matches with season-to-date stats")
df_features[['Date', 'Season', 'HomeTeam', 'AwayTeam', 'home_matches_played', 
             'home_goals_scored_avg', 'away_goals_scored_avg', 'FTHG', 'FTAG']].head(10)


New shape: (1020, 12)

Season-to-date features created:
- home_goals_scored_avg: Home team's avg goals scored so far this season
- home_goals_conceded_avg: Home team's avg goals conceded so far this season
- away_goals_scored_avg: Away team's avg goals scored so far this season
- away_goals_conceded_avg: Away team's avg goals conceded so far this season
- home_matches_played: Number of matches played by home team this season
- away_matches_played: Number of matches played by away team this season

Example: First 10 matches with season-to-date stats


Unnamed: 0,Date,Season,HomeTeam,AwayTeam,home_matches_played,home_goals_scored_avg,away_goals_scored_avg,FTHG,FTAG
0,2019-08-09,2019,Liverpool,Norwich,0,1.445098,1.445098,4,1
1,2019-08-10,2019,West Ham,Man City,0,1.445098,1.445098,0,5
2,2019-08-10,2019,Bournemouth,Sheffield United,0,1.445098,1.445098,1,1
3,2019-08-10,2019,Burnley,Southampton,0,1.445098,1.445098,3,0
4,2019-08-10,2019,Crystal Palace,Everton,0,1.445098,1.445098,0,0
5,2019-08-10,2019,Watford,Brighton,0,1.445098,1.445098,0,3
6,2019-08-10,2019,Tottenham,Aston Villa,0,1.445098,1.445098,3,1
7,2019-08-11,2019,Leicester,Wolves,0,1.445098,1.445098,0,0
8,2019-08-11,2019,Newcastle,Arsenal,0,1.445098,1.445098,0,1
9,2019-08-11,2019,Man United,Chelsea,0,1.445098,1.445098,4,0


In [19]:
# Step 6: Final feature selection and summary
# Define final feature set for HISTORICAL MODEL

categorical_features = ['HomeTeam', 'AwayTeam']

numerical_features = [
    # Season-to-date cumulative statistics
    'home_goals_scored_avg', 'home_goals_conceded_avg',
    'away_goals_scored_avg', 'away_goals_conceded_avg',
    'home_matches_played', 'away_matches_played',
]

# Target variables
target_home = 'FTHG'
target_away = 'FTAG'

# Create final modeling dataframe
X = df_features[categorical_features + numerical_features].copy()
y_home = df_features[target_home].copy()
y_away = df_features[target_away].copy()

print("="*60)
print("FINAL FEATURE SET FOR MODELING (Season-to-Date)")
print("="*60)
print(f"\nFeatures: 8 total (2 categorical + 6 numerical)")
print("  • HomeTeam, AwayTeam")
print("  • Season-to-date averages: goals scored/conceded")
print("  • Matches played this season (for context)")
print(f"\nTargets: FTHG (home goals), FTAG (away goals)")
print(f"Samples: {len(X)} matches")

print("\nNote: Features use only data from BEFORE each match")
print("="*60)

FINAL FEATURE SET FOR MODELING (Season-to-Date)

Features: 8 total (2 categorical + 6 numerical)
  • HomeTeam, AwayTeam
  • Season-to-date averages: goals scored/conceded
  • Matches played this season (for context)

Targets: FTHG (home goals), FTAG (away goals)
Samples: 1020 matches

Note: Features use only data from BEFORE each match


In [20]:
# Verify cumulative stats are working correctly - show Liverpool's progression
print("Liverpool home matches in 2019-20 season (first 5 games):")
liverpool_home = df_features[(df_features['HomeTeam'] == 'Liverpool') & (df_features['Season'] == 2019)].head(5)
print(liverpool_home[['Date', 'HomeTeam', 'AwayTeam', 'home_matches_played', 
                       'home_goals_scored_avg', 'home_goals_conceded_avg', 'FTHG', 'FTAG']])

Liverpool home matches in 2019-20 season (first 5 games):
         Date   HomeTeam   AwayTeam  home_matches_played  \
0  2019-08-09  Liverpool    Norwich                    0   
25 2019-08-24  Liverpool    Arsenal                    1   
40 2019-09-14  Liverpool  Newcastle                    2   
72 2019-10-05  Liverpool  Leicester                    3   
98 2019-10-27  Liverpool  Tottenham                    4   

    home_goals_scored_avg  home_goals_conceded_avg  FTHG  FTAG  
0                1.445098                 1.445098     4     1  
25               4.000000                 1.000000     3     1  
40               3.500000                 1.000000     3     1  
72               3.333333                 1.000000     2     1  
98               3.000000                 1.000000     2     1  



# **STEP 3 — Time-Based Train/Test Split**
**Evaluation Strategy:**
- **Train:** 2019-20 + 2020-21 seasons (~760 matches)
- **Test:** 2021-22 season (~380 matches)
- **Why:** No temporal leakage, realistic future prediction


In [17]:

# Add Season column to X and y for filtering
X['Season'] = df_features['Season']

# Create time-based split masks
train_mask = X['Season'].isin([2019, 2020])
test_mask = X['Season'] == 2021

# Split features and targets
X_train = X[train_mask].drop('Season', axis=1).copy()
X_test = X[test_mask].drop('Season', axis=1).copy()

y_home_train = y_home[train_mask].copy()
y_home_test = y_home[test_mask].copy()

y_away_train = y_away[train_mask].copy()
y_away_test = y_away[test_mask].copy()

print("="*60)
print("TIME-BASED TRAIN/TEST SPLIT")
print("="*60)
print(f"\nTraining Set:")
print(f"  Seasons: 2019-20, 2020-21")
print(f"  Samples: {len(X_train)} matches")
print(f"\nTest Set:")
print(f"  Season: 2021-22")
print(f"  Samples: {len(X_test)} matches")
print(f"\nSplit Ratio: {len(X_test)/(len(X_train)+len(X_test))*100:.1f}% test")
print("\nNote: Model trained on past seasons, tested on future season")
print("="*60)

# Remove Season from original X to keep it clean
X = X.drop('Season', axis=1)


TIME-BASED TRAIN/TEST SPLIT

Training Set:
  Seasons: 2019-20, 2020-21
  Samples: 640 matches

Test Set:
  Season: 2021-22
  Samples: 380 matches

Split Ratio: 37.3% test

Note: Model trained on past seasons, tested on future season



# **STEP 4 — Preprocessing Pipelines**
Separate categorical and numeric pipelines.




## **STEP 6 — Train Regression Pipelines**
Models:
- Ridge  
- Lasso  
- RandomForestRegressor  
- GradientBoostingRegressor  



# **STEP 7 — Evaluate Models**



# **STEP 10 — Analysis + Conclusion**

Write your analysis here:
- Which regression model performed the best?  
- Which classification algorithm performed the best?  
- Was the dataset balanced?  
- Which features were most important?  
- What improvements could be made in the next iteration?  
