## Goal

The goal of this project is to demonstrate and apply the concepts learned in my Machine Learning course by developing a match prediction system. To achieve this, I will implement both supervised and unsupervised learning techniques. Unsupervised models are used to explore the underlying structure of the data, such as identifying patterns, similarities, or groupings among teams or matches, which informs feature engineering and model understanding. Supervised learning models are then trained on labeled match outcomes to perform the actual prediction task. Together, these approaches allow for both interpretability and predictive performance while showcasing a comprehensive application of machine learning methodologies.


### Unsupervised Learning Models

1. **K-Means Clustering**  
   K-Means is used to cluster teams based on performance statistics such as scoring efficiency, defensive metrics, and possession related features. The goal of this model is to identify latent play styles or performance tiers. Cluster assignments are later incorporated as additional features for supervised prediction.

2. **Principal Component Analysis (PCA)**  
   PCA is applied for dimensionality reduction and feature decorrelation. By projecting the original feature space into a smaller set of principal components, PCA reduces noise and multicollinearity while retaining most of the variance in the data. These components provide a compact representation of team or match characteristics.

### Supervised Learning Models

1. **Logistic Regression**  
   Logistic Regression serves as a baseline classifier for predicting match outcomes (win/draw/loss). It provides interpretability through feature coefficients and establishes a performance benchmark for more complex models.

2. **Random Forest Classifier**  
   A Random Forest model is used to capture non-linear relationships and feature interactions that logistic regression cannot model effectively. It is robust to overfitting and performs well on structured, tabular data commonly found in sports analytics.

3. **Gradient Boosting (e.g., XGBoost or GradientBoostingClassifier)**  
   Gradient Boosting is employed as a high-performance model to further improve predictive accuracy. By sequentially correcting errors from previous trees, this model often achieves superior results on match prediction tasks, especially when feature interactions are important.

### Model Integration Strategy

Outputs from the unsupervised models (cluster labels and/or principal components) are appended to the original feature set and used as inputs to the supervised models. This hybrid approach leverages unsupervised learning for structure discovery and supervised learning for outcome prediction, providing both explanatory insights and strong predictive performance.

### Unsupervised Learning Models
Data limitiations, Due to the model focusing around the english premeriship I will only be using the publically avaliable data from https://www.football-data.co.uk/englandm.php and due to their being a long history I will only be looking at the past 5 years (for now will expand once done) including data from the current season. The data doesnt include European form, lower level leagues and only the 380 games that take place during the premier league. 

In [1]:
#Creating the pandas Data frame
import pandas as pd

csv_files = [
    "DATA/21-22.csv",
    "DATA/22-23.csv",
    "DATA/23-24.csv",
    "DATA/24-25.csv",
    "DATA/25-26.csv"
]

df = pd.concat(
    [pd.read_csv(file) for file in csv_files],
    ignore_index=True
)

print(df.shape)
df.head()


(1680, 162)


Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,BMGMCA,BVCH,BVCD,BVCA,CLCH,CLCD,CLCA,LBCH,LBCD,LBCA
0,E0,13/08/2021,20:00,Brentford,Arsenal,2,0,H,1,0,...,,,,,,,,,,
1,E0,14/08/2021,12:30,Man United,Leeds,5,1,H,1,0,...,,,,,,,,,,
2,E0,14/08/2021,15:00,Burnley,Brighton,1,2,A,1,0,...,,,,,,,,,,
3,E0,14/08/2021,15:00,Chelsea,Crystal Palace,3,0,H,2,0,...,,,,,,,,,,
4,E0,14/08/2021,15:00,Everton,Southampton,3,1,H,0,1,...,,,,,,,,,,


Encoding & Cleaning

In [11]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


admin_cols = ['Div', 'Date', 'Time']
target_col = ['FTR']
halftime_cols = ['HTHG', 'HTAG', 'HTR']

bookmaker_prefixes_to_drop = [
    'PS', 'WH', 'VC', 'GB', 'BS', 'LB', 'SB', 'SJ', 'SY',
    'BM', 'BV', 'CL'
]

betting_cols_to_drop = [
    col for col in df.columns
    for prefix in bookmaker_prefixes_to_drop
    if col.startswith(prefix)
]

cols_to_drop = admin_cols + halftime_cols + betting_cols_to_drop

df_clean = df.drop(columns=cols_to_drop, errors='ignore')


target_encoder = LabelEncoder()
y = target_encoder.fit_transform(df_clean['FTR'])

df_clean = df_clean.drop(columns=['FTR'])


le_home = LabelEncoder()
le_away = LabelEncoder()

df_clean['HomeTeam_encoded'] = le_home.fit_transform(df_clean['HomeTeam'])
df_clean['AwayTeam_encoded'] = le_away.fit_transform(df_clean['AwayTeam'])

df_clean = df_clean.drop(columns=['HomeTeam', 'AwayTeam'])


df_clean = df_clean.replace(
    ['Missing value', 'missing value', 'NA', 'N/A', ''],
    np.nan
)
for col in df_clean.columns:
    df_clean[col] = pd.to_numeric(df_clean[col], errors='coerce')

missing_fraction = df_clean.isna().mean()
cols_to_drop_missing = missing_fraction[missing_fraction > 0.4].index

df_clean = df_clean.drop(columns=cols_to_drop_missing)

df_clean = df_clean.fillna(df_clean.median())

df_clean = df_clean.fillna(df_clean.median())

df_clean




Unnamed: 0,FTHG,FTAG,HS,AS,HST,AST,HF,AF,HC,AC,...,B365CAHH,B365CAHA,PCAHH,PCAHA,MaxCAHH,MaxCAHA,AvgCAHH,AvgCAHA,HomeTeam_encoded,AwayTeam_encoded
0,2,0,8,22,3,4,12,8,2,5,...,1.75,2.05,1.81,2.13,2.05,2.17,1.80,2.09,3,0
1,5,1,16,10,8,3,11,9,5,4,...,2.05,1.75,2.17,1.77,2.19,1.93,2.10,1.79,16,11
2,1,2,14,14,3,8,10,7,7,6,...,1.79,2.15,1.81,2.14,1.82,2.19,1.79,2.12,5,4
3,3,0,13,4,6,1,15,11,5,2,...,2.05,1.75,2.12,1.81,2.16,1.93,2.06,1.82,6,7
4,3,1,14,6,6,3,13,15,6,8,...,2.05,1.88,2.05,1.88,2.08,1.90,2.03,1.86,8,21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,3,0,15,6,6,1,9,13,3,3,...,1.93,1.93,1.96,1.96,1.93,1.95,1.86,1.89,19,23
1676,1,0,5,6,1,2,10,10,3,6,...,1.98,1.88,2.05,1.88,2.00,1.88,1.95,1.77,22,17
1677,2,3,10,7,3,4,18,13,5,3,...,1.95,1.90,1.99,1.93,1.95,1.90,1.93,1.80,25,1
1678,1,1,7,18,2,4,8,9,0,3,...,1.83,2.03,1.91,2.02,1.85,2.03,1.81,1.96,3,11
