# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">EPL Match Statistical Analysis </p>

- Let's take a look at English premiure League data from 2021-2024
- See if we can train a predictive model for some of recorded stats/results of football matches



# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Installing Packages  </p>


In [44]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Load the dataset </p>


In [45]:
# Load the dataset
file_path = '/Users/neomodibedi/Downloads/EPL_Data/English_PremieSre_League.csv'  
football_data = pd.read_csv(file_path)

**About the Dataset:**
- The dataset uploaded contains detailed football match data: including results, team names, and various betting odds.
-  Here's a breakdown of the key columns in the first few rows:

**Div:** The division (e.g., E0 refers to the English Premier League).

**Date:** The match date.

**HomeTeam and AwayTeam:** Teams involved in the match.

**FTHG (Full-Time Home Goals) and FTAG (Full-Time Away Goals):** Number of goals scored by the home and away teams at full time.

**FTR:** Full-Time Result (H = Home Win, D = Draw, A = Away Win).

**HTHG (Half-Time Home Goals) and HTAG (Half-Time Away Goals):** Number of goals scored by home and away teams at half time.

**Various betting odds:** Columns related to odds from different bookmakers (e.g., B365CAHH and B365CAHA refer to Bet365 odds for Correct Half-Time Home/Away).


In [46]:
football_data.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,1XBCH,1XBCD,1XBCA,BFECH,BFECD,BFECA,BFEC>2.5,BFEC<2.5,BFECAHH,BFECAHA
0,E0,13/08/2021,20:00,Brentford,Arsenal,2,0,H,1,0,...,,,,,,,,,,
1,E0,14/08/2021,12:30,Man United,Leeds,5,1,H,1,0,...,,,,,,,,,,
2,E0,14/08/2021,15:00,Burnley,Brighton,1,2,A,1,0,...,,,,,,,,,,
3,E0,14/08/2021,15:00,Chelsea,Crystal Palace,3,0,H,2,0,...,,,,,,,,,,
4,E0,14/08/2021,15:00,Everton,Southampton,3,1,H,0,1,...,,,,,,,,,,


# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Exploratory Data Analysis </p>


In [47]:
football_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1210 entries, 0 to 1209
Columns: 132 entries, Div to BFECAHA
dtypes: float64(108), int64(16), object(8)
memory usage: 1.2+ MB


In [48]:
football_data.shape

(1210, 132)

In [49]:
football_data.describe()

Unnamed: 0,FTHG,FTAG,HTHG,HTAG,HS,AS,HST,AST,HF,AF,...,1XBCH,1XBCD,1XBCA,BFECH,BFECD,BFECA,BFEC>2.5,BFEC<2.5,BFECAHH,BFECAHA
count,1210.0,1210.0,1210.0,1210.0,1210.0,1210.0,1210.0,1210.0,1210.0,1210.0,...,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0
mean,1.639669,1.33719,0.730579,0.595041,14.386777,11.790909,5.019008,4.169421,10.542149,10.842149,...,2.747857,4.523571,4.893,2.768714,4.658571,5.483429,1.680571,2.575429,1.974571,2.015714
std,1.371124,1.233133,0.869561,0.790541,5.904603,5.251342,2.709931,2.437688,3.434168,3.667853,...,1.6159,1.516727,4.478872,1.658472,1.645569,5.691861,0.201652,0.490701,0.095821,0.098871
min,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,...,1.09,3.18,1.34,1.12,3.25,1.35,1.33,1.76,1.8,1.85
25%,1.0,0.0,0.0,0.0,10.0,8.0,3.0,2.0,8.0,8.0,...,1.6525,3.63,2.195,1.6975,3.75,2.285,1.5125,2.165,1.89,1.93
50%,1.0,1.0,1.0,0.0,14.0,11.0,5.0,4.0,10.0,11.0,...,2.32,3.97,3.06,2.35,4.0,3.25,1.655,2.49,1.96,2.02
75%,2.0,2.0,1.0,1.0,18.0,15.0,7.0,6.0,13.0,13.0,...,3.2525,4.5425,5.37,3.15,4.775,5.575,1.84,2.91,2.0575,2.0975
max,9.0,8.0,5.0,5.0,36.0,31.0,16.0,15.0,23.0,25.0,...,8.57,11.0,26.0,9.6,12.5,32.0,2.28,4.0,2.16,2.22


# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Data Preprocessing </p>


In [50]:
# Identify missing values
print(football_data.isnull().sum())

# Drop or fill missing values (depending on the situation)
#football_data = football_data.dropna()  # Or use filling methods


Div            0
Date           0
Time           0
HomeTeam       0
AwayTeam       0
            ... 
BFECA       1140
BFEC>2.5    1140
BFEC<2.5    1140
BFECAHH     1140
BFECAHA     1140
Length: 132, dtype: int64


In [51]:
football_data['Date'] = pd.to_datetime(football_data['Date'], errors='coerce')
football_data = football_data.dropna(subset=['Date'])

  football_data['Date'] = pd.to_datetime(football_data['Date'], errors='coerce')


In [52]:
football_data.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,...,1XBCH,1XBCD,1XBCA,BFECH,BFECD,BFECA,BFEC>2.5,BFEC<2.5,BFECAHH,BFECAHA
0,E0,2021-08-13,20:00,Brentford,Arsenal,2,0,H,1,0,...,,,,,,,,,,
1,E0,2021-08-14,12:30,Man United,Leeds,5,1,H,1,0,...,,,,,,,,,,
2,E0,2021-08-14,15:00,Burnley,Brighton,1,2,A,1,0,...,,,,,,,,,,
3,E0,2021-08-14,15:00,Chelsea,Crystal Palace,3,0,H,2,0,...,,,,,,,,,,
4,E0,2021-08-14,15:00,Everton,Southampton,3,1,H,0,1,...,,,,,,,,,,


In [53]:
football_data['Date'] = pd.to_datetime(football_data['Date'], dayfirst=True)


In [54]:
football_data['DayOfWeek'] = football_data['Date'].dt.dayofweek
football_data['Month'] = football_data['Date'].dt.month
football_data['Year'] = football_data['Date'].dt.year


- **Encoding Target (FTR):**

let's encode the full-time result (FTR) into numerical labels (1 for home win, 0 for draw, -1 for away win). This is useful for potential classification tasks.
- **Categorical Variable Encoding:**

Used LabelEncoder to transform HomeTeam and AwayTeam into numeric labels for modeling. This is straightforward and works well in your use case.


In [55]:

# Encode Full-Time Result (FTR) as a numeric label for training
football_data['FTR_label'] = football_data['FTR'].map({'H': 1, 'D': 0, 'A': -1})

# Encode categorical variables (HomeTeam and AwayTeam)
label_encoder = LabelEncoder()
football_data['HomeTeam_encoded'] = label_encoder.fit_transform(football_data['HomeTeam'])
football_data['AwayTeam_encoded'] = label_encoder.fit_transform(football_data['AwayTeam'])

# Prepare data for training (features and targets)
X = football_data[['HomeTeam_encoded', 'AwayTeam_encoded', 'DayOfWeek', 'Month', 'Year']]

y_home = football_data['FTHG']  # Target for Home Goals
y_away = football_data['FTAG']  # Target for Away Goals



In [56]:
# Split the data into training and testing sets (70% training, 30% testing)
X_train_home, X_test_home, y_home_train, y_home_test = train_test_split(
    X, y_home, test_size=0.3, random_state=42
)
X_train_away, X_test_away, y_away_train, y_away_test = train_test_split(
    X, y_away, test_size=0.3, random_state=42
)

- **Random Forest Regressors:**

Separate models for predicting home and away goals.
Random Forest is robust and handles interactions between features well.


In [57]:
model_home_goals = RandomForestRegressor(random_state=42)
model_home_goals.fit(X_train_home, y_home_train)

model_away_goals = RandomForestRegressor(random_state=42)
model_away_goals.fit(X_train_away, y_away_train)


- **Prediction Function:**

The function encodes user-input team names and predicts match scores.

In [58]:
def predict_match_score(home_team, away_team, match_date):
    try:
        # Encode team names
        home_encoded = label_encoder.transform([home_team])[0]
        away_encoded = label_encoder.transform([away_team])[0]

        # Extract date features
        match_date = pd.to_datetime(match_date)
        day_of_week = match_date.dayofweek
        month = match_date.month
        year = match_date.year
        
        # Create input data with all required features
        input_data = pd.DataFrame([[home_encoded, away_encoded, day_of_week, month, year]],
                                  columns=['HomeTeam_encoded', 'AwayTeam_encoded', 'DayOfWeek', 'Month', 'Year'])
        
        # Predict home and away goals
        predicted_home_goals = model_home_goals.predict(input_data)[0]
        predicted_away_goals = model_away_goals.predict(input_data)[0]
        
        return predicted_home_goals, predicted_away_goals
    except ValueError as e:
        return f"Error: {e}. Make sure the team names and match date are correct."


# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Predictions  </p>


In [66]:
home_team = 'Liverpool'
away_team = 'Chelsea'
match_date = '2025-02-10'  # Specify the match date
predicted_home_goals, predicted_away_goals = predict_match_score(home_team, away_team, match_date)

print(f'Predicted Score: {home_team} {round(predicted_home_goals)} - {round(predicted_away_goals)} {away_team}')


Predicted Score: Liverpool 3 - 1 Chelsea


In [60]:
# Example usage
home_team = 'Southampton'  # Change to the home team name you want to test
away_team = 'Leicester'  # Change to the away team name you want to test
match_date = '2025-01-25'  # Specify the match date
predicted_home_goals, predicted_away_goals = predict_match_score(home_team, away_team, match_date)

print(f'Predicted Score: {home_team} {round(predicted_home_goals)} - {round(predicted_away_goals)} {away_team}')


Predicted Score: Southampton 1 - 2 Leicester


In [61]:
# Example usage
home_team = 'Fulham'  # Change to the home team name you want to test
away_team = 'Aston Villa'  # Change to the away team name you want to test
match_date = '2024-11-20'  # Specify the match date
predicted_home_goals, predicted_away_goals = predict_match_score(home_team, away_team, match_date)

print(f'Predicted Score: {home_team} {round(predicted_home_goals)} - {round(predicted_away_goals)} {away_team}')


Predicted Score: Fulham 2 - 2 Aston Villa


In [62]:
# Check accuracy of models
def check_accuracy(model, X_test, y_test):
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    return mae, mse, rmse, r2

# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Model Evaluation </p>


In [63]:
# Predicting on the test set for home goals
y_home_pred = model_home_goals.predict(X_test_home)

# Evaluation metrics for home goals
mae_home = mean_absolute_error(y_home_test, y_home_pred)
mse_home = mean_squared_error(y_home_test, y_home_pred)
rmse_home = np.sqrt(mse_home)
r2_home = r2_score(y_home_test, y_home_pred)

# Display results
print(f"Home Goals Model Evaluation:")
print(f"Mean Absolute Error: {mae_home:.2f}")
print(f"Mean Squared Error: {mse_home:.2f}")
print(f"Root Mean Squared Error: {rmse_home:.2f}")
print(f"R² Score: {r2_home:.2f}")


Home Goals Model Evaluation:
Mean Absolute Error: 1.04
Mean Squared Error: 1.69
Root Mean Squared Error: 1.30
R² Score: -0.03


In [64]:
# Predicting on the test set for away goals
y_away_pred = model_away_goals.predict(X_test_away)

# Evaluation metrics for away goals
mae_away = mean_absolute_error(y_away_test, y_away_pred)
mse_away = mean_squared_error(y_away_test, y_away_pred)
rmse_away = np.sqrt(mse_away)
r2_away = r2_score(y_away_test, y_away_pred)

# Display results
print(f"Away Goals Model Evaluation:")
print(f"Mean Absolute Error: {mae_away:.2f}")
print(f"Mean Squared Error: {mse_away:.2f}")
print(f"Root Mean Squared Error: {rmse_away:.2f}")
print(f"R² Score: {r2_away:.2f}")


Away Goals Model Evaluation:
Mean Absolute Error: 0.99
Mean Squared Error: 1.59
Root Mean Squared Error: 1.26
R² Score: -0.04


**The Home Goals and Away Goals models are not very accurate, 
as indicated by their evaluation metrics. Here's a detailed breakdown:**

Key Metrics:

**Mean Absolute Error (MAE):**

- Home Goals: 1.04
- Away Goals: 0.99
- These values suggest that, on average, predictions are off by about 1 goal—a significant margin in football/soccer predictions, where goal counts are typically low (0–3 on average).

**Root Mean Squared Error (RMSE):**

- Home Goals: 1.30
- Away Goals: 1.26
- RMSE, which penalizes larger errors more than MAE, shows that errors are still quite large relative to the typical range of goals scored in matches.

**R² Score:**

- Home Goals: -0.03
- Away Goals: -0.04
- A negative R² means that the models are worse than a simple baseline prediction (e.g., always predicting the average number of goals). This indicates the models fail to capture any meaningful patterns in the data.


# Accuracy Interpretation:
- Low Accuracy: Both models are struggling to make precise predictions, and their performance is close to random guessing.
- Not Reliable: The metrics indicate that the models cannot yet be used for actionable insights or predictions in their current state.


# Acceptable Benchmarks in Football Predictions:
- MAE < 0.5: Generally considered good for football goals predictions.
- R² > 0.5: Indicates the model explains at least 50% of the variance in the target variable.
- RMSE close to MAE: Suggests fewer large prediction errors.
- Overall Accuracy Assessment:
- The models are not accurate at present, and improvements are needed through:


# <p style="padding:10px;background-color:skyblue;margin:0;color:navy;font-family:newtimeroman;font-size:100%;text-align:left;border: 2px solid black; border-radius: 5px; overflow:hidden;font-weight:500">Conclusion  </p>


- Collect Recent Game Data
- Prepare Recent Game Statistics
- Merge Recent Statistics into Main Dataset
- Exploring more advanced or appropriate models.
- Hyperparameter optimization and cross-validation.
- Would you like to delve deeper into debugging or improving these models?
- Feature Engineering:
You can improve the model by adding more features (e.g., previous performance data, team strength, or player-related features, Weather, Squad List