# Premier League Match Analysis: A Statistical Exploration

This project analyzes Premier League match results from the 2006-2007 season to the 2023-2024 season. It serves as a portfolio piece to demonstrate a wide array of foundational data analysis skills, from basic statistical concepts to more advanced techniques.

First, we need to import the libraries that we're going to use for analysis.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_theme()

Next, we will go ahead and import the data from our csv file into a Pandas DataFrame

In [None]:
results_df = pd.read_csv("data/results_data.csv")

Now that the data is loaded into a DataFrame we can just take a quick look at what we might be working with. First, we will have Python display some of the basic _info_ of the DataFrame and then we will actually look at the _first 5 rows_ of the DataFrame

In [None]:
results_df.info()

In [None]:
results_df.head()

The dataset appears fine. I am a little suspicious of the team names and whether or not they are all spelled correctly. Let's take a look at the unique values that we have for the columns *home_team* *away_team* and *season*.

In [None]:
sorted(results_df.home_team.unique())

As I suspected we can already see that our names are going to have issues. For example, in the home_team column we have 3 different versions of the team Brighton & Hove Albion. Let's go ahead and make a list of all the team names from both columns, sort it, and use that list to create a name map for both columns.

In [None]:
team_list = []

for team in results_df.home_team.unique():
    team_list.append(team)

for team in results_df.away_team.unique():
    if team not in team_list:
        team_list.append(team)

team_list.sort()

team_list_length = len(team_list)

print(f"The number of unique team names in this list is: {team_list_length}")

for team in team_list:
    print(team)

I am going to create a dictionary to map all of the variations to standard names. We will simply pick 1 version of a team name and apply it to any variations that might occur in the list.

In [None]:
team_name_mapping = {
    "AFC Bournemouth": "Bournemouth",
    "Brighton": "Brighton and Hove Albion",
    "Brighton & Hove Albion": "Brighton and Hove Albion",
    "Cardiff": "Cardiff City",
    "Huddersfield": "Huddersfield Town",
    "Leicester": "Leicester City",
    "Man City": "Manchester City",
    "Man United": "Manchester United",
    "Newcastle": "Newcastle United",
    "Norwich": "Norwich City",
    "Nottingham": "Nottingham Forest",
    "Tottenham": "Tottenham Hotspur",
    "West Brom": "West Bromwich Albion",
    "West Ham": "West Ham United",
    "Wolves": "Wolverhampton Wanderers"
}

Let's go ahead and apply our mapping and just double check that we captured all of the issues in our columns.

In [None]:
results_df['home_team'] = results_df['home_team'].replace(team_name_mapping)
results_df['away_team'] = results_df['away_team'].replace(team_name_mapping)

In [None]:
team_list = []

for team in results_df.home_team.unique():
    team_list.append(team)

for team in results_df.away_team.unique():
    if team not in team_list:
        team_list.append(team)

team_list.sort()

team_list_length = len(team_list)

print(f"The number of unique team names in this list is: {team_list_length}")

for team in team_list:
    print(team)

I think we have taken care of all of the issues with the team names. Lets just take a look at our *season* column and the unique values for that variable.

In [None]:
print(results_df.season.unique())

Since we have already shown in a previous step that we have no null values and soccer matches are allowed to have 0 goals scored I will not check the home_goals and away_goals columns. This is for 2 reasons. One, if the data has no values that are specifically labled null or left blank, and a team is allowed to score 0 goals in a game, we cannot assume that a 0 means that the data is missing or null. Two, we will be later looking at the distribution of this data and getting an idea for it's shape and potential outliers. So we can leave these alone for now.

Next, let's just make sure that the DataFrame data types are exactly how we want them to be for analysis. Let's go ahead and change the *home_team*, *away_team*, and *season* to the categorical data type.

In [None]:
results_df.info()

In [None]:
results_df.home_team = results_df.home_team.astype('category')
results_df.away_team = results_df.away_team.astype('category')
results_df.season = results_df.season.astype('category')

In [None]:
results_df.info()

## Feature Engineering

Before we begin to do further analysis I want to create a couple more variables that we might want to keep track of. I am going to start with some basic additions and then work up to some more complex ones.

Let's start with a goal difference column and a total goals column.

In [None]:
results_df["goal_difference"] = results_df.home_goals - results_df.away_goals
results_df["total_goals"] = results_df.home_goals + results_df.away_goals
results_df.head()

I am going to also create a column for the result of the match. I'm going to create it as a categorical variable at first and then use Python's one-hot-encoding to break it down into binary columns.

In [None]:
results_df["result"] = results_df.goal_difference.apply(lambda x: "home_win" if x > 0 else ("draw" if x == 0 else "away_win"))
results_df.head()

I'm going to go ahead and turn the new *result* column into three binary columns by using Panda's get_dummies.

In [None]:
results_df = pd.get_dummies(data=results_df,columns=["result"])
results_df.head()

I want to create a column that shows the team's average goals scored for the season up to that point. I also want to show their total goals scored. I'm going to create a function that will help me do this.

In [None]:
def calculate_average_goals_seasonal(df):
    df["home_team_goals_scored"] = 0
    df["away_team_goals_scored"] = 0
    df["home_team_avg_goals_scored"] = 0.0
    df["away_team_avg_goals_scored"] = 0.0
    counter = 1

    for season in df['season'].unique():
        temp_season_df = df[df['season'] == season]
        for index in range(len(temp_season_df)):
            home_team = temp_season_df.iloc[index, 0]
            away_team = temp_season_df.iloc[index, 1]
            temp_df = temp_season_df.iloc[:index].reset_index(drop=True)

            home_team_home_goals_table = temp_df[temp_df.home_team == home_team]
            home_team_away_goals_table = temp_df[temp_df.away_team == home_team]
            home_team_home_goals_total = home_team_home_goals_table.home_goals.sum()
            home_team_away_goals_total = home_team_away_goals_table.away_goals.sum()
            home_team_cumulative_total = home_team_home_goals_total + home_team_away_goals_total
            home_team_games_count = len(home_team_home_goals_table) + len(home_team_away_goals_table)
            home_team_cumulative_average = home_team_cumulative_total / home_team_games_count if home_team_games_count > 0 else 0
            home_team_cumulative_average = round(home_team_cumulative_average, 2)
            df.at[temp_season_df.index[index], 'home_team_avg_goals_scored'] = home_team_cumulative_average
            df.at[temp_season_df.index[index], 'home_team_goals_scored'] = home_team_cumulative_total

            away_team_home_goals_table = temp_df[temp_df.home_team == away_team]
            away_team_away_goals_table = temp_df[temp_df.away_team == away_team]
            away_team_home_goals_total = away_team_home_goals_table.home_goals.sum()
            away_team_away_goals_total = away_team_away_goals_table.away_goals.sum()
            away_team_cumulative_total = away_team_home_goals_total + away_team_away_goals_total
            away_team_games_count = len(away_team_home_goals_table) + len(away_team_away_goals_table)
            away_team_cumulative_average = away_team_cumulative_total / away_team_games_count if away_team_games_count > 0 else 0
            away_team_cumulative_average = round(away_team_cumulative_average, 2)
            df.at[temp_season_df.index[index], 'away_team_avg_goals_scored'] = away_team_cumulative_average
            df.at[temp_season_df.index[index], 'away_team_goals_scored'] = away_team_cumulative_total
    return df

In [None]:
results_df = calculate_average_goals_seasonal(results_df)

Let's go ahead and take a look at the DataFrame info followed by a look at the top 10 rows.

In [None]:
results_df.info()

In [None]:
results_df.head()

I think that I would like to create one more feature that will give us the team's total goals conceded and the average goals conceded. This will give us some kind of metric to measure a team's defense.

In [None]:
def calculate_conceded_goals_seasonal(df):
    df["home_team_goals_conceded"] = 0
    df["away_team_goals_conceded"] = 0
    df["home_team_avg_goals_conceded"] = 0.0
    df["away_team_avg_goals_conceded"] = 0.0
    counter = 1

    for season in df['season'].unique():
        temp_season_df = df[df['season'] == season]
        for index in range(len(temp_season_df)):
            home_team = temp_season_df.iloc[index, 0]
            away_team = temp_season_df.iloc[index, 1]
            temp_df = temp_season_df.iloc[:index].reset_index(drop=True)

            home_team_home_goals_conceded_table = temp_df[temp_df.home_team == home_team]
            home_team_away_conceded_table = temp_df[temp_df.away_team == home_team]
            home_team_home_goals_conceded_total = home_team_home_goals_conceded_table.away_goals.sum()
            home_team_away_goals_conceded_total = home_team_away_conceded_table.home_goals.sum()
            home_team_cumulative_total = home_team_home_goals_conceded_total + home_team_away_goals_conceded_total
            home_team_games_count = len(home_team_home_goals_conceded_table) + len(home_team_away_conceded_table)
            home_team_cumulative_average = home_team_cumulative_total / home_team_games_count if home_team_games_count > 0 else 0
            home_team_cumulative_average = round(home_team_cumulative_average, 2)
            df.at[temp_season_df.index[index], 'home_team_avg_goals_conceded'] = home_team_cumulative_average
            df.at[temp_season_df.index[index], 'home_team_goals_conceded'] = home_team_cumulative_total

            away_team_home_goals_conceded_table = temp_df[temp_df.home_team == away_team]
            away_team_away_goals_conceded_table = temp_df[temp_df.away_team == away_team]
            away_team_home_goals_conceded_total = away_team_home_goals_conceded_table.away_goals.sum()
            away_team_away_goals_conceded_total = away_team_away_goals_conceded_table.home_goals.sum()
            away_team_cumulative_total = away_team_home_goals_conceded_total + away_team_away_goals_conceded_total
            away_team_games_count = len(away_team_home_goals_conceded_table) + len(away_team_away_goals_conceded_table)
            away_team_cumulative_average = away_team_cumulative_total / away_team_games_count if away_team_games_count > 0 else 0
            away_team_cumulative_average = round(away_team_cumulative_average, 2)
            df.at[temp_season_df.index[index], 'away_team_avg_goals_conceded'] = away_team_cumulative_average
            df.at[temp_season_df.index[index], 'away_team_goals_conceded'] = away_team_cumulative_total
    return df

results_df = calculate_conceded_goals_seasonal(results_df)

results_df.head()

## Exploratory Data Analysis

Now that we have created quite a few features for our data and we have cleaned it up we can move on to some EDA. We just want to get a feel for the data.

In [None]:
results_df.describe(include='all')

I want to get a nice overview of some of the columns visually. Let's start by checking out the distributions for our columns *home_goals*, *away_goals*, and *total_goals*.

In [None]:
home_goals_mean = results_df.home_goals.mean()
home_goals_mean = round(home_goals_mean, 2)

away_goals_mean = results_df.away_goals.mean()
away_goals_mean = round(away_goals_mean, 2)

home_goals_std_dev = results_df.home_goals.std()
home_goals_std_dev = round(home_goals_std_dev, 2)
home_upper_limit = round((home_goals_std_dev + home_goals_mean), 2)
home_lower_limit = round((home_goals_mean - home_goals_std_dev), 2)

away_goals_std_dev = results_df.away_goals.std()
away_goals_std_dev = round(away_goals_std_dev, 2)
away_upper_limit = round((away_goals_mean + away_goals_std_dev), 2)
away_lower_limit = round((away_goals_mean - away_goals_std_dev), 2)

plt.figure(figsize=(16, 6))

plt.subplot(1, 2, 1)
sns.set_palette('colorblind')
sns.histplot(x='home_goals', data=results_df, discrete=True)
plt.axvline(home_goals_mean, color='black', label=f'Mean: {home_goals_mean}')
plt.axvline(home_upper_limit, color='black', linestyle='--', label=f'Std Dev Upper Limit: {home_upper_limit}')
plt.axvline(home_lower_limit, color='black', linestyle='--', label=f'Std Dev Lower Limit: {home_lower_limit}')
plt.xlabel('Number of Goals Scored')
plt.title("Distribution of Goals Scored by the Home Team")
plt.legend()

plt.subplot(1, 2, 2)
sns.set_palette('colorblind')
sns.histplot(x='away_goals', data=results_df, discrete=True, color='red')
plt.axvline(away_goals_mean, color='black', label=f'Mean: {away_goals_mean}')
plt.axvline(away_upper_limit, color='black', linestyle='--', label=f'Std Dev Upper Limit: {away_upper_limit}')
plt.axvline(away_lower_limit, color='black', linestyle='--', label=f'Std Dev Lower Limit: {away_lower_limit}')
plt.xlabel('Number of Goals Scored')
plt.title("Distribution of Goals Scored by the Away Team")
plt.legend()

plt.subplots_adjust(wspace=0.4)

plt.show()

In [None]:
total_goals_mean = results_df.total_goals.mean()
total_goals_mean = round(total_goals_mean, 2)

total_std_dev = results_df.total_goals.std()
total_std_dev = round(total_std_dev, 2)
upper_limit = round((total_goals_mean + total_std_dev), 2)
lower_limit = round((total_goals_mean - total_std_dev), 2)

sns.set_palette('colorblind')
sns.histplot(x='total_goals', data=results_df, discrete=True, color='green')
plt.axvline(total_goals_mean, color='black', label=f'Mean: {total_goals_mean}')
plt.axvline(upper_limit, color='black', linestyle='--', label=f'Std Dev Upper Limit: {upper_limit}')
plt.axvline(lower_limit, color='black', linestyle='--', label=f'Std Dev Lower Limit: {lower_limit}')
plt.xlabel('Total Number of Goals Scored')
plt.title("Distribution of Total Goals Scored per Game")
plt.legend()

plt.show()

#### A Few Insights

We can see a few insightful things with these distribution plots. The first thing that I noticed was that the plot for the goals scored by the home team is noticeably different from the plot for the goals scored by the away team. We can see that, in general, the away team scores fewer goals than the home team. This makes sense intuitively but our plot gives us quantitative values that verify that logic. Later on we will explore this relationship in more detail to try to understand to what extent the location of a match affects the amount of goals a team might score.

We also took a look at the distribution for the total goals scored in each game. One conclusion we might draw from this plot is that it seems pretty rare to get a game with 0 goals scored. The majority of the games that occur will have at least 1 goal scored. It seems that most games will have between 2 to 3 goals.

I think one possible insight we can draw from all of this data could be this: Generally, we can expect that at least 1 goal will be scored in any Premier League match and the team that is most likely to score that goal would be the home team. We can also make some inferences about the defensive strength of particular teams. For example, if team A plays away vs team B, and the match ends 0 - 0, it might suggest that team A has a rather strong defense. This is because it would be rare for a game to not feature at least 1 goal from the home team. So we might infer that the away team, team A, needed to be strong defensively to create such an outcome.

#### Time Series Analysis

Next, let's take a look at some trends over time. Since we already took a look at some of the variables I want to stick with those for now. We can make line graphs that display the average goals scored by the home team, away team, and the average total goals scored per game. This can give us an idea about whether or not goals are mostly consistent over time or if trend in a particular direction.

In [None]:
avg_total_goals_df = results_df.groupby("season", observed="False").total_goals.mean().reset_index()
avg_home_goals_df = results_df.groupby("season", observed="False").home_goals.mean().reset_index()
avg_away_goals_df = results_df.groupby("season", observed="False").away_goals.mean().reset_index()

plt.figure(figsize=(20, 8))

plt.subplot(2, 1, 1)
plt.plot(avg_home_goals_df.season, avg_home_goals_df.home_goals, color='blue', marker='o')
plt.plot(avg_away_goals_df.season, avg_away_goals_df.away_goals, color='red', marker='o')
plt.xlabel("Season")
plt.ylabel("Average Goals per Game")
plt.title("Average Goals per Game Over the Seasons")
plt.legend(['Home Team', 'Away Team'])
plt.show()

plt.figure(figsize=(20, 8))
plt.subplot(2, 1, 2)
plt.plot(avg_total_goals_df.season, avg_total_goals_df.total_goals, color='green', marker='o')
plt.xlabel("Season")
plt.ylabel("Average Total Goals Per Game")
plt.title("Average Total Goals per Game Over the Seasons")
plt.show()

## Key Insight!!

If you examine the plot showing the average goals scored by the home team and the away team, you may notice something interesting. During the 2020-2021 season, the home and away teams appeared to be much more evenly matched compared to other seasons. *This anomaly warrants further investigation!*

To understand what might have caused this, we need to consider any events or circumstances during that time period. This is where some Premier League knowledge becomes valuable. Any dedicated fan would recall that the 2020-2021 season was played in empty stadiums due to COVID-19. This provides a clear explanation for the unusual data point on the plot and strongly suggests that home fans do indeed provide an advantage to the home team. The absence of fans coincided with nearly even average goals scored by both teams.

This insight will be crucial in our next steps as we aim to build a simple model to predict game outcomes. Knowing that there is a significant advantage to playing at home, we can adjust the model's parameters to account for the match location.

We should also note that the most recent season was the highest-scoring season in terms of goals per game by far. This introduces a bit more uncertainty about the upcoming season. We might interpret this as a sign that the next season could also see more goals per game than many of the previous seasons. Alternatively, we might consider the 2023-2024 season an outlier and expect the 2024-2025 season to be more in line with previous seasons.

### Adding some new features

Part of the reason for creating this project in the first place was to establish a foundation on which I can build a predictive model for Premier League games. With this in mind, I want to create a few additional features. These features will use a formula to predict the number of goals a particular team might score against another. We will have many decisions to make when developing these features. Testing, experimenting, and adjusting them will be key to creating an effective predictive model.

To keep it brief for now, the features I want to create will use a mathematical formula based on each team's average goals scored and conceded at a particular location—home or away—to estimate the goals each team will score, thereby predicting the outcome of the match.

Next, let's create a heatmap to visualize the correlation matrix for our numerical variables. This will help us identify which variables are most closely related.

In [None]:
correlation_matrix = results_df[['home_goals', 'away_goals', 'goal_difference', 'total_goals', 'home_team_goals_scored',
                                 'away_team_goals_scored', 'home_team_avg_goals_scored', 'away_team_avg_goals_scored',
                                 'home_team_goals_conceded', 'away_team_goals_conceded', 'home_team_avg_goals_conceded', 'away_team_avg_goals_conceded']].corr()

plt.figure(figsize=(12, 10))
color_map = sns.color_palette("coolwarm", as_cmap=True)
sns.heatmap(correlation_matrix, annot=True, cmap=color_map)
plt.title("Correlation Matrix")
plt.show()