# Data Cleaning Process Documentation


## Project Overview

Project Name: **Sports Analytics - Premier League** <br>
Author: **Alvin Ong** <br>

This Premier League Sports Analytics project analyzes standard player and team performance over the last five seasons.  
The cleaned dataset will be stored in an SQL database and used for analysis in Python and Tableau to uncover trends, statistics, and market insights while showcasing data collection, cleaning, and analytical skills.

------------------------------------------------------------------------

## 1. Data Source

-   Dataset Name: **player_stats, team_stats**
-   Source: FBref.com
-   Collection Period: Last 5 seasons of Premier League (2019-2020 to 2024-2025 season)
-   Original Format: Extracted data into CSV format
-   Initial Size: Player stats (3299 rows, 37 columns), Team stats (120 rows, 35 columns) 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [76]:
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Load the data
player_df = pd.read_csv("data/raw/player_stats.csv", header=[0, 1])
team_df = pd.read_csv("data/raw/team_stats.csv", header=[0, 1])

# Display basic information
print(f"Player dataset shape {player_df.shape}")
print(f"Team dataset shape {team_df.shape}")

Player dataset shape (3299, 37)
Team dataset shape (120, 35)


------------------------------------------------------------------------

## 2. Data Overview

### 2.1 Overview of Raw Player Data

In [49]:
# Display the first few rows
player_df.head()

Unnamed: 0_level_0,league,season,team,player,nation,pos,age,born,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,Expected,Progression,Progression,Progression,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Gls,Ast,G+A,G-PK,G+A-PK,xG,xAG,xG+xAG,npxG,npxG+xAG
0,ENG-Premier League,1920,Arsenal,Ainsley Maitland-Niles,ENG,DF,21,1997,20,15,1386,15.4,0,2,2,0,0,0,6,1,0.4,0.4,1.5,2.0,38,61,34,0.0,0.13,0.13,0.0,0.13,0.03,0.1,0.13,0.03,0.13
1,ENG-Premier League,1920,Arsenal,Alexandre Lacazette,FRA,FW,28,1991,30,22,1874,20.8,10,4,14,10,0,0,8,0,8.1,8.1,2.9,11.1,21,49,137,0.48,0.19,0.67,0.48,0.67,0.39,0.14,0.53,0.39,0.53
2,ENG-Premier League,1920,Arsenal,Bernd Leno,GER,GK,27,1992,30,30,2649,29.4,0,0,0,0,0,0,2,0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ENG-Premier League,1920,Arsenal,Bukayo Saka,ENG,"DF,FW",17,2001,26,19,1753,19.5,1,5,6,1,0,0,6,0,1.0,1.0,3.3,4.3,67,67,143,0.05,0.26,0.31,0.05,0.31,0.05,0.17,0.22,0.05,0.22
4,ENG-Premier League,1920,Arsenal,Calum Chambers,ENG,DF,24,1995,14,13,1102,12.2,1,1,2,1,0,0,5,0,0.8,0.8,1.6,2.4,16,45,67,0.08,0.08,0.16,0.08,0.16,0.06,0.13,0.2,0.06,0.2


In [50]:
# Get summary statistics
player_df.describe(include='all')

Unnamed: 0_level_0,league,season,team,player,nation,pos,age,born,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,Expected,Progression,Progression,Progression,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Gls,Ast,G+A,G-PK,G+A-PK,xG,xAG,xG+xAG,npxG,npxG+xAG
count,3299,3299.0,3299,3299,3299,3299,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0,3299.0
unique,1,,27,1324,85,10,538.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
top,ENG-Premier League,,Chelsea,Chris Wood,ENG,DF,25.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
freq,3299,,173,8,1125,1023,235.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
mean,,2177.199454,,,,,,1995.829645,19.018794,14.544407,1306.647469,14.517369,1.842073,1.309791,3.151864,1.696272,0.145802,0.178236,2.437708,0.079115,1.893574,1.752198,1.346287,3.099424,24.083055,50.020309,49.532889,0.11194,0.082537,0.194416,0.105577,0.188045,0.135632,0.089973,0.22555,0.129254,0.219251
std,,171.180286,,,,,,4.756415,11.493385,11.618327,1005.896413,11.176721,3.397852,2.11548,4.935082,3.024143,0.702777,0.804625,2.55681,0.28525,3.113891,2.748916,1.884225,4.189431,30.43689,54.688105,67.073978,0.269366,0.240671,0.376223,0.26344,0.370311,0.228885,0.18857,0.317153,0.222779,0.311366
min,,1920.0,,,,,,1981.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,2021.0,,,,,,1992.0,9.0,3.0,360.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.1,0.2,2.0,6.0,2.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.05,0.02,0.05
50%,,2223.0,,,,,,1996.0,20.0,13.0,1186.0,13.2,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,0.7,0.7,0.6,1.5,12.0,34.0,20.0,0.0,0.0,0.09,0.0,0.09,0.06,0.06,0.14,0.06,0.14
75%,,2324.0,,,,,,1999.0,28.0,24.0,2103.0,23.4,2.0,2.0,4.0,2.0,0.0,0.0,4.0,0.0,2.2,2.2,1.9,4.1,35.0,76.0,73.0,0.15,0.12,0.29,0.14,0.28,0.18,0.13,0.32,0.17,0.31


In [51]:
# Check data types and missing values
df_info = pd.DataFrame({
    'Data Type': player_df.dtypes,
    'Non-Null Count': player_df.count(),
    'Missing Values': player_df.isnull().sum(),
    'Missing Percentage': (player_df.isnull().sum() / len(player_df) * 100).round(2)
})
df_info

Unnamed: 0,Unnamed: 1,Data Type,Non-Null Count,Missing Values,Missing Percentage
league,Unnamed: 0_level_1,object,3299,0,0.0
season,Unnamed: 1_level_1,int64,3299,0,0.0
team,Unnamed: 2_level_1,object,3299,0,0.0
player,Unnamed: 3_level_1,object,3299,0,0.0
nation,Unnamed: 4_level_1,object,3299,0,0.0
pos,Unnamed: 5_level_1,object,3299,0,0.0
age,Unnamed: 6_level_1,object,3299,0,0.0
born,Unnamed: 7_level_1,int64,3299,0,0.0
Playing Time,MP,int64,3299,0,0.0
Playing Time,Starts,int64,3299,0,0.0


### 2.2 Overview of Raw Team Data

In [53]:
# Display the first few rows
team_df.head()

Unnamed: 0_level_0,league,season,team,players_used,Age,Poss,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,Expected,Progression,Progression,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,url
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,Gls,Ast,G+A,G-PK,G+A-PK,xG,xAG,xG+xAG,npxG,npxG+xAG,Unnamed: 34_level_1
0,ENG-Premier League,1920,Arsenal,29,25.8,53.8,38,418,3420,38,56,35,91,53,3,3,88,5,47.0,44.6,31.4,76.1,815,1641,1.47,0.92,2.39,1.39,2.32,1.24,0.83,2.06,1.17,2.0,/en/squads/18bb7c10/2019-2020/Arsenal-Stats
1,ENG-Premier League,1920,Aston Villa,28,25.7,44.1,38,418,3420,38,40,32,72,39,1,3,70,1,44.3,41.8,34.2,76.0,639,1247,1.05,0.84,1.89,1.03,1.87,1.16,0.9,2.06,1.1,2.0,/en/squads/8602292d/2019-2020/Aston-Villa-Stats
2,ENG-Premier League,1920,Bournemouth,27,25.2,44.1,38,418,3420,38,38,24,62,34,4,4,78,3,44.8,41.7,30.6,72.3,601,1256,1.0,0.63,1.63,0.89,1.53,1.18,0.8,1.98,1.1,1.9,/en/squads/4ba7cbea/2019-2020/Bournemouth-Stats
3,ENG-Premier League,1920,Brighton,25,26.4,52.2,38,418,3420,38,35,24,59,34,1,2,59,2,45.4,43.8,31.6,75.4,669,1675,0.92,0.63,1.55,0.89,1.53,1.2,0.83,2.03,1.15,1.99,/en/squads/d07537b9/2019-2020/Brighton-and-Hov...
4,ENG-Premier League,1920,Burnley,22,28.0,41.9,38,418,3420,38,41,30,71,38,3,3,67,0,47.4,45.3,32.4,77.7,437,890,1.08,0.79,1.87,1.0,1.79,1.25,0.85,2.1,1.19,2.04,/en/squads/943e8050/2019-2020/Burnley-Stats


In [54]:
# Get summary statistics
team_df.describe(include='all')

Unnamed: 0_level_0,league,season,team,players_used,Age,Poss,Playing Time,Playing Time,Playing Time,Playing Time,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,Expected,Progression,Progression,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,Per 90 Minutes,url
Unnamed: 0_level_1,Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,MP,Starts,Min,90s,Gls,Ast,G+A,G-PK,PK,PKatt,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,Gls,Ast,G+A,G-PK,G+A-PK,xG,xAG,xG+xAG,npxG,npxG+xAG,Unnamed: 34_level_1
count,120,120.0,120,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120
unique,1,,27,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,120
top,ENG-Premier League,,Arsenal,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,/en/squads/18bb7c10/2019-2020/Arsenal-Stats
freq,120,,6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
mean,,2172.5,,27.491667,26.619167,50.0,36.35,399.85,3271.5,36.35,50.641667,36.025,86.666667,46.633333,4.008333,4.9,67.041667,2.175,50.929167,47.133333,36.6375,83.78,662.083333,1375.141667,1.395333,0.9945,2.38925,1.285417,2.2805,1.403,1.010917,2.4145,1.299333,2.310667,
std,,173.213571,,2.819488,1.074322,7.39025,3.707023,40.777249,333.632037,3.707023,17.64867,13.683986,31.078468,16.293492,2.427074,2.838511,14.431755,1.493192,13.973296,12.730803,10.426736,23.069182,175.360832,373.745235,0.463715,0.367663,0.823529,0.430475,0.791847,0.360076,0.276187,0.631945,0.330776,0.604797,
min,,1920.0,,21.0,24.2,35.8,28.0,308.0,2520.0,28.0,19.0,12.0,31.0,16.0,0.0,0.0,38.0,0.0,24.6,23.2,17.7,42.8,346.0,700.0,0.5,0.32,0.84,0.42,0.76,0.85,0.62,1.49,0.78,1.43,
25%,,2021.0,,25.0,25.975,44.05,38.0,418.0,3420.0,38.0,38.0,26.0,63.75,35.75,2.0,3.0,57.0,1.0,41.05,38.25,29.075,67.5,541.25,1130.5,1.045,0.68,1.725,0.96,1.61,1.1375,0.82,1.9775,1.06,1.885,
50%,,2172.5,,27.0,26.6,49.4,38.0,418.0,3420.0,38.0,49.0,35.0,83.0,44.5,3.0,4.0,64.5,2.0,48.25,44.85,34.2,77.65,642.5,1309.0,1.35,0.95,2.38,1.24,2.24,1.34,0.975,2.325,1.24,2.215,
75%,,2324.0,,29.0,27.2,54.325,38.0,418.0,3420.0,38.0,60.0,43.5,105.25,55.0,5.25,6.25,78.0,3.0,56.85,52.325,41.6,94.3,760.5,1574.75,1.6875,1.21,2.875,1.5425,2.76,1.6175,1.18,2.7625,1.49,2.67,


In [55]:
# Check data types and missing values
df_info = pd.DataFrame({
    'Data Type': team_df.dtypes,
    'Non-Null Count': team_df.count(),
    'Missing Values': team_df.isnull().sum(),
    'Missing Percentage': (team_df.isnull().sum() / len(team_df) * 100).round(2)
})
df_info

Unnamed: 0,Unnamed: 1,Data Type,Non-Null Count,Missing Values,Missing Percentage
league,Unnamed: 0_level_1,object,120,0,0.0
season,Unnamed: 1_level_1,int64,120,0,0.0
team,Unnamed: 2_level_1,object,120,0,0.0
players_used,Unnamed: 3_level_1,int64,120,0,0.0
Age,Unnamed: 4_level_1,float64,120,0,0.0
Poss,Unnamed: 5_level_1,float64,120,0,0.0
Playing Time,MP,int64,120,0,0.0
Playing Time,Starts,int64,120,0,0.0
Playing Time,Min,int64,120,0,0.0
Playing Time,90s,int64,120,0,0.0


------------------------------------------------------------------------

## 3. Data Cleaning Process

### 3.1 Fixing table headers and renaming columns for better clarity for Player Stats

The data appears to be relatively clean, with no missing or null values. However, adjusting the data types for better alignment and improving column names for clarity could enhance usability.

In [None]:
# Create a copy of the raw dataframe to clean
clean_player_df = player_df.copy()

# Flatten multi-level columns
clean_player_df.columns = ['_'.join(col).strip() if 'Unnamed' not in col[1] else col[0] for col in clean_player_df.columns]

clean_player_df.head()

Unnamed: 0,league,season,team,player,nation,pos,age,born,Playing Time_MP,Playing Time_Starts,Playing Time_Min,Playing Time_90s,Performance_Gls,Performance_Ast,Performance_G+A,Performance_G-PK,Performance_PK,Performance_PKatt,Performance_CrdY,Performance_CrdR,Expected_xG,Expected_npxG,Expected_xAG,Expected_npxG+xAG,Progression_PrgC,Progression_PrgP,Progression_PrgR,Per 90 Minutes_Gls,Per 90 Minutes_Ast,Per 90 Minutes_G+A,Per 90 Minutes_G-PK,Per 90 Minutes_G+A-PK,Per 90 Minutes_xG,Per 90 Minutes_xAG,Per 90 Minutes_xG+xAG,Per 90 Minutes_npxG,Per 90 Minutes_npxG+xAG
0,ENG-Premier League,1920,Arsenal,Ainsley Maitland-Niles,ENG,DF,21,1997,20,15,1386,15.4,0,2,2,0,0,0,6,1,0.4,0.4,1.5,2.0,38,61,34,0.0,0.13,0.13,0.0,0.13,0.03,0.1,0.13,0.03,0.13
1,ENG-Premier League,1920,Arsenal,Alexandre Lacazette,FRA,FW,28,1991,30,22,1874,20.8,10,4,14,10,0,0,8,0,8.1,8.1,2.9,11.1,21,49,137,0.48,0.19,0.67,0.48,0.67,0.39,0.14,0.53,0.39,0.53
2,ENG-Premier League,1920,Arsenal,Bernd Leno,GER,GK,27,1992,30,30,2649,29.4,0,0,0,0,0,0,2,0,0.0,0.0,0.0,0.0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,ENG-Premier League,1920,Arsenal,Bukayo Saka,ENG,"DF,FW",17,2001,26,19,1753,19.5,1,5,6,1,0,0,6,0,1.0,1.0,3.3,4.3,67,67,143,0.05,0.26,0.31,0.05,0.31,0.05,0.17,0.22,0.05,0.22
4,ENG-Premier League,1920,Arsenal,Calum Chambers,ENG,DF,24,1995,14,13,1102,12.2,1,1,2,1,0,0,5,0,0.8,0.8,1.6,2.4,16,45,67,0.08,0.08,0.16,0.08,0.16,0.06,0.13,0.2,0.06,0.2


In [71]:
# Drop all per 90 columns as they can be calculated manually if needed
clean_player_df.drop(columns=clean_player_df.loc[:, 'Per 90 Minutes_Gls':'Per 90 Minutes_npxG+xAG'].columns, inplace=True)

clean_player_df.head()

Unnamed: 0,league,season,team,player,nation,pos,age,born,Playing Time_MP,Playing Time_Starts,Playing Time_Min,Playing Time_90s,Performance_Gls,Performance_Ast,Performance_G+A,Performance_G-PK,Performance_PK,Performance_PKatt,Performance_CrdY,Performance_CrdR,Expected_xG,Expected_npxG,Expected_xAG,Expected_npxG+xAG,Progression_PrgC,Progression_PrgP,Progression_PrgR
0,ENG-Premier League,1920,Arsenal,Ainsley Maitland-Niles,ENG,DF,21,1997,20,15,1386,15.4,0,2,2,0,0,0,6,1,0.4,0.4,1.5,2.0,38,61,34
1,ENG-Premier League,1920,Arsenal,Alexandre Lacazette,FRA,FW,28,1991,30,22,1874,20.8,10,4,14,10,0,0,8,0,8.1,8.1,2.9,11.1,21,49,137
2,ENG-Premier League,1920,Arsenal,Bernd Leno,GER,GK,27,1992,30,30,2649,29.4,0,0,0,0,0,0,2,0,0.0,0.0,0.0,0.0,0,0,0
3,ENG-Premier League,1920,Arsenal,Bukayo Saka,ENG,"DF,FW",17,2001,26,19,1753,19.5,1,5,6,1,0,0,6,0,1.0,1.0,3.3,4.3,67,67,143
4,ENG-Premier League,1920,Arsenal,Calum Chambers,ENG,DF,24,1995,14,13,1102,12.2,1,1,2,1,0,0,5,0,0.8,0.8,1.6,2.4,16,45,67


In [74]:
#  Renaming columns
clean_player_df.rename(columns={

    'league'                : 'League',
    'season'                : 'Season',
    'team'                  : 'Team',
    'player'                : 'Player',
    'nation'                : 'Nationality',
    'pos'                   : 'Position',
    'age'                   : 'Age',
    'born'                  : 'Birth Year',
    'Playing Time_MP'       : 'Matches Played',
    'Playing Time_Starts'   : 'Games Started',
    'Playing Time_Min'      : 'Minutes Played',
    'Playing Time_90s'      : '90 Minutes Played',
    'Performance_Gls'       : 'Goals',
    'Performance_Ast'       : 'Assists',
    'Performance_G+A'       : 'Goals+Assists',
    'Performance_G-PK'      : 'Non-Penalty Goals',
    'Performance_PK'        : 'Penalty Kick Goals',
    'Performance_PKatt'     : 'Penalty Kick Attempts',
    'Performance_CrdY'      : 'Yellow Cards',
    'Performance_CrdR'      : 'Red Cards',
    'Expected_xG'           : 'Expected Goals',
    'Expected_npxG'         : 'Expected Non-Penalty Goals',
    'Expected_xAG'          : 'Expected Assisted Goals',
    'Expected_npxG+xAG'     : 'Expected Non-Penalty+Assisted Goals',
    'Progression_PrgC'      : 'Progressive Carries',
    'Progression_PrgP'      : 'Progressive Passes',
    'Progression_PrgR'      : 'Progressive Passes Received'

}, inplace=True)

clean_player_df.columns

Index(['League', 'Season', 'Team', 'Player', 'Nationality', 'Position', 'Age',
       'Birth Year', 'Matches Played', 'Games Started', 'Minutes Played',
       '90 Minutes Played', 'Goals', 'Assists', 'Goals+Assists',
       'Non-Penalty Goals', 'Penalty Kick Goals', 'Penalty Kick Attempts',
       'Yellow Cards', 'Red Cards', 'Expected Goals',
       'Expected Non-Penalty Goals', 'Expected Assisted Goals',
       'Expected Non-Penalty+Assisted Goals', 'Progressive Carries',
       'Progressive Passes', 'Progressive Passes Received'],
      dtype='object')

In [87]:
# standardize ages to just years and convert data type to integer

# Extract the year portion before '-'
clean_player_df['Age'] = clean_player_df['Age'].astype(str).str.split('-').str[0]

# Convert to integer
clean_player_df['Age'] = pd.to_numeric(clean_player_df['Age'], errors='coerce')

clean_player_df.dtypes

League                                  object
Season                                   int64
Team                                    object
Player                                  object
Nationality                             object
Position                                object
Age                                      int64
Birth Year                               int64
Matches Played                           int64
Games Started                            int64
Minutes Played                           int64
90 Minutes Played                      float64
Goals                                    int64
Assists                                  int64
Goals+Assists                            int64
Non-Penalty Goals                        int64
Penalty Kick Goals                       int64
Penalty Kick Attempts                    int64
Yellow Cards                             int64
Red Cards                                int64
Expected Goals                         float64
Expected Non-

### 3.2 Fixing table headers and renaming columns for better clarity for Team Stats


In [89]:
# Create a copy of the raw dataframe to clean
clean_team_df = team_df.copy()

# Flatten multi-level columns
clean_team_df.columns = ['_'.join(col).strip() if 'Unnamed' not in col[1] else col[0] for col in clean_team_df.columns]

clean_team_df.head()

Unnamed: 0,league,season,team,players_used,Age,Poss,Playing Time_MP,Playing Time_Starts,Playing Time_Min,Playing Time_90s,Performance_Gls,Performance_Ast,Performance_G+A,Performance_G-PK,Performance_PK,Performance_PKatt,Performance_CrdY,Performance_CrdR,Expected_xG,Expected_npxG,Expected_xAG,Expected_npxG+xAG,Progression_PrgC,Progression_PrgP,Per 90 Minutes_Gls,Per 90 Minutes_Ast,Per 90 Minutes_G+A,Per 90 Minutes_G-PK,Per 90 Minutes_G+A-PK,Per 90 Minutes_xG,Per 90 Minutes_xAG,Per 90 Minutes_xG+xAG,Per 90 Minutes_npxG,Per 90 Minutes_npxG+xAG,url
0,ENG-Premier League,1920,Arsenal,29,25.8,53.8,38,418,3420,38,56,35,91,53,3,3,88,5,47.0,44.6,31.4,76.1,815,1641,1.47,0.92,2.39,1.39,2.32,1.24,0.83,2.06,1.17,2.0,/en/squads/18bb7c10/2019-2020/Arsenal-Stats
1,ENG-Premier League,1920,Aston Villa,28,25.7,44.1,38,418,3420,38,40,32,72,39,1,3,70,1,44.3,41.8,34.2,76.0,639,1247,1.05,0.84,1.89,1.03,1.87,1.16,0.9,2.06,1.1,2.0,/en/squads/8602292d/2019-2020/Aston-Villa-Stats
2,ENG-Premier League,1920,Bournemouth,27,25.2,44.1,38,418,3420,38,38,24,62,34,4,4,78,3,44.8,41.7,30.6,72.3,601,1256,1.0,0.63,1.63,0.89,1.53,1.18,0.8,1.98,1.1,1.9,/en/squads/4ba7cbea/2019-2020/Bournemouth-Stats
3,ENG-Premier League,1920,Brighton,25,26.4,52.2,38,418,3420,38,35,24,59,34,1,2,59,2,45.4,43.8,31.6,75.4,669,1675,0.92,0.63,1.55,0.89,1.53,1.2,0.83,2.03,1.15,1.99,/en/squads/d07537b9/2019-2020/Brighton-and-Hov...
4,ENG-Premier League,1920,Burnley,22,28.0,41.9,38,418,3420,38,41,30,71,38,3,3,67,0,47.4,45.3,32.4,77.7,437,890,1.08,0.79,1.87,1.0,1.79,1.25,0.85,2.1,1.19,2.04,/en/squads/943e8050/2019-2020/Burnley-Stats


In [None]:
# Drop all per 90 columns as they can be calculated manually if needed and url column is not needed
clean_team_df.drop(columns=clean_team_df.loc[:, 'Per 90 Minutes_Gls':'url'].columns, inplace=True)

clean_team_df.head()

Unnamed: 0,league,season,team,players_used,Age,Poss,Playing Time_MP,Playing Time_Starts,Playing Time_Min,Playing Time_90s,Performance_Gls,Performance_Ast,Performance_G+A,Performance_G-PK,Performance_PK,Performance_PKatt,Performance_CrdY,Performance_CrdR,Expected_xG,Expected_npxG,Expected_xAG,Expected_npxG+xAG,Progression_PrgC,Progression_PrgP
0,ENG-Premier League,1920,Arsenal,29,25.8,53.8,38,418,3420,38,56,35,91,53,3,3,88,5,47.0,44.6,31.4,76.1,815,1641
1,ENG-Premier League,1920,Aston Villa,28,25.7,44.1,38,418,3420,38,40,32,72,39,1,3,70,1,44.3,41.8,34.2,76.0,639,1247
2,ENG-Premier League,1920,Bournemouth,27,25.2,44.1,38,418,3420,38,38,24,62,34,4,4,78,3,44.8,41.7,30.6,72.3,601,1256
3,ENG-Premier League,1920,Brighton,25,26.4,52.2,38,418,3420,38,35,24,59,34,1,2,59,2,45.4,43.8,31.6,75.4,669,1675
4,ENG-Premier League,1920,Burnley,22,28.0,41.9,38,418,3420,38,41,30,71,38,3,3,67,0,47.4,45.3,32.4,77.7,437,890


In [None]:
#  Renaming columns
clean_team_df.rename(columns={

    'league'                : 'League',
    'season'                : 'Season',
    'team'                  : 'Team',
    'players_used'          : 'Players Used',
    'Poss'                  : 'Possession',
    'Age'                   : 'Average Age',
    'Playing Time_MP'       : 'Matches Played',
    'Playing Time_Starts'   : 'Games Started',
    'Playing Time_Min'      : 'Minutes Played',
    'Playing Time_90s'      : '90 Minutes Played',
    'Performance_Gls'       : 'Goals',
    'Performance_Ast'       : 'Assists',
    'Performance_G+A'       : 'Goals+Assists',
    'Performance_G-PK'      : 'Non-Penalty Goals',
    'Performance_PK'        : 'Penalty Kick Goals',
    'Performance_PKatt'     : 'Penalty Kick Attempts',
    'Performance_CrdY'      : 'Yellow Cards',
    'Performance_CrdR'      : 'Red Cards',
    'Expected_xG'           : 'Expected Goals',
    'Expected_npxG'         : 'Expected Non-Penalty Goals',
    'Expected_xAG'          : 'Expected Assisted Goals',
    'Expected_npxG+xAG'     : 'Expected Non-Penalty+Assisted Goals',
    'Progression_PrgC'      : 'Progressive Carries',
    'Progression_PrgP'      : 'Progressive Passes',

}, inplace=True)

clean_team_df.columns

Index(['League', 'Season', 'Team', 'Players Used', 'Average Age', 'Possession',
       'Matches Played', 'Games Started', 'Minutes Played',
       '90 Minutes Played', 'Goals', 'Assists', 'Goals+Assists',
       'Non-Penalty Goals', 'Penalty Kick Goals', 'Penalty Kick Attempts',
       'Yellow Cards', 'Red Cards', 'Expected Goals',
       'Expected Non-Penalty Goals', 'Expected Assisted Goals',
       'Expected Non-Penalty+Assisted Goals', 'Progressive Carries',
       'Progressive Passes'],
      dtype='object')

------------------------------------------------------------------------

## 4. Summary of Cleaning Actions

Actions Taken 
- Flattened Table Headers
- Drop unneccesary columns
- Renamed Headers for better clarity
- Standardized player ages to years and converted to int datatype

In [94]:
# Save the cleaned dataset
clean_player_df.to_csv('data/cleaned/cleaned_standard_player_dataset.csv', index=False)
clean_team_df.to_csv('data/cleaned/cleaned_standard_team_stats.csv', index=False)

------------------------------------------------------------------------

## 5. Next Steps

#### Key Variables to Focus On

##### Player-Level Analysis

1. Performance Metrics
- Goals, Assists, Goals+Assists → Key indicators of attacking performance.
- Non-Penalty Goals, Penalty Goals, Penalty Attempts → Differentiate scoring efficiency.
- Expected Goals (xG), Expected Assists (xA) → Measure performance vs. expected contribution.
- Progressive Carries, Progressive Passes, Progressive Passes Received → Gauge involvement in attacking plays.
- Minutes Played, Matches Played, Games Started → Identify player consistency and participation.

2. Discipline & Efficiency
- Yellow Cards, Red Cards → Discipline tracking.
- xG vs. Goals → Determine finishing efficiency.

3. Demographics & Experience
- Age, Birth Year, Nationality → Player longevity and trends.

##### Team-Level Analysis

1. Overall Performance
- Goals, Assists, Goals+Assists → Team attacking output.
- Possession → Playing style indicator.
- Expected Goals, Expected Assists → Compare team output to expected performance.

2. Discipline & Efficiency
- Yellow Cards, Red Cards → Discipline impact.
- Penalty Goals, Attempts → Efficiency in set-piece situations.

3. Squad Characteristics
- Average Age, Players Used → Team experience and rotation trends.

##### Suggested Analytical Approaches

1. Descriptive Analysis

- Summary Statistics (mean, median, min, max) for key performance indicators.
- Player Performance Trends across seasons and teams.

2. Correlation Analysis

- xG vs. Actual Goals → Identify under/over-performing players.
- Possession vs. Goals Scored → Analyze if high-possession teams score more.
- Progressive Passes & Carries vs. Assists → Relationship between build-up play and assists.

3. Clustering

- Player Segmentation → Cluster players based on attributes (e.g., attacking, defensive, all-round).
- Team Playstyle Clustering → Compare teams based on possession, xG, discipline, etc.

4. Regression Analysis

- Predicting Goals using xG, progressive carries, and passes.
- Impact of Possession on Wins → Does more possession lead to more success?

5. Time-Series Analysis

- Performance Trends over 5 Seasons → Track improvements/declines in teams and players.

##### Potential Limitations to Consider in the future

1. Data Availability  
Some key factors (e.g., defensive stats like tackles, interceptions) might be missing.
No injury data, which affects player availability and performance.

2. xG Limitations  
xG models are estimates and may not always reflect actual outcomes.

3. Team-Level Generalization  
Aggregated team data may hide individual player contributions.