## Premier League Matchup Prediction

### Data Preparation & Preprocessing

#### Roberto Ruiz, B.S. Petroleum Engineering & Certificate in CS, The University of Texas at Austin

This workflow demonstrates basic data preparation and preprocessing of Premier League data gathered from <a href="http://football-data.co.uk/englandm.php">this</a> website. Data cleaning and preprocessing was done experimentally based off statistics that are relevant in Soccer matchups. Features were extracted and constructed from the original datasets to provide deeper insight as to how each team in the Premier League was performing at a particular time (year, season, week).

Data gathered is being cleaned for classification models that will predict outcomes of matches (home team win, away team win, draw). However, feature selection and feature transformation is not necessarily done in this workflow.

Download the csv files from the website previously mentioned from the 2003-2004 season to the 2020-2021 season. Acronyms for feature names can be found <a href="http://football-data.co.uk/notes.txt">here</a>.

We will also need some standard packages for this workflow.

In [1]:
import numpy as np     # Numpy scientific computing tool
import pandas as pd    # Pandas data analysis and data frame tool
import matplotlib      # Matplotlib for visualizing data
import time            # Time & Calendar will be used to transform dates into proper values for ML models
import calendar
import os              # Set directory and manipulate file names if desired
import re              # Regex for file name manipulation if desired

#### Set the working directory
The address should be specific to where your data is located.

In [2]:
os.chdir('C:/Users/Roberto Ruiz/Documents/Personal Projects/PLGamePrediction/original_data')

#### Only use if your download order matches the one below
Downloaded 2003-2004 season first as E0.csv then proceeded in order. <br>
Ex: 2004-2005 = E0(1).csv, 2005-2006 = E0(2).csv,... etc.

In [3]:
'''Use only once, then no need for it'''

# years_off = 3
# leading = '20'

# for file in os.listdir():
    
#     file_name, file_extension = os.path.splitext(file)
#     match = re.findall(r'\d+', file_name)
    
#     if len(match) > 0:
#         season = years_off + int(match[-1])
#     else:
#         season = years_off
        
#     new_name = '{}{}_{}{}{}'.format(leading, str(season).zfill(2),leading,str(season+1).zfill(2),file_extension)
#     os.rename(file, new_name)


'Use only once, then no need for it'

#### Read and view data
Read csv files, get all of them in a list, and view one of the files.

In [4]:
df_raw_2003_2004 = pd.read_csv('2003_2004.csv')
df_raw_2004_2005 = pd.read_csv('2004_2005.csv')
df_raw_2005_2006 = pd.read_csv('2005_2006.csv')
df_raw_2006_2007 = pd.read_csv('2006_2007.csv')
df_raw_2007_2008 = pd.read_csv('2007_2008.csv')
df_raw_2008_2009 = pd.read_csv('2008_2009.csv')
df_raw_2009_2010 = pd.read_csv('2009_2010.csv')
df_raw_2010_2011 = pd.read_csv('2010_2011.csv')
df_raw_2011_2012 = pd.read_csv('2011_2012.csv')
df_raw_2012_2013 = pd.read_csv('2012_2013.csv')
df_raw_2013_2014 = pd.read_csv('2013_2014.csv')
df_raw_2014_2015 = pd.read_csv('2014_2015.csv')
df_raw_2015_2016 = pd.read_csv('2015_2016.csv')
df_raw_2016_2017 = pd.read_csv('2016_2017.csv')
df_raw_2017_2018 = pd.read_csv('2017_2018.csv')
df_raw_2018_2019 = pd.read_csv('2018_2019.csv')
df_raw_2019_2020 = pd.read_csv('2019_2020.csv')
df_raw_2020_2021 = pd.read_csv('2020_2021.csv')

all_datasets = [df_raw_2003_2004, df_raw_2004_2005, df_raw_2005_2006, df_raw_2006_2007, df_raw_2007_2008, df_raw_2008_2009, 
               df_raw_2009_2010, df_raw_2010_2011, df_raw_2011_2012, df_raw_2012_2013, df_raw_2013_2014, df_raw_2014_2015, 
               df_raw_2015_2016, df_raw_2016_2017, df_raw_2017_2018, df_raw_2018_2019, df_raw_2019_2020, df_raw_2020_2021]


df_raw_2003_2004.head(10)

Unnamed: 0,Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,...,B365<2.5,GBAHH,GBAHA,GBAH,LBAHH,LBAHA,LBAH,B365AHH,B365AHA,B365AH
0,E0,16/08/03,Arsenal,Everton,2,1,H,1,0,H,...,2.1,1.9,1.9,-1.25,1.9,1.95,-1.25,1.975,1.925,-1.25
1,E0,16/08/03,Birmingham,Tottenham,1,0,H,1,0,H,...,1.85,2.05,1.75,-0.25,1.75,2.1,0.0,2.1,1.8,-0.25
2,E0,16/08/03,Blackburn,Wolves,5,1,H,2,0,H,...,1.9,1.95,1.85,-0.75,2.0,1.85,-0.75,1.95,1.95,-0.75
3,E0,16/08/03,Fulham,Middlesbrough,3,2,H,1,1,D,...,1.7,1.8,2.0,0.0,1.85,2.0,0.0,2.1,1.8,-0.25
4,E0,16/08/03,Leicester,Southampton,2,2,D,2,0,H,...,1.8,1.85,1.95,0.0,1.85,2.0,0.0,1.85,2.05,0.0
5,E0,16/08/03,Man United,Bolton,4,0,H,1,0,H,...,2.2,1.95,1.85,-1.75,1.95,1.9,-1.75,2.05,1.85,-1.75
6,E0,16/08/03,Portsmouth,Aston Villa,2,1,H,1,0,H,...,1.9,2.05,1.75,-0.25,2.1,1.75,-0.25,2.1,1.8,-0.25
7,E0,17/08/03,Charlton,Man City,0,3,A,0,2,A,...,1.95,2.05,1.75,-0.25,2.05,1.8,-0.25,1.8,2.1,0.0
8,E0,17/08/03,Leeds,Newcastle,2,2,D,1,1,D,...,2.0,1.8,2.0,0.25,1.8,2.05,0.25,1.825,2.075,0.25
9,E0,17/08/03,Liverpool,Chelsea,1,2,A,0,1,A,...,1.9,2.05,1.85,-0.25,2.05,1.8,-0.25,2.1,1.8,-0.25


#### Observe the columns
Refer to the beginning of this workflow for key to results data (feature names from abbreviations).

In [5]:
df_raw_2003_2004.columns

Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'GBH', 'GBD',
       'GBA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA', 'SOH', 'SOD', 'SOA',
       'SBH', 'SBD', 'SBA', 'WHH', 'WHD', 'WHA', 'GB>2.5', 'GB<2.5',
       'B365>2.5', 'B365<2.5', 'GBAHH', 'GBAHA', 'GBAH', 'LBAHH', 'LBAHA',
       'LBAH', 'B365AHH', 'B365AHA', 'B365AH'],
      dtype='object')

### For the purpose of this project betting odds will not be used
Let's remove them from all of our datasets. We will be left with the follwing fields: <br>
- Date
- HomeTeam
- AwayTeam
- FTHG = Full Time Home Team Goals
- FTAG = Full Time Away Team Goals
- FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)
- HTHG = Half Time Home Team Goals
- HTAG = Half Time Away Team Goals
- HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)
- HS = Home Team Shots
- AS = Away Team Shots
- HST = Home Team Shots on Target
- AST = Away Team Shots on Target
- HF = Home Team Fouls Committed
- AF = Away Team Fouls Committed
- HC = Home Team Corners
- AC = Away Team Corners
- HY = Home Team Yellow Cards
- AY = Away Team Yellow Cards
- HR = Home Team Red Cards
- AR = Away Team Red Cards

In [6]:
i = 0
for df in all_datasets:
    df.drop(df.columns.difference(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG','HTAG', 'HTR', 
                                   'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC','AC', 'HY', 'AY', 'HR', 'AR']), 
                                    axis=1, inplace=True)
    
    # Check for missing values and print index for data frame with missing value
    
    if df.isna().sum().sum() != 0:
        print(i)
    
    i += 1
    
df_raw_2003_2004.columns

Index(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG', 'HTAG',
       'HTR', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY',
       'HR', 'AR'],
      dtype='object')

#### Standarize date column
Now let's standarize the date column by converting the Date column objects to seconds since epoch (unixtime). This way, the date feature can be used by ML algorithms.

In [7]:
# Function that takes in a date and prepares it for calendar.timegm transformation

def standardize_date(d):
    
    date_list = d.split('/')
    day = date_list[0]
    month = date_list[1]
    year = date_list[2]
    
    if len(day)==2 and day[0] == '0':
        day = day.replace('0','')
    if len(month)==2 and month[0] == '0':
        month = month.replace('0','')
    if len(year) != 4:
        year = '20' + year
    
    return (int(year),int(month),int(day),0,0,0,0,0,0)


#### Replace Date column with unixtime column
Check the dataframe to ensure there is no date column and there is a unixtime column.

In [8]:
for df in all_datasets:
    df['Date'] = df['Date'].apply( lambda x : standardize_date(x))
    df['unixtime'] = df['Date'].apply(lambda x : calendar.timegm(x))
    df.drop(columns="Date", inplace=True)

df_raw_2003_2004.head(10)

Unnamed: 0,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,HS,AS,...,AST,HF,AF,HC,AC,HY,AY,HR,AR,unixtime
0,Arsenal,Everton,2,1,H,1,0,H,11,13,...,7,8,15,6,9,1,3,1,1,1060992000
1,Birmingham,Tottenham,1,0,H,1,0,H,10,15,...,7,20,27,1,4,3,5,0,0,1060992000
2,Blackburn,Wolves,5,1,H,2,0,H,25,8,...,5,8,14,6,2,1,1,0,0,1060992000
3,Fulham,Middlesbrough,3,2,H,1,1,D,17,8,...,5,18,16,7,6,1,1,0,0,1060992000
4,Leicester,Southampton,2,2,D,2,0,H,12,13,...,10,27,15,2,7,3,1,0,0,1060992000
5,Man United,Bolton,4,0,H,1,0,H,13,15,...,5,12,8,8,4,0,4,0,0,1060992000
6,Portsmouth,Aston Villa,2,1,H,1,0,H,4,9,...,5,18,22,7,9,2,1,0,1,1060992000
7,Charlton,Man City,0,3,A,0,2,A,10,14,...,8,12,16,4,7,2,1,1,0,1061078400
8,Leeds,Newcastle,2,2,D,1,1,D,8,19,...,10,19,16,4,8,4,2,0,0,1061078400
9,Liverpool,Chelsea,1,2,A,0,1,A,12,10,...,6,13,12,8,6,3,1,0,0,1061078400


### Let's come up with some features that could be important
CP = Current Points--> Points gained until that point <br>
GDPH = Goal Difference Playing Home --> Goals in favor - Goals against while playing home<br>
GDPA = Goal Difference Playing Away --> Goals in favor - Goals against while playing away<br>
TGD = Total Goal Difference --> Total goals in favor - Total goals against<br>
FUD = Fatigue Unixtime Difference --> Get the difference in time between the last game played and the current game for each team<br>
SOV = Strength of Victory --> Get strength of victory for each team until that point<br>
HTGD = Half Time Goal Difference --> Half time total goal difference for each team<br>
<br>
##### Additionally, let's make every other feature cumulative for each team until that point



In [9]:
# Function to determine fatigue unixtime for teams
# dataframes come in ascending order (oldest time first)

def fatigue_time(home_away_ut_df):
    
    first_matchup_fatigue = 1209600     # default first matchups fatigue to two weeks
    home_team_fatigue_col = np.zeros(len(home_away_ut_df))
    away_team_fatigue_col = np.zeros(len(home_away_ut_df))
    
    teams = home_away_ut_df['HomeTeam'].unique()   # Get all teams in prem league
    
    for team in teams:
        
        team_matchups = home_away_ut_df.loc[(home_away_ut_df['HomeTeam'] == team) | (home_away_ut_df['AwayTeam'] == team)]
        
        i = 0
        old_row = np.nan
        for index, row in team_matchups.iterrows():
            
            if i == 0:
                i += 1
                if row['HomeTeam'] == team:
                    home_team_fatigue_col[index] = first_matchup_fatigue
                    old_row = row.copy()
                    continue
                else:
                    away_team_fatigue_col[index] = first_matchup_fatigue
                    old_row = row.copy()
                    continue
            
            fatigue_time = row['unixtime'] - old_row['unixtime']
            old_row = row.copy()
            
            if row['HomeTeam'] == team:
                home_team_fatigue_col[index] = fatigue_time
            else:
                away_team_fatigue_col[index] = fatigue_time
                
    return home_team_fatigue_col, away_team_fatigue_col


In [10]:
# Function to get goal difference as away, home, and total
# goals df contains HomeTeam AwayTeam fthg ftag ftr

def goal_difference(goals_df):
    
    # Home team goal difference playing home and away
    ht_diff_play_home = np.zeros(len(goals_df))
    ht_diff_play_away = np.zeros(len(goals_df))
    
    # Away team goal difference playing home and away
    at_diff_play_home = np.zeros(len(goals_df))
    at_diff_play_away = np.zeros(len(goals_df))
    
    teams = goals_df['HomeTeam'].unique()
    
    for team in teams:
        
        team_matchups = goals_df.loc[(goals_df['HomeTeam'] == team) | (goals_df['AwayTeam'] == team)]
        home = 0
        away = 0
        playing_home_prev_index = np.nan
        playing_away_prev_index = np.nan
        
        for index, row in team_matchups.iterrows():
            
            
            # Team has not played
            if home == 0 and away == 0:
                
                if row['HomeTeam'] == team: # Team first played at home
                    
                    home += 1
                    playing_home_prev_index = index
                    continue
                    
                else: # Team first played away
                    
                    away += 1
                    playing_away_prev_index = index
                    continue
            
            elif home == 0: # First game at home (not first game of season)
                
                home += 1
                ht_diff_play_away[index] = at_diff_play_away[playing_away_prev_index] + team_matchups.loc[playing_away_prev_index]['FTAG'] - team_matchups.loc[playing_away_prev_index]['FTHG']
                playing_home_prev_index = index
                continue
            
            elif away == 0: # First game away (not first game of season)
                
                away += 1
                at_diff_play_home[index] = ht_diff_play_home[playing_home_prev_index] + team_matchups.loc[playing_home_prev_index]['FTHG'] - team_matchups.loc[playing_home_prev_index]['FTAG']
                playing_away_prev_index = index
                continue
            
            
                
            if row['HomeTeam'] == team:

                ht_diff_play_home[index] = ht_diff_play_home[playing_home_prev_index] + team_matchups.loc[playing_home_prev_index]['FTHG'] - team_matchups.loc[playing_home_prev_index]['FTAG']
                ht_diff_play_away[index] = at_diff_play_away[playing_away_prev_index] + team_matchups.loc[playing_away_prev_index]['FTAG'] - team_matchups.loc[playing_away_prev_index]['FTHG']
                playing_home_prev_index = index

            else:
                
                at_diff_play_home[index] = ht_diff_play_home[playing_home_prev_index] + team_matchups.loc[playing_home_prev_index]['FTHG'] - team_matchups.loc[playing_home_prev_index]['FTAG']
                at_diff_play_away[index] = at_diff_play_away[playing_away_prev_index] + team_matchups.loc[playing_away_prev_index]['FTAG'] - team_matchups.loc[playing_away_prev_index]['FTHG']
                playing_away_prev_index = index
                
            
    return ht_diff_play_home, at_diff_play_home, ht_diff_play_away, at_diff_play_away


def tot_goal_difference(at_home, away):
    return at_home + away  
    
    

In [11]:
def current_points(current_points_df):
    
    '''
    Current points data frame includes Hometeam, Awayteam, FTR (final time result)
    '''
    
    original_points = 0     # default original points for each team to 0
    home_team_points_col = np.zeros(len(current_points_df))
    away_team_points_col = np.zeros(len(current_points_df))
    home_games_played = np.zeros(len(current_points_df))
    away_games_played = np.zeros(len(current_points_df))
    
    teams = current_points_df['HomeTeam'].unique()   # Get all teams in prem league
    
    for team in teams:
        team_matchups = current_points_df.loc[(current_points_df['HomeTeam'] == team) | (current_points_df['AwayTeam'] == team)]
        game_counter = 0
        current_points = 0
        
        for index, row in team_matchups.iterrows():
            
            if game_counter == 0:
                game_counter += 1
                
                if row['HomeTeam'] == team:
                    if row['FTR'] == 'H':
                        current_points += 3
                        
                    elif row['FTR'] == 'D':
                        current_points += 1
                
                else:
                    if row['FTR'] == 'A':
                        current_points += 3
                        
                    elif row['FTR'] == 'D':
                        current_points += 1
                
                continue
            
            
            if row['HomeTeam'] == team:
                
                home_team_points_col[index] = current_points
                home_games_played[index] = game_counter
                game_counter += 1
                
                if row['FTR'] == 'H':
                    current_points += 3
                        
                elif row['FTR'] == 'D':
                    current_points += 1
                
            else:
                
                away_team_points_col[index] = current_points
                away_games_played[index] = game_counter
                game_counter += 1
                
                if row['FTR'] == 'A':
                    current_points += 3
                        
                elif row['FTR'] == 'D':
                    current_points += 1
    
    
    return home_team_points_col, away_team_points_col, home_games_played, away_games_played

In [12]:
def strength_of_victory(sov_df):
    
    '''
    SOV dataframe will contain Hometeam, Awayteam, current points, and games played
    '''
    
    original_sov = 0     # default original sov for each team to 0
    home_team_sov_col = np.zeros(len(sov_df))
    away_team_sov_col = np.zeros(len(sov_df))
    
    teams = sov_df['HomeTeam'].unique()   # Get all teams in prem league
    
    for team in teams:
        team_matchups = sov_df.loc[(sov_df['HomeTeam'] == team) | (sov_df['AwayTeam'] == team)]
        game_counter = 0
        victory_counter = 0
        current_sov = 0
        
        for index, row in team_matchups.iterrows():
            
            if game_counter == 0:
                game_counter += 1
                
                if row['HomeTeam'] == team:
                    
                    if row['FTR'] == 'H':
                        current_sov = row['ACP']/3
                        victory_counter += 1
                
                else:
                    if row['FTR'] == 'A':
                        current_sov = row['HCP']/3
                        victory_counter += 1
                        
                continue
            
            
            if row['HomeTeam'] == team:
                
                home_team_sov_col[index] = current_sov
                if row['FTR'] == 'H':
                    victory_counter += 1
                    if row['AGP'] == 0:
                        current_sov = current_sov*((victory_counter-1)/victory_counter)
                    else:
                        current_sov = current_sov*((victory_counter-1)/victory_counter) + ((row['ACP']/(row['AGP'] * 3))/victory_counter)
                
            else:
                away_team_sov_col[index] = current_sov
                if row['FTR'] == 'A':
                    victory_counter += 1
                    if row['HGP'] == 0:
                        current_sov = current_sov*((victory_counter-1)/victory_counter)
                    else:
                        current_sov = current_sov*((victory_counter-1)/victory_counter) + ((row['HCP']/(row['HGP'] * 3))/victory_counter)
       
    
    return home_team_sov_col, away_team_sov_col 

In [13]:
def halftime_goal_difference(htgd_df):
    
    default_htgd = 0
    home_team_htgd = np.zeros(len(htgd_df))
    away_team_htgd = np.zeros(len(htgd_df))
    
    teams = htgd_df['HomeTeam'].unique()   # Get all teams in prem league
    
    for team in teams:
        team_matchups = htgd_df.loc[(htgd_df['HomeTeam'] == team) | (htgd_df['AwayTeam'] == team)]
        game_counter = 0
        current_htgd = 0
        
        for index, row in team_matchups.iterrows():
            
            if game_counter == 0:
                game_counter += 1
                
                if row['HomeTeam'] == team:
                    current_htgd = row['HTHG'] - row['HTAG']
                else:
                    current_htgd = row['HTAG'] - row['HTHG']
                continue
            
            if row['HomeTeam'] == team:
                home_team_htgd[index] = current_htgd
                current_htgd += row['HTHG'] - row['HTAG']
            else:
                
                away_team_htgd[index] = current_htgd                  
                current_htgd += row['HTAG'] - row['HTHG']
       
    
    return home_team_htgd, away_team_htgd

In [14]:
def per_game_stats(pergame_df):
    
    # Shots per game
    home_team_spg = np.zeros(len(pergame_df))
    away_team_spg = np.zeros(len(pergame_df))
    
    # Shots on target per game
    home_team_stpg = np.zeros(len(pergame_df))
    away_team_stpg = np.zeros(len(pergame_df))
    
    # Fouls per game
    home_team_fpg = np.zeros(len(pergame_df))
    away_team_fpg = np.zeros(len(pergame_df))
    
    # Corners per game
    home_team_cpg = np.zeros(len(pergame_df))
    away_team_cpg = np.zeros(len(pergame_df))
    
    # Yellow cards per game
    home_team_ypg = np.zeros(len(pergame_df))
    away_team_ypg = np.zeros(len(pergame_df))
    
    # Red cards per game
    home_team_rpg = np.zeros(len(pergame_df))
    away_team_rpg = np.zeros(len(pergame_df))
    
    teams = pergame_df['HomeTeam'].unique()   # Get all teams in prem league
    
    for team in teams:
        
        team_matchups = pergame_df.loc[(pergame_df['HomeTeam'] == team) | (pergame_df['AwayTeam'] == team)]
        game_counter = 0
        spg = 0
        stpg = 0
        fpg = 0
        cpg = 0
        ypg = 0
        rpg = 0
        
        for index, row in team_matchups.iterrows():
            
            if game_counter == 0:
                game_counter += 1
                
                if row['HomeTeam'] == team:
                    spg = row['HS']
                    stpg = row['HST']
                    fpg = row['HF']
                    cpg = row['HC']
                    ypg = row['HY']
                    rpg = row['HR']
                    
                else:
                    spg = row['AS']
                    stpg = row['AST']
                    fpg = row['AF']
                    cpg = row['AC']
                    ypg = row['AY']
                    rpg = row['AR']
                
                continue
            
                
            if row['HomeTeam'] == team:
                
                home_team_spg[index] = spg/game_counter
                home_team_stpg[index] = stpg/game_counter
                home_team_fpg[index] = fpg/game_counter
                home_team_cpg[index] = cpg/game_counter
                home_team_ypg[index] = ypg/game_counter
                home_team_rpg[index] = rpg/game_counter
                
                game_counter += 1
                
                spg += row['HS']
                stpg += row['HST']
                fpg += row['HF']
                cpg += row['HC']
                ypg += row['HY']
                rpg += row['HR']
                    
            else:
                
                away_team_spg[index] = spg/game_counter
                away_team_stpg[index] = stpg/game_counter
                away_team_fpg[index] = fpg/game_counter
                away_team_cpg[index] = cpg/game_counter
                away_team_ypg[index] = ypg/game_counter
                away_team_rpg[index] = rpg/game_counter
                
                game_counter += 1
                
                spg += row['AS']
                stpg += row['AST']
                fpg += row['AF']
                cpg += row['AC']
                ypg += row['AY']
                rpg += row['AR']
                
    return home_team_spg, away_team_spg, home_team_stpg, away_team_stpg, home_team_fpg, away_team_fpg, home_team_cpg, away_team_cpg, home_team_ypg, away_team_ypg, home_team_rpg, away_team_rpg
            
                
                
    
    
    
    

In [15]:
for df in all_datasets:
    
    # Add FUD --> Fatigue unixtime difference
    hfud, afud = fatigue_time(df[['HomeTeam','AwayTeam','unixtime']])
    df['HFUD'] = hfud
    df['AFUD'] = afud
    
    # Add goal differences HGDPH, AGDPH, HGDPA, AGDPA
    hgdph, agdph, hgdpa, agdpa  = goal_difference(df[['HomeTeam','AwayTeam','FTHG','FTAG']])
    df['HGDPH'] = hgdph
    df['AGDPH'] = agdph
    df['HGDPA'] = hgdpa
    df['AGDPA'] = agdpa
    
    # Add total goal differences HTGD, ATGD
    df['HTGD'] = df.apply(lambda row: tot_goal_difference(row['HGDPH'], row['HGDPA']), axis=1)
    df['ATGD'] = df.apply(lambda row: tot_goal_difference(row['AGDPH'], row['AGDPA']), axis=1)
    
    # Add current points HCP, ACP and games played HGP, AGP
    hcp, acp, hgp, agp  = current_points(df[['HomeTeam','AwayTeam','FTR']])
    df['HCP'] = hcp
    df['ACP'] = acp
    df['HGP'] = hgp
    df['AGP'] = agp
    
    # Add strength of victor HSOV, ASOV
    hsov, asov  = strength_of_victory(df[['HomeTeam','AwayTeam','FTR','HGP','HCP','AGP','ACP']])
    df['HSOV'] = hsov
    df['ASOV'] = asov
    
    # Add half time goal difference HHGD AHGD
    hhgd, ahgd = halftime_goal_difference(df[['HomeTeam','AwayTeam','HTHG','HTAG']])
    df['HHGD'] = hhgd
    df['AHGD'] = ahgd
    
    # Add per game stats
    hspg, aspg, htspg, atspg, hfpg, afpg, hcpg, acpg, hypg, aypg, hrpg, arpg  = per_game_stats(df)
    df['HSPG'] = hspg
    df['ASPG'] = aspg
    df['HTSPG'] = htspg
    df['ATSPG'] = atspg
    df['HFPG'] = hfpg
    df['AFPG'] = afpg
    df['HCPG'] = hcpg
    df['ACPG'] = acpg
    df['HYPG'] = hypg
    df['AYPG'] = aypg
    df['HRPG'] = hrpg
    df['ARPG'] = arpg
    
    
    # Drop unecessary stats  
    df.drop(['FTHG', 'FTAG', 'HTHG', 'HTAG','HTR', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC', 'AC', 'HY', 'AY',
       'HR', 'AR'], axis=1, inplace=True)
    
    
    
    
    

#### Final features
32 Columns

In [16]:
all_datasets[0].columns

Index(['HomeTeam', 'AwayTeam', 'FTR', 'unixtime', 'HFUD', 'AFUD', 'HGDPH',
       'AGDPH', 'HGDPA', 'AGDPA', 'HTGD', 'ATGD', 'HCP', 'ACP', 'HGP', 'AGP',
       'HSOV', 'ASOV', 'HHGD', 'AHGD', 'HSPG', 'ASPG', 'HTSPG', 'ATSPG',
       'HFPG', 'AFPG', 'HCPG', 'ACPG', 'HYPG', 'AYPG', 'HRPG', 'ARPG'],
      dtype='object')

#### Final dataframe/csv
6577 x 32 sample data

In [17]:
final_df = pd.concat(all_datasets)
len(final_df)

6577

#### Make dataframe into csv

In [18]:
final_df.to_csv('PL_all_seasons.csv',index=False, encoding='utf-8-sig')