# Preprocessing Footywire

In this notebook we clean the dataset sourced from footywire.com

Since the names are represented in a different manner than the fryzigg dataset e.g. Thomas Stewart vs. Tom Stewart

We find a solution to combine the datasets such that the player ids can be joined to the match performance

In [1]:
import pandas as pd

In [2]:
SC = pd.read_csv('../../data/raw/cleaned_footywire.csv', index_col=0)

In [3]:
SC

Unnamed: 0,match_id,season,match_round,Player,Team,SC
0,13966,2012,1,Aaron Sandilands,Fremantle,113.0
1,13966,2012,1,Adam McPhee,Fremantle,38.0
2,13966,2012,1,Allen Christensen,Geelong,48.0
3,13966,2012,1,Andrew Mackie,Geelong,95.0
4,13966,2012,1,Billie Smedts,Geelong,46.0
...,...,...,...,...,...,...
94222,16346,2022,23,Tom Hickey,Sydney,88.0
94223,16346,2022,23,Tom McCartin,Sydney,24.0
94224,16346,2022,23,Tom Papley,Sydney,37.0
94225,16346,2022,23,Will Hayward,Sydney,121.0


In [4]:
SC['name_length'] = SC['Player'].apply(lambda x: len(x.split()))

In [5]:
SC.drop_duplicates('Player').sort_values('name_length', ascending=False)[:15]

Unnamed: 0,match_id,season,match_round,Player,Team,SC,name_length
2966,14025,2012,8,Jay Van Berlo,Fremantle,20.0,3
117,13964,2012,1,Nathan Van Berlo,Adelaide,117.0,3
18666,14409,2014,4,Dylan Van Unen,Essendon,40.0,3
22,13966,2012,1,Matthew De Boer,Fremantle,97.0,3
26203,14584,2015,1,Jordan De Goey,Collingwood,11.0,3
68,13960,2012,1,Josh P. Kennedy,Sydney,137.0,3
34854,14789,2016,1,Callum Ah Chee,Gold Coast,58.0,3
78171,15919,2021,5,Sam De Koning,Geelong,27.0,3
60234,15387,2018,22,Tom De Koning,Carlton,58.0,3
27200,14603,2015,3,Brendon Ah Chee,Port Adelaide,22.0,3


In [6]:
# only player to have middle initial

print(len(SC.query('Player.str.contains("\.")').Player.unique()))
SC.query('Player.str.contains("\.")').head()


1


Unnamed: 0,match_id,season,match_round,Player,Team,SC,name_length
68,13960,2012,1,Josh P. Kennedy,Sydney,137.0,3
720,13971,2012,2,Josh P. Kennedy,Sydney,139.0,3
1080,13980,2012,3,Josh P. Kennedy,Sydney,113.0,3
1514,13994,2012,4,Josh P. Kennedy,Sydney,107.0,3
1822,14002,2012,5,Josh P. Kennedy,Sydney,173.0,3


In [7]:
# b ebert and l young are legitimate cases of 2 players on same team with the same initial surname combo

# m frederick name changed from Minairo to Michael in the dataset but it is the same person
# j mcinerney: there was 1 instance where the I in mcinerney was capitalised, hence counted as 2 different names

def transform_name(row):
    
    player_name = row['Player'].replace("'", "")
    player_team = row['Team']

    # for unique case of Josh P. Kennedy
    if '.' in player_name:

        first, mid, last = player_name.split()

        return_name = f'{first[0]} {last}'

        return return_name.lower()
    
    # for case of l young on WB and b ebert on PA
    double_names = {'Lewis Young', 'Lachie Young', 'Brad Ebert', 'Brett Ebert'}
    # double_teams = {'Western Bulldogs', 'Port Adelaide'}
    
    if player_name in double_names:

        return player_name.lower()
    
    # cases like Jordan De Goey -> j de goey
    if row['name_length'] == 3:

        first, mid, last = player_name.split()

        return_name = f'{first[0]} {mid} {last}'
        
        return return_name.lower()

    if row['name_length'] == 2: 

        first, last = player_name.split()

        return_name = f'{first[0]} {last}'
        
        return return_name.lower()

In [8]:
SC['process_name'] = SC.apply(transform_name, axis=1)

In [9]:
SC.query('Player.str.contains("Lewis Young")')

Unnamed: 0,match_id,season,match_round,Player,Team,SC,name_length,process_name
49478,15136,2017,17,Lewis Young,Western Bulldogs,67.0,2,lewis young
50228,15141,2017,18,Lewis Young,Western Bulldogs,79.0,2,lewis young
50622,15153,2017,19,Lewis Young,Western Bulldogs,48.0,2,lewis young
50713,15159,2017,20,Lewis Young,Western Bulldogs,52.0,2,lewis young
51412,15165,2017,21,Lewis Young,Western Bulldogs,23.0,2,lewis young
51810,15175,2017,22,Lewis Young,Western Bulldogs,46.0,2,lewis young
52029,15183,2017,23,Lewis Young,Western Bulldogs,23.0,2,lewis young
54982,15256,2018,7,Lewis Young,Western Bulldogs,69.0,2,lewis young
55377,15269,2018,8,Lewis Young,Western Bulldogs,62.0,2,lewis young
61495,15424,2019,2,Lewis Young,Western Bulldogs,35.0,2,lewis young


In [10]:
players = pd.read_csv('../../data/curated/player_information_12-22.csv', index_col=0)

In [11]:
merge = pd.merge(SC, players, how='left',
         left_on=['process_name', 'Team', 'season'],
         right_on=['process_name', 'player_team', 'season'])

In [12]:
merge[merge.isna().any(axis=1)].drop_duplicates('Player')

Unnamed: 0,match_id,season,match_round,Player,Team,SC,name_length,process_name,player_id,player_team,no_teams,player_first_name,player_last_name
17754,14381,2014,1,Jay K-Harris,Melbourne,24.0,2,j k-harris,,,,,
17864,14384,2014,2,Angus Litherland,Hawthorn,27.0,2,a litherland,,,,,
26922,14590,2015,2,Simon Tunbridge,West Coast,48.0,2,s tunbridge,,,,,
35077,14791,2016,1,Dane Swan,Collingwood,0.0,2,d swan,,,,,
35101,14791,2016,1,Michael Talia,Sydney,56.0,2,m talia,,,,,
48421,15108,2017,14,Josh D-Cardillo,Fremantle,33.0,2,j d-cardillo,,,,,
53018,15216,2018,2,Willie Rioli,West Coast,31.0,2,w rioli,,,,,
67117,15549,2019,17,Ian Hill,Greater Western Sydney,60.0,2,i hill,,,,,
76628,15877,2021,1,James Jordan,Melbourne,78.0,2,j jordan,,,,,


Names to Fix:

J Harris: no hyphen in other dataframe so name did not change.

A Litherland: name change from Litherland to Dewar in return to AFL, name change was applied to previous records in dataset

Simon Tunbridge, Michael Talia and Dane Swan, played 1 game in 2015/2016, where other stats were not recorded and hence were dropped in previous notebook.

J D-Cardillo is written as J Deluca in other dataset

Willie and Ian go by Juniour and Bobby respectively in other dataset

J Jordon is written as Jordan for first 9 or so career games.

In [13]:
# 94227 rows means no double ups in the merge 
# => the method of processing names to then be joined via name, season, team was successful
merge

Unnamed: 0,match_id,season,match_round,Player,Team,SC,name_length,process_name,player_id,player_team,no_teams,player_first_name,player_last_name
0,13966,2012,1,Aaron Sandilands,Fremantle,113.0,2,a sandilands,11260.0,Fremantle,1.0,Aaron,Sandilands
1,13966,2012,1,Adam McPhee,Fremantle,38.0,2,a mcphee,11135.0,Fremantle,1.0,Adam,McPhee
2,13966,2012,1,Allen Christensen,Geelong,48.0,2,a christensen,11927.0,Geelong,2.0,Allen,Christensen
3,13966,2012,1,Andrew Mackie,Geelong,95.0,2,a mackie,11327.0,Geelong,1.0,Andrew,Mackie
4,13966,2012,1,Billie Smedts,Geelong,46.0,2,b smedts,12048.0,Geelong,2.0,Billie,Smedts
...,...,...,...,...,...,...,...,...,...,...,...,...,...
94222,16346,2022,23,Tom Hickey,Sydney,88.0,2,t hickey,12009.0,Sydney,4.0,Tom,Hickey
94223,16346,2022,23,Tom McCartin,Sydney,24.0,2,t mccartin,12634.0,Sydney,1.0,Tom,McCartin
94224,16346,2022,23,Tom Papley,Sydney,37.0,2,t papley,12419.0,Sydney,1.0,Tom,Papley
94225,16346,2022,23,Will Hayward,Sydney,121.0,2,w hayward,12516.0,Sydney,1.0,Will,Hayward


# J K-Harris

In [14]:
players.query('process_name.str.contains("kennedy harris")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
17599,2014,12259,Melbourne,1,Jay,Kennedy Harris,j kennedy harris
25949,2015,12259,Melbourne,1,Jay,Kennedy Harris,j kennedy harris
44949,2017,12259,Melbourne,1,Jay,Kennedy Harris,j kennedy harris
57835,2018,12259,Melbourne,1,Jay,Kennedy Harris,j kennedy harris
64656,2019,12259,Melbourne,1,Jay,Kennedy Harris,j kennedy harris


In [15]:
merge.loc[merge.query('process_name.str.contains("k-harris")').index, 'player_id'] = 12259

# A Litherland

In [16]:
players.query('process_name.str.contains("dewar")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
17731,2014,12261,Hawthorn,2,Angus,Dewar,a dewar
28330,2015,12261,Hawthorn,2,Angus,Dewar,a dewar
34796,2016,12261,Hawthorn,2,Angus,Dewar,a dewar
85708,2022,12261,West Coast,2,Angus,Dewar,a dewar


In [17]:
merge.loc[merge.query('process_name.str.contains("litherland")').index, 'player_id'] = 12261

# Tunbridge, Swan, Talia

In [18]:
players.query('process_name.str.contains("tunbridge")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
13375,2013,12212,West Coast,1,Simon,Tunbridge,s tunbridge
20019,2014,12212,West Coast,1,Simon,Tunbridge,s tunbridge
41214,2016,12212,West Coast,1,Simon,Tunbridge,s tunbridge


In [19]:
merge.loc[merge.query('process_name.str.contains("tunbridge")').index, 'player_id'] = 12212

In [20]:
players.query('process_name.str.contains("swan")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
96,2012,11290,Collingwood,1,Dane,Swan,d swan
8938,2013,11290,Collingwood,1,Dane,Swan,d swan
17252,2014,11290,Collingwood,1,Dane,Swan,d swan
26004,2015,11290,Collingwood,1,Dane,Swan,d swan


In [21]:
merge.loc[merge.query('process_name.str.contains("d swan")').index, 'player_id'] = 11290

In [22]:
players.query('process_name.str.contains("m talia")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
6775,2012,12129,Western Bulldogs,1,Michael,Talia,m talia
12356,2013,12129,Western Bulldogs,1,Michael,Talia,m talia
19177,2014,12129,Western Bulldogs,1,Michael,Talia,m talia
26079,2015,12129,Western Bulldogs,1,Michael,Talia,m talia


In [23]:
merge.loc[merge.query('process_name.str.contains("m talia")').index, 'player_id'] = 12129

# J D-Cardillo

In [24]:
players.query('process_name.str.contains("deluca")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
48091,2017,12556,Fremantle,2,Josh,Deluca,j deluca
66912,2019,12556,Carlton,2,Josh,Deluca,j deluca


In [25]:
merge.loc[merge.query('process_name.str.contains("cardillo")').index, 'player_id'] = 12556

# W Rioli, I Hill

In [26]:
players.query('process_name.str.contains("j rioli")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
52447,2018,12613,West Coast,1,Junior,Rioli,j rioli
63661,2019,12613,West Coast,1,Junior,Rioli,j rioli
85401,2022,12613,West Coast,1,Junior,Rioli,j rioli


In [27]:
merge.loc[merge.query('process_name.str.contains("w rioli")').index, 'player_id'] = 12613

In [28]:
players.query('process_name.str.contains("b hill")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
1363,2012,12066,Hawthorn,3,Bradley,Hill,b hill
9016,2013,12066,Hawthorn,3,Bradley,Hill,b hill
17543,2014,12066,Hawthorn,3,Bradley,Hill,b hill
26258,2015,12066,Hawthorn,3,Bradley,Hill,b hill
36019,2016,12066,Hawthorn,3,Bradley,Hill,b hill
43410,2017,12066,Fremantle,3,Bradley,Hill,b hill
51898,2018,12066,Fremantle,3,Bradley,Hill,b hill
60826,2019,12066,Fremantle,3,Bradley,Hill,b hill
66703,2019,12744,Greater Western Sydney,1,Bobby,Hill,b hill
69446,2020,12066,St Kilda,3,Bradley,Hill,b hill


In [29]:
merge.loc[merge.query('process_name.str.contains("i hill")').index, 'player_id'] = 12744

# J Jordon

In [30]:
players.query('process_name.str.contains("jord")')

Unnamed: 0,season,player_id,player_team,no_teams,player_first_name,player_last_name,process_name
76036,2021,12853,Melbourne,1,James,Jordon,j jordon
85051,2022,12853,Melbourne,1,James,Jordon,j jordon


In [31]:
merge.loc[merge.query('process_name.str.contains("jord")').index, 'player_id'] = 12853

In [32]:
merge.isna().sum()

match_id               0
season                 0
match_round            0
Player                 0
Team                   0
SC                     0
name_length            0
process_name           0
player_id              0
player_team          165
no_teams             165
player_first_name    165
player_last_name     165
dtype: int64

In [33]:
merge = merge[['match_id', 'season', 'player_id', 'SC']]

In [34]:
final_df = pd.merge(merge, players, on=['player_id', 'season'], how='left')

In [35]:
final_df

Unnamed: 0,match_id,season,player_id,SC,player_team,no_teams,player_first_name,player_last_name,process_name
0,13966,2012,11260.0,113.0,Fremantle,1.0,Aaron,Sandilands,a sandilands
1,13966,2012,11135.0,38.0,Fremantle,1.0,Adam,McPhee,a mcphee
2,13966,2012,11927.0,48.0,Geelong,2.0,Allen,Christensen,a christensen
3,13966,2012,11327.0,95.0,Geelong,1.0,Andrew,Mackie,a mackie
4,13966,2012,12048.0,46.0,Geelong,2.0,Billie,Smedts,b smedts
...,...,...,...,...,...,...,...,...,...
94222,16346,2022,12009.0,88.0,Sydney,4.0,Tom,Hickey,t hickey
94223,16346,2022,12634.0,24.0,Sydney,1.0,Tom,McCartin,t mccartin
94224,16346,2022,12419.0,37.0,Sydney,1.0,Tom,Papley,t papley
94225,16346,2022,12516.0,121.0,Sydney,1.0,Will,Hayward,w hayward


In [36]:
# refers to Tunbridge, Swan, Talia, no player information for that season
final_df.loc[final_df.isna().any(axis=1)]

Unnamed: 0,match_id,season,player_id,SC,player_team,no_teams,player_first_name,player_last_name,process_name
26922,14590,2015,12212.0,48.0,,,,,
35077,14791,2016,11290.0,0.0,,,,,
35101,14791,2016,12129.0,56.0,,,,,


In [37]:
final_df.to_csv('../../data/raw/supercoach_12-22.csv')