# Contents:
   ### 0) [Setup](#setup)
   ### 1) [Basic view of data with some filtering](#first)
   ### 2) [Seperating fixtures into those with all match stats and those with at least some](#second)
   ### 3) [Proportions of games that have all match statistics](#third)
   ### 4) [Investigating rows that have at least 1 match stat missing](#fourth)
   ### 5) [Seperating League and Cup games](#fifth)

---
<a id='setup'></a>

## 0) Setup

In [1]:
import pandas as pd

import convert_json_to_csv as jtocsv
import utils as u

cfg = u.get_config()
data_cfg = u.get_config("data")

pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 200)

In [2]:
fixture_filepaths = u.get_all_files_with_name("fixtures_all.json", file_dirs=cfg['ALL_LEAGUES_AND_CUPS'])

match_stats = jtocsv.create_match_statistics_df_from_json(
    fixture_filepaths,
    file_to_save_match_stats="match_stats.csv",
    file_to_save_extra_data="extra_data.json",
    return_match_stats_df=True,
    verbose=False, read_data_if_exists=True)

---
<a id='first'></a>

## 1) Basic view of data with some filtering

In [3]:
print(match_stats.shape)
match_stats.head()

(23677, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
0,276393,England,FA Cup,1063,Cup,2019,09/08/2019,Extra Preliminary Round,Match Finished,90,"Valerie Park (Prescot, Merseyside)",,,1-1,,,Skelmersdale United,8937,Penistone Church,8895,1.0,1.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,276394,England,FA Cup,1063,Cup,2019,09/08/2019,Extra Preliminary Round,Match Finished,90,"The Harlow Arena (Harlow, Essex)",,,1-1,,,Woodford Town,9009,White Ensign,8998,1.0,1.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,276395,England,FA Cup,1063,Cup,2019,09/08/2019,Extra Preliminary Round,Match Finished,90,Wibbandune Sports Ground (London),,,1-0,,,Balham,8695,Rusthall,8917,1.0,0.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,276396,England,FA Cup,1063,Cup,2019,09/08/2019,Extra Preliminary Round,Match Finished,90,"The Polegrove (Bexhill-on-Sea, East Sussex)",,,1-6,,,Bexhill United,8704,Eastbourne Town,8769,1.0,6.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,276397,England,FA Cup,1063,Cup,2019,09/08/2019,Extra Preliminary Round,Match Finished,90,"Leg O'Mutton Field (Cobham, Surrey)",,,1-1,,,Sutton Common Rovers,8962,Molesey,8874,1.0,1.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [4]:
print("Unique fixture ID's:", len(match_stats.fixture_id.unique()), "\n")
print("Countries"); print(match_stats.country.value_counts(), "\n")
print("League Name"); print(match_stats.league_name.value_counts(), "\n")
print("League Type"); print(match_stats.league_type.value_counts(), "\n")
print("Season"); print(match_stats.league_season.value_counts(), "\n")
print("Has some match statistics?"); print(match_stats.has_match_stats.value_counts(), "\n")
print("Fixture Status"); print(match_stats.fixture_status.value_counts(), "\n")

Unique fixture ID's: 23677 

Countries
England    5211
France     4753
Italy      4097
Spain      4035
Germany    3307
World      2274
Name: country, dtype: int64 

League Name
Ligue 1              3802
Serie A              3800
Primera Division     3800
Premier League       3790
Bundesliga 1         3062
Europa League        1419
FA Cup               1328
Champions League      855
Coupe de France       778
Coppa Italia          297
DFB Pokal             245
Copa del Rey          235
Coupe de la Ligue     173
League Cup             93
Name: league_name, dtype: int64 

League Type
League    18254
Cup        5423
Name: league_type, dtype: int64 

Season
2019    3980
2018    3205
2017    2779
2016    2767
2015    1826
2013    1826
2014    1826
2011    1826
2012    1826
2010    1816
Name: league_season, dtype: int64 

Has some match statistics?
False    15229
True      8448
Name: has_match_stats, dtype: int64 

Fixture Status
Match Finished        22863
Not Started             633
Time to 

Having looked through the dates of matches with Fixture status other than 'Match Finished', we see 99% are from this season. To begin with I think we will exclude matches from this season from our analysis since it is ongoing. However, we might be able to use it as the test data set when splitting our data.

In [5]:
match_stats_filtered = match_stats[match_stats.fixture_status == "Match Finished"].copy()
match_stats_filtered = match_stats_filtered[match_stats_filtered.league_season != "2019"].copy()
print(match_stats_filtered.shape)
match_stats_filtered.head()

(19695, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
873,209926,England,FA Cup,758,Cup,2018,09/11/2018,1st Round,Match Finished,90,Coles Park (London),"Salisbury Michael, England",0-0,0-1,,,Haringey Borough,4681,AFC Wimbledon,1333,0.0,1.0,True,0.0,5.0,2.0,12.0,4.0,20.0,1.0,8.0,3.0,12.0,128.0,293.0,265.0,409.0,48%,72%,38%,62%,5.0,6.0,1.0,1.0,12.0,13.0,1.0,0.0,0.0,0.0,4.0,0.0,2.0,3.0
874,209927,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"York Road, Maidenhead","Dean Whitestone, England",0-1,0-4,,,Maidenhead,1838,Portsmouth,1355,0.0,4.0,True,0.0,12.0,5.0,7.0,6.0,25.0,3.0,18.0,3.0,7.0,114.0,452.0,232.0,564.0,49%,80%,28%,72%,1.0,7.0,,,13.0,7.0,,,0.0,0.0,8.0,0.0,1.0,6.0
875,209928,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"The Crown Ground, Accrington","Martin Coy, England",1-0,1-0,,,Accrington ST,1360,Colchester,1361,1.0,0.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
876,209929,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Borough Sports Ground, Sutton, London","Sam Purkiss,",0-0,0-0,,,Sutton Utd,1835,Slough Town,4685,0.0,0.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
877,209930,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Plainmoor, Torquay","Declan Bourne,",0-0,0-1,,,Torquay,1827,Woking,1836,0.0,1.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
print("Season"); print(match_stats_filtered.league_season.value_counts(), "\n")
print("Has some match statistics?"); print(match_stats_filtered.has_match_stats.value_counts(), "\n")
print("Fixture Status"); print(match_stats_filtered.fixture_status.value_counts(), "\n")
print("Time Elapsed"); print(match_stats_filtered.fixture_elapsed.value_counts(), "\n")

Season
2018    3205
2017    2778
2016    2766
2013    1826
2014    1826
2011    1826
2012    1826
2015    1826
2010    1816
Name: league_season, dtype: int64 

Has some match statistics?
False    12998
True      6697
Name: has_match_stats, dtype: int64 

Fixture Status
Match Finished    19695
Name: fixture_status, dtype: int64 

Time Elapsed
90     19361
120      334
Name: fixture_elapsed, dtype: int64 



We'll now seperate our data between those with and without match statistics

In [7]:
match_stats_filtered_yes = match_stats_filtered[match_stats_filtered.has_match_stats].copy()
match_stats_filtered_no = match_stats_filtered[~match_stats_filtered.has_match_stats].copy()
print("With match stats:")
print(match_stats_filtered_yes.shape)
display(match_stats_filtered_yes.head(3))
print("\nWithout match stats:")
print(match_stats_filtered_no.shape)
display(match_stats_filtered_no.head(3))

With match stats:
(6697, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
873,209926,England,FA Cup,758,Cup,2018,09/11/2018,1st Round,Match Finished,90,Coles Park (London),"Salisbury Michael, England",0-0,0-1,,,Haringey Borough,4681,AFC Wimbledon,1333,0.0,1.0,True,0.0,5.0,2.0,12.0,4.0,20.0,1.0,8.0,3.0,12.0,128.0,293.0,265.0,409.0,48%,72%,38%,62%,5.0,6.0,1.0,1.0,12.0,13.0,1.0,0.0,0.0,0.0,4.0,0.0,2.0,3.0
874,209927,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"York Road, Maidenhead","Dean Whitestone, England",0-1,0-4,,,Maidenhead,1838,Portsmouth,1355,0.0,4.0,True,0.0,12.0,5.0,7.0,6.0,25.0,3.0,18.0,3.0,7.0,114.0,452.0,232.0,564.0,49%,80%,28%,72%,1.0,7.0,,,13.0,7.0,,,0.0,0.0,8.0,0.0,1.0,6.0
880,209933,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Imber Court, Molesey","Matt Donohue, England",0-1,0-2,,,Metropolitan Police,4683,Newport County,1367,0.0,2.0,True,6.0,9.0,2.0,8.0,12.0,21.0,6.0,11.0,6.0,10.0,154.0,290.0,262.0,395.0,59%,73%,40%,60%,9.0,5.0,0.0,4.0,11.0,8.0,3.0,2.0,1.0,0.0,7.0,6.0,4.0,4.0



Without match stats:
(12998, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
875,209928,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"The Crown Ground, Accrington","Martin Coy, England",1-0,1-0,,,Accrington ST,1360,Colchester,1361,1.0,0.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
876,209929,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Borough Sports Ground, Sutton, London","Sam Purkiss,",0-0,0-0,,,Sutton Utd,1835,Slough Town,4685,0.0,0.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
877,209930,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Plainmoor, Torquay","Declan Bourne,",0-0,0-1,,,Torquay,1827,Woking,1836,0.0,1.0,False,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


We'll have a brief look at the data that don't have match stats so we know what we are currently ignoring

In [8]:
print("Unique fixture ID's:", len(match_stats_filtered_no.fixture_id.unique()), "\n")
print("Countries"); print(match_stats_filtered_no.country.value_counts(), "\n")
print("League Name"); print(match_stats_filtered_no.league_name.value_counts(), "\n")
print("League Type"); print(match_stats_filtered_no.league_type.value_counts(), "\n")
print("Season"); print(match_stats_filtered_no.league_season.value_counts(), "\n")
print("Fixture Status"); print(match_stats_filtered_no.fixture_status.value_counts(), "\n")
print("Time Elapsed"); print(match_stats_filtered_no.fixture_elapsed.value_counts(), "\n")

Unique fixture ID's: 12998 

Countries
France     2904
England    2650
Italy      2456
Spain      2354
Germany    1851
World       783
Name: country, dtype: int64 

League Name
Primera Division     2319
Serie A              2292
Ligue 1              2287
Premier League       2270
Bundesliga 1         1836
Europa League         646
Coupe de France       520
FA Cup                380
Coppa Italia          164
Champions League      137
Coupe de la Ligue      97
Copa del Rey           35
DFB Pokal              15
Name: league_name, dtype: int64 

League Type
League    11004
Cup        1994
Name: league_type, dtype: int64 

Season
2013    1826
2014    1826
2011    1826
2012    1826
2015    1826
2010    1816
2017     790
2018     636
2016     626
Name: league_season, dtype: int64 

Fixture Status
Match Finished    12998
Name: fixture_status, dtype: int64 

Time Elapsed
90     12750
120      248
Name: fixture_elapsed, dtype: int64 



So we can see we don't have match stats for games before the 2016/17 season. We'll recalculate the above just for the seasons we have some match statistics

In [9]:
match_stats.league_season.unique()

array(['2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012',
       '2011', '2010'], dtype=object)

In [10]:
match_stats_filtered_no_aft2016 = match_stats_filtered_no[match_stats_filtered_no.league_season >= '2016']

print("Unique fixture ID's:", len(match_stats_filtered_no_aft2016.fixture_id.unique()), "\n")
print("Countries"); print(match_stats_filtered_no_aft2016.country.value_counts(), "\n")
print("League Name"); print(match_stats_filtered_no_aft2016.league_name.value_counts(), "\n")
print("League Type"); print(match_stats_filtered_no_aft2016.league_type.value_counts(), "\n")
print("Season"); print(match_stats_filtered_no_aft2016.league_season.value_counts(), "\n")
print("Fixture Status"); print(match_stats_filtered_no_aft2016.fixture_status.value_counts(), "\n")
print("Time Elapsed"); print(match_stats_filtered_no_aft2016.fixture_elapsed.value_counts(), "\n")

Unique fixture ID's: 2052 

Countries
World      783
France     624
England    380
Italy      176
Spain       74
Germany     15
Name: country, dtype: int64 

League Name
Europa League        646
Coupe de France      520
FA Cup               380
Coppa Italia         164
Champions League     137
Coupe de la Ligue     97
Primera Division      39
Copa del Rey          35
DFB Pokal             15
Serie A               12
Ligue 1                7
Name: league_name, dtype: int64 

League Type
Cup       1994
League      58
Name: league_type, dtype: int64 

Season
2017    790
2018    636
2016    626
Name: league_season, dtype: int64 

Fixture Status
Match Finished    2052
Name: fixture_status, dtype: int64 

Time Elapsed
90     1804
120     248
Name: fixture_elapsed, dtype: int64 



We'll have a closer look at what rounds from each competition we don't have match_stats for

In [11]:
for league in match_stats_filtered_no_aft2016.league_name.unique():
    data = match_stats_filtered_no_aft2016[match_stats_filtered_no_aft2016.league_name == league]
    print(league)
    print(data.league_season.unique())
    print(data.fixture_round.value_counts(), "\n")

FA Cup
['2018' '2017' '2016']
1st Round            112
3rd Round             77
2nd Round             51
4th Round             34
1st Round Replays     32
3rd Round Replays     19
5th Round             16
2nd Round Replays     16
Quarter-finals         8
4th Round Replays      6
Semi-finals            4
5th Round Replays      3
Final                  2
Name: fixture_round, dtype: int64 

Ligue 1
['2017' '2016']
Regular Season - 31                 2
Regular Season - 33                 2
Relegation Play Off - Second Leg    1
Regular Season - 24                 1
Relegation Play Off - First Leg     1
Name: fixture_round, dtype: int64 

Coupe de la Ligue
['2018' '2016' '2017']
1st Round         32
16th Finals       24
8th Finals        19
2nd Round         15
Quarter-finals     4
Semi-finals        2
Final              1
Name: fixture_round, dtype: int64 

Coupe de France
['2018' '2017' '2016']
7th Round      263
8th Round      132
32nd Finals     80
16th Finals     37
8th Finals       8
N

So we can see that most of the matches without match_statistics are Cup games, which includes the UCL and UEL unfortunately. These are mostly games before the knockout stage. It also includes most of the games that went to extra time. Of the top 5 european leagues, only from Ligue 1 (7), Serie A (12) and Primera Division (39) do some games not have match statistics.  

We do see for some of the cups that some important knockout and finals games don't have match statistics but most matches are from early rounds and may possibly be for teams not in the top 5 European leagues. (TODO: We can check this but not now).

So we will ignore these matches for now and focus on the matches with match_statistics

---
<a id='second'></a>

## 2) Seperating fixtures into those with all match stats and those with at least some

We'll start by looking at the breakdown of the data

In [12]:
print("Unique fixture ID's:", len(match_stats_filtered_yes.fixture_id.unique()), "\n")
print("Countries"); print(match_stats_filtered_yes.country.value_counts(), "\n")
print("League Name"); print(match_stats_filtered_yes.league_name.value_counts(), "\n")
print("League Type"); print(match_stats_filtered_yes.league_type.value_counts(), "\n")
print("Season"); print(match_stats_filtered_yes.league_season.value_counts(), "\n")
print("Fixture Status"); print(match_stats_filtered_yes.fixture_status.value_counts(), "\n")
print("Time Elapsed"); print(match_stats_filtered_yes.fixture_elapsed.value_counts(), "\n")

Unique fixture ID's: 6697 

Countries
France     1229
England    1215
Italy      1183
Spain      1178
Germany    1094
World       798
Name: country, dtype: int64 

League Name
Premier League       1140
Ligue 1              1135
Serie A              1128
Primera Division     1101
Bundesliga 1          920
Champions League      515
Europa League         283
DFB Pokal             174
Copa del Rey           77
FA Cup                 75
Coupe de France        64
Coppa Italia           55
Coupe de la Ligue      30
Name: league_name, dtype: int64 

League Type
League    5424
Cup       1273
Name: league_type, dtype: int64 

Season
2018    2569
2016    2140
2017    1988
Name: league_season, dtype: int64 

Fixture Status
Match Finished    6697
Name: fixture_status, dtype: int64 

Time Elapsed
90     6611
120      86
Name: fixture_elapsed, dtype: int64 



Now we will see in detail how much of our data are null values

In [13]:
match_stats_filtered_yes.isna().sum()

fixture_id               0
country                  0
league_name              0
league_id                0
league_type              0
league_season            0
fixture_date             0
fixture_round            0
fixture_status           0
fixture_elapsed          0
fixture_venue            0
fixture_referee       3125
fixture_result_ht        1
fixture_result_ft        0
fixture_result_et     6617
fixture_result_pen    6656
home_team_name           0
home_team_id             0
away_team_name           0
away_team_id             0
home_goals               0
away_goals               0
has_match_stats          0
home_shots_ont           8
away_shots_ont           8
home_shots_offt         11
away_shots_offt         11
home_shots_tot          10
away_shots_tot          10
home_shots_inb         346
away_shots_inb         346
home_shots_outb        347
away_shots_outb        347
home_passes_acc        346
away_passes_acc        346
home_passes_tot        346
away_passes_tot        346
h

We will filter this data further so that we start only with matches with all match statistics and then check how much is missing from each league and season

In [14]:
print(match_stats_filtered_yes[data_cfg['MATCH_STAT_COLUMNS']].isna().any(axis=1).value_counts(), "\n")

False    5378
True     1319
dtype: int64 



In [15]:
all_match_stats = match_stats_filtered_yes[~match_stats_filtered_yes[data_cfg['MATCH_STAT_COLUMNS']].isna().any(axis=1)]
some_match_stats = match_stats_filtered_yes[match_stats_filtered_yes[data_cfg['MATCH_STAT_COLUMNS']].isna().any(axis=1)]

print(all_match_stats.shape)
display(all_match_stats.head(3))
print(some_match_stats.shape)
display(some_match_stats.head(3))

(5378, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
873,209926,England,FA Cup,758,Cup,2018,09/11/2018,1st Round,Match Finished,90,Coles Park (London),"Salisbury Michael, England",0-0,0-1,,,Haringey Borough,4681,AFC Wimbledon,1333,0.0,1.0,True,0.0,5.0,2.0,12.0,4.0,20.0,1.0,8.0,3.0,12.0,128.0,293.0,265.0,409.0,48%,72%,38%,62%,5.0,6.0,1.0,1.0,12.0,13.0,1.0,0.0,0.0,0.0,4.0,0.0,2.0,3.0
880,209933,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"Imber Court, Molesey","Matt Donohue, England",0-1,0-2,,,Metropolitan Police,4683,Newport County,1367,0.0,2.0,True,6.0,9.0,2.0,8.0,12.0,21.0,6.0,11.0,6.0,10.0,154.0,290.0,262.0,395.0,59%,73%,40%,60%,9.0,5.0,0.0,4.0,11.0,8.0,3.0,2.0,1.0,0.0,7.0,6.0,4.0,4.0
903,209956,England,FA Cup,758,Cup,2018,11/11/2018,1st Round,Match Finished,90,"Field Mill, Mansfield","Oliver Yates, England",1-0,1-1,,,Mansfield Town,1374,Charlton,1335,1.0,1.0,True,3.0,4.0,8.0,8.0,18.0,13.0,11.0,7.0,7.0,6.0,257.0,354.0,353.0,455.0,73%,78%,45%,55%,10.0,5.0,2.0,0.0,11.0,12.0,0.0,4.0,0.0,0.0,3.0,2.0,7.0,1.0


(1319, 55)


Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
874,209927,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"York Road, Maidenhead","Dean Whitestone, England",0-1,0-4,,,Maidenhead,1838,Portsmouth,1355,0.0,4.0,True,0.0,12.0,5.0,7.0,6.0,25.0,3.0,18.0,3.0,7.0,114.0,452.0,232.0,564.0,49%,80%,28%,72%,1.0,7.0,,,13.0,7.0,,,0.0,0.0,8.0,0.0,1.0,6.0
958,209863,England,FA Cup,758,Cup,2018,05/01/2019,3rd Round,Match Finished,90,"Turf Moor, Burnley","Simon Hooper, England",0-0,1-0,,,Burnley,44,Barnsley,747,1.0,0.0,True,3.0,0.0,5.0,6.0,8.0,6.0,,,,,,,,,,,47%,53%,4.0,4.0,2.0,4.0,6.0,8.0,,,0.0,0.0,0.0,2.0,,
964,209869,England,FA Cup,758,Cup,2018,05/01/2019,3rd Round,Match Finished,90,"Villa Park, Birmingham","Gavin Ward, England",0-1,0-3,,,Aston Villa,66,Swansea,76,0.0,3.0,True,2.0,8.0,7.0,5.0,9.0,13.0,,,,,,,,,,,46%,54%,9.0,5.0,1.0,0.0,9.0,9.0,,,0.0,0.0,5.0,2.0,,


---
<a id='third'></a>

## 3) Proportions of games that have all match statistics

The most important games are those from the top 5 leagues, the domestic cup games are not important atm. Those games will be used to supplement how good one team performs against another.

Below we show the proportion of matches with match stats that have all match stats out of all with at least some match stats.

In [16]:
all_match_stats_count = all_match_stats.groupby(['league_season', 'league_name']).has_match_stats.count()
some_match_stats_count = some_match_stats.groupby(['league_season', 'league_name']).has_match_stats.count()
match_stats_proportion = pd.DataFrame(all_match_stats_count/(all_match_stats_count+some_match_stats_count)).fillna(value=0).round(decimals=2)
display(match_stats_proportion)

Unnamed: 0_level_0,Unnamed: 1_level_0,has_match_stats
league_season,league_name,Unnamed: 2_level_1
2016,Bundesliga 1,0.94
2016,Champions League,0.58
2016,Coppa Italia,0.6
2016,Coupe de France,0.88
2016,Coupe de la Ligue,0.09
2016,DFB Pokal,0.25
2016,Ligue 1,0.92
2016,Premier League,0.91
2016,Primera Division,0.98
2016,Serie A,0.98


And here is the proportion of matches with all match stats out of all matches in total

In [17]:
all_matches_count_of_match_stats = match_stats_filtered.groupby(['league_season', 'league_name']).has_match_stats.count()
match_stats_total_proportion = pd.DataFrame(((all_match_stats_count/all_matches_count_of_match_stats).fillna(value=0).round(decimals=2)*100).apply(int))
match_stats_total_proportion


Unnamed: 0_level_0,Unnamed: 1_level_0,has_match_stats
league_season,league_name,Unnamed: 2_level_1
2010,Bundesliga 1,0
2010,Ligue 1,0
2010,Premier League,0
2010,Primera Division,0
2010,Serie A,0
2011,Bundesliga 1,0
2011,Ligue 1,0
2011,Premier League,0
2011,Primera Division,0
2011,Serie A,0


---
<a id='fourth'></a>

## 4) Investigating rows that have at least 1 match stat missing

To make the most of the data we have, we're going to go back over the matches that have some match stats and break it down

In [18]:
some_match_stats.head()

Unnamed: 0,fixture_id,country,league_name,league_id,league_type,league_season,fixture_date,fixture_round,fixture_status,fixture_elapsed,fixture_venue,fixture_referee,fixture_result_ht,fixture_result_ft,fixture_result_et,fixture_result_pen,home_team_name,home_team_id,away_team_name,away_team_id,home_goals,away_goals,has_match_stats,home_shots_ont,away_shots_ont,home_shots_offt,away_shots_offt,home_shots_tot,away_shots_tot,home_shots_inb,away_shots_inb,home_shots_outb,away_shots_outb,home_passes_acc,away_passes_acc,home_passes_tot,away_passes_tot,home_passes_pct,away_passes_pct,home_possession,away_possession,home_corners,away_corners,home_offsides,away_offsides,home_fouls,away_fouls,home_yc,away_yc,home_rc,away_rc,home_gksaves,away_gksaves,home_shots_bl,away_shots_bl
874,209927,England,FA Cup,758,Cup,2018,10/11/2018,1st Round,Match Finished,90,"York Road, Maidenhead","Dean Whitestone, England",0-1,0-4,,,Maidenhead,1838,Portsmouth,1355,0.0,4.0,True,0.0,12.0,5.0,7.0,6.0,25.0,3.0,18.0,3.0,7.0,114.0,452.0,232.0,564.0,49%,80%,28%,72%,1.0,7.0,,,13.0,7.0,,,0.0,0.0,8.0,0.0,1.0,6.0
958,209863,England,FA Cup,758,Cup,2018,05/01/2019,3rd Round,Match Finished,90,"Turf Moor, Burnley","Simon Hooper, England",0-0,1-0,,,Burnley,44,Barnsley,747,1.0,0.0,True,3.0,0.0,5.0,6.0,8.0,6.0,,,,,,,,,,,47%,53%,4.0,4.0,2.0,4.0,6.0,8.0,,,0.0,0.0,0.0,2.0,,
964,209869,England,FA Cup,758,Cup,2018,05/01/2019,3rd Round,Match Finished,90,"Villa Park, Birmingham","Gavin Ward, England",0-1,0-3,,,Aston Villa,66,Swansea,76,0.0,3.0,True,2.0,8.0,7.0,5.0,9.0,13.0,,,,,,,,,,,46%,54%,9.0,5.0,1.0,0.0,9.0,9.0,,,0.0,0.0,5.0,2.0,,
990,209839,England,FA Cup,758,Cup,2018,26/01/2019,4th Round,Match Finished,90,"Liberty Stadium, Swansea","England Darren, England",2-0,4-1,,,Swansea,76,Gillingham,1347,4.0,1.0,True,7.0,5.0,4.0,3.0,11.0,8.0,,,,,,,,,,,59%,41%,7.0,6.0,0.0,2.0,6.0,11.0,,,0.0,0.0,4.0,3.0,,
993,209842,England,FA Cup,758,Cup,2018,26/01/2019,4th Round,Match Finished,90,"Keepmoat Stadium, Doncaster","Peter Bankes, England",0-0,2-1,,,Doncaster,1354,Oldham,1349,2.0,1.0,True,5.0,3.0,6.0,6.0,11.0,9.0,,,,,,,,,,,59%,41%,10.0,1.0,1.0,3.0,9.0,16.0,,,0.0,1.0,2.0,3.0,,


In [19]:
lots_missing_cols = ["home_shots_inb","away_shots_inb","home_shots_outb","away_shots_outb","home_passes_acc","away_passes_acc","home_passes_tot","away_passes_tot","home_passes_pct","away_passes_pct", "home_shots_bl", "away_shots_bl"]

print(some_match_stats[some_match_stats[lots_missing_cols].isna().all(axis=1)].shape)
print(some_match_stats[~some_match_stats[lots_missing_cols].isna().all(axis=1)].shape)
print(some_match_stats[some_match_stats[lots_missing_cols].isna().any(axis=1)].shape)


(344, 55)
(975, 55)
(374, 55)


In [20]:
some_match_stats[~some_match_stats[lots_missing_cols].isna().all(axis=1)].isna().sum()

fixture_id              0
country                 0
league_name             0
league_id               0
league_type             0
league_season           0
fixture_date            0
fixture_round           0
fixture_status          0
fixture_elapsed         0
fixture_venue           0
fixture_referee       169
fixture_result_ht       0
fixture_result_ft       0
fixture_result_et     967
fixture_result_pen    973
home_team_name          0
home_team_id            0
away_team_name          0
away_team_id            0
home_goals              0
away_goals              0
has_match_stats         0
home_shots_ont          0
away_shots_ont          0
home_shots_offt         0
away_shots_offt         0
home_shots_tot          2
away_shots_tot          2
home_shots_inb          2
away_shots_inb          2
home_shots_outb         3
away_shots_outb         3
home_passes_acc         2
away_passes_acc         2
home_passes_tot         2
away_passes_tot         2
home_passes_pct         2
away_passes_

In [21]:
cols_to_investigate = data_cfg['MATCH_STAT_COLUMNS']
some_match_stats_investigate = some_match_stats[cols_to_investigate].copy()
missing_col_combinations = []
count = 0

while (len(some_match_stats_investigate)>0) and (count<15):
    count+=1
    count_of_missing_per_row = some_match_stats_investigate[
        some_match_stats_investigate[cols_to_investigate].isna().any(axis=1)].isna().sum(axis=1)
    largest_num_missing_per_row = max(count_of_missing_per_row.unique())
    
    data_subset = some_match_stats_investigate[count_of_missing_per_row == largest_num_missing_per_row]
    subset_columns = sorted(list((data_subset.isna().sum() > 0)[(data_subset.isna().sum() > 0)].index))
                            
    
    missing_col_combinations.append(tuple((subset_columns, len(data_subset))))
    
    some_match_stats_investigate = some_match_stats_investigate[count_of_missing_per_row != largest_num_missing_per_row]



In [22]:
for columns, num_rows in missing_col_combinations:
    print(num_rows)
    columns = [x.split('away_')[-1] for x in columns[:int(len(columns)/2)]]
    
    print(sorted(columns), "\n")

2
['fouls', 'gksaves', 'offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'shots_bl', 'shots_inb', 'shots_offt', 'shots_ont', 'shots_outb', 'shots_tot', 'yc'] 

9
['corners', 'fouls', 'gksaves', 'offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'shots_bl', 'shots_inb', 'shots_offt', 'shots_ont', 'shots_outb', 'shots_tot', 'yc'] 

1
['gksaves', 'offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'shots_bl', 'shots_inb', 'shots_outb', 'yc'] 

2
['offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'shots_bl', 'shots_inb', 'shots_outb', 'yc'] 

9
['offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'rc', 'shots_bl', 'shots_inb', 'shots_outb', 'yc'] 

51
['gksaves', 'offsides', 'passes_acc', 'passes_pct', 'passes_tot', 'possession', 'rc', 'shots_bl', 'shots_inb', 'shots_outb', 'yc'] 

2
['passes_acc', 'passes_pct', 'passes_tot', 'shots_bl', 'shots_inb', 'shots_outb'] 

270
['passes_acc', 'passes_pct', 'passes

For now we'll just ignore the partially missing data as we there doesn't seem to be an easy fix right now. TODO: We'll come back and find a way to utilise this data too.

---
 <a id='fifth'></a>

## 5) Seperating League and Cup games

We will start by only looking at those matches that all match statistics

We will look only at league games for now and compare the different leagues

In [23]:
all_match_stats_league = all_match_stats[all_match_stats.league_type == 'League']
all_match_stats_cup = all_match_stats[all_match_stats.league_type == 'Cup']
print(all_match_stats_league.shape)
print(all_match_stats_cup.shape)

(4572, 55)
(806, 55)


In [24]:
print(all_match_stats_league.league_name.value_counts(), "\n")
print(all_match_stats_league.league_season.unique(), "\n")
print(380*3)
print(306*3 + 2)

Serie A             968
Primera Division    958
Premier League      941
Ligue 1             941
Bundesliga 1        764
Name: league_name, dtype: int64 

['2018' '2017' '2016'] 

1140
920


Note that we expect bundesliga to have a lot less games since they have less fixtures in a season. We see that each league is missing about 150-200 games.

In [25]:
pd.DataFrame(all_match_stats_league.groupby(['league_name', 'fixture_round']).has_match_stats.count())

Unnamed: 0_level_0,Unnamed: 1_level_0,has_match_stats
league_name,fixture_round,Unnamed: 2_level_1
Bundesliga 1,Regular Season - 1,24
Bundesliga 1,Regular Season - 10,24
Bundesliga 1,Regular Season - 11,23
Bundesliga 1,Regular Season - 12,26
Bundesliga 1,Regular Season - 13,27
Bundesliga 1,Regular Season - 14,27
Bundesliga 1,Regular Season - 15,24
Bundesliga 1,Regular Season - 16,23
Bundesliga 1,Regular Season - 17,25
Bundesliga 1,Regular Season - 18,22


In [26]:
pd.DataFrame(all_match_stats_league.groupby(['fixture_round']).has_match_stats.count()).sort_values(['has_match_stats'], ascending=False)


Unnamed: 0_level_0,has_match_stats
fixture_round,Unnamed: 1_level_1
Regular Season - 23,142
Regular Season - 1,141
Regular Season - 21,141
Regular Season - 19,140
Regular Season - 4,140
Regular Season - 7,139
Regular Season - 6,139
Regular Season - 22,138
Regular Season - 20,138
Regular Season - 9,138


We see that the games later in the season tend to have more matches with missing data

In [27]:
pd.DataFrame(match_stats_filtered_yes[match_stats_filtered_yes.league_type == 'League'].groupby(['fixture_round']).has_match_stats.count()).sort_values(['has_match_stats'], ascending=False)


Unnamed: 0_level_0,has_match_stats
fixture_round,Unnamed: 1_level_1
Regular Season - 9,147
Regular Season - 8,147
Regular Season - 6,147
Regular Season - 32,147
Regular Season - 29,147
Regular Season - 7,147
Regular Season - 1,147
Regular Season - 4,147
Regular Season - 2,147
Regular Season - 5,147


---