**Project context** <br>This notebook is part of the project "Bookmakerspy", a project that aims at predicting football games results in the context of the English Premier League (based on data from 2014 to 2018) and ultimately beat bookmakers' odds.<br>
The current notebook "bookmakerspy_data_collection" is the first in a series of 4 notebooks. It is followed by "bookmakerspy_data_preprocessing", "bookmakerspy_modelisation" and "bookmakerspy_odds_strategy".

**Information about the notebook**<br>
This notebook is intended for processing with Google Colab, and aims at collecting data from the following sources: https://www.kaggle.com/shubhmamp/english-premier-league-match-data and https://datahub.io/sports-data/english-premier-league in order to create a dataframe containing English Premier League game statistics and player statistics and the corresponding bookmakers' odds.<br>
The kaggle dataset is available in json format and contains games and players statistics between 2014 and 2018. The datahub dataset enables to retrieve bookmakers' odds for the same games.<br>

**Notebook goal**<br>
Running the notebook will result in the creation of an intermediary Google Drive folders containing the relevant data and performing minor pre-processing tasks. Data will then be assembled into a dataset that can then be processed further for the exploration, pre-processing and modelisation steps.

In [1]:
# Connect the notebook with Google drive to collect data from Kaggle
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Upload your personal kaggle.json containing your personal token info. This file can be retrieved via your personal Kaggle account (more info: https://www.kaggle.com/docs/api#authentication)
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"mllacc","key":"d20eed651306a0a3a7297dad6e1f6f12"}'}

In [3]:
# Create Kaggle folder
! mkdir ~/.kaggle

# Copy kaggle.json into kaggle folder
! cp kaggle.json ~/.kaggle/

# Modify permissions for kaggle.json
! chmod 600 ~/.kaggle/kaggle.json

In [4]:
# Download Kaggle data https://www.kaggle.com/shubhmamp/english-premier-league-match-data
! kaggle datasets download -d shubhmamp/english-premier-league-match-data

Downloading english-premier-league-match-data.zip to /content
 90% 5.00M/5.57M [00:00<00:00, 13.6MB/s]
100% 5.57M/5.57M [00:00<00:00, 13.8MB/s]


In [5]:
# Creation of a "dataset" folder in Google Drive and unzip Kaggle data into this folder
! mkdir '/content/drive/My Drive/dataset'
! unzip english-premier-league-match-data.zip -d '/content/drive/My Drive/dataset'

Archive:  english-premier-league-match-data.zip
  inflating: /content/drive/My Drive/dataset/datafile/season14-15/season_match_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season14-15/season_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season15-16/season_match_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season15-16/season_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season16-17/season_match_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season16-17/season_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season17-18/season_match_stats.json  
  inflating: /content/drive/My Drive/dataset/datafile/season17-18/season_stats.json  
  inflating: /content/drive/My Drive/dataset/datafilev2/datafile/season14-15/season_match_stats.json  
  inflating: /content/drive/My Drive/dataset/datafilev2/datafile/season14-15/season_stats.json  
  inflating: /content/drive/My Drive/dat

In [2]:
import json
import pandas as pd

# Match team stats data

In [3]:
# retrieving the files containing statistics
team_stats_14_15_json = json.load(open('/content/drive/My Drive/dataset/datafilev2/datafile/season14-15/season_stats.json'))
team_stats_15_16_json = json.load(open('/content/drive/My Drive/dataset/datafilev2/datafile/season15-16/season_stats.json'))
team_stats_16_17_json = json.load(open('/content/drive/My Drive/dataset/datafilev2/datafile/season16-17/season_stats.json'))
team_stats_17_18_json = json.load(open('/content/drive/My Drive/dataset/datafilev2/datafile/season17-18/season_stats.json'))

In [4]:
def team_stats(json,teamloc):
  
  # data containing home team stats are contained at position 0, whereas data containining away team stats are contained at position 1
  teamidx = 0 if teamloc == 'home' else 1 if teamloc == 'away' else "null"
  
  # creation of a dataframe to gather relevant data
  stats = pd.DataFrame()
  row=0

  # Iterating json data to retrieve stats related to the team
  for match_id, infos_match in json.items():

      stats.loc[row, 'match_id'] = match_id
      
      team = dict(list(infos_match.values())[teamidx])

      for column, team_info in team['team_details'].items():
          stats.loc[row, column] = team_info
      
      for column, team_stat in team['aggregate_stats'].items():
          stats.loc[row, column] = team_stat
      

      row += 1
  
  stats['date'] = pd.to_datetime(stats['date'], dayfirst=True)
  stats = stats.sort_values(by=['date', 'match_id'])
  stats = stats.reset_index(drop=True)
  
  stats['team_rating'] = stats['team_rating'].astype(float)
  
  for column in stats.columns[5:]:
      stats[column] = stats[column].astype(float)
  
  return stats

In [None]:
team_stats_home = {'season_14_15': team_stats(team_stats_14_15_json,'home').add_prefix('home_'), 
                   'season_15_16': team_stats(team_stats_15_16_json,'home').add_prefix('home_'), 
                   'season_16_17': team_stats(team_stats_16_17_json,'home').add_prefix('home_'),  
                   'season_17_18': team_stats(team_stats_17_18_json,'home').add_prefix('home_')}

team_stats_away = {'season_14_15': team_stats(team_stats_14_15_json,'away').add_prefix('away_'), 
                   'season_15_16': team_stats(team_stats_15_16_json,'away').add_prefix('away_'), 
                   'season_16_17': team_stats(team_stats_16_17_json,'away').add_prefix('away_'),  
                   'season_17_18': team_stats(team_stats_17_18_json,'away').add_prefix('away_')}

In [15]:
# Add season info
team_stats_home['season_14_15']['season'] = '2014_2015'
team_stats_home['season_15_16']['season'] = '2015_2016'
team_stats_home['season_16_17']['season'] = '2016_2017'
team_stats_home['season_17_18']['season'] = '2017_2018'

In [16]:
# Concatenation home data
df_home = pd.concat([team_stats_home['season_14_15'],team_stats_home['season_15_16'],team_stats_home['season_16_17'],team_stats_home['season_17_18']])
# Concatenation away data
df_away = pd.concat([team_stats_away['season_14_15'],team_stats_away['season_15_16'],team_stats_away['season_16_17'],team_stats_away['season_17_18']])


In [18]:
df_home.head()

Unnamed: 0,home_match_id,home_team_id,home_team_name,home_team_rating,home_date,home_att_goal_low_left,home_won_contest,home_possession_percentage,home_total_throws,home_att_miss_high_left,home_blocked_scoring_att,home_total_scoring_att,home_att_sv_low_left,home_total_tackle,home_att_miss_high_right,home_aerial_won,home_att_miss_right,home_att_sv_low_centre,home_aerial_lost,home_accurate_pass,home_total_pass,home_won_corners,home_shot_off_target,home_ontarget_scoring_att,home_goals,home_att_miss_left,home_fk_foul_lost,home_att_sv_low_right,home_att_goal_low_centre,home_att_sv_high_left,home_total_offside,home_att_goal_high_left,home_att_goal_low_right,home_att_miss_high,home_att_sv_high_centre,home_att_post_high,home_post_scoring_att,home_att_sv_high_right,home_att_pen_goal,home_att_post_right,home_att_goal_high_right,home_att_post_left,home_att_goal_high_centre,home_penalty_save,season
0,829513,13,Arsenal,7.015,2014-08-16,1.0,12.0,76.0,21.0,1.0,3.0,14.0,2.0,26.0,1.0,23.0,1.0,1.0,17.0,640.0,730.0,9.0,5.0,6.0,2.0,2.0,13.0,1.0,1.0,,,,,,,,,,,,,,,,2014_2015
1,829515,14,Leicester,6.714286,2014-08-16,,6.0,36.7,12.0,,3.0,11.0,,13.0,,27.0,2.0,,14.0,265.0,344.0,3.0,5.0,3.0,2.0,1.0,16.0,1.0,,,,,2.0,2.0,,,,,,,,,,,2014_2015
2,829517,32,Manchester United,6.707143,2014-08-16,1.0,13.0,59.6,29.0,,4.0,14.0,1.0,13.0,,20.0,1.0,3.0,10.0,482.0,558.0,4.0,5.0,5.0,1.0,2.0,14.0,,,,1.0,,,1.0,,1.0,1.0,,,,,,,,2014_2015
3,829519,171,Queens Park Rangers,6.715,2014-08-16,,8.0,51.0,22.0,,6.0,19.0,1.0,14.0,,30.0,5.0,3.0,15.0,296.0,383.0,8.0,7.0,6.0,,2.0,10.0,,,,,,,,1.0,,,1.0,,,,,,,2014_2015
4,829520,96,Stoke,6.799231,2014-08-16,,9.0,63.1,36.0,,6.0,12.0,2.0,27.0,,30.0,1.0,,9.0,432.0,517.0,2.0,4.0,2.0,,2.0,14.0,,,,1.0,,,1.0,,,,,,,,,,,2014_2015


In [19]:
df_away.head()

Unnamed: 0,away_match_id,away_team_id,away_team_name,away_team_rating,away_date,away_won_corners,away_fk_foul_lost,away_won_contest,away_total_tackle,away_aerial_lost,away_possession_percentage,away_att_goal_low_left,away_total_pass,away_total_throws,away_total_offside,away_blocked_scoring_att,away_ontarget_scoring_att,away_aerial_won,away_accurate_pass,away_total_scoring_att,away_att_sv_low_left,away_goals,away_att_miss_high_right,away_att_miss_right,away_shot_off_target,away_att_miss_left,away_att_miss_high,away_att_goal_low_centre,away_att_goal_high_right,away_att_miss_high_left,away_att_sv_low_centre,away_att_goal_high_left,away_att_sv_high_centre,away_att_goal_low_right,away_att_sv_high_right,away_att_sv_high_left,away_penalty_save,away_att_sv_low_right,away_post_scoring_att,away_att_post_high,away_att_post_left,away_att_pen_goal,away_att_goal_high_centre,away_att_post_right
0,829513,162,Crystal Palace,6.628571,2014-08-16,3.0,19.0,7.0,33.0,23.0,24.0,1.0,222.0,18.0,1.0,2.0,2.0,17.0,127.0,4.0,1.0,1.0,,,,,,,,,,,,,,,,,,,,,,
1,829515,31,Everton,6.61,2014-08-16,6.0,10.0,9.0,19.0,27.0,63.3,,605.0,17.0,,5.0,3.0,14.0,509.0,13.0,,2.0,,1.0,5.0,2.0,1.0,,1.0,1.0,1.0,1.0,,,,,,,,,,,,
2,829517,259,Swansea,6.886429,2014-08-16,,20.0,4.0,19.0,20.0,40.4,,383.0,23.0,1.0,1.0,4.0,10.0,307.0,5.0,1.0,2.0,,,,,,1.0,,,,,,1.0,1.0,,,,,,,,,
3,829519,214,Hull,7.176429,2014-08-16,9.0,10.0,4.0,17.0,30.0,49.0,1.0,382.0,26.0,2.0,4.0,4.0,15.0,289.0,11.0,,1.0,2.0,1.0,3.0,,,,,,2.0,,,,,,1.0,1.0,,,,,,
4,829520,24,Aston Villa,6.767692,2014-08-16,8.0,9.0,4.0,19.0,30.0,36.9,1.0,289.0,35.0,1.0,2.0,1.0,9.0,197.0,7.0,,1.0,,,4.0,1.0,2.0,,,1.0,,,,,,,,,,,,,,


In [17]:
# Merge away / home on match id
df_merge = df_home.merge(df_away, left_on=['home_match_id'], right_on=['away_match_id'])

In [20]:
# removing columns made redundant by the merge
df_merge = df_merge.rename(columns={"home_match_id": "match_id", "home_date": "date"})
df_merge = df_merge.drop(['away_match_id','away_date','home_goals','away_goals'], axis=1)

In [21]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520 entries, 0 to 1519
Data columns (total 85 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   match_id                    1520 non-null   object        
 1   home_team_id                1520 non-null   object        
 2   home_team_name              1520 non-null   object        
 3   home_team_rating            1520 non-null   float64       
 4   date                        1520 non-null   datetime64[ns]
 5   home_att_goal_low_left      537 non-null    float64       
 6   home_won_contest            1519 non-null   float64       
 7   home_possession_percentage  1520 non-null   float64       
 8   home_total_throws           1520 non-null   float64       
 9   home_att_miss_high_left     537 non-null    float64       
 10  home_blocked_scoring_att    1453 non-null   float64       
 11  home_total_scoring_att      1520 non-null   float64     

In [22]:
# even if this is pre-processing already, we are setting all NaNs to 0 as it is the meaning of the NaNs in this first part of our dataframe
# Les NaN dans le dataset sont équivalents à 0
df_merge = df_merge.fillna(0)

In [40]:
df_merge['match_id'] = df_merge['match_id'].astype(int)

# Match players stats data

For each player, we are retrieving the match players stats in order to create an average by players position for a given match, so as to have a more refined indicator than the team rating.

In [23]:
def players_stats(json, teamloc):
    
    IsAway = 0 if teamloc == 'home' else 1 if teamloc == 'away' else None

    stats = pd.DataFrame()
   
    row = 0
    for match_id, infos_match in json.items():
        
        home = dict(list(infos_match.values())[IsAway])
        for column, player_stat in home['Player_stats'].items():
            stats.loc[row, 'season'] = None
            stats.loc[row, 'match_id'] = match_id
            for column1, player_details in home['Player_stats'][column]['player_details'].items():
              stats.loc[row, column1] = player_details
            for column1, match_details in home['Player_stats'][column]['Match_stats'].items():
              stats.loc[row, column1] = match_details

            row+=1
    
    stats = stats.sort_values(by=['match_id'])
    stats = stats.reset_index(drop=True)

    for column in stats.columns[6:]:
        stats[column] = stats[column].astype(float)
    
    return stats

In [24]:
players_stats_home = {'season_14_15': players_stats(team_stats_14_15_json, 'home').add_prefix('home_'), 
                      'season_15_16': players_stats(team_stats_15_16_json, 'home').add_prefix('home_'), 
                      'season_16_17': players_stats(team_stats_16_17_json, 'home').add_prefix('home_'),  
                      'season_17_18': players_stats(team_stats_17_18_json, 'home').add_prefix('home_')}

players_stats_away = {'season_14_15': players_stats(team_stats_14_15_json, 'away').add_prefix('away_'), 
                      'season_15_16': players_stats(team_stats_15_16_json, 'away').add_prefix('away_'), 
                      'season_16_17': players_stats(team_stats_16_17_json, 'away').add_prefix('away_'),  
                      'season_17_18': players_stats(team_stats_17_18_json, 'away').add_prefix('away_')}

In [25]:
# concatenation for home team data
df_players_home = pd.concat([players_stats_home['season_14_15'], players_stats_home['season_15_16'], players_stats_home['season_16_17'], players_stats_home['season_17_18']])

# concatenation for away team data
df_players_away = pd.concat([players_stats_away['season_14_15'], players_stats_away['season_15_16'], players_stats_away['season_16_17'], players_stats_away['season_17_18']])

In [26]:
df_players_home.head()

Unnamed: 0,home_season,home_match_id,home_player_id,home_player_name,home_player_position_value,home_player_position_info,home_player_rating,home_touches,home_saves,home_total_pass,home_aerial_won,home_formation_place,home_accurate_pass,home_total_tackle,home_aerial_lost,home_fouls,home_yellow_card,home_total_scoring_att,home_man_of_the_match,home_goals,home_won_contest,home_blocked_scoring_att,home_goal_assist,home_good_high_claim,home_last_man_tackle,home_six_yard_block,home_post_scoring_att,home_att_pen_target,home_att_pen_goal,home_second_yellow,home_red_card,home_att_pen_miss,home_error_lead_to_goal,home_own_goals,home_clearance_off_line,home_penalty_conceded,home_penalty_save,home_att_pen_post
0,,829513,73379,Wojciech Szczesny,1,GK,5.81,20.0,1.0,13.0,1.0,1.0,11.0,,,,,,,,,,,,,,,,,,,,,,,,,
1,,829513,95977,Joel Campbell,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,829513,845,Tomas Rosicky,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,829513,102248,Emiliano Martínez,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,829513,84146,Alex Oxlade Chamberlain,5,Sub,6.21,28.0,,23.0,1.0,0.0,19.0,,1.0,,,,,,,,,,,,,,,,,,,,,,,


In [27]:
df_players_away.head()

Unnamed: 0,away_season,away_match_id,away_player_id,away_player_name,away_player_position_value,away_player_position_info,away_player_rating,away_good_high_claim,away_touches,away_saves,away_total_pass,away_formation_place,away_accurate_pass,away_won_contest,away_total_tackle,away_aerial_lost,away_aerial_won,away_fouls,away_yellow_card,away_total_scoring_att,away_goals,away_second_yellow,away_blocked_scoring_att,away_goal_assist,away_red_card,away_man_of_the_match,away_error_lead_to_goal,away_last_man_tackle,away_penalty_save,away_penalty_conceded,away_clearance_off_line,away_post_scoring_att,away_six_yard_block,away_att_pen_goal,away_att_pen_target,away_own_goals,away_att_pen_miss,away_att_pen_post
0,,829513,7895,Julian Speroni,1,GK,6.26,2.0,35.0,4.0,29.0,1.0,7.0,,,,,,,,,,,,,,,,,,,,,,,,,
1,,829513,21571,Wayne Hennessey,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
2,,829513,11668,Paddy McCarthy,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,829513,8466,Damien Delaney,5,Sub,6.03,,5.0,,2.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,829513,21427,Glenn Murray,5,Sub,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,


In [28]:
# keeping only players with a rating
df_players_home_rated = df_players_home.loc[df_players_home['home_player_rating'] > 0,:]
df_players_away_rated = df_players_away.loc[df_players_away['away_player_rating'] > 0,:]

In [29]:
# creating the key for merge for later stage
df_players_home_rated = df_players_home_rated.rename(columns={"home_match_id": "match_id"})
df_players_away_rated = df_players_away_rated.rename(columns={"away_match_id": "match_id"})

In [30]:
# grouping players by position
df_players_home_rated['home_player_position'] = df_players_home_rated['home_player_position_value']
df_players_home_rated['home_player_position'].replace(['1', '2', '3', '4', '5'], ['Goalkeeper', 'Defender', 'Midfielder', 'Forward', 'Substitute'], inplace = True)

df_players_away_rated['away_player_position'] = df_players_away_rated['away_player_position_value']
df_players_away_rated['away_player_position'].replace(['1', '2', '3', '4', '5'], ['Goalkeeper', 'Defender', 'Midfielder', 'Forward', 'Substitute'], inplace = True)

In [31]:
def position_rating(teamloc):
  
  output = pd.DataFrame()
  position_list = ['Goalkeeper', 'Defender', 'Midfielder', 'Forward', 'Substitute']

  if teamloc == 'home':
    df = pd.DataFrame(df_players_home_rated.groupby(['match_id', teamloc + '_player_position']).mean()[teamloc + '_player_rating'].reset_index())
  elif teamloc == 'away':
    df = pd.DataFrame(df_players_away_rated.groupby(['match_id', teamloc + '_player_position']).mean()[teamloc + '_player_rating'].reset_index())
  
  df.index = df['match_id']

  for position in position_list:
    output = pd.concat([output, df.loc[df[teamloc + '_player_position'] == position,:]], axis = 1)
    output = output.rename(columns = {teamloc + '_player_rating':position.lower() + str('_') + teamloc + '_player_rating'})
    output = output.drop(['match_id', teamloc + '_player_position'], axis = 1)

  output = output.reset_index()
  output = output.rename(columns = {'index':'match_id'})

  return output

df_position_home = position_rating('home')

In [32]:
# Creation of ratings dataframes
df_position_home_rating = position_rating('home')
df_position_away_rating = position_rating('away')

In [33]:
df_position_home_rating.head()

Unnamed: 0,match_id,goalkeeper_home_player_rating,defender_home_player_rating,midfielder_home_player_rating,forward_home_player_rating,substitute_home_player_rating
0,1080506,5.92,6.935,6.424,6.44,7.09
1,1080507,6.75,6.445,6.423333,5.996667,6.313333
2,1080508,7.36,6.835,6.5825,6.265,6.055
3,1080509,6.48,7.05,7.14,7.6,6.213333
4,1080510,6.7,6.6025,7.19,6.05,6.093333


In [37]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520 entries, 0 to 1519
Data columns (total 85 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   match_id                    1520 non-null   object        
 1   home_team_id                1520 non-null   object        
 2   home_team_name              1520 non-null   object        
 3   home_team_rating            1520 non-null   float64       
 4   date                        1520 non-null   datetime64[ns]
 5   home_att_goal_low_left      1520 non-null   float64       
 6   home_won_contest            1520 non-null   float64       
 7   home_possession_percentage  1520 non-null   float64       
 8   home_total_throws           1520 non-null   float64       
 9   home_att_miss_high_left     1520 non-null   float64       
 10  home_blocked_scoring_att    1520 non-null   float64       
 11  home_total_scoring_att      1520 non-null   float64     

In [41]:
df_position_rating = df_position_home_rating.merge(df_position_away_rating, on = ['match_id'])
df_position_rating['match_id'] = df_position_rating['match_id'].astype(int)

# Merging both df
df_merge = df_merge.merge(df_position_rating, on = ['match_id'])

# Removing substitutes as not considered as relevant
df_merge = df_merge.drop(columns=['substitute_away_player_rating', 'substitute_home_player_rating']) 

# If there is no attacking player, rating equals 0
df_merge['forward_away_player_rating'] = df_merge['forward_away_player_rating'].fillna(0)

In [42]:
df_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520 entries, 0 to 1519
Data columns (total 93 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   match_id                       1520 non-null   int64         
 1   home_team_id                   1520 non-null   object        
 2   home_team_name                 1520 non-null   object        
 3   home_team_rating               1520 non-null   float64       
 4   date                           1520 non-null   datetime64[ns]
 5   home_att_goal_low_left         1520 non-null   float64       
 6   home_won_contest               1520 non-null   float64       
 7   home_possession_percentage     1520 non-null   float64       
 8   home_total_throws              1520 non-null   float64       
 9   home_att_miss_high_left        1520 non-null   float64       
 10  home_blocked_scoring_att       1520 non-null   float64       
 11  home_total_scorin

# Match odds data

In [43]:
! pip install datapackage

Collecting datapackage
  Downloading datapackage-1.15.2-py2.py3-none-any.whl (85 kB)
[?25l[K     |███▉                            | 10 kB 19.1 MB/s eta 0:00:01[K     |███████▋                        | 20 kB 11.3 MB/s eta 0:00:01[K     |███████████▌                    | 30 kB 8.6 MB/s eta 0:00:01[K     |███████████████▎                | 40 kB 8.0 MB/s eta 0:00:01[K     |███████████████████             | 51 kB 4.2 MB/s eta 0:00:01[K     |███████████████████████         | 61 kB 4.4 MB/s eta 0:00:01[K     |██████████████████████████▊     | 71 kB 4.7 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 81 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 85 kB 2.3 MB/s 
Collecting unicodecsv>=0.14
  Downloading unicodecsv-0.14.1.tar.gz (10 kB)
Collecting tabulator>=1.29
  Downloading tabulator-1.53.5-py2.py3-none-any.whl (72 kB)
[K     |████████████████████████████████| 72 kB 515 kB/s 
Collecting jsonpointer>=1.10
  Downloading jsonpointer-2.2-py2.py3

In [44]:
import datapackage

In [45]:
# retrieving odds data relevant for seasons considered

data_url = 'https://datahub.io/sports-data/english-premier-league/datapackage.json'
package = datapackage.Package(data_url)
resources = package.resources

cotes_1415 = pd.read_csv(resources[5].descriptor['path'])
cotes_1516 = pd.read_csv(resources[4].descriptor['path'])
cotes_1617 = pd.read_csv(resources[3].descriptor['path']) 
cotes_1718 = pd.read_csv(resources[2].descriptor['path']) 

In [46]:
df_odds = pd.concat([cotes_1415, cotes_1516, cotes_1617, cotes_1718])

In [47]:
# Converting date
df_odds['date'] = df_odds['Date'].apply(lambda x: pd.to_datetime(x, dayfirst=True))
df_odds = df_odds.drop(['Date'], axis=1)

In [48]:
# Harmonising team names accross datasets

old_names = sorted(df_odds['HomeTeam'].unique())
new_names = sorted(df_merge['home_team_name'].unique())

#print(old_names)
#print(new_names)
df_odds['HomeTeam'] = df_odds['HomeTeam'].replace(old_names, new_names)
df_odds['AwayTeam'] = df_odds['AwayTeam'].replace(old_names, new_names)

In [49]:
# Remove odds columns that contain NaNs
df_odds = df_odds.dropna(axis='columns')

In [50]:
# Remove variables that are redundant with df_merge
df_odds = df_odds.drop(['Div','HS','AS','HST','AST','HC', 'AC'], axis=1)

# Merging Stats and Odds

In [51]:
df_stats_odds = df_merge.merge(df_odds, left_on = ['date', 'home_team_name', 'away_team_name'], right_on = ['date', 'HomeTeam', 'AwayTeam'])

In [52]:
# Removing redundancies and unuseful columns, and making sure id columns are ints
df_stats_odds = df_stats_odds.drop(['HomeTeam','AwayTeam'], axis=1)
df_stats_odds[['match_id','home_team_id','away_team_id']] =  df_stats_odds[['match_id','home_team_id','away_team_id']].astype(int)
df_stats_odds  = df_stats_odds.drop(['Referee'], axis=1)

# CSV Output

In [56]:
df_stats_odds.to_csv('df_stats_odds.csv')