In [1]:
import requests
from bs4 import BeautifulSoup

import json
import re

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# xG Model

## 1. Fetching data

### 1.1 Objects and functions to fetch data from Statsbomb's Github

In the third cell below, we call `fetch_matches_for_season` with argument **11**, as this corresponds to the La Liga comp in Statsbomb public data. See [here](https://github.com/statsbomb/open-data/blob/master/data/competitions.json).

In [2]:
class Game:
    """Game object whose only attribute is event-level JSON file (from Statsbomb's github)"""
    
    def __init__(self, json_file):
        self.json_file = json.loads(json_file)

In [3]:
def fetch_matches_for_season(github_season_url):
    """
    Function which take a url from Statsbomb's github for a specific season and returns a dictionary maping game ID's to the game's 
    event level JSON data.
    
    Arguments:
    
    github_season_url - (String) URL from Statsbomb's github. Format is:
                        https://github.com/statsbomb/open-data/blob/master/data/matches/{league_ID}/{season_ID}.json
    """
    req = requests.get(github_season_url).text
    soup = BeautifulSoup(req, "lxml") 
    table = soup.find('table')
    
    game_nums = []
    for td in table.find_all('td'):
        if "match_id" in td.text:
            game_num = re.findall(r'[0-9]+', td.text)[0]
            game_nums.append(game_num)
            
    json_files = []
    base_url_string = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/events/"
    game_num_dict = {
        game_num  : Game(requests.get(base_url_string + game_num + ".json").text)
        for game_num in game_nums
    }
 
    return game_num_dict

In [7]:
def fetch_all_seasons_for_league(competition_id):
    """
    Function which takes a competition_id, as specified by Statsbomb, and returns a dictionary where each season maps to 
    another dictionary containing all games in that season.
    
    Arguments:
    
    season_id - (int) competition_id as specified by Statsbomb 
                      See here: https://github.com/statsbomb/open-data/blob/master/data/competitions.json
    
    """
    #Get webpage html for competitions.json
    req = requests.get("https://raw.githubusercontent.com/statsbomb/open-data/master/data/competitions.json").text
    #Convert webpage to json format
    competitions_statsbomb = json.loads(req)

    all_seasons_id = {}
    for comps in competitions_statsbomb:
        if comps['competition_id'] == competition_id:
            season_id = comps['season_id']
            season_name = comps['season_name']
        
            all_seasons_id[season_name] = season_id
    
    league_all_games_by_seasons = {}
    
    for keys, values in all_seasons_id.items():
        season_url = "https://github.com/statsbomb/open-data/blob/master/data/matches/{}/{}.json".format(competition_id, values)
        print("Getting season {}...".format(keys))
        season = fetch_matches_for_season(season_url)
        league_all_games_by_seasons[keys] = season
    
    print("Done")
    
    return league_all_games_by_seasons

la_liga = fetch_all_seasons_for_league(11)

Getting season 2015/2016...
Getting season 2014/2015...
Getting season 2013/2014...
Getting season 2012/2013...
Getting season 2011/2012...
Getting season 2010/2011...
Getting season 2009/2010...
Getting season 2008/2009...
Getting season 2007/2008...
Getting season 2006/2007...
Getting season 2005/2006...
Getting season 2004/2005...
Done


So we have a data structure, called `la_liga`, which has league season names as the keys of the outer dictionary. The values are dictionaries, which map game ids (as specified by statsbomb) to `Game` objects (defined above).

In [38]:
la_liga

{'2015/2016': {'266310': <__main__.Game at 0x22a25a610>,
  '266498': <__main__.Game at 0x1341dabd0>,
  '265839': <__main__.Game at 0x118d62e90>,
  '265958': <__main__.Game at 0x119adadd0>,
  '266106': <__main__.Game at 0x11a4e1f90>,
  '266160': <__main__.Game at 0x11b326f10>,
  '266236': <__main__.Game at 0x11bddce10>,
  '266254': <__main__.Game at 0x11cf55c50>,
  '266424': <__main__.Game at 0x11d7b0fd0>,
  '266467': <__main__.Game at 0x11e95bfd0>,
  '266653': <__main__.Game at 0x11f0aef90>,
  '266620': <__main__.Game at 0x11b331750>,
  '266664': <__main__.Game at 0x120cfdd90>,
  '266815': <__main__.Game at 0x121741f50>,
  '266885': <__main__.Game at 0x122490f50>,
  '267274': <__main__.Game at 0x123109fd0>,
  '267327': <__main__.Game at 0x124264f90>,
  '267422': <__main__.Game at 0x124a3bfd0>,
  '267533': <__main__.Game at 0x12554ef10>,
  '267506': <__main__.Game at 0x125fa3f10>,
  '267611': <__main__.Game at 0x126d37dd0>,
  '266961': <__main__.Game at 0x127a92f50>,
  '266056': <__main

### 1.2 Getting shots data 

Now I will write some functions to parse through the json of each game, and extract all the shots, as well as features related to the shots. In particular, these features are: 

- **play pattern**: pattern of play which led to the shot
- **x start location**: x-location of the shot 
- **y start location**: y-location of the shot 
- **duration**: duration of the shot
- **outcome**: result of the shot
- **technique**: technique with which the shot was hit
- **first time**: whether the shot was hit for time or not
- **x gk position**: x-location of the gk when shot was taken
- **y gk position**: y-location of the gk when shot was taken
- **type of shot**: whether shot was from open play or set piece (and type of set piece specified)
- **num opponents within 5 yards**: number of opponents which were within 5 yards of shot location
- **num opponents between shot and goal**: number of opponents which were between the shot location, and the lines connecting shot location and the two posts

All the variables listed above can be extracted by parsing through the json directly, except for the last ones. Therefore, below, I will write a function to get this variable.

In [22]:
def check_player_btwn_shot_and_goal(x_shot, y_shot, x_player, y_player):
    """
    Function which checks whether a player in a stats bomb freeze is between the shot location, and the two lines
    connecting the shot location to the posts. See here for coordinate specifications: 
    https://github.com/statsbomb/open-data/blob/master/doc/StatsBomb%20Open%20Data%20Specification%20v1.1.pdf
    
    Arguments:
    x_shot    x-location of shot
    y_shot    y-location of shot
    x_player  x-location of player
    y_player  y-location of player    
    """
    x_diff = x_player - x_shot
    y_diff = y_player - y_shot
    
    slope_1 = (36 - y_shot) / (120 - x_shot)
    slope_2 = (44 - y_shot) / (120 - x_shot) 
                    
    return (x_diff >= 0) and ((y_shot + slope_1 * x_diff) < y_player < (y_shot + slope_2 * x_diff))


def plot_shot_freeze_frame(game_json, shot_id):
    player_pos_list_x = []
    player_pos_list_y = []
    x_shot = 0
    y_shot = 0
    
    for events in game_json:
        if events['id'] == shot_id:
            x_shot = events['location'][0]
            y_shot = events['location'][1]
        
            for players in events['shot']['freeze_frame']:
                if (players['teammate'] == False):
                    player_pos_list_x.append(players['location'][0])
                    player_pos_list_y.append(players['location'][1])
            
    plt.scatter(player_pos_list_x, player_pos_list_y)
    plt.plot([x_shot, 120], [y_shot,36], color = 'red', linestyle = '--')
    plt.plot([x_shot, 120], [y_shot,44], color = 'red', linestyle = '--')

Now I will write a function which takes a json file for a given game, and returns a data frame containing all shots taken along with the variables related to that shot. Then I will write two wrapper function: `get_shots_for_season` which calls `get_shots_for_game` on all game files for that season, and `get_shots_for_league`, which does a similar thing, but for each game across every season we have for that league.

In [29]:
def get_shots_for_game(game_json):
    """
    Function which parses through a game JSON and return a data frame containing all shots taken in that game with several features 
    related to that shot
    
    Arguments
    game_json - event level json for a game
    """
    
    #features for each shot, which will be the columns of our data frame
    
    shot_id_list = []
    x_start_location_list = []
    y_start_location_list = []
    play_pattern_list = []
    duration_list = []
    outcome_list = []
    technique_list = []
    type_shot_list = []
    first_time_list = []
    x_gk_pos_list = []
    y_gk_pos_list = []
    num_opponents_5_yards_list = []
    num_opponents_between_goal_list = []
    
    #-------------------------#

    for events in game_json:
    
        if events['type']['name'] == 'Shot':
        
            #get data for first 8 features
            shot_id = events['id']
            x_start_location = events['location'][0]
            y_start_location = events['location'][1]
            play_pattern = events['play_pattern']['name']
            duration = events['duration']
            outcome = events['shot']['outcome']['name'] 
            technique = events['shot']['technique']['name']
            type_shot = events['shot']['type']['name']
        
            #check if json shot has a first_time attribute, if not set first_time to False
            if 'first_time' in events['shot']:
                first_time = events['shot']['first_time']
            else:
                first_time = False
            
            #check if shot has a freeze_frame dictionary
            if "freeze_frame" in events["shot"]:
                
                num_opponents_5_yards = 0
                num_opponents_between_goal = 0
                
                for player in events["shot"]["freeze_frame"]:
                    x_player = player['location'][0]
                    y_player = player['location'][1]
                    
                    #count how many opponents were within 5 yards of player when shot was taken
                    if ((x_start_location - x_player)**2 + (y_start_location - y_player)**2) <= 25 and (player['teammate'] == False):
                        if (player['position']['name'] != 'Goalkeeper'):
                            num_opponents_5_yards += 1
                    
                    #count how many opponents were between shot and goal
                    if (player['teammate'] == False) and (player['position']['name'] != 'Goalkeeper'):
                        if check_player_btwn_shot_and_goal(x_start_location, y_start_location, x_player, y_player):
                            num_opponents_between_goal += 1
                    
                    #get position of opponent's goalkeeper 
                    if ((player['position']['name'] == 'Goalkeeper') and (player['teammate'] == False)):
                        x_gk_pos = player['location'][0]
                        y_gk_pos = player['location'][1]
            
            #if there is no freeze frame, assume goalkeeper is at center of goal, and 0 opponenets around shot location
            else:
                num_opponents_between_goal = 0
                num_opponents_5_yards = 0
                x_gk_pos = 120
                y_gk_pos = 40
                
            
            #append data on shot to relevant list (column)
            shot_id_list.append(shot_id)
            play_pattern_list.append(play_pattern)
            x_start_location_list.append(x_start_location)
            y_start_location_list.append(y_start_location)
            duration_list.append(duration)
            outcome_list.append(outcome)
            technique_list.append(technique)
            first_time_list.append(first_time)
            x_gk_pos_list.append(x_gk_pos)
            y_gk_pos_list.append(y_gk_pos)
            type_shot_list.append(type_shot)
            num_opponents_5_yards_list.append(num_opponents_5_yards)
            num_opponents_between_goal_list.append(num_opponents_between_goal)
        
    #create data frame with column features
    shot_df = pd.DataFrame({
                       "shot id" : shot_id_list,
                       "play pattern" : play_pattern_list, 
                       "x start location" : x_start_location_list, 
                       "y start location" : y_start_location_list,
                       "duration" : duration_list, 
                       "outcome" : outcome_list, 
                       "technique" : technique_list, 
                       "first time" : first_time_list,
                       "x gk position" : x_gk_pos_list,
                       "y gk position" : y_gk_pos_list,
                       "type of shot" : type_shot_list,
                       "num opponents within 5 yards" : num_opponents_5_yards_list,
                       "num opponents between shot and goal" : num_opponents_between_goal_list
                       })
    
    return shot_df.set_index("shot id")

**Wrapper functions:**

In [28]:
def get_shots_for_season(season_dict):
    """
    Takes a dictionary whichs maps game ids to Game objects, and calls get_shots_for_game() on each one.
    
    Arguments:
    season_dict - dictionary which maps game ids (string, as specified by statsbomb) to Game objects (defined above)
    """
    total_shot_df = pd.DataFrame()
    
    for keys, values in season_dict.items():
        shot_df = get_shots_for_game(values.json_file)
        total_shot_df = total_shot_df.append(shot_df)
        
    return total_shot_df


def get_shots_for_league(league_dict):
    """
    Takes a dictionary which maps season names to season dictionaries for a league, and calls get_shots_for_game() on each game.
    
    Arguments:
    league_dict - dictionary which maps season names to another dictionary. This inner dictionary maps 
    game ids (as specified by statsbomb) to Game objects (defined above)
    """
    total_shot_df = pd.DataFrame()
    
    for keys, values in league_dict.items():
        shot_df = get_shots_for_season(values)
        total_shot_df = total_shot_df.append(shot_df)
        print("Getting shots for " + keys)
        
    return total_shot_df

In [17]:
all_la_liga_shots = get_shots_for_league(la_liga)

Getting shots for 2015/2016
Getting shots for 2014/2015
Getting shots for 2013/2014
Getting shots for 2012/2013
Getting shots for 2011/2012
Getting shots for 2010/2011
Getting shots for 2009/2010
Getting shots for 2008/2009
Getting shots for 2007/2008
Getting shots for 2006/2007
Getting shots for 2005/2006
Getting shots for 2004/2005


So we have our data frame with all 8512 shots taken in Barca games from 2004/05 to 2015/16.

In [26]:
all_la_liga_shots

Unnamed: 0_level_0,play pattern,x start location,y start location,duration,outcome,technique,first time,x gk position,y gk position,type of shot,num opponents within 5 yards,num opponents between shot and goal
shot id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8522d7b1-8efd-4400-9006-1c2d1327f41d,Regular Play,109.7,30.6,0.564551,Off T,Normal,True,117.7,37.1,Open Play,2,1
fa3b6e06-dac2-4e77-81f7-20ff9429ad3a,From Throw In,87.5,31.7,1.201127,Off T,Normal,False,118.4,40.1,Open Play,2,1
b70b5409-5781-4182-b31a-bad2c8e3e891,From Goal Kick,107.1,34.0,0.466900,Saved,Normal,False,117.6,37.5,Open Play,6,1
e3fc1a0b-d68a-4ffb-95bf-4f97c075c049,From Corner,116.4,38.9,0.323872,Goal,Volley,True,119.7,38.5,Open Play,2,1
97b07723-77e9-4ece-aa90-b6c975b0dbc1,From Corner,97.5,33.8,0.268600,Wayward,Volley,True,119.3,40.0,Open Play,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
e05a628b-87f5-49b1-980a-71e3f8d59929,From Free Kick,114.9,37.3,0.703500,Goal,Normal,False,118.9,39.0,Open Play,5,0
0db9c930-e402-4ab3-bf98-bcd888b1acf7,Regular Play,108.4,25.7,0.596500,Blocked,Normal,True,115.2,32.7,Open Play,0,1
6ab038e3-9e39-4bb9-a823-6221d6a16542,Regular Play,114.0,39.6,0.296701,Blocked,Normal,False,115.0,35.1,Open Play,2,1
5012da5d-c22e-4d65-95bb-7c88b1d23432,From Throw In,97.1,51.0,1.001687,Goal,Volley,True,118.3,40.8,Open Play,0,1


## 2. Data Exploration