# 1. Data Extraction
Parsing raw _StatsBomb_ data and storing in _Pandas_ dataframe.

In [3]:
import requests
import pandas as pd

Extract the data from StatsBomb github page. The {} are filled out with .format() in the following _parse_data_ function.

In [4]:
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

A function (parse_data) which handles all extracting and transformation in one go.

1. Load matches into _matches_ list
2. The Match ID for each match is loaded into a seperate list
3. Go through each Match ID and load the raw data from the match into the events list
4. Shots are extracted into a seperate list with a filter on events
5. Go through all shots and extract specfic features from the shot event and store in attributes dictionary.
6. Append each the shot´s attributes into the all_events list

In [38]:
def parse_data(competition_id, season_id):
    matches = requests.get(url = comp_url.format(competition_id, season_id)).json()
    match_ids = [match["match_id"] for match in matches]
    
    all_events = []
    for match_id in match_ids:
        
        events = requests.get(url = match_url.format(match_id)).json()
        
        shots = [x for x in events if x["type"]["name"] == "Shot"]
        for shot in shots:
            attributes = {
                "match_id": match_id,
                "team": shot["possession_team"]["name"],
                "player": shot["player"]["name"],
                "x": shot["location"][0],
                "y": shot["location"][1],
                "outcome": shot["shot"]["outcome"]["name"],
            }
            all_events.append(attributes)
    return pd.DataFrame(all_events)

The World Cup has competition_id 43 and season_id 3

In [9]:
WC_id = 43
season_id = 3

Run function on World Cup data

In [40]:
df = parse_data(WC_id, WC_season_id)

In [41]:
df.head(10)

Unnamed: 0,match_id,team,player,x,y,outcome
0,7581,Denmark,Mathias Jattah-Njie Jørgensen,115.0,34.0,Goal
1,7581,Croatia,Mario Mandžukić,112.0,36.0,Goal
2,7581,Croatia,Ivan Perišić,101.0,55.0,Blocked
3,7581,Croatia,Ivan Perišić,103.0,24.0,Blocked
4,7581,Denmark,Christian Dannemann Eriksen,96.0,37.0,Blocked
5,7581,Denmark,Martin Braithwaite Christensen,111.0,50.0,Saved
6,7581,Croatia,Ivan Rakitić,94.0,32.0,Saved
7,7581,Croatia,Ivan Perišić,110.0,32.0,Wayward
8,7581,Croatia,Ivan Perišić,111.0,34.0,Off T
9,7581,Croatia,Ante Rebić,97.0,30.0,Blocked


### Exploring extracting different metrics from the dataset

In [38]:
def parse_data2(competition_id, season_id):
    matches = requests.get(url=comp_url.format(competition_id, season_id)).json()
    match_ids = [match["match_id"] for match in matches]
    
    all_events = []
    for match in match_ids:
        
        events = requests.get(url=match_url.format(match)).json()
        
        duels = [x for x in events if x["type"]["name"] == "Duel"]
        
        for duel in duels:
            attributes = {
                'player_id': duel['player']['name'],
                'duel': 0 if 'Lost In Play' in duel['duel']['outcome']['name'] else 1,
            }
            all_events.append(attributes)
    
    return pd.DataFrame(all_events)

In [39]:
parse_data2(WC_id, season_id)

KeyError: 'outcome'