<a id='top'></a>

# Data Engineering of StatsBomb Data
##### Notebook to engineer previous parsed JSON data from the [StatsBomb Open Data GitHub repository](https://github.com/statsbomb/open-data)

### By [Edd Webster](https://www.twitter.com/eddwebster)
Notebook first written: 10/11/2020<br>
Notebook last updated: 18/02/2021

![title](../../img/logos/stats-bomb-logo.png)

Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Sources](#section3), and [Data Engineering](#section4) sections. Or click [here](#section6) to jump straight to the Conclusion.

___


## <a id='import_libraries'>Introduction</a>
This notebook parses pubicly available [StatsBomb](https://statsbomb.com/) Event data, using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.

For more information about this notebook and the author, I'm available through all the following channels:
*    [eddwebster.com](https://www.eddwebster.com/);
*    edd.j.webster@gmail.com;
*    [@eddwebster](https://www.twitter.com/eddwebster);
*    [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
*    [github/eddwebster](https://github.com/eddwebster/);
*    [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);
*    [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and
*    [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).

![title](../../img/edd_webster/fifa21eddwebsterbanner.png)

The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/StatsBomb%20Parsing%20and%20Data%20Engineering.ipynb).

___

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Project Brief](#section2)<br>
3.    [Data Sources](#section3)<br>
      1.    [Introduction](#section3.1)<br>
      2.    [Download the Data](#section3.2)<br>
      3.    [Read in the Datasets](#section3.3)<br>
      4.    [Join the Datasets](#section3.4)<br>
      5.    [Initial Data Handling](#section3.5)<br>
4.    [Data Engineering](#section4)<br>
      1.    [Assign Raw DataFrame to Engineered DataFrame](#section4.1)<br>
      2.    [Sort the DataFrame](#section4.2)<br>
      3.    [Create Sort the DataFrame](#section4.3)<br>
      4.    [Subset DataFrame](#section4.4)<br>
5.    [Export DataFrame](#section5)<br>
6.    [Summary](#section6)<br>
7.    [Next Steps](#section7)<br>
8.    [Bibliography](#section8)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
*    [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and
*    `tqdm` for a clean progress bar;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).

### Import Libraries and Modules

In [3]:
%load_ext autoreload
%autoreload 2

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd    # version 1.0.3
import os    #  used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
import codecs
from pandas.io.json import json_normalize

# Football Libraries
from FCPython import createPitch

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno    # visually display missing data

# Progress Bar
from tqdm import tqdm    # a clean progress bar library

# Display in Jupyter
from IPython.display import Image, Video, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete


In [4]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))

Python: 3.7.6
NumPy: 1.18.0
pandas: 1.2.0
matplotlib: 3.3.2
Seaborn: 0.11.1


### Defined Variables

In [5]:
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

### Defined Filepaths

In [6]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_sb = os.path.join(base_dir, 'data', 'sb')
scripts_dir = os.path.join(base_dir, 'scripts')
scripts_dir_sb = os.path.join(base_dir, 'scripts', 'sb')
data_dir_understat = os.path.join(base_dir, 'data', 'understat')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

### Custom Functions

In [7]:
# Define custom function to read JSON files that also handles the encoding of special characters e.g. accents in names of players and teams
def read_json_file(filename):
    with open(filename, 'rb') as json_file:
        return BytesIO(json_file.read()).getvalue().decode('unicode_escape')
    
# Define custom function to flatten pandas DataFrames with nested JSON columns. Source: https://stackoverflow.com/questions/39899005/how-to-flatten-a-pandas-dataframe-with-some-columns-as-json
def flatten_nested_json_df(df):

    df = df.reset_index()

    print(f"original shape: {df.shape}")
    print(f"original columns: {df.columns}")


    # search for columns to explode/flatten
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    print(f"lists: {list_columns}, dicts: {dict_columns}")
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            print(f"flattening: {col}")
            # explode dictionaries horizontally, adding new columns
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            print(f"exploding: {col}")
            # explode lists vertically, adding new columns
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        # check if there are still dict o list fields to flatten
        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()

        print(f"lists: {list_columns}, dicts: {dict_columns}")

    print(f"final shape: {df.shape}")
    print(f"final columns: {df.columns}")
    return df

### Notebook Settings

In [8]:
pd.set_option('display.max_columns', None)

---

<a id='section2'></a>

## <a id='#section2'>2. Project Brief</a>
This Jupyter notebook explores how to parse publicly available Event data from [StatsBomb](https://statsbomb.com/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.


The combined event data roduced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.


**Notebook Conventions**:<br>
*    Variables that refer a `DataFrame` object are prefixed with `df_`.
*    Variables that refer to a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`.

---

## <a id='#section3'>3. Data Sources</a>

### <a id='#section3.1'>3.1. Introduction</a>

#### <a id='#section3.1.1'>3.1.1. About StatsBomb</a>
[StatsBomb](https://statsbomb.com/) are a football analytics and data company.

![title](../../img/logos/stats-bomb-logo.png)

Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and Cleaned in the Data Engineering section [Section 4](#section4).

We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame.

#### <a id='#section3.1.2'>3.1.2. About the StatsBomb publicly available data</a>
The complete data set contains:
- 7 competitions;
- 879 matches;
- 3,161,917 events; and
- z players.

The datasets we will be using are:
- competitions;
- matches;
- events;
- lineups; and
- tactics;

The data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).

### <a id='#section3.2'>3.2. Read in Data</a>
The following cells read the the `JSON` files into a `DataFrame` object with some basic Data Engineering to flatten the data and select only the columns of interest, to ensure the notebook doesn't crash on a standard laptop.

#### <a id='#section3.3.1.'>3.3.1. Competitions</a>

##### Data dictionary

##### Read in data

In [9]:
# Show files in directory
print(glob.glob(os.path.join(data_dir_sb, 'combined', 'raw', 'csv')))

['../../data/sb/combined/raw/csv']


In [None]:
# Read CSV file as a pandas DataFrame
df_sb = pd.read_csv(os.path.join(data_dir_sb, 'combined', 'raw', 'csv', 'combined.csv'))

  interactivity=interactivity, compiler=compiler, result=result)


### <a id='#section3.3'>3.3. Initial Data Handling</a>

In [None]:
# Display the first 5 rows of the raw DataFrame, df_sb
df_sb.head()

In [None]:
# Display the last 5 rows of the raw DataFrame, df_sb
df_sb.tail()

In [None]:
# Print the shape of the raw DataFrame, df_sb
print(df_sb.shape)

In [None]:
# Print the column names of the raw DataFrame, df_sb
print(df_sb.columns)

The joined dataset has forty features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1).

In [None]:
# Data types of the features of the raw DataFrame, df_sb
df_sb.dtypes

Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1).

In [None]:
# Info for the raw DataFrame, df_sb
df_sb.info()

In [None]:
# Description of the raw DataFrame, df_sb, showing some summary statistics for each numberical column in the DataFrame
df_sb.describe()

In [None]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_sb
msno.matrix(df_sb, figsize = (30, 7))

In [None]:
# Counts of missing values
null_value_stats = df_sb.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

The visualisation shows us that there are no missing values in the DataFrame.

---

## <a id='#section4'>4. Data Engineering</a>
Before conducting an [Exploratory Data Analysis (EDA)](#section5) of the data, we'll first need to clean and wrangle the datasets to a form that meet our needs.

### <a id='#section4.1'>4.1. Sort DataFrame</a>
Sort data by `matchId`, `matchPeriod`, and `eventSec`. Important for when determining previous events. which are attributes created for the DataFrame in the Data Engineering notebook.

In [None]:
# Sort data by matchId, matchPeriod, and eventSec
df_sb = df_sb.sort_values(['matchId', 'matchPeriod', 'eventSec'])

### <a id='#section4.3'>4.3. Create Engineered Attributes</a>

#### <a id='#section4.2.1'>4.2.1. Create `Team` and `Opponent` Attributes</a>

In [None]:
df_sb['Team'] = np.where(df_sb['team.name'] == df_sb['home_team.home_team_name'], df_sb['home_team.home_team_name'], df_sb['away_team.away_team_name'])
df_sb['Opponent'] = np.where(df_sb['team.name'] == df_sb['away_team.away_team_name'], df_sb['home_team.home_team_name'], df_sb['away_team.away_team_name'])

#### <a id='#section4.2.2'>4.2.2. Create `Full_Fixture_Date` Attribute</a>

In [None]:
df_sb['Full_Fixture_Date'] = df_sb['match_date'].astype(str) + ' ' + df_sb['home_team.home_team_name'].astype(str)  + ' ' + df_sb['home_score'].astype(str) + ' ' + ' vs. ' + ' ' + df_sb['away_score'].astype(str) + ' ' + df_sb['away_team.away_team_name'].astype(str)

In [None]:
df_sb.head()

### <a id='#section4.4'>4.4. Subset DataFrame</a>
Subset DataFrame into
- In-play Events only
- Lineups - Starting XI
- Tactical Changes
- Halves

In [None]:
# List unique values in the df_sb['type.name'] column
df_sb['type.name'].unique()

#### <a id='#section4.4.1'>4.4.1. Isolate In-Play Events</a>
DataFrame of only player's actions i.e. removing line ups, halves, etc.

##### <a id='#section4.4.1.1'>4.4.1.1. Remove Non-Event rows</a>

In [None]:
lst_events = ['Pass', 'Ball Receipt*', 'Carry', 'Duel', 'Miscontrol', 'Pressure', 'Ball Recovery', 'Dribbled Past', 'Dribble', 'Shot', 'Block', 'Goal Keeper', 'Clearance', 'Dispossessed', 'Foul Committed', 'Foul Won', 'Interception', 'Shield', 'Half End', 'Substitution', 'Tactical Shift', 'Injury Stoppage', 'Player Off', 'Player On', 'Offside', 'Referee Ball-Drop', 'Error']

In [None]:
df_sb_events = df_sb[df_sb['type.name'].isin(lst_events)]

In [None]:
df_sb_events.shape

##### <a id='#section4.4.1.2'>4.4.1.2. Break down all `location` attributes into seperate attribute for X, Y (and sometimes Z) coordinates</a>

In [None]:
# Display all location columns
for col in df_sb_events.columns:
    if 'location' in col:
        print(col)

There are the following five 'location' attributes:
- `location`
- `pass.end_location`
- `carry.end_location`
- `shot.end_location`
- `goalkeeper.end_location`

From reviewing the official documentation [[link](https://statsbomb.com/stat-definitions/)], the five attributes have the following dimensionality:
- `location` [x, y]
- `pass.end_location` [x, y]
- `carry.end_location` [x, y]
- `shot.end_location` [x, y, z]
- `goalkeeper.end_location` [x, y]

In [None]:
"""
# CURRENTLY NOT WORKING, NEED TO FIX

# Normalize 'shot.freeze_frame' avvtribute - see: https://stackoverflow.com/questions/52795561/flattening-nested-json-in-pandas-data-frame

## explode all columns with lists of dicts
df_sb_events_normalize = df_sb_events.apply(lambda x: x.explode()).reset_index(drop=True)

## list of columns with dicts
cols_to_normalize = ['shot.freeze_frame']

## if there are keys, which will become column names, overlap with excising column names. add the current column name as a prefix
normalized = list()

for col in cols_to_normalize:
    d = pd.json_normalize(df_sb_events_normalize[col], sep='_')
    d.columns = [f'{col}_{v}' for v in d.columns]
    normalized.append(d.copy())

## combine df with the normalized columns
df_sb_events_normalize = pd.concat([df_sb_events_normalize] + normalized, axis=1).drop(columns=cols_to_normalize)

## display(df_lineup_select_normalize)
df_sb_events_normalize.head(30)
"""

In [None]:
#

##
df_sb_events['location'] = df_sb_events['location'].astype(str)
df_sb_events['pass.end_location'] = df_sb_events['pass.end_location'].astype(str)
df_sb_events['carry.end_location'] = df_sb_events['carry.end_location'].astype(str)
df_sb_events['shot.end_location'] = df_sb_events['shot.end_location'].astype(str)
df_sb_events['goalkeeper.end_location'] = df_sb_events['goalkeeper.end_location'].astype(str)
df_sb_events['shot.end_location'] = df_sb_events['shot.end_location'].astype(str)
#df_sb_events['shot.freeze_frame'] = df_sb_events['shot.freeze_frame'].astype(str)


##

###
df_sb_events['location'] = df_sb_events['location'].str.replace('[','')
df_sb_events['pass.end_location'] = df_sb_events['pass.end_location'].str.replace('[','')
df_sb_events['carry.end_location'] = df_sb_events['carry.end_location'].str.replace('[','')
df_sb_events['shot.end_location'] = df_sb_events['shot.end_location'].str.replace('[','')
df_sb_events['goalkeeper.end_location'] = df_sb_events['goalkeeper.end_location'].str.replace('[','')
#df_sb_events['shot.freeze_frame'] = df_sb_events['shot.freeze_frame'].str.replace('[','')

###
df_sb_events['location'] = df_sb_events['location'].str.replace(']','')
df_sb_events['pass.end_location'] = df_sb_events['pass.end_location'].str.replace(']','')
df_sb_events['carry.end_location'] = df_sb_events['carry.end_location'].str.replace(']','')
df_sb_events['shot.end_location'] = df_sb_events['shot.end_location'].str.replace(']','')
df_sb_events['goalkeeper.end_location'] = df_sb_events['goalkeeper.end_location'].str.replace(']','')
#df_sb_events['shot.freeze_frame'] = df_sb_events['shot.freeze_frame'].str.replace(']','')


## Break down each location attributes
df_sb_events['location_x'], df_sb_events['location_y'] = df_sb_events['location'].str.split(',', 1).str
df_sb_events['pass.end_location_x'], df_sb_events['pass.end_location_y'] = df_sb_events['pass.end_location'].str.split(',', 1).str
df_sb_events['carry.end_location_x'], df_sb_events['carry.end_location_y'] = df_sb_events['carry.end_location'].str.split(',', 1).str
df_sb_events['shot.end_location_x'], df_sb_events['shot.end_location_y'], df_sb_events['shot.end_location_z'] = df_sb_events['shot.end_location'].str.split(',', 3).str[0:3].str
df_sb_events['goalkeeper.end_location_x'], df_sb_events['goalkeeper.end_location_y'] = df_sb_events['goalkeeper.end_location'].str.split(',', 1).str
#df_sb_events['shot.freeze_frame_x'], df_sb_events['shot.freeze_frame_y'] = df_sb_events['shot.freeze_frame'].str.split(',', 1).str


## Display DataFrame
df_sb_events.head(10)

In [None]:
df_sb_events.shape

##### Export Dataset

In [None]:
# Export 
#df_sb_events.to_csv(data_dir_sb + '/events/engineered/' + '/sb_events_1819_2021_wsl.csv', index=None, header=True)

# Export 
#df_sb_events.to_csv(data_dir + '/export/' + '/sb_wsl_events.csv', index=None, header=True)

##### <a id='#section4.4.1.3'>4.4.1.3. Create Passing Matrix Data</a>
The following DataFrame is the CSV extract used for Tableau dashboarding

In [None]:
df1 = df_sb_events.copy()

In [None]:
df1['df_name'] = 'df1'

In [None]:
df1.head()

In [None]:
df2 = df_sb_events.copy()

In [None]:
df2['df_name'] = 'df2'

In [None]:
df2.head()

In [None]:
df1.head()

##### Concatanate DataFrames

In [None]:
df_sb_events_passing = pd.concat([df1, df2])

In [None]:
df_sb_events_passing.shape

##### ...

In [None]:
df_sb_events_passing['Pass_X'] = np.where(df_sb_events_passing['df_name'] == 'df1', df_sb_events_passing['location_x'], df_sb_events_passing['pass.end_location_x'])
df_sb_events_passing['Pass_Y'] = np.where(df_sb_events_passing['df_name'] == 'df1', df_sb_events_passing['location_y'], df_sb_events_passing['pass.end_location_y'])

In [None]:
df_sb_events_passing.head()

In [None]:
sorted(df_sb_events_passing.columns)

##### Export Dataset

In [None]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'export', 'sb_wsl_events_passing_matrix.csv')):
    df_sb_events_passing.to_csv(os.path.join(data_dir_sb, 'export', 'sb_wsl_events_passing_matrix.csv'), index=None, header=True)
else:
    pass

##### <a id='#section4.4.1.4'>4.4.1.4. Create Passing Network Data</a>

See: https://community.tableau.com/s/question/0D54T00000C6YbE/football-passing-network

In [None]:
df_sb_pass_network = df_sb_events_passing.copy()

In [None]:
df_sb_pass_network = df_sb_pass_network[df_sb_pass_network['type.name'] == 'Pass']

In [None]:
df_sb_pass_network['player_recipient'] = np.where(df_sb_pass_network['df_name'] == 'df1', df_sb_pass_network['player.name'], df_sb_pass_network['pass.recipient.name'])

In [None]:
df_sb_pass_network.head()

In [None]:
sorted(df_sb_pass_network.columns)

In [None]:
df_sb_pass_network.shape

In [None]:
# Select columns of interest

## Define columns
cols = ['df_name',
        'id',
        'index',
        'competition_name',
        'season_name',
        'match_date',
        'kick_off',
        'Full_Fixture_Date',
        'Team',
        'Opponent',
        'home_team.home_team_name',
        'away_team.away_team_name',
        'home_score',
        'away_score',
        'player_recipient',
        'player.name',
        'pass.recipient.name',
        'position.id',
        'position.name',
        'type.name',
        'pass.type.name',
        'pass.outcome.name',
        'location_x',
        'location_y', 
        'pass.end_location_x',
        'pass.end_location_y',
        'Pass_X',
        'Pass_Y'
       ]

##
df_sb_pass_network_select = df_sb_pass_network[cols]

In [None]:
df_sb_pass_network_select['pass.to.from'] = df_sb_pass_network_select['player.name'] + ' - ' + df_sb_pass_network_select['pass.recipient.name']

In [None]:
# List unique values in the df_sb_pass_network_select['pass.outcome.name'] column
df_sb_pass_network_select['pass.outcome.name'].unique()

In [None]:
df_sb_pass_network_select = df_sb_pass_network_select[df_sb_pass_network_select['pass.outcome.name'].isnull()]

In [None]:
df_sb_pass_network_select.shape

In [None]:
df_sb_pass_network_select = df_sb_pass_network_select.sort_values(['season_name', 'match_date', 'kick_off', 'Full_Fixture_Date', 'index', 'id', 'df_name'], ascending=[True, True, True, True, True, True, True])

In [None]:
df_sb_pass_network_select['Pass_X'] = df_sb_pass_network_select['Pass_X'].astype(str).astype(float)
df_sb_pass_network_select['Pass_Y'] = df_sb_pass_network_select['Pass_Y'].astype(str).astype(float)
df_sb_pass_network_select['location_x'] = df_sb_pass_network_select['location_x'].astype(str).astype(float)
df_sb_pass_network_select['location_y'] = df_sb_pass_network_select['location_y'].astype(str).astype(float)
df_sb_pass_network_select['pass.end_location_x'] = df_sb_pass_network_select['pass.end_location_x'].astype(str).astype(float)
df_sb_pass_network_select['pass.end_location_y'] = df_sb_pass_network_select['pass.end_location_y'].astype(str).astype(float)

In [None]:
df_sb_pass_network_select.dtypes

In [None]:
df_sb_pass_network_select.head()

In [None]:
#

##
df_sb_pass_network_grouped = (df_sb_pass_network_select
                                  .groupby(['competition_name',
                                            'season_name',
                                            'match_date',
                                            'kick_off',
                                            'Full_Fixture_Date',
                                            'Team',
                                            'Opponent',
                                            'home_team.home_team_name',
                                            'away_team.away_team_name',
                                            'home_score',
                                            'away_score',
                                            'pass.to.from',
                                            'player.name',
                                            'pass.recipient.name',
                                            'player_recipient'
                                           ])
                                  .agg({'pass.to.from': ['count']
                                       })
                             )

##
df_sb_pass_network_grouped.columns = df_sb_pass_network_grouped.columns.droplevel(level=0)

##
df_sb_pass_network_grouped = df_sb_pass_network_grouped.reset_index()

## 
df_sb_pass_network_grouped.columns = ['competition_name',
                                      'season_name',
                                      'match_date',
                                      'kick_off',
                                      'full_fixture_date',
                                      'team',
                                      'opponent',
                                      'home_team_name',
                                      'away_team_name',
                                      'home_score',
                                      'away_score',
                                      'pass_to_from',
                                      'player_name',
                                      'pass_recipient_name',
                                      'player_recipient',
                                      'count_passes',
                                     ]

##
#df_sb_pass_network_grouped['count_passes'] = df_sb_pass_network_grouped['count_passes'] / 2

##
df_sb_pass_network_grouped = df_sb_pass_network_grouped.sort_values(['season_name', 'match_date', 'kick_off', 'full_fixture_date', 'team', 'opponent', 'pass_to_from'], ascending=[True, True, True, True, True, True, True])

##
df_sb_pass_network_grouped.head()

In [None]:
df_sb_pass_network_grouped.shape

In [None]:
# Select columns of interest

## Define columns
cols = ['Full_Fixture_Date',
        'player.name',
        'position.id',
        'position.name',
        'Pass_X',
        'Pass_Y'
       ]

##
df_sb_pass_network_avg_pass = df_sb_pass_network_select[cols]

In [None]:
df_sb_pass_network_avg_pass 

In [None]:
#

##
df_sb_pass_network_avg_pass_grouped = (df_sb_pass_network_avg_pass 
                                          .groupby(['Full_Fixture_Date',
                                                    'player.name',
                                                    'position.id',
                                                    'position.name',
                                                   ])
                                          .agg({'Pass_X': ['mean'],
                                                'Pass_Y': ['mean']
                                               })
                                     )

##
df_sb_pass_network_avg_pass_grouped.columns = df_sb_pass_network_avg_pass_grouped .columns.droplevel(level=0)

##
df_sb_pass_network_avg_pass_grouped = df_sb_pass_network_avg_pass_grouped.reset_index()

## 
df_sb_pass_network_avg_pass_grouped.columns = ['full_fixture_date',
                                               'player_name',
                                               'position_id',
                                               'position_name',
                                               'avg_location_pass_x',
                                               'avg_location_pass_y'
                                     ]

##
df_sb_pass_network_avg_pass_grouped['avg_location_pass_x'] = df_sb_pass_network_avg_pass_grouped['avg_location_pass_x'].round(decimals=1)
df_sb_pass_network_avg_pass_grouped['avg_location_pass_y'] = df_sb_pass_network_avg_pass_grouped['avg_location_pass_y'].round(decimals=1)

##
df_sb_pass_network_avg_pass_grouped = df_sb_pass_network_avg_pass_grouped.sort_values(['full_fixture_date', 'player_name'], ascending=[True, True])

##
df_sb_pass_network_avg_pass_grouped.head()

In [None]:
# Join the Events DataFrame to the Matches DataFrame
df_sb_pass_network_final = pd.merge(df_sb_pass_network_grouped, df_sb_pass_network_avg_pass_grouped, left_on=['full_fixture_date', 'player_recipient'], right_on=['full_fixture_date', 'player_name'])

In [None]:
## Rename columns
df_sb_pass_network_final = df_sb_pass_network_final.rename(columns={'player_name_x': 'player_name',
                                                                   #'player_name_x': 'player_name'
                                                                   }
                                                          )

In [None]:
df_sb_pass_network_final.head()

In [None]:
df_sb_pass_network_final.shape

##### Export Dataset

In [None]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'export', 'engineered', 'sb_events_passing_network.csv')):
    df_sb_pass_network_final.to_csv(os.path.join(data_dir_sb, 'export', 'engineered', 'sb_events_passing_network.csv'), index=None, header=True)
else:
    pass

In [None]:
# Export DataFrame as a CSV file
if not os.path.exists(os.path.join(data_dir_sb, 'export', 'engineered', 'sb_events_passing_network.csv')):
    df_sb_pass_network_final.to_csv(os.path.join(data_dir, 'export', 'sb_events_passing_network.csv'), index=None, header=True)
else:
    pass

##### Export WSL data Dataset

In [None]:
# Export 
#df_sb_pass_network_final.to_csv(data_dir_sb + '/events/engineered/' + '/sb_events_passing_network_1819_2021_wsl.csv', index=None, header=True)

# Export 
#df_sb_pass_network_final.to_csv(data_dir + '/export/' + '/sb_wsl_events_passing_network.csv', index=None, header=True)

#### <a id='#section4.4.2'>4.4.2. Lineups</a>

In [None]:
# List unique values in the df_sb['type.name'] column
df_sb['type.name'].unique()

The starting XI players and formation can be found in the rows where `type.name` is 'Starting XI'.

In [None]:
df_lineup = df_sb[df_sb['type.name'] == 'Starting XI']

In [None]:
df_lineup

In [None]:
# Streamline DataFrame to include just the columns of interest

## Define columns
cols = ['id', 'type.name', 'match_date', 'kick_off', 'Full_Fixture_Date', 'team.id', 'team.name', 'tactics.formation', 'tactics.lineup', 'competition_name', 'season_name', 'home_team.home_team_name', 'away_team.away_team_name', 'Team', 'Opponent', 'home_score', 'away_score']

## Select only columns of interest
df_lineup_select = df_lineup[cols]

In [None]:
df_lineup_select

We can see from the extracted lineup data so far. To get the stating XI players, we need to breakdown the `tactics.lineup` attribute.

In [None]:
# Normalize tactics.lineup - see: https://stackoverflow.com/questions/52795561/flattening-nested-json-in-pandas-data-frame

## explode all columns with lists of dicts
df_lineup_select_normalize = df_lineup_select.apply(lambda x: x.explode()).reset_index(drop=True)

## list of columns with dicts
cols_to_normalize = ['tactics.lineup']

## if there are keys, which will become column names, overlap with excising column names. add the current column name as a prefix
normalized = list()

for col in cols_to_normalize:
    d = pd.json_normalize(df_lineup_select_normalize[col], sep='_')
    d.columns = [f'{col}_{v}' for v in d.columns]
    normalized.append(d.copy())

## combine df with the normalized columns
df_lineup_select_normalize = pd.concat([df_lineup_select_normalize] + normalized, axis=1).drop(columns=cols_to_normalize)

## display(df_lineup_select_normalize)
df_lineup_select_normalize.head(30)

In [None]:
df_lineup_engineered = df_lineup_select_normalize

In [None]:
# Streamline DataFrame to include just the columns of interest

## Define columns
cols = ['id', 'match_date', 'kick_off', 'Full_Fixture_Date', 'type.name', 'season_name', 'competition_name', 'home_team.home_team_name', 'away_team.away_team_name', 'Team', 'Opponent', 'home_score', 'away_score', 'tactics.formation', 'tactics.lineup_jersey_number', 'tactics.lineup_position_id', 'tactics.lineup_player_name', 'tactics.lineup_position_name']

## Select only columns of interest
df_lineup_engineered_select = df_lineup_engineered[cols]

In [None]:
df_lineup_engineered_select['tactics.formation'] = df_lineup_engineered_select['tactics.formation'].astype('Int64')
df_lineup_engineered_select['tactics.lineup_jersey_number'] = df_lineup_engineered_select['tactics.lineup_jersey_number'].astype('Int64')

In [None]:
df_lineup_engineered_select.head(5)

In [None]:
df_lineup_engineered_select.columns

In [None]:
## Rename columns
df_lineup_engineered_select = df_lineup_engineered_select.rename(columns={'id': 'Match_Id',
                                                                          'match_date': 'Match_Date',
                                                                          'kick_off': 'Kick_Off',
                                                                          'type.name': 'Type_Name',
                                                                          'season_name': 'Season',
                                                                          'competition_name': 'Competition',
                                                                          'home_team.home_team_name': 'Home_Team',
                                                                          'away_team.away_team_name': 'Away_Team',
                                                                          'home_score': 'Home_Score',
                                                                          'away_score': 'Away_Score',
                                                                          'tactics.formation': 'Formation',
                                                                          'tactics.lineup_jersey_number': 'Shirt_Number',
                                                                          'tactics.lineup_position_id': 'Position_Number',
                                                                          'tactics.lineup_player_name': 'Player_Name',
                                                                          'tactics.lineup_position_name': 'Position_Name'
                                                                         }
                                                                         
                                                                )

## Display DataFrame
df_lineup_engineered_select.head()

In [None]:
# Convert Match_Date from string to datetime64[ns]
df_lineup_engineered_select['Match_Date']= pd.to_datetime(df_lineup_engineered_select['Match_Date'])

In [None]:
"""
# THIS IS NOT WORKING ATM

# Convert Kick_Off from string to datetime64[ns]
df_lineup_engineered_select['Kick_Off']= pd.to_datetime(df_lineup_engineered_select['Kick_Off'], format='%H:%M', errors='ignore')
df_lineup_engineered_select['Kick_Off'] = df_lineup_engineered_select['Kick_Off'].dt.time
"""

In [None]:
df_lineup_engineered_select.dtypes

In [None]:
# Put hyphens between numbers in Formation attribute

## Convert Formation attribute from Integer to String
df_lineup_engineered_select['Formation'] = df_lineup_engineered_select['Formation'].astype(str)

## Define custom function to add hyphen between letters: StackOverflow: https://stackoverflow.com/questions/29382285/python-making-a-function-that-would-add-between-letters
def f(s):
        m = s[0]
        for i in s[1:]:
             m += '-' + i
        return m
    
## Apply custom function
df_lineup_engineered_select['Formation'] = df_lineup_engineered_select.apply(lambda row: f(row['Formation']),axis=1)

In [None]:
lst_formation = df_lineup_engineered_select['Formation'].unique().tolist()

In [None]:
lst_formation

##### Add Position Coordinates

In [None]:
df_formations_coords = pd.read_csv(data_dir_sb + '/sb_formation_coordinates.csv')

In [None]:
#df_formations_coords['Id'] = df_formations_coords['Id'].astype('Int8')
#df_formations_coords['Player_Number'] = df_formations_coords['Player_Number'].astype('Int8')

In [None]:
df_lineup_engineered_select = pd.merge(df_lineup_engineered_select, df_formations_coords, how='left', left_on=['Formation', 'Position_Number'], right_on=['Formation', 'Player_Number'])

In [None]:
#df_lineup_engineered_select = df_lineup_engineered_select.drop(['Player_Number'], axis=1)
df_lineup_engineered_select = df_lineup_engineered_select.drop(['Id'], axis=1)
df_lineup_engineered_select = df_lineup_engineered_select.drop(['Player_Position'], axis=1)

In [None]:
df_lineup_engineered_select.head()

##### Add Opponent Data to Each Row

In [None]:
# Select columns of interest

## Define columns
cols = ['Match_Date',
        'Competition',
        'Full_Fixture_Date',
        'Team',
        'Formation'
       ]

##
df_lineup_opponent = df_lineup_engineered_select[cols]

##
df_lineup_opponent = df_lineup_opponent.drop_duplicates()

##
df_lineup_opponent.head()

In [None]:
# Join DataFrame to itself on 'Date', 'Fixture', 'Team'/'Opponent', and 'Event', to join Team and Opponent together
df_lineup_engineered_opponent_select = pd.merge(df_lineup_engineered_select, df_lineup_opponent,  how='left', left_on=['Match_Date', 'Competition', 'Full_Fixture_Date', 'Opponent'], right_on = ['Match_Date', 'Competition', 'Full_Fixture_Date', 'Team'])

In [None]:
# Clean Data

## Drop columns
df_lineup_engineered_opponent_select = df_lineup_engineered_opponent_select.drop(columns=['Team_y'])


## Rename columns
df_lineup_engineered_opponent_select = df_lineup_engineered_opponent_select.rename(columns={'Team_x': 'Team',
                                                                                            'Formation_x': 'Formation',
                                                                                            'Formation_y': 'Opponent_Formation'
                                                                                           }
                                                                                      )

## Display DataFrame
df_lineup_engineered_opponent_select.head()

##### Export DataFrame

In [None]:
# Export 
#df_lineup_engineered_opponent_select.to_csv(data_dir_sb + '/lineups/engineered/' + '/sb_lineups_1819_2021_wsl.csv', index=None, header=True)

In [None]:
# Export 
#df_lineup_engineered_opponent_select.to_csv(data_dir + '/export/' + '/sb_wsl_lineups.csv', index=None, header=True)

#### <a id='#section4.4.3'>4.4.3. Tactical Shifts</a>

In [None]:
df_tactics = df_sb[df_sb['type.name'] == 'Tactical Shift']

In [None]:
df_tactics

In [None]:
# Select columns of interest

##
cols = ['id', 'type.name', 'team.id', 'team.name', 'tactics.formation', 'tactics.lineup']

##
df_tactics_select = df_tactics[cols]

In [None]:
df_tactics_select

In [None]:
# Normalize tactics.lineup - see: https://stackoverflow.com/questions/52795561/flattening-nested-json-in-pandas-data-frame

## explode all columns with lists of dicts
df_tactics_select_normalize = df_tactics_select.apply(lambda x: x.explode()).reset_index(drop=True)

## list of columns with dicts
cols_to_normalize = ['tactics.lineup']

## if there are keys, which will become column names, overlap with excising column names. add the current column name as a prefix
normalized = list()
for col in cols_to_normalize:
    
    d = pd.json_normalize(df_tactics_select_normalize[col], sep='_')
    d.columns = [f'{col}_{v}' for v in d.columns]
    normalized.append(d.copy())

## combine df with the normalized columns
df_tactics_select_normalize = pd.concat([df_tactics_select_normalize] + normalized, axis=1).drop(columns=cols_to_normalize)

## display(df_lineup_select_normalize)
df_tactics_select_normalize.head(10)

#### <a id='#section4.4.4'>4.4.4. Halves</a>

In [None]:
df_half = df_sb[df_sb['type.name'] == 'Half Start']

In [None]:
df_half

---

## <a id='#section5'>5. Export Data</a>
Export Data ready for data engineering in the subsequent notebooks.

In [None]:
# Export 
#df_sb.to_csv(data_dir_sb + '/combined/raw/csv/wsl/' + '/df_sb_combined_data_wsl.csv', index=None, header=True)

## <a id='#section6'>6. Summary</a>
This notebook engineers scraped [StatsBomb](https://statsbomb.com/) data using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.

---

## <a id='#section7'>7. Next Steps</a>
The step is to take the parsed dataset created in this notebook and engineer the data for new features, which is carried out in the follow [Data Engineering](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/StatsBomb%20Data%20Engineering.ipynb) notebook. This data is then ready for use in projects including Expected Goals (xG) models and Tableau visualisations.

## <a id='#section8'>8. References</a>

#### Data
*    [StatsBomb](https://statsbomb.com/) data
*    [StatsBomb](https://github.com/statsbomb/open-data/tree/master/data) open data GitHub repository

---

***Visit my website [EddWebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***

[Back to the top](#top)