## Libraries

In [1]:
import pandas as pd
import re
import math
import ast
import numpy as np
from statsbombpy import sb

# Supress warnings from not having a full StatsBomb subscription:
import warnings
warnings.filterwarnings("ignore", message="credentials were not supplied. open data access only")

---

## Import Data

Import the dataset as a csv file containing all shots in Statsbomb open data, and all columns which do not contain only missing values.

In [2]:
df = pd.read_csv("statsbomb_open_shots.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63358 entries, 0 to 63357
Data columns (total 42 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   duration               63358 non-null  float64
 1   id                     63358 non-null  object 
 2   index                  63358 non-null  int64  
 3   location               63358 non-null  object 
 4   match_id               63358 non-null  int64  
 5   minute                 63358 non-null  int64  
 6   off_camera             63 non-null     object 
 7   out                    1194 non-null   object 
 8   period                 63358 non-null  int64  
 9   play_pattern           63358 non-null  object 
 10  player                 63358 non-null  object 
 11  player_id              63358 non-null  float64
 12  position               63358 non-null  object 
 13  possession             63358 non-null  int64  
 14  possession_team        63358 non-null  object 
 15  po

---

## Adding Extra Features

Based on the StatsBomb paper about "*Estimated Player Impact...*", the following features will be added to the data:

- Shooter distance to goal
- GK distance to goal
- Shot angle
- GK present in shot triangle
- Number of players in shot triangle
- Body part will be changed to strong foot and weak foot instead of right and left foot
- A binary goal column, where currently it is in the `shot_outcome` column.

Also, based on some discussions with Oktay, the following will also be added:

- Number of opponents in 1m radius of shooter
- More general position of player (striker, attacking midfielder, other midfielder, defender)

---

### Distance to goal

Before calculating this, need to convert location column to a list from a string.

In [4]:
df.location = [ast.literal_eval(df.location[row]) for row in range(df.shape[0])]

In [5]:
print("""
X Coordinates:
{}

Y Coordinates:
{}
""".format(
    pd.Series([df.location[row][0] for row in range(df.shape[0])]).describe().round(1),
    pd.Series([df.location[row][1] for row in range(df.shape[0])]).describe().round(1)
))


X Coordinates:
count    63358.0
mean       103.9
std          8.7
min         35.4
25%         97.8
50%        105.2
75%        110.8
max        120.0
dtype: float64

Y Coordinates:
count    63358.0
mean        39.8
std          9.9
min          1.5
25%         32.3
50%         39.8
75%         47.2
max         79.6
dtype: float64



Add distance to goal by measuring Euclidean distance to centre of the goal:

In [6]:
goal_centre = [120,40] # From StatsBomb data spec

df["distance_to_goal"] = [math.dist(goal_centre, [df.location[row][0], df.location[row][1]]) for row in range(df.shape[0])]

In [7]:
df.distance_to_goal.describe().round(1)

count    63358.0
mean        19.0
std          8.6
min          0.4
25%         12.1
50%         18.2
75%         25.0
max         88.8
Name: distance_to_goal, dtype: float64

---

### Shot angle

We need three numbers to calculate the shot angle:

- Distance between the goalposts (8 from StatsBomb data spec.) = $X$
- Distance from shooter to left goalpost = $Y$
- Distance from shooter to right goalpost = $Z$

Then, the corresponding formula to find the shot angle ($\theta$) is:

$\theta = \arccos(\frac{Y^2 + Z^2 - X^2}{2YZ})$

Before calculating this, we need to adress the issue of shots from the byline, shown below:

In [8]:
df.iloc[[row for row in range(df.shape[0]) if df.location[row][0] == 120]].location

538      [120.0, 44.9]
9276     [120.0, 54.6]
18287    [120.0, 39.6]
25978    [120.0, 42.0]
26998    [120.0, 43.7]
54447    [120.0, 29.1]
56524    [120.0, 37.8]
Name: location, dtype: object

These shots will cause errors when calculating shot angles since they do not create shot triangles but simply a straight line.

Because there are so few of these, we simply remove them:

In [9]:
df = df.iloc[[row for row in range(df.shape[0]) if df.location[row][0] != 120]].reset_index(drop=True)

We then create a functions which returns the shot angle as above, but converted to degrees for readability:

In [10]:
def calc_shot_angle(row_index):
    X = 8 # From StatsBomb data spec.
    Y = math.dist([120,36], [df.location[row_index][0], df.location[row_index][1]])
    Z = math.dist([120,44], [df.location[row_index][0], df.location[row_index][1]])
    
    theta = np.rad2deg(math.acos((Y**2 + Z**2 - X**2) / (2*Y*Z)))
    
    return theta

In [11]:
df["shot_angle"] = [calc_shot_angle(row) for row in range(df.shape[0])]

In [12]:
df.shot_angle.describe().round(3)

count    63351.000
mean        25.391
std         15.599
min          0.660
25%         15.148
50%         19.856
75%         31.011
max        168.607
Name: shot_angle, dtype: float64

---

### Non-shooter features

For these features, we have to use the `shot_freeze_frame` column, which contains information on other players on the pitch at the moment the shot is taken.

In [13]:
df.shot_freeze_frame.isna().value_counts()

shot_freeze_frame
False    63351
Name: count, dtype: int64

We remove the roughly 400 rows that do not contain this information.

In [14]:
df = df.dropna(subset="shot_freeze_frame").reset_index(drop=True)

In [15]:
GK_present_list = []

for frame in df.shot_freeze_frame:
    frame = ast.literal_eval(frame)
    
    GK_present = False
    
    for player in frame:
        if player["position"]["name"] == "Goalkeeper":
            GK_present = True
            
    GK_present_list.append(GK_present)
    
pd.Series(GK_present_list).value_counts()

True     63331
False       20
Name: count, dtype: int64

For whatever reason, 12 events do not have the goalkeeper's location in the freeze frame, so we drop them.

In [16]:
df = df.iloc[GK_present_list].reset_index(drop=True)

---

### GK distance to goal

This is calculated the same way as shooter distance to goal, so extract GK location from `shot_freeze_frame` first.

In [17]:
def get_gk_location(row):
    for player in ast.literal_eval(df.shot_freeze_frame[row]):
        if player["position"]["name"] == "Goalkeeper":
            return player["location"]

In [18]:
df["gk_location"] = [get_gk_location(row) for row in range(df.shape[0])]

In [19]:
df["gk_distance_to_goal"] = [math.dist(goal_centre, [df.gk_location[row][0], df.gk_location[row][1]]) for row in range(df.shape[0])]

In [20]:
df.gk_distance_to_goal.describe().round(1)

count    63331.0
mean         3.6
std          2.6
min          0.0
25%          2.0
50%          3.0
75%          4.2
max        118.0
Name: gk_distance_to_goal, dtype: float64

---

### GK present in shot triangle

The shot triangle is constructed by the goal line, and the lines from the shot to each of the goal-posts.

A general function is defined which determines if a given player is in the shot triangle or not. This will be done using a barycentric coordinate system, as described here: http://totologic.blogspot.com/2014/01/accurate-point-in-triangle-test.html.

In [21]:
def player_in_shot_triangle(x1,y1,x,y):
    """
    Checks if a point (x,y), giving a player's location on the pitch, is in the 
    shot triangle defined by the shot location (x1,y1) and the two goalposts.
    """
    
    # left_goalpost:
    x2, y2 = 120, 36
    # right_goalpost:
    x3, y3 = 120, 44

    a = ((y2-y3)*(x-x3) + (x3-x2)*(y-y3)) / ((y2-y3)*(x1-x3) + (x3-x2)*(y1-y3))
    b = ((y3-y1)*(x-x3) + (x1-x3)*(y-y3)) / ((y2-y3)*(x1-x3) + (x3-x2)*(y1-y3))
    c = 1-a-b

    if (0<=a<=1) & (0<=b<=1) & (0<=c<=1):
        return True
    else:
        return False

In [22]:
df["gk_in_shot_triangle"] = [player_in_shot_triangle(df.location[row][0], df.location[row][1], df.gk_location[row][0], df.gk_location[row][1]) for row in range(df.shape[0])]

In [23]:
df.gk_in_shot_triangle.value_counts()

gk_in_shot_triangle
True     60591
False     2740
Name: count, dtype: int64

---

### Number of players in shot triangle

To calculate this, we just apply the function above to each player in the `shot_freeze_frame` for a shot.

In [24]:
def count_players_in_shot_triangle(x1,y1, freeze_frame):
    current_count = 0
    
    for player in ast.literal_eval(freeze_frame):
        if player_in_shot_triangle(x1,y1, player["location"][0], player["location"][1]):
            current_count += 1
            
    return current_count

In [25]:
df["players_in_shot_triangle"] = [count_players_in_shot_triangle(df.location[row][0], df.location[row][1], df.shot_freeze_frame[row]) for row in range(df.shape[0])]

In [26]:
df.players_in_shot_triangle.value_counts()

players_in_shot_triangle
1     30272
2     19490
3      6805
4      2918
0      1786
5      1217
6       513
7       211
8        77
9        29
10       11
11        2
Name: count, dtype: int64

---

### Number of opponents in 1m radius of shooter

The decision to use only opposition players in the radius and all players in the shot triangle, is that event teammates in the shot triangle often accidentally disrupt a shot because they have so little time to react or are not facing the shooter. However, if a teammate is outside of the shot triangle they will rarely disrupt the shot, only opponents will actively try to apply pressure or block the shot even when outside the shot triangle.

To calculate this number, we simply calculate the euclidean distance from each player to the shooter, and add them to the count if less than or equal to 1.

In [27]:
def count_opponents_in_radius(x1,y1, freeze_frame_list, radius):
    current_count = 0
    
    for player in ast.literal_eval(freeze_frame_list):
        if math.dist([x1,y1], player["location"]) <= radius:
            current_count += 1
            
    return current_count

In [28]:
df["opponents_in_radius"] = [count_opponents_in_radius(df.location[row][0], df.location[row][1], df.shot_freeze_frame[row], 1) for row in range(df.shape[0])]

In [29]:
df.opponents_in_radius.value_counts()

opponents_in_radius
0    55552
1     7036
2      662
3       71
4       10
Name: count, dtype: int64

---

### General player positions

We classify players into:

- Strikers (ST)
- Attacking midfielders (AM)
- Non-attacking midfielders (M)
- Defenders (D)

We expect players who are playing as strikers to be chosen largely for their goalscoring ability, while attacking midfielders may be more facilitators but obviously still with attacking talent. Thereafter, we have other midfielders who could be more defensive-minded or focusing on keeping the ball in possession. Finally, defenders will generally not be focusing on actually scoring goals.

Since players don't always play in the same position, and sometimes have to play in unnatural positions, we assign each player their most common position before converting to the aggregated positions above.

In [30]:
mode_position_lookup = pd.DataFrame({"player": df.player.unique(),
                                     "position": [df[df.player == player].position.mode()[0] for player in df.player.unique()]})

df["mode_position"] = [mode_position_lookup[mode_position_lookup.player == df.player[row]].position.values[0] for row in df.index]

In [31]:
df.mode_position.value_counts()

mode_position
Center Forward               11047
Right Wing                    7894
Left Wing                     7083
Center Attacking Midfield     4969
Left Center Midfield          3479
Left Center Forward           3204
Right Center Midfield         3169
Right Center Forward          2834
Right Center Back             2441
Left Back                     2429
Right Back                    2420
Left Defensive Midfield       2203
Left Center Back              2175
Left Midfield                 1957
Right Defensive Midfield      1880
Right Midfield                1770
Center Defensive Midfield     1402
Right Wing Back                362
Left Wing Back                 310
Center Back                    170
Left Attacking Midfield         79
Right Attacking Midfield        37
Goalkeeper                       6
Center Midfield                  6
Secondary Striker                5
Name: count, dtype: int64

We classify the players into general positions as follows:

**Strikers:**
- L/R/C Center Forward

**Attacking Midfielders:**
- L/R Wing
- L/R/C Attacking Midfield
- Secondary Striker

**Non-Attacking Midfielders:**
- L/R/C Center Midfield
- L/R Midfield
- L/R/C Defensive Midfield

**Defenders:**
- L/R/C Center Back
- L/R Back
- L/R Wing Back
- Goalkeeper

In [32]:
def general_position(row):
    if df.mode_position[row] in ["Center Forward","Left Center Forward","Right Center Forward"]:
        return "ST"
    elif df.mode_position[row] in ["Left Wing","Right Wing","Left Attacking Midfield","Right Attacking Midfield","Center Attacking Midfield","Secondary Striker"]:
        return "AM"
    elif df.mode_position[row] in ["Left Center Midfield","Right Center Midfield","Center Midfield","Left Midfield","Right Midfield","Left Defensive Midfield","Right Defensive Midfield","Center Defensive Midfield"]:
        return "M"
    else:
        return "D"

In [33]:
df["general_position"] = [general_position(row) for row in range(df.shape[0])]

In [34]:
df.general_position.value_counts(dropna=False)

general_position
AM    20067
ST    17085
M     15866
D     10313
Name: count, dtype: int64

---

### Body part

In the StatsBomb paper, this feature is created by selecting whichever foot a given player plays most passes with. It makes sense to use passes instead of shots because it is more likely with a pass that you are able to choose which foot you use, whereas with a shot you have limited time so likely have to shoot with your less preferred foot more often. However, since we only have data on shots we have to go back to the overall data to obtain this.

**NOTE: This cell takes over an hour to run on my machine (see times below).**

In [42]:
%%time

passes = pd.DataFrame()

for season_index,season in sb.competitions().query("competition_gender == 'male' & season_id != 76").iterrows():
    for match_index,match in sb.matches(season_id=season.season_id, competition_id=season.competition_id).iterrows():
        match_passes = sb.events(match_id=match.match_id).query("(type == 'Pass') & (pass_body_part.isin(['Right Foot','Left Foot']))").loc[:,["player","pass_body_part"]]
        passes = pd.concat([passes, match_passes], ignore_index=True)

CPU times: total: 43min 40s
Wall time: 1h 12min 3s


Create index for player IDs and their preferred foot based on most common foot for passing. Also, we remove players from `df` who did not make a pass, so we do not know their preferred foot, as well as players who had the same number of passes on each foot to simplify analysis (there aren't many).

In [63]:
df = df[df.player.isin(passes.player.unique())].copy()

In [110]:
preferred_foot = pd.DataFrame()

preferred_foot["player"] = df.player.sort_values().unique()

preferred_foot["foot"] = passes[passes.player.isin(preferred_foot.player)].groupby("player").pass_body_part.agg(pd.Series.mode).values

preferred_foot = preferred_foot.query("foot in ['Left Foot', 'Right Foot']").copy()

df = df[df.player.isin(preferred_foot.player)].reset_index(drop=True).copy()

In [111]:
preferred_foot.head()

Unnamed: 0,player,foot
0,Aaron Cresswell,Left Foot
1,Aaron Hunt,Left Foot
2,Aaron Lennon,Right Foot
3,Aaron Mooy,Right Foot
4,Aaron Ramsey,Right Foot


Check process with some obvious players:

In [112]:
preferred_foot[["Messi" in preferred_foot.player[row] for row in preferred_foot.index]]

Unnamed: 0,player,foot
2348,Lionel Andrés Messi Cuccittini,Left Foot


In [113]:
preferred_foot[["Cristiano Ronaldo" in preferred_foot.player[row] for row in preferred_foot.index]]

Unnamed: 0,player,foot
737,Cristiano Ronaldo dos Santos Aveiro,Right Foot


In [114]:
preferred_foot[["Robben" in preferred_foot.player[row] for row in preferred_foot.index]]

Unnamed: 0,player,foot
389,Arjen Robben,Left Foot


Reassign `shot_body_part` with preferred or other foot as opposed to right and left foot. Here, we remove players who didn't make a pass and thus do not have a preferred foot in our lookup table.

In [116]:
def assign_preferred_foot(row_index):
    if df.shot_body_part[row_index] in ["Right Foot", "Left Foot"]:
        if preferred_foot[preferred_foot.player == df.player[row_index]].foot.values[0] == df.shot_body_part[row_index]:
            return "Preferred Foot"
        else:
            return "Other Foot"
    else:
        return df.shot_body_part[row_index]

In [117]:
df["shot_body_part"] = [assign_preferred_foot(row) for row in range(df.shape[0])]

In [118]:
df.shot_body_part.value_counts(dropna=False)

shot_body_part
Preferred Foot    40738
Other Foot        11733
Head              10647
Other               191
Name: count, dtype: int64

---

### Goal indicator

In [119]:
df["goal"] = df.shot_outcome == "Goal"

---

## Write to CSV

In [123]:
df.to_csv("statsbomb_open_shots_extended.csv", index=False)