## Building an Expected Goals (xG) Model Using Statsbomb Free Data
A data science project using [StatsBomb](https://statsbomb.com/) event data to build an expected goals model. This notebook is the second in a series of project notebooks, in which I prepare shots data extracted and stored as a parquet file.

#### By Dahbi El Mehdi


___

<a id='sectioncontents'></a>

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Notebook Brief](#section2)<br>
3.    [Data Preparation](#section3)<br> 
4.    [Outcome](#section4)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`pandas`](http://pandas.pydata.org/) for data manipulation.
*    [`numpy`](https://numpy.org/) for data transformation.


In [1]:
# For data manipulation
import pandas as pd
pd.set_option('display.max_columns', None)

import math 
import numpy as np

<a id='section2'></a>

## <a id='#section2'>2. Notebook Brief</a>
This Jupyter notebook is part of a series of notebooks, to extract, save, clean, prepare a dataset, culminating with basic modeling, to **Build an Expected Goals model**.

To prepare our data, we'll follow these steps:
<ul>
    <li>Drop unused columns</li>
    <li>Add columns with feature potential</li>
    <li>Save the dataframe as a parquet file.</li>
</ul>

    
    

In [2]:
# Read shots dataframe
df= pd.read_parquet('output//shots.parquet')
df.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,team,player,position,location,duration,related_events,match_id,shot_statsbomb_xg,shot_end_location,shot_key_pass_id,shot_technique,shot_body_part,shot_type,shot_outcome,shot_freeze_frame,possession_team_id,player_id,shot_first_time,under_pressure,shot_aerial_won,shot_open_goal,out,shot_one_on_one,shot_saved_to_post,shot_deflected,shot_follows_dribble,shot_saved_off_target,shot_redirect,off_camera,shot_kick_off
0,8975991a-ecaa-47ef-8b98-86e09c242131,183,1,00:04:56.223,4,56,Shot,12,Switzerland,Regular Play,Switzerland,Breel-Donald Embolo,Right Center Forward,"[104.1, 27.2]",0.164009,"[32cfd094-de56-4749-b2fb-a2181cdc6237, cb67971...",3869254,0.061377,"[107.7, 30.2]",40346fdd-9318-4c85-ae53-bd4064d14789,Normal,Left Foot,Open Play,Blocked,"[{'location': [105.8, 49.8], 'player': {'id': ...",773,5545,,,,,,,,,,,,,
1,fead457a-c1a7-42f4-84a6-ab7698e9df5e,750,1,00:16:43.605,16,43,Shot,33,Portugal,From Throw In,Portugal,Gonçalo Matias Ramos,Center Forward,"[111.3, 27.5]",0.527886,[8bc375e4-e957-4e83-b665-ccaea88a3cbf],3869254,0.101021,"[120.0, 36.4, 2.4]",887e559a-77e4-4662-ad3a-96d4df0d7c22,Normal,Left Foot,Open Play,Goal,"[{'location': [85.5, 24.6], 'player': {'id': 3...",780,38803,,,,,,,,,,,,,
2,e616c23c-786b-487b-a35d-af91f8d575ee,858,1,00:21:03.108,21,3,Shot,38,Portugal,From Free Kick,Portugal,Otávio Edmilson da Silva Monteiro,Left Center Midfield,"[101.4, 36.8]",0.734486,[1e8be93a-19d4-4a99-8087-64645f1fb76b],3869254,0.032515,"[118.0, 38.6, 0.8]",336f6b0b-4ec5-4751-91e7-a05719aa52fc,Half Volley,Right Foot,Open Play,Saved,"[{'location': [96.5, 59.6], 'player': {'id': 1...",780,11184,True,,,,,,,,,,,,
3,47cd9da1-eb07-4544-aca9-9fb5f2956055,899,1,00:21:54.174,21,54,Shot,42,Portugal,Regular Play,Portugal,Gonçalo Matias Ramos,Center Forward,"[100.7, 32.2]",0.617896,[5771283a-9d06-4471-8c48-b6fe96d521ea],3869254,0.072024,"[116.8, 38.0, 0.0]",,Normal,Right Foot,Open Play,Saved,"[{'location': [93.5, 47.8], 'player': {'id': 3...",780,38803,True,,,,,,,,,,,,
4,c176824b-e608-4b52-98c4-f328748925b7,1108,1,00:29:38.907,29,38,Shot,57,Switzerland,From Free Kick,Switzerland,Xherdan Shaqiri,Left Center Forward,"[87.2, 46.1]",1.203185,[260c2713-6dc6-4ffe-841d-2175a128c93a],3869254,0.02389,"[118.9, 43.9, 0.3]",,Normal,Left Foot,Free Kick,Saved,"[{'location': [99.6, 36.7], 'player': {'id': 2...",773,3533,,,,,,,,,,,,,


___

<a id='section3'></a>

## <a id='#section3'>3. Data Preparation</a>

In [3]:
# Dropping columns
columns_to_drop= ["id", "period", "timestamp", "minute", "second", "type", "possession",
                  "possession_team", "play_pattern", "team", "player", "position", "duration",
                  "match_id", "shot_key_pass_id", "shot_statsbomb_xg", "possession_team_id",
                  "player_id", "shot_end_location", "related_events"]

df= df.drop(columns= columns_to_drop)
df.head()

Unnamed: 0,index,location,shot_technique,shot_body_part,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure,shot_aerial_won,shot_open_goal,out,shot_one_on_one,shot_saved_to_post,shot_deflected,shot_follows_dribble,shot_saved_off_target,shot_redirect,off_camera,shot_kick_off
0,183,"[104.1, 27.2]",Normal,Left Foot,Open Play,Blocked,"[{'location': [105.8, 49.8], 'player': {'id': ...",,,,,,,,,,,,,
1,750,"[111.3, 27.5]",Normal,Left Foot,Open Play,Goal,"[{'location': [85.5, 24.6], 'player': {'id': 3...",,,,,,,,,,,,,
2,858,"[101.4, 36.8]",Half Volley,Right Foot,Open Play,Saved,"[{'location': [96.5, 59.6], 'player': {'id': 1...",True,,,,,,,,,,,,
3,899,"[100.7, 32.2]",Normal,Right Foot,Open Play,Saved,"[{'location': [93.5, 47.8], 'player': {'id': 3...",True,,,,,,,,,,,,
4,1108,"[87.2, 46.1]",Normal,Left Foot,Free Kick,Saved,"[{'location': [99.6, 36.7], 'player': {'id': 2...",,,,,,,,,,,,,


In [4]:
# Find columns where more than 85% of the values are None
to_drop = [col for col in df.columns if df[col].isnull().mean() > 0.85]

# Drop these columns
df = df.drop(columns=to_drop)
df.head()

Unnamed: 0,index,location,shot_technique,shot_body_part,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure
0,183,"[104.1, 27.2]",Normal,Left Foot,Open Play,Blocked,"[{'location': [105.8, 49.8], 'player': {'id': ...",,
1,750,"[111.3, 27.5]",Normal,Left Foot,Open Play,Goal,"[{'location': [85.5, 24.6], 'player': {'id': 3...",,
2,858,"[101.4, 36.8]",Half Volley,Right Foot,Open Play,Saved,"[{'location': [96.5, 59.6], 'player': {'id': 1...",True,
3,899,"[100.7, 32.2]",Normal,Right Foot,Open Play,Saved,"[{'location': [93.5, 47.8], 'player': {'id': 3...",True,
4,1108,"[87.2, 46.1]",Normal,Left Foot,Free Kick,Saved,"[{'location': [99.6, 36.7], 'player': {'id': 2...",,


In [5]:
# Extracting both elemnts of the array as X(first element) and Y(second element)
df[['X','Y']] = df['location'].apply(lambda x: pd.Series(x, index=['X', 'Y']))
df.drop(columns=['location'], inplace=True)

In [6]:
# Let's make two more columns with the potential to be features.
# 1. shot_angle
df['X2'] = ((df['X']-104)**2 + (df['Y']-42)**2)
df['Y2'] = ((df['X']-104)**2 + (df['Y']-34)**2)
df['shot_angle'] = np.arctan2(df['Y2'] - df['Y'], df['X2'] - df['X'])
df['shot_angle'] = df['shot_angle'].apply(math.degrees)

# 2. distance_to_goal
df['distance'] = (((df['X']-104)**2 + (df['Y']-38)**2)**(1/2))

# Drop unuseful columns
df.drop(columns=["X", "Y", "X2", "Y2"], inplace=True)


In [7]:
print("Unique values of Shot_body_part column: "+ str(df.shot_body_part.unique())+ "\n"+ 
      "Unique values of Shot_outcome column: "+ str(df.shot_outcome.unique()) )


Unique values of Shot_body_part column: ['Left Foot' 'Right Foot' 'Head' 'Other']
Unique values of Shot_outcome column: ['Blocked' 'Goal' 'Saved' 'Off T' 'Wayward' 'Post' 'Saved to Post'
 'Saved Off Target']


- A player can shoot the ball with his or her foot (either right or left), head, or other body parts. It makes appropriate to divide the column values into three categories: foot, head, and other.

- A shot either results in a goal or does not. As a result, we'll mark shots as either Goal or No_Goal.

In [8]:
df["shot_body_part"]= df["shot_body_part"].replace(["Left Foot","Right Foot"], "Foot")

df["shot_outcome"]= df["shot_outcome"].replace(["Blocked", "Saved", "Off T", "Wayward", "Post", "Saved to Post",
                                                "Saved Off Target"], "No_Goal")

___

<a id='section4'></a>

## <a id='#section4'>4. Outcome</a>

In [9]:
df.head()

Unnamed: 0,index,shot_technique,shot_body_part,shot_type,shot_outcome,shot_freeze_frame,shot_first_time,under_pressure,shot_angle,distance
0,183,Normal,Foot,Open Play,No_Goal,"[{'location': [105.8, 49.8], 'player': {'id': ...",,,9.409776,10.800463
1,750,Normal,Foot,Open Play,Goal,"[{'location': [85.5, 24.6], 'player': {'id': 3...",,,24.081105,12.788276
2,858,Half Volley,Foot,Open Play,No_Goal,"[{'location': [96.5, 59.6], 'player': {'id': 1...",True,,-161.819697,2.863564
3,899,Normal,Foot,Open Play,No_Goal,"[{'location': [93.5, 47.8], 'player': {'id': 3...",True,,-70.977326,6.67308
4,1108,Normal,Foot,Free Kick,No_Goal,"[{'location': [99.6, 36.7], 'player': {'id': 2...",,,61.023026,18.650737


In [10]:
# Saving our shots dataframe as parquet file
df.to_parquet('output//prep_shots.parquet')