## Building an Expected Goals (xG) Model Using Statsbomb Free Data
A data science project using [StatsBomb](https://statsbomb.com/) event data to build an expected goals model. This notebook is the first in a series of project notebooks in which I extract shots data from the Statsbomb API.

#### By Dahbi El Mehdi


___

<a id='sectioncontents'></a>

## <a id='notebook_contents'>Notebook Contents</a>
1.    [Notebook Dependencies](#section1)<br>
2.    [Notebook Brief](#section2)<br>
3.    [Data Collection](#section3)<br> 
4.    [Extraction Outcome](#section4)<br>

___

<a id='section1'></a>

## <a id='#section1'>1. Notebook Dependencies</a>

This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
*    [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
*    [`pandas`](http://pandas.pydata.org/) for data manipulation.
*    [`statsbombpy`](https://github.com/statsbomb/statsbombpy) a python package to easily stream StatsBomb data into Python.


In [1]:
# Install statsbombpy package to extract event data
!pip install statsbombpy
from statsbombpy import sb

# warnings to ignore warnings
import warnings
warnings.filterwarnings('ignore')

# data manipulation
import pandas as pd



<a id='section2'></a>

## <a id='#section2'>2. Notebook Brief</a>
This Jupyter notebook is part of a series of notebooks, to extract, save, clean, prepare a dataset, culminating with basic modeling, to **Build an Expected Goals model**.

To do so, we'll follow these steps:
<ul>
    <li>Extracting competitions ids</li>
    <li>Make a dataframe with all of the selected competitions' matches.</li>
    <li>Using match ids to retrieve all shots</li>
    <li>Save the dataframe as a parquet file.</li>
</ul>

    
    

___

<a id='section3'></a>

## <a id='#section3'>3. Data Extraction</a>

In [2]:
# Create a dataframe from ststsbomby competitions 
competitions_df= sb. competitions()

# Get the available competitions
print('Available compeititons:', competitions_df['competition_name'].unique())

Available compeititons: ['Champions League' "FA Women's Super League" 'FIFA World Cup'
 'Indian Super league' 'La Liga' 'NWSL' 'Premier League' 'UEFA Euro'
 "UEFA Women's Euro" "Women's World Cup"]


In [3]:
# We're interested in keepoing as much dataas we can, so we'll extract shots events from all matches
# of the different men competitions

# competitions ids
comps_ids= competitions_df['competition_id'].unique()

# Seasons ids
seasons_ids= competitions_df['season_id'].unique()



In [4]:
print(comps_ids)

[  16   37   43 1238   11   49    2   55   53   72]


In [5]:
# Competitions names, id, and seasons to extract

print("Competitions names:",
      competitions_df['competition_name'].unique(),
      '\n',
      "Competitions ids:",
      comps_ids,
      '\n',
      "Seasons ids:",
      seasons_ids)

Competitions names: ['Champions League' "FA Women's Super League" 'FIFA World Cup'
 'Indian Super league' 'La Liga' 'NWSL' 'Premier League' 'UEFA Euro'
 "UEFA Women's Euro" "Women's World Cup"] 
 Competitions ids: [  16   37   43 1238   11   49    2   55   53   72] 
 Seasons ids: [  4   1   2  27  26  25  24  23  22  21  41  39  37  44  76  90  42 106
   3 108  40  38  43  30]


In [6]:
comps_ids= competitions_df['competition_id']
comps_seasons= competitions_df['season_id']

matches_dfs = sb.matches(competition_id = 43, season_id = 106)

for id, season in zip(comps_ids, comps_seasons):
    try:
        matches_df = sb.matches(competition_id = id, season_id = season)
    except: 
        pass
    matches_dfs = matches_dfs.append(matches_df, ignore_index = 1)
    

In [7]:
matches_dfs= matches_dfs.drop_duplicates()
matches_dfs.head()

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,...,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,3869254,2022-12-06,21:00:00.000,International - FIFA World Cup,2022,Portugal,Switzerland,6,1,available,...,2023-01-14T15:08:46.172894,4,Round of 16,Lusail Stadium,César Arturo Ramos Palazuelos,Fernando Manuel Fernandes da Costa Santos,Murat Yakin,1.1.0,2,2
1,3869118,2022-12-04,21:00:00.000,International - FIFA World Cup,2022,England,Senegal,3,0,available,...,2022-12-13T21:39:52.223504,4,Round of 16,Al Bayt Stadium,Ivan Arcides Barton Cisneros,Gareth Southgate,Aliou Cissé,1.1.0,2,2
2,3869486,2022-12-10,17:00:00.000,International - FIFA World Cup,2022,Morocco,Portugal,1,0,available,...,2023-01-04T12:36:10.102347,5,Quarter-finals,Al Thumama Stadium,Facundo Tello Figueroa,Hoalid Regragui,Fernando Manuel Fernandes da Costa Santos,1.1.0,2,2
3,3869685,2022-12-18,17:00:00.000,International - FIFA World Cup,2022,Argentina,France,3,3,available,...,2022-12-21T16:02:21.075183,7,Final,Lusail Stadium,Szymon Marciniak,Lionel Sebastián Scaloni,Didier Deschamps,1.1.0,2,2
4,3869684,2022-12-17,17:00:00.000,International - FIFA World Cup,2022,Croatia,Morocco,2,1,available,...,2022-12-18T21:30:47.341680,7,3rd Place Final,Sheikh Khalifa International Stadium,Abdulrahman Ibrahim Al Jassim,Zlatko Dalić,Hoalid Regragui,1.1.0,2,2


In [8]:
matches_dfs

Unnamed: 0,match_id,match_date,kick_off,competition,season,home_team,away_team,home_score,away_score,match_status,...,last_updated_360,match_week,competition_stage,stadium,referee,home_managers,away_managers,data_version,shot_fidelity_version,xy_fidelity_version
0,3869254,2022-12-06,21:00:00.000,International - FIFA World Cup,2022,Portugal,Switzerland,6,1,available,...,2023-01-14T15:08:46.172894,4,Round of 16,Lusail Stadium,César Arturo Ramos Palazuelos,Fernando Manuel Fernandes da Costa Santos,Murat Yakin,1.1.0,2,2
1,3869118,2022-12-04,21:00:00.000,International - FIFA World Cup,2022,England,Senegal,3,0,available,...,2022-12-13T21:39:52.223504,4,Round of 16,Al Bayt Stadium,Ivan Arcides Barton Cisneros,Gareth Southgate,Aliou Cissé,1.1.0,2,2
2,3869486,2022-12-10,17:00:00.000,International - FIFA World Cup,2022,Morocco,Portugal,1,0,available,...,2023-01-04T12:36:10.102347,5,Quarter-finals,Al Thumama Stadium,Facundo Tello Figueroa,Hoalid Regragui,Fernando Manuel Fernandes da Costa Santos,1.1.0,2,2
3,3869685,2022-12-18,17:00:00.000,International - FIFA World Cup,2022,Argentina,France,3,3,available,...,2022-12-21T16:02:21.075183,7,Final,Lusail Stadium,Szymon Marciniak,Lionel Sebastián Scaloni,Didier Deschamps,1.1.0,2,2
4,3869684,2022-12-17,17:00:00.000,International - FIFA World Cup,2022,Croatia,Morocco,2,1,available,...,2022-12-18T21:30:47.341680,7,3rd Place Final,Sheikh Khalifa International Stadium,Abdulrahman Ibrahim Al Jassim,Zlatko Dalić,Hoalid Regragui,1.1.0,2,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1367,69199,2019-06-27,21:00:00.000,International - Women's World Cup,2019,Norway Women's,England Women's,0,3,available,...,2021-06-13T16:17:31.694,5,Regular Season,Stade Océane,Lucila Venegas Montes,Martin Sjögren,Phil Neville,1.1.0,2,2
1368,69258,2019-07-02,21:00:00.000,International - Women's World Cup,2019,England Women's,United States Women's,1,2,available,...,2021-06-13T16:17:31.694,6,Semi-finals,Groupama Stadium,,Phil Neville,Jillian Ellis,1.1.0,2,2
1369,69284,2019-07-03,21:00:00.000,International - Women's World Cup,2019,Netherlands Women's,Sweden Women's,1,0,available,...,2021-06-13T16:17:31.694,6,Semi-finals,Groupama Stadium,Marie-Soleil Beaudoin,Sarina Glotzbach-Wiegman,Peter Gerhardsson,1.1.0,2,2
1370,69205,2019-06-29,15:00:00.000,International - Women's World Cup,2019,Italy Women's,Netherlands Women's,0,2,available,...,2021-06-13T16:17:31.694,5,Regular Season,Stade du Hainaut,Claudia Umpiérrez,Milena Bertolini,Sarina Glotzbach-Wiegman,1.1.0,2,2


In [9]:
# Creating a list of matches ids
matches_ids= matches_dfs["match_id"]

all_events= sb.events(match_id=matches_ids[0], split=True, flatten_attrs=True)["shots"]

for id in matches_ids[1:]:
    all_events= all_events.append(sb.events(match_id=id, split=True, flatten_attrs=True)["shots"], ignore_index = 1)
    print('Shots scraped from match with id:'+ str(id))


Shots scraped from match with id:3869118
Shots scraped from match with id:3869486
Shots scraped from match with id:3869685
Shots scraped from match with id:3869684
Shots scraped from match with id:3869519
Shots scraped from match with id:3869354
Shots scraped from match with id:3869552
Shots scraped from match with id:3869420
Shots scraped from match with id:3869321
Shots scraped from match with id:3869220
Shots scraped from match with id:3869219
Shots scraped from match with id:3869253
Shots scraped from match with id:3869151
Shots scraped from match with id:3869152
Shots scraped from match with id:3869117
Shots scraped from match with id:3857256
Shots scraped from match with id:3857270
Shots scraped from match with id:3857269
Shots scraped from match with id:3857263
Shots scraped from match with id:3857259
Shots scraped from match with id:3857295
Shots scraped from match with id:3857266
Shots scraped from match with id:3857283
Shots scraped from match with id:3857284
Shots scraped fr

Shots scraped from match with id:3775583
Shots scraped from match with id:3775542
Shots scraped from match with id:3775608
Shots scraped from match with id:3775599
Shots scraped from match with id:3775554
Shots scraped from match with id:3775652
Shots scraped from match with id:3764238
Shots scraped from match with id:2275127
Shots scraped from match with id:2275136
Shots scraped from match with id:2275154
Shots scraped from match with id:2275150
Shots scraped from match with id:2275146
Shots scraped from match with id:2275142
Shots scraped from match with id:2275137
Shots scraped from match with id:2275153
Shots scraped from match with id:2275151
Shots scraped from match with id:2275144
Shots scraped from match with id:2275088
Shots scraped from match with id:2275110
Shots scraped from match with id:2275044
Shots scraped from match with id:2275105
Shots scraped from match with id:2275063
Shots scraped from match with id:2275054
Shots scraped from match with id:2275072
Shots scraped fr

Shots scraped from match with id:7536
Shots scraped from match with id:7555
Shots scraped from match with id:7546
Shots scraped from match with id:7539
Shots scraped from match with id:7538
Shots scraped from match with id:7576
Shots scraped from match with id:7565
Shots scraped from match with id:7551
Shots scraped from match with id:7550
Shots scraped from match with id:7537
Shots scraped from match with id:7580
Shots scraped from match with id:8650
Shots scraped from match with id:7581
Shots scraped from match with id:7549
Shots scraped from match with id:7529
Shots scraped from match with id:7534
Shots scraped from match with id:7562
Shots scraped from match with id:7571
Shots scraped from match with id:7569
Shots scraped from match with id:7568
Shots scraped from match with id:7530
Shots scraped from match with id:7558
Shots scraped from match with id:7583
Shots scraped from match with id:7547
Shots scraped from match with id:7535
Shots scraped from match with id:7584
Shots scrape

Shots scraped from match with id:3773547
Shots scraped from match with id:3773415
Shots scraped from match with id:3764440
Shots scraped from match with id:3773689
Shots scraped from match with id:3773477
Shots scraped from match with id:303731
Shots scraped from match with id:303532
Shots scraped from match with id:303516
Shots scraped from match with id:303596
Shots scraped from match with id:303430
Shots scraped from match with id:303725
Shots scraped from match with id:303504
Shots scraped from match with id:303451
Shots scraped from match with id:303664
Shots scraped from match with id:303682
Shots scraped from match with id:303400
Shots scraped from match with id:303634
Shots scraped from match with id:303421
Shots scraped from match with id:303493
Shots scraped from match with id:303680
Shots scraped from match with id:303479
Shots scraped from match with id:303615
Shots scraped from match with id:303696
Shots scraped from match with id:303487
Shots scraped from match with id:30

Shots scraped from match with id:266871
Shots scraped from match with id:266967
Shots scraped from match with id:266929
Shots scraped from match with id:266770
Shots scraped from match with id:266142
Shots scraped from match with id:267368
Shots scraped from match with id:70301
Shots scraped from match with id:70298
Shots scraped from match with id:70302
Shots scraped from match with id:70291
Shots scraped from match with id:267138
Shots scraped from match with id:267520
Shots scraped from match with id:70289
Shots scraped from match with id:266066
Shots scraped from match with id:267675
Shots scraped from match with id:70288
Shots scraped from match with id:266074
Shots scraped from match with id:267567
Shots scraped from match with id:266201
Shots scraped from match with id:266274
Shots scraped from match with id:70293
Shots scraped from match with id:265918
Shots scraped from match with id:70277
Shots scraped from match with id:70283
Shots scraped from match with id:70306
Shots scra

Shots scraped from match with id:69143
Shots scraped from match with id:69181
Shots scraped from match with id:68365
Shots scraped from match with id:69178
Shots scraped from match with id:68364
Shots scraped from match with id:69170
Shots scraped from match with id:68359
Shots scraped from match with id:68356
Shots scraped from match with id:69158
Shots scraped from match with id:69187
Shots scraped from match with id:68363
Shots scraped from match with id:69166
Shots scraped from match with id:68366
Shots scraped from match with id:69148
Shots scraped from match with id:69184
Shots scraped from match with id:69173
Shots scraped from match with id:69146
Shots scraped from match with id:69182
Shots scraped from match with id:68358
Shots scraped from match with id:68361
Shots scraped from match with id:69141
Shots scraped from match with id:69145
Shots scraped from match with id:69176
Shots scraped from match with id:69179
Shots scraped from match with id:68334
Shots scraped from match 

Shots scraped from match with id:3835342
Shots scraped from match with id:3835337
Shots scraped from match with id:3835338
Shots scraped from match with id:3835330
Shots scraped from match with id:3835329
Shots scraped from match with id:3835322
Shots scraped from match with id:3835332
Shots scraped from match with id:3835327
Shots scraped from match with id:3835326
Shots scraped from match with id:3835341
Shots scraped from match with id:3835340
Shots scraped from match with id:3835339
Shots scraped from match with id:3835336
Shots scraped from match with id:3835334
Shots scraped from match with id:3835328
Shots scraped from match with id:3835333
Shots scraped from match with id:3835321
Shots scraped from match with id:3835319
Shots scraped from match with id:22963
Shots scraped from match with id:68311
Shots scraped from match with id:68357
Shots scraped from match with id:22933
Shots scraped from match with id:22940
Shots scraped from match with id:22943
Shots scraped from match wit

___

<a id='section4'></a>

## <a id='#section4'>4. Extraction outcome</a>
As a result of the extraction process, we now have a dataframe full of shots retrieved from various international and non-international competitions of both male and female players.

In [10]:
all_events.head()

Unnamed: 0,id,index,period,timestamp,minute,second,type,possession,possession_team,play_pattern,...,shot_open_goal,out,shot_one_on_one,shot_saved_to_post,shot_deflected,shot_follows_dribble,shot_saved_off_target,shot_redirect,off_camera,shot_kick_off
0,8975991a-ecaa-47ef-8b98-86e09c242131,183,1,00:04:56.223,4,56,Shot,12,Switzerland,Regular Play,...,,,,,,,,,,
1,fead457a-c1a7-42f4-84a6-ab7698e9df5e,750,1,00:16:43.605,16,43,Shot,33,Portugal,From Throw In,...,,,,,,,,,,
2,e616c23c-786b-487b-a35d-af91f8d575ee,858,1,00:21:03.108,21,3,Shot,38,Portugal,From Free Kick,...,,,,,,,,,,
3,47cd9da1-eb07-4544-aca9-9fb5f2956055,899,1,00:21:54.174,21,54,Shot,42,Portugal,Regular Play,...,,,,,,,,,,
4,c176824b-e608-4b52-98c4-f328748925b7,1108,1,00:29:38.907,29,38,Shot,57,Switzerland,From Free Kick,...,,,,,,,,,,


In [16]:
all_events.shape

(33266, 40)

In [14]:
# Saving our shots dataframe as parquet file
all_events.to_parquet('output//shots.parquet')