# Extracting and storing the Statsbomb data

### Library to open the data - [mplsoccer](https://mplsoccer.readthedocs.io/en/latest/gallery/statsbomb/plot_statsbomb_data.html)

Importing Sbopen function from mplsoccer. Creating a Sbopen() parser variable



In [1]:
from mplsoccer import Sbopen

parser = Sbopen()

### Opening Competition Data

Using `competition()` method

In [2]:
df_competition = parser.competition()
#view the structure of the data
df_competition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   competition_id             42 non-null     int64 
 1   season_id                  42 non-null     int64 
 2   country_name               42 non-null     object
 3   competition_name           42 non-null     object
 4   competition_gender         42 non-null     object
 5   competition_youth          42 non-null     bool  
 6   competition_international  42 non-null     bool  
 7   season_name                42 non-null     object
 8   match_updated              42 non-null     object
 9   match_updated_360          41 non-null     object
 10  match_available_360        3 non-null      object
 11  match_available            42 non-null     object
dtypes: bool(2), int64(2), object(8)
memory usage: 3.5+ KB


### Opening Match Data

Using `match()` method. It takes two variables `competition_id` and `season_id`

In [3]:
df_match = parser.match(competition_id = 72, season_id = 30)
#view the structure of the data
df_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 52 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   match_id                         52 non-null     int64         
 1   match_date                       52 non-null     datetime64[ns]
 2   kick_off                         52 non-null     datetime64[ns]
 3   home_score                       52 non-null     int64         
 4   away_score                       52 non-null     int64         
 5   match_status                     52 non-null     object        
 6   match_status_360                 52 non-null     object        
 7   last_updated                     52 non-null     datetime64[ns]
 8   last_updated_360                 52 non-null     datetime64[ns]
 9   match_week                       52 non-null     int64         
 10  competition_id                   52 non-null     int64         


### Opening Lineup Data

Using `lineup()` method. It takes the variable `game_id`

In [4]:
df_lineup = parser.lineup(69301)
#view the structure of the data
df_lineup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   player_id        41 non-null     int64 
 1   player_name      41 non-null     object
 2   player_nickname  41 non-null     object
 3   jersey_number    41 non-null     int64 
 4   match_id         41 non-null     int64 
 5   team_id          41 non-null     int64 
 6   team_name        41 non-null     object
 7   country_id       41 non-null     int64 
 8   country_name     41 non-null     object
dtypes: int64(5), object(4)
memory usage: 3.0+ KB


### Opening Event Data 

Using `event()` method. It takes the variable `game_id`

> `parser.event(<game_id>)` returns a list: `[event_data, related_data, freeze_data, tactics_data]` 

1. Event data - Information of the events that took place

In [5]:
df_event = parser.event(69301)[0]
#view the structure of the data
df_event.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3291 entries, 0 to 3290
Data columns (total 74 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              3291 non-null   object 
 1   index                           3291 non-null   int64  
 2   period                          3291 non-null   int64  
 3   timestamp                       3291 non-null   object 
 4   minute                          3291 non-null   int64  
 5   second                          3291 non-null   int64  
 6   possession                      3291 non-null   int64  
 7   duration                        2457 non-null   float64
 8   match_id                        3291 non-null   int64  
 9   type_id                         3291 non-null   int64  
 10  type_name                       3291 non-null   object 
 11  possession_team_id              3291 non-null   int64  
 12  possession_team_name            32

2. Related data - Information on events that were related to each other. For example ball pass and pressure applied

In [6]:
df_related = parser.event(69301)[1]
#view the structure of the data
df_related.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6272 entries, 0 to 4734
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   match_id           6272 non-null   int64 
 1   id                 6272 non-null   object
 2   index              6272 non-null   int64 
 3   type_name          6272 non-null   object
 4   id_related         6272 non-null   object
 5   index_related      6272 non-null   int64 
 6   type_name_related  6272 non-null   object
dtypes: int64(3), object(4)
memory usage: 392.0+ KB


3. Freeze data - Freezed frames with player position in the moment of shots

In [7]:
df_freeze = parser.event(69301)[2]
#view the structure of the data
df_freeze.info()
# df_freeze.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232 entries, 0 to 231
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   teammate         232 non-null    bool   
 1   match_id         232 non-null    int64  
 2   id               232 non-null    object 
 3   x                232 non-null    float64
 4   y                232 non-null    float64
 5   player_id        232 non-null    int64  
 6   player_name      232 non-null    object 
 7   position_id      232 non-null    int64  
 8   position_name    232 non-null    object 
 9   event_freeze_id  232 non-null    int64  
dtypes: bool(1), float64(2), int64(4), object(3)
memory usage: 16.7+ KB


4. Tactics data - Information about player position on the pitch

In [8]:
df_tactics = parser.event(69301)[3]
#view the structure of the data
df_tactics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   jersey_number     44 non-null     int64 
 1   match_id          44 non-null     int64 
 2   id                44 non-null     object
 3   player_id         44 non-null     int64 
 4   player_name       44 non-null     object
 5   position_id       44 non-null     int64 
 6   position_name     44 non-null     object
 7   event_tactics_id  44 non-null     int64 
dtypes: int64(5), object(3)
memory usage: 2.9+ KB


### Opening 360 data

Using `frame()` method. It takes one variable

> `parser.frame(<id>)` returns a list: `[frame_data, visible_data]` 

1. Frame data - Information about the positions at the time of freezed frame

In [9]:
df_frame = parser.frame(3788741)[0]
# view the structure of the data
df_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47060 entries, 0 to 47059
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   teammate  47060 non-null  bool   
 1   actor     47060 non-null  bool   
 2   keeper    47060 non-null  bool   
 3   match_id  47060 non-null  int64  
 4   id        47060 non-null  object 
 5   x         47060 non-null  float64
 6   y         47060 non-null  float64
dtypes: bool(3), float64(2), int64(1), object(1)
memory usage: 1.6+ MB


2. Visible data - Information about the visibility of the player at the time of freezed frame

In [10]:
df_visible = parser.frame(3788741)[1]
# view the structure of the data
df_visible.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3470 entries, 0 to 3469
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   match_id      3470 non-null   int64 
 1   id            3470 non-null   object
 2   visible_area  3470 non-null   object
dtypes: int64(1), object(2)
memory usage: 81.5+ KB


Additional

In [11]:
def breakline(s):
    print("\n" + "-----------" + s + "-----------" + "\n")

In [12]:
breakline("Competition Data")


-----------Competition Data-----------

