<a href="https://www.kaggle.com/code/carolinariddick/data-scraping-from-understat-statsbomb?scriptVersionId=274322231" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## _Data Scraping from Understat & Statsbomb_

#### This notebook is about data scraping from Understat & statsbomb

*Some of the steps include*
1. **Installing and importing libraries**
2. **Accesing different futbol data providers**
3. **Data manipulation (numpy, pandas, processing, creting new data frames)**
4. **Data cleaning and preprocessing**
5. **Data visualziation**
    1. **Team pass map**
    2. **Player pass map**

## Install libraries

In [None]:
!pip install statsbombpy --quiet
!pip install mplsoccer --quiet
!pip install highlight_text --quiet

## Import libraries

In [None]:
from statsbombpy import sb
import pandas as pd
from mplsoccer import Pitch
from mplsoccer import VerticalPitch,Pitch
import requests
from bs4 import BeautifulSoup
import json
from highlight_text import ax_text, fig_text
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.pyplot as plt
import matplotlib.patheffects as path_effects
import seaborn as sns

## We can acces the data in different wayd and with many providers:
### 1. Undertstat
### 2. Statsbomb

## _Data Scraping from Understat_

In [None]:
baseUrl = 'https://understat.com/match/'
match = str(input('Please enter the matdh id: ')) # 22514
url = baseUrl + match
print(url)

In [None]:
response = requests.get(url)
soup = BeautifulSoup(response.content,'lxml')
scripts = soup.find_all('script')

# scripts

#### As we can see, the shots data (shots on goal) are the first data that appear.
#### This means that, if we want to work only with data related to shots on goal, we need to use the number 1.
#### After each word var, there is a different dataset, so if you want to work with another one, you just need to specify its corresponding number.

In [None]:
dataShots = scripts[1].string
# dataShots

### What is data Scraping? How we can access StatsBombs free database of competitions and matches ?


#### *Data scraping, also known as web scraping, is a technique that allows extracting information from websites. This information can be saved in files or spreadsheets.*


We have to call Statsbomb API

In [None]:
# Lets convert this to JSON
# We will now delete the symbols to get only the JSON data

indStart = dataShots.index("('")+2
indEnd = dataShots.index("')")

jsonData = dataShots[indStart:indEnd]
jsonData = jsonData.encode('utf8').decode('unicode_escape')

# We will now convert the string into json
data = json.loads(jsonData)
# data

#### For this example, we only need ‘x’ and ‘y’, which are the coordinates required to ensure that when we process these data, they appear at the correct spot on the pitch. Additionally, we need xG (expected goals for each shot) and ‘h_team’ (home team).

#### But if you wanted more, you would just need to add what I’ve written below, such as ‘shotType’ if you're interested in knowing what type of shot it was, or ‘situation’ if you want to see when in the game the shot occurred.

In [None]:
x = []
y = []
xg = []
team = []

# Data from visitor team
data_away = data['a']

# Data from the local team
data_home = data['h']

for index in range(len(data_home)):
    for key in data_home[index]:
        if key =='X' :
            x.append(data_home[index][key])
        if key == 'Y' :
            y.append(data_home[index][key])
        if key == 'xG' :
            xg.append(data_home[index][key])
        if key == 'h_team' :
            team.append(data_home[index][key])

### Now, we want the processed data to appear with the columns we need (x, y, xG, and h_team) so we can view them in an organised way. We add the following:

In [None]:
col_name = ['x', 'y', 'xg', 'team']
df = pd.DataFrame([x,y,xg,team], index=col_name)
df = df.T
df

### What have we achieved with this?

Well, for example, we can now see the xG (expected goals) of each shot made by Sassuolo.
This is just an example using shot data, but as we explained earlier, you can try it with other types of data as well — just make sure to replace xG with the variable corresponding to the data you need, such as key passes if you’re looking for passing data, or many other examples you can explore.

## _Data Scraping from StatsBomb_

In [None]:
#call statsbombpy API to get all free competitions
free_comps = sb.competitions()

# print a list of free competitions
free_comps.head()

In [None]:
# Call the statsbombpy API to get a list of matches for a given competition
# Europe - UEFA Euro
europe_UEFA = sb.matches(competition_id=55, season_id=282)
europe_UEFA.head(4)

### Call specific Match

In [None]:
game_events_df = sb.events(match_id=3943043)
game_events_df

## Explore dataset 

In [None]:
game_events_df.columns

### Using the command df.columns we can get a list of all the columns in the dataset. 

1. type: Refers to the type of action (e.g., passes, shots)
2. shot_statsbomb_xg: Expected goal values
3. team: Identifies the home and away teams
4. player: Lists all players, both home and away
5. location: Provides the x and y coordinates for events
6. pass_assisted_shot_id: Represents a "key pass" (an assist leading to a shot)


To explore a specific column in more detail, we can use the `unique` command.
Let's find all the players who played in the final by using the following command

In [None]:
# List of all players
game_events_df.player.unique()

### Creating Visualization: Team Pass Map


In [None]:
# Separate start and end locations from coordinates
game_events_df[['x', 'y']] = world_cup_game['location'].apply(pd.Series)
game_events_df[['pass_end_x', 'pass_end_y']] = game_events_df['pass_end_location'].apply(pd.Series)
game_events_df[['carry_end_x', 'carry_end_y']] = game_events_df['carry_end_location'].apply(pd.Series)

#create a variable for the team you want to look into
team="Spain"

#filter for only matches that the focus team played in
matches_df = europe_UEFA[(europe_UEFA['home_team'] == team)|(europe_UEFA['away_team'] == team)]

#filter for events done by the focus team
#filter by event type to get only passes
#filter for passes that started outside of the final third
#filter for passes that ended in the final third
#filter for completed passes
passes_df=game_events_df[(game_events_df.team==team)&(game_events_df.type=="Pass")&
            (game_events_df.x<80)&
            (game_events_df.pass_end_x>80)&
            (game_events_df.pass_outcome.isna())]

# Visualize for a team
pass_colour='#e21017'

# Create the pitch
pitch = Pitch(pitch_type='statsbomb', pitch_color='white', line_zorder=2, line_color='black')
fig, ax = pitch.draw(figsize=(16, 11),constrained_layout=True, tight_layout=False)
fig.set_facecolor('white')

# Visualiza passes
pitch.arrows(passes_df.x, passes_df.y,
passes_df.pass_end_x, passes_df.pass_end_y, width=3,
headwidth=8, headlength=5, color=pass_colour, ax=ax, zorder=2, label = "Pass")

ax.legend(facecolor='white', handlelength=5, edgecolor='None', fontsize=20, loc='best')

# Set title
ax_title = ax.set_title(f'{team} Progressions into Final 3rd: Euros Final', fontsize=30,color='black')

## Creating Visualization: Player Pass Map


In [None]:
# Visualize for a specific player
player_name="Daniel Olmo Carvajal"

player_passes=game_events_df[(game_events_df.player==player_name)&
            (game_events_df.type=="Pass")&
            (game_events_df.x<80)&
            (game_events_df.pass_end_x>80)&
            (game_events_df.pass_outcome.isna())]

pass_colour='#e21017'

# set up the pitch
pitch = Pitch(pitch_type='statsbomb', pitch_color='white', line_zorder=2, line_color='black')
fig, ax = pitch.draw(figsize=(16, 11),constrained_layout=True, tight_layout=False)
fig.set_facecolor('white')

# plot passes
pitch.arrows(player_passes.x, player_passes.y,
player_passes.pass_end_x, player_passes.pass_end_y, width=3,
headwidth=8, headlength=5, color=pass_colour, ax=ax, zorder=2, label = "Pass")

# plot legend
ax.legend(facecolor='white', handlelength=5, edgecolor='None', fontsize=20, loc='best')

ax_title = ax.set_title(f'{player_name} Progressions into Final 3rd', fontsize=30,color='black')