# **Data Retrieval**

## **Installing and importing packages**

In [None]:
!pip install soccerdata &> /dev/null
!pip install dill &> /dev/null

In [None]:
from soccerdata import FBref
import pandas as pd
import dill

## **Data retrieving phase**

To take data on soccer players we are going to use *soccerdata*, a set of scrapers useful for retrieving soccer-related data from various platforms, such as FBref, Fotmob, and ESPN.
We consider the FBref platform because it allows us to fetch data on individual soccer players, with the ability to filter results by football season (complete documentation of *soccerdata* at the following link: [Soccerdata](https://soccerdata.readthedocs.io/en/latest/index.html)).

FBref provides data on the following leagues:

In [None]:
FBref.available_leagues()

['ARG-Primera División Argentina',
 'AUS-A-League',
 'AUT-Austrian Football Bundesliga',
 'BEL-Challenger Pro League',
 'BEL-Jupiler Pro League',
 'BRA-Serie A Brasil',
 'BUL-Bulgarian First League',
 'Big 5 European Leagues Combined',
 'COL-Primera A Colombia',
 'CRO-Croatian Football League',
 'CZE-Czech First League',
 'DEN-Superliga',
 'ECU-Liga Profesional Ecuador',
 'ENG-Championship',
 'ENG-Premier League',
 'ESP-La Liga',
 'ESP-Segunda División',
 'FRA-Ligue 1',
 'FRA-Ligue 2',
 'GER-2. Fußball-Bundesliga',
 'GER-Bundesliga',
 'GRE-Super League Greece',
 'HUN-Nemzeti Bajnokság I',
 "INT-Women's World Cup",
 'INT-World Cup',
 'ITA-Serie A',
 'ITA-Serie B',
 'JAP-J1 League',
 'KOR-K League 1',
 'KSA-Saudi Professional League',
 'MEX-Liga MX',
 'NED-Eerste Divisie',
 'NED-Eredivisie',
 'NOR-Eliteserien',
 'PAR-Paraguayan Primera División',
 'POL-Ekstraklasa',
 'POR-Primeira Liga',
 'RUS-Russian Premier League',
 'SRB-Serbian SuperLiga',
 'SUI-Swiss Super League',
 'SWE-Allsvenskan

Although the wide choice on leagues, we will consider only the following since they have the most data available.

We will also consider data for the 2022/2023 season as we would like to have available the most recent data possible but for an entire season.

In [None]:
fbref = FBref(leagues=['ARG-Primera División Argentina',
 'Big 5 European Leagues Combined',
 'ENG-Championship',
 'ESP-Segunda División',
 'FRA-Ligue 2',
 'GER-2. Fußball-Bundesliga',
 'ITA-Serie B',
 'MEX-Liga MX',
 'NED-Eredivisie',
 'POR-Primeira Liga',
 'USA-Major League Soccer'] , seasons='2022')

We retrieved the following types of statistics:
* Standard stats
* Shooting stats
* Passing stats
* Passing types stats
* Shot creation stats
* Defensive stats
* Possession related stats
* Playing stats
* Miscellaneous stats

In the next phases we are going to filter by the statistics we are interested in.

In [None]:
standard = fbref.read_player_season_stats(stat_type='standard')



In [None]:
standard.reset_index(inplace=True)

In [None]:
shooting = fbref.read_player_season_stats(stat_type='shooting')



In [None]:
shooting.reset_index(inplace=True)

In [None]:
passing = fbref.read_player_season_stats(stat_type='passing')



In [None]:
passing.reset_index(inplace=True)

In [None]:
passing_types = fbref.read_player_season_stats(stat_type='passing_types')



In [None]:
passing_types.reset_index(inplace=True)

In [None]:
shot_creation = fbref.read_player_season_stats(stat_type='goal_shot_creation')



In [None]:
shot_creation.reset_index(inplace=True)

In [None]:
defense = fbref.read_player_season_stats(stat_type='defense')



In [None]:
defense.reset_index(inplace=True)

In [None]:
possession = fbref.read_player_season_stats(stat_type='possession')



In [None]:
possession.reset_index(inplace=True)

In [None]:
playing = fbref.read_player_season_stats(stat_type='playing_time')



In [None]:
playing.reset_index(inplace=True)

In [None]:
misc = fbref.read_player_season_stats(stat_type='misc')



In [None]:
misc.reset_index(inplace=True)

In [None]:
playing = playing[playing['Playing Time']['MP'] > 0]

We retrieved different types of dataframes that we are going to merge based on player and team name.

In [None]:
df1 = standard.merge(right = shooting, on = ['league','season','team','player','nation','pos','age','born'])
df2 = df1.merge(right = passing, on = ['league','season','team','player','nation','pos','age','born'])
df3 = df2.merge(right = passing_types, on = ['league','season','team','player','nation','pos','age','born'])
df4 = df3.merge(right = shot_creation, on = ['league','season','team','player','nation','pos','age','born'], suffixes =['_1','_2'])
df5 = df4.merge(right = defense, on = ['league','season','team','player','nation','pos','age','born'])
df6 = df5.merge(right = possession, on = ['league','season','team','player','nation','pos','age','born'], suffixes =['_3','_4'])
df7 = df6.merge(right = playing, on = ['league','season','team','player','nation','pos','age','born'])
final_df = df7.merge(right = misc, on = ['league','season','team','player','nation','pos','age','born'])

  df1 = standard.merge(right = shooting, on = ['league','season','team','player','nation','pos','age','born'])
  df2 = df1.merge(right = passing, on = ['league','season','team','player','nation','pos','age','born'])
  df3 = df2.merge(right = passing_types, on = ['league','season','team','player','nation','pos','age','born'])
  df4 = df3.merge(right = shot_creation, on = ['league','season','team','player','nation','pos','age','born'], suffixes =['_1','_2'])
  df5 = df4.merge(right = defense, on = ['league','season','team','player','nation','pos','age','born'])
  df6 = df5.merge(right = possession, on = ['league','season','team','player','nation','pos','age','born'], suffixes =['_3','_4'])
  df7 = df6.merge(right = playing, on = ['league','season','team','player','nation','pos','age','born'])
  final_df = df7.merge(right = misc, on = ['league','season','team','player','nation','pos','age','born'])


In [None]:
final_df

Unnamed: 0_level_0,league,season,team,player,nation,pos,age,born,Playing Time_x,Playing Time_x,...,Performance_y,Performance_y,Performance_y,Performance_y,Performance_y,Performance_y,Performance_y,Aerial Duels,Aerial Duels,Aerial Duels
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,MP,Starts,...,Crs,Int,TklW,PKwon,PKcon,OG,Recov,Won,Lost,Won%
0,ARG-Primera División Argentina,2223,Aldosivi,Andrés Ríos,ARG,"FW,MF",32,1989,6,4,...,4,1,2,0,0,0,16,9,10,47.4
1,ARG-Primera División Argentina,2223,Aldosivi,Bautista Kociubinski,ARG,MF,20,2001,8,8,...,5,12,13,0,0,0,59,12,8,60.0
2,ARG-Primera División Argentina,2223,Aldosivi,Brian Martínez,ARG,"FW,MF",22,1999,18,14,...,56,5,8,,,0,,,,
3,ARG-Primera División Argentina,2223,Aldosivi,David Torres,ARG,"FW,DF",20,2001,3,0,...,0,0,0,0,0,0,1,2,1,66.7
4,ARG-Primera División Argentina,2223,Aldosivi,Edwin Mosquera,COL,FW,20,2001,1,0,...,2,0,1,0,0,0,1,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9460,USA-Major League Soccer,2223,Vancouver,Sebastian Berhalter,USA,MF,20,2001,18,11,...,29,24,20,0,0,0,93,5,10,33.3
9461,USA-Major League Soccer,2223,Vancouver,Simon Becher,USA,FW,22,1999,1,0,...,0,0,0,0,0,0,0,0,0,
9462,USA-Major League Soccer,2223,Vancouver,Thomas Hasal,CAN,GK,22,1999,17,17,...,0,1,0,0,0,0,16,4,2,66.7
9463,USA-Major League Soccer,2223,Vancouver,Tosaint Ricketts,CAN,"FW,DF",34,1987,23,2,...,3,0,3,0,0,0,8,22,26,45.8


## **Saving dataframe**

Finally we save the dataframe on Mydrive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = '/content/drive/MyDrive/DC_Inter/'

In [None]:
with open(path+'football_stats_2022_2023.pkl', 'wb') as f:
    dill.dump(final_df, f)