## Web Scraping EPL Matches

In this project we will extract data on two season(2020-2021 and 2021-2022) from 'fbref.com'. 

We will start by accessing the url and turning it to text format to later parse through and get the data we want.

In [1]:
import requests 

In [2]:
url = 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats'

In [3]:
data = requests.get(url)

In [44]:
from bs4 import BeautifulSoup # allows us to navigate and fetch data from url's

In [7]:
soup = BeautifulSoup(data.text) #parsing through downloaded html 

In [45]:
standings_table = soup.select('table.stats_table')[0] # selecting tables elements in the page with class "stats_table"

In [10]:
links = standings_table.find_all('a') 

In [11]:
links = [l.get('href') for l in links]

In [12]:
links = [l for l in links if '/squads' in l] # looking for links with "/squads" in them

In [13]:
links

['/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 '/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 '/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 '/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 '/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 '/en/squads/19538871/2021-2022/Manchester-United-Stats',
 '/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 '/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 '/en/squads/d07537b9/2021-2022/Brighton-and-Hove-Albion-Stats',
 '/en/squads/8cec06e1/2021-2022/Wolverhampton-Wanderers-Stats',
 '/en/squads/b2b47a98/2021-2022/Newcastle-United-Stats',
 '/en/squads/47c64c55/2021-2022/Crystal-Palace-Stats',
 '/en/squads/cd051869/2021-2022/Brentford-Stats',
 '/en/squads/8602292d/2021-2022/Aston-Villa-Stats',
 '/en/squads/33c895d4/2021-2022/Southampton-Stats',
 '/en/squads/d3fd31cc/2021-2022/Everton-Stats',
 '/en/squads/5bfb9659/2021-2022/Leeds-United-Stats',
 '/en/squads/943e8050/2021-2022/Burnley-Stats',
 '/en/squads/2abfe087/2021-

##### Above is a list of the relative links for each team's stats. We will take these relative links and turn them to absolute links that we can use to extract stat tables from

In [14]:
full_urls = [f'https://fbref.com{l}' for l in links]

In [15]:
full_urls

['https://fbref.com/en/squads/b8fd03ef/2021-2022/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/2021-2022/Liverpool-Stats',
 'https://fbref.com/en/squads/cff3d9bb/2021-2022/Chelsea-Stats',
 'https://fbref.com/en/squads/361ca564/2021-2022/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/18bb7c10/2021-2022/Arsenal-Stats',
 'https://fbref.com/en/squads/19538871/2021-2022/Manchester-United-Stats',
 'https://fbref.com/en/squads/7c21e445/2021-2022/West-Ham-United-Stats',
 'https://fbref.com/en/squads/a2d435b3/2021-2022/Leicester-City-Stats',
 'https://fbref.com/en/squads/d07537b9/2021-2022/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/8cec06e1/2021-2022/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/b2b47a98/2021-2022/Newcastle-United-Stats',
 'https://fbref.com/en/squads/47c64c55/2021-2022/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/cd051869/2021-2022/Brentford-Stats',
 'https://fbref.com/en/squads/8602292d/2021-2022/Aston-Vill

In [16]:
team_url = full_urls[0]

In [17]:
data = requests.get(team_url)

In [18]:
import pandas as pd

In [19]:
matches = pd.read_html(data.text, match = 'Scores & Fixtures')

In [21]:
matches[0]

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,,,57,,Fernandinho,4-3-3,Paul Tierney,Match Report,
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,2.0,1.0,65,58262.0,Fernandinho,4-3-3,Anthony Taylor,Match Report,
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,2.7,0.1,67,51437.0,İlkay Gündoğan,4-3-3,Graham Scott,Match Report,
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,4.0,0.2,80,52276.0,İlkay Gündoğan,4-3-3,Martin Atkinson,Match Report,
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,3.3,0.6,61,32087.0,İlkay Gündoğan,4-3-3,Paul Tierney,Match Report,
5,2021-09-15,20:00,Champions Lg,Group stage,Wed,Home,W,6,3,de RB Leipzig,2.3,1.3,50,38062.0,Rúben Dias,4-3-3,Serdar Gözübüyük,Match Report,
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,1.2,0.5,64,52698.0,Fernandinho,4-3-3,Jonathan Moss,Match Report,
7,2021-09-21,19:45,EFL Cup,Third round,Tue,Home,W,6,1,Wycombe,,,79,30959.0,Kevin De Bruyne,4-3-3,Robert Jones,Match Report,
8,2021-09-25,12:30,Premier League,Matchweek 6,Sat,Away,W,1,0,Chelsea,1.4,0.2,59,40036.0,Rúben Dias,4-3-3,Michael Oliver,Match Report,
9,2021-09-28,21:00,Champions Lg,Group stage,Tue,Away,L,0,2,fr Paris S-G,1.9,0.4,54,37350.0,Rúben Dias,4-3-3,Carlos del Cerro,Match Report,


In [22]:
soup = BeautifulSoup(data.text)

In [23]:
link = soup.find_all('a')

In [24]:
link = [l.get('href') for l in link ]

In [25]:
link = [l for l in link if l and 'all_comps/shooting/' in l]

In [26]:
link

['/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions']

In [27]:
data = requests.get(f'https://fbref.com{link[0]}')

In [28]:
shooting = pd.read_html(data.text, match = 'Shooting')[0]

In [29]:
shooting.head()

Unnamed: 0_level_0,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,...,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,17.3,1.0,0,0,2.0,2.0,0.11,-2.0,-2.0,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,18.5,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,14.8,0.0,0,0,4.0,4.0,0.16,1.0,1.0,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,14.3,0.0,0,0,3.3,3.3,0.14,-2.3,-2.3,Match Report


In [30]:
shooting.columns = shooting.columns.droplevel()

In [31]:
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,17.3,1.0,0,0,2.0,2.0,0.11,-2.0,-2.0,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,18.5,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,14.8,0.0,0,0,4.0,4.0,0.16,1.0,1.0,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,14.3,0.0,0,0,3.3,3.3,0.14,-2.3,-2.3,Match Report


##### In the past few lines of code we extracted two stat tables and now we will merge them together(this table is only for 2021-2022 season).

In [32]:
team_data = matches[0].merge(shooting[['Date','Sh', 'SoT', 'Dist','FK','PK','PKatt']], on = 'Date')

In [33]:
team_data

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,4-3-3,Paul Tierney,Match Report,,12,3,,,0,0
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,4-3-3,Anthony Taylor,Match Report,,18,4,17.3,1.0,0,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,4-3-3,Graham Scott,Match Report,,16,4,18.5,1.0,0,0
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,4-3-3,Martin Atkinson,Match Report,,25,10,14.8,0.0,0,0
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,4-3-3,Paul Tierney,Match Report,,25,8,14.3,0.0,0,0
5,2021-09-15,20:00,Champions Lg,Group stage,Wed,Home,W,6,3,de RB Leipzig,...,4-3-3,Serdar Gözübüyük,Match Report,,15,7,16.8,0.0,1,1
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,4-3-3,Jonathan Moss,Match Report,,16,1,16.4,1.0,0,0
7,2021-09-21,19:45,EFL Cup,Third round,Tue,Home,W,6,1,Wycombe,...,4-3-3,Robert Jones,Match Report,,26,14,,,0,0
8,2021-09-25,12:30,Premier League,Matchweek 6,Sat,Away,W,1,0,Chelsea,...,4-3-3,Michael Oliver,Match Report,,15,3,17.1,0.0,0,0
9,2021-09-28,21:00,Champions Lg,Group stage,Tue,Away,L,0,2,fr Paris S-G,...,4-3-3,Carlos del Cerro,Match Report,,18,7,15.9,2.0,0,0


In [34]:
matches[0].shape

(58, 19)

## Bringing It All Together

Now that we have a framework of what we need to do, we will put it all under a 'for' loop to extract stats for both season (2020-2021 and 2021-2022) and put them all in one dataframe. This dataframe will be saved as a '.csv' and could be used for data cleaning and to later extract uselful information out of the data.

In [None]:
years = list(range(2022,2020, -1))
all_matches = []
standings_url = 'https://fbref.com/en/comps/9/2021-2022/2021-2022-Premier-League-Stats'

import time

for year in years:
    data = requests.get(standings_url)
    soup = BeautifulSoup(data.text)
    standings_table = soup.select('table.stats_table')[0]
    
    links = [l.get('href') for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads' in l]
    team_urls = [f'https://fbref.com{l}' for l in links] 
    
    previous_season = soup.select('a.prev')[0].get('href')
    standings_url = f'https://fbref.com{previous_season}'
    
    for team_url in team_urls:
        team_name = team_url.split('/')[-1].replace('-Stats', '').replace('-',' ')
        data = requests.get(team_url)
        matches = pd.read_html(data.text, match ='Scores & Fixtures')[0]
        
        soup = BeautifulSoup(data.text)
        links = [l.get('href') for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        data = requests.get(f'https://fbref.com{links[0]}')
        shooting = pd.read_html(data.text, match = 'Shooting')[0]
        shooting.columns = shooting.columns.droplevel()
        
        try:
            team_data = matches.merge(shooting[['Date','Sh', 'SoT', 'Dist','FK','PK','PKatt']], on = 'Date')
        except ValueError:
            continue 
            
        team_data = team_data[team_data['Comp'] == 'Premier League']
        team_data['Season'] = year
        team_data['Team'] = team_name
        all_matches.append(team_data)
        time.sleep(1.5)

In [39]:
match_df = pd.concat(all_matches)

In [40]:
match_df.columns = [c.lower() for c in match_df.columns]

In [41]:
match_df

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,Match Report,,18.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,Match Report,,16.0,4.0,18.5,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,Match Report,,25.0,10.0,14.8,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,Match Report,,25.0,8.0,14.3,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,Match Report,,16.0,1.0,16.4,1.0,0.0,0.0,2022,Manchester City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0,4,Tottenham,...,Match Report,,8.0,1.0,18.2,0.0,0.0,0.0,2021,Sheffield United
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0,2,Crystal Palace,...,Match Report,,7.0,0.0,13.4,1.0,0.0,0.0,2021,Sheffield United
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1,0,Everton,...,Match Report,,10.0,3.0,18.5,0.0,0.0,0.0,2021,Sheffield United
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0,1,Newcastle Utd,...,Match Report,,11.0,1.0,18.3,1.0,0.0,0.0,2021,Sheffield United


In [43]:
match_df.to_csv('matches_1.csv')