<a href="https://colab.research.google.com/github/Brighton94/predicting-football-matches/blob/main/web_scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, I will be webscrapping the data from [FBREF](https://fbref.com/en/) with a focus on English football clubs.

In [41]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time 

In [2]:
epl_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [3]:
data = requests.get(epl_url)

In [4]:
soup = BeautifulSoup(data.text)

In [5]:
standings_table = soup.select('table.stats_table')[0]

In [6]:
standings_table

<table class="stats_table sortable min_width" data-cols-to-freeze=",2" id="results111601_overall"> <caption>League Table Table</caption> <colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup> <thead> <tr> <th aria-label="Rk" class=" poptip sort_default_asc center" data-stat="rank" data-tip="&lt;strong&gt;Squad finish in competition&lt;/strong&gt;&lt;br&gt;Finish within the league or competition.&lt;br&gt;For knockout competitions may show final round reached.&lt;br&gt;Colors and arrows represent promotion/relegation or qualifiation for continental cups.&lt;br&gt;Trophy indicates team won league whether by playoffs or by leading the table.&lt;br&gt;Star indicates topped table in league USING another means of naming champion." scope="col">Rk</th> <th aria-label="Squad" class=" poptip sort_default_asc center" data-stat="squad" scope="col">Squad</th> <th aria-label="MP" class=" poptip center" data-stat="games"

In [7]:
# get all a tags in standings_table and store in a list
links = standings_table.find_all('a')

In [8]:
# get the relative links
links = [l.get("href") for l in links]

In [9]:
# keep only squad links
links = [l for l in links if '/squads/' in l]
links

['/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/squads/18bb7c10/Arsenal-Stats',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 '/en/squads/a2d435b3/Leicester-City-Stats',
 '/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 '/en/squads/cd051869/Brentford-Stats',
 '/en/squads/b2b47a98/Newcastle-United-Stats',
 '/en/squads/47c64c55/Crystal-Palace-Stats',
 '/en/squads/8602292d/Aston-Villa-Stats',
 '/en/squads/33c895d4/Southampton-Stats',
 '/en/squads/d3fd31cc/Everton-Stats',
 '/en/squads/5bfb9659/Leeds-United-Stats',
 '/en/squads/943e8050/Burnley-Stats',
 '/en/squads/2abfe087/Watford-Stats',
 '/en/squads/1c781004/Norwich-City-Stats']

In [10]:
# get the absolute links
team_urls = [f"https://fbref.com{l}" for l in links]
team_urls

['https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/Liverpool-Stats',
 'https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats',
 'https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/18bb7c10/Arsenal-Stats',
 'https://fbref.com/en/squads/19538871/Manchester-United-Stats',
 'https://fbref.com/en/squads/7c21e445/West-Ham-United-Stats',
 'https://fbref.com/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/a2d435b3/Leicester-City-Stats',
 'https://fbref.com/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/cd051869/Brentford-Stats',
 'https://fbref.com/en/squads/b2b47a98/Newcastle-United-Stats',
 'https://fbref.com/en/squads/47c64c55/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/8602292d/Aston-Villa-Stats',
 'https://fbref.com/en/squads/33c895d4/Southampton-Stats',
 'https://fbref.com/en/squads/d3fd31cc/Everton-Stats',
 'https://fbref.

# Scraping the Data for One Team

I will first scrape the data for just one team, before generalizing the code for all 20 English Premier League teams.

In [13]:
man_united_url = team_urls[5]
data = requests.get(man_united_url)

In [14]:
matches = pd.read_html(data.text, match="Scores & Fixtures")

In [31]:
matches[0].head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,1.5,0.6,49.0,72732.0,Harry Maguire,4-2-3-1,Paul Tierney,Match Report,
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,1.4,0.8,62.0,32000.0,Harry Maguire,4-2-3-1,Craig Pawson,Match Report,
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,0.6,1.8,56.0,30621.0,Harry Maguire,4-2-3-1,Mike Dean,Match Report,
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,2.1,0.7,64.0,72732.0,Harry Maguire,4-2-3-1,Anthony Taylor,Match Report,
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch Young Boys,0.5,1.6,45.0,31120.0,Harry Maguire,4-2-3-1,François Letexier,Match Report,


## Shooting stats

In [16]:
soup = BeautifulSoup(data.text)

In [17]:
links = soup.find_all('a')

In [18]:
links = [l.get("href") for l in links]

In [22]:
links = [l for l in links if 'all_comps/shooting/' in l]

In [23]:
links

['/en/squads/19538871/2021-2022/matchlogs/all_comps/shooting/Manchester-United-Match-Logs-All-Competitions',
 '/en/squads/19538871/2021-2022/matchlogs/all_comps/shooting/Manchester-United-Match-Logs-All-Competitions',
 '/en/squads/19538871/2021-2022/matchlogs/all_comps/shooting/Manchester-United-Match-Logs-All-Competitions',
 '/en/squads/19538871/2021-2022/matchlogs/all_comps/shooting/Manchester-United-Match-Logs-All-Competitions']

In [21]:
data = requests.get(f"https://fbref.com{links[0]}")

In [26]:
shooting_df = pd.read_html(data.text, match="Shooting")[0]

In [27]:
shooting_df.head()

Unnamed: 0_level_0,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,For Manchester United,...,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,...,18.0,0.0,0,0,1.5,1.5,0.09,3.5,3.5,Match Report
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,...,13.9,1.0,0,0,1.4,1.4,0.1,-0.4,-0.4,Match Report
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,...,18.3,1.0,0,0,0.6,0.6,0.06,0.4,0.4,Match Report
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,...,20.3,0.0,0,0,2.1,2.1,0.1,1.9,1.9,Match Report
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch Young Boys,...,11.1,0.0,0,0,0.5,0.5,0.26,0.5,0.5,Match Report


In [29]:
shooting_df.columns = shooting_df.columns.droplevel()

In [30]:
shooting_df.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,...,18.0,0.0,0,0,1.5,1.5,0.09,3.5,3.5,Match Report
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,...,13.9,1.0,0,0,1.4,1.4,0.1,-0.4,-0.4,Match Report
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,...,18.3,1.0,0,0,0.6,0.6,0.06,0.4,0.4,Match Report
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,...,20.3,0.0,0,0,2.1,2.1,0.1,1.9,1.9,Match Report
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch Young Boys,...,11.1,0.0,0,0,0.5,0.5,0.26,0.5,0.5,Match Report


In [33]:
team_data = matches[0].merge(shooting_df[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt" ]], on="Date")
team_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2021-08-14,12:30,Premier League,Matchweek 1,Sat,Home,W,5,1,Leeds United,...,4-2-3-1,Paul Tierney,Match Report,,16,8,18.0,0.0,0,0
1,2021-08-22,14:00,Premier League,Matchweek 2,Sun,Away,D,1,1,Southampton,...,4-2-3-1,Craig Pawson,Match Report,,15,3,13.9,1.0,0,0
2,2021-08-29,16:30,Premier League,Matchweek 3,Sun,Away,W,1,0,Wolves,...,4-2-3-1,Mike Dean,Match Report,,10,3,18.3,1.0,0,0
3,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Home,W,4,1,Newcastle Utd,...,4-2-3-1,Anthony Taylor,Match Report,,21,6,20.3,0.0,0,0
4,2021-09-14,18:45,Champions Lg,Group stage,Tue,Away,L,1,2,ch Young Boys,...,4-2-3-1,François Letexier,Match Report,,2,2,11.1,0.0,0,0


# Scraping the Data for All Teams

Now let us scrape the data for all teams available.

In [38]:
# years to scrape data for season 2020-2021 and 2021-2022
years = list(range(2022, 2020, -1))
years

[2022, 2021]

In [58]:
all_fixtures = []

In [59]:
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [60]:
for year in years:
    data = requests.get(standings_url)
    soup = BeautifulSoup(data.text)
    standings_table = soup.select('table.stats_table')[0]

    links = [l.get("href") for l in standings_table.find_all('a')]
    links = [l for l in links if '/squads/' in l]
    team_urls = [f"https://fbref.com{l}" for l in links]
    
    previous_season = soup.select("a.prev")[0].get("href")
    standings_url = f"https://fbref.com{previous_season}"

    for team_url in team_urls:
        team_name = team_url.split("/")[-1].replace("-Stats", "").replace("-", " ")
        data = requests.get(team_url)
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
        soup = BeautifulSoup(data.text)
        links = [l.get("href") for l in soup.find_all('a')]
        links = [l for l in links if l and 'all_comps/shooting/' in l]
        data = requests.get(f"https://fbref.com{links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()
        # ignore teams with no shootings stats that give a ValueError and just continue
        try:
            team_data = matches.merge(shooting[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on="Date")
        except ValueError:
            continue
        # filter out all the other competitions and keep the Premier League only
        team_data = team_data[team_data["Comp"] == "Premier League"]
        
        team_data["Season"] = year
        team_data["Team"] = team_name
        all_fixtures.append(team_data)
        # scraping the website too quickly can slow down the website and we might
        # get blocked for macking too many requests
        # use time.sleep() to slow down the requests
        time.sleep(1)

In [61]:
# number of matches
len(all_fixtures)

19

In [62]:
fixtures_df = pd.concat(all_fixtures)
fixtures_df.columns = [c.lower() for c in fixtures_df.columns]

In [64]:
fixtures_df.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
0,2021-08-14,17:30,Premier League,Matchweek 1,Sat,Away,W,3,0,Norwich City,...,Match Report,,19.0,6.0,16.4,1.0,0,0,2022,Liverpool
1,2021-08-21,12:30,Premier League,Matchweek 2,Sat,Home,W,2,0,Burnley,...,Match Report,,28.0,8.0,15.1,0.0,0,0,2022,Liverpool
2,2021-08-28,17:30,Premier League,Matchweek 3,Sat,Home,D,1,1,Chelsea,...,Match Report,,23.0,6.0,14.8,0.0,1,1,2022,Liverpool
3,2021-09-12,16:30,Premier League,Matchweek 4,Sun,Away,W,3,0,Leeds United,...,Match Report,,30.0,8.0,14.7,1.0,0,0,2022,Liverpool
5,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,W,3,0,Crystal Palace,...,Match Report,,25.0,10.0,15.2,0.0,0,0,2022,Liverpool


In [65]:
fixtures_df.to_csv("fixtures.csv")

The code of this notebook was built from a DataQuest walkthrough from this YouTube [video](https://www.youtube.com/watch?v=Nt7WJa2iu0s&ab_channel=Dataquest)