# Python_Web_Crawling: Football matches from the EPL

Tools used: requests, BeautifulSoup, pandas

**Index**

1.Scrap the first page with requests

2.Parse html links with BeautifulSoup

3.Extract match stats using pandas and requests

4.Get match shooting stats with requests and pandas

5.Clean and merge scraped data with Pandas

6.Scraping data for multiple season and teams with a loop

7.Final match results dataframe



---



# 1.Scrap the first page with requests

In [None]:
import requests

In [None]:
url = "https://fbref.com/en/comps/9/Premier-Leaue-Stats"
data = requests.get(url)

Bring only 'League Table' data; url for each squad

1. Use Chrome inspector 
2. Click the arrow icon on the top left pane
3. Drag the mouse to the table and click

or

Right Click - Inspect to find the html code 



---



# 2.Parse html links with BeautifulSoup
Use **BeautifulSoup**

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(data.text)

Bring only the 'League Table' element

CSS selector

In [None]:
table = soup.select('table.stats_table')[0]

Raw table data

In [None]:
table

<table class="stats_table sortable min_width" data-cols-to-freeze=",2" id="results111601_overall"> <caption>League Table Table</caption> <colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup> <thead> <tr> <th aria-label="Rk" class=" poptip sort_default_asc center" data-stat="rank" data-tip="&lt;strong&gt;Squad finish in competition&lt;/strong&gt;&lt;br&gt;Finish within the league or competition.&lt;br&gt;For knockout competitions may show final round reached.&lt;br&gt;Colors and arrows represent promotion/relegation or qualifiation for continental cups.&lt;br&gt;Trophy indicates team won league whether by playoffs or by leading the table.&lt;br&gt;Star indicates topped table in league USING another means of naming champion." scope="col">Rk</th> <th aria-label="Squad" class=" poptip sort_default_asc center" data-stat="squad" scope="col">Squad</th> <th aria-label="MP" class=" poptip center" data-stat="games" data-

Find only the urls

In [None]:
anchors = table.find_all('a')

In [None]:
links = [anchor.get("href") for anchor in anchors]

In [None]:
links

['/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/players/e46012d4/Kevin-De-Bruyne',
 '/en/players/3bb7b8b4/Ederson',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/players/e342ad68/Mohamed-Salah',
 '/en/players/7a2e46a8/Alisson',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/players/9674002f/Mason-Mount',
 '/en/players/33887998/Edouard-Mendy',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/players/92e7e919/Son-Heung-min',
 '/en/players/8f62b6ee/Hugo-Lloris',
 '/en/squads/18bb7c10/Arsenal-Stats',
 '/en/players/bc7dc64d/Bukayo-Saka',
 '/en/players/466fb2c5/Aaron-Ramsdale',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/players/dea698d9/Cristiano-Ronaldo',
 '/en/players/7ba6d84e/David-de-Gea',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/players/79c84d1c/Jarrod-Bowen',
 '/en/players/9328b835/Lukasz-Fabianski',
 '/en/squads/a2d435b3/Leicester-City-Stats',
 '/en/players/45963054/Jamie-Vardy',
 '/en/players/53af52f3/Kasper-Schmeichel',
 '/en/squads/d07537b9/Brighton-and-

Squad urls and player urls are combined.
Leave only squad urls.

In [None]:
links = [link for link in links if '/squads/' in link]

In [None]:
links

['/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/squads/18bb7c10/Arsenal-Stats',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/squads/a2d435b3/Leicester-City-Stats',
 '/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 '/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 '/en/squads/b2b47a98/Newcastle-United-Stats',
 '/en/squads/47c64c55/Crystal-Palace-Stats',
 '/en/squads/cd051869/Brentford-Stats',
 '/en/squads/8602292d/Aston-Villa-Stats',
 '/en/squads/33c895d4/Southampton-Stats',
 '/en/squads/d3fd31cc/Everton-Stats',
 '/en/squads/5bfb9659/Leeds-United-Stats',
 '/en/squads/943e8050/Burnley-Stats',
 '/en/squads/2abfe087/Watford-Stats',
 '/en/squads/1c781004/Norwich-City-Stats']

There is no 'https://bref.com'.

Add it.

In [None]:
squad_urls = [f"https://fbref.com{link}" for link in links]

In [None]:
squad_urls

['https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/Liverpool-Stats',
 'https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats',
 'https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/18bb7c10/Arsenal-Stats',
 'https://fbref.com/en/squads/19538871/Manchester-United-Stats',
 'https://fbref.com/en/squads/7c21e445/West-Ham-United-Stats',
 'https://fbref.com/en/squads/a2d435b3/Leicester-City-Stats',
 'https://fbref.com/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/b2b47a98/Newcastle-United-Stats',
 'https://fbref.com/en/squads/47c64c55/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/cd051869/Brentford-Stats',
 'https://fbref.com/en/squads/8602292d/Aston-Villa-Stats',
 'https://fbref.com/en/squads/33c895d4/Southampton-Stats',
 'https://fbref.com/en/squads/d3fd31cc/Everton-Stats',
 'https://fbref.

---



# 3.Extract match stats using pandas and requests

Work with the first team.

In [None]:
squad_url = squad_urls[0]
data = requests.get(squad_url)

Bring only 'Scores & Fixtures' table 

Use **Pandas**

The 'Scores & Fixtures' table html has "Scores & Fixtures" under <caption>.

Use the keyword to find the table.

In [None]:
import pandas as pd

matches = pd.read_html(data.text, match = "Scores & Fixtures")

matches[0] instead of matches to show it as a dataframe (Pandas data format) instead of list

head() to show first 5 rows

In [None]:
matches[0].head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,,,57,,Fernandinho,4-3-3,Paul Tierney,Match Report,
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,1.9,1.3,64,58262.0,Fernandinho,4-3-3,Anthony Taylor,Match Report,
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,2.7,0.1,67,51437.0,İlkay Gündoğan,4-3-3,Graham Scott,Match Report,
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,3.8,0.1,80,52276.0,İlkay Gündoğan,4-3-3,Martin Atkinson,Match Report,
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,2.9,0.8,61,32087.0,İlkay Gündoğan,4-3-3,Paul Tierney,Match Report,




---



# 4.Get match shooting stats with requests and pandas

Get 'Shooting' table from '2021-2022 Match Log Types' 'Shooting' tag

In [None]:
soup = BeautifulSoup(data.text)

In [None]:
anchors = soup.find_all('a')

In [None]:
links = [anchor.get("href") for anchor in anchors]

In [None]:
links = [link for link in links if link and 'all_comps/shooting/' in link]

In [None]:
links

['/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions']

There are duplicated links.

In [None]:
data = requests.get(f"https://fbref.com{links[0]}")

In [None]:
shooting = pd.read_html(data.text, match="Shooting")[0]



---



# 5.Clean and merge scraped data with Pandas

In [None]:
shooting.head()

Unnamed: 0_level_0,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,...,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,16.9,1.0,0,0,1.9,1.9,0.11,-1.9,-1.9,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,17.3,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,14.3,0.0,0,0,3.8,3.8,0.15,1.2,1.2,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,14.0,0.0,0,0,2.9,2.9,0.12,-1.9,-1.9,Match Report


Cleaning: Remove multi-level index

In [None]:
shooting.columns = shooting.columns.droplevel()

In [None]:
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,16.9,1.0,0,0,1.9,1.9,0.11,-1.9,-1.9,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,17.3,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,14.3,0.0,0,0,3.8,3.8,0.15,1.2,1.2,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,14.0,0.0,0,0,2.9,2.9,0.12,-1.9,-1.9,Match Report


Merging: Combine 'matches[0]' table and 'shooting' table.

Use 'date' and 'time' as a merging key.

In [None]:
team_data = matches[0].merge(shooting[["Date","Sh","SoT","Dist","FK","PK","PKatt"]], on="Date")

Merged dataframe

In [None]:
team_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,...,4-3-3,Paul Tierney,Match Report,,12,3,,,0,0
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,4-3-3,Anthony Taylor,Match Report,,18,4,16.9,1.0,0,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,4-3-3,Graham Scott,Match Report,,16,4,17.3,1.0,0,0
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,4-3-3,Martin Atkinson,Match Report,,25,10,14.3,0.0,0,0
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,4-3-3,Paul Tierney,Match Report,,25,8,14.0,0.0,0,0


Check row number

In [None]:
matches[0].shape

(58, 19)

In [None]:
shooting.shape

(59, 26)

In [None]:
team_data.shape

(58, 25)

The row that didn't exist in both tables is removed when merged



---



# 6.Scraping data for multiple season and teams with a loop

loop (Step 1~5)

In [None]:
years = list(range(2022,2020,-1))

In [None]:
all_matches = []

In [None]:
url = "https://fbref.com/en/comps/9/Premier-Leaue-Stats"

Button "Previous Season" for going to the previous season page.

In [None]:
import time

for year in years:
    data = requests.get(url)
    soup = BeautifulSoup(data.text)
    table = soup.select('table.stats_table')[0]
    links = [anchor.get("href") for anchor in table.find_all('a')]
    squad_links = [link for link in links if '/squads/' in link]
    full_urls = [f"https://fbref.com{link}" for link in squad_links]

    previous_season = soup.select("a.prev")[0].get("href")
    url = f"https://fbref.com/{previous_season}"

    for full_url in full_urls:
        squad_name = full_url.split("/")[-1].replace("-Stats","").replace("-"," ")

        data = requests.get(full_url)
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]

        soup = BeautifulSoup(data.text)
        links = [anchor.get("href") for anchor in soup.find_all('a')]
        shooting_links = [link for link in links if link and 'all_comps/shooting/' in link]
        data = requests.get(f"https://fbref.com{shooting_links[0]}")
        shooting = pd.read_html(data.text, match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()

        try:
          team_data = matches.merge(shooting[["Date","Sh","SoT","Dist","FK","PK","PKatt"]], on="Date")
        except ValueError:
          continue

        team_data = team_data[team_data["Comp"] == "Premier League"]
        team_data["Season"] = year
        team_data["Team"] = squad_name
        all_matches.append(team_data)

        time.sleep(1)

In [None]:
match_df = pd.concat(all_matches)

In [None]:
match_df.columns = [c.lower() for c in match_df.columns]

In [None]:
match_df.to_csv("matches.csv")



---



# 7.Final match results dataframe

In [None]:
match_df

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,season,team
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,...,Match Report,,18.0,4.0,16.9,1.0,0.0,0.0,2022,Manchester City
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,...,Match Report,,16.0,4.0,17.3,1.0,0.0,0.0,2022,Manchester City
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,...,Match Report,,25.0,10.0,14.3,0.0,0.0,0.0,2022,Manchester City
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,...,Match Report,,25.0,8.0,14.0,0.0,0.0,0.0,2022,Manchester City
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,...,Match Report,,16.0,1.0,15.7,1.0,0.0,0.0,2022,Manchester City
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0,4,Tottenham,...,Match Report,,8.0,1.0,17.4,0.0,0.0,0.0,2021,Sheffield United
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0,2,Crystal Palace,...,Match Report,,7.0,0.0,11.4,1.0,0.0,0.0,2021,Sheffield United
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1,0,Everton,...,Match Report,,10.0,3.0,17.0,0.0,0.0,0.0,2021,Sheffield United
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0,1,Newcastle Utd,...,Match Report,,11.0,1.0,16.0,1.0,0.0,0.0,2021,Sheffield United
