# Webscraping Football Matches From The EPL

## Objective:

In this project, we will learn how to scrape football matches data from the English Premier League. First, we will download all of the matches played in several seasons with the help of Python and Requests library. After that, we will parse and clean our data using BeautifulSoup and Pandas libraries. By the end, we will have a single pandas dataframe with all of the EPL matches for different seasons.

## Scraping our first page with requests

In [1]:
import requests

In [2]:
standings = "https://fbref.com/en/comps/9/Primier-League-Stats"

In [3]:
data = requests.get(standings)

In [9]:
data.text[:1000]

'    \n      \n<!DOCTYPE html>\n<html data-version="klecko-" data-root="/home/fb/deploy/www/base" itemscope itemtype="https://schema.org/WebSite" lang="en" class="no-js" >\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="x-ua-compatible" content="ie=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=2.0" />\n    <link rel="dns-prefetch" href="https://d2p3bygnnzw9w3.cloudfront.net/req/202204185" />\n    <!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->\n<script type="text/javascript" async=true>\n    (function() {\n\tvar host = window.location.hostname;\n\tvar element = document.createElement(\'script\');\n\tvar firstScript = document.getElementsByTagName(\'script\')[0];\n\tvar url = \'https://quantcast.mgr.consensu.org\'\n\t    .concat(\'/choice/\', \'XwNYEpNeFfhfr\', \'/\', host, \'/choice.js\')\n\tvar uspTries = 0;\n\tvar uspTriesLimit = 3;\n\telement.async = true;\n\telement.type = \'text/javascript\';\n\telement

## Parsing html links with BeautifulSoup

In [10]:
from bs4 import BeautifulSoup

In [12]:
soup = BeautifulSoup(data.text)

In [13]:
standings_table = soup.select("table.stats_table")[0]

In [15]:
links = standings_table.find_all('a')

In [19]:
links = [l.get('href') for l in links]
links[:10]

['/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/matches/c294f564/Burnley-Manchester-City-April-2-2022-Premier-League',
 '/en/matches/37e2fe92/Manchester-City-Liverpool-April-10-2022-Premier-League',
 '/en/matches/34fd93f9/Manchester-City-Brighton-and-Hove-Albion-April-20-2022-Premier-League',
 '/en/matches/af522ca3/Manchester-City-Watford-April-23-2022-Premier-League',
 '/en/matches/5ce80a04/Leeds-United-Manchester-City-April-30-2022-Premier-League',
 '/en/players/892d5bb1/Riyad-Mahrez',
 '/en/players/e46012d4/Kevin-De-Bruyne',
 '/en/players/3bb7b8b4/Ederson',
 '/en/squads/822bd0ba/Liverpool-Stats']

In [20]:
links = [l for l in links if "/squads/" in l]
links

['/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/squads/18bb7c10/Arsenal-Stats',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 '/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 '/en/squads/b2b47a98/Newcastle-United-Stats',
 '/en/squads/a2d435b3/Leicester-City-Stats',
 '/en/squads/47c64c55/Crystal-Palace-Stats',
 '/en/squads/8602292d/Aston-Villa-Stats',
 '/en/squads/cd051869/Brentford-Stats',
 '/en/squads/33c895d4/Southampton-Stats',
 '/en/squads/943e8050/Burnley-Stats',
 '/en/squads/5bfb9659/Leeds-United-Stats',
 '/en/squads/d3fd31cc/Everton-Stats',
 '/en/squads/2abfe087/Watford-Stats',
 '/en/squads/1c781004/Norwich-City-Stats']

In [22]:
team_urls = [f"https://fbref.com{l}" for l in links]
team_urls

['https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/Liverpool-Stats',
 'https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats',
 'https://fbref.com/en/squads/18bb7c10/Arsenal-Stats',
 'https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/19538871/Manchester-United-Stats',
 'https://fbref.com/en/squads/7c21e445/West-Ham-United-Stats',
 'https://fbref.com/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/b2b47a98/Newcastle-United-Stats',
 'https://fbref.com/en/squads/a2d435b3/Leicester-City-Stats',
 'https://fbref.com/en/squads/47c64c55/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/8602292d/Aston-Villa-Stats',
 'https://fbref.com/en/squads/cd051869/Brentford-Stats',
 'https://fbref.com/en/squads/33c895d4/Southampton-Stats',
 'https://fbref.com/en/squads/943e8050/Burnley-Stats',
 'https://fbref.

## Extract match stats using pandas and requests

In [23]:
team_url = team_urls[0]

In [24]:
data = requests.get(team_url)

In [25]:
import pandas as pd

matches = pd.read_html(data.text, match="Scores & Fixtures")

In [28]:
matches[0][:10]

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,,,57.0,,Fernandinho,4-3-3,Paul Tierney,Match Report,
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,1.9,1.3,64.0,58262.0,Fernandinho,4-3-3,Anthony Taylor,Match Report,
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,2.7,0.1,67.0,51437.0,İlkay Gündoğan,4-3-3,Graham Scott,Match Report,
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,3.8,0.1,80.0,52276.0,İlkay Gündoğan,4-3-3,Martin Atkinson,Match Report,
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,2.9,0.8,61.0,32087.0,İlkay Gündoğan,4-3-3,Paul Tierney,Match Report,
5,2021-09-15,20:00,Champions Lg,Group stage,Wed,Home,W,6,3,de RB Leipzig,2.1,0.6,51.0,38062.0,Rúben Dias,4-3-3,Serdar Gözübüyük,Match Report,
6,2021-09-18,15:00,Premier League,Matchweek 5,Sat,Home,D,0,0,Southampton,1.1,0.4,63.0,52698.0,Fernandinho,4-3-3,Jonathan Moss,Match Report,
7,2021-09-21,19:45,EFL Cup,Third round,Tue,Home,W,6,1,Wycombe,,,79.0,30959.0,Kevin De Bruyne,4-3-3,Robert Jones,Match Report,
8,2021-09-25,12:30,Premier League,Matchweek 6,Sat,Away,W,1,0,Chelsea,1.7,0.3,60.0,40036.0,Rúben Dias,4-3-3,Michael Oliver,Match Report,
9,2021-09-28,21:00,Champions Lg,Group stage,Tue,Away,L,0,2,fr Paris S-G,1.9,0.8,54.0,37350.0,Rúben Dias,4-3-3,Carlos del Cerro,Match Report,


## Get match shooting stats

In [29]:
soup = BeautifulSoup(data.text)

In [30]:
links = soup.find_all('a')

In [31]:
links = [l.get('href') for l in links]

In [32]:
links = [l for l in links if l and "all_comps/shooting/" in l]

In [33]:
links

['/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions',
 '/en/squads/b8fd03ef/2021-2022/matchlogs/all_comps/shooting/Manchester-City-Match-Logs-All-Competitions']

In [34]:
data = requests.get(f"https://fbref.com{links[0]}")

In [35]:
shooting = pd.read_html(data.text, match='Shooting')[0]

In [36]:
shooting.head()

Unnamed: 0_level_0,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,For Manchester City,Standard,Standard,Standard,Standard,Standard,Standard,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,Gls,Sh,SoT,SoT%,G/Sh,G/SoT,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,0,12,3,25.0,0.0,0.0,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,0,18,4,22.2,0.0,0.0,16.9,1.0,0,0,1.9,1.9,0.11,-1.9,-1.9,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,4,16,4,25.0,0.25,1.0,17.3,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,5,25,10,40.0,0.2,0.5,14.3,0.0,0,0,3.8,3.8,0.15,1.2,1.2,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,1,25,8,32.0,0.04,0.13,14.0,0.0,0,0,2.9,2.9,0.12,-1.9,-1.9,Match Report


## Cleaning and Merging scraped data

In [37]:
shooting.columns = shooting.columns.droplevel()

In [38]:
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,Gls,Sh,SoT,SoT%,G/Sh,G/SoT,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,0,12,3,25.0,0.0,0.0,,,0,0,,,,,,Match Report
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,0,18,4,22.2,0.0,0.0,16.9,1.0,0,0,1.9,1.9,0.11,-1.9,-1.9,Match Report
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,4,16,4,25.0,0.25,1.0,17.3,1.0,0,0,2.7,2.7,0.17,1.3,1.3,Match Report
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,5,25,10,40.0,0.2,0.5,14.3,0.0,0,0,3.8,3.8,0.15,1.2,1.2,Match Report
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,1,25,8,32.0,0.04,0.13,14.0,0.0,0,0,2.9,2.9,0.12,-1.9,-1.9,Match Report


In [42]:
team_data = pd.merge(matches[0], shooting[['Date', 'Sh', 'SoT', 'Dist', 'FK', 'PK', 'PKatt']], on='Date')

In [44]:
team_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2021-08-07,17:15,Community Shield,FA Community Shield,Sat,Neutral,L,0,1,Leicester City,,,57.0,,Fernandinho,4-3-3,Paul Tierney,Match Report,,12,3,,,0,0
1,2021-08-15,16:30,Premier League,Matchweek 1,Sun,Away,L,0,1,Tottenham,1.9,1.3,64.0,58262.0,Fernandinho,4-3-3,Anthony Taylor,Match Report,,18,4,16.9,1.0,0,0
2,2021-08-21,15:00,Premier League,Matchweek 2,Sat,Home,W,5,0,Norwich City,2.7,0.1,67.0,51437.0,İlkay Gündoğan,4-3-3,Graham Scott,Match Report,,16,4,17.3,1.0,0,0
3,2021-08-28,12:30,Premier League,Matchweek 3,Sat,Home,W,5,0,Arsenal,3.8,0.1,80.0,52276.0,İlkay Gündoğan,4-3-3,Martin Atkinson,Match Report,,25,10,14.3,0.0,0,0
4,2021-09-11,15:00,Premier League,Matchweek 4,Sat,Away,W,1,0,Leicester City,2.9,0.8,61.0,32087.0,İlkay Gündoğan,4-3-3,Paul Tierney,Match Report,,25,8,14.0,0.0,0,0


In [45]:
team_data.shape

(53, 25)