# Soccer Data Scraping

1.) Data Acquisition: Utilizing BeautifulSoup to scrape soccer data of English Premier League from a designated   website for all the teams in past few years

2.) Data Conversion: Transforming the scraped data into a CSV format for comprehensive analysis

3.) Analysis: Employing machine learning models to predict match outcomes based on the gathered data

In [1]:
# importing required libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# source link

source_url = "https://fbref.com/en/comps/9/2023-2024/2023-2024-Premier-League-Stats"

In [3]:
# downloading the html of the source

data = requests.get(source_url)

In [5]:
# BeautifulSoup library - for parsing through html

soup = BeautifulSoup(data.text)

In [6]:
# CSS selector to get the table stats

stats_table = soup.select('table.stats_table')[0]

In [8]:
links = stats_table.find_all('a')

In [9]:
links = [i.get("href") for i in links]

In [10]:
# Extracting the squad links based on the stats_table

links = [i for i in links if '/squads/' in i]

In [11]:
links

['/en/squads/18bb7c10/Arsenal-Stats',
 '/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/squads/8602292d/Aston-Villa-Stats',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/squads/b2b47a98/Newcastle-United-Stats',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 '/en/squads/4ba7cbea/Bournemouth-Stats',
 '/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 '/en/squads/fd962109/Fulham-Stats',
 '/en/squads/47c64c55/Crystal-Palace-Stats',
 '/en/squads/d3fd31cc/Everton-Stats',
 '/en/squads/cd051869/Brentford-Stats',
 '/en/squads/e4a775cb/Nottingham-Forest-Stats',
 '/en/squads/e297cd13/Luton-Town-Stats',
 '/en/squads/943e8050/Burnley-Stats',
 '/en/squads/1df6b87e/Sheffield-United-Stats']

In [12]:
# Appending the base url for data extraction

team_links = [f"https://fbref.com{i}" for i in links]

In [13]:
team_links

['https://fbref.com/en/squads/18bb7c10/Arsenal-Stats',
 'https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats',
 'https://fbref.com/en/squads/822bd0ba/Liverpool-Stats',
 'https://fbref.com/en/squads/8602292d/Aston-Villa-Stats',
 'https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/19538871/Manchester-United-Stats',
 'https://fbref.com/en/squads/b2b47a98/Newcastle-United-Stats',
 'https://fbref.com/en/squads/7c21e445/West-Ham-United-Stats',
 'https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats',
 'https://fbref.com/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/4ba7cbea/Bournemouth-Stats',
 'https://fbref.com/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/fd962109/Fulham-Stats',
 'https://fbref.com/en/squads/47c64c55/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/d3fd31cc/Everton-Stats',
 'https://fbref.com/en/squads/cd051869/Brentford-Stats',
 'https://fbref.com/en/s

In [14]:
current_team = team_links[0]
val = requests.get(current_team)

In [16]:
# Using pandas to read the html and get 'Scores & Fixtures' table data

match_data = pd.read_html(val.text, match = 'Scores & Fixtures')

In [17]:
match_data[0]

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (4),1 (1),Manchester City,,,45.0,81145.0,Martin Ødegaard,4-3-3,Stuart Attwell,Match Report,Arsenal won on penalty kicks following normal ...
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,0.8,1.2,78.0,59984.0,Martin Ødegaard,4-3-3,Michael Oliver,Match Report,
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,2.0,1.0,53.0,24189.0,Martin Ødegaard,4-3-3,David Coote,Match Report,
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,3.2,0.6,71.0,59961.0,Martin Ødegaard,4-3-3,Paul Tierney,Match Report,
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,2.3,0.9,55.0,60192.0,Martin Ødegaard,4-3-3,Anthony Taylor,Match Report,
5,2023-09-17,16:30,Premier League,Matchweek 5,Sun,Away,W,1,0,Everton,1.0,0.3,74.0,39217.0,Martin Ødegaard,4-3-3,Simon Hooper,Match Report,
6,2023-09-20,20:00,Champions Lg,Group stage,Wed,Home,W,4,0,nl PSV Eindhoven,2.3,0.5,58.0,58860.0,Martin Ødegaard,4-3-3,Felix Zwayer,Match Report,
7,2023-09-24,14:00,Premier League,Matchweek 6,Sun,Home,D,2,2,Tottenham,1.8,1.4,47.0,60156.0,Martin Ødegaard,4-3-3,Robert Jones,Match Report,
8,2023-09-27,19:45,EFL Cup,Third round,Wed,Away,W,1,0,Brentford,,,60.0,16688.0,Jorginho,4-3-3,Darren Bond,Match Report,
9,2023-09-30,15:00,Premier League,Matchweek 7,Sat,Away,W,4,0,Bournemouth,3.4,0.6,57.0,11193.0,Martin Ødegaard,4-3-3,Michael Salisbury,Match Report,


In [18]:
soup = BeautifulSoup(val.text)

In [19]:
links = soup.find_all('a')

In [20]:
links = [i.get("href") for i in links]

In [21]:
# Similar to the above steps, using shooting table data for all completions

links = [i for i in links if i and 'all_comps/shooting/' in i]

In [22]:
links

['/en/squads/18bb7c10/2023-2024/matchlogs/all_comps/shooting/Arsenal-Match-Logs-All-Competitions',
 '/en/squads/18bb7c10/2023-2024/matchlogs/all_comps/shooting/Arsenal-Match-Logs-All-Competitions',
 '/en/squads/18bb7c10/2023-2024/matchlogs/all_comps/shooting/Arsenal-Match-Logs-All-Competitions',
 '/en/squads/18bb7c10/2023-2024/matchlogs/all_comps/shooting/Arsenal-Match-Logs-All-Competitions']

In [23]:
val = requests.get(f"https://fbref.com{links[0]}")

In [24]:
# Using pandas to read the html and get 'Shooting' table data

shoot_data = pd.read_html(val.text, match = 'Shooting')[0]

In [25]:
shoot_data.head()

Unnamed: 0_level_0,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,For Arsenal,...,Standard,Standard,Standard,Standard,Expected,Expected,Expected,Expected,Expected,Unnamed: 25_level_0
Unnamed: 0_level_1,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (4),1 (1),Manchester City,...,,,0,0,,,,,,Match Report
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,...,19.1,0.0,0,0,0.8,0.8,0.06,1.2,1.2,Match Report
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,...,16.4,0.0,1,1,2.0,1.2,0.09,-1.0,-1.2,Match Report
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,...,13.8,0.0,1,1,3.2,2.4,0.14,-1.2,-1.4,Match Report
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,...,15.0,0.0,0,0,2.3,2.3,0.13,0.7,0.7,Match Report


In [26]:
shoot_data.columns = shoot_data.columns.droplevel()

In [27]:
shoot_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (4),1 (1),Manchester City,...,,,0,0,,,,,,Match Report
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,...,19.1,0.0,0,0,0.8,0.8,0.06,1.2,1.2,Match Report
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,...,16.4,0.0,1,1,2.0,1.2,0.09,-1.0,-1.2,Match Report
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,...,13.8,0.0,1,1,3.2,2.4,0.14,-1.2,-1.4,Match Report
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,...,15.0,0.0,0,0,2.3,2.3,0.13,0.7,0.7,Match Report


In [28]:
# Shooting stats for a single team in a single season

shoot_data.shape

(50, 26)

In [29]:
# Merging the Scores & Fixtures with important columns required for prediction

merge_data = match_data[0].merge(shoot_data[["Date", "Sh", "SoT", "Dist", "FK", "PK", "PKatt"]], on = 'Date')

In [30]:
merge_data

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2023-08-06,16:00,Community Shield,FA Community Shield,Sun,Neutral,D,1 (4),1 (1),Manchester City,...,4-3-3,Stuart Attwell,Match Report,Arsenal won on penalty kicks following normal ...,7,3,,,0,0
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,...,4-3-3,Michael Oliver,Match Report,,15,7,19.1,0.0,0,0
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,...,4-3-3,David Coote,Match Report,,13,2,16.4,0.0,1,1
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,...,4-3-3,Paul Tierney,Match Report,,18,9,13.8,0.0,1,1
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,...,4-3-3,Anthony Taylor,Match Report,,17,5,15.0,0.0,0,0
5,2023-09-17,16:30,Premier League,Matchweek 5,Sun,Away,W,1,0,Everton,...,4-3-3,Simon Hooper,Match Report,,13,4,17.4,0.0,0,0
6,2023-09-20,20:00,Champions Lg,Group stage,Wed,Home,W,4,0,nl PSV Eindhoven,...,4-3-3,Felix Zwayer,Match Report,,18,8,14.4,0.0,0,0
7,2023-09-24,14:00,Premier League,Matchweek 6,Sun,Home,D,2,2,Tottenham,...,4-3-3,Robert Jones,Match Report,,12,4,16.9,0.0,1,1
8,2023-09-27,19:45,EFL Cup,Third round,Wed,Away,W,1,0,Brentford,...,4-3-3,Darren Bond,Match Report,,10,3,,,0,0
9,2023-09-30,15:00,Premier League,Matchweek 7,Sat,Away,W,4,0,Bournemouth,...,4-3-3,Michael Salisbury,Match Report,,13,6,15.5,0.0,2,2


In [31]:
# Scraping unnecessary data to consider only 'Premier League' stats

merge_data = merge_data[merge_data["Comp"] == "Premier League"]

In [32]:
merge_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,...,4-3-3,Michael Oliver,Match Report,,15,7,19.1,0.0,0,0
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,...,4-3-3,David Coote,Match Report,,13,2,16.4,0.0,1,1
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,...,4-3-3,Paul Tierney,Match Report,,18,9,13.8,0.0,1,1
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,...,4-3-3,Anthony Taylor,Match Report,,17,5,15.0,0.0,0,0
5,2023-09-17,16:30,Premier League,Matchweek 5,Sun,Away,W,1,0,Everton,...,4-3-3,Simon Hooper,Match Report,,13,4,17.4,0.0,0,0


In [33]:
# Extracting data for 7 years across the league  

years = list(range(2024,2017,-1))
years

[2024, 2023, 2022, 2021, 2020, 2019, 2018]

In [34]:
final_data = []

In [35]:
final_data.append(merge_data)

In [36]:
final_data

[          Date   Time            Comp         Round  Day Venue Result GF GA  \
 1   2023-08-12  12:30  Premier League   Matchweek 1  Sat  Home      W  2  1   
 2   2023-08-21  20:00  Premier League   Matchweek 2  Mon  Away      W  1  0   
 3   2023-08-26  15:00  Premier League   Matchweek 3  Sat  Home      D  2  2   
 4   2023-09-03  16:30  Premier League   Matchweek 4  Sun  Home      W  3  1   
 5   2023-09-17  16:30  Premier League   Matchweek 5  Sun  Away      W  1  0   
 7   2023-09-24  14:00  Premier League   Matchweek 6  Sun  Home      D  2  2   
 9   2023-09-30  15:00  Premier League   Matchweek 7  Sat  Away      W  4  0   
 11  2023-10-08  16:30  Premier League   Matchweek 8  Sun  Home      W  1  0   
 12  2023-10-21  17:30  Premier League   Matchweek 9  Sat  Away      D  2  2   
 14  2023-10-28  15:00  Premier League  Matchweek 10  Sat  Home      W  5  0   
 16  2023-11-04  17:30  Premier League  Matchweek 11  Sat  Away      L  0  1   
 18  2023-11-11  15:00  Premier League  

In [37]:
# Using a dataframe to append the matches data

match_df = pd.concat(final_data)

In [38]:
match_df.columns = [c.lower() for c in match_df.columns]

In [39]:
match_df.head()

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,formation,referee,match report,notes,sh,sot,dist,fk,pk,pkatt
1,2023-08-12,12:30,Premier League,Matchweek 1,Sat,Home,W,2,1,Nott'ham Forest,...,4-3-3,Michael Oliver,Match Report,,15,7,19.1,0.0,0,0
2,2023-08-21,20:00,Premier League,Matchweek 2,Mon,Away,W,1,0,Crystal Palace,...,4-3-3,David Coote,Match Report,,13,2,16.4,0.0,1,1
3,2023-08-26,15:00,Premier League,Matchweek 3,Sat,Home,D,2,2,Fulham,...,4-3-3,Paul Tierney,Match Report,,18,9,13.8,0.0,1,1
4,2023-09-03,16:30,Premier League,Matchweek 4,Sun,Home,W,3,1,Manchester Utd,...,4-3-3,Anthony Taylor,Match Report,,17,5,15.0,0.0,0,0
5,2023-09-17,16:30,Premier League,Matchweek 5,Sun,Away,W,1,0,Everton,...,4-3-3,Simon Hooper,Match Report,,13,4,17.4,0.0,0,0


In [40]:
# Converting the data to a csv format to use for prediction 

#match_df.to_csv('final_dataset.csv')