# Web scraping Football English Premier League Data 

Welcome to my web scraping project for the English Premier League! 

As a football enthusiast, I have always been fascinated by the performance of my favourite teams and players. But I wanted to take my analysis to the next level by gathering data from English Premier League website and using web scraping techniques to extract meaningful insights using Python programming language and the BeautifulSoup library. 

So let's get started! 

Let's start with importing "requests" module.

The "requests" module in Python is a powerful tool for making HTTP requests to web servers and receiving responses. It provides a simple and user-friendly interface for sending HTTP requests and handling responses, making it an essential tool for web scraping, web development, and data analysis.

In [1]:
#importing request module
import requests

We are initializing and storing our desired web page into the variable "url", so that we can write "url" instead of the whole url of the webpage.

In [2]:
#initializing the url of the web-page that we want to retrieve data from
url = "https://fbref.com/en/comps/9/Premier-League-Stats"

Here, we are using the requests module to send an HTTP GET request to retrieve the contents of a webpage located at https://fbref.com/en/comps/9/Premier-League-Stats. 

The get() function of the requests module is used to send the request, and the response from the server is stored in the "data" variable.

By doing this we are making a request to the webpage.

In [3]:
#sending HTML "get" request to the url and retrieving the data into "data"
data = requests.get(url)

However, the response we get from the webpage is in the form of HTML which is no readable. So we want to parsh them into readable format by using BeautifulSoup library. 

If you haven't installed BeautifulSoup, you can install it using 'pip install beautifulsoup4' command.  

In [4]:
#checking the html text we just retrieved. 
data.text



We can see that there are lots of unwanted html text in our data. We need to filter out those to get the data we are actually looking for.

BeautifulSoup is a Python library that allows us to parse HTML and XML documents. It is a very popular library for web scraping and data extraction. 
BeautifulSoup would help us navigate and search a parsed HTML or XML document, so that we can easily extract the data we need. 

So why not we use this powerful library!

In [5]:
#importing beautifulsoup library
from bs4 import BeautifulSoup

Parshing the html text of the webpage into variable "soup"

In [6]:
#passing the html text into BeautifulSoup class
soup = BeautifulSoup(data.text)
soup

<!DOCTYPE html>
<html class="no-js" data-root="/home/fb/deploy/www/base" data-version="klecko-" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport"/>
<link href="https://cdn.ssref.net/req/202303231" rel="dns-prefetch"/>
<!-- Quantcast Choice. Consent Manager Tag v2.0 (for TCF 2.0) -->
<script async="true" type="text/javascript">
    (function() {
	var host = window.location.hostname;
	var element = document.createElement('script');
	var firstScript = document.getElementsByTagName('script')[0];
	var url = 'https://cmp.quantcast.com'
	    .concat('/choice/', 'XwNYEpNeFfhfr', '/', host, 
		    '/choice.js?tag_version=V2');
	var uspTries = 0;
	var uspTriesLimit = 3;
	element.async = true;
	element.type = 'text/javascript';
	element.src = url;
	
	firstScript.parentNode.insertBefore(element, firstScript);
	
	function makeStub() {
	    var TCF_LOCATOR_NAME = '_

BeautifulSoup has select() function which help us to find our desired element in the webpage using CSS (Cascading Style Sheets) selector. Here we are finding the table named "table.stats_table" in the webpage. For this we need to open the web-page and right click on to our desired table and go to the inspector to see the html codes. The table extracted is stored into the variable "table".

In [7]:
#select uses css selector which gives lots of flexibility to select different elements, classes, IDs, etc..
table = soup.select("table.stats_table")[0]
table

<table class="stats_table sortable min_width force_mobilize" data-cols-to-freeze=",2" id="results2022-202391_overall"> <caption>Regular season Table</caption> <colgroup><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/><col/></colgroup> <thead> <tr> <th aria-label="Rank" class="poptip sort_default_asc center" data-stat="rank" data-tip="&lt;strong&gt;Rank&lt;/strong&gt;&lt;br&gt;Squad finish in competition&lt;br&gt;Finish within the league or competition.&lt;br&gt;For knockout competitions may show final round reached.&lt;br&gt;Colors and arrows represent promotion/relegation or qualifiation for continental cups.&lt;br&gt;Trophy indicates team won league whether by playoffs or by leading the table.&lt;br&gt;Star indicates topped table in league USING another means of naming champion." scope="col">Rk</th> <th aria-label="Squad" class="poptip sort_default_asc center" data-stat="team" scope="col">Squad</th> <th aria-label="Mat

Now that we found our desired table, we can look into the tags we need by using find_all() function. Here we are extracting the links associated with the tag "a" and storing into the variable "links"

In [8]:
#find_all finds only tags
links = table.find_all("a")
links

[<a href="/en/squads/18bb7c10/Arsenal-Stats">Arsenal</a>,
 <a href="/en/matches/705a2f3c/Arsenal-Everton-March-1-2023-Premier-League" style="color:#fff; text-decoration:none; background-color: transparent">W</a>,
 <a href="/en/matches/3e9a33fc/Arsenal-Bournemouth-March-4-2023-Premier-League" style="color:#fff; text-decoration:none; background-color: transparent">W</a>,
 <a href="/en/matches/d238973c/Fulham-Arsenal-March-12-2023-Premier-League" style="color:#fff; text-decoration:none; background-color: transparent">W</a>,
 <a href="/en/matches/98e0de00/Arsenal-Crystal-Palace-March-19-2023-Premier-League" style="color:#fff; text-decoration:none; background-color: transparent">W</a>,
 <a href="/en/matches/2e4383ca/Arsenal-Leeds-United-April-1-2023-Premier-League" style="color:#fff; text-decoration:none; background-color: transparent">W</a>,
 <a href="/en/players/48a5a5d6/Martinelli">Martinelli</a>,
 <a href="/en/players/466fb2c5/Aaron-Ramsdale">Aaron Ramsdale</a>,
 <a href="/en/squads/b8f

As we are interested only the value of "href", we are trying to find those with get() function.  

In [9]:
#getting the "href" values
links = [l.get("href") for l in links]
links

['/en/squads/18bb7c10/Arsenal-Stats',
 '/en/matches/705a2f3c/Arsenal-Everton-March-1-2023-Premier-League',
 '/en/matches/3e9a33fc/Arsenal-Bournemouth-March-4-2023-Premier-League',
 '/en/matches/d238973c/Fulham-Arsenal-March-12-2023-Premier-League',
 '/en/matches/98e0de00/Arsenal-Crystal-Palace-March-19-2023-Premier-League',
 '/en/matches/2e4383ca/Arsenal-Leeds-United-April-1-2023-Premier-League',
 '/en/players/48a5a5d6/Martinelli',
 '/en/players/466fb2c5/Aaron-Ramsdale',
 '/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/matches/7f12d9aa/Nottingham-Forest-Manchester-City-February-18-2023-Premier-League',
 '/en/matches/e731c1dd/Bournemouth-Manchester-City-February-25-2023-Premier-League',
 '/en/matches/b33bcb97/Manchester-City-Newcastle-United-March-4-2023-Premier-League',
 '/en/matches/d29fe1d9/Crystal-Palace-Manchester-City-March-11-2023-Premier-League',
 '/en/matches/40966f45/Manchester-City-Liverpool-April-1-2023-Premier-League',
 '/en/players/1f44ac21/Erling-Haaland',
 '/en/player

We are now looking for the urls which contain "squads" 

In [10]:
#getting only the links which contains squads
links = [l for l in links if "/squads/" in l]
links

['/en/squads/18bb7c10/Arsenal-Stats',
 '/en/squads/b8fd03ef/Manchester-City-Stats',
 '/en/squads/b2b47a98/Newcastle-United-Stats',
 '/en/squads/361ca564/Tottenham-Hotspur-Stats',
 '/en/squads/19538871/Manchester-United-Stats',
 '/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 '/en/squads/8602292d/Aston-Villa-Stats',
 '/en/squads/822bd0ba/Liverpool-Stats',
 '/en/squads/cd051869/Brentford-Stats',
 '/en/squads/fd962109/Fulham-Stats',
 '/en/squads/cff3d9bb/Chelsea-Stats',
 '/en/squads/47c64c55/Crystal-Palace-Stats',
 '/en/squads/5bfb9659/Leeds-United-Stats',
 '/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 '/en/squads/7c21e445/West-Ham-United-Stats',
 '/en/squads/d3fd31cc/Everton-Stats',
 '/en/squads/e4a775cb/Nottingham-Forest-Stats',
 '/en/squads/4ba7cbea/Bournemouth-Stats',
 '/en/squads/a2d435b3/Leicester-City-Stats',
 '/en/squads/33c895d4/Southampton-Stats']

As we can see that the above urls are Relative links, we want their respective Absolute links. To convert relative links into absolute links we can simply append "https://fbref.com" just before the relative links.

Let's now convert them.

In [11]:
#formatting url links so that we get the full (absolute links) urls
team_urls = [f"https://fbref.com{l}" for l in links]
#printing the extracted team_urls
team_urls

['https://fbref.com/en/squads/18bb7c10/Arsenal-Stats',
 'https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats',
 'https://fbref.com/en/squads/b2b47a98/Newcastle-United-Stats',
 'https://fbref.com/en/squads/361ca564/Tottenham-Hotspur-Stats',
 'https://fbref.com/en/squads/19538871/Manchester-United-Stats',
 'https://fbref.com/en/squads/d07537b9/Brighton-and-Hove-Albion-Stats',
 'https://fbref.com/en/squads/8602292d/Aston-Villa-Stats',
 'https://fbref.com/en/squads/822bd0ba/Liverpool-Stats',
 'https://fbref.com/en/squads/cd051869/Brentford-Stats',
 'https://fbref.com/en/squads/fd962109/Fulham-Stats',
 'https://fbref.com/en/squads/cff3d9bb/Chelsea-Stats',
 'https://fbref.com/en/squads/47c64c55/Crystal-Palace-Stats',
 'https://fbref.com/en/squads/5bfb9659/Leeds-United-Stats',
 'https://fbref.com/en/squads/8cec06e1/Wolverhampton-Wanderers-Stats',
 'https://fbref.com/en/squads/7c21e445/West-Ham-United-Stats',
 'https://fbref.com/en/squads/d3fd31cc/Everton-Stats',
 'https://fbref.com/en/

Our desired absolute links are stored in the variable "team_urls" in the form of list, so we'll go to individual team_url to extract data from two of the tables.

In [12]:
#considering only the first url in from the team_urls
team_url = team_urls[0]

Like the way we did request the webpage to retrieve data, we again use the request.get() function to get into the team_url to extract data  then store the content into the variable "data".

In [13]:
#requesting and storing the content from the first url into the variable "data". 
data = requests.get(team_url)

Let's import Pandas, an open-source Python library that provides powerful and flexible tools for data analysis and manipulation.  

In [14]:
#importing pandas as pd so that we can address it with its abbreviation each time we used.
import pandas as pd

We now use the read_html() function of pandas to retrieve the data from the table "Scoring & Fixtures" and store the content into variable "matches" as a dataframe.

In [15]:
#using the pandas read_html() function we store the webpage content into the variable "matches" as a dataframe.
matches = pd.read_html(data.text, match="Scores & Fixtures")

Let's now print matches to see the content we just retrieved.

In [16]:
#printing matches
matches[0].head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,xG,xGA,Poss,Attendance,Captain,Formation,Referee,Match Report,Notes
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,1.0,1.2,44.0,25286.0,Martin Ødegaard,4-3-3,Anthony Taylor,Match Report,
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,2.7,0.5,50.0,60033.0,Martin Ødegaard,4-3-3,Darren England,Match Report,
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,1.3,0.3,57.0,10423.0,Martin Ødegaard,4-3-3,Craig Pawson,Match Report,
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,2.6,0.8,71.0,60164.0,Martin Ødegaard,4-3-3,Jarred Gillett,Match Report,
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,2.4,0.4,59.0,60012.0,Martin Ødegaard,4-3-3,Robert Jones,Match Report,


Walah! we finally completed first part of web scraping where we extracted data from the table "Scores & Fixtures" and stored in the variable "matches".

Now in the second part of our web scraping session we'll repeat the above steps to extract data from another table from the same webpage. So we'll quickly repeat the steps and extract data from the table "Shooting" then store it to the variable "shooting".

In [17]:
#parshing the html text into variable "soup" using BeautifulSoup class
soup = BeautifulSoup(data.text)

#pointing into the tag "a" using find_all function
links = soup.find_all("a")

#getting the "href" values using get() function and storing the urls into variables "links"
links = [l.get("href") for l in links]

#extrating only the links which contains "shooting"
links = [l for l in links if l and "shooting/" in l]

#requesting the "https://fbref.com" to retrieve the content of the links[0] and store them into the variable "data"
data = requests.get(f"https://fbref.com{links[0]}")

#using pandas read_html() function we store the webpage content into the variable "shooting" as a dataframe.
shooting = pd.read_html(data.text, match="Shooting")[0]

#dropping the secondary column name using droplevel() function.
shooting.columns = shooting.columns.droplevel()

#printing the shooting dataframe
shooting.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Dist,FK,PK,PKatt,xG,npxG,npxG/Sh,G-xG,np:G-xG,Match Report
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,...,14.6,1.0,0,0,1.0,1.0,0.1,0.0,0.0,Match Report
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,...,13.0,0.0,0,0,2.7,2.7,0.16,1.3,1.3,Match Report
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,...,14.8,0.0,0,0,1.3,1.3,0.1,1.7,1.7,Match Report
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,...,15.5,1.0,0,0,2.6,2.6,0.12,-0.6,-0.6,Match Report
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,...,16.3,1.0,0,0,2.4,2.4,0.12,-0.4,-0.4,Match Report


Let's look into the summary of our "shooting" dataframe.

In [18]:
#"info()" function is a method in pandas library for getting concise information about a DataFrame or Series.
shooting.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41 entries, 0 to 40
Data columns (total 26 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          40 non-null     object 
 1   Time          40 non-null     object 
 2   Comp          40 non-null     object 
 3   Round         40 non-null     object 
 4   Day           40 non-null     object 
 5   Venue         40 non-null     object 
 6   Result        41 non-null     object 
 7   GF            41 non-null     object 
 8   GA            41 non-null     object 
 9   Opponent      40 non-null     object 
 10  Gls           41 non-null     int64  
 11  Sh            41 non-null     int64  
 12  SoT           41 non-null     int64  
 13  SoT%          41 non-null     float64
 14  G/Sh          41 non-null     float64
 15  G/SoT         40 non-null     float64
 16  Dist          38 non-null     float64
 17  FK            38 non-null     float64
 18  PK            41 non-null     in

We can see that there are 26 columns but we dont need them all, we need only the "Date", "Sh", "SoT", "Dist", "FP", "PK", and "PKatt" columns. We want to combine these columns of "shooting" with the dataframe "matches". So to do this we use the "merge()" function to form a new dataframe "team_data"

In [19]:
#combining the chosen columns of "shooting" with "matches" dataframe on column "Date" to form a new dataframe "team_data".
team_data = matches[0].merge(shooting[["Date","Sh","SoT","Dist","FK","PK","PKatt"]], on="Date")

Let's check out "team_data" with "head()" method of pandas library which is used to display the first n rows of a DataFrame. 

In [20]:
#The default number of rows displayed is 5.
team_data.head()

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Formation,Referee,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,...,4-3-3,Anthony Taylor,Match Report,,10,2,14.6,1.0,0,0
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,...,4-3-3,Darren England,Match Report,,19,7,13.0,0.0,0,0
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,...,4-3-3,Craig Pawson,Match Report,,14,6,14.8,0.0,0,0
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,...,4-3-3,Jarred Gillett,Match Report,,22,8,15.5,1.0,0,0
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,...,4-3-3,Robert Jones,Match Report,,22,8,16.3,1.0,0,0


Here we are done with the extraction of data for the first team in the "team_urls" (if you recall we had a list of team urls). We have extracted data from tables "Scores & Fixtures" and "Shooting" for the most recent session (i.e. 2023). We need to do a lot more data extraction work for each team in the "Premier League" for some number of sessions. Instead of repeating the above steps multiple times for multiple teams we want to form a loop where the extraction of data keep on repeating for all the teams in the Squads. 

Before we do that we'll import the "time" module which is a built-in Python module that provides various functions to handle time-related tasks, such as measuring time elapsed, pausing code execution, and formatting time values.

In [21]:
#importing time module
import time

Let's make a list of years for which we want to extract data. You can add any number of desired sessions in the list, here I've for 3 sessions- 2021 to 2023.    

In [22]:
#creating list of desired years
years = list(range(2023,2020,-1))
years

[2023, 2022, 2021]

In order to store all our extracted data, we want to initialize a list. 

In [23]:
#initializing an empty list to store all the data to be extracted
all_matches = []

#assigning the main url of the webpage to a variable "standings_url"
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

Now let's begin the "for" loop

In [24]:
#initializing "for" loop to iterate into each year
for year in years:
    #extracting the urls of each and every team    
    #sending HTTP GET request to "standings_url" and retrieving the content of the webpage. 
    data = requests.get(standings_url)
    #passing "data" (string) as an argument to BeautifulSoup constructor to create a BeautifulSoup object "soup".
    soup = BeautifulSoup(data.text)
    #extracting elements from HTML document ("table.stats_table") based on CSS selectors.
    standings_table = soup.select("table.stats_table")[0]
    #extracting "href" value from an HTML document that has tag "a". 
    links = [l.get("href") for l in standings_table.find_all("a")]
    #extracting links which contain "/squads/"
    links = [l for l in links if "/squads/" in l]
    #converting relative links into absolute links
    team_urls = [f"https://fbref.com{l}" for l in links]

    #extracting previous session's data
    #extracting "href" value from HTML document ("a.prev") based on CSS selectors.
    previous_session = soup.select("a.prev")[0].get("href")
    #updating "standings_url" with the "previous_session" link
    standings_url = f"https://fbref.com{previous_session}"
    
    #initializing "for" loop to iterate into each "team_url"
    for team_url in team_urls:
        #extracting team name from its url and storing into variable "team_name"
        team_name = team_url.split("/")[-1].replace("-Stats","").replace("-"," ")
     
        #extracting data from the table "Scores & Fixtures"
        #sending HTTP GET request to "team_url" and retrieving the content of the webpage.
        data = requests.get(team_url)
        #parsing HTML table ("Score & Fixtures") into pandas DataFrames.
        matches = pd.read_html(data.text, match="Scores & Fixtures")[0]
        
        #extracting data from the table "Shooting"
        #passing "data" (string) as an argument to BeautifulSoup constructor to create a BeautifulSoup object "soup".
        soup = BeautifulSoup(data.text)
        #extracting "href" value from an HTML document that has tag "a".
        links = [l.get("href") for l in soup.find_all("a")]
        #extracting links which contain "all_comps/shooting/"
        links = [l for l in links if l and "all_comps/shooting/" in l]
        #converting relative link into absolute link and 
        #sending HTTP GET request the url and retrieving the content of the webpage.
        data = requests.get(f"https://fbref.com{links[0]}")
        #parsing HTML table ("Shooting") into pandas DataFrames.
        shooting = pd.read_html(data.text, match="Shooting")[0]
        #droping second level column index
        shooting.columns = shooting.columns.droplevel()
        
        #controlling flow statements to handle errors arise due to missing data.
        try:
            #merging the selected columns from the table "Shooting" with the table "matches" and
            #forming a new dataframe "team_data"
            team_data = matches.merge(shooting[["Date","Sh","SoT","Dist","FK","PK","PKatt"]], on="Date")
        except ValueError:
            continue
        
        #filtering in type of the competition with "Premier League"
        team_data = team_data[team_data["Comp"] == "Premier League"]
        #forming new column "Session" in team_data
        team_data["Session"] = year
        #forming new column "Team" 
        team_data["Team"] = team_name
        #adding "team_data" into the list "all_matches"
        all_matches.append(team_data)
        
        #pausing the execution for a second
        #this is ensure not to trigger traffic conjestion to the website due to these data extraction requests
        time.sleep(1)

We can see that "all_matches" is a list of mulitple dataframes so we concatenate all the list to form a dataframe 

In [25]:
all_matches

[          Date   Time            Comp         Round  Day Venue Result GF GA  \
 0   2022-08-05  20:00  Premier League   Matchweek 1  Fri  Away      W  2  0   
 1   2022-08-13  15:00  Premier League   Matchweek 2  Sat  Home      W  4  2   
 2   2022-08-20  17:30  Premier League   Matchweek 3  Sat  Away      W  3  0   
 3   2022-08-27  17:30  Premier League   Matchweek 4  Sat  Home      W  2  1   
 4   2022-08-31  19:30  Premier League   Matchweek 5  Wed  Home      W  2  1   
 5   2022-09-04  16:30  Premier League   Matchweek 6  Sun  Away      L  1  3   
 7   2022-09-18  12:00  Premier League   Matchweek 8  Sun  Away      W  3  0   
 8   2022-10-01  12:30  Premier League   Matchweek 9  Sat  Home      W  3  1   
 10  2022-10-09  16:30  Premier League  Matchweek 10  Sun  Home      W  3  2   
 12  2022-10-16  14:00  Premier League  Matchweek 11  Sun  Away      W  1  0   
 14  2022-10-23  14:00  Premier League  Matchweek 13  Sun  Away      D  1  1   
 16  2022-10-30  14:00  Premier League  

In [26]:
#concatenating all the dataframes of "all_matches" to form a DataFrame
match_df = pd.concat(all_matches)

Let's look into the DataFrame we just formed

In [27]:
match_df

Unnamed: 0,Date,Time,Comp,Round,Day,Venue,Result,GF,GA,Opponent,...,Match Report,Notes,Sh,SoT,Dist,FK,PK,PKatt,Session,Team
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,...,Match Report,,10.0,2.0,14.6,1.0,0.0,0.0,2023,Arsenal
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,...,Match Report,,19.0,7.0,13.0,0.0,0.0,0.0,2023,Arsenal
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,...,Match Report,,14.0,6.0,14.8,0.0,0.0,0.0,2023,Arsenal
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,...,Match Report,,22.0,8.0,15.5,1.0,0.0,0.0,2023,Arsenal
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,...,Match Report,,22.0,8.0,16.3,1.0,0.0,0.0,2023,Arsenal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0,4,Tottenham,...,Match Report,,8.0,1.0,18.2,0.0,0.0,0.0,2021,Sheffield United
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0,2,Crystal Palace,...,Match Report,,7.0,0.0,13.4,1.0,0.0,0.0,2021,Sheffield United
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1,0,Everton,...,Match Report,,10.0,3.0,18.5,0.0,0.0,0.0,2021,Sheffield United
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0,1,Newcastle Utd,...,Match Report,,11.0,1.0,18.3,1.0,0.0,0.0,2021,Sheffield United


In [29]:
match_df.shape

(2088, 27)

We successfully extracted 2088 rows of data with 28 columns. Before we conclude our web scraping session I just want to do a minor cleaning onto the name of columns by making them lower_case. This is an optional step you can skip it if you are comfortable typing headline styles of column names. 

In [30]:
#converting headline style column names into lower_case
match_df.columns = [c.lower() for c in match_df.columns]

Let's check it out the final look of our DataFrame

In [31]:
#printing dataframe "match_df"
match_df

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,match report,notes,sh,sot,dist,fk,pk,pkatt,session,team
0,2022-08-05,20:00,Premier League,Matchweek 1,Fri,Away,W,2,0,Crystal Palace,...,Match Report,,10.0,2.0,14.6,1.0,0.0,0.0,2023,Arsenal
1,2022-08-13,15:00,Premier League,Matchweek 2,Sat,Home,W,4,2,Leicester City,...,Match Report,,19.0,7.0,13.0,0.0,0.0,0.0,2023,Arsenal
2,2022-08-20,17:30,Premier League,Matchweek 3,Sat,Away,W,3,0,Bournemouth,...,Match Report,,14.0,6.0,14.8,0.0,0.0,0.0,2023,Arsenal
3,2022-08-27,17:30,Premier League,Matchweek 4,Sat,Home,W,2,1,Fulham,...,Match Report,,22.0,8.0,15.5,1.0,0.0,0.0,2023,Arsenal
4,2022-08-31,19:30,Premier League,Matchweek 5,Wed,Home,W,2,1,Aston Villa,...,Match Report,,22.0,8.0,16.3,1.0,0.0,0.0,2023,Arsenal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38,2021-05-02,19:15,Premier League,Matchweek 34,Sun,Away,L,0,4,Tottenham,...,Match Report,,8.0,1.0,18.2,0.0,0.0,0.0,2021,Sheffield United
39,2021-05-08,15:00,Premier League,Matchweek 35,Sat,Home,L,0,2,Crystal Palace,...,Match Report,,7.0,0.0,13.4,1.0,0.0,0.0,2021,Sheffield United
40,2021-05-16,19:00,Premier League,Matchweek 36,Sun,Away,W,1,0,Everton,...,Match Report,,10.0,3.0,18.5,0.0,0.0,0.0,2021,Sheffield United
41,2021-05-19,18:00,Premier League,Matchweek 37,Wed,Away,L,0,1,Newcastle Utd,...,Match Report,,11.0,1.0,18.3,1.0,0.0,0.0,2021,Sheffield United


We've come so far to form this dataframe with match details. It is time to save our data for future. We'll export the dataframe into CSV file with the method "to_csv()". This method will save our dataframe in the current working directory in CSV format.

In [32]:
#extracting dataframe into CSV file
match_df.to_csv("matches.csv")

Now that we know how to scrape data from webpage, we can extract any data we want from any webpages for our upcoming projects. With this we conclude our web scraping session.

Happy Learning!