# English Premier League

![Image](tile-premier-league-mar.jpg)

The Premier League is the highest level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Seasons typically run from August to May with each team playing 38 matches against all other teams both home and away. Most games are played on Saturday and Sunday afternoons, with occasional weekday evening fixtures.

We are going to attend to retrieve data of the latest EPL season using Webscraping and feed our data into a Machine Learning Model to predict the matches of each Team . The Packages which will be used to scrape data of the internet is called **Beautiful Soup** . Beautiful Soup is a Python library that is used for web scraping and parsing HTML or XML documents. It provides a convenient way to extract data from web pages by traversing the HTML or XML structure and locating specific elements or attributes.

### Web Scraping tools/packages to find our data  

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
# get the link of the EPL stats which we will convert to a pandas DataFrame
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [None]:
epl_data = requests.get(standings_url)

## Parse HTML links using BeautifulSoup

In [None]:
soup = BeautifulSoup(epl_data.text)

In [None]:
league_standing = soup.select('table.stats_table')

In [None]:
team_links = league_standing.find_all('a')

In [None]:
team_links = [l.get("href") for l in team_links]

In [None]:
team_links = [l for l in team_links if '/squads/' in l]

In [None]:
team_url = [f"https://fbref.com{l}" for l in team_links]

## Extract Each Match Stats Using Pandas and Requests

In [None]:
#let's extract the team url for the first team on the table
team_url = team_url[0]

In [None]:
#We can see that the first team on the table Man City 
team_url

In [None]:
epl_data = requests.get(team_url)

In [None]:
#Let's get all of Man City's 2022/2023 MP results
matches_played = pd.read_html(epl_data.text , match="Scores & Fixtures")

In [None]:
matches_played[0].head()

## Get Match Shooting stats using Beautiful Soup

In [None]:
soup = BeautifulSoup(epl_data.text)

In [None]:
#get links for the shooting stats
links = soup.find_all('a')

In [None]:
links = [l.get("href") for l in links]

In [None]:
links = [l for l in links if l and 'all_comps/shooting/' in l]

In [None]:
epl_data = requests.get(f"https://fbref.com{links[0]}")

In [None]:
shooting = pd.read_html(epl_data.text ,match="Shooting")[0]

In [None]:
shooting.head()

## Clean and Merge scapped data

In [None]:
shooting.columns = shooting.columns.droplevel()

In [None]:
shooting

In [None]:
# Let's compine the Matches played and shooting tables to get a more detailded match table
match_table = pd.concat([matches_played[0], shooting], axis=1, join='inner')

In [None]:
match_table.info()

### Scraping Data for Multiple Seasons from 2013 to 2023

In [None]:
years = list(range(2022, 2020, -1))

In [None]:
years

In [None]:
all_matches = []

In [None]:
standings_url = "https://fbref.com/en/comps/9/Premier-League-Stats"

In [None]:
for year in years:
    epl_data = requests.get(standings_url)
    soup = BeautifulSoup(epl_data.text)
    league_standing = soup.select('table.stats_table')[0]
    team_links = [l.get("href") for l in league_standing.find_all(' a')]
    team_links = [l for l in links if '/squads/' in l]
    team_url = [f"https://fbref.com{l}" for l in team_links]
    previous_season = soup.select("a.prev")[0].get("href")
    standings_url = f"https://fbref.com{previous_season}"
    
    for t_url in team_url:
        team_name = t_url.split("/")[-1].replace("-Stats","").replace("-"," ")
        epl_data = requests.get(t_url)
        matches_played = pd.read_html(epl_data.text, match="Scores & Fixtures")[0]
        soup = BeautifulSoup(epl_data.text)
        team_links = [l.get("href") for l in soup.find_all("a")]
        team_links = [l for l in links if l and 'all_comps/shooting' in l]
        epl_data = requests.get(f"https://fbref.com{team_links[0]}")
        shooting = pd.read_html(epl_data ,match="Shooting")[0]
        shooting.columns = shooting.columns.droplevel()
        
        try:
            match_table = pd.concat([matches_played[0], shooting], axis=1, join='inner')
        except ValueError:
            continue
        match_table = match_table[match_table["Comp"] == "Premier League"]
        match_table["Season"] = year
        match_table["Team"] = team_name
        all_matches.append(match_table)


In [None]:
match_df = pd.concat(all_matches)
matches_df.columns = [c.lower() for c in match_df.columns]

In [None]:
match_df.to_csv("matches.csv")