# Teams: League Table & Scores Scraper

Importing packages that we are going to use:

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

### Current League Table

First, we write a function for scraping the current league table from our website. We loop over each column and append the values in lists that are later put together into a dataframe.

In [2]:
def league_table_scraper(url):
    
    # getting soup from the url
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    league_table = soup.find("div", {'class': 'table_container'}).table
    
    # preparation for looping
    rank = []
    list_names = ["squad", "games", "wins", "draws", "losses", "goals_for", "goals_against", "goal_diff", "points", "last_5", "attendance_per_g", "top_team_scorers", "top_keeper"]
    lists = []
    for i in list_names:
        i = []
        lists.append(i)
    
    # looping over different teams (rows)
    for team in league_table.find_all("tbody"):
        rows = team.find_all("tr")
        
        # looping over different variables (columns) and writing into lists
        for row in rows:
            rank.append(row.find("th", {'data-stat': 'rank'}).text)
            for k in lists:
                k.append(row.find("td", {'data-stat': list_names[lists.index(k)]}).text)
    
    # creating a dataframe by concatenating the lists
    table = pd.DataFrame({"Rank": rank, "Team": lists[0], "Games": lists[1], "Wins": lists[2], "Draws": lists[3], "Losses": lists[4], "Goals for": lists[5], "Goals against": lists[6], "Goal Difference": lists[7], "Points": lists[8], "Last 5 Games": lists[9], "Attendance per Game": lists[10], "Top Team Scorers": lists[11], "Top Goalkeeper": lists[12]})
    return table

Now we take the url of the website we want to scrape, and plug it as an argument into the prewritten function to obtain a nice table. To be able to work with the scraped data we also wrote a csv file which we are going to read in a new notebook dedicated to data cleaning and analysis.

In [3]:
url = "https://fbref.com/en/comps/66/Czech-First-League-Stats"
table = league_table_scraper(url)
#table.to_csv(r"C:\Users\Honza Stuchlík\Documents\IES\Data Processing in Python\Czech-Football-League\league_table.csv", index = False)
table

Unnamed: 0,Rank,Team,Games,Wins,Draws,Losses,Goals for,Goals against,Goal Difference,Points,Last 5 Games,Attendance per Game,Top Team Scorers,Top Goalkeeper
0,1,Slavia Prague,16,14,2,0,48,8,40,44,W W W W W,2019,Abdallah Sima - 9,Ondřej Kolář
1,2,Sparta Prague,16,11,2,3,33,18,15,35,W D D W W,1445,Lukáš Juliš - 11,Florin Niță
2,3,Jablonec,16,11,2,3,33,16,17,35,D W W W W,722,Ivan Schranz - 6,Jan Hanuš
3,4,Slovácko,16,9,3,4,30,17,13,30,W W W W W,498,Jan Kliment - 6,Vít Nemrava
4,5,Baník Ostrava,16,7,5,4,21,13,8,26,W D L D W,1382,Dyjan Carlos De Azevedo - 6,Jan Laštůvka
5,6,Sigma Olomouc,16,6,7,3,25,19,6,25,D D L L W,819,David Houska - 4,Aleš Mandous
6,7,Slovan Liberec,16,7,4,5,24,17,7,25,W D D W L,984,Michael Rabušic - 7,Filip Nguyen
7,8,Viktoria Plzeň,16,7,3,6,30,21,9,24,W L L D W,1333,"Aleš Čermák, Jean-David Beauguel - 6",Aleš Hruška
8,9,České Budĕjov.,16,6,6,4,23,23,0,24,D L W W W,471,"Benjamin Čolić, Patrik Brandner - 5",Jaroslav Drobný
9,10,FK Pardubice,16,6,4,6,15,19,-4,22,L D W L L,589,David Huf - 5,Marek Boháč


### Scores & Fixtures

We wrote a function for scraping scores from different seasons. The function takes two arguments (url of the website that we want to scrape and a logical argument "regular" which is True if the season which we are scraping was regular and False if it was not regular). Regular season means there were no extra rounds after all teams have played against each other twice. Irregular seasons usually include extra championship and relegation rounds. The argument makes sure we follow the correct html structure of the website because regular seasons' websites have a slightly different structure than the irregular ones.

In [4]:
def scores_scraper(url, regular = True):
    
    # getting soup from the url
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")
    scores_table = soup.find("div", {'class': 'table_container'}).table
    
    # preparation for looping
    rank = []
    list_names = ["gameweek", "dayofweek", "date", "time", "squad_a", "score", "squad_b", "attendance", "venue", "referee"]
    lists = []
    for i in list_names:
        i = []
        lists.append(i)       

    # looping over different rows
    for game in scores_table.find_all("tbody"):
        rows = game.find_all("tr")
        
        if regular:
            # looping over different variables (columns) and writing into lists
            for row in rows:
                lists[0].append(row.find("th", {'data-stat': 'gameweek'}).text)
                for k in lists[1:]:
                    k.append(row.find("td", {'data-stat': list_names[lists.index(k)]}).text)
                
        else:
            # looping over different variables (columns) and writing into lists
            for row in rows:
                for k in lists:
                    k.append(row.find("td", {'data-stat': list_names[lists.index(k)]}).text)
                    
    # creating a dataframe by concatenating the lists
    scores = pd.DataFrame({"Game Week": lists[0], "Weekday": lists[1], "Date": lists[2], "Time": lists[3], "Home Team": lists[4], "Score": lists[5], "Away Team": lists[6], "Attendance": lists[7], "Venue": lists[8], "Referee": lists[9]})
    return scores

To test the function, we pass the url of the current season's website as an argument, and we obtain a dataframe with scores and some other information. The season is regular, so we do not need to specify the argument "regular" as it is set to True by default.

In [5]:
url = "https://fbref.com/en/comps/66/schedule/Czech-First-League-Scores-and-Fixtures"
scores_scraper(url)

Unnamed: 0,Game Week,Weekday,Date,Time,Home Team,Score,Away Team,Attendance,Venue,Referee
0,1,Fri,2020-08-21,18:00,Viktoria Plzeň,3–1,Opava,2813,Doosan Arena,Alex Denev
1,1,Sat,2020-08-22,17:00,Fastav Zlín,1–2,Slovácko,1282,Stadion Letná,Pavel Královec
2,1,Sat,2020-08-22,17:00,Příbram,1–3,Teplice,1350,Energon Aréna,Ondřej Berka
3,1,Sat,2020-08-22,17:00,Sigma Olomouc,1–0,Slovan Liberec,2216,Andrův stadion,Paval Julínek
4,1,Sat,2020-08-22,19:30,Zbrojovka Brno,1–4,Sparta Prague,2500,Městský fotbalový stadion Srbská,Pavel Franek
...,...,...,...,...,...,...,...,...,...,...
343,34,Fri,2021-05-28,,Sparta Prague,,Zbrojovka Brno,,Generali Arena,
344,34,Fri,2021-05-28,,Slovan Liberec,,Sigma Olomouc,,Stadion u Nisy,
345,34,Fri,2021-05-28,,Slovácko,,Fastav Zlín,,Městský fotbalový stadion Miroslava Vale...,
346,34,Fri,2021-05-28,,Slavia Prague,,České Budĕjov.,,Sinobo Stadium,


The second url represents an "irregular" season. This time we specify the argument "regular" as False, and we get a dataframe in the same format as for a regular season.

In [6]:
url = "https://fbref.com/en/comps/66/3226/schedule/2019-2020-Czech-First-League-Scores-and-Fixtures"
scores_scraper(url, regular = False)

Unnamed: 0,Game Week,Weekday,Date,Time,Home Team,Score,Away Team,Attendance,Venue,Referee
0,1,Fri,2019-07-12,18:00,Jablonec,2–0,Bohemians 1905,2612,Stadion Střelnice,Pavel Franek
1,1,Sat,2019-07-13,17:00,Příbram,1–1,Teplice,2862,Energon Aréna,Paval Julínek
2,1,Sat,2019-07-13,17:00,Baník Ostrava,1–2,Slovan Liberec,7542,Městský stadion - Vítkovice Aréna,Ondřej Berka
3,1,Sat,2019-07-13,19:30,Viktoria Plzeň,3–1,Sigma Olomouc,9611,Doosan Arena,Ondřej Pechanec
4,1,Sun,2019-07-14,16:30,České Budĕjov.,0–1,Opava,4381,Fotbalový stadion Střelecký ostrov,Ondřej Ginzel
...,...,...,...,...,...,...,...,...,...,...
311,4,Thu,2020-07-23,18:00,Fastav Zlín,,Karviná,,Stadion Letná,
312,,,,,,,,,,
313,5,Sun,2020-07-26,17:00,Sigma Olomouc,,Fastav Zlín,,Andrův stadion,
314,5,Sun,2020-07-26,17:00,Karviná,,Příbram,,Městský stadion,


Since we get a dataframe in the same format for both regular and irregular seasons, we can easily append the dataframes for all seasons that we scraped into a single dataframe to make our later work easier. We first wrote a function to construct the url for a given season, and set it as an argument for our scraping function. This was done for all seasons of our choice. Then, all the obtained data frames were put together using append. The result is again written into a csv file, so that we can load it and process it in another notebook.

In [7]:
seasons = ["2015-2016", "2016-2017", "2017-2018", "2018-2019", "2019-2020", "2020-2021"]

def season_scores_url(season_index):
    season_id = [1459, 1518, 1623, 2427, 3226, ""]
    core_url1 = "https://fbref.com/en/comps/66/"
    core_url2 = "-Czech-First-League-Scores-and-Fixtures"
    if season_index == 5:
        schedule = "schedule/"
    else:
        schedule = "/schedule/"
    season_url = core_url1 + str(season_id[season_index]) + schedule + seasons[season_index] + core_url2
    return season_url

In [8]:
scores_1516 = scores_scraper(season_scores_url(0))
scores_1617 = scores_scraper(season_scores_url(1))
scores_1718 = scores_scraper(season_scores_url(2))
scores_1819 = scores_scraper(season_scores_url(3), regular = False)
scores_1920 = scores_scraper(season_scores_url(4), regular = False)
scores_2021 = scores_scraper(season_scores_url(5))
scores_dfs = [scores_1516, scores_1617, scores_1718, scores_1819, scores_1920, scores_2021]
scores = pd.DataFrame()
for i in scores_dfs:
    scores = scores.append(i, ignore_index = True, sort = False)
#scores.to_csv(r"C:\Users\Honza Stuchlík\Documents\IES\Data Processing in Python\Czech-Football-League\scores.csv", index = False)
scores

Unnamed: 0,Game Week,Weekday,Date,Time,Home Team,Score,Away Team,Attendance,Venue,Referee
0,1,Fri,2015-07-24,17:30,Viktoria Plzeň,2–1,Slavia Prague,11233,Doosan Arena,Pavel Franek
1,1,Fri,2015-07-24,19:00,Vysočina Jihlava,0–0,Sparta Prague,3894,Stadion v Jiráskově ulici,Tomas Kocourek
2,1,Sat,2015-07-25,17:00,Příbram,2–3,Jablonec,4182,Energon Aréna,Pavel Královec
3,1,Sat,2015-07-25,17:00,Slovácko,4–3,Dukla Prague,3726,Městský fotbalový stadion Miroslava Vale...,Zbyněk Proske
4,1,Sat,2015-07-25,17:00,Zbrojovka Brno,2–1,Baník Ostrava,5326,Městský fotbalový stadion Srbská,Libor Kovařík
...,...,...,...,...,...,...,...,...,...,...
1732,34,Fri,2021-05-28,,Slovan Liberec,,Sigma Olomouc,,Stadion u Nisy,
1733,34,Fri,2021-05-28,,Opava,,Viktoria Plzeň,,Stadion v Městských sadech,
1734,34,Fri,2021-05-28,,Slovácko,,Fastav Zlín,,Městský fotbalový stadion Miroslava Vale...,
1735,34,Fri,2021-05-28,,Slavia Prague,,České Budĕjov.,,Sinobo Stadium,
