# Obtaining FBref data
The aim of this script is to download statistics from FBref website. 
The site offers detailed statistics and data on players, teams, and leagues, including historical data and records. For the development of the project all the matches of the Italian Serie A from the 2018-2019 season to the current one (2022-2023) were used. 

The datasets are extracted by making parsing of the HTML tables using the python BeatifulSoup library. 
The class used to obtain data is DownloadDati. 

This notebook use methods useful to obtain all stats.
The FBref website is used in italian.

In [1]:
from download import DownloadDati
import pandas as pd
import time
import util_strings as utils

Dictionary containing the name of the season (year) as a key and as a value the link where the season is taken

In [2]:
download_seasons = {
    "2022-2023": "https://fbref.com/it/comp/11/Statistiche-di-Serie-A",
    "2021-2022": "https://fbref.com/it/comp/11/2021-2022/Statistiche-di-Serie-A-2021-2022",
    "2020-2021": "https://fbref.com/it/comp/11/2020-2021/Statistiche-di-Serie-A-2020-2021",
    "2019-2020": "https://fbref.com/it/comp/11/2019-2020/Statistiche-di-Serie-A-2019-2020",
    "2018-2019": "https://fbref.com/it/comp/11/2018-2019/Statistiche-di-Serie-A-2018-2019"
}

## Data download
Each season in the dictionary is downloaded and saved in the folder "Serie A/Stats".
Stats downloaded from FBref comes from 3 different HTML tables: 
- Shooting
- Possession
- Miscellaneous stats

Before saving the matches several operations must be performed:
* renaming of fields
* conversion date from string to datetime
* sort matches by date

In [None]:
downloadDF = pd.DataFrame()
row = 0
for season, link in download_seasons.items():
    print("Season "+season+ " downloading...")
    start = time.time()
    
    download = DownloadDati("Serie A") 
    download.connect(link)
    download.get_teams_names()
    download.get_matches()
    download.save_matches(utils.statistics.format(season))
    download.save_championship_games(utils.championship.format(season))

    print("Season "+season+ " download ended\n")
    end = time.time()
    downloadDF.at[row, 'season'] = season
    downloadDF.at[row, 'download time'] = end-start
    downloadDF.at[row, 'n. matches'] = len(download.all_matches)/2 
    row += 1
    

The computation of the script above requires a lot of minutes, between 7 and 10 minutes per season.

## Match merge
All csv stats found in the Stats folder are merged into one csv called "Stats/all_stats.csv".

In [None]:
from os import listdir, statvfs
from os.path import isfile, join
import pandas as pd
from analysis import MatchAnalysis

onlyfiles = [f for f in listdir(utils.stats) if isfile(join(utils.stats, f))]
name_csv_statistics_FE_season = [x for x in onlyfiles] 

merged_statistics = pd.DataFrame()

for name_statistics_single_season in sorted(name_csv_statistics_FE_season):
    print(utils.stats+name_statistics_single_season)
    statistics_single_season = pd.read_csv(utils.stats+name_statistics_single_season, index_col=0)

    #an additional field is added to each row and it represents the season in which the game has been played
    statistics_single_season['season'] = name_statistics_single_season[:9] #e.g.: 2022-2023
    merged_statistics = merged_statistics.append(statistics_single_season)

In [None]:
merged_statistics['team1'] = merged_statistics['team1'].str.lower()
merged_statistics['team2'] = merged_statistics['team2'].str.lower()

In [None]:
merged_statistics.to_csv(utils.merged_statistics)

The obtained dataset has pair of records referring the same match: the first record refers to the statistics reached by the home team against the away team and the second one viceversa. These values representing the same match must be combined and the script that make this is .....