# Cleaning and pre-processing the FBref data
The dataset needs to be processed because there are two records referring the same match: the first record refers to the statistics reached by the home team against the away team and the second one viceversa. These values representing the same match must be combined.

Each pair of rows referencing the same match is merged through the following 3 steps:
- filter previous statistics of the 2 teams involved in the match and two datasets are obtained: one referring to home stats and one referring to away stats
- computing the average of the 5 previous matches for each statistic
- subtracting averages features: avg(home features) – avg(away features)

Another important feature is the ranking of the teams which is calculated as the number of direct matches won by the two teams in all the previous times they have met. Thus for each match there are two additional feature: home rank and away rank. The obtained features are stored into "Stats/cleaned_stats.csv".

The classes used to make this are MatchAnalysis (analysis.py) and Ranking (ranking.py). 

Old stats are overwritten by the new ones in the "Stats/cleaned_stats.csv". 

In [1]:
from analysis import MatchAnalysis
import util_strings as utils
from ranking import Ranking

In [2]:
ranking = Ranking('SerieA', '2022-2023')
ranking.read_matches(seasons=6, path=utils.ranking)

ma = MatchAnalysis()
ma.set_ranking(ranking=ranking)

In [3]:
#reading all the matches from "Stats/all_stats.csv" (double because there are statistics for each team) 
ma.read_matches(utils.merged_statistics) 
#creation of a list of objects (list of teams)
ma.create_team_dataset()
#dividing the home games from the away ones
ma.divide_and_merge_home_away()
#combine 2 datasets computating the average
ma.reduce_dataset_with_avg(number = 5, path=utils.dataset_without_text)

Teams are saved into a json file with an associated identification code. 

In [4]:
import json
with open(utils.teams_codes, 'rb') as json_file:
    data = json.load(json_file)

In [5]:
#I take the value of the last id, in order to assign it to the next team
cont = 0
for team, id in data.items():
    if id == cont:
        cont+=1

#I assign the teams not present on file
keys = data.keys()
for team in ma.matches_by_team:
    if team.name.lower() not in keys:
        data[team.name.lower()]=cont
        cont+=1

In [6]:
with open(utils.teams_codes, 'w') as outfile:
    json.dump(data, outfile)