# Data Prep

Loads the output of a [transfermark-webscrapper](https://github.com/dcaribou/transfermarkt-scraper) run, and a applies a series of transformations to produce a file that is validated and more friendly for perfoming analyisis. Some of these transformations are

* Creating handy ID columns
* Renaming fileds to comply with naming convention
* Parsing raw values into their own columns

## Load
Input to the data prep process is excepted to be the output of the [transfermark-webscrapper](https://github.com/dcaribou/transfermarkt-scraper). I. e., a file with JSON lines with one line per player containing all player appearances on a game up to the date the scraper was run.

In [1]:
import pandas as pd

raw_file = '../data/tfmkt__2019-08-03__GB1_ES1.json'

raw = pd.read_json(
  raw_file,
  lines=True,
  convert_dates=True,
  orient={'index','date'}
)

raw.head()

Unnamed: 0,confederation,domestic_competition,stats_competition,current_team,player_name,stats
0,europa,GB1,GB1,fc-watford,adam-masina,"[{'matchday': '5', 'date': '2018-09-15', 'home..."
1,europa,GB1,FAC,fc-watford,adam-masina,"[{'matchday': 'Third Round', 'date': '2019-01-..."
2,europa,GB1,CGB,fc-watford,adam-masina,"[{'matchday': 'Second Round', 'date': '2018-08..."
3,europa,GB1,GB1,brighton-amp-hove-albion,jayson-molumby,[]
4,europa,GB1,GBFL,brighton-amp-hove-albion,jayson-molumby,[]


## Prep
The prep phase applies a series of transformations on the raw data frame that we loaded above

In [2]:
from prep_lib import *

### Flatten
Firstly, we need to explode the data frame to have one ne row per player appearance, rather than one row per player

In [3]:
raw_flat = flatten(raw, ['stats'])
raw_flat.head()

Unnamed: 0,matchday,date,home_team,away_team,result,pos,goals,assists,own_goals,yellow_cards,...,red_cards,substituted_on,substituted_off,minutes_played,transfermarkt_player_rating,confederation,domestic_competition,stats_competition,current_team,player_name
0,5,2018-09-15,fc-watford,manchester-united,1:2,LB,0,0,0,0,...,0,84,0,6,,europa,GB1,GB1,fc-watford,adam-masina
1,9,2018-10-20,wolverhampton-wanderers,fc-watford,0:2,LB,0,0,0,0,...,0,0,0,90,,europa,GB1,GB1,fc-watford,adam-masina
2,10,2018-10-27,fc-watford,huddersfield-town,3:0,LB,0,1,0,67',...,0,0,0,90,,europa,GB1,GB1,fc-watford,adam-masina
3,13,2018-11-24,fc-watford,fc-liverpool,0:3,LB,0,0,0,0,...,0,0,0,90,,europa,GB1,GB1,fc-watford,adam-masina
4,20,2018-12-29,fc-watford,newcastle-united,1:1,LB,0,0,0,44',...,0,0,78,78,,europa,GB1,GB1,fc-watford,adam-masina


### Rename
Modify the names of the input columns to make them consisent with a naming convention

In [4]:
mappings = {
    'matchday': 'round',
    'home_team': 'home_club_name',
    'away_team': 'away_club_name',
    'current_team': 'player_club_name',
    'pos': 'player_position',
    'confederation': 'club_confederation',
    'domestic_competition': 'club_domestic_competition',
    'stats_competition': 'competition',
    'transfermarkt_player_rating': 'player_transfermarkt_rating'
}

with_renamed_columns = renames(raw_flat, mappings)

### Update
- [x] Convert `goals`, `assists`, `own_goals` and `date` to the appropriate types
- [x] Revamp `yellow_cards` and `red_cards`. `second_yellows` column is not needed
- [ ] Club name prettifying. _FC Watford_ instead of _fc-watford_
- [ ] Player name prettifying. _Adam Masina_ instead of _adam-masina_
- [ ] Use longer names for `position` instead of the chryptic 'LB', etc. (use 'filter by position' [here](https://www.transfermarkt.co.uk/diogo-jota/leistungsdatendetails/spieler/340950/saison/2020/verein/0/liga/0/wettbewerb/GB1/pos/0/trainer_id/0/plus/1) to get the mappings)

In [5]:
with_improved_columns = improve_columns(with_renamed_columns)

### Create
- [x] Add surrogate keys `game_id`, `player_id`, `appearance_id`, `home_club_id`, `away_club_id`
- [x] Split `result` into `home_club_goals` and `away_club_goals`
- [x] Approximate appearance `season`

In [6]:

with_new_columns = add_new_columns(with_improved_columns)

### Filter
* Only season 2018 is complete on the current file, so we remove the rest
  - [ ] Rather than hardcoding the filter, the whole script should be parameterized for a specific season
* To reduce the scope of this version of the data prep scritp, select only appearances from domestic competitions


In [7]:
with_filtered_appearances = filter_appearances(with_new_columns)

## Validate
Validate that the output dataframe contains consistent data. Two types of checks are performed.

### Value checks
- [x] Fields `red_cards`, `yellow_cards`, `own_goals`, `assists`, `goals` and `minutes_played` contain values within an expected range
- [x] Rows are unique on `player_id` + `date`
- [ ] `position` field is either one of the long form player positions from Transfermarkt

### Completeness checks
- [x] Number of teams per domestic competition must be exactly 20
- [ ] Each club must play 38 games per season on the domestic competition
- [ ] On each match, both clubs should have at least 11 appearances
- [ ] Similarly, each club must have at least 11 appearances per game


In [8]:
validate(with_filtered_appearances)

Validation clubs_per_domestic_competition did not pass
Validation games_per_season_per_club did not pass
Validation appearances_per_match did not pass
Validation appearances_per_club_per_game did not pass


## Save

In [9]:
with_filtered_appearances.to_csv(
  '../data/tfmkt__2019-08-03__GB1_ES1__prep.csv',
  index=False
)