## List of quality and tidiniess issues in soccer database

### Quality

##### match table
- `date` column should be datetime
- `season` should be categorical
- figure out what `stage` represents
    - represents the matchday
    - ACTION: rename to `matchday` and should be categorical
- player IDs (`home_player_1` ... `away_player_11`) are null for some matches
    - should drop null values
- match events (`goal` ... `possession`) are null for some matches
    - should drop null values
- predictions are null for some matches
    - should drop null values
- player positions (`home_player_X1` ... `away_player_X11`) are null for some matches
    - leave for now as I don't intend to use these values
    - so it's better to have the other values that those rows may provide

##### player attributes table
- `date` column should be datetime
- missing a `season` column
    - create from date column
- missing a `league_id` column
    - create from the league of the first team the player played for that season
- preferred foot should be categorical
- all attributes missing for some rows
    - should drop null values
- attacking work rate null for some players
    - should drop null values
- attacking work rate has strange values
    - should normalize values
- defensive work rate has strange values
    - should normalize values
- attacking work rate should be categorical variable
- defensive work rate should be categorical variable

##### team attributes table
- `date` column should be datetime
- missing a `season` column
    - create from date column
- all the columns that end with `Class` should be categorical
- `buildUpPlayDribbling` is float, should be int
    - not handling because will drop the column
- `buildUpPlayDribbling` has (a lot of) null values
    - consider dropping the entire column


##### champs league table
- contains data for years that are not part of our period of interest
    - those years should be removed
- order assigned to `progress` values is not consistent eg `5. Last 16` and `6. Last 16`
    - remove order before making the variables categorical
- `progress` should be categorical
- missing team api id column
    - match from team name and make necessary adjustments
- missing a `season` column
    - create from year column
    - drop rows for years that are not in our timeframe
- missing a league id column
    - create from country column
- inconsistent team names (eg typo in Steaua Bucuresti, shortened names like Bayern Munich and Real Madrid)
    - use names in the main team table
- potential duplicates (APOEL vs Hapoel tel aviv)
    - remove duplicates

### Tidiness

##### match table
- the xml columns (`goal` ... `possession`) contain multiple details that can and should be in a separate table
- player positions (`home_player_X1` ... `away_player_Y11`) should be in a separate table
- match players (`home_player_1` ... `away_player_11`) should be in a separate table

##### player attributes table
- duplicated rows for the same information (same player and date)

##### champs league table
- some rows are duplicates

### Iterating - Quality

##### `goal` events table
- `x` and `y` should be dropped because too many of them are null
- investigate rows that lack `player1`
    - drop if they represent incomplete data and they also lack the `team` column (that would mean we don't know the scorer or the team that scored the goal)
- investigate high proportion (~42%) of rows that lack `player2`
- investigate rows that lack `subtype`
    - can it be inferred from `goal_type`?
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- the beneficiary of own goals need to be clearly identified
- investigate rows that have null `team` column
    - check if it can be inferred from `player1`
    - if not then drop the row
- columns to drop: `type`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `player2`, `team` should be an integer columns
    - `subtype` should be categorical
- renaming columns:
    - `player1` and `player2` should be renamed to `scorer` and `assister` respectively
    - `stats` should be changed to something more appropriate
- `goal_type` has unclear values
    - need to confirm what each value represents then make categorical
- `stats` has inconsistent values that need to be handled

##### `shoton` events table
- investigate rows that lack `player1`
    - drop if they represent incomplete data and they also lack the `team` column (that would mean we don't know the shooter or the team that took the shot)
- investigate rows that lack `subtype`
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- investigate rows that have null `team` column
    - check if it can be inferred from `player1`
    - if not then drop the row
- columns to drop: `type`, `card_type`
- investigate negative values for `elapsed_plus` to know if the row should be dropped or if the column should be set to null
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `team` should be an integer columns
    - `subtype` should be categorical
- renaming columns:
    - `player1` should be renamed to `shooter`

##### `shotoff` events table
- investigate rows that lack `player1`
    - drop if they represent incomplete data and they also lack the `team` column (that would mean we don't know the shooter or the team that took the shot)
- investigate rows that lack `subtype`
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- investigate rows that have null `team` column
    - check if it can be inferred from `player1`
    - if not then drop the row
- columns to drop: `type`, `card_type`, `stats`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `team` should be integer columns
    - `subtype` should be categorical
- renaming columns:
    - `player1` should be renamed to `shooter`

##### `foulcommit` events table
- investigate rows that lack `player1`, `player2` or `team`
    - drop if they represent incomplete data and they also lack the other column (we therefore can't infer the team that should receive the event)
- `card_type` has too many missing values (only 12 yellow cards and 1 red card over 8 years?)
    - consider dropping if this info can't be easily completed
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- columns to drop: `injury_time`, `venue`, `type`, `stats`
- drop the row that has 17 minutes `elapsed_plus`
    - couldn't find information to validate it
    - that column is lacking player and team data so there's a good chance it'll already be removed by the earlier step that investigates missing player and team data
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `player2`, `team` should be integer columns
    - `subtype` should be categorical
    - if kept, `card_type` should be categorical
- renaming columns:
    - `player1` should be renamed to `fouler` and `player2` should be renamed to `fouled_player`

##### `card` events table
- investigate rows that lack `player1` or `team`
    - drop if they represent incomplete data and they also lack the other column (we therefore can't infer the team that should receive the event)
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- columns to drop: `goal_type`, `type`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `team` should be integer columns
    - `subtype` should be categorical
    - `comment` should be categorical
- renaming columns:
    - `player1` should be renamed to `card_recipient`

##### `cross` events table
- investigate rows that lack `player1` or `team`
    - drop if they represent incomplete data and they also lack the other column (we therefore can't infer the team that should receive the event)
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- columns to drop: `goal_type`, `spectators`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `team` should be integer columns
    - `type` should be categorical
- renaming columns:
    - `player1` should be renamed to `crosser`

##### `corner` events table
- investigate rows that lack `player1` or `team`
    - drop if they represent incomplete data and they also lack the other column (we therefore can't infer the team that should receive the event)
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- columns to drop: `spectators`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `player1`, `team` should be integer columns
    - `subtype` should be categorical
- renaming columns:
    - `player1` should be renamed to `crosser`

##### `possession` events table
- investigate rows that are missing `homepos` and `awaypos`
    - if one is missing one, it should be calculated from the other
    - if both are missing, check if `comment` is a value and calculate home as the rounded integer of comment, then derive away
    - if both are missing and comment has no value, delete the row
- investigate what exactly the `_del` column represents
    - drop if it's not useful
- columns to drop: `spectators`, `injury_time`, `subtype`, `stats`, `card_type`
- confirm that players and teams are actually correct i.e. the players belonged to the teams and the team IDs correspond to the teams that played the match in the match table
- type conversions:
    - `event_api_id`, `event_incident_typefk`, `elapsed`, `elapsed_plus`, `homepos`, `awaypos` should be integer columns

##### player positions table
- drop rows where `x_pos` and `y_pos` are null
- type conversions:
    - `x_pos`, `y_pos`, `player_id` and `player_num` should be int
    - `team_designation` should be categorical

### Iterating - Tidiness

##### `goal` events table
- handle duplicated rows that refer to the same goal
    - for one of them, the one of the rows had a `saved_back_into_play` subtype, which could be one way to identify the duplicate
    - it's likely that all the duplicates need to be identified to correctly decide how to handle them

##### `shoton` events table
- drop `stats` column because it contains info (blocked shots) that's already in the `subtype` column

##### `card` events table
- drop `card_type` and `stats` columns because they contain info that's already in the `comment` column
- handle duplicated rows that refer to the same card event
    - confirmed that these rows have the same card type, team and player, so almost certain that they're duplicates

##### `cross` events table
- drop `stats` and `subtype` columns because they contain info that's already in the `type` column

##### `corner` events table
- drop `stats` and `type` columns because they contain info that's already in the `type` column

##### `possession` events table
- drop `comment` column because it contains info that's already in the `homepos` column
    - only do this after using `comment` to fill in missing values for `homepos` and `awaypos`
- handle duplicated rows that refer to the same possession event
    - confirmed that these rows have the same home and away possession, timestamp and match, so almost certain that they're duplicates