# 4. Manipulates Raw Statistics into Multiple Forms #
## For Brownlow Predictor Project ##

Turns raw counts of game/player stats into
- [x] Standardised
- [x] Normalised
- [x] Rank followed by Standardisation
- [x] Percentages
- [ ] Percentages followed by Standardisation       *[since proven to be exactly same as Standardisation]*
- [x] Percentages followed by Normalisation

*In order for different statistics to be used together for predictive purposes an important step is to manipulate them into a form which they are 'equal'. There are many ways to do so, and the author wanted to use different manipulations of data to see which gave the best results in terms of prediction. Also wanted to test in general the effectiveness of each data manipulation format*

***Author: `Lang (Ron) Chen` 2021.12-2022.1**

---

*Note: As Percentages followed by Standardisation was proven to be exactly the same as Standardisation, all relevent code have bben commented out* 

**0. Import Libraries**

In [96]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
import json

**1. Writing the functions for standardisation**

The reason that each method of manipulation contains 4 functions (except for percentage) is because of the existance of negative stats in which the scores should be reversed (i.e. turnovers, frees against). 

- BT means both teams - manipulates the data with respects to all players on ground
- OT means own teams - manipulates the data with respect to only teammates
- inv means inverse - the scores would be ranked in reverse (according to what is 'reverse' for that particular manipulated method)

'Normalisation followed by Standardisation' or vice versa were initially also planned to be used, before an experiment showed that the former is equal to 'Standardisation' whilst the latter is exactly equal to 'Normalisation'.

In [8]:
filelist = os.listdir(f'../data/curated/AFLCAVotes')
filelist.sort()
filelist = [file for file in filelist if file[0] != '.' ]

In [9]:
# First make dictionaries which direct the flow of control as to whether a column shoul receive manipulation and whether it should be positive or inversed direction.

# Unchanged
Orig = ['Player']

# compare
compare = ['Winloss','HomeAway', 'AFLCA_votes', 'Kicks', 'Handballs', 'Disposals', 'Marks', 'Goals', 'Behinds', 'Tackles', 'Hitouts', 'Goal Assists', 'Inside 50s', 
               'Clearances', 'Rebound 50s', 'Frees For', 'Contested Possessions', 'Uncontested Possessions', 
               'Effective Disposals', 'Contested Marks', 'Marks Inside 50', 'One Percenters', 'Bounces', 'Centre Clearances', 
               'Stoppage Clearances', 'Score Involvements', 'Metres Gained', 'Intercepts', 'Tackles Inside 50', 'Time On Ground %', 'Uncontested Marks',
               'Marks Outside 50', 'Tackles Outside 50', 'Behind Assists', 'Effective Disposals', 'Ineffective Disposals', 'Brownlow Votes', 'Clangers', 'Turnovers', 'Frees Agains']

In [12]:
for file in filelist:

    if file[2:4] not in ['20', '21', '22']:
        continue

    if file in os.listdir(f'../data/curated/Head2Head'):
        continue

    df = pd.read_csv(f'../data/curated/AFLCAVotes/{file}')
    df = df[df['Time On Ground %'] > 0.55] # below this time on ground no player has won Brownlow Votes (ever)

    df['HomeAway'] = df['HomeAway'].apply(lambda x: 1 if x=='Home' else 0)

    game_head2head_df_list = []

    for player in df['Player']:
        
        one_player_df_list = []

        player_stat = df[df['Player']==player]

        for other_player in df['Player']:
            if other_player == player:
                continue
            
            other_player_stat = df[df['Player']==other_player]

            head_2_head = {'player1': [player], 'player2': [other_player]}

            for stat in compare:
                
                head_2_head[stat] = [player_stat[stat].values[0] - other_player_stat[stat].values[0]]
    
            head_2_head_df = pd.DataFrame(head_2_head)

            one_player_df_list.append(head_2_head_df)

        one_player_df = pd.concat(one_player_df_list)
        game_head2head_df_list.append(one_player_df)  
    
    game_head2head_df = pd.concat(game_head2head_df_list)

    game_head2head_df.to_parquet(f'../data/curated/Head2Head/{file.split(".")[0]}.parquet')

In [50]:
non_2022_list = [x for x in os.listdir(f'../data/curated/Head2Head') if x[2:4] in ['21', '20']]
is_2022_list = [x for x in os.listdir(f'../data/curated/Head2Head') if x[2:4] in ['22']]

In [51]:
train_games, val_test_games = train_test_split(non_2022_list, train_size = 0.7, random_state = 19260817)
val_games, test_games = train_test_split(val_test_games, train_size = 0.5, random_state = 19260817)

In [70]:
game_id = 0

train_list = list()
for file in train_games:

    df = pd.read_parquet(f'../data/curated/Head2Head/{file}')
    df['game_id'] = game_id

    game_id += 1

    train_list.append(df)

In [71]:
game_id = 0

val_list = list()
for file in val_games:

    df = pd.read_parquet(f'../data/curated/Head2Head/{file}')
    df['game_id'] = game_id

    game_id += 1

    val_list.append(df)

In [72]:
game_id = 0

test_list = list()
for file in val_games:

    df = pd.read_parquet(f'../data/curated/Head2Head/{file}')
    df['game_id'] = game_id

    game_id += 1

    test_list.append(df)

In [73]:
game_id = 0

future_list = list()
for file in is_2022_list:

    df = pd.read_parquet(f'../data/curated/Head2Head/{file}')
    df['game_id'] = game_id

    game_id += 1

    future_list.append(df)

In [74]:
train_data = pd.concat(train_list)
val_data = pd.concat(val_list)
test_data = pd.concat(test_list)
future_data = pd.concat(future_list)

In [75]:
col_std = dict()

for col in train_data:
    if col == 'game_id':
        continue
    if train_data[col].dtype == np.float64:
        col_std[col] = train_data[col].std()
        train_data[col] = train_data[col]/col_std[col]

        val_data[col] = val_data[col]/col_std[col]
        test_data[col] = test_data[col]/col_std[col]
        future_data[col] = future_data[col]/col_std[col]

In [94]:
train_data.to_parquet('../data/curated/modelling/train_data.parquet')
val_data.to_parquet('../data/curated/modelling/val_data.parquet')
test_data.to_parquet('../data/curated/modelling/test_data.parquet')
future_data.to_parquet('../data/curated/modelling/future_data.parquet')

In [97]:
with open('../data/curated/stdev.json', 'w') as f:
    json.dump(col_std, f)