# Script 4. Training and collecting statistics from Linear Regression Models in two-steps with Bootstrapping #
## For Brownlow Predictor Project ##

Trains and collects statistics from 4000 different LR Models in two-steps wth Bootstrapping for Brownlow Predicting

Different models arise from the permutations of choices one can make when training models. For this case they are:
- [x] 5 Data Manipulation Types
- [x] 3 Values of Adjusted Votes for labels of step 1
- [x] 4 Macro Rules of Feature Selection 
- [x] 4 Feature Selection Coefficient Cutoff Values 
- [x] 4 Micro Rules of Feature Selection
- [x] 2 Whether to include Winloss in columns
- [x] (5 Folds of Train-Test Split)


**Author: `Lang (Ron) Chen` 2021.12-2022.1**

___

**0. Import Libraries**

In [1]:
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

from BrownlowPredictorTools2.predict import predict1_mass, predict2_mass
from BrownlowPredictorTools2.test import test1_mass, test2_mass
from BrownlowPredictorTools2.wholeseason import wholeseason
from BrownlowPredictorTools.feature_selection2 import feature_selection2

In [2]:
Central_Statistics = pd.read_csv('Central_Statistics.csv')

**1. Using Loops to simulate permutations**

*Makes clever use of f-strings to input and output the desired data*

In [3]:
%%time
choice = {'N': 'NormalisedData', 'S': 'StandardisedData', 'RS': 'RankStandardisedData', 
          'P': 'PercentageData', 'PN': 'PercentageNormalisedData'}

for dt in ['N', 'S', 'RS', 'P', 'PN']: # 5 Data Manipulation Types
    
    filelist = os.listdir(f'./Data/{choice[dt]}')[1:]
    
    # Gets list of emperical test games (full 2021 season)
    final_test_games = [file for file in filelist if '2021' in file]
    
    for adj_votes in range(1, 4): # 3 Adjusted Vote Values to use for labels of step-one
    
        for use in ['BT', 'OT', 'BT_OT', 'BT+OT']:
            
            # BT: Both Teams
            # OT: Own Team
            # BT_OT: Both Teams data OR Own Team data (same stat cannot use both columns)
            # BT+OT: Both Teams data AND Own Team data (free to use both columns for same stats)
            
            BT_OT = False # Variable to be used later in feature_selection2() function
            if use == 'BT_OT':
                BT_OT = True

            for FS_val in [0.2, 0.25, 0.3, 0.35]: # 4 Feature Selection Coefficient Cutoff Values: Above what level of 
            # pearsons correlation coeffcient does a column have to be in order to be selected as a feature 

                for FS_rule in [1, 2, 3, 4]: # 4 Micro Rules of Feature Selection
                    
                # 1: All cols that passed FS_val selected
                # 2: For those with dependency/triangle relationships (i.e. A=Disposals/B=Kicks/C=Handballs), if A comes first then B, C excluded. If B or C comes first then A excluded
                # 3: All cols that passed FS_val selected but abandon all 'summary' cols such as Disposal/Tackles/Marks
                # 4: Exclude Disposals, otherwise as per rule 2
                
                    for winloss in ['In', 'Out']: # 2 Whether to include Winloss in columns

                        for fold in [1, 2, 3, 4, 5]: # 5 Folds of Train-Test Split
                            
                            # Read in appropriate Train and Test data
                            first_lr_data = pd.read_csv(f'./PreparedData/Train_Data_{fold} ({dt}) (2_1_{adj_votes})(B).csv')

                            second_lr_data = pd.read_csv(f'./PreparedData/Train_Data_{fold} ({dt}) (2_2)(B).csv')
                            
                            test_games = list(pd.read_csv(f'./PreparedData/Test_Games_List_{fold} ({dt}) (2)(B).csv')['Test Games'])
                            
                            # Primary filtering of features according to Macro Rules
                            if use in ['BT', 'OT']:
                                # Accounts for Winloss inclusion choice
                                if winloss:
                                    cols = [col for col in first_lr_data.columns if (f'{use}{dt}' in col or 'Winloss' in col)]
                
                                else:
                                    cols = [col for col in first_lr_data.columns if (f'{use}{dt}' in col)]

                            else:
                                # Accounts for Winloss inclusion choice
                                if winloss:
                                    cols = [col for col in first_lr_data.columns if (f'BT{dt}' in col or f'OT{dt}' in col or 'Winloss' in col)]

                                else:
                                    cols = [col for col in first_lr_data.columns if (f'BT{dt}' in col or f'OT{dt}' in col)]
                            
                            # Calculates correlation and only accept columns that have surpassed FS_Val
                            corr1 = dict()
                            corr2 = dict()
                            for col in cols:
                                corr1[col] = first_lr_data[[col, 'Brownlow Votes']].corr(method = 'pearson').loc[col]['Brownlow Votes']
                                corr2[col] = second_lr_data[[col, 'Brownlow Votes']].corr(method = 'pearson').loc[col]['Brownlow Votes']

                            corr1 = list(corr1.items())
                            corr2 = list(corr2.items())
        
                            selected_features1 = [col[0] for col in corr1 if col[1] > FS_val]
                            selected_features2 = [col[0] for col in corr2 if col[1] > FS_val]

                            # Put into feature_selection2 function to do secondary filtering based on the FS_rule and BT_OT (or not)
                            selected_features1 = feature_selection2(selected_features1, FS_rule, BT_OT)
                            selected_features2 = feature_selection2(selected_features2, FS_rule, BT_OT)
                            
                            # Initialises a blank dataframe for this test sample
                            cent_storage_cols = {'Method': [f'LR(2)(B)_({adj_votes})'], 'Datatype': [dt], 'Use': [use], 'Feature Selection Value': [FS_val], 
                                                 'Feature Selection Rule': [FS_rule], 'Winloss': [winloss], 'Fold': [fold], 'TP0': [None], 
                                                 'TP0.5': [None], 'TP1': [None], 'TP2': [None], 'TP3': [None], 'Coef1': [None], 'Coef2': [None],
                                                'P1': [None], 'V1': [None],
                                                'P2': [None], 'V2': [None],
                                                'P3': [None], 'V3': [None],
                                                'P4': [None], 'V4': [None],
                                                'P5': [None], 'V5': [None],
                                                'P6': [None], 'V6': [None],
                                                'P7': [None], 'V7': [None],
                                                'P8': [None], 'V8': [None],
                                                'P9': [None], 'V9': [None],
                                                'P10': [None], 'V10': [None],
                                                'P11': [None], 'V11': [None],
                                                'P12': [None], 'V12': [None],
                                                'P13': [None], 'V13': [None],
                                                'P14': [None], 'V14': [None],
                                                'P15': [None], 'V15': [None],
                                                'P16': [None], 'V16': [None],
                                                'P17': [None], 'V17': [None],
                                                'P18': [None], 'V18': [None],
                                                'P19': [None], 'V19': [None],
                                                'P20': [None], 'V20': [None]}

                            if not selected_features1 or not selected_features2:
                                # Adds it onto our Dataframe for writing onto Central Database later
                                Central_Statistics = Central_Statistics.append(pd.DataFrame(cent_storage_cols))
                                continue
                            
                            # Prepare data for Training for step 1
                            traindataf_x = first_lr_data[selected_features1]
                            traindataf_x.index = range(len(first_lr_data))
                            traindataf_y = first_lr_data['Brownlow Votes']
                            traindataf_y.index = range(len(first_lr_data))
                            
                            # Train model for step 1
                            lm_f = linear_model.LinearRegression()
                            traindataf_x = traindataf_x.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
                            modelf = lm_f.fit(traindataf_x, traindataf_y)
                            
                            # Perform predictions and collect predictions and actual observations into one dataframe for step 1
                            out1 = predict1_mass(test_games, lm_f, selected_features1, choice[dt])
                            
                            # Performs testing on test case and collect stats for true positive of 0, 1, 2 and 3 (as a tuple tp) for step 1
                            tp1 = test1_mass(out1, adj_votes)

                            # Prepare data for Training for step 2
                            traindatas_x = second_lr_data[selected_features2]
                            traindatas_x.index = range(0,len(second_lr_data))
                            traindatas_y = second_lr_data['Brownlow Votes']
                            traindatas_y.index = range(0,len(second_lr_data))
                            
                            # Train model for step 2 
                            lm_s = linear_model.LinearRegression()
                            traindatas_x = traindatas_x.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
                            models = lm_s.fit(traindatas_x, traindatas_y)

                            # Perform predictions and collect predictions and actual observations into one dataframe for step 2
                            out2 = predict2_mass(test_games, lm_s, selected_features2, choice[dt])
                            
                            # Performs testing on test case and collect stats for true positive of 0, 1, 2 and 3 (as a tuple tp) for step 2
                            tp2 = test2_mass(out2)
                            
                            # Performs emperical testing on 2021 season
                            leaderboard = wholeseason(final_test_games, lm_f, lm_s, selected_features1, selected_features2, choice[dt])
                            
                            # Collects the pearsons coefficient for this model
                            pears_co1 = lm_f.score(traindataf_x, traindataf_y)
                            pears_co2 = lm_s.score(traindatas_x, traindatas_y)

                            # Initialises a dataframe for this test sample, beginning to fill in some of the statistics
                            cent_storage_cols = {'Method': [f'LR(2)(B)_({adj_votes})'], 'Datatype': [dt], 'Use': [use], 'Feature Selection Value': [FS_val], 
                                                 'Feature Selection Rule': [FS_rule], 'Winloss': [winloss], 'Fold': [fold], 'TP0': [tp1[0]], 
                                                 'TP0.5': [tp1[1]], 'TP1': [tp2[0]], 'TP2': [tp2[1]], 'TP3': [tp2[2]], 'Coef1': [pears_co1], 'Coef2': [pears_co2],
                                                'P1': list(), 'V1': list(),
                                                'P2': list(), 'V2': list(),
                                                'P3': list(), 'V3': list(),
                                                'P4': list(), 'V4': list(),
                                                'P5': list(), 'V5': list(),
                                                'P6': list(), 'V6': list(),
                                                'P7': list(), 'V7': list(),
                                                'P8': list(), 'V8': list(),
                                                'P9': list(), 'V9': list(),
                                                'P10': list(), 'V10': list(),
                                                'P11': list(), 'V11': list(),
                                                'P12': list(), 'V12': list(),
                                                'P13': list(), 'V13': list(),
                                                'P14': list(), 'V14': list(),
                                                'P15': list(), 'V15': list(),
                                                'P16': list(), 'V16': list(),
                                                'P17': list(), 'V17': list(),
                                                'P18': list(), 'V18': list(),
                                                'P19': list(), 'V19': list(),
                                                'P20': list(), 'V20': list()}
                            
                            # Fill in emperical observations
                            for i in range(1, 21):
                                cent_storage_cols[f'P{i}'].append(leaderboard[i-1][0])
                                cent_storage_cols[f'V{i}'].append(leaderboard[i-1][1])

                            # Adds it onto our Dataframe for writing onto Central Database later
                            Central_Statistics = Central_Statistics.append(pd.DataFrame(cent_storage_cols))
                    
                    # Write the Dataframe (consisting 10 models and their statistics) out
                            # Chose to do it in batch of 10 to conserve computational power.
                            # But must output regularly as the full block takes up to 10 hours to run - if fail somewhere in between need a method to salvage results.
                    Central_Statistics.to_csv('Central_Statistics.csv', index = None)

Wall time: 10h 26min 31s


## Note: A few improvements could be made on this notebook: ##

*1. To further save computational time, could consider switching up order of iteration (although that may be dangerous as it doesn't guarentee we are reading correct copy in from files each time*

*2. A mechanism for starting loop at predetermined point (in case fails half way - don't want to waste previous computation time). Would be useful for future projects that does mass testing*

*-an idea is to perhaps try turn iterations first into a list, and then iterate through the list. Thus at failpoint, could salvage the iterator index that it was up to and restart from list[i] using list[i:].*