# 02.1 League Cross Validation Splitter 

## Now moved to src!

1. Explanation
2. Usage

In [1]:
import os
import sys
import pickle

import xarray as xr

test_dir = '../data/test/'

In [2]:
# Load the "autoreload" extension
%load_ext autoreload

# always reload modules marked with "%aimport"
%autoreload 1

# add the 'src' directory to path to import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

# import my class code from the source
%aimport cross_val.LeagueKFold

from cross_val.LeagueKFold import LeagueKFold

## Explanation

The league fixture for the first three months of the 2017 season are shown. Note that these games are sorted by date, and then alpha sorted in ascending order based on the home team name. These games have an index number

![Fixture List](figures/fixture_list.png "Fixture List")

The cross validation scheme is designed with the following requirements:
1. Models will be fitted/trained based on historical results. This means there is a "Burn In" period which is from the start of the season until each league team has played at least one home, and one away game
2. The data from the burn-in period is used to develop the training features for the following games. This means that the set of games immediately following the burn-in, is the first data that can be used for model fitting. Call this Training Batch 1
3. Every team plays most weeks, so the cross validator needs to identify each batch of games where no team plays twice within the batch. This is becasue we want to make use of all games that a team has previously played in order to fit a model. Tese batches are called prediction batches
4. Once the prediction batches are identified, then the first feature extractor can decide how far back to go to develop features - See diagram below (toy data) 

![Cross Validation Layout](figures/cross_val_layout.png "Cross Validation layout")

## Usage

In [3]:
# Get a Fixture List
# Load Data
ds = xr.open_dataset(test_dir + 'XArrayDataSet_1.nc')
#ds = xr.open_dataset(full_final_dir + 'XArrayDataSet_1.nc')
# Select the game index numbers and drop the games that have not been played yet, and sort on index number
df = ds['Idx'].to_dataframe().dropna().sort_values('Idx').reset_index().drop('Idx', axis=1)#.reset_index('Idx',drop=False)
df.head()

Unnamed: 0,h_team,a_team
0,Arsenal,Leicester City
1,Brighton and Hove Albion,Manchester City
2,Chelsea,Burnley
3,Crystal Palace,Huddersfield Town
4,Everton,Stoke City


In [4]:
# Get the team list
pickle_in = open(test_dir + 'team_list.pickle', "rb")
team_list = pickle.load(pickle_in)
print(team_list)

['Arsenal', 'Bournemouth', 'Brighton and Hove Albion', 'Burnley', 'Chelsea', 'Crystal Palace', 'Everton', 'Huddersfield Town', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United', 'Newcastle United', 'Southampton', 'Stoke City', 'Swansea City', 'Tottenham Hotspur', 'Watford', 'West Bromwich Albion', 'West Ham United']


### Get the Splits

In [5]:
lkf = LeagueKFold(df, team_list, pretrained=False)
cv = lkf.split()
for train, test in cv:
    print('train', train, '*TEST*', test, '\n')
    break

train [50 51 52 53 54 55 56 57 58 59] *TEST* [60 61 62 63 64 65 66 67 68 69] 



In [6]:
lkf = LeagueKFold(df, team_list, pretrained=True)
cv = lkf.split()
for train, test in cv:
    print('train', train, '*TEST*', test, '\n')
    break

train [40 41 42 43 44 45 46 47 48 49] *TEST* [50 51 52 53 54 55 56 57 58 59] 



In [7]:
lkf = LeagueKFold(df, team_list, pretrained=False)
cv = lkf.split()
for train, test in cv:
    print('train', train, '*TEST*', test, '\n')

train [50 51 52 53 54 55 56 57 58 59] *TEST* [60 61 62 63 64 65 66 67 68 69] 

train [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69] *TEST* [70 71 72 73 74 75 76 77 78 79] 

train [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79] *TEST* [80 81 82 83 84 85 86 87 88 89] 

train [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89] *TEST* [90 91 92 93 94 95 96 97 98 99] 

train [50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99] *TEST* [100 101 102 103 104 105 106 107 108 109] 

train [ 50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103
 104 105 106 107 108 109] *TEST* [110 111 112 113 114 115 116 117 1