# 03.2.2 Shots on Target

## Outline: 

Some estimators work by using a summarized history of the 2 teams e.g. mean goals for in home games, mean goals against in away games, etc.

'Power Ranking' type schemes make use of all the games played by every team in order to put a numerical rating against each team.

This means that even though we might only be interested in getting the rating for 2 teams for a particular match, we need the historical data from every team.
Therefore, for a ranking feature, we will need to draw in the whole DataSet for interrogation.

## Forming the Ranker Feature Extractor:

The fixture list is split into batches, where each batch (after the burn in period) contains no repeat games.

Therefore, we can look at the lowest number in the batch, and grab the data for all the games indexed before this number.
Using these games we can form the ranking, and use the rankings as features for the batch

We can split the code into 2 parts:
1. Get and format the historical data
2. Apply the ranking scheme (there may be multiple possible ranking schemes)

### Pseudo Code

get_league_history()

Get the lowest fixture index in the batch

`lowest_index = min(fixture_list.index`

Get all historical games lower than the lowest index

`hist_game_set = self.ds.where(selfds['Idx'] < lowest_index)`

Would be good to get a DataFrame out of this

What should this DataFrame look like? - Need to know before constructing scheme to manipulate it

get_league_history_for_ranker()

Get the relevant columns for the ranker to work on

relevant cols = get any cols containing 'goals', 'xG' etc., etc.

get_rankings()

Apply the ranking method

Return single line for each fixture with the ranking metric

## Code

In [1]:
% matplotlib inline

import os
import sys

import pandas as pd

test_dir = '../data/test/'
datacube_path = test_dir + 'XArrayDataSet_1.nc'
team_list_path = test_dir + 'team_list.pickle'

pd.set_option("display.width",100)
pd.options.display.float_format = '{:,.2f}'.format

# add the 'src' directory to path to import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

## Set Up To Do Ranker Feature Extraction

In [2]:
# MOVE TO UTILITIES ###################
import xarray as xr
import pickle

def get_fixture_list(full_path):
    """
    Opens a Datacube at the specified path
    Looks into the DataCube, and gets all played fixtures home and away teams,
    sorted on index value (set at Data Processing)
    Returns a 2 column sequential integer indexed DataFrame of strings |h_team | a_team |
    where h_team represents home team, and away team reps away team
    Integer Index represents sequence of games played. Earliest (lowest) to most recent (highest)
    """
    ds = xr.open_dataset(full_path)
    fixture_list = ds['Idx'].to_dataframe().dropna().sort_values('Idx').reset_index().drop('Idx', axis=1)
    return fixture_list

def get_team_list(team_list_path):
    """
    Accepts a string path to the reference team list for a league
    Loads a previously formed team list and returns
    """
    pickle_in = open(team_list_path, "rb")
    team_list = pickle.load(pickle_in)
    return team_list
# MOVE TO UTILITIES ###################

fixtures = get_fixture_list(datacube_path)
print(fixtures.head())
print(fixtures.shape, '\n')



team_list= get_team_list(team_list_path)
print(team_list)
print(len(team_list), '\n')

print(fixtures.iloc[50:56,:])

                     h_team             a_team
0                   Arsenal     Leicester City
1  Brighton and Hove Albion    Manchester City
2                   Chelsea            Burnley
3            Crystal Palace  Huddersfield Town
4                   Everton         Stoke City
(170, 2) 

['Arsenal', 'Bournemouth', 'Brighton and Hove Albion', 'Burnley', 'Chelsea', 'Crystal Palace', 'Everton', 'Huddersfield Town', 'Leicester City', 'Liverpool', 'Manchester City', 'Manchester United', 'Newcastle United', 'Southampton', 'Stoke City', 'Swansea City', 'Tottenham Hotspur', 'Watford', 'West Bromwich Albion', 'West Ham United']
20 

             h_team             a_team
50          Burnley  Huddersfield Town
51          Everton        Bournemouth
52   Leicester City          Liverpool
53  Manchester City     Crystal Palace
54      Southampton  Manchester United
55       Stoke City            Chelsea


## Ranker Feature Extractor

In [3]:
def get_league_history(fixture_list_batch, ds):
    """
    """
    lowest_index = min(fixture_list_batch.index)
    prev_games_ds = ds.where(ds['Idx'] < lowest_index, drop=True)
    prev_games_df = prev_games_ds.to_dataframe().dropna(subset=['Idx']).sort_values('Idx')
    prev_games_df.reset_index(inplace=True)
    prev_games_df.set_index('Idx', inplace=True)
    return prev_games_df


fixture_batch = fixtures.iloc[50:56,:]
ds = xr.open_dataset(datacube_path)
print(fixture_batch, '\n')
prev_games_df = get_league_history(fixture_batch, ds)
print(prev_games_df.iloc[0:5,0:5], '\n')
print(prev_games_df.shape, '\n')
print(prev_games_df.columns, '\n')
print(prev_games_df.index)

             h_team             a_team
50          Burnley  Huddersfield Town
51          Everton        Bournemouth
52   Leicester City          Liverpool
53  Manchester City     Crystal Palace
54      Southampton  Manchester United
55       Stoke City            Chelsea 

                 a_team                    h_team       Date result  h_goals
Idx                                                                         
0.00     Leicester City                   Arsenal 2017-08-11   hwin     4.00
1.00    Manchester City  Brighton and Hove Albion 2017-08-12   awin     0.00
2.00            Burnley                   Chelsea 2017-08-12   awin     2.00
3.00  Huddersfield Town            Crystal Palace 2017-08-12   awin     0.00
4.00         Stoke City                   Everton 2017-08-12   hwin     1.00 

(50, 81) 

Index(['a_team', 'h_team', 'Date', 'result', 'h_goals', 'a_goals', 'h_xGF', 'h_xGF_diff', 'h_xGA',
       'h_xGA_diff', 'a_xGF', 'a_xGF_diff', 'a_xGA', 'a_xGA_diff', 'h_shot

In [4]:
def get_prev_games_subset_df(prev_games_df, some_strs):
    """
    """
    relevant_cols = [col for col in prev_games_df.columns for some_str in some_strs if some_str in col]
    relevant_cols = relevant_cols + ['h_team', 'a_team']
    print(relevant_cols)
    return prev_games_df[relevant_cols]
    #prev_games_subset_df = 
    #lowest_index = min(fixture_list_batch.index)
    #prev_games_ds = ds.where(ds['Idx'] < lowest_index, drop=True)
    #prev_games_df = prev_games_ds.to_dataframe().dropna(subset=['Idx']).sort_values('Idx')
    #return prev_games_df
#some_str = ['xGF', 'xGA']
some_str = ['xG', 'Idx']
prev_games_subset_df = get_prev_games_subset_df(prev_games_df, some_str)
print(prev_games_subset_df.head())
print(prev_games_subset_df.shape)

['h_xGF', 'h_xGF_diff', 'h_xGA', 'h_xGA_diff', 'a_xGF', 'a_xGF_diff', 'a_xGA', 'a_xGA_diff', 'h_team', 'a_team']
      h_xGF  h_xGF_diff  h_xGA  h_xGA_diff  a_xGF  a_xGF_diff  a_xGA  a_xGA_diff  \
Idx                                                                                
0.00   2.54       -1.46   2.50       -1.50   1.46       -1.54   2.50       -1.50   
1.00   0.29        0.29   0.07        0.07   2.26        1.26   0.07        0.07   
2.00   1.36       -0.64   1.17       -0.83   0.57       -2.43   1.17       -0.83   
3.00   0.99        0.99   0.45        0.45   1.74       -0.26   0.45        0.45   
4.00   0.72       -0.28   0.57       -0.43   0.28        0.28   0.57       -0.43   

                        h_team             a_team  
Idx                                                
0.00                   Arsenal     Leicester City  
1.00  Brighton and Hove Albion    Manchester City  
2.00                   Chelsea            Burnley  
3.00            Crystal Palace  Hudder

In [5]:
# GET LEAGUE SUMMARY STATS
h_tot_games_played = a_tot_games_played = len(prev_games_subset_df.index)
print(h_tot_games_played, a_tot_games_played)

h_tot_GF = prev_games_subset_df['h_xGF'].sum() ; a_tot_GF = prev_games_subset_df['a_xGF'].sum()
print(h_tot_GF, a_tot_GF)

h_GF_mean_per_game = h_tot_GF/h_tot_games_played ; a_GF_mean_per_game = a_tot_GF/a_tot_games_played
print(h_GF_mean_per_game, a_GF_mean_per_game)

G_mean_per_game = (h_tot_GF + a_tot_GF)/(h_tot_games_played + a_tot_games_played)
print(G_mean_per_game)

h_GF_mean_per_team_to_date = prev_games_subset_df['h_xGF'].mean()
h_GA_mean_per_team_to_date = prev_games_subset_df['h_xGA'].mean()
a_GF_mean_per_team_to_date = prev_games_subset_df['a_xGF'].mean()
a_GA_mean_per_team_to_date = prev_games_subset_df['a_xGA'].mean()
print(h_GF_mean_per_team_to_date, h_GA_mean_per_team_to_date, a_GF_mean_per_team_to_date, a_GA_mean_per_team_to_date)

50 50
73.71 56.11
1.4742 1.1222
1.2982
1.4742000000000006 1.0886000000000002 1.1222000000000003 1.0886000000000002


In [6]:
## FUNCTION TO CREATE REF DF
def get_ref_df(pdf, team_list):

    df_ref = pd.DataFrame(index = team_list)

    for team in df_ref.index:
        df_ref.loc[team,'h_xGF'] = pdf[pdf['h_team'] == team]['h_xGF'].mean()#pdf.loc[team, 'h_xGF'].mean()
        df_ref.loc[team, 'h_xGA'] = pdf[pdf['h_team'] == team]['h_xGA'].mean()
        df_ref.loc[team, 'a_xGF'] = pdf[pdf['a_team'] == team]['h_xGF'].mean()
        df_ref.loc[team, 'a_xGA'] = pdf[pdf['a_team'] == team]['a_xGA'].mean()





    df_ref = df_ref.assign(Attack_Strength = (df_ref['h_xGF']/h_GF_mean_per_team_to_date + \
                                              df_ref['a_xGF']/a_GF_mean_per_team_to_date)/2)


    df_ref = df_ref.assign(Defence_Weakness = (df_ref['h_xGA']/h_GA_mean_per_team_to_date + \
                                               df_ref['a_xGA']/a_GA_mean_per_team_to_date)/2)
    df_ref = df_ref.assign(Supremacy = (df_ref['Attack_Strength'] - df_ref['Defence_Weakness'])*G_mean_per_game)
    return df_ref

df_ref = get_ref_df(prev_games_subset_df, team_list)


print(df_ref)

                          h_xGF  h_xGA  a_xGF  a_xGA  Attack_Strength  Defence_Weakness  Supremacy
Arsenal                    2.52   2.42   1.64   1.27             1.59              1.70      -0.14
Bournemouth                0.86   0.52   1.83   1.75             1.11              1.04       0.09
Brighton and Hove Albion   0.34   0.21   1.20   0.85             0.65              0.49       0.21
Burnley                    0.90   0.61   1.99   1.51             1.20              0.97       0.29
Chelsea                    1.10   0.94   1.04   0.53             0.84              0.68       0.20
Crystal Palace             1.28   0.86   1.64   0.98             1.16              0.85       0.41
Everton                    0.93   0.80   1.87   1.24             1.15              0.94       0.28
Huddersfield Town          0.83   0.61   1.53   0.68             0.96              0.59       0.48
Leicester City             1.63   0.85   2.01   1.66             1.45              1.15       0.39
Liverpool 

In [7]:
### GET DATA OUT OF REF DF AND INTO EXPANDED FIXTURE LIST
print(fixture_batch, '\n')
fb = fixture_batch

fb = fb.assign(h_attack_strength = df_ref.loc[fb['h_team'], 'Attack_Strength'].values)
fb = fb.assign(h_defence_weakness = df_ref.loc[fb['h_team'], 'Defence_Weakness'].values)
fb = fb.assign(h_GF_mean_per_game = h_GF_mean_per_game)
fb = fb.assign(a_attack_strength = df_ref.loc[fb['a_team'], 'Attack_Strength'].values)
fb = fb.assign(a_defence_weakness = df_ref.loc[fb['a_team'], 'Defence_Weakness'].values)
fb = fb.assign(a_GF_mean_per_game = a_GF_mean_per_game)
fb['dependancy'] = 0.0

print(fb)

             h_team             a_team
50          Burnley  Huddersfield Town
51          Everton        Bournemouth
52   Leicester City          Liverpool
53  Manchester City     Crystal Palace
54      Southampton  Manchester United
55       Stoke City            Chelsea 

             h_team             a_team  h_attack_strength  h_defence_weakness  h_GF_mean_per_game  \
50          Burnley  Huddersfield Town               1.20                0.97                1.47   
51          Everton        Bournemouth               1.15                0.94                1.47   
52   Leicester City          Liverpool               1.45                1.15                1.47   
53  Manchester City     Crystal Palace               0.95                0.97                1.47   
54      Southampton  Manchester United               1.32                1.00                1.47   
55       Stoke City            Chelsea               1.01                1.02                1.47   

    a_attack_stre

In [8]:
stop

NameError: name 'stop' is not defined

In [None]:
import numpy as np

from scipy.misc import comb, factorial

from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels

class BivariatePoissonClassifier(BaseEstimator, ClassifierMixin):
    """
    Accepts Home Team Attack Strength, Home Team Defence Weakness, home GF mean per game and,
    Away Team Attack Strength, Away Team Attack Weakness, away GF mean per game, and dependancy like this:
    h_attack_strength -> float | h_defence_weakness -> float | h_GF_mean_per_game -> float | 
    a_attack_strength -> float | a_defence_weakness -> float | a_GF_mean_per_game -> float | dependency -> float
    0       1      2      3      4      5     6
    1.31 | 0.95 | 1.48 | 0.90 | 1.02 | 1.14 | 0.0 -> 7 floats in the correct order x N instances
    Returns a classification - a numpy array representing the classes of home win, away win, draw
    pred_h_win | pred_a_win | pred_draw
    [0,1,0] -> 0 or 1 x N instances
    Method to return a numpy array - a probability associated with each class like:
    [0.25, 0.50, 0.25] -> x N instances
    
    """
    def __init__(self):
        pass
    
    
    def fit(self, X, y):
        """
        This performs checks on the X,y shapes. Other than that it does nothing
        
        """
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)
        # Store the classes seen during fit
        self.classes_ = unique_labels(y)

        self.X_ = X
        self.y_ = y
        # Return the classifier
        return self
        

    def predict(self, X):
        # Check is fit had been called
        check_is_fitted(self, ['X_', 'y_'])
        # Input validation
        X = check_array(X)
        
        self.num_instances = X.shape[0]
        # calculate expected goals for home and away teams
        h_xG = X[:,0]* X[:,4] *X[:,2]
        a_xG = X[:,3]* X[:,1] *X[:,5]
        xG = np.append(h_xG.reshape(self.num_instances,1),
                       a_xG.reshape(self.num_instances,1),axis=1)
        xG = np.append(xG, X[:,6].reshape(self.num_instances,1), axis=1)
        self.probs = self.get_bivariate_poisson_probs(xG)
        return np.argmax(self.probs, axis=1)
  
        
        
    def predict_proba(self):
        return self.probs


    def get_bivariate_poisson_probs(self, xG):
        """
        Returns an 11 x 11 probability table
        """
        all_probs = np.zeros((self.num_instances, 3))
        hExp = xG[:,0]; aExp = xG[:,1] ; dep = xG[:,2]

        for k, hExp_, aExp_ , dep_ in np.broadcast(np.arange(hExp.shape[0]), hExp, aExp, dep):
            max_goals = 10
            arr = np.empty(shape=(11,11))
            for hX in range(max_goals+1):
                for aY in range(max_goals+1):
                    arr[hX,aY] = self.bivariate_poisson_p(hX, aY, hExp_, aExp_, dep_)
            probs = self.event_probs(arr)
            all_probs[k,:] = probs
        return all_probs

            
    def bivariate_poisson_p(self, homeX: int, awayY: int, homeExp: float, awayExp: float, dep: float) -> float:
        """

        """    

        bivpois = 0
        for i in range(0,min(homeX, awayY)+1):
            #print(i)
            bivpois = bivpois + comb(homeX,i) * comb(awayY, i) * factorial(i) * (dep/(homeExp*awayExp))**i
        #print(np.exp(-0))

        bivpois = bivpois * np.exp(-(homeExp + awayExp + dep)) \
        * (homeExp**homeX) * (awayExp**awayY) / (factorial(homeX) * factorial(awayY))

        return bivpois

    def event_probs(self, prob_table: np.ndarray) -> np.ndarray:
        """
        Accepts an nxn ndarray of floats representing a probability table
        Returns a numpy 3 x 1 array of floats where index 0 is the sum of table values where the vertical 
        index is greater than the horizontal. This represents the probability of the vertical team winning
        Index 1 is the sum of table values where the indices areequal. Represents the probability of a Draw
        Index 2 is the sume of table values where the horizontal index is greater than the vertical.
        This represents the probability of the horizontal team winning
        >>> event_probs(np.array([[ 0.067,  0.156],[ 0.234,  0.543]]))
        array([ 0.234,  0.61 ,  0.156])
        """
        n = prob_table.shape[0]
        lower_triangle = np.tril_indices(n,-1)
        h_winp = np.sum(prob_table[lower_triangle])
        upper_triangle = np.triu_indices(n,1)
        a_winp = np.sum(prob_table[upper_triangle])
        diags = np.diag_indices(n)
        drawp = np.sum(prob_table[diags])
        return np.array([h_winp, a_winp, drawp ])


