# Big Data Cup 2021 
## How to value Zone Entries and other actions that are not shots or goals
### VAEP (Valuing actions by estimating probabilities) framework for Hockey 
Inspired by paper of the Soccer version [Actions Speak Louder Than Goals: Valuing Player Actions in Soccer](https://arxiv.org/abs/1802.07127) by Tom Decroos, Lotte Bransen, Jan Van Haaren and Jesse Davis. Very helpful was the Tutorial as part of the Friends of Tracking initiative by Lotte Bransen and Jan Van Haaren: [Friends of Tracking: Valuing actions in football](https://github.com/SciSports-Labs/fot-valuing-actions)

In [2]:
%reload_ext nb_black
import pandas as pd
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import brier_score_loss, roc_auc_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import KFold
from xgboost import XGBClassifier, plot_importance

import shap
from ipywidgets import interact_manual, fixed, widgets
%matplotlib inline

<IPython.core.display.Javascript object>

# Importing data, renaming columns, creating extra columns

In [3]:
# Import and Data Frame for womens data
project_dir = '/Users/keltim01/git_repos/TK5/Data/Big-Data-Cup-2021/'
womens = pd.read_csv(project_dir + 'hackathon_womens.csv')
nwhl = pd.read_csv(project_dir + 'hackathon_nwhl.csv')
womens = womens.append(nwhl, ignore_index=True)
# important numbers for the hockey rink 
ICE_LENGTH = 200
ICE_WIDTH = 85
GOAL_X = ICE_LENGTH - 10
GOAL_Y = ICE_WIDTH / 2
D_ZONE = 75
O_ZONE = ICE_LENGTH - 75

womens.columns = ['game_date', 'home_team', 'away_team', 'period', 'clock', 'home_team_skaters', 'away_team_skaters', 'home_team_goals','away_team_goals', 'team', 'player', 'event', 'x_coord', 'y_coord', 'detail_1', 'detail_2', 'detail_3', 'detail_4', 'player_2', 'x_coord_2', 'y_coord_2']
womens['game_id'] = womens.loc[:, ['game_date', 'home_team', 'away_team']].sum(axis=1).astype('category').cat.codes
womens['is_home'] = 0
womens['is_shot'] = 0
womens['is_goal'] = 0
womens['event_id'] = womens['event'].astype('category').cat.codes
womens['team_id'] = womens['team'].astype('category').cat.codes
womens['player_id'] = womens['player'].astype('category').cat.codes

for x in range(1,5):
    womens[f'detail_{x}_code'] = womens[f'detail_{x}'].astype('category').cat.codes
womens.loc[womens['home_team'] == womens['team'], 'is_home'] = 1
womens.loc[womens['event']=='Shot', 'is_shot'] = 1
womens.loc[womens['event']=='Goal', 'is_goal'] = 1
womens['goal_diff'] = womens['home_team_goals'].sub(womens['away_team_goals'])
womens['clock'] = pd.to_datetime(womens['clock'], format='%M:%S')
womens['seconds_remaining'] = womens['clock'].dt.minute.mul(60).add(womens['clock'].dt.second)

<IPython.core.display.Javascript object>

## Possession gained/lost
* Shot: which team has the puck recovery? (next event)
* Goal: not interesting because you scored -> 0
* Play: possession stays -> 0
* Incomplete Play: Possession lost -> -1
* Takeaway: Possession won 
* Puck recovery: according to the team Possessing the puck before 
* Dump In/out: team recovering the puck 
* Zone Entry: 
    * carried: possesion retained 
    * dump in: next event 
    * passed: possesion retained 
* Faceoff Win: Possession gained
* Penalty Taken: Possession 0 like goal

## Glossary 
* -1 possesion lost through action 
* 0 possesion stays the same 
* 1 possesion gained through action

In [4]:
womens.loc[(womens['event']=='Shot') & (womens['team']==womens['team'].shift(-1)),'poss_status'] = 0
womens.loc[(womens['event']=='Shot') & (womens['team']!=womens['team'].shift(-1)),'poss_status'] = -1
womens.loc[(womens['event']=='Puck Recovery') & (womens['team']!=womens['team'].shift(1)),'poss_status'] = 1
womens.loc[(womens['event']=='Puck Recovery') & (womens['team']==womens['team'].shift(1)),'poss_status'] = 0
womens.loc[(womens['event']=='Dump In/Out') & (womens['team']==womens['team'].shift(-1)),'poss_status'] = 0
womens.loc[(womens['event']=='Dump In/Out') & (womens['team']!=womens['team'].shift(-1)),'poss_status'] = -1
womens.loc[womens['event']=='Goal','poss_status'] = 0
womens.loc[womens['event']=='Takeaway','poss_status'] = 1
womens.loc[womens['event']=='Play','poss_status'] = 0
womens.loc[womens['event']=='Incomplete Play','poss_status'] = -1
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Carried'),'poss_status'] = 0
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Passed'),'poss_status'] = 0
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped') & (womens['event'].shift(-1) == 'Faceoff Win'),'poss_status'] = 0
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped') & (womens['event'].shift(-1) == 'Penalty Taken'),'poss_status'] = 0
womens.loc[(womens['team']==womens['team'].shift(-1)) & (womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped') & (womens['event'].shift(-1) == 'Puck Recovery'),'poss_status'] = 0
womens.loc[(womens['team']!=womens['team'].shift(-1)) & (womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped') & (womens['event'].shift(-1) == 'Puck Recovery'),'poss_status'] = -1
womens.loc[womens['event']=='Faceoff Win','poss_status'] = 1
womens.loc[womens['event']=='Penalty Taken','poss_status'] = 0

<IPython.core.display.Javascript object>

In [5]:
womens[['poss_status','event']].value_counts()

poss_status  event          
 0.0         Play               14673
 1.0         Puck Recovery       9368
-1.0         Incomplete Play     6111
 0.0         Puck Recovery       5806
             Zone Entry          2555
-1.0         Dump In/Out         2106
 1.0         Takeaway            2092
 0.0         Shot                1917
 1.0         Faceoff Win         1629
-1.0         Shot                1607
 0.0         Dump In/Out         1439
-1.0         Zone Entry           928
 0.0         Penalty Taken        260
             Goal                 132
dtype: int64

<IPython.core.display.Javascript object>

 ## For everything we assume the next event
we have to account for what can happen: 
* Shot: 
    * Puck Recovery
    * Shot: Rebound 
    * Faceoff Win: Out of Play or Goalie froze the puck 
    * Goal: Rebound 
    * Penalty Taken 
    * Play: Pass from Rebound 
    * Incomplete Play: Pass from Rebound 
    * Dump in/out: surrender the puck from rebound
*  Dump In/Out: 
    * Puck Recovery
    * Zone Entry
    * Faceoff Win: Goalie Freeze, icing 
    * Penalty Taken 
* Zone Entry Dump:
    * Puck Recovery 
    * Faceoff Win: Icing or over the glass.
    * Penalty Taken 
    


## Strength States: differences between team Strengths 

In [6]:
womens.loc[womens['team']==womens['home_team'],'strength_state'] = womens.loc[womens['team']==womens['home_team'],'home_team_skaters'].sub(womens.loc[womens['team']==womens['home_team'],'away_team_skaters'])
womens.loc[womens['team']==womens['away_team'],'strength_state'] = womens.loc[womens['team']==womens['away_team'],'away_team_skaters'].sub(womens.loc[womens['team']==womens['away_team'],'home_team_skaters'])
womens['strength_state'].value_counts()

 0.0    39618
 1.0     8113
-1.0     2755
 2.0      350
-2.0       48
Name: strength_state, dtype: int64

<IPython.core.display.Javascript object>

In [7]:
womens.loc[womens['team']==womens['home_team'],'home_team_skaters'].value_counts(dropna=False)

5    23164
4     2109
6      137
3       38
Name: home_team_skaters, dtype: int64

<IPython.core.display.Javascript object>

In [8]:
womens.loc[50883,'home_team_skaters']

4

<IPython.core.display.Javascript object>

In [9]:
womens.columns

Index(['game_date', 'home_team', 'away_team', 'period', 'clock',
       'home_team_skaters', 'away_team_skaters', 'home_team_goals',
       'away_team_goals', 'team', 'player', 'event', 'x_coord', 'y_coord',
       'detail_1', 'detail_2', 'detail_3', 'detail_4', 'player_2', 'x_coord_2',
       'y_coord_2', 'game_id', 'is_home', 'is_shot', 'is_goal', 'event_id',
       'team_id', 'player_id', 'detail_1_code', 'detail_2_code',
       'detail_3_code', 'detail_4_code', 'goal_diff', 'seconds_remaining',
       'poss_status', 'strength_state'],
      dtype='object')

<IPython.core.display.Javascript object>

#  Calculate differences in disctance for actions
## create endpoint for actions
### Shot
* on net: position of the goal
* missed/blocked possesion lost or Retained: location next event -> Puck Recovery
### Goal 
* position of the goal
### Takeaway
* same position
### Puck Recovery
* same position
### Dump In/Out
* Possession Lost or Retained: location of the next event (Puck Recovery)
### Zone Entry
* Carried: same position
* Dumped: location of the next event (Puck Recovery)
* Passed: entpoint of the pass
### Faceoff Wins
* same position
### Penalty Taken
* same position

In [10]:
womens.loc[(womens['event']=='Shot') & (womens['detail_2'] == 'On Net'),['x_coord_2','y_coord_2']] = [GOAL_X,GOAL_Y]
shifted_coords = womens.loc[:,['x_coord','y_coord']].shift(-1)
womens2 = womens.loc[:]
womens2.loc[:,['x_coord','y_coord']] = shifted_coords
womens.loc[(womens['event']=='Shot') & (womens['detail_2'] == 'Blocked'),'x_coord_2'] = womens2.loc[(womens2['event']=='Shot') & (womens2['detail_2'] == 'Blocked'),'x_coord']
womens.loc[(womens['event']=='Shot') & (womens['detail_2'] == 'Blocked'),'y_coord_2'] = womens2.loc[(womens2['event']=='Shot') & (womens2['detail_2'] == 'Blocked'),'y_coord']
womens.loc[(womens['event']=='Shot') & (womens['detail_2'] == 'Missed'),'x_coord_2'] = womens2.loc[(womens2['event']=='Shot') & (womens2['detail_2'] == 'Missed'),'x_coord']
womens.loc[(womens['event']=='Shot') & (womens['detail_2'] == 'Missed'),'y_coord_2'] = womens2.loc[(womens2['event']=='Shot') & (womens2['detail_2'] == 'Missed'),'y_coord']
womens.loc[womens['event']=='Goal',['x_coord_2','y_coord_2']] = [GOAL_X,GOAL_Y]
womens.loc[womens['event']=='Takeaway','x_coord_2'] = womens.loc[womens['event']=='Takeaway','x_coord']
womens.loc[womens['event']=='Takeaway','y_coord_2'] = womens.loc[womens['event']=='Takeaway','y_coord']
womens.loc[womens['event']=='Puck Recovery','x_coord_2'] = womens.loc[womens['event']=='Puck Recovery','x_coord']
womens.loc[womens['event']=='Puck Recovery','y_coord_2'] = womens.loc[womens['event']=='Puck Recovery','y_coord']
womens.loc[womens['event']=='Dump In/Out','x_coord_2'] = womens2.loc[womens2['event']=='Dump In/Out','x_coord']
womens.loc[womens['event']=='Dump In/Out','y_coord_2'] = womens2.loc[womens2['event']=='Dump In/Out','y_coord']
womens.loc[womens['event']=='Zone Entry','x_coord_2'] = womens.loc[womens['event']=='Zone Entry','x_coord']
womens.loc[womens['event']=='Zone Entry','y_coord_2'] = womens.loc[womens['event']=='Zone Entry','y_coord']
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped'),'x_coord_2'] = womens2.loc[(womens2['event']=='Zone Entry') & (womens2['detail_1']=='Dumped'),'x_coord']
womens.loc[(womens['event']=='Zone Entry') & (womens['detail_1']=='Dumped'),'y_coord_2'] = womens2.loc[(womens2['event']=='Zone Entry') & (womens2['detail_1']=='Dumped'),'y_coord']
womens.loc[womens['event']=='Faceoff Win','x_coord_2'] = womens.loc[womens['event']=='Faceoff Win','x_coord']
womens.loc[womens['event']=='Faceoff Win','y_coord_2'] = womens.loc[womens['event']=='Faceoff Win','y_coord']
womens.loc[womens['event']=='Penalty Taken','x_coord_2'] = womens.loc[womens['event']=='Penalty Taken','x_coord']
womens.loc[womens['event']=='Penalty Taken','y_coord_2'] = womens.loc[womens['event']=='Penalty Taken','y_coord']

<IPython.core.display.Javascript object>

## make columns for in which zone a player is in and a diff column for it 
* 1 is the defensive zone 
* 2 is the neutral zone 
* 3 is the offensive zone
* a positive difference is the difference in zones forward
* a negative difference is the differene in zone backwards

In [11]:
womens.loc[womens['x_coord'] <= D_ZONE, 'zone_1'] = 1
womens.loc[womens['x_coord'] > D_ZONE, 'zone_1'] = 2
womens.loc[womens['x_coord'] >= O_ZONE, 'zone_1'] = 3
womens.loc[womens['x_coord_2'] <= D_ZONE, 'zone_2'] = 1
womens.loc[womens['x_coord_2'] > D_ZONE, 'zone_2'] = 2
womens.loc[womens['x_coord_2'] >= O_ZONE, 'zone_2'] = 3
womens.loc[womens['event']=='Zone Entry','zone_1'] = 2
womens.loc[womens['event']=='Zone Entry','zone_2'] = 3
womens.loc[:,'zone_diff'] = womens['zone_2'] - womens['zone_1']

<IPython.core.display.Javascript object>

In [12]:
womens['zone_diff'].value_counts()

 0.0    41378
 1.0     7089
-2.0      975
-1.0      829
 2.0      613
Name: zone_diff, dtype: int64

<IPython.core.display.Javascript object>

In [13]:
diff_x1 = GOAL_X - womens['x_coord']
diff_y1 = abs(GOAL_Y - womens['y_coord'])
diff_x2 = GOAL_X - womens['x_coord_2']
diff_y2 = abs(GOAL_Y - womens['y_coord_2'])
womens['start_distance_to_goal'] = np.sqrt(diff_x1 ** 2 + diff_y1 ** 2)
womens['end_distance_to_goal'] = np.sqrt(diff_x2 ** 2 + diff_y2 ** 2)
womens['diff_x'] = womens['x_coord_2'] - womens['x_coord']
womens['diff_y'] = womens['y_coord_2'] - womens['y_coord']
womens['distance_covered'] = np.sqrt((womens['x_coord_2'] - womens['x_coord']) ** 2 + (womens['y_coord_2'] - womens['y_coord']) ** 2)
diff_x1 = diff_x1.astype(float)
womens['angle_to_goal_start'] = np.divide(diff_x1, diff_y1,out=np.zeros_like(diff_x1),where=(diff_y1 != 0))
womens.loc[womens['angle_to_goal_start']>=360,'angle_to_goal_start'] = womens.loc[womens['angle_to_goal_start'] >=360,'angle_to_goal_start'] - 360
womens.loc[womens['angle_to_goal_start']< 0,'angle_to_goal_start'] = womens.loc[womens['angle_to_goal_start'] < 0,'angle_to_goal_start'] + 360
diff_x2 = diff_x2.astype(float)
womens['angle_to_goal_end'] = np.divide(diff_x2, diff_y2,out=np.zeros_like(diff_x2),where=(diff_y2 != 0))
womens.loc[womens['angle_to_goal_end']>=360,'angle_to_goal_end'] = womens.loc[womens['angle_to_goal_end'] >=360,'angle_to_goal_end'] - 360
womens.loc[womens['angle_to_goal_end']< 0,'angle_to_goal_end'] = womens.loc[womens['angle_to_goal_end'] < 0,'angle_to_goal_end'] + 360
womens['diff_angle_to_goal'] = womens['angle_to_goal_end'] - womens['angle_to_goal_start']

<IPython.core.display.Javascript object>

# non-shot xG models

In [14]:
xg_features = ['x_coord','y_coord','start_distance_to_goal','angle_to_goal_start','strength_state']
xg_labels = ['is_goal']

<IPython.core.display.Javascript object>

In [15]:
parameters = {
    'nthread': [4],
    'objective': ['binary:logistic'],
    'max_depth': [3, 4, 5, 6],
    'learning_rate': [0.01],
    'n_estimators': [100, 500, 1000],
    'seed': [42]
    }

df_xg_model = pd.DataFrame()
kf = KFold(10, shuffle=True)

for train_idx, test_idx in kf.split(womens):
    train_data = womens.iloc[train_idx].copy()
    test_data = womens.iloc[test_idx].copy()

    classifier = XGBClassifier()
    classifier = GridSearchCV(classifier, parameters, scoring='roc_auc', verbose=2)
    classifier.fit(
        train_data[xg_features],
        train_data[xg_labels]
    )
    dfs_predictions = {}
    y_pred = classifier.predict_proba(test_data[xg_features])
    dfs_predictions[xg_labels[0]] = pd.Series(y_pred[:,1], index=test_data.index)
    df_predictions = pd.concat(dfs_predictions, axis=1)
    df_xg_model = df_xg_model.append(df_predictions)

arning_rate=0.01, max_depth=3, n_estimators=1000, nthread=4, objective=binary:logistic, seed=42; total time=   3.0s
[CV] END learning_rate=0.01, max_depth=3, n_estimators=1000, nthread=4, objective=binary:logistic, seed=42; total time=   2.9s
[CV] END learning_rate=0.01, max_depth=3, n_estimators=1000, nthread=4, objective=binary:logistic, seed=42; total time=   2.9s
[CV] END learning_rate=0.01, max_depth=3, n_estimators=1000, nthread=4, objective=binary:logistic, seed=42; total time=   3.0s
[CV] END learning_rate=0.01, max_depth=4, n_estimators=100, nthread=4, objective=binary:logistic, seed=42; total time=   0.3s
[CV] END learning_rate=0.01, max_depth=4, n_estimators=100, nthread=4, objective=binary:logistic, seed=42; total time=   0.3s
[CV] END learning_rate=0.01, max_depth=4, n_estimators=100, nthread=4, objective=binary:logistic, seed=42; total time=   0.3s
[CV] END learning_rate=0.01, max_depth=4, n_estimators=100, nthread=4, objective=binary:logistic, seed=42; total time=   0.3s

<IPython.core.display.Javascript object>

In [16]:
womens['is_goal'].sum()

132

<IPython.core.display.Javascript object>

In [17]:
df_xg_model.sum()

is_goal    133.816666
dtype: float32

<IPython.core.display.Javascript object>

In [18]:
abs(df_xg_model.sum() - womens['is_goal'].sum())

is_goal    1.816666
dtype: float32

<IPython.core.display.Javascript object>

In [19]:
womens['non_shot_xg'] = df_xg_model['is_goal']

<IPython.core.display.Javascript object>

# create Labels

In [20]:
goals = womens['event'].str.contains('Goal')
y = pd.concat([womens.loc[:, 'is_goal'], womens.loc[:,'team_id']], axis = 1)
y.columns = ['goal','team_id']
for i in range(1, 10):
    for col in ['team_id', 'goal']:
        shifted = y[col].shift(-i)
        shifted[-i:] = y[col][len(y) - 1]
        y[f'{col}+{i}'] = shifted.astype(int)

scores = y['goal']
concedes = y['goal']
for i in range(1, 10):
    goal_scored = y[f'goal+{i}'] & (y[f'team_id+{i}'] == y['team_id'])
    goal_opponent = y[f'goal+{i}'] & (y[f'team_id+{i}'] != y['team_id'])
    scores = scores | goal_scored
    concedes = concedes | goal_opponent
label_scores = pd.DataFrame(scores, columns=['scores'])
label_concedes = pd.DataFrame(concedes, columns=['concedes'])
df_labels = pd.concat([label_scores, label_concedes], axis=1)

<IPython.core.display.Javascript object>

# Features

In [21]:
features = ['game_id','team_id', 'player_id', 'period', 'x_coord', 'y_coord', 'x_coord_2',
       'y_coord_2', 'is_home', 'is_shot', 'is_goal', 'event_id',
       'goal_diff', 'seconds_remaining','diff_x', 'diff_y', 'distance_covered', 'start_distance_to_goal', 'end_distance_to_goal','zone_diff','poss_status','diff_angle_to_goal','non_shot_xg']
df_delays = [womens[features].shift(step).add_suffix(f'-{step}') for step in range(0,3)]
df_features = pd.concat(df_delays, axis=1)

<IPython.core.display.Javascript object>

In [22]:
for step in range(0,3):
    df_features[f'team-{step}'] = df_features['team_id-0'] == df_features[f'team_id-{step}']

for step in range(0,3):
    df_features.loc[~(df_features[f'team-{step}']),f'x_coord-{step}'] = ICE_LENGTH - df_features[f'x_coord-{step}']
    df_features.loc[~(df_features[f'team-{step}']),f'x_coord_2-{step}'] = ICE_LENGTH - df_features[f'x_coord_2-{step}']
    df_features.loc[~(df_features[f'team-{step}']),f'y_coord-{step}'] = ICE_WIDTH - df_features[f'y_coord-{step}']
    df_features.loc[~(df_features[f'team-{step}']),f'y_coord_2-{step}'] = ICE_WIDTH - df_features[f'y_coord_2-{step}']

<IPython.core.display.Javascript object>

In [23]:
for step in range(0,3):
    start_diff_x = GOAL_X - df_features[f'x_coord-{step}']
    start_diff_y = abs(GOAL_Y - df_features[f'y_coord-{step}'])
    df_features[f'start_distance_to_goal-{step}'] = np.sqrt(start_diff_x ** 2 + start_diff_y ** 2)
    end_diff_x = GOAL_X - df_features[f'x_coord_2-{step}']
    end_diff_y = abs(GOAL_Y - df_features[f'y_coord_2-{step}'])
    df_features[f'end_distance_to_goal-{step}'] = np.sqrt(end_diff_x ** 2 + end_diff_y ** 2)
    df_features[f'diff_x-{step}'] = df_features[f'x_coord_2-{step}'] - df_features[f'x_coord-{step}']
    df_features[f'diff_y-{step}'] = df_features[f'y_coord_2-{step}'] - df_features[f'y_coord-{step}']
    df_features[f'distance_covered-{step}'] = np.sqrt((df_features[f'x_coord_2-{step}'] - df_features[f'x_coord-{step}']) ** 2 + (df_features[f'y_coord_2-{step}'] - df_features[f'y_coord-{step}']) ** 2)

<IPython.core.display.Javascript object>

In [24]:
df_features['xdiff_sequenc_pre'] = df_features['x_coord-0'] - df_features['x_coord-2']
df_features['ydiff_sequenc_pre'] = df_features['y_coord-0'] - df_features['y_coord-2']
df_features['time_sequence_pre'] = df_features['seconds_remaining-0'] - df_features['seconds_remaining-2']
df_features[['start_distance_to_goal-0', 'end_distance_to_goal-0', 'start_distance_to_goal-1', 'end_distance_to_goal-1', 'start_distance_to_goal-2', 'end_distance_to_goal-2', 'team-1', 'team-2']]
            

Unnamed: 0,start_distance_to_goal-0,end_distance_to_goal-0,start_distance_to_goal-1,end_distance_to_goal-1,start_distance_to_goal-2,end_distance_to_goal-2,team-1,team-2
0,90.001389,90.001389,,,,,False,False
1,101.986519,101.986519,90.001389,90.001389,,,True,False
2,92.402651,92.402651,101.986519,101.986519,90.001389,90.001389,True,True
3,92.402651,46.970736,92.402651,92.402651,101.986519,101.986519,True,True
4,46.970736,46.970736,92.402651,46.970736,92.402651,92.402651,True,True
...,...,...,...,...,...,...,...,...
50879,182.937831,182.937831,162.059403,18.607794,149.141711,162.059403,False,False
50880,182.937831,165.774697,182.937831,182.937831,162.059403,18.607794,True,False
50881,161.425679,69.615013,182.937831,165.774697,182.937831,182.937831,True,True
50882,69.615013,69.615013,47.940067,127.930645,18.607794,43.832066,False,False


<IPython.core.display.Javascript object>

In [25]:
df_features.columns

Index(['game_id-0', 'team_id-0', 'player_id-0', 'period-0', 'x_coord-0',
       'y_coord-0', 'x_coord_2-0', 'y_coord_2-0', 'is_home-0', 'is_shot-0',
       'is_goal-0', 'event_id-0', 'goal_diff-0', 'seconds_remaining-0',
       'diff_x-0', 'diff_y-0', 'distance_covered-0',
       'start_distance_to_goal-0', 'end_distance_to_goal-0', 'zone_diff-0',
       'poss_status-0', 'diff_angle_to_goal-0', 'non_shot_xg-0', 'game_id-1',
       'team_id-1', 'player_id-1', 'period-1', 'x_coord-1', 'y_coord-1',
       'x_coord_2-1', 'y_coord_2-1', 'is_home-1', 'is_shot-1', 'is_goal-1',
       'event_id-1', 'goal_diff-1', 'seconds_remaining-1', 'diff_x-1',
       'diff_y-1', 'distance_covered-1', 'start_distance_to_goal-1',
       'end_distance_to_goal-1', 'zone_diff-1', 'poss_status-1',
       'diff_angle_to_goal-1', 'non_shot_xg-1', 'game_id-2', 'team_id-2',
       'player_id-2', 'period-2', 'x_coord-2', 'y_coord-2', 'x_coord_2-2',
       'y_coord_2-2', 'is_home-2', 'is_shot-2', 'is_goal-2', 'event_i

<IPython.core.display.Javascript object>

# Split Dataset & Train Classifiers

In [26]:
labels = ['scores','concedes']
feat = ['start_distance_to_goal-0', 'end_distance_to_goal-0', 'start_distance_to_goal-1', 'end_distance_to_goal-1', 'start_distance_to_goal-2', 'end_distance_to_goal-2','team-1', 'team-2','seconds_remaining-0','goal_diff-0','zone_diff-0','zone_diff-1','zone_diff-2','poss_status-0','poss_status-1','poss_status-2', 'diff_angle_to_goal-0','diff_angle_to_goal-1','diff_angle_to_goal-2','non_shot_xg-0','non_shot_xg-1','non_shot_xg-2']

<IPython.core.display.Javascript object>

In [27]:
df_model = pd.concat([df_features,df_labels],axis=1)
df_score_concede_prob = pd.DataFrame()
kf = KFold(10, shuffle=True)

for train_idx, test_idx in kf.split(df_model):
    train_data = df_model.iloc[train_idx].copy()
    test_data = df_model.iloc[test_idx].copy()

    models = {}
    for label in tqdm(labels):
        model = XGBClassifier(
            n_estimators=50,
            max_depth=3
        )
        model.fit(
            X=train_data[feat],
            y=train_data[label]
        )
        models[label] = model

    dfs_predictions = {}
    for label in tqdm(labels):
        model = models[label]
        probabilities = model.predict_proba(test_data[feat])
        predictions = probabilities[:, 1]
        print(np.isnan(probabilities).sum())
        dfs_predictions[label] = pd.Series(predictions, index=test_data.index)
    df_predictions = pd.concat(dfs_predictions, axis=1)
    df_score_concede_prob = df_score_concede_prob.append(df_predictions)

100%|██████████| 2/2 [00:01<00:00,  1.14it/s]
100%|██████████| 2/2 [00:00<00:00, 71.18it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.15it/s]
100%|██████████| 2/2 [00:00<00:00, 66.21it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.15it/s]
100%|██████████| 2/2 [00:00<00:00, 69.08it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.11it/s]
100%|██████████| 2/2 [00:00<00:00, 68.45it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.11it/s]
100%|██████████| 2/2 [00:00<00:00, 57.06it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.10it/s]
100%|██████████| 2/2 [00:00<00:00, 65.23it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.11it/s]
100%|██████████| 2/2 [00:00<00:00, 65.87it/s]
  0%|          | 0/2 [00:00<?, ?it/s]0
0
100%|██████████| 2/2 [00:01<00:00,  1.20it/s]
100%|██████████| 2/2 [0

<IPython.core.display.Javascript object>

In [28]:
dfs_actions = []
dfs_actions.append(womens)
df_actions = pd.concat(dfs_actions).reset_index(drop=True)

df_actions_predictions = pd.concat([df_actions, df_score_concede_prob], axis=1)
df_actions_predictions = df_actions_predictions.dropna(subset=['start_distance_to_goal', 'end_distance_to_goal', 'diff_x', 'diff_y',
       'distance_covered', 'scores', 'concedes'])

<IPython.core.display.Javascript object>

# calculate the VAEP value

In [29]:
def prev(x: pd.Series) -> pd.Series:
    prev_x = x.shift(1)
    prev_x[:1] = x.values[0]
    return prev_x

<IPython.core.display.Javascript object>

In [30]:
dfs_values = []
df_values = pd.DataFrame()

sameteam = prev(df_actions_predictions.team_id) == df_actions_predictions.team_id
prev_scores = prev(df_actions_predictions.scores) * sameteam + prev(df_actions_predictions.concedes) * (~sameteam)
prev_concedes = prev(df_actions_predictions.concedes) * sameteam + prev(df_actions_predictions.scores) * (~sameteam)

toolong_idx = abs(prev(df_actions_predictions.seconds_remaining) - df_actions_predictions.seconds_remaining) > 10
prev_scores[toolong_idx] = 0
prev_concedes[toolong_idx] = 0

prevgoal_idx = prev(df_actions_predictions.event) == 'Goal'
prev_scores[prevgoal_idx] = 0
prev_concedes[prevgoal_idx] = 0

df_values['offensive_value'] = df_actions_predictions.scores - prev_scores
df_values['defensive_value'] = df_actions_predictions.concedes - prev_concedes
df_values['vaep'] = df_values['offensive_value'] + df_values['defensive_value']

<IPython.core.display.Javascript object>

# Analysis

In [31]:
df_final = pd.concat([df_actions_predictions,df_values],axis=1).dropna(subset=['vaep'])
df_ranking = (df_final[['player','team','vaep']]
.groupby(['player','team'])
.agg(vaep_count=('vaep','count'),
vaep_mean=('vaep','mean'),
vaep_sum=('vaep','sum'))
.sort_values('vaep_sum',ascending=False)
.reset_index()
)

df_rank_events = (df_final[['event','vaep']]
.groupby(['event'])
.agg(vaep_count=('vaep','count'),
vaep_mean=('vaep','mean'),
vaep_sum=('vaep','sum'))
.sort_values('vaep_sum',ascending=False)
.reset_index()
)

df_zone_entries = (df_final.loc[womens['event']=='Zone Entry',['detail_1','defensive_value','offensive_value','vaep']]
.groupby(['detail_1'])
.agg(vaep_count=('vaep','count'),
vaep_mean=('vaep','mean'),
vaep_sum=('vaep','sum'))
.sort_values('vaep_sum',ascending=False)
.reset_index()
)

df_rank_strength = (df_final[['strength_state','event','vaep']]
.groupby(['strength_state','event'])
.agg(vaep_count=('vaep','count'),
vaep_mean=('vaep','mean'),
vaep_sum=('vaep','sum'))
.sort_values('vaep_sum',ascending=False)
.reset_index()
)

<IPython.core.display.Javascript object>

In [32]:
df_final.columns

Index(['game_date', 'home_team', 'away_team', 'period', 'clock',
       'home_team_skaters', 'away_team_skaters', 'home_team_goals',
       'away_team_goals', 'team', 'player', 'event', 'x_coord', 'y_coord',
       'detail_1', 'detail_2', 'detail_3', 'detail_4', 'player_2', 'x_coord_2',
       'y_coord_2', 'game_id', 'is_home', 'is_shot', 'is_goal', 'event_id',
       'team_id', 'player_id', 'detail_1_code', 'detail_2_code',
       'detail_3_code', 'detail_4_code', 'goal_diff', 'seconds_remaining',
       'poss_status', 'strength_state', 'zone_1', 'zone_2', 'zone_diff',
       'start_distance_to_goal', 'end_distance_to_goal', 'diff_x', 'diff_y',
       'distance_covered', 'angle_to_goal_start', 'angle_to_goal_end',
       'diff_angle_to_goal', 'non_shot_xg', 'scores', 'concedes',
       'offensive_value', 'defensive_value', 'vaep'],
      dtype='object')

<IPython.core.display.Javascript object>

In [33]:
df_ranking.head(10)

Unnamed: 0,player,team,vaep_count,vaep_mean,vaep_sum
0,Natalie Spooner,Olympic (Women) - Canada,411,0.01122,4.611493
1,Rebecca Johnston,Olympic (Women) - Canada,686,0.005697,3.90805
2,Kendall Coyne Schofield,Olympic (Women) - United States,466,0.007629,3.554918
3,Christina Putigna,Boston Pride,365,0.008796,3.21049
4,Hilary Knight,Olympic (Women) - United States,447,0.006685,2.988146
5,Meghan Lorence,Minnesota Whitecaps,229,0.01279,2.92891
6,Meghan Agosta,Olympic (Women) - Canada,324,0.007937,2.571652
7,Sarah Nurse,Olympic (Women) - Canada,395,0.006459,2.55147
8,Taylor Wenczkowski,Boston Pride,304,0.008384,2.54883
9,Autumn MacDougall,Buffalo Beauts,285,0.008468,2.41346


<IPython.core.display.Javascript object>

In [34]:
df_rank_events

Unnamed: 0,event,vaep_count,vaep_mean,vaep_sum
0,Shot,3524,0.043461,153.155273
1,Play,14673,0.00382,56.047165
2,Goal,132,0.3841,50.701187
3,Zone Entry,3744,0.005183,19.405521
4,Takeaway,2092,0.000632,1.323032
5,Dump In/Out,3545,0.000172,0.608482
6,Penalty Taken,260,0.001298,0.337507
7,Incomplete Play,6111,-0.007413,-45.301025
8,Faceoff Win,1629,-0.032298,-52.613205
9,Puck Recovery,15174,-0.007565,-114.796928


<IPython.core.display.Javascript object>

# First Impression
## What looks wrong here:
* Takeaway gains possesion and has negative value
* Incomplete Play loses Possession and has positive value
* Faceoff Win: alth negative value
* Puck Recovery

## What looks right:
* Shot: high value overall and mean
* Goal: highest value mean
* Play
* zone entry: kinda right.
* Dump in/out: low value. seems right
* penalty taken: negative value

# New Impression
## What looks right now: 
* Takeaway has positive mean value now
* Incomplete Play has now negative mean value 
* Puck Recovery can be for and against. There are positive and negative values 
## What is still wrong: 
* Faceoff Win still has 


In [35]:
df_zone_entries

Unnamed: 0,detail_1,vaep_count,vaep_mean,vaep_sum
0,Carried,2316,0.007657,17.733377
1,Played,261,0.003639,0.949745
2,Dumped,1167,0.000619,0.722398


<IPython.core.display.Javascript object>

In [36]:
womens.loc[(womens['event']=='Zone Entry'),'detail_1'].value_counts(dropna=False)

Carried    2316
Dumped     1167
Played      261
Name: detail_1, dtype: int64

<IPython.core.display.Javascript object>