## Data retrieval and cleaning

Don't forget to delete your `json_clean` if you make any modifications `cleaning.py`!


In [1]:
from ift6758.data.acquisition import NHLGameData

data_path_raw = './../../ift6758/data/json_raw/'
nhl_games_data = NHLGameData(data_path_raw)
for year in range(2016,2021):
    nhl_games_data.fetch_season(year)

Loading from cache file ./../../ift6758/data/json_raw/2016/2016-regular.pkl
Found 1230 regular games for season 2016-2017
Loading from cache file ./../../ift6758/data/json_raw/2016/2016-playoff.pkl
Found 102 playoff games for season 2016-2017
Loading from cache file ./../../ift6758/data/json_raw/2017/2017-regular.pkl
Found 1271 regular games for season 2017-2018
Loading from cache file ./../../ift6758/data/json_raw/2017/2017-playoff.pkl
Found 105 playoff games for season 2017-2018
Loading from cache file ./../../ift6758/data/json_raw/2018/2018-regular.pkl
Found 1271 regular games for season 2018-2019
Loading from cache file ./../../ift6758/data/json_raw/2018/2018-playoff.pkl
Found 105 playoff games for season 2018-2019
Loading from cache file ./../../ift6758/data/json_raw/2019/2019-regular.pkl
Found 1271 regular games for season 2019-2020
Loading from cache file ./../../ift6758/data/json_raw/2019/2019-playoff.pkl
Found 105 playoff games for season 2019-2020
Loading from cache file ./..

In [2]:
from ift6758.data.cleaning import DataCleaner

data_path_clean = './../../ift6758/data/json_clean/'
data_cleaner = DataCleaner(data_raw=nhl_games_data, data_path_clean=data_path_clean)
for year in range(2016,2021):
    data_cleaner.clean_season(year, keepPreviousEventInfo=True)

Failed to extract event data for game 2017020045 due to missing rink side information
Failed to extract event data for game 2017020062 due to missing rink side information
Failed to extract event data for game 2017020077 due to missing rink side information
Failed to extract event data for game 2017020090 due to missing rink side information
Failed to extract event data for game 2017020121 due to missing rink side information
Failed to extract event data for game 2017020135 due to missing rink side information
Failed to extract event data for game 2017020149 due to missing rink side information
Failed to extract event data for game 2017020248 due to missing rink side information
Failed to extract event data for game 2017020308 due to missing rink side information
Failed to extract event data for game 2017020347 due to missing rink side information
Failed to extract event data for game 2017020376 due to missing rink side information
Failed to extract event data for game 2017020411 due t

In [3]:
import pandas as pd

data_2016 = pd.read_pickle(data_path_clean + '2016/2016.pkl')
data_2017 = pd.read_pickle(data_path_clean + '2017/2017.pkl')
data_2018 = pd.read_pickle(data_path_clean + '2018/2018.pkl')
data_2019 = pd.read_pickle(data_path_clean + '2019/2019.pkl')
data_2020 = pd.read_pickle(data_path_clean + '2020/2020.pkl')

In [4]:
data_2020.sample(5)

Unnamed: 0,game_id,period,period_time,event_type,team,x,y,shooter,goalie,shot_type,empty_net,strength,opposite_team_side,prev_type,prev_period_time,prev_x,prev_y,time_between_events,distance_between_events
49599,2020020848,1,11:34,SHOT,Winnipeg Jets,-60.0,12.0,Paul Stastny,Anton Forsberg,Backhand,False,,left,HIT,11:27,-95.0,-25.0,7,50.931326
14157,2020020244,2,02:12,SHOT,Boston Bruins,-81.0,31.0,David Krejci,Scott Wedgewood,Wrist Shot,False,,left,MISSED_SHOT,01:54,35.0,29.0,18,116.01724
24464,2020020418,1,03:14,SHOT,Washington Capitals,69.0,16.0,Alex Ovechkin,Brian Elliott,Snap Shot,False,,right,MISSED_SHOT,02:41,48.0,29.0,33,24.698178
36923,2020020630,1,06:22,SHOT,Columbus Blue Jackets,38.0,4.0,Cam Atkinson,Andrei Vasilevskiy,Wrist Shot,False,,right,FACEOFF,05:41,0.0,0.0,41,38.209946
30905,2020020527,2,02:54,SHOT,Calgary Flames,77.0,25.0,Sam Bennett,Connor Hellebuyck,Wrist Shot,False,,right,BLOCKED_SHOT,02:31,70.0,-13.0,23,38.639358


The `NaN` values for previous events is normal as some previous events are not in our interest (the columns are therefore filled with empty values).

In [5]:
data_2020.isna().sum()

game_id                        0
period                         0
period_time                    0
event_type                     0
team                           0
x                              0
y                              0
shooter                        0
goalie                       277
shot_type                      0
empty_net                      0
strength                   50044
opposite_team_side             0
prev_type                      0
prev_period_time               0
prev_x                      2822
prev_y                      2822
time_between_events            0
distance_between_events     2822
dtype: int64

## Feature engineering

In [6]:
from ift6758.features import FeatureEng
data_path_clean = './../../ift6758/data/json_clean/'
w = FeatureEng(data_path_clean)

In [7]:
df = w.features_2(2016,2020)
df.sample(10)

Unnamed: 0,game_id,period,period_time,event_type,x,y,prev_type,prev_period_time,prev_x,prev_y,time_between_events,distance_between_events,distance_goal,prev_distance_goal,angle_shot,prev_angle_shot,bounce,angle_change,speed
129677,2017020876,3,16:22,SHOT,61.0,-8.0,SHOT,15:43,-68.0,17.0,39.0,131.400152,30.08,158.91,-15.42,6.14,True,-21.56,3.37
276156,2019020725,1,14:52,SHOT,-82.0,-28.0,FACEOFF,14:42,69.0,22.0,10.0,159.062881,29.12,160.51,-74.06,7.88,False,0.0,15.91
100031,2017020402,2,05:42,SHOT,-71.0,35.0,BLOCKED_SHOT,05:09,67.0,-14.0,33.0,146.441114,39.82,157.62,61.52,-5.1,True,66.62,4.44
235002,2019020049,1,09:17,GOAL,-79.0,-12.0,TAKEAWAY,09:08,94.0,-6.0,9.0,173.104015,16.28,184.1,-47.49,-1.87,False,0.0,19.23
20306,2016020336,1,06:51,SHOT,54.0,22.0,TAKEAWAY,06:40,-12.0,-40.0,11.0,90.553851,42.19,109.56,31.43,-21.41,False,0.0,8.23
219805,2018021073,3,10:10,SHOT,-67.0,-19.0,SHOT,10:08,-53.0,14.0,2.0,35.846897,29.83,39.56,-39.56,20.73,True,-60.29,17.92
247021,2019020246,2,04:13,SHOT,78.0,-2.0,SHOT,03:38,75.0,-10.0,35.0,8.544004,12.17,18.03,-9.46,-33.69,True,24.23,0.24
285035,2019020867,4,02:48,SHOT,77.0,3.0,SHOT,02:18,66.0,3.0,30.0,11.0,13.34,24.19,13.0,7.12,True,5.88,0.37
197952,2018020717,1,01:28,SHOT,-85.0,7.0,HIT,01:16,-92.0,-36.0,12.0,43.566042,8.6,36.06,54.48,-86.69,False,0.0,3.63
4825,2016020081,2,11:38,SHOT,-69.0,-5.0,MISSED_SHOT,11:27,74.0,39.0,11.0,149.616176,21.59,168.57,-13.39,13.38,True,-26.77,13.6


In [8]:
df.isna().sum()

game_id                       0
period                        0
period_time                   0
event_type                    0
x                             0
y                             0
prev_type                     0
prev_period_time              0
prev_x                     4076
prev_y                     4075
time_between_events           0
distance_between_events    4076
distance_goal                 0
prev_distance_goal         4076
angle_shot                    0
prev_angle_shot            4076
bounce                        0
angle_change                  2
speed                         0
dtype: int64

In [None]:
df.isna().sum()

Annoying SHOT types that have no coordinates (and therefore no distance or angles)
The second one is a during a shootout. We have to see how we deal with shootout shots that have period times of 0.

In [None]:
df[df.angle_change.isna()]

Annoying previous event that only has y coordinate. We keep it for the speed and time between events.

In [None]:
df[df.prev_x.isna() & df.prev_y.notna()]