<h1>Exploratory Analysis Notebook</h1>
<p>Purpose: This notebook will be used to explore the dataset, understand all of the feature present, and see how they correlate with eachother</p>
<hr>

<h2>Packages</h2>

In [2]:
# Packages

from nhldata import moneypuck
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None) 

In [3]:
# Establish connection to NHL Shots Data
mp_conn = moneypuck.Connector()

shots = mp_conn.shots_season(
    season = 2024
)

shots_df = pd.DataFrame(shots)

<h2>Examine Dataset<h2>

In [4]:
shots_df.shape

(117829, 137)

In [5]:
shots_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117829 entries, 0 to 117828
Columns: 137 entries, shotID to yCordAdjusted
dtypes: object(137)
memory usage: 123.2+ MB


Given the large nature of this dataset (on a feature basis), `df.info()` has truncated the output. We can still see, however, that there are 137 columns, all of type `object`. We will check for missing data in the next cell

In [6]:
print('Checking for missing data: ')
print('=' * 50)

shots_df.isnull().sum().sort_values(ascending=False).head(10)

Checking for missing data: 


shotID                       0
arenaAdjustedShotDistance    0
arenaAdjustedXCord           0
arenaAdjustedXCordABS        0
arenaAdjustedYCord           0
arenaAdjustedYCordAbs        0
averageRestDifference        0
awayEmptyNet                 0
awayPenalty1Length           0
awayPenalty1TimeLeft         0
dtype: int64

The count of null rows was sorted in a descending order, and the first 10 values were displayed. Given that they are 0, it's clear that there is no missing data in not only these columns, but all other columns (because of the fact that the `isnull()` record count was sorted in a descending order).

<h2>Features</h2>

<p>Given there are no null values, let's spend some time looking through the features we have available.</p>
<p>The complete list of features can be found <a href='https://docs.google.com/spreadsheets/d/1aB-AkJJMTEPhb4oBCyOv-kJr11sOXW5MQtMjBeNss-Y/edit?gid=241218541#gid=241218541'>here.</a>

In [8]:
target_cols = [
    'homeTeamCode',
    'awayTeamCode',
    'isPlayoffGame',
    'time',
    'period',
    'team',
    'location',
    'event',
    'goal',
    'xCord',
    'yCord',
    'shotAngle',
    'shotDistance',
    'shotType',
    'shotOnEmptyNet',
    'shotRebound',
    'shotRush',
    'homeEmptyNet',
    'awayEmptyNet',
    'playerPositionThatDidEvent',
    'goalieIdForShot',
    'shooterPlayerId',
    'shooterLeftRight',
    'shooterTimeOnIce',
    'offWing',
    'isHomeTeam',
    'teamCode'
]

shots_df[target_cols].head()

Unnamed: 0,homeTeamCode,awayTeamCode,isPlayoffGame,time,period,team,location,event,goal,xCord,yCord,shotAngle,shotDistance,shotType,shotOnEmptyNet,shotRebound,shotRush,homeEmptyNet,awayEmptyNet,playerPositionThatDidEvent,goalieIdForShot,shooterPlayerId,shooterLeftRight,shooterTimeOnIce,offWing,isHomeTeam,teamCode
0,BUF,NJD,0,8,1,AWAY,HOMEZONE,SHOT,0,57,-40,-51.3401917459,51.2249938995,WRIST,0,0,0,0,0,D,8480045,8483495,,8,0,0.0,NJD
1,BUF,NJD,0,29,1,AWAY,HOMEZONE,MISS,0,71,-28,-57.2647737279,33.2866339542,WRIST,0,0,0,0,0,L,8480045,8479407,L,7,1,0.0,NJD
2,BUF,NJD,0,40,1,AWAY,HOMEZONE,SHOT,0,48,-24,-30.3432488842,47.5078940809,SLAP,0,0,0,0,0,D,8480045,8476462,R,11,0,0.0,NJD
3,BUF,NJD,0,62,1,HOME,AWAYZONE,SHOT,0,-41,-31,32.8557219504,57.1401785086,WRIST,0,0,0,0,0,R,8474593,8482175,L,41,0,1.0,BUF
4,BUF,NJD,0,66,1,HOME,AWAYZONE,MISS,0,-36,15,-15.8025139539,55.0817574157,SLAP,0,0,0,0,0,D,8474593,8482671,L,15,1,1.0,BUF
