<h1><center>NHL DATA SCRAPE AND VISUALIZATION</center></h1>
<h2><center>A Learning Project</center></h2>

The following is an attempt to learn pandas, numpy, and beautifulsoup in order to create a repository of data scrapped from the web, clean it, and then process it to draw and test assumptions against that data. This will utilize the NHL API along with the NHL Play by Play sheet. One of the reasons to use a Jupyter Notebook is to avoid multiple calls to the mentioned websites but still continue to work with the data they provide.

First we import the required python libraries. 
- NHLget is a self made module which accesses the NHL API and the NHL Play by Play pages. It has five functions, three are internal to the module and the other two can be accessed. One of the outward facing functions returns a list of gameIDs for a provided time period. The other returns play information for a speific game. The three inner functions work with the later. One gets the game information from the NHL API, the second gets additional game information from the NHL Game Report, and the final one combines those two and then returns the game information.
- Pandas module will be used to conduct the data analysis.
- Matplotlib will be used for data visualization.

In order to maximize the 'print screen' within the notebook, we use the set option to expand the frame to False. According to the Pandas documentation: "Whether to print out the full DataFrame repr for wide DataFrames across multiple lines, max_columns is still respected, but the output will wrap-around across multiple “pages” if its width exceeds display.width. [default: True] [currently: True]"

In [11]:
import NHLget
import pandas as pd
import matplotlib.pyplot as plt 
from datetime import datetime

pd.set_option('display.expand_frame_repr', False)

Next, we're going to use the get_player_on_ice_info from the NHLget library to gather the play information from the NHL API and NHL Play by Play for one game: #2021020001. For the game ID, the first four digits are the year the season starts, the next two digits denote which part of the season. The preseason is 01 and the regular season is 02. The first digit of the playoffs denotes the round, and the second digit is 3. The All Star Game is 04. The final four digits are the game number. With 32 teams playing 82 games (41 at home and 41 away) there are a total of 1312 games.

In [2]:
all_plays = NHLget.get_player_on_ice_info('2021020001')

The function returns a list of lists. Each list is the information for a play from the game. Collectively they will be turned into a pandas dataframe. Each play has the following information in this order: The play ID, the period, player strength, time elapsed in the period, time remaining in the period, the play code, play description, the visiting players on the ice, the home players on the ice, 1-4 players involved in the play, the date-time of the play, and the coordinates on the ice.

The play codes are ... 
- PGSTR for the pregame start
- PGEND for the pregame end
- PSTR for the period start
- PEND for the period end
- GEND for the game end
- FAC for a faceoff
- HIT for a hit
- STOP for a play stoppage
- SHOT for a shot on goal
- MISS for a missed shot
- GOAL for a goal
- TAKE for a takeaway
- GIVE for a giveaway
- BLOCK for a blocked shot
- PENL for a penalty

Pandas will create an index, so we are going to set the play_id as the index for the dataframe.

In [3]:
column_names = [
  'play_id', 
  'period', 
  'strength', 
  'period_time',
  'remain_time', 
  'play_code', 
  'play_description', 
  'visit_player_on_ice', 
  'home_player_on_ice', 
  'player_one', 
  'player_two', 
  'player_three', 
  'player_four', 
  'date_time', 
  'coordinates'
]

game_df = pd.DataFrame(all_plays, columns = column_names)
game_df.set_index('play_id', inplace = True)

With the play information gathered, the dataframe created, it is now time to start exploring the data. 

First, we'll use the `info()` method which will give us the shape (columns and rows) of the dataframe, it will provide us with how many null items are in each column, and what the datatype for each column is.

In [4]:
print(game_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 324 entries, 1 to 324
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   period               324 non-null    object
 1   strength             324 non-null    object
 2   period_time          324 non-null    object
 3   remain_time          324 non-null    object
 4   play_code            324 non-null    object
 5   play_description     324 non-null    object
 6   visit_player_on_ice  324 non-null    object
 7   home_player_on_ice   324 non-null    object
 8   player_one           169 non-null    object
 9   player_two           169 non-null    object
 10  player_three         169 non-null    object
 11  player_four          169 non-null    object
 12  date_time            169 non-null    object
 13  coordinates          169 non-null    object
dtypes: object(14)
memory usage: 38.0+ KB
None


Everything is an object, which could be the case. However, we need to look closer at the data to ensure that it's correct and to make adjustments as we need.

We can look to see what the different elements look like using the `head()` function.

In [12]:
print(game_df.head())

        period strength period_time remain_time play_code                                   play_description                                visit_player_on_ice                                 home_player_on_ice   player_one     player_two player_three player_four             date_time           coordinates
play_id                                                                                                                                                                                                                                                                                                            
1            1                00:00       20:00     PGSTR                                                                                                    {}                                                 {}                                                       2021-10-12T22:45:48Z                    {}
2            1                00:00       20:00     PGEND                   

The columns 'period_time' and 'remain_time' are redundant and one can be removed. 

In [5]:
game_df.describe(include='all')

Unnamed: 0,period,strength,period_time,remain_time,play_code,play_description,visit_player_on_ice,home_player_on_ice,player_one,player_two,player_three,player_four,date_time,coordinates
count,324,324,324,324,324,324,324,324,169.0,169.0,169.0,169.0,169,169
unique,3,4,216,216,16,261,40,62,35.0,33.0,5.0,3.0,166,105
top,1,EV,00:00,20:00,FAC,ICING,"{'C': '77', 'R': '17', 'L': '43', 'D1': '8', '...","{'C': '21', 'R': '86', 'L': '18', 'D1': '27', ...",,,,,2021-10-12T23:44:06Z,{}
freq,119,248,9,9,69,15,55,29,31.0,41.0,165.0,166.0,2,31


Looks like there are 16 unique play types, and Faceoffs occur most often. If we review above, there were only 15 plays mentioned, so why is there 16. Let's take a closer look at different values within 'play_code' and their counts.

In [6]:
print(game_df['play_code'].value_counts())

FAC       69
HIT       59
STOP      56
SHOT      55
BLOCK     26
MISS      24
TAKE       8
GOAL       8
GIVE       7
PSTR       3
PEND       3
PENL       2
PGSTR      1
PGEND      1
ANTHEM     1
GEND       1
Name: play_code, dtype: int64


It appears that we left off ANTHEM from our list of plays. It also makes sense that the Pregame starting (PGSTR), pregame ending (PGEND), the anthem (ANTHEM), and the game ending (GEND) all only happen once. It also makes sense that in a game that ends in regulation time that there are only three period start (PSTR) and period end (PEND) plays.

From the list, we see that hits (HIT), stoppages (STOPS), and shots (SHOT) occur relatively equally not far behind faceoffs. This is followed by blocked shots (BLOCK) and missed shots (MISS). Takeaways (TAKE) and Giveaways (GIVE) were fairly equal. Finally, there were eight goals scored in this game and only 2 penalties called.

Let's see if we can find the average length it took for each period. First let's look at the 'date_time' entries for all the Period Start and Period End rows.

In [7]:
print(game_df[(game_df['play_code'] == 'PSTR') | (game_df['play_code'] == 'PEND')]['date_time'])

play_id
4      2021-10-12T23:44:06Z
119    2021-10-13T00:21:07Z
120    2021-10-13T00:40:21Z
212    2021-10-13T01:16:55Z
213    2021-10-13T01:35:37Z
323    2021-10-13T02:17:04Z
Name: date_time, dtype: object


It appears to be in the following format YYYY-MM-DDTHH:MM:SSZ where T is for Time and Z is for the timezone. We can start by seperating all the elements into their own columns.

In [8]:

third_end = datetime.strptime(game_df.iloc[323]['date_time'].replace('Z', '-0400'), '%Y-%m-%dT%H:%M:%S%z')
third_start = datetime.strptime(game_df.iloc[213]['date_time'].replace('Z', '-0400'), '%Y-%m-%dT%H:%M:%S%z')

print(third_start)
print(third_end)

print(third_end - third_start)

2021-10-13 01:35:37-04:00
2021-10-13 02:17:13-04:00
0:41:36


In [9]:
print(game_df[(game_df['play_code'] == 'PSTR') | (game_df['play_code'] == 'PEND')]['date_time'])

play_id
4      2021-10-12T23:44:06Z
119    2021-10-13T00:21:07Z
120    2021-10-13T00:40:21Z
212    2021-10-13T01:16:55Z
213    2021-10-13T01:35:37Z
323    2021-10-13T02:17:04Z
Name: date_time, dtype: object
