## NHL Advanced Stats

Over the course of the last 15 to 20 years, statistics and probability have fundamentally changed the world of professional sports. This statistical 'revolution' is romanticized in the movie "Moneyball," based on Michael Lewis' book of the same name. In the film, set in 2002, the General Manager of the MLB Oakland Athletics, Billy Beane (played by Brad Pitt) teams up with Peter Brand, a fictious Yale economics graduate major that is based on the real life assistant GM at the time, Paul DePodesta (played by Jonah Hill) to radically alter how the Oakland A's assess player value. In order to compete with MLB teams that have much larger player payrolls, Beane and DePodesta had to find players that were undervalued by the standard collective wisdom of baseballs' scouts, managers, and coaches. As such, they had to apply a new method of evaluating player value.  
  
They defied commonly followed baseball stats, and the intuition of their scouts, and started to use sabermetrics. Roughly defined, sabermetrics is an empirical analysis of in-game activity. Through advanced statistical analysis, certain indicators (on-base percentage, slugging percentage, etc.) were determined to be better predictors of offensive success than the 'standard' stats (batting avg, stolen bases, etc.). Focusing on these stats, Beane and his staff were able to acquire players that were atypical and undervalued. It helped the Oakland A's to set a 20-game win streak, and successfully compete with franchises with significantly larger payrolls (Yankees, Red Sox, etc.) than themselves. Since then, using sabermetrics has become a pillar of valuing players in Major League Baseball.
    
Just as Beane and DePondesta effectively changed the landscape of evaluating professional baseball players, so too is that 'statistical renaissance' just beginning to occur in the National Hockey League.

As of writing this, I'm not entirely decided on what my main question will be, but here's some ideas for what I'm aiming for:
- What stats most heavily contribute to a winning NHL franchise?
- Using advanced stats, how can we determine what drives value in an NHL forward?
- Relative to their salary, what forwards are the most valueable to their team?

In [None]:
# To do still:
# Pull team wins or points per season
# Creating predictive model on player 'value' before the years that the data for
# Create some type of value basis on a per dollar amount (per points, per goals) for the year's that I do have salary data for

# Imports

In [1]:
# Run all imports
import pandas as pd
import numpy as np
import requests
import time
import regex as re
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

%matplotlib inline

# Data Collection

## NHL Salary Cap

The NHL has a 'hard' salary cap, meaning each team can only spend up to that cap amount on a team of (at least) 24 players, up to a maximum of 50 players.  
  
As one of my metrics for understanding player value will be value added vs cap hit, I'll be using the NHL salary cap to gauge how much each player is making as a percentage of their teams' total cap.  

As I was only able to find salary information going back to the 2011-2012 season, and the cap changes a variable amount every year, I'll be manually entering the salary cap for each year. And in cases where I might need it in varying formats, I'll be entering it once as a full year-to-year label (2011-2012), and once as a single year with the starting year representing the whole season (so, 2011-2012 would equal just 2011). There's certainly a better way to do it than this, but it's moot as writing these both out don't take much time at all.

In [39]:
salary_cap_y2y = {"2011-2012":64300000,
                  "2012-2013":70200000,
                  "2013-2014":64300000,
                  "2014-2015":69000000,
                  "2015-2016":71400000,
                  "2016-2017":73000000,
                  "2017-2018":75000000}

salary_cap_year = {"2011":64300000,
                  "2012":70200000,
                  "2013":64300000,
                  "2014":69000000,
                  "2015":71400000,
                  "2016":73000000,
                  "2017":75000000}

## Scraper Code - Part I - Player Performance Data

**Note: After further research, I ended up finding a better, more detailed site that included extra data points.** This site (http://www.corsica.hockey/) allows its user to extract the information into .csv, rendering this loop, and the subsquent data pulled, redundant. However, since I put a good deal of effort into getting this loop to run properly, I'm going to leave it here to show my work.

I have to pull player salary from a different source than player performance data, so there will be two seperate data pulls and therefore loops.

These next dozen cells or so are going to be aimed at testing out functionality before creating a loop.

### Relevant URLs that I'll be pulling from.

First page:

https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset=0

Second page:

https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset=100

Notes:
- There's 100 players on each page, and the URL actually iterates by 100.
- I ran through all possible pages to figure out where the iteration would have to end. Looks like that's 9684 rows, so I'll stop at 9600 for the URL, as it should include all players up until 9700.

In [2]:
# So first off, I'm going to write code that pulls just one page, and make sure that works.
# Once that's done, I'm going to put in the pull for the second page, to make sure I understand how to combine the two
# Then I'll write a loop based off of those two pulls to get the remaining data
# First page URL from above:
url = 'https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset=0'

# Second page:
url2 = 'https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset=100'

### Pull request, status code check, and transform into BeautifulSoup

In [3]:
# Get request
res = requests.get(url)
# res2 = requests.get(url2)

In [4]:
# Confirm we got a successful response code
res.status_code

200

In [5]:
# Setup my soup object to parse out the data
soup = BeautifulSoup(res.content, 'lxml')
# soup2 = BeautifulSoup(res2.content, 'lxml')

In [6]:
# Take a brief look to make sure it pulled correctly
soup.text



In [7]:
# This find_all is exploratory, to understand how it's pulling, and how I can break it down further
soup.find_all('td', {'class':'left'})

[<td class="left " csk="Zyuzin,Andrei" data-append-csv="zyuzian01" data-stat="player"><a href="/players/z/zyuzian01.html">Andrei Zyuzin</a></td>,
 <td class="left " data-stat="team_id"><a href="/teams/CHI/2008.html">CHI</a></td>,
 <td class="left " data-stat="season">2007-08</td>,
 <td class="left " csk="Zykov,Valentin" data-append-csv="zykovva01" data-stat="player"><strong><a href="/players/z/zykovva01.html">Valentin Zykov</a></strong></td>,
 <td class="left " data-stat="team_id"><a href="/teams/CAR/2018.html">CAR</a></td>,
 <td class="left " data-stat="season">2017-18</td>,
 <td class="left " csk="Zykov,Valentin" data-append-csv="zykovva01" data-stat="player"><strong><a href="/players/z/zykovva01.html">Valentin Zykov</a></strong></td>,
 <td class="left " data-stat="team_id"><a href="/teams/CAR/2017.html">CAR</a></td>,
 <td class="left " data-stat="season">2016-17</td>,
 <td class="left " csk="Zucker,Jason" data-append-csv="zuckeja01" data-stat="player"><strong><a href="/players/z/zuc

In [8]:
# These are the individual scrapes, which I'll aggregate to loop and pull
# Creating them and printing them to make sure they work
player_name = soup.find_all('td', {'class':'left', 'data-stat':'player'})
pos = soup.find_all('td', {'class':'center', 'data-stat':'pos'})
team_id = soup.find_all('td', {'class':'left', 'data-stat':'team_id'})
season = soup.find_all('td', {'class':'left', 'data-stat':'season'})
games_played = soup.find_all('td', {'class':'right', 'data-stat':'games_played'})
goals = soup.find_all('td', {'class':'right', 'data-stat':'goals'})
assists = soup.find_all('td', {'class':'right', 'data-stat':'assists'})
points = soup.find_all('td', {'class':'right', 'data-stat':'points'})
corsi_for = soup.find_all('td', {'class':'right', 'data-stat':'corsi_for'})
corsi_against = soup.find_all('td', {'class':'right', 'data-stat':'corsi_against'})
corsi_pct = soup.find_all('td', {'class':'right', 'data-stat':'corsi_pct'})
corsi_rel_pct = soup.find_all('td', {'class':'right', 'data-stat':'corsi_rel_pct'})
corsi_per_60 = soup.find_all('td', {'class':'right', 'data-stat':'corsi_per_60'})
corsi_rel_per_60 = soup.find_all('td', {'class':'right', 'data-stat':'corsi_rel_per_60'})
fenwick_for = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_for'})
fenwick_against = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_against'})
fenwick_pct = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_pct'})
fenwick_rel_pct = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_rel_pct'})
on_ice_shot_pct = soup.find_all('td', {'class':'right', 'data-stat':'on_ice_shot_pct'})
on_ice_sv_pct = soup.find_all('td', {'class':'right', 'data-stat':'on_ice_sv_pct'})
pdo = soup.find_all('td', {'class':'right', 'data-stat':'pdo'})
zs_offense_pct = soup.find_all('td', {'class':'right', 'data-stat':'zs_offense_pct'})
zs_defense_pct = soup.find_all('td', {'class':'right', 'data-stat':'zs_defense_pct'})
toi_pbp_avg = soup.find_all('td', {'class':'right', 'data-stat':'toi_pbp_avg'})
faceoff_wins = soup.find_all('td', {'class':'right', 'data-stat':'faceoff_wins'})
faceoff_losses = soup.find_all('td', {'class':'right', 'data-stat':'faceoff_losses'})
faceoff_percentage = soup.find_all('td', {'class':'center', 'data-stat':'faceoff_percentage'})
hits = soup.find_all('td', {'class':'right', 'data-stat':'hits'})
blocks = soup.find_all('td', {'class':'right', 'data-stat':'blocks'})
takeaways = soup.find_all('td', {'class':'right', 'data-stat':'takeaways'})
giveaways = soup.find_all('td', {'class':'right', 'data-stat':'giveaways'})

In [9]:
# This cell to make sure each of my variables got pulled in correctly
# And that I can pull the data out as expected
# Print player 1's stats essentially
print(f'''
{player_name[0].text}
{pos[0].text}
{team_id[0].text}
{season[0].text}
{games_played[0].text}
{goals[0].text}
{assists[0].text}
{points[0].text}
{corsi_for[0].text}
{corsi_against[0].text}
{corsi_pct[0].text}
{corsi_rel_pct[0].text}
{corsi_per_60[0].text}
{corsi_rel_per_60[0].text}
{fenwick_for[0].text}
{fenwick_against[0].text}
{fenwick_pct[0].text}
{fenwick_rel_pct[0].text}
{on_ice_shot_pct[0].text}
{on_ice_sv_pct[0].text}
{pdo[0].text}
{zs_offense_pct[0].text}
{zs_defense_pct[0].text}
{toi_pbp_avg[0].text}
{faceoff_wins[0].text}
{faceoff_losses[0].text}
{faceoff_percentage[0].text}
{hits[0].text}
{blocks[0].text}
{takeaways[0].text}
{giveaways[0].text}
''')


Andrei Zyuzin
D
CHI
2007-08
32
1
2
3
275
312
46.8
-7.0
-6.0
-9.3
207
237
46.6
-6.5
6.4
88.9
95.3
45.8
54.2
11.5
0
0

24
29
4
7



In [10]:
# This cell is to test putting together a dataframe from many lists
df_test_player1 = pd.DataFrame(
    {'player_name': player_name[0].text,
    'pos': pos[0].text, 
    'team_id': team_id[0].text, 
    'season': season[0].text, 
    'games_played': games_played[0].text, 
    'goals': goals[0].text, 
    'assists': assists[0].text, 
    'points': points[0].text, 
    'corsi_for': corsi_for[0].text, 
    'corsi_against': corsi_against[0].text, 
    'corsi_pct': corsi_pct[0].text, 
    'corsi_rel_pct': corsi_rel_pct[0].text, 
    'corsi_per_60': corsi_per_60[0].text, 
    'corsi_rel_per_60': corsi_rel_per_60[0].text, 
    'fenwick_for': fenwick_for[0].text, 
    'fenwick_against': fenwick_against[0].text, 
    'fenwick_pct': fenwick_pct[0].text, 
    'fenwick_rel_pct': fenwick_rel_pct[0].text, 
    'on_ice_shot_pct': on_ice_shot_pct[0].text, 
    'on_ice_sv_pct': on_ice_sv_pct[0].text, 
    'pdo': pdo[0].text, 
    'zs_offense_pct': zs_offense_pct[0].text, 
    'zs_defense_pct': zs_defense_pct[0].text, 
    'toi_pbp_avg': toi_pbp_avg[0].text, 
    'faceoff_wins': faceoff_wins[0].text, 
    'faceoff_losses': faceoff_losses[0].text, 
    'faceoff_percentage': faceoff_percentage[0].text, 
    'hits': hits[0].text, 
    'blocks': blocks[0].text, 
    'takeaways': takeaways[0].text, 
    'giveaways': giveaways[0].text}, index=[0])

In [11]:
# Take a look at the df, compare it to the website to make sure everything lines up correctly
df_test_player1.T

Unnamed: 0,0
player_name,Andrei Zyuzin
pos,D
team_id,CHI
season,2007-08
games_played,32
goals,1
assists,2
points,3
corsi_for,275
corsi_against,312


In [12]:
# Keeping this here for a visualization of how zip works,
# and how I might look through to aggregate my future lists together in a dataframe
for i, j in zip(player_name, season):
    print(i.text, j.text)

Andrei Zyuzin 2007-08
Valentin Zykov 2017-18
Valentin Zykov 2016-17
Jason Zucker 2013-14
Jason Zucker 2017-18
Jason Zucker 2012-13
Jason Zucker 2016-17
Jason Zucker 2011-12
Jason Zucker 2015-16
Jason Zucker 2014-15
Mats Zuccarello 2012-13
Mats Zuccarello 2016-17
Mats Zuccarello 2011-12
Mats Zuccarello 2015-16
Mats Zuccarello 2010-11
Mats Zuccarello 2014-15
Mats Zuccarello 2013-14
Mats Zuccarello 2017-18
Dainius Zubrus 2010-11
Dainius Zubrus 2014-15
Dainius Zubrus 2009-10
Dainius Zubrus 2013-14
Dainius Zubrus 2008-09
Dainius Zubrus 2012-13
Dainius Zubrus 2007-08
Dainius Zubrus 2011-12
Dainius Zubrus 2015-16
Sergei Zubov 2008-09
Sergei Zubov 2007-08
Ilya Zubov 2008-09
Ilya Zubov 2007-08
Andrei Zubarev 2010-11
Harry Zolnierczyk 2013-14
Harry Zolnierczyk 2016-17
Harry Zolnierczyk 2012-13
Harry Zolnierczyk 2011-12
Harry Zolnierczyk 2015-16
Harry Zolnierczyk 2014-15
Mike Zigomanis 2008-09
Mike Zigomanis 2007-08
Mike Zigomanis 2010-11
Marek Zidlicky 2009-10
Marek Zidlicky 2013-14
Marek Zidlic

In [13]:
goals = soup.find_all('td', {'class':'right', 'data-stat':'goals'})

In [14]:
# Testing the url + next_get portion of the request pull
next_get = str(100)
url = 'https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset='

res = requests.get(url+next_get)

In [15]:
# Testing creation of DF
df_puck = pd.DataFrame([], columns=['player_name', 'pos', 'team_id', 'season', 'games_played', 'goals', 'assists', 'points', 'corsi_for', 'corsi_against', 'corsi_pct', 'corsi_rel_pct', 'corsi_per_60', 'corsi_rel_per_60', 'fenwick_for', 'fenwick_against', 'fenwick_pct', 'fenwick_rel_pct', 'on_ice_shot_pct', 'on_ice_sv_pct', 'pdo', 'zs_offense_pct', 'zs_defense_pct', 'toi_pbp_avg', 'faceoff_wins', 'faceoff_losses', 'faceoff_percentage', 'hits', 'blocks', 'takeaways', 'giveaways'])

In [16]:
# Testing appending
df_puck = df_puck.append(df_test_player1, )

In [17]:
df_puck

Unnamed: 0,player_name,pos,team_id,season,games_played,goals,assists,points,corsi_for,corsi_against,...,zs_offense_pct,zs_defense_pct,toi_pbp_avg,faceoff_wins,faceoff_losses,faceoff_percentage,hits,blocks,takeaways,giveaways
0,Andrei Zyuzin,D,CHI,2007-08,32,1,2,3,275,312,...,45.8,54.2,11.5,0,0,,24,29,4,7


In [19]:
# Scraper loop
# Original URL
url = 'https://www.hockey-reference.com/play-index/ppbp_finder.cgi?c2stat=&c4stat=&c2comp=&order_by_asc=&game_location=&c1comp=&year_min=2008&request=1&franch_id=&birth_country=&match=single&year_max=2018&c3comp=&report=ppbp&season_end=-1&c3stat=&order_by=player&season_start=1&c1val=&c3val=&c2val=&handed=&rookie=N&pos=S&describe_only=&c1stat=&situation_id=ev&c4val=&age_min=0&age_max=99&c4comp=&offset='
df_puck = pd.DataFrame([], columns=['player_name', 'pos', 'team_id', 'season', 'games_played', 'goals', 'assists', 'points', 'corsi_for', 'corsi_against', 'corsi_pct', 'corsi_rel_pct', 'corsi_per_60', 'corsi_rel_per_60', 'fenwick_for', 'fenwick_against', 'fenwick_pct', 'fenwick_rel_pct', 'on_ice_shot_pct', 'on_ice_sv_pct', 'pdo', 'zs_offense_pct', 'zs_defense_pct', 'toi_pbp_avg', 'faceoff_wins', 'faceoff_losses', 'faceoff_percentage', 'hits', 'blocks', 'takeaways', 'giveaways'])

# See logic above for why I chose these numbers
for i in range(0, 9700, 100):
    
    # Create lists fresh on each loop
    player_name_list = []
    pos_list = []
    team_id_list = []
    season_list = []
    games_played_list = []
    goals_list = []
    assists_list = []
    points_list = []
    corsi_for_list = []
    corsi_against_list = []
    corsi_pct_list = []
    corsi_rel_pct_list = []
    corsi_per_60_list = []
    corsi_rel_per_60_list = []
    fenwick_for_list = []
    fenwick_against_list = []
    fenwick_pct_list = []
    fenwick_rel_pct_list = []
    on_ice_shot_pct_list = []
    on_ice_sv_pct_list = []
    pdo_list = []
    zs_offense_pct_list = []
    zs_defense_pct_list = []
    toi_pbp_avg_list = []
    faceoff_wins_list = []
    faceoff_losses_list = []
    faceoff_percentage_list = []
    hits_list = []
    blocks_list = []
    takeaways_list = []
    giveaways_list = []
    
    # Iteration to create end of URL
    next_get = str(i)
    
    # Request get
    res = requests.get(url+next_get)
    
    # Create into bs4 object
    soup = BeautifulSoup(res.content, 'lxml')
    
    # Breakdown soup via find_all into its various pieces
    player_name = soup.find_all('td', {'class':'left', 'data-stat':'player'})
    pos = soup.find_all('td', {'class':'center', 'data-stat':'pos'})
    team_id = soup.find_all('td', {'class':'left', 'data-stat':'team_id'})
    season = soup.find_all('td', {'class':'left', 'data-stat':'season'})
    games_played = soup.find_all('td', {'class':'right', 'data-stat':'games_played'})
    goals = soup.find_all('td', {'class':'right', 'data-stat':'goals'})
    assists = soup.find_all('td', {'class':'right', 'data-stat':'assists'})
    points = soup.find_all('td', {'class':'right', 'data-stat':'points'})
    corsi_for = soup.find_all('td', {'class':'right', 'data-stat':'corsi_for'})
    corsi_against = soup.find_all('td', {'class':'right', 'data-stat':'corsi_against'})
    corsi_pct = soup.find_all('td', {'class':'right', 'data-stat':'corsi_pct'})
    corsi_rel_pct = soup.find_all('td', {'class':'right', 'data-stat':'corsi_rel_pct'})
    corsi_per_60 = soup.find_all('td', {'class':'right', 'data-stat':'corsi_per_60'})
    corsi_rel_per_60 = soup.find_all('td', {'class':'right', 'data-stat':'corsi_rel_per_60'})
    fenwick_for = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_for'})
    fenwick_against = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_against'})
    fenwick_pct = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_pct'})
    fenwick_rel_pct = soup.find_all('td', {'class':'right', 'data-stat':'fenwick_rel_pct'})
    on_ice_shot_pct = soup.find_all('td', {'class':'right', 'data-stat':'on_ice_shot_pct'})
    on_ice_sv_pct = soup.find_all('td', {'class':'right', 'data-stat':'on_ice_sv_pct'})
    pdo = soup.find_all('td', {'class':'right', 'data-stat':'pdo'})
    zs_offense_pct = soup.find_all('td', {'class':'right', 'data-stat':'zs_offense_pct'})
    zs_defense_pct = soup.find_all('td', {'class':'right', 'data-stat':'zs_defense_pct'})
    toi_pbp_avg = soup.find_all('td', {'class':'right', 'data-stat':'toi_pbp_avg'})
    faceoff_wins = soup.find_all('td', {'class':'right', 'data-stat':'faceoff_wins'})
    faceoff_losses = soup.find_all('td', {'class':'right', 'data-stat':'faceoff_losses'})
    faceoff_percentage = soup.find_all('td', {'class':'center', 'data-stat':'faceoff_percentage'})
    hits = soup.find_all('td', {'class':'right', 'data-stat':'hits'})
    blocks = soup.find_all('td', {'class':'right', 'data-stat':'blocks'})
    takeaways = soup.find_all('td', {'class':'right', 'data-stat':'takeaways'})
    giveaways = soup.find_all('td', {'class':'right', 'data-stat':'giveaways'})
    
    # Add the various soup objects into a new dataframe
    for a in range(0, len(player_name), 1):
        if a == 0:
            df_append = pd.DataFrame(
            {'player_name': player_name[a].text,
            'pos': pos[a].text, 
            'team_id': team_id[a].text, 
            'season': season[a].text, 
            'games_played': games_played[a].text, 
            'goals': goals[a].text, 
            'assists': assists[a].text, 
            'points': points[a].text, 
            'corsi_for': corsi_for[a].text, 
            'corsi_against': corsi_against[a].text, 
            'corsi_pct': corsi_pct[a].text, 
            'corsi_rel_pct': corsi_rel_pct[a].text, 
            'corsi_per_60': corsi_per_60[a].text, 
            'corsi_rel_per_60': corsi_rel_per_60[a].text, 
            'fenwick_for': fenwick_for[a].text, 
            'fenwick_against': fenwick_against[a].text, 
            'fenwick_pct': fenwick_pct[a].text, 
            'fenwick_rel_pct': fenwick_rel_pct[a].text, 
            'on_ice_shot_pct': on_ice_shot_pct[a].text, 
            'on_ice_sv_pct': on_ice_sv_pct[a].text, 
            'pdo': pdo[a].text, 
            'zs_offense_pct': zs_offense_pct[a].text, 
            'zs_defense_pct': zs_defense_pct[a].text, 
            'toi_pbp_avg': toi_pbp_avg[a].text, 
            'faceoff_wins': faceoff_wins[a].text, 
            'faceoff_losses': faceoff_losses[a].text, 
            'faceoff_percentage': faceoff_percentage[a].text, 
            'hits': hits[a].text, 
            'blocks': blocks[a].text, 
            'takeaways': takeaways[a].text, 
            'giveaways': giveaways[a].text}, index=[i])
        else:
            df_append = df_append.append(
            {'player_name': player_name[a].text,
            'pos': pos[a].text, 
            'team_id': team_id[a].text, 
            'season': season[a].text, 
            'games_played': games_played[a].text, 
            'goals': goals[a].text, 
            'assists': assists[a].text, 
            'points': points[a].text, 
            'corsi_for': corsi_for[a].text, 
            'corsi_against': corsi_against[a].text, 
            'corsi_pct': corsi_pct[a].text, 
            'corsi_rel_pct': corsi_rel_pct[a].text, 
            'corsi_per_60': corsi_per_60[a].text, 
            'corsi_rel_per_60': corsi_rel_per_60[a].text, 
            'fenwick_for': fenwick_for[a].text, 
            'fenwick_against': fenwick_against[a].text, 
            'fenwick_pct': fenwick_pct[a].text, 
            'fenwick_rel_pct': fenwick_rel_pct[a].text, 
            'on_ice_shot_pct': on_ice_shot_pct[a].text, 
            'on_ice_sv_pct': on_ice_sv_pct[a].text, 
            'pdo': pdo[a].text, 
            'zs_offense_pct': zs_offense_pct[a].text, 
            'zs_defense_pct': zs_defense_pct[a].text, 
            'toi_pbp_avg': toi_pbp_avg[a].text, 
            'faceoff_wins': faceoff_wins[a].text, 
            'faceoff_losses': faceoff_losses[a].text, 
            'faceoff_percentage': faceoff_percentage[a].text, 
            'hits': hits[a].text, 
            'blocks': blocks[a].text, 
            'takeaways': takeaways[a].text, 
            'giveaways': giveaways[a].text}, ignore_index = True)
    
    # Kept getting timeout errors, so added a sleep to offset
    time.sleep(3)
    
    df_puck = df_puck.append(df_append, ignore_index = True)
    df_puck.to_csv('hockey_data.csv', index = True)

In [33]:
df_puck['year'] = df_puck['season'].apply(lambda x: int(x[:4]))

In [36]:
df_puck[df_puck['year'] >= 2014].shape

(3558, 32)

In [34]:
#df_puck['year'] = pd.to_datetime(df_puck['year'])

In [37]:
df_puck.head()

Unnamed: 0,player_name,pos,team_id,season,games_played,goals,assists,points,corsi_for,corsi_against,...,zs_defense_pct,toi_pbp_avg,faceoff_wins,faceoff_losses,faceoff_percentage,hits,blocks,takeaways,giveaways,year
0,Andrei Zyuzin,D,CHI,2007-08,32,1,2,3,275,312,...,54.2,11.5,0,0,,24,29,4,7,2007
1,Valentin Zykov,LW,CAR,2017-18,10,3,4,7,132,90,...,30.5,12.7,0,1,0.0,3,1,5,4,2017
2,Valentin Zykov,LW,CAR,2016-17,2,1,0,1,17,9,...,55.6,5.2,0,0,,0,0,0,1,2016
3,Jason Zucker,LW,MIN,2013-14,21,3,0,3,198,259,...,42.7,12.0,0,0,,16,10,8,9,2013
4,Jason Zucker,LW,MIN,2017-18,82,25,21,46,1137,1134,...,55.9,14.2,4,9,30.8,78,46,46,29,2017


In [None]:
# Pull team wins or points per season
# Pull salary cap for each year
# Creating predictive model on player 'value' before the years that the data for
# Create some type of value basis on a per dollar amount (per points, per goals) for the year's that I do have salary data for