The code cell imports several Python libraries, including Pandas, NumPy, and glob and os, and a module called eliteprospect_scraper from a package called eliteprospect.

The eliteprospect_scraper module provides functions for scraping data from the Eliteprospects website, which is a database of hockey player statistics and information.

In [1]:
import pandas as pd
import numpy as np
import glob
import os

import eliteprospect.eliteprospect_scraper as ep


This code loads in NHL player data from a range of seasons and combines it with data on player head injuries from the corresponding seasons. The head injury data is first filtered to only include head injuries and then joined with the NHL player data using player names. The resulting combined data is then saved as a CSV file for each season.

In [2]:
years_to_grab = ['2000-01', '2001-02', '2002-03', '2003-04', '2004-05', '2005-06', '2006-07', '2007-08',
                 '2008-09', '2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15', '2015-16',
                 '2016-17', '2017-18', '2018-19', '2019-20', '2020-21']
head_injury_years = ['2001/02', '2002/03', '2003/04', '2004/05', '2005/06', '2006/07', '2007/08','2008/09',
                    '2009/10', '2010/11', '2011/12', '2012/13', '2013/14', '2014/15','2015/16', '2016/17',
                    '2017/18', '2018/19', '2019/20', '2020/21', '2021/22']

# load in concussion csv

injuries = pd.read_csv('data/nhl_injury.csv', encoding='latin')

for year, head_injury_year in zip(years_to_grab, head_injury_years):
    print(f'Running on year: {year}')
    nhl_data = ep.getPlayers('nhl', year)
    
    # combine with concussion data
    # concussion data = year + 1
    injuries_by_year = injuries[injuries['ï»¿Season'] == head_injury_year]
    
    # filter for head injuries
    injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
                                         "Nose": "Head", "Jaw": "Head", "Migraine": "Head", "Mouth": "Head"}, inplace=True)

    # filter to only be injury type of Head
    head_injuries = injuries_by_year[injuries_by_year['Injury Type'] == 'Head']
    
    # make new name variable to join between the 2 datasets 
    new_names = head_injuries['Player'].map(lambda x: x.split(' ')[1]) + ' ' + head_injuries['Player'].map(lambda x: x.split(' ')[0])
    
    new_names = [x.replace(',', '') for x in new_names]
    
    head_injuries['Name'] = new_names
    
    # combine 2 datasets (nhl_data and concussion data)
    nhl_data['playername'] = nhl_data['playername'].map(lambda x: x.strip())
    
    # join 2 datasets 
    joined_data = nhl_data.merge(head_injuries[['Name', 'Games Missed']],
                        how='left', left_on='playername',
                        right_on='Name')
    
    # fill in missing values with 0 
    joined_data.fillna(0, inplace=True)
    
    # create a concussion column
    
    joined_data['head_injuries'] = np.where(joined_data['Name'] == 0, 0, 1)
    
    
    # save to csv 
    
    # final_df.to_csv(index=False)
    # final_df = nhl_data + concussion_data
    joined_data.to_csv(f'data/{year}.csv', index=False)

Running on year: 2000-01


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2001-02


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2002-03


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2003-04


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")


Running on year: 2004-05


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2005-06


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2006-07


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2007-08


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2008-09


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2009-10


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2010-11


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2011-12


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2012-13


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2013-14


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2014-15


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2015-16


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2016-17


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2017-18


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2018-19


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2019-20


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


Running on year: 2020-21


  df_players['playername'] = df_players['player'].str.replace(r"\(.*\)","")
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  injuries_by_year['Injury Type'].replace({"Facial": "Head", "Ear": "Head", "Concussion": "Head", "Eye": "Head",
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  head_injuries['Name'] = new_names


In this code cell, the file path is specified where all the CSV files are stored. The glob function is used to extract the file names of all the CSV files in the path. Then, a list is initialized and each CSV file is loaded into a Pandas DataFrame and appended to this list. Finally, all the DataFrames in the list are concatenated into a single DataFrame, ignoring the index of the individual DataFrames.

In [3]:
path = "C:\\Users\\Jeff\\Desktop\\phase_5_project\\data"
all_files = glob.glob(os.path.join(path , "*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

In [4]:
# view dataframe
df.head()

Unnamed: 0,player,team,gp,g,a,tp,ppg,pim,+/-,link,season,league,playername,position,fw_def,Name,Games Missed,head_injuries
0,Jaromír Jágr (RW),Pittsburgh Penguins,81,52,69,121,1.49,42,19,https://www.eliteprospects.com/player/8627/jar...,2000-01,nhl,Jaromír Jágr,RW,FW,0,0.0,0
1,Joe Sakic (C),Colorado Avalanche,82,54,64,118,1.44,30,45,https://www.eliteprospects.com/player/8862/joe...,2000-01,nhl,Joe Sakic,C,FW,0,0.0,0
2,Patrik Elias (LW),New Jersey Devils,82,40,56,96,1.17,51,45,https://www.eliteprospects.com/player/8698/pat...,2000-01,nhl,Patrik Elias,LW,FW,0,0.0,0
3,Alexei Kovalev (RW),Pittsburgh Penguins,79,44,51,95,1.2,96,12,https://www.eliteprospects.com/player/8670/ale...,2000-01,nhl,Alexei Kovalev,RW,FW,0,0.0,0
4,Jason Allison (C),Boston Bruins,82,36,59,95,1.16,85,-8,https://www.eliteprospects.com/player/9064/jas...,2000-01,nhl,Jason Allison,C,FW,0,0.0,0


In [5]:
# view row and column count of dataframe
df.shape

(18750, 18)

In [6]:
# view column names and datatypes.
df.dtypes

player            object
team              object
gp                 int64
g                  int64
a                  int64
tp                 int64
ppg               object
pim               object
+/-               object
link              object
season            object
league            object
playername        object
position          object
fw_def            object
Name              object
Games Missed     float64
head_injuries      int64
dtype: object

This code saves DataFrame df to a CSV file called "df.csv" in the current working directory. The index parameter is set to False to exclude the index column from being written to the CSV file.

In [7]:
df.to_csv("data/df.csv", index=False)