# Fantasy Hockey Data Wrangling

In this notebook we will wrangle the data used for the Fantasy Hockey draft. This will involve a few steps
1. Gather various tables of players, teams, and salaries
2. Join the tables
3. Calculate each player's score-per-game with custom scoring metrics for our fantasy hockey league 

Note that many of the helper scripts have been abstracted away to `scripts/hockey_bots.py`


## Collecting Data

Below we import libraries, as well as import required data into the notebook. 

In [4]:
import pandas as pd
import numpy as np
import sys
import importlib
sys.path.insert(1, '../')
import scripts.hockey_bots as hockey
importlib.reload(hockey)

# players table (stats)
df_p = pd.read_csv("../data/game_skater_stats.csv")
# game data dable
df_g = pd.read_csv("../data/game.csv")
# goalies table 
df_go = pd.read_csv("../data/game_goalie_stats.csv")
# player/goalie table (name, team, etc)
df_player = pd.read_csv("../data/player_info.csv")
# shifts table 
shifts = pd.read_csv("../data/game_shifts.csv")
# teams
teams = pd.read_csv("../data/team_info.csv")

### Goalies

As goalies are awarded points for starting a game, we need to filter our shifts table to goalies and first period to see when they start.

In [None]:
# Figuring out if a goalie started a game or not (starting the game is worth points)
goal_shifts = shifts[shifts.player_id.isin(df_go.player_id)]
goal_shifts = goal_shifts[goal_shifts.period==1]


### Filtering to the 2018-2019 Season

To simplify our analysis ,we will only foucs on the 2018-2019 hockey season. This is done below

In [2]:
import datetime

# Finding data for just this season
datetime.datetime.strptime
df = pd.read_csv('../data/game_teams_stats.csv')
df =  pd.merge(df, df_g[['game_id', 'date_time', 'type']])
df['date_time'] = pd.to_datetime(df['date_time'])

df = df[(df['date_time'] > '2018-10-3') &
        (df['date_time'] < '2019-04-8') & 
        (df['type'] == 'R')]
df = df.sort_values(by=['team_id', 'date_time'])
df['game_num'] = df.groupby('team_id').cumcount()



In [3]:
df

Unnamed: 0,game_id,team_id,HoA,won,settled_in,head_coach,goals,shots,hits,pim,powerPlayOpportunities,powerPlayGoals,faceOffWinPercentage,giveaways,takeaways,date_time,type,game_num
15265,2018020020,1,home,True,REG,John Hynes,5,27,24,8,2,1,49.2,11,10,2018-10-06,R,0
15327,2018020048,1,home,True,REG,John Hynes,6,36,19,10,4,1,61.7,9,20,2018-10-11,R,1
15373,2018020071,1,home,True,REG,John Hynes,3,36,19,6,8,1,61.2,8,14,2018-10-14,R,2
15387,2018020078,1,home,True,REG,John Hynes,3,34,15,13,6,1,44.3,8,11,2018-10-16,R,3
15413,2018020091,1,home,False,REG,John Hynes,3,30,19,19,5,2,54.3,13,11,2018-10-18,R,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22555,2018021202,54,home,False,REG,Gerard Gallant,2,37,28,8,2,0,48.1,7,18,2019-03-30,R,77
22572,2018021211,54,away,False,OT,Gerard Gallant,3,26,40,27,2,0,52.5,10,7,2019-03-31,R,78
22605,2018021227,54,home,True,REG,Gerard Gallant,3,31,23,4,2,0,53.7,7,19,2019-04-02,R,79
22657,2018021253,54,home,False,REG,Gerard Gallant,1,38,32,6,1,0,54.2,7,10,2019-04-05,R,80


In [None]:
df_games = df[['game_id','team_id', 'won', 'game_num']]


### Salary Information

As a constraint, we may want to include a maximum salary for our fantasy hockey team. To do this we gather hockey salaries below and save them to a data frame

In [None]:
salaries = pd.read_html('https://www.hockey-reference.com/friv/current_nhl_salaries.cgi')[0]
salaries['firstName'], salaries['lastName'] = salaries['Player'].str.split(' ', 1).str


### Merging Tables
Below we merge the player tables, thir salaries and other information into a single table.

In [None]:
#players 
df_p_2018 = hockey.player_merge(df_p, df_g, df_player, salaries)


In [None]:
df_ = pd.merge(df_p, df_g[['game_id', 'date_time', 'type']])
df_['date_time'] = pd.to_datetime(df_['date_time'])

df_ = pd.merge(df_, df_player[['player_id','firstName', 'lastName', 'primaryPosition']])
df_ = pd.merge(df_, salaries[['firstName', 'lastName', 'Salary']], on = ['firstName', 'lastName'])
df_

In [None]:
#players 
df_p_2018 = hockey.player_merge(df_p, df_g, df_player, salaries)
df_p_2018 =df_p_2018[(df_p_2018['date_time'] > '2018-10-3') &
           (df_p_2018['date_time'] < '2019-04-8') & 
           (df_p_2018['type'] == 'R')]

# goalies

df_g_2018 = hockey.player_merge(df_go, df_g, df_player, salaries)
df_g_2018 =df_g_2018[(df_g_2018['date_time'] > '2018-10-3') &
           (df_g_2018['date_time'] < '2019-04-8') & 
           (df_g_2018['type'] == 'R')]

In [None]:
df_p_2018['points'] = df_p_2018.copy().apply(hockey.player_points, axis=1)
df_g_2018['points'] = df_g_2018.copy().apply(hockey.goalie_points, args=[goal_shifts], axis=1)

In [None]:
import matplotlib.pyplot as plt
df_p_2018=df_p_2018.sort_values(by='date_time').reset_index(drop=True)
df_g_2018=df_g_2018.sort_values(by='date_time').reset_index(drop=True)
df_score = df_p_2018[['game_id', 'team_id', 'player_id','firstName', 'lastName', 'primaryPosition', 'points']]
df_scoreg = df_g_2018[['game_id', 'team_id', 'player_id','firstName', 'lastName', 'primaryPosition', 'points']]
df_score=df_score.append(df_scoreg, ignore_index=True)

In [None]:
a = pd.merge(df_score, df_games, on = 'game_id', how='left')


In [None]:
a=a.sort_values(by=['player_id', 'game_num'], ascending=False).drop_duplicates(subset=['player_id','game_num'])


In [None]:
import matplotlib.pyplot as plt
import cufflinks as cf
import plotly
cf.set_config_file(offline=True)
ax1 = pd.DataFrame()
ax1['Edmonton Oilers'] = a[a.team_id_x==22]['points']

print(len(ax))
ax2 = pd.DataFrame()
ax2['Tampa Bay Lightning'] = a[a.team_id_x==14]['points']

# ax.iplot(kind='hist', barmode='overlay')
#ax.set_ylim([0,17])
team_compare = pd.concat([ax1,ax2], ignore_index=True, axis=0, sort=False)
team_compare.iplot(kind='hist',
                   barmode='overlay',
                   bins=25,
                   histnorm='probability density',
                   yTitle='Proportion of Points',
                   xTitle = "Bin Value")

## Final Data Merge

In this case we need to add empty (zero filled) rows for players who did not play in a particular game. This is important as we need to have an equal amount of games played for each player for our portfolio optimization later. Note that zero filling is not _necessarily_ the best thing to do. One could also fill by the mean/median/some other metric. In this case zero filling was chosen as if we're looking to pick a player for fantasy hockey - if a player isn't playing many games, they're not going to help us win.

In [None]:
games = list(a.game_num.unique())
test = a.copy()
for player in a['player_id'].unique():
    games_played =  list(a[a['player_id'] == player]['game_num'])
    fill_games = list(set(games) - set(games_played))
    for game in fill_games:
        pos = a[a['player_id'] == player]['primaryPosition'].to_list()[0]
        first =  a[a['player_id'] == player]['firstName'].to_list()[0]
        last =   a[a['player_id'] == player]['lastName'].to_list()[0]
        to_append = pd.DataFrame([[np.nan, 
                                   np.nan, 
                                   player, 
                                   first,
                                   last,
                                   pos, 
                                   0, 
                                   np.nan,
                                   np.nan,
                                   game]], 
                                 columns = list(a))
        test = test.append(to_append, ignore_index=True)
    
        


test.head()

### Dropping Players
Here we're ignoring any player that did not play more than 10 games in the previous season.

In [None]:
grouped = a.groupby('player_id').count()
players=list(grouped[grouped['won'] > 10].index)
test2=test[test['player_id'].isin(players)].reset_index(drop=True)
len(players)

In [None]:
# Zero fillign 
test = test.fillna(0)
test = test.sort_values(by=['player_id', 'game_num'])
len(test.player_id.unique())

test2 = test2.fillna(0)
test = test2.sort_values(by=['player_id', 'game_num'])
len(test2.player_id.unique())


In [None]:
p = a.groupby(['firstName', 'lastName']).count().sort_values(by='game_id', ascending=False).reset_index()
ax = p['game_id'].plot( figsize=(14,10), linewidth=3, grid=True)
ax.tick_params(axis="x", labelsize=16)
ax.tick_params(axis="y", labelsize=16)
ax.set_ylabel("Games Played", size = 22)
ax.set_xlabel("Player", size = 22)
plt.show()

In [None]:
test2.to_csv("../data/fixed_data_2018.csv")

In [None]:
a.to_csv("../data/textaa.csv")