People always look at a player's performance over the past season or several as an indicator of his skills. But how accurate is it actually at predicting the same player's performance in the new season? How many seasons back should be look? And how much the player's team can affect it? Those are the main topics this notebook aims to research.

At first, we are going to test out a simple model where we are only interested in the points (goals + assists) and, therefore, only the skaters. If the model shows any promising result, we can attempt expanding it to the other key performance indicators as well.

In [1]:
# Importing standard packages for data exploration and processing.
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)


data = pd.read_csv('../data/players/skaters_season.csv')
data.head()

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against
0,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2014/2015,Amur (Khabarovsk),93.0,13,1,0,1,-4,1,5,6,1,0,0,0,0,0,11,9.1,0.8,0,0,,6:57,417,9.3,1.0,2.0,1.0
1,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2013/2014,Amur (Khabarovsk),91.0,12,0,0,0,0,1,1,0,0,0,0,0,0,0,14,0.0,1.2,1,0,0.0,6:15,375,8.0,,,
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2017/2018,Dinamo (Minsk),63.0,8,0,0,0,-1,1,2,0,0,0,0,0,0,0,6,0.0,0.8,0,0,,6:00,360,8.2,2.0,2.0,0.0
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0
4,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2015/2016,Dinamo (Minsk),24.0,11,1,0,1,1,3,2,0,1,0,0,0,0,0,5,20.0,0.5,0,0,,4:43,283,8.5,1.0,0.0,1.0


We will definitely need the total time on ice over the season. After all, two players might be equally skilled but one of them simply gets much more icetime and thus gets more points. What we are going to use is not the points over the season but really a standartised amount of points over a certain interval. For ease of browse, let us set the interval as 60 minutes (standard match length) the same as with goalies.

In [2]:
data['Icetime'] = data['Games'] * data['Icetime_game_seconds'] / 3600
data['Points_average'] = data['Points'] / data['Icetime']
data.head()

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
0,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2014/2015,Amur (Khabarovsk),93.0,13,1,0,1,-4,1,5,6,1,0,0,0,0,0,11,9.1,0.8,0,0,,6:57,417,9.3,1.0,2.0,1.0,1.505833,0.664084
1,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2013/2014,Amur (Khabarovsk),91.0,12,0,0,0,0,1,1,0,0,0,0,0,0,0,14,0.0,1.2,1,0,0.0,6:15,375,8.0,,,,1.25,0.0
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2017/2018,Dinamo (Minsk),63.0,8,0,0,0,-1,1,2,0,0,0,0,0,0,0,6,0.0,0.8,0,0,,6:00,360,8.2,2.0,2.0,0.0,0.8,0.0
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0,3.238889,1.234991
4,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2015/2016,Dinamo (Minsk),24.0,11,1,0,1,1,3,2,0,1,0,0,0,0,0,5,20.0,0.5,0,0,,4:43,283,8.5,1.0,0.0,1.0,0.864722,1.156441


Since we are using averages we need to ensure that all players have participated at a certain bare mininum during the season. This could be accounted for in two ways, based off either games played or icetime recorded. For now, icetime like a good choice. Let us set the minimum requirement at 2 hours.

On a related note, let us drop all playoff seasons from the data. Not only they tend to be fairly short and would be mostly sorted out based on the icetime required, but the playoff matches tend to behave somewhat differently than the regular season.

In [3]:
data = data[data['Season'] == 'Regular season']
data = data[data['Icetime'] >= 2]

Important note! As indicated below, we have some rows where Team is specified as "Summary:". That is the case when a player has changed his team during the season, so he ends up having a separate row of statistics for both teams and for them combined.

In [4]:
data[35:38]

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
65,https://en.khl.ru/players/3989/,Vitaly Atyushov,Skater,Regular season,2008/2009,Metallurg (Magnitogorsk),27.0,55,8,27,35,17,51,34,34,6,1,1,0,1,0,124,6.5,2.3,1,0,0.0,22:39,1359,24.1,,,,20.7625,1.685731
66,https://en.khl.ru/players/23434/,Jonas Ahnelov,Skater,Regular season,2017/2018,Avangard (Omsk),5.0,42,3,7,10,-3,18,21,8,2,1,0,0,0,0,47,6.4,1.1,0,0,,15:01,901,22.0,28.0,51.0,1.0,10.511667,0.951324
68,https://en.khl.ru/players/23434/,Jonas Ahnelov,Skater,Regular season,2016/2017,Avangard (Omsk),5.0,44,4,6,10,16,36,20,28,3,1,0,0,1,0,48,8.3,1.1,0,0,,18:13,1093,25.6,28.0,57.0,4.0,13.358889,0.748565


Ideally, we want to take such cases into account but it present problems of its own. If we keep just the summary, we cannot include teams in our model. If we keep the statistics for each team separately, the rows might fail the icetime requirement even if the player had enough icetime that season to get included. For now, let us go with the latter approach.

In [5]:
data = data[data['Team'] != 'Summary:']
data.head(5)

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0,3.238889,1.234991
6,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2010/2011,Lokomotiv (Yaroslavl),57.0,52,5,14,19,22,51,29,80,2,2,1,0,1,0,81,6.2,1.6,3,0,0.0,20:16,1216,22.7,,,,17.564444,1.081731
8,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2009/2010,Lokomotiv (Yaroslavl),57.0,52,7,11,18,10,38,28,50,1,6,0,0,0,0,105,6.7,2.0,0,0,,21:31,1291,23.6,,,,18.647778,0.965262
10,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2008/2009,Lokomotiv (Yaroslavl),57.0,40,2,10,12,13,32,19,44,0,1,1,0,2,0,66,3.0,1.6,0,0,,18:35,1115,22.0,,,,12.388889,0.96861
11,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,Skater,Regular season,2016/2017,Traktor (Chelyabinsk),85.0,28,1,0,1,-3,4,7,4,1,0,0,0,0,0,29,3.4,1.0,0,0,,8:35,515,11.9,17.0,11.0,2.0,4.005556,0.249653


We are going to first try predicting based off the latest two seasons that player has participated in. Important note - those seasons are not necessarily the last ones as a player could not participate in some seasons or not participate enough to be included in our analysis. And since we need the values for at least the current and two latest seasons for each player, players with less than 3 seasons in the data have to be dropped altogether.

In [6]:
data = data.groupby('Profile').filter(lambda x: len(x) > 2)

In [7]:
# To avoid typing column lists manually.
data.columns

Index(['Profile', 'Player', 'Position', 'Season', 'Year', 'Team', 'Number',
       'Games', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus', 'Minus',
       'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded',
       'Goals_overtime', 'Game_winning_goals', 'Game_winning_shootouts',
       'Shots', 'Shots_percentage', 'Shots_game', 'Faceoffs', 'Faceoffs_won',
       'Faceoffs_percentage', 'Icetime_game', 'Icetime_game_seconds',
       'Shifts_game', 'Hits', 'Shots_blocked', 'Penalties_against', 'Icetime',
       'Points_average'],
      dtype='object')

In [8]:
# We can drop all unnecessary columns now.
drop_list = ['Position', 'Season', 'Number', 'Games', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus',
             'Minus', 'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded', 'Goals_overtime',
             'Game_winning_goals', 'Game_winning_shootouts', 'Shots', 'Shots_percentage', 'Shots_game',
             'Faceoffs', 'Faceoffs_won', 'Faceoffs_percentage', 'Icetime_game', 'Icetime_game_seconds',
             'Shifts_game', 'Hits', 'Shots_blocked', 'Penalties_against', 'Icetime']
data.drop(drop_list, axis=1, inplace=True)
data.reset_index(drop=True, inplace=True)
data.head()

Unnamed: 0,Profile,Player,Year,Team,Points_average
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731
1,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262
2,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
3,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594
4,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563


We now need to add the year/team/points data from the past two years to the dataframe. The simplest way to do this is by adding shifted versions of the same columns with different column names. However, a row above does not necessarily contain the data for the same player. To account for that, we are going to include both past profile and player name in the output dataframe and check that they remain the same.

In [9]:
# The rows are going to be shifted down.
header_0 = ['Profile_0', 'Player_0', 'Year_0', 'Team_0', 'Points_0']
header_1 = ['Profile_1', 'Player_1', 'Year_1', 'Team_1', 'Points_1']
header_2 = ['Profile_2', 'Player_2', 'Year_2', 'Team_2', 'Points_2']
data.columns = header_0
data[header_1] = data[header_0].shift(-1)
data[header_2] = data[header_0].shift(-2)
data.head()

Unnamed: 0,Profile_0,Player_0,Year_0,Team_0,Points_0,Profile_1,Player_1,Year_1,Team_1,Points_1,Profile_2,Player_2,Year_2,Team_2,Points_2
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594
2,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563
3,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514
4,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262


In [10]:
data.tail(5)

Unnamed: 0,Profile_0,Player_0,Year_0,Team_0,Points_0,Profile_1,Player_1,Year_1,Team_1,Points_1,Profile_2,Player_2,Year_2,Team_2,Points_2
7642,https://en.khl.ru/players/23355/,Denis Zernov,2017/2018,Lada (Togliatti),1.615499,https://en.khl.ru/players/23355/,Denis Zernov,2016/2017,Lada (Togliatti),1.487272,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0
7643,https://en.khl.ru/players/23355/,Denis Zernov,2016/2017,Lada (Togliatti),1.487272,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286
7644,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187
7645,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187,,,,,
7646,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187,,,,,,,,,,


In [11]:
# Now dropping the rows which contain data for different players in them.
data = data[(data['Profile_0'] == data['Profile_1']) & (data['Profile_1'] == data['Profile_2'])]
data.reset_index(drop=True, inplace=True)
data.head()

Unnamed: 0,Profile_0,Player_0,Year_0,Team_0,Points_0,Profile_1,Player_1,Year_1,Team_1,Points_1,Profile_2,Player_2,Year_2,Team_2,Points_2
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514
2,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262
3,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,https://en.khl.ru/players/14763/,Sergei Andronov,2016/2017,CSKA (Moscow),0.798509
4,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,https://en.khl.ru/players/14763/,Sergei Andronov,2016/2017,CSKA (Moscow),0.798509,https://en.khl.ru/players/14763/,Sergei Andronov,2015/2016,CSKA (Moscow),0.952822


In [12]:
# Everything seems in order, we can drop the duplicate columns now.
data.drop(['Profile_1', 'Player_1', 'Profile_2', 'Player_2'], axis=1, inplace=True)
data.head()

Unnamed: 0,Profile_0,Player_0,Year_0,Team_0,Points_0,Year_1,Team_1,Points_1,Year_2,Team_2,Points_2
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,2009/2010,Lokomotiv (Yaroslavl),0.965262,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,2019/2020,CSKA (Moscow),0.450563,2018/2019,CSKA (Moscow),1.013514
2,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,2018/2019,CSKA (Moscow),1.013514,2017/2018,CSKA (Moscow),0.809262
3,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,2017/2018,CSKA (Moscow),0.809262,2016/2017,CSKA (Moscow),0.798509
4,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,2016/2017,CSKA (Moscow),0.798509,2015/2016,CSKA (Moscow),0.952822
