People always look at a player's performance over the past season or several as an indicator of his skills. But how accurate is it actually at predicting the same player's performance in the new season? How many seasons back should be look? And how much the player's team can affect it? Those are the main topics this notebook aims to research.

At first, we are going to test out a simple model where we are only interested in the points (goals + assists) and, therefore, only the skaters. If the model shows any promising result, we can attempt expanding it to the other key performance indicators as well.

In [1]:
# Importing standard packages for data exploration and processing.
import numpy as np
import pandas as pd

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)


data = pd.read_csv('../data/players/skaters_season.csv')
data.head()

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against
0,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2014/2015,Amur (Khabarovsk),93.0,13,1,0,1,-4,1,5,6,1,0,0,0,0,0,11,9.1,0.8,0,0,,6:57,417,9.3,1.0,2.0,1.0
1,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2013/2014,Amur (Khabarovsk),91.0,12,0,0,0,0,1,1,0,0,0,0,0,0,0,14,0.0,1.2,1,0,0.0,6:15,375,8.0,,,
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2017/2018,Dinamo (Minsk),63.0,8,0,0,0,-1,1,2,0,0,0,0,0,0,0,6,0.0,0.8,0,0,,6:00,360,8.2,2.0,2.0,0.0
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0
4,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2015/2016,Dinamo (Minsk),24.0,11,1,0,1,1,3,2,0,1,0,0,0,0,0,5,20.0,0.5,0,0,,4:43,283,8.5,1.0,0.0,1.0


We will definitely need the total time on ice over the season. After all, two players might be equally skilled but one of them simply gets much more icetime and thus gets more points. What we are going to use is not the points over the season but really a standartised amount of points over a certain interval. For ease of browse, let us set the interval as 60 minutes (standard match length) the same as with goalies.

In [2]:
data['Icetime'] = data['Games'] * data['Icetime_game_seconds'] / 3600
data['Points_average'] = data['Points'] / data['Icetime']
data.head()

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
0,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2014/2015,Amur (Khabarovsk),93.0,13,1,0,1,-4,1,5,6,1,0,0,0,0,0,11,9.1,0.8,0,0,,6:57,417,9.3,1.0,2.0,1.0,1.505833,0.664084
1,https://en.khl.ru/players/16673/,Sergei Abramov,Skater,Regular season,2013/2014,Amur (Khabarovsk),91.0,12,0,0,0,0,1,1,0,0,0,0,0,0,0,14,0.0,1.2,1,0,0.0,6:15,375,8.0,,,,1.25,0.0
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2017/2018,Dinamo (Minsk),63.0,8,0,0,0,-1,1,2,0,0,0,0,0,0,0,6,0.0,0.8,0,0,,6:00,360,8.2,2.0,2.0,0.0,0.8,0.0
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0,3.238889,1.234991
4,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2015/2016,Dinamo (Minsk),24.0,11,1,0,1,1,3,2,0,1,0,0,0,0,0,5,20.0,0.5,0,0,,4:43,283,8.5,1.0,0.0,1.0,0.864722,1.156441


Since we are using averages we need to ensure that all players have participated at a certain bare mininum during the season. This could be accounted for in two ways, based off either games played or icetime recorded. For now, icetime like a good choice. Let us set the minimum requirement at 2 hours.

On a related note, let us drop all playoff seasons from the data. Not only they tend to be fairly short and would be mostly sorted out based on the icetime required, but the playoff matches tend to behave somewhat differently than the regular season.

In [3]:
data = data[data['Season'] == 'Regular season']
data = data[data['Icetime'] >= 2]

Important note! As indicated below, we have some rows where Team is specified as "Summary:". That is the case when a player has changed his team during the season, so he ends up having a separate row of statistics for both teams and for them combined.

In [4]:
data[35:38]

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
65,https://en.khl.ru/players/3989/,Vitaly Atyushov,Skater,Regular season,2008/2009,Metallurg (Magnitogorsk),27.0,55,8,27,35,17,51,34,34,6,1,1,0,1,0,124,6.5,2.3,1,0,0.0,22:39,1359,24.1,,,,20.7625,1.685731
66,https://en.khl.ru/players/23434/,Jonas Ahnelov,Skater,Regular season,2017/2018,Avangard (Omsk),5.0,42,3,7,10,-3,18,21,8,2,1,0,0,0,0,47,6.4,1.1,0,0,,15:01,901,22.0,28.0,51.0,1.0,10.511667,0.951324
68,https://en.khl.ru/players/23434/,Jonas Ahnelov,Skater,Regular season,2016/2017,Avangard (Omsk),5.0,44,4,6,10,16,36,20,28,3,1,0,0,1,0,48,8.3,1.1,0,0,,18:13,1093,25.6,28.0,57.0,4.0,13.358889,0.748565


Ideally, we want to take such cases into account but it present problems of its own. If we keep just the summary, we cannot include teams in our model. If we keep the statistics for each team separately, the rows might fail the icetime requirement even if the player had enough icetime that season to get included. For now, let us go with the latter approach.

In [5]:
data = data[data['Team'] != 'Summary:']
data.head(5)

Unnamed: 0,Profile,Player,Position,Season,Year,Team,Number,Games,Goals,Assists,Points,Plus_minus,Plus,Minus,Penalties,Goals_even,Goals_powerplay,Goals_shorthanded,Goals_overtime,Game_winning_goals,Game_winning_shootouts,Shots,Shots_percentage,Shots_game,Faceoffs,Faceoffs_won,Faceoffs_percentage,Icetime_game,Icetime_game_seconds,Shifts_game,Hits,Shots_blocked,Penalties_against,Icetime,Points_average
3,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Skater,Regular season,2016/2017,Dinamo (Minsk),15.0,20,3,1,4,1,5,4,10,3,0,0,0,0,0,21,14.3,1.1,7,3,42.9,9:43,583,12.7,10.0,7.0,9.0,3.238889,1.234991
6,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2010/2011,Lokomotiv (Yaroslavl),57.0,52,5,14,19,22,51,29,80,2,2,1,0,1,0,81,6.2,1.6,3,0,0.0,20:16,1216,22.7,,,,17.564444,1.081731
8,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2009/2010,Lokomotiv (Yaroslavl),57.0,52,7,11,18,10,38,28,50,1,6,0,0,0,0,105,6.7,2.0,0,0,,21:31,1291,23.6,,,,18.647778,0.965262
10,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,Skater,Regular season,2008/2009,Lokomotiv (Yaroslavl),57.0,40,2,10,12,13,32,19,44,0,1,1,0,2,0,66,3.0,1.6,0,0,,18:35,1115,22.0,,,,12.388889,0.96861
11,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,Skater,Regular season,2016/2017,Traktor (Chelyabinsk),85.0,28,1,0,1,-3,4,7,4,1,0,0,0,0,0,29,3.4,1.0,0,0,,8:35,515,11.9,17.0,11.0,2.0,4.005556,0.249653


We are going to first try predicting based off the latest two seasons that player has participated in. Important note - those seasons are not necessarily the last ones as a player could not participate in some seasons or not participate enough to be included in our analysis. And since we need the values for at least the current and two latest seasons for each player, players with less than 3 seasons in the data have to be dropped altogether.

In [6]:
data = data.groupby('Profile').filter(lambda x: len(x) > 2)

In [7]:
# To avoid typing column lists manually.
data.columns

Index(['Profile', 'Player', 'Position', 'Season', 'Year', 'Team', 'Number',
       'Games', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus', 'Minus',
       'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded',
       'Goals_overtime', 'Game_winning_goals', 'Game_winning_shootouts',
       'Shots', 'Shots_percentage', 'Shots_game', 'Faceoffs', 'Faceoffs_won',
       'Faceoffs_percentage', 'Icetime_game', 'Icetime_game_seconds',
       'Shifts_game', 'Hits', 'Shots_blocked', 'Penalties_against', 'Icetime',
       'Points_average'],
      dtype='object')

In [8]:
# We can drop all unnecessary columns now.
drop_list = ['Position', 'Season', 'Number', 'Games', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus',
             'Minus', 'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded', 'Goals_overtime',
             'Game_winning_goals', 'Game_winning_shootouts', 'Shots', 'Shots_percentage', 'Shots_game',
             'Faceoffs', 'Faceoffs_won', 'Faceoffs_percentage', 'Icetime_game', 'Icetime_game_seconds',
             'Shifts_game', 'Hits', 'Shots_blocked', 'Penalties_against', 'Icetime']
data.drop(drop_list, axis=1, inplace=True)
data.reset_index(drop=True, inplace=True)
data.head()

Unnamed: 0,Profile,Player,Year,Team,Points_average
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731
1,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262
2,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
3,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594
4,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563


We now need to add the year/team/points data from the past two years to the dataframe. The simplest way to do this is by adding shifted versions of the same columns with different column names. However, a row above does not necessarily contain the data for the same player. To account for that, we are going to include both past profile and player name in the output dataframe and check that they remain the same.

In [9]:
# The rows are going to be shifted down.
# All columns are getting a T_ prefix indicating their timeshift relative to the current period.
header = ['T0_Profile', 'T0_Player', 'T0_Year', 'T0_Team', 'T0_Points']
header_1 = ['T1_Profile', 'T1_Player', 'T1_Year', 'T1_Team', 'T1_Points']
header_2 = ['T2_Profile', 'T2_Player', 'T2_Year', 'T2_Team', 'T2_Points']
data.columns = header
data[header_1] = data[header].shift(-1)
data[header_2] = data[header].shift(-2)
data.head()

Unnamed: 0,T0_Profile,T0_Player,T0_Year,T0_Team,T0_Points,T1_Profile,T1_Player,T1_Year,T1_Team,T1_Points,T2_Profile,T2_Player,T2_Year,T2_Team,T2_Points
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594
2,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563
3,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514
4,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262


In [10]:
data.tail(5)

Unnamed: 0,T0_Profile,T0_Player,T0_Year,T0_Team,T0_Points,T1_Profile,T1_Player,T1_Year,T1_Team,T1_Points,T2_Profile,T2_Player,T2_Year,T2_Team,T2_Points
7642,https://en.khl.ru/players/23355/,Denis Zernov,2017/2018,Lada (Togliatti),1.615499,https://en.khl.ru/players/23355/,Denis Zernov,2016/2017,Lada (Togliatti),1.487272,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0
7643,https://en.khl.ru/players/23355/,Denis Zernov,2016/2017,Lada (Togliatti),1.487272,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286
7644,https://en.khl.ru/players/16217/,Airat Ziazov,2014/2015,Vityaz (Moscow Region),0.0,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187
7645,https://en.khl.ru/players/16217/,Airat Ziazov,2013/2014,Vityaz (Moscow Region),2.089286,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187,,,,,
7646,https://en.khl.ru/players/16217/,Airat Ziazov,2012/2013,Neftekhimik (Nizhnekamsk),0.239187,,,,,,,,,,


We are also encountering another issue here. Since some players have changed teams during the season, we are getting observations where the last time period our of three would be season 2009/2010. And, what is worse, some players have even played for three teams in season 2008/2009. Not only do those seasons have few observations, they are not representative as only the players who have changed teams can have either of their seasons as the latest of the three. Therefore, let us drop them.

In [11]:
# Now dropping the rows which contain data for different players in them.
data = data[(data['T0_Profile'] == data['T1_Profile']) & (data['T0_Profile'] == data['T2_Profile'])]
data = data[(data['T0_Year'] != '2008/2009') & (data['T0_Year'] != '2009/2010')]
data.reset_index(drop=True, inplace=True)
data.head()

Unnamed: 0,T0_Profile,T0_Player,T0_Year,T0_Team,T0_Points,T1_Profile,T1_Player,T1_Year,T1_Team,T1_Points,T2_Profile,T2_Player,T2_Year,T2_Team,T2_Points
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2009/2010,Lokomotiv (Yaroslavl),0.965262,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514
2,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262
3,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,https://en.khl.ru/players/14763/,Sergei Andronov,2016/2017,CSKA (Moscow),0.798509
4,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,https://en.khl.ru/players/14763/,Sergei Andronov,2016/2017,CSKA (Moscow),0.798509,https://en.khl.ru/players/14763/,Sergei Andronov,2015/2016,CSKA (Moscow),0.952822


In [12]:
# Everything seems in order, we can drop the duplicate columns now.
data.drop(['T1_Profile', 'T1_Player', 'T2_Profile', 'T2_Player'], axis=1, inplace=True)
data.head()

Unnamed: 0,T0_Profile,T0_Player,T0_Year,T0_Team,T0_Points,T1_Year,T1_Team,T1_Points,T2_Year,T2_Team,T2_Points
0,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2010/2011,Lokomotiv (Yaroslavl),1.081731,2009/2010,Lokomotiv (Yaroslavl),0.965262,2008/2009,Lokomotiv (Yaroslavl),0.96861
1,https://en.khl.ru/players/14763/,Sergei Andronov,2020/2021,CSKA (Moscow),1.811594,2019/2020,CSKA (Moscow),0.450563,2018/2019,CSKA (Moscow),1.013514
2,https://en.khl.ru/players/14763/,Sergei Andronov,2019/2020,CSKA (Moscow),0.450563,2018/2019,CSKA (Moscow),1.013514,2017/2018,CSKA (Moscow),0.809262
3,https://en.khl.ru/players/14763/,Sergei Andronov,2018/2019,CSKA (Moscow),1.013514,2017/2018,CSKA (Moscow),0.809262,2016/2017,CSKA (Moscow),0.798509
4,https://en.khl.ru/players/14763/,Sergei Andronov,2017/2018,CSKA (Moscow),0.809262,2016/2017,CSKA (Moscow),0.798509,2015/2016,CSKA (Moscow),0.952822


In [13]:
# We can now get dummies for our years and teams.
dummies = data.copy()
dummies.drop(['T0_Profile', 'T0_Player'], axis=1, inplace=True)
dummies = pd.get_dummies(dummies, drop_first=True)
dummies.head()

Unnamed: 0,T0_Points,T1_Points,T2_Points,T0_Year_2011/2012,T0_Year_2012/2013,T0_Year_2013/2014,T0_Year_2014/2015,T0_Year_2015/2016,T0_Year_2016/2017,T0_Year_2017/2018,T0_Year_2018/2019,T0_Year_2019/2020,T0_Year_2020/2021,T0_Team_Ak Bars (Kazan),T0_Team_Amur (Khabarovsk),T0_Team_Atlant (Moscow Region),T0_Team_Avangard (Omsk),T0_Team_Avtomobilist (Ekaterinburg),T0_Team_Barys (Nur-Sultan),T0_Team_CSKA (Moscow),T0_Team_Dinamo (Minsk),T0_Team_Dinamo (Riga),T0_Team_Donbass (Donetsk),T0_Team_Dynamo (Moscow),T0_Team_HC Dynamo (Moscow),T0_Team_Jokerit (Helsinki),T0_Team_Kunlun Red Star (Beijing),T0_Team_Lada (Togliatti),T0_Team_Lev (Poprad),T0_Team_Lev (Praha),T0_Team_Lokomotiv (Yaroslavl),T0_Team_Medvescak (Zagreb),T0_Team_Metallurg (Magnitogorsk),T0_Team_Metallurg (Novokuznetsk),T0_Team_Neftekhimik (Nizhnekamsk),T0_Team_OHC Dynamo (Moscow),T0_Team_SKA (Saint Petersburg),T0_Team_Salavat Yulaev (Ufa),T0_Team_Severstal (Cherepovets),T0_Team_Sibir (Novosibirsk Region),T0_Team_Slovan (Bratislava),T0_Team_Sochi (Sochi),T0_Team_Spartak (Moscow),T0_Team_Torpedo (Nizhny Novgorod Region),T0_Team_Traktor (Chelyabinsk),T0_Team_Ugra (Khanty-Mansiysk),T0_Team_Vityaz (Moscow Region),T1_Year_2009/2010,T1_Year_2010/2011,T1_Year_2011/2012,T1_Year_2012/2013,T1_Year_2013/2014,T1_Year_2014/2015,T1_Year_2015/2016,T1_Year_2016/2017,T1_Year_2017/2018,T1_Year_2018/2019,T1_Year_2019/2020,T1_Year_2020/2021,T1_Team_Ak Bars (Kazan),T1_Team_Amur (Khabarovsk),T1_Team_Atlant (Moscow Region),T1_Team_Avangard (Omsk),T1_Team_Avtomobilist (Ekaterinburg),T1_Team_Barys (Nur-Sultan),T1_Team_CSKA (Moscow),T1_Team_Dinamo (Minsk),T1_Team_Dinamo (Riga),T1_Team_Donbass (Donetsk),T1_Team_Dynamo (Moscow),T1_Team_HC Dynamo (Moscow),T1_Team_HC MVD (Moscow Region),T1_Team_Jokerit (Helsinki),T1_Team_Kunlun Red Star (Beijing),T1_Team_Lada (Togliatti),T1_Team_Lev (Poprad),T1_Team_Lev (Praha),T1_Team_Lokomotiv (Yaroslavl),T1_Team_Medvescak (Zagreb),T1_Team_Metallurg (Magnitogorsk),T1_Team_Metallurg (Novokuznetsk),T1_Team_Neftekhimik (Nizhnekamsk),T1_Team_OHC Dynamo (Moscow),T1_Team_SKA (Saint Petersburg),T1_Team_Salavat Yulaev (Ufa),T1_Team_Severstal (Cherepovets),T1_Team_Sibir (Novosibirsk Region),T1_Team_Slovan (Bratislava),T1_Team_Sochi (Sochi),T1_Team_Spartak (Moscow),T1_Team_Torpedo (Nizhny Novgorod Region),T1_Team_Traktor (Chelyabinsk),T1_Team_Ugra (Khanty-Mansiysk),T1_Team_Vityaz (Moscow Region),T2_Year_2009/2010,T2_Year_2010/2011,T2_Year_2011/2012,T2_Year_2012/2013,T2_Year_2013/2014,T2_Year_2014/2015,T2_Year_2015/2016,T2_Year_2016/2017,T2_Year_2017/2018,T2_Year_2018/2019,T2_Year_2019/2020,T2_Team_Ak Bars (Kazan),T2_Team_Amur (Khabarovsk),T2_Team_Atlant (Moscow Region),T2_Team_Avangard (Omsk),T2_Team_Avtomobilist (Ekaterinburg),T2_Team_Barys (Nur-Sultan),T2_Team_CSKA (Moscow),T2_Team_Dinamo (Minsk),T2_Team_Dinamo (Riga),T2_Team_Donbass (Donetsk),T2_Team_Dynamo (Moscow),T2_Team_HC Dynamo (Moscow),T2_Team_HC MVD (Moscow Region),T2_Team_Jokerit (Helsinki),T2_Team_Khimik (Voskresensk),T2_Team_Kunlun Red Star (Beijing),T2_Team_Lada (Togliatti),T2_Team_Lev (Poprad),T2_Team_Lev (Praha),T2_Team_Lokomotiv (Yaroslavl),T2_Team_Medvescak (Zagreb),T2_Team_Metallurg (Magnitogorsk),T2_Team_Metallurg (Novokuznetsk),T2_Team_Neftekhimik (Nizhnekamsk),T2_Team_OHC Dynamo (Moscow),T2_Team_SKA (Saint Petersburg),T2_Team_Salavat Yulaev (Ufa),T2_Team_Severstal (Cherepovets),T2_Team_Sibir (Novosibirsk Region),T2_Team_Slovan (Bratislava),T2_Team_Sochi (Sochi),T2_Team_Spartak (Moscow),T2_Team_Torpedo (Nizhny Novgorod Region),T2_Team_Traktor (Chelyabinsk),T2_Team_Ugra (Khanty-Mansiysk),T2_Team_Vityaz (Moscow Region)
0,1.081731,0.965262,0.96861,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.811594,0.450563,1.013514,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0.450563,1.013514,0.809262,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.013514,0.809262,0.798509,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0.809262,0.798509,0.952822,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


We can now try fitting Machine Learning models on that prepared data. Let us first start with a simple linear regression before moving to random forests and boosted trees.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
# We are trying to predict how many points a player gets this new season based off everything else.
y = dummies['T0_Points'].copy()
X = dummies.drop('T0_Points', axis=1).copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [16]:
from sklearn.linear_model import LinearRegression

In [17]:
linear = LinearRegression(n_jobs=-1)

In [18]:
linear.fit(X_train, y_train)

LinearRegression(n_jobs=-1)

In [19]:
y_pred = linear.predict(X_test)

In [20]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [21]:
y_train.mean()

1.2295533281568383

In [22]:
y_test.mean()

1.2472970748957866

In [23]:
y_pred.mean()

1.2270979409610767

In [24]:
mean_squared_error(y_test, y_pred)

0.3095964715535181

In [25]:
mean_absolute_error(y_test, y_pred)

0.43046456025457347

In [26]:
r2_score(y_test, y_pred)

0.4793434263510711

In [27]:
# Getting coefficients in a more readable form.
linear_coef = pd.DataFrame(zip(X.columns, linear.coef_))
linear_coef.columns = ['Feature', 'Coefficient']
linear_coef

Unnamed: 0,Feature,Coefficient
0,T1_Points,0.429645
1,T2_Points,0.325855
2,T0_Year_2011/2012,-0.064013
3,T0_Year_2012/2013,-0.200282
4,T0_Year_2013/2014,-0.1261
5,T0_Year_2014/2015,-0.111896
6,T0_Year_2015/2016,-0.179117
7,T0_Year_2016/2017,-0.235924
8,T0_Year_2017/2018,-0.117033
9,T0_Year_2018/2019,-0.012113


We can still improve on the way coefficients are stored. After all, we have 3 sets of coefficients for three different time periods. Why not give each period its own column?

In [28]:
# Splitting the dataframe into three.
T0_coef = linear_coef[linear_coef['Feature'].str.startswith('T0_')].copy().reset_index(drop=True)
T1_coef = linear_coef[linear_coef['Feature'].str.startswith('T1_')].copy().reset_index(drop=True)
T2_coef = linear_coef[linear_coef['Feature'].str.startswith('T2_')].copy().reset_index(drop=True)

# Removing the year indicators from feature names.
T0_coef['Feature'].replace({'T0_': ''}, inplace=True, regex=True)
T1_coef['Feature'].replace({'T1_': ''}, inplace=True, regex=True)
T2_coef['Feature'].replace({'T2_': ''}, inplace=True, regex=True)

# Changing column names for coefficients.
T0_coef.columns = ['Feature', 'Period_0']
T1_coef.columns = ['Feature', 'Period_1']
T2_coef.columns = ['Feature', 'Period_2']

# Final dataframe.
linear_coef = T0_coef.merge(T1_coef, how='outer', on='Feature').merge(T2_coef, how='outer', on='Feature')
linear_coef

Unnamed: 0,Feature,Period_0,Period_1,Period_2
0,Year_2011/2012,-0.064013,0.90106,0.167746
1,Year_2012/2013,-0.200282,0.646707,0.114594
2,Year_2013/2014,-0.1261,0.723344,0.176152
3,Year_2014/2015,-0.111896,0.670813,0.154804
4,Year_2015/2016,-0.179117,0.765797,0.174308
5,Year_2016/2017,-0.235924,0.639734,0.214877
6,Year_2017/2018,-0.117033,0.455714,0.161237
7,Year_2018/2019,-0.012113,0.520274,0.292618
8,Year_2019/2020,-0.068847,0.574191,0.226517
9,Year_2020/2021,-0.066881,0.545501,


We will need to rearrange the columns for sure. More importantly, why are Period 1 year features' coefficients so high? Something is definitely not right in here.