In [1]:
# Importing standard packages for data exploration and processing.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

This notebook is going to focus on processing the players' season statistics only. The player information and performance statistics for individual matches will be processed in two separate notebooks.

In [2]:
# Unlike in the Stage 1 notebooks, we are going to create new variables rather than perform the operations in-place here.
# The reason is that we might need to review the original data during processing.

raw_players_season = pd.read_csv('raw_players_season.csv')

In [3]:
# Does everything seem to be alright with the data?
raw_players_season

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2014/2015,Amur (Khabarovsk),93.0,13.0,1.0,0.0,1.0,-4.0,...,1.0,,,,,,,,,
1,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2013/2014,Amur (Khabarovsk),91.0,2.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
2,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2013/2014,Amur (Khabarovsk),91.0,12.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2012/2013,Amur (Khabarovsk),99.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Regular season:,,25.0,1.0,0.0,1.0,-4.0,...,1.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28643,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,KHL Total:,,39.0,5.0,1.0,6.0,-7.0,...,8.0,,,,,,,,,
28644,https://en.khl.ru/players/11543/,Alexander Zevakhin,Regular season 2009/2010,Severstal (Cherepovets),15.0,20.0,1.0,1.0,2.0,-5.0,...,,,,,,,,,,
28645,https://en.khl.ru/players/11543/,Alexander Zevakhin,Regular season 2008/2009,Severstal (Cherepovets),4.0,44.0,3.0,6.0,9.0,2.0,...,,,,,,,,,,
28646,https://en.khl.ru/players/11543/,Alexander Zevakhin,KHL Summary,Regular season:,,64.0,4.0,7.0,11.0,-3.0,...,0.0,,,,,,,,,


This dataframe contains statistics for individual seasons as well as for the player's whole KHL career. Since those two cannot be combined in a single analysis, we are going to separate the data into two dataframes based on the Season column.

In [7]:
players_season = raw_players_season[raw_players_season['Season'] != 'KHL Summary']
players_career = raw_players_season[raw_players_season['Season'] == 'KHL Summary']

Both season and career statistics are likely to share many issues, as they were stored as a single table and are closely related. However, let us check first how similar the two seem.

In [10]:
players_season.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19674 entries, 0 to 28645
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                19674 non-null  object 
 1   Player name        19674 non-null  object 
 2   Season             19674 non-null  object 
 3   Tournament / Team  19674 non-null  object 
 4   №                  18677 non-null  float64
 5   GP                 19674 non-null  float64
 6   G                  19674 non-null  float64
 7   Assists            19674 non-null  float64
 8   PTS                17753 non-null  float64
 9   +/-                17753 non-null  float64
 10  +                  17753 non-null  float64
 11  -                  17753 non-null  float64
 12  PIM                19674 non-null  float64
 13  ESG                17753 non-null  float64
 14  PPG                17753 non-null  float64
 15  SHG                17753 non-null  float64
 16  OTG                177

In [11]:
players_career.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8974 entries, 4 to 28647
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                8974 non-null   object 
 1   Player name        8974 non-null   object 
 2   Season             8974 non-null   object 
 3   Tournament / Team  8974 non-null   object 
 4   №                  0 non-null      float64
 5   GP                 8974 non-null   float64
 6   G                  8974 non-null   float64
 7   Assists            8974 non-null   float64
 8   PTS                7859 non-null   float64
 9   +/-                7859 non-null   float64
 10  +                  7405 non-null   float64
 11  -                  7405 non-null   float64
 12  PIM                8974 non-null   float64
 13  ESG                7859 non-null   float64
 14  PPG                7859 non-null   float64
 15  SHG                7859 non-null   float64
 16  OTG                7859

We can see that in many columns there is no missing data at all. At the same time, for other columns there is a clear separation into skaters (forwards and defencemen) and goalies.

For example, we can see that season statistics appears to have 17753 rows of data for skaters and 1919 rows for goalies, with a total of 19672 rows. However, there are 19674 rows in the dataframe so 2 rows seem to be unaccounted in either.

The same thing is happening in the career statistics, with 7859 rows for skaters and 1113 rows for goalies. It sums up to 8972 rows in total while the dataframe has 8974 rows, again 2 rows missing.

Let us find out who is messing up our data. Icetime seems like a good indicator since it must be present for all players who have recorded a match during that season and is stored differently for skaters (average icetime per match) and goalies (total icetime per season).

In [17]:
# We need the rows for which both icetime are null.

players_season[players_season['TOI/G'].isnull() & players_season['TOI'].isnull()]

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
2924,https://en.khl.ru/players/29144/,David Boldizar,Regular season 2018/2019,Slovan (Bratislava),61.0,3.0,0.0,0.0,,,...,,,,,,,,,,
2925,https://en.khl.ru/players/29144/,David Boldizar,Regular season 2017/2018,Slovan (Bratislava),23.0,2.0,0.0,0.0,,,...,,,,,,,,,,


In [20]:
# Confirm it for the career statistics.

players_career[players_career['TOI/G'].isnull() & players_career['TOI'].isnull()]

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
2926,https://en.khl.ru/players/29144/,David Boldizar,KHL Summary,Regular season:,,5.0,0.0,0.0,,,...,,,,,,,,,,
2927,https://en.khl.ru/players/29144/,David Boldizar,KHL Summary,KHL Total:,,5.0,0.0,0.0,,,...,,,,,,,,,,


So, David Boldizar from Slovan (Bratislava) is the culprit. I wonder what is going on with him. Thankfully, we have added each player's profile link so we can easily check the original data, and it turns out that the data was not stored properly on the website to begin with.

In [18]:
# What do we know about that specific player?

players_season[players_season['URL'] == 'https://en.khl.ru/players/29144/'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 2924 to 2925
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                2 non-null      object 
 1   Player name        2 non-null      object 
 2   Season             2 non-null      object 
 3   Tournament / Team  2 non-null      object 
 4   №                  2 non-null      float64
 5   GP                 2 non-null      float64
 6   G                  2 non-null      float64
 7   Assists            2 non-null      float64
 8   PTS                0 non-null      float64
 9   +/-                0 non-null      float64
 10  +                  0 non-null      float64
 11  -                  0 non-null      float64
 12  PIM                2 non-null      float64
 13  ESG                0 non-null      float64
 14  PPG                0 non-null      float64
 15  SHG                0 non-null      float64
 16  OTG                0 non

Most of the data is missing, and not because it is supposed to be a zero. After all, icetime cannot be zero. Therefore, we need to drop this player from our data altogether.

In [21]:
player_season = players_season[players_season['URL'] != 'https://en.khl.ru/players/29144/']
player_career = players_career[players_career['URL'] != 'https://en.khl.ru/players/29144/']