In [1]:
# Importing standard packages for data exploration and processing.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

This notebook is going to focus on processing the players' season statistics only. The other three files will be processed in separate notebooks.

In [2]:
# Unlike in the Stage 1 notebooks, we are going to create new variables rather than perform the operations in-place here.
# The reason is that we might need to review the original data during processing.

raw_players_season = pd.read_csv('raw_players_season.csv')

In [3]:
# Does everything seem to be alright with the data?

raw_players_season

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2014/2015,Amur (Khabarovsk),93.0,13.0,1.0,0.0,1.0,-4.0,...,1.0,,,,,,,,,
1,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2013/2014,Amur (Khabarovsk),91.0,2.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
2,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2013/2014,Amur (Khabarovsk),91.0,12.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
3,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2012/2013,Amur (Khabarovsk),99.0,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,https://en.khl.ru/players/16462/,Maxim Alyapkin,Regular season 2015/2016,Torpedo (Nizhny Novgorod Region),31.0,2.0,0.0,0.0,,,...,,1.0,1.0,0.0,3.0,10.0,76.9,2.98,0.0,60:25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19669,https://en.khl.ru/players/16217/,Airat Ziazov,Regular season 2009/2010,Neftekhimik (Nizhnekamsk),79.0,1.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
19670,https://en.khl.ru/players/23656/,Tomislav Zanoski,Regular season 2016/2017,Medvescak (Zagreb),10.0,24.0,2.0,1.0,3.0,-8.0,...,6.0,,,,,,,,,
19671,https://en.khl.ru/players/23656/,Tomislav Zanoski,Regular season 2015/2016,Medvescak (Zagreb),10.0,15.0,3.0,0.0,3.0,1.0,...,2.0,,,,,,,,,
19672,https://en.khl.ru/players/11543/,Alexander Zevakhin,Regular season 2009/2010,Severstal (Cherepovets),15.0,20.0,1.0,1.0,2.0,-5.0,...,,,,,,,,,,


We can already see that there are some issues with missing data and integers stored as floats.

In [4]:
# What would the summary tell us?

raw_players_season.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19674 entries, 0 to 19673
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                19674 non-null  object 
 1   Player name        19674 non-null  object 
 2   Season             19674 non-null  object 
 3   Tournament / Team  19674 non-null  object 
 4   №                  18677 non-null  float64
 5   GP                 19674 non-null  float64
 6   G                  19674 non-null  float64
 7   Assists            19674 non-null  float64
 8   PTS                17753 non-null  float64
 9   +/-                17753 non-null  float64
 10  +                  17753 non-null  float64
 11  -                  17753 non-null  float64
 12  PIM                19674 non-null  float64
 13  ESG                17753 non-null  float64
 14  PPG                17753 non-null  float64
 15  SHG                17753 non-null  float64
 16  OTG                177

We can see that in many columns there is no missing data at all. At the same time, for other columns there is a clear separation into skaters (forwards and defencemen) and goalies.

For example, we can see that season statistics appears to have 17753 rows of data for skaters and 1919 rows for goalies, with a total of 19672 rows. However, there are 19674 rows in the dataframe so 2 rows seem to be unaccounted in either.

The same thing is happening in the career statistics, with 7859 rows for skaters and 1113 rows for goalies. It sums up to 8972 rows in total while the dataframe has 8974 rows, again 2 rows missing.

Let us find out who is messing up our data. Icetime seems like a good indicator since it must be present for all players who have recorded a match during that season and is stored differently for skaters (average icetime per match) and goalies (total icetime per season).

In [5]:
# We need the rows for which both icetime are null.

raw_players_season[raw_players_season['TOI/G'].isnull() & raw_players_season['TOI'].isnull()]

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
2037,https://en.khl.ru/players/29144/,David Boldizar,Regular season 2018/2019,Slovan (Bratislava),61.0,3.0,0.0,0.0,,,...,,,,,,,,,,
2038,https://en.khl.ru/players/29144/,David Boldizar,Regular season 2017/2018,Slovan (Bratislava),23.0,2.0,0.0,0.0,,,...,,,,,,,,,,


So, David Boldizar from Slovan (Bratislava) is the culprit. I wonder what is going on with him. Thankfully, we have added each player's profile link so we can easily check the original data, and it turns out that the data was not stored properly on the website to begin with.

In [6]:
# What do we know about that specific player?

raw_players_season[raw_players_season['URL'] == 'https://en.khl.ru/players/29144/'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 2037 to 2038
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                2 non-null      object 
 1   Player name        2 non-null      object 
 2   Season             2 non-null      object 
 3   Tournament / Team  2 non-null      object 
 4   №                  2 non-null      float64
 5   GP                 2 non-null      float64
 6   G                  2 non-null      float64
 7   Assists            2 non-null      float64
 8   PTS                0 non-null      float64
 9   +/-                0 non-null      float64
 10  +                  0 non-null      float64
 11  -                  0 non-null      float64
 12  PIM                2 non-null      float64
 13  ESG                0 non-null      float64
 14  PPG                0 non-null      float64
 15  SHG                0 non-null      float64
 16  OTG                0 non

Most of the data is missing, and not because it is supposed to be a zero. After all, icetime cannot be zero. Therefore, we need to drop this player from our data altogether.

In [7]:
player_season = raw_players_season[raw_players_season['URL'] != 'https://en.khl.ru/players/29144/']

Now we can create a new column indicating whether a player is a skater or a goalie. Let us use the icetime for the separation.