In [1]:
# Importing standard packages for data exploration and processing.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

This notebook is going to focus on processing the players' career statistics only. The other three files will be processed in separate notebooks.

In [4]:
# Unlike in the Stage 1 notebooks, we are going to create new variables rather than perform the operations in-place here.
# The reason is that we might need to review the original data during processing.

raw_players_career = pd.read_csv('../raw_data/raw_players_career.csv')

In [5]:
# Does everything seem to be alright with the data?

raw_players_career

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Regular season:,,25.0,1.0,0.0,1.0,-4.0,...,1.0,,,,,,,,,
1,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Nadezhda Cup:,,2.0,0.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
2,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,KHL Total:,,25.0,1.0,0.0,1.0,-4.0,...,1.0,,,,,,,,,
3,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,Regular season:,,3.0,0.0,0.0,,,...,,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41
4,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,KHL Total:,,3.0,0.0,0.0,,,...,,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8969,https://en.khl.ru/players/16217/,Airat Ziazov,KHL Summary,KHL Total:,,79.0,6.0,10.0,16.0,-4.0,...,6.0,,,,,,,,,
8970,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,Regular season:,,39.0,5.0,1.0,6.0,-7.0,...,8.0,,,,,,,,,
8971,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,KHL Total:,,39.0,5.0,1.0,6.0,-7.0,...,8.0,,,,,,,,,
8972,https://en.khl.ru/players/11543/,Alexander Zevakhin,KHL Summary,Regular season:,,64.0,4.0,7.0,11.0,-3.0,...,0.0,,,,,,,,,


We can already see that there are some issues with missing data and integers stored as floats.

In [6]:
# What would the summary tell us?

raw_players_career.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8974 entries, 0 to 8973
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                8974 non-null   object 
 1   Player name        8974 non-null   object 
 2   Season             8974 non-null   object 
 3   Tournament / Team  8974 non-null   object 
 4   №                  0 non-null      float64
 5   GP                 8974 non-null   float64
 6   G                  8974 non-null   float64
 7   Assists            8974 non-null   float64
 8   PTS                7859 non-null   float64
 9   +/-                7859 non-null   float64
 10  +                  7405 non-null   float64
 11  -                  7405 non-null   float64
 12  PIM                8974 non-null   float64
 13  ESG                7859 non-null   float64
 14  PPG                7859 non-null   float64
 15  SHG                7859 non-null   float64
 16  OTG                7859 

We can see that in many columns there is no missing data at all. At the same time, for other columns there is a clear separation into skaters (forwards and defencemen) and goalies.

For example, we can see that season statistics appears to have 7859 rows of data for skaters and 1113 rows for goalies, with a total of 8972 rows. However, there are 8974 rows in the dataframe so 2 rows seem to be unaccounted in either.

Let us find out who is messing up our data. Icetime seems like a good indicator since it must be present for all players who have recorded a match during that season and is stored differently for skaters (average icetime per match) and goalies (total icetime per season).

In [7]:
# We need the rows for which both icetime are null.

raw_players_career[raw_players_career['TOI/G'].isnull() & raw_players_career['TOI'].isnull()]

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
887,https://en.khl.ru/players/29144/,David Boldizar,KHL Summary,Regular season:,,5.0,0.0,0.0,,,...,,,,,,,,,,
888,https://en.khl.ru/players/29144/,David Boldizar,KHL Summary,KHL Total:,,5.0,0.0,0.0,,,...,,,,,,,,,,


So, David Boldizar from Slovan (Bratislava) is the culprit. I wonder what is going on with him. Thankfully, we have added each player's profile link so we can easily check the original data, and it turns out that the data was not stored properly on the website to begin with.

In [8]:
# What do we know about that specific player?

raw_players_career[raw_players_career['URL'] == 'https://en.khl.ru/players/29144/'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 887 to 888
Data columns (total 39 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   URL                2 non-null      object 
 1   Player name        2 non-null      object 
 2   Season             2 non-null      object 
 3   Tournament / Team  2 non-null      object 
 4   №                  0 non-null      float64
 5   GP                 2 non-null      float64
 6   G                  2 non-null      float64
 7   Assists            2 non-null      float64
 8   PTS                0 non-null      float64
 9   +/-                0 non-null      float64
 10  +                  0 non-null      float64
 11  -                  0 non-null      float64
 12  PIM                2 non-null      float64
 13  ESG                0 non-null      float64
 14  PPG                0 non-null      float64
 15  SHG                0 non-null      float64
 16  OTG                0 non-n

Most of the data is missing, and not because it is supposed to be a zero. After all, icetime cannot be zero. Therefore, we need to drop this player from our data altogether.

In [9]:
# Once again, we are using the profile link as primary key due to a possibility of matching names.

players_career = raw_players_career[raw_players_career['URL'] != 'https://en.khl.ru/players/29144/']

Now we can create a new column indicating whether a player is a skater or a goalie. Let us use the icetime for the separation.

In [14]:
# Total ice time is only tracked for goalies, so skaters are supposed to have it as null.

players_career['Role'] = np.where(players_career['TOI'].isnull(), 'Skater', 'Goalie')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_career['Role'] = np.where(players_career['TOI'].isnull(), 'Skater', 'Goalie')


In [15]:
players_career

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI,Role
0,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Regular season:,,25.0,1.0,0.0,1.0,-4.0,...,,,,,,,,,,Skater
1,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Nadezhda Cup:,,2.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,Skater
2,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,KHL Total:,,25.0,1.0,0.0,1.0,-4.0,...,,,,,,,,,,Skater
3,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,Regular season:,,3.0,0.0,0.0,,,...,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41,Goalie
4,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,KHL Total:,,3.0,0.0,0.0,,,...,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41,Goalie
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8969,https://en.khl.ru/players/16217/,Airat Ziazov,KHL Summary,KHL Total:,,79.0,6.0,10.0,16.0,-4.0,...,,,,,,,,,,Skater
8970,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,Regular season:,,39.0,5.0,1.0,6.0,-7.0,...,,,,,,,,,,Skater
8971,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,KHL Total:,,39.0,5.0,1.0,6.0,-7.0,...,,,,,,,,,,Skater
8972,https://en.khl.ru/players/11543/,Alexander Zevakhin,KHL Summary,Regular season:,,64.0,4.0,7.0,11.0,-3.0,...,,,,,,,,,,Skater


We would like to fix the floats in columns where we know the values are supposed to be integers. You can't score 3.5 goals after all. However, we can have a problem here since converting data to another type requires that there is no NaN values.

At the same time, the off-season tournaments can mess up our data. If a player have only participated in the off-season tournaments, the rest of his career statistics would end up with NaN values since the off-season statistics is not included. Therefore, let us drop such players from the data despite going through the hassle of including them while scraping the data.

In [16]:
# What off-season tournaments do we have?

players_career['Tournament / Team'].unique()

array(['Regular season:', 'Nadezhda Cup:', 'KHL Total:', 'Playoffs:'],
      dtype=object)

In [17]:
# Apparently, only the Nadezhda Cup.

players_career = players_career[players_career['Tournament / Team'] != 'Nadezhda Cup:']

# We still have players left with no games recorded in the official matches.

players_career = players_career[players_career['GP'] > 0]

# The 'Season' column is not saying us much the way it is.
# At the same time, the 'Tournament / Team' would be do better as 'Season' and without a colon at the end.

players_career['Season'] = players_career['Tournament / Team'].apply(lambda x: x[:-1])
players_career.drop('Tournament / Team', axis=1, inplace=True)

# Cleaning up the dataframe.

players_career = players_career.reset_index(drop=True)

In [18]:
players_career

Unnamed: 0,URL,Player name,Season,№,GP,G,Assists,PTS,+/-,+,...,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI,Role
0,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season,,25.0,1.0,0.0,1.0,-4.0,2.0,...,,,,,,,,,,Skater
1,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Total,,25.0,1.0,0.0,1.0,-4.0,2.0,...,,,,,,,,,,Skater
2,https://en.khl.ru/players/16462/,Maxim Alyapkin,Regular season,,3.0,0.0,0.0,,,,...,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41,Goalie
3,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Total,,3.0,0.0,0.0,,,,...,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41,Goalie
4,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,Regular season,,39.0,4.0,1.0,5.0,1.0,9.0,...,,,,,,,,,,Skater
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8018,https://en.khl.ru/players/16217/,Airat Ziazov,KHL Total,,79.0,6.0,10.0,16.0,-4.0,27.0,...,,,,,,,,,,Skater
8019,https://en.khl.ru/players/23656/,Tomislav Zanoski,Regular season,,39.0,5.0,1.0,6.0,-7.0,15.0,...,,,,,,,,,,Skater
8020,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Total,,39.0,5.0,1.0,6.0,-7.0,15.0,...,,,,,,,,,,Skater
8021,https://en.khl.ru/players/11543/,Alexander Zevakhin,Regular season,,64.0,4.0,7.0,11.0,-3.0,18.0,...,,,,,,,,,,Skater


Can we now change the data from floats to integers? Not really.

Most of our columns still has many NaN values because different statistics are tracked for skaters and goalies. And integers do not like having NaN values in them. It could be worked around but such an approach is not necessarily the best one.

We could, of course, leave it as it is or replace missing values with zeros. However, analysing skaters and goalies together in the future sounds like a bad analysis design since the two groups are very distinct. Therefore, let us separate the data into two distinct dataframes and store skater statistics and goalie statistics separately. That way, we can also change floats into integers within each dataframe separately.

In [19]:
# Thankfully, we have a convenient column to separate on.

skaters_career = players_career[players_career['Role'] == 'Skater'].reset_index(drop=True)
goalies_career = players_career[players_career['Role'] == 'Goalie'].reset_index(drop=True)