In [1]:
# Importing standard packages for data exploration and processing.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

This notebook is going to focus on processing the players' match statistics only. The other three files will be processed in separate notebooks.

In [2]:
# Unlike in the Stage 1 notebooks, we are going to create new variables rather than perform the operations in-place here.
# The reason is that we might need to review the original data during processing.

raw_players_match = pd.read_csv('../raw_data/raw_players_match.csv')

In [3]:
# Does everything seem to be alright with the data?

raw_players_match

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,BLS,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO
0,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,28 Dec 2013,Barys - Amur,8:2,91,0,...,,,,,,,,,,
1,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,3 Jan 2014,Amur - Lokomotiv,2:1,91,0,...,,,,,,,,,,
2,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,5 Jan 2014,Amur - SKA,1:6,91,0,...,,,,,,,,,,
3,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,7 Jan 2014,Amur - Atlant,2:3 Б,91,0,...,,,,,,,,,,
4,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,9 Jan 2014,Amur - Severstal,1:3,91,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451101,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,13 Dec 2009,Severstal - CSKA,4:3 Б,15,0,...,,,,,,,,,,
451102,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,23 Dec 2009,Barys - Severstal,3:4,15,0,...,,,,,,,,,,
451103,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,25 Dec 2009,Salavat Yulaev - Severstal,2:3,15,0,...,,,,,,,,,,
451104,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,27 Dec 2009,Avangard - Severstal,3:1,15,0,...,,,,,,,,,,


We can already see that there are some issues with missing data. In addition, the player's team is only indicated by an id rather than its official name.

In [4]:
# What would the summary tell us?

raw_players_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 451106 entries, 0 to 451105
Data columns (total 40 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   URL          451106 non-null  object 
 1   Player name  451106 non-null  object 
 2   IDSeason     451106 non-null  int64  
 3   Season       451106 non-null  object 
 4   Team         451106 non-null  int64  
 5   Date         451106 non-null  object 
 6   Teams        451106 non-null  object 
 7   Score        451106 non-null  object 
 8   №            451106 non-null  int64  
 9   G            451106 non-null  int64  
 10  Assists      451106 non-null  int64  
 11  PTS          409063 non-null  float64
 12  +/-          409063 non-null  float64
 13  +            409063 non-null  float64
 14  -            409063 non-null  float64
 15  PIM          451106 non-null  int64  
 16  ESG          409063 non-null  float64
 17  PPG          409063 non-null  float64
 18  SHG          409063 non-

We can see that in many columns there is no missing data at all. Some columns are stored as floats while they should in fact be integers. At the same time, for other columns there is a clear separation into skaters (forwards and defencemen) and goalies.

For example, we can see that season statistics appears to have 409063 rows of data for skaters and 42020 rows for goalies, with a total of 451083 rows. However, there are 451106 rows in the dataframe so 23 rows seem to be unaccounted in either.

Let us find out who is messing up our data. We can see that icetime has exactly 451083 non-null values which is in line with our calculations, so we are probably interested in the cases when icetime is null.

In [5]:
# We need the rows for which both icetime are null.

raw_players_match[raw_players_match['TOI'].isnull()]

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,BLS,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO
35442,https://en.khl.ru/players/33314/,Casey Bailey,671,Regular season 2018/2019,246,20 Feb 2019,Jokerit - Slovan,7:1,25,0,...,,,,,,,,,,
46321,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,20 Sep 2017,Slovan - Ak Bars,3:6,23,0,...,,,,,,,,,,
46322,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,23 Sep 2017,Vityaz - Slovan,4:0,23,0,...,,,,,,,,,,
46323,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,25 Sep 2017,CSKA - Slovan,3:2,23,0,...,,,,,,,,,,
46324,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,27 Sep 2017,Slovan - Vityaz,4:3,23,0,...,,,,,,,,,,
46325,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,3 Oct 2017,Slovan - Severstal,5:4 Б,23,0,...,,,,,,,,,,
46326,https://en.khl.ru/players/29144/,David Boldizar,468,Regular season 2017/2018,246,5 Oct 2017,Slovan - Torpedo,0:1,23,0,...,,,,,,,,,,
46327,https://en.khl.ru/players/29144/,David Boldizar,671,Regular season 2018/2019,246,22 Jan 2019,Dinamo R - Slovan,3:2,61,0,...,,,,,,,,,,
46328,https://en.khl.ru/players/29144/,David Boldizar,671,Regular season 2018/2019,246,24 Jan 2019,Lokomotiv - Slovan,7:0,61,0,...,,,,,,,,,,
46329,https://en.khl.ru/players/29144/,David Boldizar,671,Regular season 2018/2019,246,26 Jan 2019,Slovan - Dinamo Mn,2:4,61,0,...,,,,,,,,,,


We have multiple culprits here. Something must have went wrong with the way their data was stored.

In [6]:
raw_players_match[raw_players_match['TOI'].isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 35442 to 354778
Data columns (total 40 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   URL          23 non-null     object 
 1   Player name  23 non-null     object 
 2   IDSeason     23 non-null     int64  
 3   Season       23 non-null     object 
 4   Team         23 non-null     int64  
 5   Date         23 non-null     object 
 6   Teams        23 non-null     object 
 7   Score        23 non-null     object 
 8   №            23 non-null     int64  
 9   G            23 non-null     int64  
 10  Assists      23 non-null     int64  
 11  PTS          0 non-null      float64
 12  +/-          0 non-null      float64
 13  +            0 non-null      float64
 14  -            0 non-null      float64
 15  PIM          23 non-null     int64  
 16  ESG          0 non-null      float64
 17  PPG          0 non-null      float64
 18  SHG          0 non-null      float64
 19  OT

Most of the data is missing, and not because it is supposed to be a zero. After all, icetime cannot be zero if a player has participated in a match. And the only values present are integers, so definitely something weird with the formatting.

We do not know whether the player has zeroes in all those columns or if it is just a data storage issue. Since those are only a few broken rows, let us just drop them altogether.

In [7]:
players_match = raw_players_match[~raw_players_match['TOI'].isnull()]

# Cleaning up the dataframe.

players_match = players_match.reset_index(drop=True)

Now we can create a new column indicating whether a player is a skater or a goalie. Let us use the shifts for separation.

In [9]:
# Number of shifts on ice is only tracked for skaters, so goalies are supposed to have it as null.

players_match['Role'] = np.where(players_match['SFT'].isnull(), 'Goalie', 'Skater')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  players_match['Role'] = np.where(players_match['SFT'].isnull(), 'Goalie', 'Skater')


In [10]:
players_match

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,Role
0,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,28 Dec 2013,Barys - Amur,8:2,91,0,...,,,,,,,,,,Skater
1,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,3 Jan 2014,Amur - Lokomotiv,2:1,91,0,...,,,,,,,,,,Skater
2,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,5 Jan 2014,Amur - SKA,1:6,91,0,...,,,,,,,,,,Skater
3,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,7 Jan 2014,Amur - Atlant,2:3 Б,91,0,...,,,,,,,,,,Skater
4,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,9 Jan 2014,Amur - Severstal,1:3,91,0,...,,,,,,,,,,Skater
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451078,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,13 Dec 2009,Severstal - CSKA,4:3 Б,15,0,...,,,,,,,,,,Skater
451079,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,23 Dec 2009,Barys - Severstal,3:4,15,0,...,,,,,,,,,,Skater
451080,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,25 Dec 2009,Salavat Yulaev - Severstal,2:3,15,0,...,,,,,,,,,,Skater
451081,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,27 Dec 2009,Avangard - Severstal,3:1,15,0,...,,,,,,,,,,Skater


We have quite a bit of work ahead of us. Many columns contain data that we would like to see in other columns, such as years, home/visit team and whether the game was finished in the main time or in overtime/by shootouts.

In [11]:
# We do not really need the 'IDSeason' column, as the current 'Season' column is indicative enough.

players_match = players_match.drop('IDSeason', axis=1)

# Separating the 'Season' column into the type of season and the years would allow us to more easily sort it.

players_match['Year'] = players_match['Season'].apply(lambda x: x[:-10])
players_match['Season'] = players_match['Season'].apply(lambda x: x[-9:])

# We need to separate the teams into two columns.
# It is important to remove the trailling spaces from the results.

players_match['Home_team'] = players_match['Teams'].apply(lambda x: x.split('-')[0].strip())
players_match['Away_team'] = players_match['Teams'].apply(lambda x: x.split('-')[1].strip())

# Now separating the match score into each team's corresponding score.
# In addition, we will create a 'Length' column that will indicate in which period the game has ended.
# The split on a space separates the scores from an overtime indicator, and the split on a colon separates teams' scores.

players_match['Home_score'] = players_match['Score'].apply(lambda x: x.split(' ')[0].split(':')[0])
players_match['Away_score'] = players_match['Score'].apply(lambda x: x.split(' ')[0].split(':')[1])

# We cannot just take the second element after the split since the list will only contain 1 element if there is no overtime.
# But we can artificially create an extra element of a list by padding the string with an extra space at the end.
# This trick allows us to take the overtime indicator if it is present or a blank string if it is not.

players_match['Length'] = players_match['Score'].apply(lambda x: (x + ' ').split(' ')[1])

In [12]:
# What values do we have here?

players_match['Length'].unique()

array(['', 'Б', 'ОТ'], dtype=object)

We could previously see a Russian letter 'Б' in the 'Score' column. It indicates shootouts and was not properly changed to English it seems. Therefore, we need to change it and, while we are at it, might as well change all values to the more obvious ones.

In [13]:
length_dict = {'': 'Standard', 'ОТ': 'Overtime', 'Б': 'Shootouts'}

players_match['Length'] = players_match['Length'].map(length_dict)

We can now rearrange the columns.

In [31]:
# Dropping the columns we are no longer interested in.

players_match.drop(['Teams', 'Score'], axis=1, inplace=True)

# We will have to move the columns around quite a bit.

columns = players_match.columns

players_match = players_match[[col for col in columns[:2]] + ['Role', 'Year'] + [col for col in columns[2:5]] +
                              ['Home_team', 'Away_team', 'Home_score', 'Away_score','Length'] + [col for col in columns[5:-7]]]

# The current column names are not very informative, are they?

header = ['Profile', 'Player', 'Role', 'Year', 'Season', 'Team_id', 'Date', 'Home_team', 'Away_team', 'Home_score',
          'Away_score', 'Length', 'Number', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus', 'Minus',
          'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded', 'Goals_overtime', 'Game_winning_goals',
          'Game_winning_shootouts', 'Shots', 'Shots_percentage', 'Faceoffs', 'Faceoffs_won', 'Faceoffs_percentage',
          'Icetime','Hits', 'Shots_blocked', 'Penalties_against', 'Wins', 'Losses', 'Shootouts', 'Goals_against',
          'Saves', 'Saves_percentage', 'Goals_against_average', 'Shutouts']

KeyError: "['Teams' 'Score'] not found in axis"

In [25]:
players_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 451083 entries, 0 to 451082
Data columns (total 44 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   URL          451083 non-null  object 
 1   Player name  451083 non-null  object 
 2   Role         451083 non-null  object 
 3   Year         451083 non-null  object 
 4   Season       451083 non-null  object 
 5   Team         451083 non-null  int64  
 6   Date         451083 non-null  object 
 7   Home_team    451083 non-null  object 
 8   Away_team    451083 non-null  object 
 9   Home_score   451083 non-null  object 
 10  Away_score   451083 non-null  object 
 11  Length       451083 non-null  object 
 12  №            451083 non-null  int64  
 13  G            451083 non-null  int64  
 14  Assists      451083 non-null  int64  
 15  PTS          409063 non-null  float64
 16  +/-          409063 non-null  float64
 17  +            409063 non-null  float64
 18  -            409063 non-

In [36]:
header = ['Profile', 'Player', 'Role', 'Year', 'Season', 'Team_id', 'Date', 'Home_team', 'Away_team', 'Home_score',
          'Away_score', 'Length', 'Number', 'Goals', 'Assists', 'Points', 'Plus_minus', 'Plus', 'Minus',
          'Penalties', 'Goals_even', 'Goals_powerplay', 'Goals_shorthanded', 'Goals_overtime', 'Game_winning_goals',
          'Game_winning_shootouts', 'Shots', 'Shots_percentage', 'Faceoffs', 'Faceoffs_won', 'Faceoffs_percentage',
          'Icetime','Hits', 'Shots_blocked', 'Penalties_against', 'Wins', 'Losses', 'Shootouts', 'Goals_against',
          'Saves', 'Saves_percentage', 'Goals_against_average', 'Shutouts']

players_match.columns = header

ValueError: Length mismatch: Expected axis has 44 elements, new values have 43 elements

In [37]:
players_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 451083 entries, 0 to 451082
Data columns (total 44 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Profile                 451083 non-null  object 
 1   Player                  451083 non-null  object 
 2   Role                    451083 non-null  object 
 3   Year                    451083 non-null  object 
 4   Season                  451083 non-null  object 
 5   Team_id                 451083 non-null  int64  
 6   Date                    451083 non-null  object 
 7   Home_team               451083 non-null  object 
 8   Away_team               451083 non-null  object 
 9   Home_score              451083 non-null  object 
 10  Away_score              451083 non-null  object 
 11  Length                  451083 non-null  object 
 12  Number                  451083 non-null  int64  
 13  Games                   451083 non-null  int64  
 14  Goals               

In [30]:
players_match.loc[0]

Profile                   https://en.khl.ru/players/16673/
Player                                      Sergei Abramov
Role                                                Skater
Year                                        Regular season
Season                                           2013/2014
Team_id                                                 54
Date                                           28 Dec 2013
Home_team                                            Barys
Away_team                                             Amur
Home_score                                               8
Away_score                                               2
Length                                            Standard
Number                                                  91
Games                                                    0
Goals                                                    0
Assists                                                0.0
Points                                                 0