This notebook aims to prepare player data for future analysis. The original data was gathered using "data_scraping_players.ipynb" notebook.

In [1]:
# Importing standard packages for data preparation

import numpy as np
import pandas as pd

Let us start with Konstantin Okulov, the best scorer of KHL playoffs 2020-2021 and also a former player of my favourite hockey team, HC Sibir. The data for him was scrapped in the data_scraping_players.ipynb notebook.

In [2]:
# There are three files: player information, statistics for every season played and statistics for every match played.

player_name = pd.read_csv('okulov_data/player_name.csv')
player_info = pd.read_csv('okulov_data/raw_player_info.csv')
player_season_stat = pd.read_csv('okulov_data/raw_player_season_stat.csv')
player_match_stat = pd.read_csv('okulov_data/raw_player_match_stat.csv')

First of all let us look at what the scraped tables look like, and whether there are any issues to be addressed.

In [3]:
player_name

Unnamed: 0,URL,Player name
0,https://en.khl.ru/players/18770/,Konstantin Okulov


In [4]:
player_info

Unnamed: 0,0,1,2,3,4,5
0,Born 18 February 1995,Height 183,Weight 83,Age 26,Shoots left,Country Russia


In [5]:
player_season_stat

Unnamed: 0,Tournament / Team,№,GP,G,Assists,PTS,+/-,+,-,PIM,...,%SOG,S/G,FO,FOW,%FO,TOI/G,SFT/G,HITS,BLS,FOA
0,Playoffs 2020/2021,,,,,,,,,,...,,,,,,,,,,
1,CSKA (Moscow),71.0,23.0,6.0,14.0,20.0,10.0,19.0,9.0,2.0,...,9.8,2.7,3.0,0.0,0.0,16:53,23.4,7.0,5.0,8.0
2,Regular season 2020/2021,,,,,,,,,,...,,,,,,,,,,
3,CSKA (Moscow),71.0,55.0,18.0,31.0,49.0,14.0,43.0,29.0,6.0,...,14.0,2.3,9.0,2.0,22.2,15:59,19.9,5.0,13.0,13.0
4,Playoffs 2019/2020,,,,,,,,,,...,,,,,,,,,,
5,CSKA (Moscow),71.0,3.0,0.0,1.0,1.0,2.0,2.0,0.0,0.0,...,0.0,2.7,0.0,0.0,-,14:27,18.0,1.0,0.0,0.0
6,Regular season 2019/2020,,,,,,,,,,...,,,,,,,,,,
7,CSKA (Moscow),71.0,56.0,17.0,21.0,38.0,14.0,35.0,21.0,22.0,...,12.1,2.5,39.0,19.0,48.7,14:48,18.6,8.0,21.0,7.0
8,Playoffs 2018/2019,,,,,,,,,,...,,,,,,,,,,
9,CSKA (Moscow),71.0,19.0,7.0,7.0,14.0,8.0,11.0,3.0,2.0,...,15.2,2.4,8.0,4.0,50.0,13:35,19.2,4.0,4.0,1.0


In [6]:
player_match_stat

Unnamed: 0,IDSeason,Season,Team,Date,Teams,Score,№,G,Assists,PTS,...,SOG,%SOG,FO,FOW,%FO,TOI,SFT,HITS,BLS,FOA
0,244,Regular season 2013/2014,29,18 Sep 2013,Metallurg Mg - Sibir,3:2 ОТ,17,0,0,0,...,0,-,0,0,-,1:53,3.0,,,
1,244,Regular season 2013/2014,29,22 Sep 2013,Neftekhimik - Sibir,4:3,17,0,0,0,...,0,-,0,0,-,-,-,,,
2,244,Regular season 2013/2014,29,24 Sep 2013,Traktor - Sibir,1:0,17,0,0,0,...,0,-,0,0,-,0:17,1.0,,,
3,244,Regular season 2013/2014,29,11 Dec 2013,Avangard - Sibir,3:2,17,0,0,0,...,0,-,1,1,100.0,5:05,7.0,,,
4,244,Regular season 2013/2014,29,13 Dec 2013,Barys - Sibir,3:4,17,0,0,0,...,0,-,0,0,-,-,-,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,1046,Playoffs 2020/2021,2,20 Apr 2021,CSKA - Avangard,3:0,71,0,1,1,...,3,0.0,0,0,-,17:46,22.0,1.0,1.0,1.0
362,1046,Playoffs 2020/2021,2,22 Apr 2021,Avangard - CSKA,1:2,71,1,1,2,...,6,16.7,0,0,-,11:47,21.0,0.0,0.0,1.0
363,1046,Playoffs 2020/2021,2,24 Apr 2021,Avangard - CSKA,4:3 ОТ,71,0,1,1,...,5,0.0,1,0,0.0,17:55,27.0,0.0,0.0,1.0
364,1046,Playoffs 2020/2021,2,26 Apr 2021,CSKA - Avangard,0:2,71,0,0,0,...,5,0.0,0,0,-,18:16,23.0,0.0,0.0,0.0


We can see several problems with the raw data that need to be handled.

First of all, in the player information table the table's header was not recognised as such. A similar thing occured in the season stats table where the season identifier was formatted as a separate line instead of an additional column.

Lastly, all tables lack the player's name. While it does not hamper our ability to analyse this particular player's statistics, the intention is to gather the data for all players in the league. As such, we must be able to recognise which player a particular row refers to.

In [7]:
# Let us handle the problem with player information first.
# The header for it is always the first word in the field, so we are going to perform the split on it.

split_info = player_info.iloc[0].str.split(n=1)

# Then the first words at index 0 are saved as header while everything else at index 1 is saved as actual information.

header = split_info.str[0]
actual_info = split_info.str[1]

# Finally, replace the column names with correct header and replace the first row with clean information.
# All changes are going to be applied to new variables, to keep original data in raw_ set of variables.

player_info.columns = header
player_info.iloc[0] = actual_info

In [8]:
player_info

Unnamed: 0,Born,Height,Weight,Age,Shoots,Country
0,18 February 1995,183,83,26,left,Russia


In [9]:
# Now let us handle the problem with season statistics.
# Every row with actual statistics needs at least one game played (GP) to be included in the table.
# We can use it to separate out the rows with season identifiers - they will be the only ones with no games recorded.

player_season_stat['Season'] = np.where(np.isnan(player_season_stat['GP']), player_season_stat['Tournament / Team'], np.NaN)

# We are interested in adding seasons to actual statistics as a separate column.
# The Season column values are NaN when GP isn't NaN in that row - that is, it has games recorded.
# The season identifiers is always above the actual season statistics, so we can fill the NaN values with row above.

player_season_stat['Season'] = player_season_stat['Season'].fillna(method='ffill')

# The Season column is now ready, dropping the rows that served only to identify seasons.

player_season_stat = player_season_stat[player_season_stat['Tournament / Team'] != player_season_stat['Season']]

# Finally, moving the Season column to the start of the table and update the index.

player_season_stat = player_season_stat[['Season'] + [col for col in player_season_stat.columns if col != 'Season']]
player_season_stat = player_season_stat.reset_index(drop=True)

In [10]:
player_season_stat

Unnamed: 0,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,+,-,...,%SOG,S/G,FO,FOW,%FO,TOI/G,SFT/G,HITS,BLS,FOA
0,Playoffs 2020/2021,CSKA (Moscow),71.0,23.0,6.0,14.0,20.0,10.0,19.0,9.0,...,9.8,2.7,3.0,0.0,0.0,16:53,23.4,7.0,5.0,8.0
1,Regular season 2020/2021,CSKA (Moscow),71.0,55.0,18.0,31.0,49.0,14.0,43.0,29.0,...,14.0,2.3,9.0,2.0,22.2,15:59,19.9,5.0,13.0,13.0
2,Playoffs 2019/2020,CSKA (Moscow),71.0,3.0,0.0,1.0,1.0,2.0,2.0,0.0,...,0.0,2.7,0.0,0.0,-,14:27,18.0,1.0,0.0,0.0
3,Regular season 2019/2020,CSKA (Moscow),71.0,56.0,17.0,21.0,38.0,14.0,35.0,21.0,...,12.1,2.5,39.0,19.0,48.7,14:48,18.6,8.0,21.0,7.0
4,Playoffs 2018/2019,CSKA (Moscow),71.0,19.0,7.0,7.0,14.0,8.0,11.0,3.0,...,15.2,2.4,8.0,4.0,50.0,13:35,19.2,4.0,4.0,1.0
5,Regular season 2018/2019,CSKA (Moscow),71.0,48.0,20.0,11.0,31.0,21.0,30.0,9.0,...,18.3,2.3,41.0,18.0,43.9,14:04,17.8,5.0,13.0,6.0
6,Playoffs 2017/2018,CSKA (Moscow),77.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.3,0.0,0.0,-,10:35,16.3,1.0,1.0,0.0
7,Regular season 2017/2018,CSKA (Moscow),77.0,35.0,6.0,13.0,19.0,14.0,19.0,5.0,...,9.5,1.8,81.0,42.0,51.9,13:20,18.3,8.0,11.0,6.0
8,Regular season 2016/2017,Sibir (Novosibirsk Region),17.0,59.0,17.0,11.0,28.0,7.0,31.0,24.0,...,11.6,2.5,480.0,238.0,49.6,16:30,21.2,4.0,26.0,14.0
9,Playoffs 2015/2016,Sibir (Novosibirsk Region),17.0,9.0,2.0,2.0,4.0,5.0,5.0,0.0,...,20.0,1.1,38.0,9.0,23.7,10:05,15.3,4.0,8.0,0.0


Now that the headers are all fixed, let us add player name and KHL profile link to the tables.

In [11]:
# Adding player name and profile link to player information.

player_info = pd.concat([player_name, player_info], axis=1)

In [12]:
player_info

Unnamed: 0,URL,Player name,Born,Height,Weight,Age,Shoots,Country
0,https://en.khl.ru/players/18770/,Konstantin Okulov,18 February 1995,183,83,26,left,Russia


In [13]:
# Add a column indicating player name to player season stastics and move it to the start of the table.

player_season_stat['Player name'] = player_info['Player name'][0]
player_season_stat = player_season_stat[['Player name'] + [col for col in player_season_stat.columns if col != 'Player name']]

In [14]:
player_season_stat

Unnamed: 0,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,+,...,%SOG,S/G,FO,FOW,%FO,TOI/G,SFT/G,HITS,BLS,FOA
0,Konstantin Okulov,Playoffs 2020/2021,CSKA (Moscow),71.0,23.0,6.0,14.0,20.0,10.0,19.0,...,9.8,2.7,3.0,0.0,0.0,16:53,23.4,7.0,5.0,8.0
1,Konstantin Okulov,Regular season 2020/2021,CSKA (Moscow),71.0,55.0,18.0,31.0,49.0,14.0,43.0,...,14.0,2.3,9.0,2.0,22.2,15:59,19.9,5.0,13.0,13.0
2,Konstantin Okulov,Playoffs 2019/2020,CSKA (Moscow),71.0,3.0,0.0,1.0,1.0,2.0,2.0,...,0.0,2.7,0.0,0.0,-,14:27,18.0,1.0,0.0,0.0
3,Konstantin Okulov,Regular season 2019/2020,CSKA (Moscow),71.0,56.0,17.0,21.0,38.0,14.0,35.0,...,12.1,2.5,39.0,19.0,48.7,14:48,18.6,8.0,21.0,7.0
4,Konstantin Okulov,Playoffs 2018/2019,CSKA (Moscow),71.0,19.0,7.0,7.0,14.0,8.0,11.0,...,15.2,2.4,8.0,4.0,50.0,13:35,19.2,4.0,4.0,1.0
5,Konstantin Okulov,Regular season 2018/2019,CSKA (Moscow),71.0,48.0,20.0,11.0,31.0,21.0,30.0,...,18.3,2.3,41.0,18.0,43.9,14:04,17.8,5.0,13.0,6.0
6,Konstantin Okulov,Playoffs 2017/2018,CSKA (Moscow),77.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.3,0.0,0.0,-,10:35,16.3,1.0,1.0,0.0
7,Konstantin Okulov,Regular season 2017/2018,CSKA (Moscow),77.0,35.0,6.0,13.0,19.0,14.0,19.0,...,9.5,1.8,81.0,42.0,51.9,13:20,18.3,8.0,11.0,6.0
8,Konstantin Okulov,Regular season 2016/2017,Sibir (Novosibirsk Region),17.0,59.0,17.0,11.0,28.0,7.0,31.0,...,11.6,2.5,480.0,238.0,49.6,16:30,21.2,4.0,26.0,14.0
9,Konstantin Okulov,Playoffs 2015/2016,Sibir (Novosibirsk Region),17.0,9.0,2.0,2.0,4.0,5.0,5.0,...,20.0,1.1,38.0,9.0,23.7,10:05,15.3,4.0,8.0,0.0


In [15]:
# Same for the player match statistics.

player_match_stat['Player name'] = player_info['Player name'][0]
player_match_stat = player_match_stat[['Player name'] + [col for col in player_match_stat.columns if col != 'Player name']]

In [16]:
player_match_stat

Unnamed: 0,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,Assists,...,SOG,%SOG,FO,FOW,%FO,TOI,SFT,HITS,BLS,FOA
0,Konstantin Okulov,244,Regular season 2013/2014,29,18 Sep 2013,Metallurg Mg - Sibir,3:2 ОТ,17,0,0,...,0,-,0,0,-,1:53,3.0,,,
1,Konstantin Okulov,244,Regular season 2013/2014,29,22 Sep 2013,Neftekhimik - Sibir,4:3,17,0,0,...,0,-,0,0,-,-,-,,,
2,Konstantin Okulov,244,Regular season 2013/2014,29,24 Sep 2013,Traktor - Sibir,1:0,17,0,0,...,0,-,0,0,-,0:17,1.0,,,
3,Konstantin Okulov,244,Regular season 2013/2014,29,11 Dec 2013,Avangard - Sibir,3:2,17,0,0,...,0,-,1,1,100.0,5:05,7.0,,,
4,Konstantin Okulov,244,Regular season 2013/2014,29,13 Dec 2013,Barys - Sibir,3:4,17,0,0,...,0,-,0,0,-,-,-,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,Konstantin Okulov,1046,Playoffs 2020/2021,2,20 Apr 2021,CSKA - Avangard,3:0,71,0,1,...,3,0.0,0,0,-,17:46,22.0,1.0,1.0,1.0
362,Konstantin Okulov,1046,Playoffs 2020/2021,2,22 Apr 2021,Avangard - CSKA,1:2,71,1,1,...,6,16.7,0,0,-,11:47,21.0,0.0,0.0,1.0
363,Konstantin Okulov,1046,Playoffs 2020/2021,2,24 Apr 2021,Avangard - CSKA,4:3 ОТ,71,0,1,...,5,0.0,1,0,0.0,17:55,27.0,0.0,0.0,1.0
364,Konstantin Okulov,1046,Playoffs 2020/2021,2,26 Apr 2021,CSKA - Avangard,0:2,71,0,0,...,5,0.0,0,0,-,18:16,23.0,0.0,0.0,0.0


In [17]:
# Our data still has many issues to be addressed before it can be properly analysed.
# However, it can already be combined with data for other players.
# So let us save it for further processing.

player_info.to_csv('okulov_data/player_info.csv', encoding='utf8', index=False)
player_season_stat.to_csv('okulov_data/player_season_stat.csv', encoding='utf8', index=False)
player_match_stat.to_csv('okulov_data/player_match_stat.csv', encoding='utf8', index=False)