In [1]:
# Importing standard packages for data exploration and processing.
import numpy as np
import pandas as pd


# Unlike in the Stage 1 notebooks, we are going to create new variables rather than perform the operations in-place here.
# The reason is that we might need to review the original data during processing.
raw_player_info = pd.read_csv('../raw_data/raw_players_info.csv')

# Does everything seem to be alright with the data?
raw_player_info.head(5)

Unnamed: 0,URL,Player name,Born,Height,Weight,Age,Shoots,Country,Died
0,https://en.khl.ru/players/16673/,Sergei Abramov,1 February 1993,189.0,99.0,28.0,left,Russia,
1,https://en.khl.ru/players/16462/,Maxim Alyapkin,28 February 1993,183.0,79.0,28.0,left,Russia,
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,26 March 1995,174.0,78.0,26.0,right,Belarus,
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2 January 1987,,,,right,Russia,7 September 2011
4,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,15 October 1996,188.0,89.0,24.0,left,Russia,


This notebook is going to focus on processing the players' personal information only. The other three files will be processed in separate notebooks.

Already we can see the two problems. All numeric values are stored as floats despite clearly being integers. In fact, on the KHL website itself the values are displayed as integers so something might have went wrong during the data scraping process.

At the same time player #4, Vitaly Anikeyenko, has no data for his height, weight and age. The reason is easy to notice - unfortunately, this player has passed away on September 7th 2011. That is probably the case for all deceased players but let us check.

In [2]:
raw_player_info[raw_player_info['Died'].notnull()]

Unnamed: 0,URL,Player name,Born,Height,Weight,Age,Shoots,Country,Died
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2 January 1987,,,,right,Russia,7 September 2011
334,https://en.khl.ru/players/4800/,Mikhail Balandin,27 July 1980,,,,left,Russia,7 September 2011
399,https://en.khl.ru/players/3945/,Artyom S. Chernov,28 April 1982,,,,left,Russia,11 December 2020
426,https://en.khl.ru/players/14277/,Gennady Churilov,5 May 1987,,,,left,Russia,7 September 2011
444,https://en.khl.ru/players/14595/,Alexei Cherepanov,15 January 1989,,,,left,Russia,13 October 2008
545,https://en.khl.ru/players/16916/,Pavol Demitra,29 November 1974,,,,left,Slovakia,7 September 2011
578,https://en.khl.ru/players/15207/,Ray Emery,28 September 1982,,,,left,Canada,15 July 2018
676,https://en.khl.ru/players/13718/,Alexander Galimov,2 May 1985,,,,left,Russia,12 September 2011
1050,https://en.khl.ru/players/14474/,Marat Kalimulin,12 August 1988,,,,right,Russia,7 September 2011
1108,https://en.khl.ru/players/13719/,Igor Korolyov,6 September 1970,,,,left,Russia,7 September 2011


Indeed we can only know the physical attributes of still living players.

Moreover, you could notice that quite a number of players have passed away on September 7th 2011 - 21 player out of the 37 present in this list. Unfortunately, that date is marked with tragedy as a plane crash happened when Lokomotiv Yaroslavl team wasa travelling to their first match of the season.

But do we even care about a player's height and weight? After all, they are recorded as of May 31st 2021 and cannot properly represent his physical characteristics for other periods. As for a player's age, we already have his date of birth so we can simply calculated it for any point in time such as the season's start or a specific match.

In [3]:
# Let us drop the columns in question.
player_info = raw_player_info.drop(['Height', 'Weight', 'Age'], axis=1).copy()
player_info.head(5)

Unnamed: 0,URL,Player name,Born,Shoots,Country,Died
0,https://en.khl.ru/players/16673/,Sergei Abramov,1 February 1993,left,Russia,
1,https://en.khl.ru/players/16462/,Maxim Alyapkin,28 February 1993,left,Russia,
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,26 March 1995,right,Belarus,
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2 January 1987,right,Russia,7 September 2011
4,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,15 October 1996,left,Russia,


Do we really need to know whether a player is still alive for our analysis? After all, many players have already retired from professional hockey or went to lower-level leagues. But what if we wanted to analyse at what age the players usually leave KHL, and do not want to mistake a player's death for him not cutting the league anymore?

That would still not give us much since many players move from KHL to NHL, something that is the exact opposite of what we are interested in. Separating those two would be impossible without acquiring some additional data, so let us drop the last column after all.

In [4]:
player_info = player_info.drop(['Died'], axis=1)
player_info.head(5)

Unnamed: 0,URL,Player name,Born,Shoots,Country
0,https://en.khl.ru/players/16673/,Sergei Abramov,1 February 1993,left,Russia
1,https://en.khl.ru/players/16462/,Maxim Alyapkin,28 February 1993,left,Russia
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,26 March 1995,right,Belarus
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2 January 1987,right,Russia
4,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,15 October 1996,left,Russia


In [5]:
# We should probably change the format of the 'Born' column.
player_info['Born'] = pd.to_datetime(player_info['Born'])
player_info.head(5)

Unnamed: 0,URL,Player name,Born,Shoots,Country
0,https://en.khl.ru/players/16673/,Sergei Abramov,1993-02-01,left,Russia
1,https://en.khl.ru/players/16462/,Maxim Alyapkin,1993-02-28,left,Russia
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,1995-03-26,right,Belarus
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,1987-01-02,right,Russia
4,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,1996-10-15,left,Russia


Since there are no numeric values left in our dataframe, we no longer need to fix the issue of them being stored as floats. In essense, we have only removed some columns from the players' personal information and changed the format for the dates of birth.

In [6]:
# But we still need to save the result.
player_info.to_csv('../data/players_info.csv', encoding='utf8')