This notebook aims to scrape the season statistics for every played season for every player on the KHL website.

In [1]:
# Importing standard packages for web scraping

import requests
import bs4
import pandas as pd

Let us start with Konstantin Okulov, the best scorer of KHL playoffs 2020-2021 and also a former player of my favourite hockey team, HC Sibir.

In [2]:
# Scraping the data and transforming it into a Pandas dataframe.

base_url = 'https://en.khl.ru/players/18770/'
result = requests.get(base_url)
result_pandas = pd.read_html(result.text)

# Some of the tables we have scraped are website-wide tables showing the outcomes of the most recent matches.
# Thus, we need to extract only the player-specific information from the result.

player_info = result_pandas[-3]
player_season_stat = result_pandas[-2]
player_match_stat = result_pandas[-1]

# Saving the data to separate files for future reference and ease of browse.

player_info.to_csv('okulov_data\player_info.csv', sep='\t', encoding='utf8', index=False)
player_season_stat.to_csv('okulov_data\player_season_stat.csv', sep='\t', encoding='utf8', index=False)
player_match_stat.to_csv('okulov_data\player_match_stat.csv', sep='\t', encoding='utf8', index=False)

In [3]:
player_info

Unnamed: 0,0,1,2,3,4,5
0,Born 18 February 1995,Height 183,Weight 83,Age 26,Shoots left,Country Russia


In [4]:
player_season_stat

Unnamed: 0,Tournament / Team,№,GP,G,Assists,PTS,+/-,+,-,PIM,...,%SOG,S/G,FO,FOW,%FO,TOI/G,SFT/G,HITS,BLS,FOA
0,Playoffs 2020/2021,,,,,,,,,,...,,,,,,,,,,
1,CSKA (Moscow),71.0,23.0,6.0,14.0,20.0,10.0,19.0,9.0,2.0,...,9.8,2.7,3.0,0.0,0.0,16:53,23.4,7.0,5.0,8.0
2,Regular season 2020/2021,,,,,,,,,,...,,,,,,,,,,
3,CSKA (Moscow),71.0,55.0,18.0,31.0,49.0,14.0,43.0,29.0,6.0,...,14.0,2.3,9.0,2.0,22.2,15:59,19.9,5.0,13.0,13.0
4,Playoffs 2019/2020,,,,,,,,,,...,,,,,,,,,,
5,CSKA (Moscow),71.0,3.0,0.0,1.0,1.0,2.0,2.0,0.0,0.0,...,0.0,2.7,0.0,0.0,-,14:27,18.0,1.0,0.0,0.0
6,Regular season 2019/2020,,,,,,,,,,...,,,,,,,,,,
7,CSKA (Moscow),71.0,56.0,17.0,21.0,38.0,14.0,35.0,21.0,22.0,...,12.1,2.5,39.0,19.0,48.7,14:48,18.6,8.0,21.0,7.0
8,Playoffs 2018/2019,,,,,,,,,,...,,,,,,,,,,
9,CSKA (Moscow),71.0,19.0,7.0,7.0,14.0,8.0,11.0,3.0,2.0,...,15.2,2.4,8.0,4.0,50.0,13:35,19.2,4.0,4.0,1.0


In [5]:
player_match_stat

Unnamed: 0,IDSeason,Season,Team,Date,Teams,Score,№,G,Assists,PTS,...,SOG,%SOG,FO,FOW,%FO,TOI,SFT,HITS,BLS,FOA
0,244,Regular season 2013/2014,29,18 Sep 2013,Metallurg Mg - Sibir,3:2 ОТ,17,0,0,0,...,0,-,0,0,-,1:53,3.0,,,
1,244,Regular season 2013/2014,29,22 Sep 2013,Neftekhimik - Sibir,4:3,17,0,0,0,...,0,-,0,0,-,-,-,,,
2,244,Regular season 2013/2014,29,24 Sep 2013,Traktor - Sibir,1:0,17,0,0,0,...,0,-,0,0,-,0:17,1.0,,,
3,244,Regular season 2013/2014,29,11 Dec 2013,Avangard - Sibir,3:2,17,0,0,0,...,0,-,1,1,100.0,5:05,7.0,,,
4,244,Regular season 2013/2014,29,13 Dec 2013,Barys - Sibir,3:4,17,0,0,0,...,0,-,0,0,-,-,-,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
361,1046,Playoffs 2020/2021,2,20 Apr 2021,CSKA - Avangard,3:0,71,0,1,1,...,3,0.0,0,0,-,17:46,22.0,1.0,1.0,1.0
362,1046,Playoffs 2020/2021,2,22 Apr 2021,Avangard - CSKA,1:2,71,1,1,2,...,6,16.7,0,0,-,11:47,21.0,0.0,0.0,1.0
363,1046,Playoffs 2020/2021,2,24 Apr 2021,Avangard - CSKA,4:3 ОТ,71,0,1,1,...,5,0.0,1,0,0.0,17:55,27.0,0.0,0.0,1.0
364,1046,Playoffs 2020/2021,2,26 Apr 2021,CSKA - Avangard,0:2,71,0,0,0,...,5,0.0,0,0,-,18:16,23.0,0.0,0.0,0.0


We can see several problems with the raw data that need to be handled.

First of all, in the player information table the table's header was not recognised as such. A similar thing occured in the season stats table where the season identifier was formatted as a separate line instead of an additional column.

Lastly, all columns lack the player's name. While it does not hamper our ability to analyse this particular player's statistics, the intention is to gather the data for all players in the league. As such, we must be able to recognise which player a particular row refers to.

In [123]:
# Let us handle the problem with player information first.
# The header for it is always the first word in the field, so we are going to perform the split on it.

split_info = player_info.iloc[0].str.split(n=1)

# Then the first words at index 0 are saved as header while everything else at index 1 is saved as actual information.

header = split_info.str[0]
actual_info = split_info.str[1]

# Finally, replace the column names with correct header and replace the first row with clean information.

player_info.columns = header
player_info.iloc[0] = actual_info

In [124]:
player_info

Unnamed: 0,Born,Height,Weight,Age,Shoots,Country
0,18 February 1995,183,83,26,left,Russia
