This stage of the project focuses on gathering the data for every player ever to participate in a KHL match. The three steps are as follows: gathering a comprehensive list of KHL profile links for all players, web scraping the raw data from each link and cleaning the data so that we can combine the data for all players' together without creating a mess.

The detailed process behind each step is shown in the notebooks players_scraping_list.ipynb, players_scraping_data.ipynb and players_cleaning_data.ipynb accordingly. Below is the final script for gathering our data.

WARNING: the script is currently working on a very limited number of players for the purpose of testing. It will be scaled up afterwards.

In [1]:
# Importing the packages we are going to need

import requests
import bs4
import re

import numpy as np
import pandas as pd

import string
import time


# The list of players available on the KHL website is filtered by the first letter of their surname.
# So we will need to loop through the English alphabet to get all of their profile links.
# alphabet = list(string.ascii_uppercase) - COMMENTED OUT FOR TEST PURPOSES

alphabet = ['A', 'B']

# Creating an empty dataframe so that the loop can append results to it.

list_profile_links = pd.DataFrame()


for letter in alphabet:
    
    # We are nice people and will add a time delay for each request to the KHL website.
    # After all, bombarding someone's servers with multiple requests all at once is a bit mean.
    
    time.sleep(5)
    
    # Constructing an URL for our target webpage and get it.
    
    base_url = f'https://en.khl.ru/players/season/all/?letter={letter}'
    result = requests.get(base_url)
    
    # The only thing we really need to get out of the result is the players' profile links.
    # The complicated regular expression you are about to see below was found on StackOverflow, here is its description.

    # <a - is an a tag
    # [^>]*? - can have any characters that are not >
    # href=" - have href
    # [^\">]+ - have any number of characters other than " and >

    regex_outcome = re.findall(r"<a [^>]*?(href=\'([^\">]+)\')", result.text)
    
    # The regular expression returns a list of tuples containing two values, with/without the "href-" part.
    # We only need the latter option.
    
    profile_links = pd.DataFrame({x[1:] for x in regex_outcome}, columns=['Profile link'])
    
    # Appending the results to the outcome dataframe.
    # list_profile_links =  list_profile_links.append(profile_links, ignore_index=True) - COMMENTED OUT FOR TEST PURPOSES
    
    list_profile_links =  list_profile_links.append(profile_links[:5], ignore_index=True)
    

# Now that we have a list of profile links, let us start scraping the data for individual players.
# We are going to have three outcome tables of interest, so we will need three empty dataframes.

list_player_info = pd.DataFrame()
list_player_season_stat = pd.DataFrame()
list_player_match_stat = pd.DataFrame()

for profile in list_profile_links['Profile link']:
    
    # Again, first add a time delay for each request.
    
    time.sleep(5)
    
    # Constructing an URL for our target webpage and scrape it into a Pandas dataframe.

    base_url = f'https://en.khl.ru{profile}'
    result = requests.get(base_url)
    result_pandas = pd.read_html(result.text)
    
    # Some of the tables we have scraped are website-wide tables showing the outcomes of the most recent matches.
    # Thus, we need to extract only the player-specific information from the result.
    # That information is contained in the last three tables on the page.
    
    player_info = result_pandas[-3]
    player_season_stat = result_pandas[-2]
    player_match_stat = result_pandas[-1]
    
    # None of the tables include the name of the player.
    # We can find it under the 'e-player_name' class using BeautifulSoup.
    # Warning - the names may contain trailing spaces, so let us clean them up.

    soup = bs4.BeautifulSoup(result.text, 'lxml')
    name = soup.select('.e-player_name')[0].text.strip()

    # Turning it into a dataframe with both player name and his KHL profile link to be added to the rest of the tables.

    player_name = pd.DataFrame([{ 'URL': base_url, 'Player name': name}])
    
    # Unfortunately, the headers got messed up during web scraping.
    # For player info the header is in the first row.
    # For season statistics the season indicators are parsed as a row of their own.
    
    
    # Let us handle the problem with player information first.
    # The header for it is always the first word in the field, so we are going to perform the split on it.

    split_info = player_info.iloc[0].str.split(n=1)

    # Then the first words at index 0 are saved as header while everything else at index 1 is saved as actual information.

    header = split_info.str[0]
    actual_info = split_info.str[1]

    # Finally, replace the column names with correct header and replace the first row with clean information.
    # All changes are going to be applied to new variables, to keep original data in raw_ set of variables.

    player_info.columns = header
    player_info.iloc[0] = actual_info
    
    
    # Now let us handle the problem with season statistics.
    # Every row with actual statistics needs at least one game played (GP) to be included in the table.
    # We can use it to separate out the rows with season identifiers - they will be the only ones with no games recorded.

    player_season_stat['Season'] = np.where(np.isnan(player_season_stat['GP']), player_season_stat['Tournament / Team'], np.NaN)

    # We are interested in adding seasons to actual statistics as a separate column.
    # The Season column values are NaN when GP isn't NaN in that row - that is, it has games recorded.
    # The season identifiers is always above the actual season statistics, so we can fill the NaN values with row above.

    player_season_stat['Season'] = player_season_stat['Season'].fillna(method='ffill')

    # The Season column is now ready, dropping the rows that served only to identify seasons.

    player_season_stat = player_season_stat[player_season_stat['Tournament / Team'] != player_season_stat['Season']]

    # Finally, moving the Season column to the start of the table and update the index.

    player_season_stat = player_season_stat[['Season'] + [col for col in player_season_stat.columns if col != 'Season']]
    player_season_stat = player_season_stat.reset_index(drop=True)
    
    
    # Now that the headers are all fixed, let us add player name and KHL profile link to the tables.
    # The reason why we need a profile link is to have a unique player identifier.
    # After all, there is an unlikely yet possible event where two players have the exact same name.

    player_info = pd.concat([player_name, player_info], axis=1)
    
    # In case of multiple rows we need to additionally stretch the player name and profile link down.
    # It can be done the same way as previously with the season identifiers.
    
    player_season_stat = pd.concat([player_name, player_season_stat], axis=1)
    player_season_stat[['URL','Player name']] = player_season_stat[['URL','Player name']].fillna(method='ffill')

    player_match_stat = pd.concat([player_name, player_match_stat], axis=1)
    player_match_stat[['URL','Player name']] = player_match_stat[['URL','Player name']].fillna(method='ffill')

    
    # Appending the results to the outcome dataframe.
    
    list_player_info =  list_player_info.append(player_info, ignore_index=True)
    list_player_season_stat =  list_player_season_stat.append(player_season_stat, ignore_index=True)
    list_player_match_stat =  list_player_match_stat.append(player_match_stat, ignore_index=True)

In [2]:
list_profile_links

Unnamed: 0,Profile link
0,/players/14252/
1,/players/19010/
2,/players/16824/
3,/players/31163/
4,/players/16785/
5,/players/21424/
6,/players/31877/
7,/players/15696/
8,/players/19034/
9,/players/18975/


In [3]:
list_player_info

Unnamed: 0,URL,Player name,Born,Height,Weight,Age,Shoots,Country
0,https://en.khl.ru/players/14252/,Nikita Alexeyev,27 December 1981,198,112,39,left,Russia
1,https://en.khl.ru/players/19010/,Pontus Aberg,23 September 1993,183,88,27,right,Sweden
2,https://en.khl.ru/players/16824/,Nick Angell,31 October 1979,181,100,41,right,USA
3,https://en.khl.ru/players/31163/,Yohann Auvitu,27 July 1989,182,88,31,left,France
4,https://en.khl.ru/players/16785/,Juhamatti Aaltonen,4 June 1985,184,89,35,right,Finland
5,https://en.khl.ru/players/21424/,Patrick Bjorkstrand,1 July 1992,184,87,28,left,Denmark
6,https://en.khl.ru/players/31877/,Beau Bennett,27 November 1991,188,93,29,right,USA
7,https://en.khl.ru/players/15696/,Nikolai Bogomolov,30 May 1991,176,81,29,left,Russia
8,https://en.khl.ru/players/19034/,Timur Bilyalov,28 March 1995,179,79,26,left,Russia
9,https://en.khl.ru/players/18975/,Vladislav Boiko,15 December 1995,194,86,25,left,Russia


In [4]:
list_player_match_stat

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,BLS,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO
0,https://en.khl.ru/players/14252/,Nikita Alexeyev,160,Regular season 2008/2009,53,5 Sep 2008,Ak Bars - Khimik,4:0,81,0,...,,,,,,,,,,
1,https://en.khl.ru/players/14252/,Nikita Alexeyev,160,Regular season 2008/2009,53,7 Sep 2008,Ak Bars - Barys,3:2,81,0,...,,,,,,,,,,
2,https://en.khl.ru/players/14252/,Nikita Alexeyev,160,Regular season 2008/2009,53,11 Sep 2008,Ak Bars - Dynamo M,3:4 ОТ,81,0,...,,,,,,,,,,
3,https://en.khl.ru/players/14252/,Nikita Alexeyev,160,Regular season 2008/2009,53,15 Sep 2008,Lada - Ak Bars,3:2 ОТ,81,0,...,,,,,,,,,,
4,https://en.khl.ru/players/14252/,Nikita Alexeyev,160,Regular season 2008/2009,53,17 Sep 2008,Barys - Ak Bars,2:7,81,1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1057,https://en.khl.ru/players/19034/,Timur Bilyalov,1046,Playoffs 2020/2021,53,9 Apr 2021,Avangard - Ak Bars,1:2,82,0,...,,,1.0,0.0,0.0,1.0,33.0,97.1,1.00,0.0
1058,https://en.khl.ru/players/19034/,Timur Bilyalov,1046,Playoffs 2020/2021,53,11 Apr 2021,Ak Bars - Avangard,2:3 ОТ,82,0,...,,,0.0,1.0,0.0,3.0,17.0,85.0,2.92,0.0
1059,https://en.khl.ru/players/19034/,Timur Bilyalov,1046,Playoffs 2020/2021,53,13 Apr 2021,Avangard - Ak Bars,0:2,82,0,...,,,1.0,0.0,0.0,0.0,18.0,100.0,0.00,1.0
1060,https://en.khl.ru/players/19034/,Timur Bilyalov,1046,Playoffs 2020/2021,53,15 Apr 2021,Ak Bars - Avangard,3:4 ОТ,82,0,...,,,0.0,1.0,0.0,4.0,29.0,87.9,3.83,0.0


In [5]:
list_player_season_stat

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,https://en.khl.ru/players/14252/,Nikita Alexeyev,Playoffs 2012/2013,Severstal (Cherepovets),45.0,9.0,2.0,0.0,2.0,0.0,...,,,,,,,,,,
1,https://en.khl.ru/players/14252/,Nikita Alexeyev,Regular season 2012/2013,Severstal (Cherepovets),45.0,23.0,2.0,2.0,4.0,2.0,...,,,,,,,,,,
2,https://en.khl.ru/players/14252/,Nikita Alexeyev,Playoffs 2011/2012,Severstal (Cherepovets),15.0,6.0,1.0,2.0,3.0,-1.0,...,,,,,,,,,,
3,https://en.khl.ru/players/14252/,Nikita Alexeyev,Regular season 2011/2012,Severstal (Cherepovets),15.0,7.0,0.0,1.0,1.0,0.0,...,,,,,,,,,,
4,https://en.khl.ru/players/14252/,Nikita Alexeyev,Playoffs 2010/2011,Ak Bars (Kazan),81.0,9.0,0.0,0.0,0.0,-3.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,https://en.khl.ru/players/19034/,Timur Bilyalov,KHL Summary,Playoffs:,,15.0,0.0,1.0,,,...,,13.0,2.0,0.0,17.0,403.0,96.0,1.11,3.0,917:58
66,https://en.khl.ru/players/19034/,Timur Bilyalov,KHL Summary,KHL Total:,,123.0,0.0,5.0,,,...,,67.0,28.0,12.0,205.0,2966.0,93.5,1.86,18.0,6619:27
67,https://en.khl.ru/players/18975/,Vladislav Boiko,Regular season 2018/2019,Admiral (Vladivostok),91.0,1.0,0.0,0.0,0.0,-2.0,...,0.0,,,,,,,,,
68,https://en.khl.ru/players/18975/,Vladislav Boiko,KHL Summary,Regular season:,,1.0,0.0,0.0,0.0,-2.0,...,0.0,,,,,,,,,
