This stage of the project focuses on gathering the data for every player ever to participate in a KHL match. The three steps are as follows: gathering a comprehensive list of KHL profile links for all players, web scraping the raw data from each link and cleaning the data so that we can combine the data for all players' together without creating a mess.

The detailed process behind each step is shown in the notebooks players_scraping_list.ipynb, players_scraping_data.ipynb and players_cleaning_data.ipynb accordingly. Below is the final script for gathering our data.

WARNING: the script is currently working on a very limited number of players for the purpose of testing. It will be scaled up afterwards.

In [1]:
# Importing all the packages we are going to need.

import requests
import bs4
import re

import numpy as np
import pandas as pd

import string
import time

The profile page URL for a player contains a number that does not follow a continous format. We could still get the data for all players by iterating over a large range of possible numbers but that would lead to many requests ending up on nonexistent pages. Therefore, it would be more efficient to first get a list of all valid profile page URLs.

There is a list of players available on the KHL website, containing their information including the number for the profile link. However, the list is permanently filtered by the first letter of their surname. Once we know how to get the list of players' profile links for a specific letter, we can loop through the English alphabet to get all of them.

Let us make a function for it.

In [2]:
def get_profile_links(letter):

    # We are nice people and will add a time delay for each request to the KHL website.
    # After all, bombarding someone's servers with multiple requests all at once is a bit mean.

    time.sleep(5)

    # Constructing an URL for our target webpage and get it.

    base_url = f'https://en.khl.ru/players/season/all/?letter={letter}'
    result = requests.get(base_url)

    # The only thing we really need to get out of the result is the players' profile links.
    # The complicated regular expression you are about to see below was found on StackOverflow, here is its description.

    # <a - is an a tag
    # [^>]*? - can have any characters that are not >
    # href=" - have href
    # [^\">]+ - have any number of characters other than " and >

    regex_outcome = re.findall(r"<a [^>]*?(href=\'([^\">]+)\')", result.text)

    # The regular expression returns a list of tuples containing two values, with/without the "href-" part.
    # We only need the latter option.

    profile_links = pd.DataFrame({x[1:] for x in regex_outcome}, columns=['Profile link'])
    
    return profile_links

The profile links are ready and will be stored as a dataframe. The reason is that pandas has very convenient methods of reading and writing data into a .csv file.

The next function is going to tackle the issue of getting the data we need given a player's profile page.

In [3]:
def scrape_player_data(profile_link):

    # Again, first add a time delay for each request.
    
    time.sleep(5)
    
    # Constructing an URL for our target webpage and scrape it into a Pandas dataframe.

    base_url = f'https://en.khl.ru{profile_link}'
    result = requests.get(base_url)
    result_pandas = pd.read_html(result.text)
    
    # Some of the tables we have scraped are website-wide tables showing the outcomes of the most recent matches.
    # Thus, we need to extract only the player-specific information from the result.
    # That information is contained in the last three tables on the page.
    
    player_info = result_pandas[-3]
    player_season_stat = result_pandas[-2]
    player_match_stat = result_pandas[-1]
    
    # None of the tables include the name of the player.
    # We can find it under the 'e-player_name' class using BeautifulSoup.
    # Warning - the names may contain trailing spaces, so let us clean them up.

    soup = bs4.BeautifulSoup(result.text, 'lxml')
    name = soup.select('.e-player_name')[0].text.strip()

    # Turning it into a dataframe with both player name and his KHL profile link to be added to the rest of the tables.

    player_name = pd.DataFrame([{ 'URL': base_url, 'Player name': name}])
    
    return player_name, player_info, player_season_stat, player_match_stat

Now we know how to both get a list of all players' profile links and how to get the individual player's data using his link. However, we still need to combine together the data from all players. 

The data, however, is far from being clean the way we get it. The headers and season indicators get messed up, the former merging with the first row and the latter becoming rows of their own. Moreover, we have a player name and profile link in a separate table but not in any of the three main tables of interest.

Let us fix that before merging together the data for different players. For it, we are going to make three different functions.

In [4]:
def fix_player_information(player_info):
    
    # Let us fix the player information first.
    # The header for it is always the first word in the field, so we are going to perform the split on it.

    split_info = player_info.iloc[0].str.split(n=1)

    # Then the first words at index 0 are saved as header while everything else at index 1 is saved as actual information.

    header = split_info.str[0]
    actual_info = split_info.str[1]

    # Finally, replace the column names with correct header and replace the first row with clean information.
    # All changes are going to be applied to new variables, to keep original data in raw_ set of variables.

    player_info.columns = header
    player_info.iloc[0] = actual_info
    
    return player_info

In [5]:
def fix_player_statistics(player_season_stat):
    
    # Now let us handle the problem with season statistics.
    # Every row with actual statistics needs at least one game played (GP) to be included in the table.
    # We can use it to separate out the rows with season identifiers - they will be the only ones with no games recorded.

    player_season_stat['Season'] = np.where(np.isnan(player_season_stat['GP']), player_season_stat['Tournament / Team'], np.NaN)

    # We are interested in adding seasons to actual statistics as a separate column.
    # The Season column values are NaN when GP isn't NaN in that row - that is, it has games recorded.
    # The season identifiers is always above the actual season statistics, so we can fill the NaN values with row above.

    player_season_stat['Season'] = player_season_stat['Season'].fillna(method='ffill')

    # The Season column is now ready, dropping the rows that served only to identify seasons.

    player_season_stat = player_season_stat[player_season_stat['Tournament / Team'] != player_season_stat['Season']]

    # Finally, moving the Season column to the start of the table and update the index.

    player_season_stat = player_season_stat[['Season'] + [col for col in player_season_stat.columns if col != 'Season']]
    player_season_stat = player_season_stat.reset_index(drop=True)
    
    return player_season_stat

In [6]:
def add_player_name(player_name, player_info, player_season_stat, player_match_stat):

    # Now that the headers are all fixed, let us add player name and KHL profile link to the tables.
    # The reason why we need a profile link is to have a unique player identifier.
    # After all, there is an unlikely yet possible event where two players have the exact same name.

    player_info = pd.concat([player_name, player_info], axis=1)
    
    # In case of multiple rows we need to additionally stretch the player name and profile link down.
    # It can be done the same way as previously with the season identifiers.
    
    player_season_stat = pd.concat([player_name, player_season_stat], axis=1)
    player_season_stat[['URL','Player name']] = player_season_stat[['URL','Player name']].fillna(method='ffill')

    player_match_stat = pd.concat([player_name, player_match_stat], axis=1)
    player_match_stat[['URL','Player name']] = player_match_stat[['URL','Player name']].fillna(method='ffill')
    
    return player_info, player_season_stat, player_match_stat

The tables are all ready! Now we just need to call the functions and combine the data for all players together.

In [7]:
# Creating empty dataframes so that our loops can append results to them.

list_profile_links = pd.DataFrame()
list_player_info = pd.DataFrame()
list_player_season_stat = pd.DataFrame()
list_player_match_stat = pd.DataFrame()

# We need to now loop through the English alphabet to get all profile links.    
# Instead of entering the list of English letters manually, we can get it in a smart and lazy way.

alphabet = list(string.ascii_uppercase)


for letter in alphabet:

    profile_links = get_profile_links(letter)
    
    # Appending the results to the outcome dataframe.
    # list_profile_links =  list_profile_links.append(profile_links, ignore_index=True) - COMMENTED OUT FOR TEST PURPOSES

    list_profile_links =  list_profile_links.append(profile_links, ignore_index=True)

# And let us save it straight away.

list_profile_links.to_csv('players_profile_links.csv', encoding='utf8', index=False)


# Start looping through profile links in the list and getting the data for each of them.

for profile in list_profile_links['Profile link']:

    player_name, player_info, player_season_stat, player_match_stat = scrape_player_data(profile)
    
    player_info = fix_player_information(player_info)
    player_season_stat = fix_player_statistics(player_season_stat)
    
    player_info, player_season_stat, player_match_stat = add_player_name(player_name, player_info,
                                                                         player_season_stat, player_match_stat)
    
    # Appending the results to the outcome dataframe.
    
    list_player_info = list_player_info.append(player_info, ignore_index=True)
    list_player_season_stat = list_player_season_stat.append(player_season_stat, ignore_index=True)
    list_player_match_stat = list_player_match_stat.append(player_match_stat, ignore_index=True)
    
# Saving the final outcome.

list_player_info.to_csv('player_info.csv', encoding='utf8', index=False)
list_player_season_stat.to_csv('player_season_stat.csv', encoding='utf8', index=False)
list_player_match_stat.to_csv('player_match_stat.csv', encoding='utf8', index=False)

Below you can see the final outcome of our data-gathering script.

In [8]:
list_profile_links

Unnamed: 0,Profile link
0,/players/23434/
1,/players/16003/
2,/players/10846/
3,/players/39812/


In [9]:
list_player_info

Unnamed: 0,URL,Player name,Born,Height,Weight,Age,Shoots,Country
0,https://en.khl.ru/players/23434/,Jonas Ahnelov,11 December 1987,190,97,33,left,Sweden
1,https://en.khl.ru/players/16003/,Daniil Apalkov,1 January 1992,185,94,29,left,Russia
2,https://en.khl.ru/players/10846/,Maxim Balmochnykh,7 March 1979,182,88,42,left,Russia
3,https://en.khl.ru/players/39812/,Andrei Bakanov,28 May 2002,190,97,18,left,Russia


In [10]:
list_player_match_stat

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,SOG,%SOG,FO,FOW,%FO,TOI,SFT,HITS,BLS,FOA
0,https://en.khl.ru/players/23434/,Jonas Ahnelov,309,Regular season 2015/2016,34,25 Aug 2015,Avangard - Salavat Yulaev,4:2,5,1,...,1,100.0,0,0,-,19:57,28.0,3.0,2.0,0.0
1,https://en.khl.ru/players/23434/,Jonas Ahnelov,309,Regular season 2015/2016,34,27 Aug 2015,Avangard - Metallurg Mg,3:4 Б,5,0,...,2,0.0,0,0,-,20:31,29.0,1.0,3.0,1.0
2,https://en.khl.ru/players/23434/,Jonas Ahnelov,309,Regular season 2015/2016,34,30 Aug 2015,Traktor - Avangard,1:2,5,0,...,1,0.0,0,0,-,20:17,26.0,0.0,1.0,0.0
3,https://en.khl.ru/players/23434/,Jonas Ahnelov,309,Regular season 2015/2016,34,1 Sep 2015,Neftekhimik - Avangard,1:4,5,0,...,0,-,0,0,-,18:43,25.0,0.0,1.0,1.0
4,https://en.khl.ru/players/23434/,Jonas Ahnelov,309,Regular season 2015/2016,34,3 Sep 2015,Salavat Yulaev - Avangard,2:3,5,0,...,1,0.0,0,0,-,19:24,25.0,0.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
752,https://en.khl.ru/players/39812/,Andrei Bakanov,1045,Regular season 2020/2021,568,18 Feb 2021,Severstal - Kunlun RS,3:1,89,0,...,0,-,0,0,-,3:57,6.0,0.0,0.0,0.0
753,https://en.khl.ru/players/39812/,Andrei Bakanov,1045,Regular season 2020/2021,568,20 Feb 2021,Salavat Yulaev - Kunlun RS,5:4,89,0,...,0,-,0,0,-,4:42,5.0,0.0,1.0,0.0
754,https://en.khl.ru/players/39812/,Andrei Bakanov,1045,Regular season 2020/2021,568,22 Feb 2021,Kunlun RS - SKA,3:4,89,0,...,0,-,0,0,-,-,-,0.0,0.0,0.0
755,https://en.khl.ru/players/39812/,Andrei Bakanov,1045,Regular season 2020/2021,568,25 Feb 2021,Kunlun RS - Barys,0:3,89,0,...,0,-,0,0,-,2:09,4.0,2.0,0.0,0.0


In [11]:
list_player_season_stat

Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,+/-,...,%SOG,S/G,FO,FOW,%FO,TOI/G,SFT/G,HITS,BLS,FOA
0,https://en.khl.ru/players/23434/,Jonas Ahnelov,Regular season 2017/2018,Avangard (Omsk),5.0,42.0,3.0,7.0,10.0,-3.0,...,6.4,1.1,0.0,0.0,-,15:01,22.0,28.0,51.0,1.0
1,https://en.khl.ru/players/23434/,Jonas Ahnelov,Playoffs 2016/2017,Avangard (Omsk),5.0,12.0,3.0,2.0,5.0,12.0,...,14.3,1.8,0.0,0.0,-,19:47,28.6,7.0,13.0,3.0
2,https://en.khl.ru/players/23434/,Jonas Ahnelov,Regular season 2016/2017,Avangard (Omsk),5.0,44.0,4.0,6.0,10.0,16.0,...,8.3,1.1,0.0,0.0,-,18:13,25.6,28.0,57.0,4.0
3,https://en.khl.ru/players/23434/,Jonas Ahnelov,Playoffs 2015/2016,Avangard (Omsk),5.0,7.0,0.0,1.0,1.0,3.0,...,0.0,0.9,0.0,0.0,-,15:46,21.4,6.0,14.0,3.0
4,https://en.khl.ru/players/23434/,Jonas Ahnelov,Regular season 2015/2016,Avangard (Omsk),5.0,54.0,2.0,7.0,9.0,9.0,...,4.2,0.9,0.0,0.0,-,17:03,23.6,48.0,60.0,9.0
5,https://en.khl.ru/players/23434/,Jonas Ahnelov,KHL Summary,Regular season:,,140.0,9.0,20.0,29.0,22.0,...,6.3,1.0,0.0,0.0,-,16:49,23.7,104.0,168.0,14.0
6,https://en.khl.ru/players/23434/,Jonas Ahnelov,KHL Summary,Playoffs:,,19.0,3.0,3.0,6.0,15.0,...,11.1,1.4,0.0,0.0,-,18:18,25.9,13.0,27.0,6.0
7,https://en.khl.ru/players/23434/,Jonas Ahnelov,KHL Summary,KHL Total:,,159.0,12.0,23.0,35.0,37.0,...,7.1,1.1,0.0,0.0,-,16:59,24.0,117.0,195.0,20.0
8,https://en.khl.ru/players/16003/,Daniil Apalkov,Regular season 2020/2021,Sochi (Sochi),40.0,15.0,0.0,1.0,1.0,-8.0,...,0.0,0.5,150.0,64.0,42.7,11:48,17.9,2.0,6.0,2.0
9,https://en.khl.ru/players/16003/,Daniil Apalkov,Regular season 2019/2020,Lokomotiv (Yaroslavl),40.0,31.0,1.0,4.0,5.0,-8.0,...,3.1,1.0,131.0,64.0,48.9,9:45,14.3,15.0,4.0,1.0


ERROR!

The script broke on line 701 of the list with player profile links. The underlying reason was that Nikita Glotov (profile page https://en.khl.ru/players/16467/) has participated in an off-season tournament between the KHL teams (Nadezhda Cup) but never in the regular season nor playoff matches. As a result, he has no match-level statistics yet is in the KHL database due to having a season-level statistics, however weird that might sound.

The last table on the page is supposed to be the match-level statistics yet ends up being the season-level statistics. It messes up the entire dataset, with the season-level statistics becoming the the player information and the player information getting replaced by a random table showing the result of one of the latest matches that is present on all pages of the KHL website.

We definitely need to alter a script so that it would check whether the dataframe for the player's match statistics is indeed containing it. The easiest way to do that would be to check the header. If the header matches the standard one, then the script goes ahead with cleaning the data and adding it to the rest. However, we need to decide what to do with such players after identifying them.

Two approaches can be taken here: drop them from the data altogether or fix the player information and add the player data to the rest without any match data. Let us trying doing the latter first, even though that kind of data would probably get dropped later on in the analysis.