This stage of the project focuses on gathering the data for every player ever to participate in a KHL match. The three steps are as follows: gathering a comprehensive list of KHL profile links for all players, web scraping the raw data from each link and cleaning the data so that we can combine the data for all players' together without creating a mess.

The detailed process behind each step is shown in the notebooks players_scraping_list.ipynb, players_scraping_data.ipynb and players_cleaning_data.ipynb accordingly. Below is the final script for gathering our data.

In [1]:
# Importing all the packages we are going to need.

import requests
import bs4
import re

import numpy as np
import pandas as pd

import string
import time

The profile page URL for a player contains a number that does not follow a continous format. We could still get the data for all players by iterating over a large range of possible numbers but that would lead to many requests ending up on nonexistent pages. Therefore, it would be more efficient to first get a list of all valid profile page URLs.

There is a list of players available on the KHL website, containing their information including the number for the profile link. However, the list is permanently filtered by the first letter of their surname. Once we know how to get the list of players' profile links for a specific letter, we can loop through the English alphabet to get all of them.

Let us make a function for it.

In [2]:
def get_profile_links(letter):

    # We are nice people and will add a time delay for each request to the KHL website.
    # After all, bombarding someone's servers with multiple requests all at once is a bit mean.

    time.sleep(5)

    # Constructing an URL for our target webpage and get it.

    base_url = f'https://en.khl.ru/players/season/all/?letter={letter}'
    result = requests.get(base_url)

    # The only thing we really need to get out of the result is the players' profile links.
    # The complicated regular expression you are about to see below was found on StackOverflow, here is its description.

    # <a - is an a tag
    # [^>]*? - can have any characters that are not >
    # href=" - have href
    # [^\">]+ - have any number of characters other than " and >

    regex_outcome = re.findall(r"<a [^>]*?(href=\'([^\">]+)\')", result.text)

    # The regular expression returns a list of tuples containing two values, with/without the "href-" part.
    # We only need the latter option.

    profile_links = pd.DataFrame({x[1:] for x in regex_outcome}, columns=['Profile link'])
    
    return profile_links

The profile links are ready and will be stored as a dataframe. The reason is that pandas has very convenient methods of reading and writing data into a .csv file.

The next function is going to tackle the issue of getting the data we need given a player's profile page.

# Warning!

This script breaks on line 701 of the player profile links list. The underlying reason is that Nikita Glotov (profile page https://en.khl.ru/players/16467/) has participated in an off-season tournament between the KHL teams (Nadezhda Cup) but never in the regular season nor playoff matches. As a result, he has no match-level statistics yet is in the KHL database due to having a season-level statistics, however weird that might sound.

The last table on the page is supposed to be the match-level statistics yet ends up being the season-level statistics. It messes up the entire dataset, with the season-level statistics becoming the the player information and the player information getting replaced by a random table showing the result of one of the latest matches that is present on all pages of the KHL website.

Thus, the script needs to check whether the dataframe for the player's match statistics is indeed containing it. The easiest way to do that would be to check the header. If the header matches the standard one, then the script goes ahead with cleaning the data and adding it to the rest. However, we need to decide what to do with such players after identifying them.

Two approaches can be taken here: drop them from the data altogether or fix the player information and add the player data to the rest without any match data. We are going to do the latter, even though that kind of data would probably get dropped later on in the analysis.

In [3]:
def scrape_player_data(profile_link):

    # Again, first add a time delay for each request.
    
    time.sleep(5)
    
    # Constructing an URL for our target webpage and scrape it into a Pandas dataframe.

    base_url = f'https://en.khl.ru{profile_link}'
    result = requests.get(base_url)
    result_pandas = pd.read_html(result.text)
    
    # Some of the tables we have scraped are website-wide tables showing the outcomes of the most recent matches.
    # Thus, we need to extract only the player-specific information from the result.
    # That information is normally contained in the last three tables on the page.
    
    # However, sometimes the player match statistics is missing so we need to account for it.
    # We can check whether the first column's header is indeed the same as in the match statistics template.
    
    if result_pandas[-1].columns[0] == 'IDSeason':
        
        # All is good, saving all three tables.
        
        player_info = result_pandas[-3]
        player_season_stat = result_pandas[-2]
        player_match_stat = result_pandas[-1]
        
    else:
        
        # Setting the match statistics to None as it is absent from the page.
        
        player_info = result_pandas[-2]
        player_season_stat = result_pandas[-1]
        player_match_stat = None
    
    # None of the tables include the name of the player.
    # We can find it under the 'e-player_name' class using BeautifulSoup.
    # Warning - the names may contain trailing spaces, so let us clean them up.

    soup = bs4.BeautifulSoup(result.text, 'lxml')
    name = soup.select('.e-player_name')[0].text.strip()

    # Turning it into a dataframe with both player name and his KHL profile link to be added to the rest of the tables.

    player_name = pd.DataFrame([{ 'URL': base_url, 'Player name': name}])
    
    return player_name, player_info, player_season_stat, player_match_stat

Now we know how to both get a list of all players' profile links and how to get the individual player's data using his link. However, we still need to combine together the data from all players. 

The data, however, is far from being clean the way we get it. The headers and season indicators get messed up, the former merging with the first row and the latter becoming rows of their own. Moreover, we have a player name and profile link in a separate table but not in any of the three main tables of interest.

Let us fix that before merging together the data for different players. For it, we are going to make three different functions.

In [4]:
def fix_player_information(player_info):
    
    # Let us fix the player information first.
    # The header for it is always the first word in the field, so we are going to perform the split on it.

    split_info = player_info.iloc[0].str.split(n=1)

    # Then the first words at index 0 are saved as header while everything else at index 1 is saved as actual information.

    header = split_info.str[0]
    actual_info = split_info.str[1]

    # Finally, replace the column names with correct header and replace the first row with clean information.
    # All changes are going to be applied to new variables, to keep original data in raw_ set of variables.

    player_info.columns = header
    player_info.iloc[0] = actual_info
    
    return player_info

In [5]:
def fix_player_statistics(player_season_stat):
    
    # Now let us handle the problem with season statistics.
    # Every row with actual statistics needs at least one game played (GP) to be included in the table.
    # We can use it to separate out the rows with season identifiers - they will be the only ones with no games recorded.

    player_season_stat['Season'] = np.where(np.isnan(player_season_stat['GP']), player_season_stat['Tournament / Team'], np.NaN)

    # We are interested in adding seasons to actual statistics as a separate column.
    # The Season column values are NaN when GP isn't NaN in that row - that is, it has games recorded.
    # The season identifiers is always above the actual season statistics, so we can fill the NaN values with row above.

    player_season_stat['Season'] = player_season_stat['Season'].fillna(method='ffill')

    # The Season column is now ready, dropping the rows that served only to identify seasons.

    player_season_stat = player_season_stat[player_season_stat['Tournament / Team'] != player_season_stat['Season']]

    # Finally, moving the Season column to the start of the table and update the index.

    player_season_stat = player_season_stat[['Season'] + [col for col in player_season_stat.columns if col != 'Season']]
    player_season_stat = player_season_stat.reset_index(drop=True)
    
    return player_season_stat

In [6]:
def add_player_name(player_name, player_info, player_season_stat, player_match_stat):

    # Now that the headers are all fixed, let us add player name and KHL profile link to the tables.
    # The reason why we need a profile link is to have a unique player identifier.
    # After all, there is an unlikely yet possible event where two players have the exact same name.

    player_info = pd.concat([player_name, player_info], axis=1)
    
    # In case of multiple rows we need to additionally stretch the player name and profile link down.
    # It can be done the same way as previously with the season identifiers.
    
    player_season_stat = pd.concat([player_name, player_season_stat], axis=1)
    player_season_stat[['URL','Player name']] = player_season_stat[['URL','Player name']].fillna(method='ffill')
    
    # Again, we need to check whether we even have match data before adding the player name to it.
    
    if player_match_stat is not None:
        
        # Go ahead.

        player_match_stat = pd.concat([player_name, player_match_stat], axis=1)
        player_match_stat[['URL','Player name']] = player_match_stat[['URL','Player name']].fillna(method='ffill')
        
    else:
        
        # We do not need your kind in here.
        
        pass
    
    return player_info, player_season_stat, player_match_stat

The tables are all ready! Now we just need to call the functions and combine the data for all players together.

In [7]:
# Creating empty dataframes so that our loops can append results to them.

list_profile_links = pd.DataFrame()
list_player_info = pd.DataFrame()
list_player_season_stat = pd.DataFrame()
list_player_match_stat = pd.DataFrame()

# We need to now loop through the English alphabet to get all profile links.    
# Instead of entering the list of English letters manually, we can get it in a smart and lazy way.

alphabet = list(string.ascii_uppercase)


for letter in alphabet:

    profile_links = get_profile_links(letter)
    
    # Appending the results to the outcome dataframe.
    # list_profile_links =  list_profile_links.append(profile_links, ignore_index=True) - COMMENTED OUT FOR TEST PURPOSES

    list_profile_links =  list_profile_links.append(profile_links, ignore_index=True)

# And let us save it straight away.

list_profile_links.to_csv('players_profile.csv', encoding='utf8', index=False)


# Start looping through profile links in the list and getting the data for each of them.

for profile in list_profile_links['Profile link']:

    player_name, player_info, player_season_stat, player_match_stat = scrape_player_data(profile)
    
    player_info = fix_player_information(player_info)
    player_season_stat = fix_player_statistics(player_season_stat)
    
    player_info, player_season_stat, player_match_stat = add_player_name(player_name, player_info,
                                                                         player_season_stat, player_match_stat)
    
    # Appending the results to the outcome dataframe.
    
    list_player_info = list_player_info.append(player_info, ignore_index=True)
    list_player_season_stat = list_player_season_stat.append(player_season_stat, ignore_index=True)
    list_player_match_stat = list_player_match_stat.append(player_match_stat, ignore_index=True)
    
    
# Separating the season table into season and career.

list_player_career_stat = player_season_stat[player_season_stat['Season'] != 'KHL Summary']
list_player_season_stat = player_season_stat[player_season_stat['Season'] == 'KHL Summary']
    
# Saving the final outcome.

list_player_info.to_csv('raw_players_info.csv', encoding='utf8', index=False)
list_player_career_stat.to_csv('raw_players_career.csv', encoding='utf8', index=False)
list_player_season_stat.to_csv('raw_players_season.csv', encoding='utf8', index=False)
list_player_match_stat.to_csv('raw_players_match.csv', encoding='utf8', index=False)

Below you can see the final outcome of our data-gathering script.

In [8]:
list_profile_links

Unnamed: 0,Profile link
0,/players/14252/
1,/players/16429/
2,/players/4475/
3,/players/13714/
4,/players/15669/
...,...
3359,/players/14527/
3360,/players/20260/
3361,/players/25749/
3362,/players/15664/


In [9]:
list_player_info

Unnamed: 0,URL,Player name,Born,Shoots,Country
0,https://en.khl.ru/players/16673/,Sergei Abramov,1 February 1993,left,Russia
1,https://en.khl.ru/players/16462/,Maxim Alyapkin,28 February 1993,left,Russia
2,https://en.khl.ru/players/19200/,Dmitry Ambrozheichik,26 March 1995,right,Belarus
3,https://en.khl.ru/players/13714/,Vitaly Anikeyenko,2 January 1987,right,Russia
4,https://en.khl.ru/players/20844/,Semyon Afonasyevsky,15 October 1996,left,Russia
...,...,...,...,...,...
3359,https://en.khl.ru/players/14930/,Alexander Zalivin,15 July 1990,left,Russia
3360,https://en.khl.ru/players/23355/,Denis Zernov,10 January 1996,left,Russia
3361,https://en.khl.ru/players/16217/,Airat Ziazov,24 January 1991,left,Russia
3362,https://en.khl.ru/players/23656/,Tomislav Zanoski,3 March 1984,left,Croatia


In [10]:
list_player_career_stat

Unnamed: 0.1,Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,4,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Regular season:,,25.0,1.0,0.0,1.0,...,1.0,,,,,,,,,
1,5,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,Nadezhda Cup:,,2.0,0.0,0.0,0.0,...,0.0,,,,,,,,,
2,6,https://en.khl.ru/players/16673/,Sergei Abramov,KHL Summary,KHL Total:,,25.0,1.0,0.0,1.0,...,1.0,,,,,,,,,
3,9,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,Regular season:,,3.0,0.0,0.0,,...,,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41
4,10,https://en.khl.ru/players/16462/,Maxim Alyapkin,KHL Summary,KHL Total:,,3.0,0.0,0.0,,...,,1.0,2.0,0.0,5.0,19.0,79.2,3.17,0.0,94:41
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8969,28639,https://en.khl.ru/players/16217/,Airat Ziazov,KHL Summary,KHL Total:,,79.0,6.0,10.0,16.0,...,6.0,,,,,,,,,
8970,28642,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,Regular season:,,39.0,5.0,1.0,6.0,...,8.0,,,,,,,,,
8971,28643,https://en.khl.ru/players/23656/,Tomislav Zanoski,KHL Summary,KHL Total:,,39.0,5.0,1.0,6.0,...,8.0,,,,,,,,,
8972,28646,https://en.khl.ru/players/11543/,Alexander Zevakhin,KHL Summary,Regular season:,,64.0,4.0,7.0,11.0,...,0.0,,,,,,,,,


In [11]:
list_player_season_stat

Unnamed: 0.1,Unnamed: 0,URL,Player name,Season,Tournament / Team,№,GP,G,Assists,PTS,...,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO,TOI
0,0,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2014/2015,Amur (Khabarovsk),93.0,13.0,1.0,0.0,1.0,...,1.0,,,,,,,,,
1,1,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2013/2014,Amur (Khabarovsk),91.0,2.0,0.0,0.0,0.0,...,,,,,,,,,,
2,2,https://en.khl.ru/players/16673/,Sergei Abramov,Regular season 2013/2014,Amur (Khabarovsk),91.0,12.0,0.0,0.0,0.0,...,,,,,,,,,,
3,3,https://en.khl.ru/players/16673/,Sergei Abramov,Nadezhda Cup 2012/2013,Amur (Khabarovsk),99.0,0.0,0.0,0.0,0.0,...,,,,,,,,,,
4,7,https://en.khl.ru/players/16462/,Maxim Alyapkin,Regular season 2015/2016,Torpedo (Nizhny Novgorod Region),31.0,2.0,0.0,0.0,,...,,1.0,1.0,0.0,3.0,10.0,76.9,2.98,0.0,60:25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19669,28635,https://en.khl.ru/players/16217/,Airat Ziazov,Regular season 2009/2010,Neftekhimik (Nizhnekamsk),79.0,1.0,0.0,0.0,0.0,...,,,,,,,,,,
19670,28640,https://en.khl.ru/players/23656/,Tomislav Zanoski,Regular season 2016/2017,Medvescak (Zagreb),10.0,24.0,2.0,1.0,3.0,...,6.0,,,,,,,,,
19671,28641,https://en.khl.ru/players/23656/,Tomislav Zanoski,Regular season 2015/2016,Medvescak (Zagreb),10.0,15.0,3.0,0.0,3.0,...,2.0,,,,,,,,,
19672,28644,https://en.khl.ru/players/11543/,Alexander Zevakhin,Regular season 2009/2010,Severstal (Cherepovets),15.0,20.0,1.0,1.0,2.0,...,,,,,,,,,,


In [12]:
list_player_match_stat

Unnamed: 0,URL,Player name,IDSeason,Season,Team,Date,Teams,Score,№,G,...,BLS,FOA,W,L,SOP,GA,Sv,%Sv,GAA,SO
0,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,28 Dec 2013,Barys - Amur,8:2,91,0,...,,,,,,,,,,
1,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,3 Jan 2014,Amur - Lokomotiv,2:1,91,0,...,,,,,,,,,,
2,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,5 Jan 2014,Amur - SKA,1:6,91,0,...,,,,,,,,,,
3,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,7 Jan 2014,Amur - Atlant,2:3 Б,91,0,...,,,,,,,,,,
4,https://en.khl.ru/players/16673/,Sergei Abramov,244,Regular season 2013/2014,54,9 Jan 2014,Amur - Severstal,1:3,91,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
451101,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,13 Dec 2009,Severstal - CSKA,4:3 Б,15,0,...,,,,,,,,,,
451102,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,23 Dec 2009,Barys - Severstal,3:4,15,0,...,,,,,,,,,,
451103,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,25 Dec 2009,Salavat Yulaev - Severstal,2:3,15,0,...,,,,,,,,,,
451104,https://en.khl.ru/players/11543/,Alexander Zevakhin,167,Regular season 2009/2010,56,27 Dec 2009,Avangard - Severstal,3:1,15,0,...,,,,,,,,,,
