Every player that have ever participated in a KHL match is supposed to have his own profile page on the KHL website. We need to prepare a complete list of players with their perspective profile links to later feed into our data scraping script.

In [1]:
# Importing standard packages for web scraping

import requests
import bs4
import pandas as pd

In [2]:
# The target webpage does not have any tables containing data.
# Thus, we are unable to easily find the player names and their profile links by using pandas.

base_url = 'https://en.khl.ru/players/season/all/'
result = requests.get(base_url)
soup = bs4.BeautifulSoup(result.text, 'lxml')

In [3]:
soup

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#">
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-5HS55GV');</script>
<!-- End Google Tag Manager -->
<script src="//static.wi-fi.ru/mtt/banners/libs/1.6.3/all.js" type="text/javascript"></script>
<script async="" src="https://yastatic.net/pcode/adfox/header-bidding.js"></script>
<script>
var adfoxBiddersMap = {
    "criteo": "739918",
    "myTarget": "791539",
    "betweenDigital": "813654",
    "rtbhouse": "856599",
    "getintent": "827941",
    "adriver": "831182",
    "buzzoola": "962407"
};
var adUnits = [
    {
        "code": "adfox_15112835524839378",
   

The first player on our target page's list is Aaltonen Juhamatti. By searching for his name in the outcome of our soup, we can see that the list of players with all the data is hidden in "var_ Data" part of a script.

After a number of failed attempts to scoop out the contents of that script using BeautifulSoup methods, a different approach is going to be taken. We only really care about getting the link to the player's profile page, all other data will be gathered from his profile directly. So let us brute force the profile linkn hidden in the "a href" class of "var_ Data" with some regular expressions.

In [4]:
import re

In [5]:
# The complicated regular expression you are about to see below was found on StackOverflow, here is its description.

# <a - is an a tag
# [^>]*? - can have any characters that are not >
# href=" - have href
# [^\">]+ - have any number of characters other than " and >

regex_outcome = re.findall(r"<a [^>]*?(href=\'([^\">]+)\')", result.text)

In [6]:
regex_outcome

[("href='/players/16785/'", '/players/16785/'),
 ("href='/players/17585/'", '/players/17585/'),
 ("href='/players/13041/'", '/players/13041/'),
 ("href='/players/38736/'", '/players/38736/'),
 ("href='/players/24998/'", '/players/24998/'),
 ("href='/players/13873/'", '/players/13873/'),
 ("href='/players/19010/'", '/players/19010/'),
 ("href='/players/16221/'", '/players/16221/'),
 ("href='/players/19276/'", '/players/19276/'),
 ("href='/players/20861/'", '/players/20861/'),
 ("href='/players/16673/'", '/players/16673/'),
 ("href='/players/17968/'", '/players/17968/'),
 ("href='/players/18961/'", '/players/18961/'),
 ("href='/players/13490/'", '/players/13490/'),
 ("href='/players/22616/'", '/players/22616/'),
 ("href='/players/31046/'", '/players/31046/'),
 ("href='/players/13121/'", '/players/13121/'),
 ("href='/players/20844/'", '/players/20844/'),
 ("href='/players/15376/'", '/players/15376/'),
 ("href='/players/23434/'", '/players/23434/'),
 ("href='/players/18767/'", '/players/18

In [7]:
profile_links = pd.DataFrame({x[1:] for x in regex_outcome}, columns=['Profile link'])

In [8]:
profile_links

Unnamed: 0,Profile link
0,/players/14252/
1,/players/16429/
2,/players/4475/
3,/players/13714/
4,/players/15669/
...,...
127,/players/26794/
128,/players/6237/
129,/players/30159/
130,/players/12814/


We can see that there are 132 rows in the resulting dataframe, which is the exact number of entries the KHL website is telling us on the source webpage.
However, it cannot be that a hockey league only had 132 players in its history, can it?

The webpage we scraped the data from, despite appearing to contain all players in all seasons, actually does a different thing. The KHL website has a filter based on a first letter of a player's surname, which is turned on for the letter "A" by default without explicitely saying that.

In effect, https://en.khl.ru/players/season/all and https://en.khl.ru/players/season/all/?letter=A are giving the exact same result. For the next step, we are going to make our script gather the data for every letter one by one, combining them to create a dataframe we are looking for.

P.S. On a positive note - since all player names are filtered alphabetically, so will be their profile links. As a result, when we use the complete list of all profile links the players' data would be merged together in the alphabetical order as well.

In [9]:
# Instead of manually typing a list of English letters, we can do it the smart way.

import string

In [10]:
alphabet = list(string.ascii_uppercase)

In [11]:
print(alphabet)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


At this stage, we could edit the notebook to exclude unnecessary steps. Put all the steps above in a "for" loop and run it our list of letters. However, it makes sense to leave the previous work as it is for now, so that the logic behind some of the steps is more clear. Thus, the loop below will duplicate some of the code above.

In [12]:
# Creating an empty dataframe so that the loop can append results to it.

list_profile_links = pd.DataFrame(columns=['Profile link'])

In [13]:
for letter in alphabet:
    
    # Construct an URL for our target webpage and get it.
    
    base_url = f'https://en.khl.ru/players/season/all/?letter={letter}'
    result = requests.get(base_url)
    
    # Still using the same regular expression.

    # <a - is an a tag
    # [^>]*? - can have any characters that are not >
    # href=" - have href
    # [^\">]+ - have any number of characters other than " and >

    regex_outcome = re.findall(r"<a [^>]*?(href=\'([^\">]+)\')", result.text)
    profile_links = pd.DataFrame({x[1:] for x in regex_outcome}, columns=['Profile link'])
    
    # Appending the results to the outcome dataframe.
    
    list_profile_links =  list_profile_links.append(profile_links, ignore_index=True)

In [14]:
list_profile_links

Unnamed: 0,Profile link
0,/players/14252/
1,/players/16429/
2,/players/4475/
3,/players/13714/
4,/players/15669/
...,...
3359,/players/14527/
3360,/players/20260/
3361,/players/25749/
3362,/players/15664/


In [15]:
list_profile_links.to_csv('players_profile_links.csv', encoding='utf8', index=False)