<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Creating-dataframe-№1" data-toc-modified-id="Creating-dataframe-№1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Creating dataframe №1</a></span></li><li><span><a href="#Creating-dataframe-№2" data-toc-modified-id="Creating-dataframe-№2-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Creating dataframe №2</a></span></li></ul></div>

Our goal is to create two dataframes for further analysis in pandas by parsing the website https://en.khl.ru.

First, we create dataframe №1 with time intervals during which the player represented a particular club:

- We iterate through 125 pages to gather the ID of each player;

- We generate links to each player's page and determine the number of pages containing data on game dates for specific clubs;

- We navigate through each page of each player and collect the performance dates for clubs;
 
- We write the data to a CSV file.

Dataframe №2 reflects the statistics of each player throughout their career:

- We navigate through the pages for each KHL player and collect statistical data;

- We write the data to a CSV file.


## Creating dataframe №1

In [3]:
# Import necessary libraries
from bs4 import BeautifulSoup  
import requests  
from tqdm import tqdm  
import csv
from concurrent.futures import ThreadPoolExecutor


In [4]:
# Generating links to player pages
links = [f'https://en.khl.ru/players/season/all/?season=all&pager_selector=&PAGEN_1={x}' for x in range(1, 126)]

# Function to process a single link and generate player links
def process_link(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'lxml')
        player_entries = soup.find('tbody', id='player_list_container').find_all('a', class_='players-table__player')
        player_links = [f"https://en.khl.ru{entry['href']}" for entry in player_entries]
        return player_links
    except Exception as e:
        print(f"An error occurred while processing {url}: {e}")
        return []

# Using ThreadPoolExecutor to parallelize link processing
with ThreadPoolExecutor(max_workers=10) as executor:
    results = list(tqdm(executor.map(process_link, links), total=len(links)))

# Flatten the list of player links
player_links = [link for sublist in results for link in sublist]


100%|█████████████████████████████████████████| 125/125 [00:34<00:00,  3.59it/s]
100%|███████████████████████████████████| 3750/3750 [00:00<00:00, 681956.30it/s]


In [24]:
# Generate player links with IDs and page number
player_links_new = [f"{el}?idplayer={el.split('/')[-2]}&PAGEN_1=1" for el in tqdm(player_links)]


100%|███████████████████████████████████| 3750/3750 [00:00<00:00, 647215.87it/s]


In [25]:
# Forming a dictionary where each player's link corresponds to a list of links to all their KHL game pages.
# The function process_url takes a URL as input and performs the following actions:
# Sends an HTTP GET request to the specified URL.
# Decodes the response content using UTF-8 encoding.
# Creates a BeautifulSoup object for parsing the HTML content of the response.
# Searches for a specific div element with the class 'pagging judges-tableStat__pagging' in the HTML.
# Extracts the last element of the text inside the pagen element (which is assumed to represent the number of pages on the website). If the element is not found (i.e., pagen is None), the function returns 1 (assuming there is only one page on the website).
# ThreadPoolExecutor is used to create a pool of worker threads with a maximum number of threads set to 10 (specified as max_workers=10).
# Calling the function executor.submit(process_url, url) sends the function process_url with each URL from the player_links_new list as an argument and returns a future object representing the computation result.
# The list futures contains all the future objects representing pending or completed computations.
# The for loop iterates over both future and URL simultaneously using tqdm to display a progress indicator with a progress bar.
# For each future, future.result() is called to retrieve the result of the corresponding computation (the number of pages obtained from process_url).
# Then the dictionary mydict is updated, where the keys are URLs and the values are the results (number of pages).
def process_url(url):
    response = requests.get(url=url)
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, 'lxml')
    pagen = soup.find('div', {'class': 'pagging judges-tableStat__pagging'})
    return int(pagen.text.strip().split('\n')[-1]) if pagen else 1

mydict = {}

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(process_url, url) for url in player_links_new]
    for future, url in tqdm(zip(futures, player_links_new), total=len(player_links_new)):
        result = future.result()
        mydict[url] = result



100%|███████████████████████████████████████| 3750/3750 [29:53<00:00,  2.09it/s]


In [26]:
# Initialize an empty dictionary to store the result
result_dict = {}

# Loop through the items in mydict
for k, v in mydict.items():
    # Check if the value v is greater than 1
    if v > 1:
        # If v is greater than 1, create a list of URLs with a range of numbers
        urls = [k[:-1] + str(i) for i in range(1, v+1)]
        # Store the list of URLs in the result_dict under the key k
        result_dict[k] = urls
    else:
        # If v is not greater than 1, create a list with a single URL (the original key k)
        # Store the list with a single URL in the result_dict under the key k
        result_dict[k] = [k]

In [27]:
# Check the result
result_dict['https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=1']


['https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=1',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=2',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=3',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=4',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=5',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=6',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=7',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=8',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=9',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=10',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=11',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=12',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=13',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=14',
 'https://en.khl.ru/players/13705/?idplayer=13705&PAGEN_1=15']

In [None]:
def player_team(links):
    commands = []  # Initialize a list to store player teams and game dates

    for url in links:
        response = requests.get(url=url)
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'lxml')

        # Extract the player's name from the web page
        player = soup.find('span', {'class': 'frameCard-header__detail-titleItem roboto'}).text

        pagen = soup.find('tbody', id='table_all_games')
        dates = pagen.find_all('td', {'class': 'matches-table__col matches-table__col_date'})
        teams = pagen.find_all('td', {'class': 'matches-table__col matches-table__col_name'})

        # Extract teams and corresponding dates for the player's games
        for x, y in zip(teams, dates):
            commands.append((x.find('strong').text, y.text.strip('\n')))

    dict_list = []  # Initialize a list to store dictionaries with team-game date pairs
    current_dict = {}

    for command, date in commands:
        current_dict.setdefault(command, []).append(date)

        if len(current_dict) > 1:
            key, values = current_dict.popitem()
            new_dict = {k: [v[-1], v[0]] for k, v in current_dict.items()}
            dict_list.append(new_dict)
            current_dict.clear()
            current_dict[key] = values

    # Add the last dictionary to the list
    if current_dict:
        new_dict = {k: [v[-1], v[0]] for k, v in current_dict.items()}
        dict_list.append(new_dict)

    # Create a final list of tuples with player info, team, and game dates
    final_list = [(links[0], player, k, v[0], v[1]) for el in dict_list for k, v in el.items()]

    return final_list  # Return the list of player-team-game date tuples


In [None]:
lst_error = []  # Initialize a list to store links that result in errors during processing

def process_links(links):
    try:
        return player_team(links)  # Call the player_team function to process the links and return the result
    except Exception as e:
        print(f"Error processing links: {e}")
        lst_error.append(links)  # Log the links that resulted in errors
        return []  # Return an empty list in case of an error

def process_result_dict(result_dict):
    lst_final = [('player_link', 'player', "team", "start_date", 'end_date')]  # Initialize the final list with column headers

    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(process_links, links) for k, links in result_dict.items()]  # Create futures for link processing

        for future in tqdm(futures, total=len(result_dict)):
            lst_final += future.result()  # Append the processed data to the final list

process_result_dict(result_dict)  # Process the result_dict

with open('khl_timedelta.csv', 'a', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(lst_final)  # Write the data to a CSV file

print("Writing complete")  # Print a completion message after writing to the CSV



## Creating dataframe №2

In [None]:
lst_info = [('player_link', 'player', 'position', 'born', 'age', 'country', 'hight', 'weight', 'shoot', 'GP', 'G', 'Assists', 'PTS', '+/-', '+', '-', 'PIM', 'ESG', 'PPG', 'SHG', 'OTG', 'GWG', 'SDS', 'SOG', '%SOG', 'S/G', 'FO', 'FOW', '%FO', 'TOI/G', 'SFT/G', 'TIE/G', 'SFTE/G', 'TIPP/G', 'SFTPP/G', 'TISH/G', 'SFTSH/G', 'HITS', 'BLS', 'FOA', 'TkA')]

# Initialize a session for making HTTP requests
session = requests.Session()

# Set user-agent headers to mimic a web browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
session.headers.update(headers)

# Function to process a player's information page
def process_url(url):
    try:
        response = session.get(url=url)
        response.encoding = 'utf-8'
        soup = BeautifulSoup(response.text, 'lxml')

        # Extract player details such as name, position, and personal information
        player = extract_text(soup.find('span', {'class': 'frameCard-header__detail-titleItem roboto'}))
        position = extract_text(soup.find('p', {'class': 'frameCard-header__detail-local roboto roboto-normal roboto-xxl color-black'}))
        player_info = soup.find('div', {'class': 'frameCard-header__detail-body'}).find_all('p', {'class': 'roboto roboto-bold roboto-lg roboto-lg__lg-bigger color-black'})
        player_info = [extract_text(info) for info in player_info]
        born, age, country, hight, weight, shoots = player_info[:6]

        # Extract statistical information for the player
        stat = [extract_text(el) for el in soup.find('div', class_='detail-table').find_all('tr')[-1] if el.text.strip('\n') != ''][1:]

        return (url, player, position, born, age, country, hight, weight, shoots, *stat)
    except Exception as e:
        print(f"Error occurred while processing {url}: {str(e)}")
        return None

# Function to extract text from an HTML element or return an empty string if the element is None
def extract_text(element):
    return element.text.strip('\n') if element is not None else ''

# Process player information pages using ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_url, player_links_new), total=len(player_links_new)))

# Remove None values from results
results = [result for result in results if result is not None]

# Append player information to the CSV file
with open('khl_info.csv', 'a', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(lst_info)

print("Writing complete")  # Print a completion message after writing to the CSV

