# Web scraper to obtain World Championship matches

## Note on terminology
In professional League of Legends, teams compete in matches against other teams (only win-lose outcomes, no draws). A match typically consists of a number of games, with the winner of a match being the team which wins a majority of the games in a match. In a tournament, the matches are often composed of just 1 game (best of 1), while the matches in the finals are typically composed of up to 5 games (best of 5). 

## Workflow for scraping data
1. Get the matchlists for the years in which we are interested (store as tournament_matchlist_urls). Each matchlist is a URL containing the list of matches for that specific tournament and year. 
2. Create an empty dictionary, 'matches'.
3. For each matchlist URL, go through the list of matches and, for each match, create a new key-value pair in the 'matches' dict. The key will be the match number, and the value will be a dictionary that consists of two key-value pairs. The first will be metadata for the match, and the second will be the match_data.

metadata will be a dictionary containing {'Tournament':'', 'TeamA':'', 'TeamB':'', 'Score':'', 'Date':'', 'Patch';''}
match_data will be a dataframe with the match data for that specific match

In [123]:
# Imports

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

pd.set_option('display.max_columns', 200)


In [2]:
# URLs from which we will be scraping data
# Each item in this list is the matchlist URL for a different year of the World Championship
tournament_matchlist_urls = [f"https://gol.gg/tournament/tournament-matchlist/World%20Championship%2020{i}/" for i in range(14,23)]

In [3]:
# Print the list of urls
tournament_matchlist_urls

['https://gol.gg/tournament/tournament-matchlist/World%20Championship%202014/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202015/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202016/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202017/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202018/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202019/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202020/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202021/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202022/']

In [124]:
matches = {}

In [237]:
def get_game_numbers_and_metadata(tournament_matchlist_url):
    '''
    Retrieves the list of game numbers and metadata from the matches in a given tournament.
    Input: URL of the tournament matchlist
    Returns: A pair of lists. 
    The first list being a list of game numbers corresponding to the games played in that tournament.
    The second list being a list of game metadata corresponding to the games in that tournament. 
    '''
    
    game_numbers = []
    game_data = {}
    links = []
    
    # Load the URL using requests
    # We need to use a request header to pretend we are using a popular browser or the website will (correctly) think that we are a bot.
    tournament_matchlist = requests.get(tournament_matchlist_url, headers={'User-Agent': 'Mozilla/5.0'})
    # Parse the html file using BeautifulSoup
    tournament_matchlist_soup = BeautifulSoup(tournament_matchlist.text)
    
    # Select the table in which we are interested (uses CSS selectors) 
    matchlist_table = tournament_matchlist_soup.select('table.table_list')[0]
    # Select the table rows (note that matchlist_table[0] is the column header, so we ignore that one)
    table_rows = matchlist_table.find_all('tr')
    
    # Iterate through the rows on the table
    for row in table_rows[1:]:
        
        # Select the columns in the row 
        cols = row.find_all('td')
        
        # The first column will alway contain the link, and in the link will be the game number. If the link contains the word summary, it will be a multiple game match.
        # In that case, we will need to do a little more work.
        link = cols[0].find_all('a')
        link = [l.get("href") for l in link]
        # If the match is a single game match, then compute the metadata and store the game number and metadata.
        game_number = [re.split('/', l)[3] for l in link if 'summary' not in l]
        if game_number:
            metadata = {}
            metadata['TEAMA'] = cols[1].text
            metadata['MATCH_SCORE'] = cols[2].text
            metadata['TEAMB'] = cols[3].text
            metadata['DATE'] = cols[-1].text
            metadata['PATCH'] = cols[-2].text
            game_numbers.append(game_number[0])
            game_data[game_number[0]] = metadata
        
        # If the link has 'page-summary' in it, then there may be multiple games in the match.
        link_multiples = [l for l in link if 'summary' in l]

        # In this case, getting the game numbers is a little trickier. First, we find out how many games were in the match.
        # For each link in link_multiples, retrieve the appropriate html and find out how many games are in that match.
        for link in link_multiples:
            # Load the appropriate URL
            link_data = requests.get('https://gol.gg'+link[2:], headers={'User-Agent': 'Mozilla/5.0'})
            # Parse the URL with BeautifulSoup
            link_soup = BeautifulSoup(link_data.text)
            # Count the number of times the div class 'row pb-1' appears in the html. This is the number of games played.
            n_games = len(link_soup.find_all("div", {"class":"row pb-1"}))
            # We get the other game numbers by adding 1 to the game number from link. We do this n_games-1 number of times. 
            # The metadata for each game in a multi game match is the same, but we attach it to each game number anyway.
            for i in range(n_games):
                game_number = str(int(re.split('/', link)[3]) + i)
                game_numbers.append(game_number)
                
                metadata = {}
                metadata['TEAMA'] = cols[1].text
                metadata['MATCH_SCORE'] = cols[2].text
                metadata['TEAMB'] = cols[3].text
                metadata['DATE'] = cols[-1].text
                metadata['PATCH'] = cols[-2].text
                game_data[game_number] = metadata
        # We also want to get the match metadata, namely the date, patch, teams, and score. 

        
    
    return sorted(game_numbers), game_data


def flatten(l):
    '''
    Flattens a list
    '''
    
    return [item for sublist in l for item in sublist]

In [240]:
%%time
game_numbers, game_data = get_game_numbers_and_metadata(tournament_matchlist_urls[0])

Wall time: 2.52 s


{'TEAMA': 'Star Horn Royal Club',
 'MATCH_SCORE': '1 - 3',
 'TEAMB': 'Samsung Galaxy White',
 'DATE': '2014-10-19',
 'PATCH': '4.14'}

In [127]:
%%time
# Get all the match numbers in all the tournaments in tournament_matchlist_urls. 
match_numbers = flatten([get_match_numbers_and_metadata(url) for url in [tournament_matchlist_urls[0]]])

Wall time: 2.47 s


In [6]:
def get_match_url(match_number):
    '''
    Takes a match number and returns the URL of the data table for that match. Very simple, very easy. 
    '''
    
    return f'https://gol.gg/game/stats/{match_number}/page-fullstats/'

In [7]:
# Store all the match URLs we are interested in into one list 
match_urls = [get_match_url(num) for num in match_numbers]

In [122]:
match_urls

['https://gol.gg/game/stats/181/page-fullstats/',
 'https://gol.gg/game/stats/182/page-fullstats/',
 'https://gol.gg/game/stats/183/page-fullstats/',
 'https://gol.gg/game/stats/184/page-fullstats/',
 'https://gol.gg/game/stats/185/page-fullstats/',
 'https://gol.gg/game/stats/186/page-fullstats/',
 'https://gol.gg/game/stats/187/page-fullstats/',
 'https://gol.gg/game/stats/188/page-fullstats/',
 'https://gol.gg/game/stats/189/page-fullstats/',
 'https://gol.gg/game/stats/190/page-fullstats/',
 'https://gol.gg/game/stats/191/page-fullstats/',
 'https://gol.gg/game/stats/192/page-fullstats/',
 'https://gol.gg/game/stats/193/page-fullstats/',
 'https://gol.gg/game/stats/194/page-fullstats/',
 'https://gol.gg/game/stats/195/page-fullstats/',
 'https://gol.gg/game/stats/196/page-fullstats/',
 'https://gol.gg/game/stats/197/page-fullstats/',
 'https://gol.gg/game/stats/198/page-fullstats/',
 'https://gol.gg/game/stats/199/page-fullstats/',
 'https://gol.gg/game/stats/200/page-fullstats/',


In [120]:


def get_match_df(match_url):
    '''
    Takes a match URL and returns the match data in the form of a pandas dataframe.
    '''
    
    link_data = requests.get(match_url, headers={'User-Agent': 'Mozilla/5.0'})
    # Parse the URL with BeautifulSoup
    link_soup = BeautifulSoup(link_data.text)

    stats = pd.read_html(link_data.text)[0]
    stats.set_index('Unnamed: 0',inplace=True)
    stats.index.name = None
    stats = stats.T

    stats.iloc[0].Role = 'BLUE_TOP'
    stats.iloc[1].Role = 'BLUE_JNG'
    stats.iloc[2].Role = 'BLUE_MID'
    stats.iloc[3].Role = 'BLUE_ADC'
    stats.iloc[4].Role = 'BLUE_SUP'

    stats.iloc[5].Role = 'RED_TOP'
    stats.iloc[6].Role = 'RED_JNG'
    stats.iloc[7].Role = 'RED_MID'
    stats.iloc[8].Role = 'RED_ADC'
    stats.iloc[9].Role = 'RED_SUP'
    
    stats["GOLD%"] = stats["GOLD%"].str.rstrip('%')
    stats["VS%"] = stats["VS%"].str.rstrip('%')
    stats["DMG%"] = stats["DMG%"].str.rstrip('%')
    stats["KP%"] = stats["KP%"].str.rstrip('%')


    stats.set_index('Role',inplace=True)
    
    
    # We want to convert our columns so that they have the correct data type, but there is a small problem.
    # Because of a known bug, we must first convert columns to float, then to Int64
    # See more: https://github.com/pandas-dev/pandas/issues/25472


    

    

    
    # Convert all columns except Player and KDA to float  
    float_cols = [col for col in stats.columns if (col !='Player' and col!='KDA')]
    stats = stats.astype({col:'float64' for col in float_cols})
        
    return stats


In [121]:
get_match_df(match_urls[-1])

Unnamed: 0_level_0,Player,Level,Kills,Deaths,Assists,KDA,CS,CS in Team's Jungle,CS in Enemy Jungle,CSM,Golds,GPM,GOLD%,Vision Score,Wards placed,Wards destroyed,Control Wards Purchased,Detector Wards Placed,VSPM,WPM,VWPM,WCPM,VS%,Total damage to Champion,Physical Damage,Magic Damage,True Damage,DPM,DMG%,K+A Per Minute,KP%,Solo kills,Double kills,Triple kills,Quadra kills,Penta kills,GD@15,CSD@15,XPD@15,LVLD@15,Objectives Stolen,Damage dealt to turrets,Damage dealt to buildings,Total heal,Total Heals On Teammates,Damage self mitigated,Total Damage Shielded On Teammates,Time ccing others,Total Time CC Dealt,Total damage taken,Total Time Spent Dead,Consumables purchased,Items Purchased,Shutdown bounty collected,Shutdown bounty lost
Role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1
BLUE_TOP,Zeus,18.0,4.0,4.0,3.0,1.8,292.0,37.0,,6.9,15552.0,369.0,21.6,54.0,15.0,17.0,9.0,8.0,1.28,0.36,0.21,0.4,14.3,27482.0,2756.0,18611.0,6115.0,652.0,26.3,0.17,70.0,,0.0,0.0,0.0,0.0,-637.0,-22.0,-1872.0,-2.0,0.0,5741.0,5741.0,12842.0,0.0,31989.0,0.0,9.0,93.0,36130.0,100.0,11.0,31.0,,
BLUE_JNG,Oner,17.0,2.0,2.0,7.0,4.5,256.0,181.0,,6.1,14713.0,349.0,20.4,88.0,13.0,33.0,13.0,11.0,2.09,0.31,0.31,0.78,23.3,18047.0,15492.0,1378.0,1177.0,428.0,17.2,0.21,90.0,,0.0,0.0,0.0,0.0,137.0,9.0,368.0,0.0,0.0,1711.0,1711.0,16210.0,0.0,30126.0,0.0,17.0,222.0,31569.0,96.0,12.0,32.0,,
BLUE_MID,Faker,18.0,2.0,5.0,4.0,1.2,325.0,24.0,,7.7,15380.0,365.0,21.3,53.0,21.0,10.0,14.0,9.0,1.26,0.5,0.33,0.24,14.1,22454.0,474.0,21696.0,284.0,533.0,21.5,0.14,60.0,,0.0,0.0,0.0,0.0,-310.0,3.0,585.0,1.0,0.0,2648.0,2648.0,1294.0,0.0,22574.0,0.0,21.0,473.0,20289.0,144.0,15.0,33.0,,
BLUE_ADC,Gumayusi,17.0,1.0,3.0,3.0,1.3,356.0,32.0,,8.4,16301.0,387.0,22.6,59.0,24.0,11.0,9.0,9.0,1.4,0.57,0.21,0.26,15.6,25817.0,23033.0,2608.0,176.0,613.0,24.7,0.09,40.0,,0.0,0.0,0.0,0.0,756.0,23.0,269.0,0.0,1.0,7094.0,7094.0,3036.0,0.0,8791.0,0.0,27.0,481.0,12202.0,84.0,13.0,31.0,,
BLUE_SUP,Keria,13.0,1.0,5.0,3.0,0.8,35.0,0.0,,0.8,10124.0,240.0,14.0,123.0,75.0,17.0,25.0,23.0,2.92,1.78,0.59,0.4,32.6,10849.0,496.0,9308.0,1045.0,257.0,10.4,0.09,40.0,,0.0,0.0,0.0,0.0,487.0,2.0,-315.0,0.0,0.0,2155.0,2155.0,863.0,0.0,11043.0,3543.0,20.0,224.0,15192.0,155.0,30.0,44.0,,
RED_TOP,kingen,18.0,6.0,3.0,6.0,4,269.0,27.0,,6.4,15346.0,364.0,20.5,57.0,15.0,18.0,12.0,12.0,1.35,0.36,0.28,0.43,13.6,27392.0,26129.0,0.0,1263.0,650.0,31.1,0.28,63.2,2.0,0.0,0.0,0.0,0.0,637.0,22.0,1872.0,2.0,0.0,966.0,966.0,21444.0,0.0,61310.0,0.0,20.0,189.0,40472.0,150.0,13.0,35.0,,
RED_JNG,Pyosik,16.0,5.0,4.0,8.0,3.3,192.0,163.0,,4.6,13765.0,327.0,18.4,68.0,21.0,17.0,21.0,20.0,1.61,0.5,0.5,0.4,16.3,12377.0,6709.0,3842.0,1826.0,294.0,14.1,0.31,68.4,,1.0,0.0,0.0,0.0,-137.0,-9.0,-368.0,0.0,0.0,526.0,526.0,19270.0,0.0,50572.0,0.0,19.0,223.0,40180.0,155.0,22.0,41.0,,
RED_MID,Zeka,18.0,3.0,2.0,9.0,6,387.0,29.0,,9.2,18568.0,441.0,24.9,62.0,23.0,11.0,11.0,9.0,1.47,0.55,0.26,0.26,14.8,23385.0,267.0,22308.0,810.0,555.0,26.6,0.28,63.2,,0.0,0.0,0.0,0.0,310.0,-3.0,-585.0,-1.0,0.0,12956.0,12956.0,2981.0,0.0,13752.0,0.0,15.0,312.0,15604.0,42.0,14.0,33.0,,
RED_ADC,Deft,18.0,5.0,0.0,4.0,Perfect KDA,388.0,37.0,,9.2,17575.0,417.0,23.5,67.0,22.0,15.0,6.0,6.0,1.59,0.52,0.14,0.36,16.0,19285.0,16676.0,1526.0,1083.0,458.0,21.9,0.21,47.4,,2.0,0.0,0.0,0.0,-756.0,-23.0,-269.0,0.0,0.0,5378.0,5378.0,5462.0,0.0,18651.0,0.0,12.0,76.0,22208.0,0.0,9.0,31.0,,
RED_SUP,BeryL,15.0,0.0,1.0,10.0,10,47.0,0.0,,1.1,9449.0,224.0,12.6,164.0,116.0,27.0,27.0,25.0,3.89,2.75,0.64,0.64,39.2,5520.0,732.0,4166.0,622.0,131.0,6.3,0.24,52.6,,0.0,0.0,0.0,0.0,-487.0,-2.0,315.0,0.0,0.0,204.0,204.0,14408.0,2834.0,15724.0,1047.0,24.0,194.0,21875.0,35.0,30.0,47.0,,


In [None]:
get_df(match_urls[4]).columns

In [None]:
df = get_df(match_urls[-1])
df.dtypes

In [None]:
get_df(match_urls[-4])

In [None]:
col_dtypes_dict = {"Level":"Int64", 
                   "Kills":"Int64", 
                   "Deaths":"Int64", 
                   "Assists":"Int64", 
                   #'KDA':'float64',
                   "CS":"Int64",
                   "CS in Team's Jungle":"Int64",
                   "CS in Enemy Jungle":"Int64",
                   "CSM":'float64',
                   "Golds":"Int64",
                   "GPM":"Int64",
                   "GOLD%":"float64"}