# Web scraper to obtain World Championship matches

## Note on terminology
In professional League of Legends, teams compete in matches against other teams (only win-lose outcomes, no draws). A match typically consists of a number of games, with the winner of a match being the team which wins a majority of the games in a match. In a tournament, the matches are often composed of just 1 game (best of 1), while the matches in the finals are typically composed of up to 5 games (best of 5). 

## Workflow for scraping data
1. Get the matchlists for the years in which we are interested (store as tournament_matchlist_urls). Each matchlist is a URL containing the list of matches for that specific tournament and year. 
2. Create the game_data dictionary. Each key is a game number, and each value is a dict containing both game data and game metadata.
    - Collect the metadata for the games in each tournament_matchlist. 
        - metadata will be a dictionary containing {'Tournament':'', 'TeamA':'', 'TeamB':'', 'Score':'', 'Date':'', 'Patch';''}
    - Collect the game data for the games in each tournament_matchlist. 
        - game_data will be a dict with the game data as a dataframe for that specific game
3. Create a dataframe by iterating through the matches in game_data and extracting relevant statistics.        





In [93]:
# Imports

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import pickle 
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 40)

In [2]:
import sys
print(sys.version)

3.7.13 (default, Mar 28 2022, 08:03:21) [MSC v.1916 64 bit (AMD64)]


In [3]:
# URLs from which we will be scraping data
# Each item in this list is the matchlist URL for a different year of the World Championship
tournament_matchlist_urls = [f"https://gol.gg/tournament/tournament-matchlist/World%20Championship%2020{i}/" for i in range(14,23)]

In [4]:
# Print the list of urls
tournament_matchlist_urls

['https://gol.gg/tournament/tournament-matchlist/World%20Championship%202014/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202015/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202016/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202017/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202018/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202019/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202020/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202021/',
 'https://gol.gg/tournament/tournament-matchlist/World%20Championship%202022/']

In [5]:
def count_games_in_match(game_number):
    '''
    Counts how many games are in a given match. The input is the game number (as a string) of the first game in the match.
    '''
    
    link = 'https://gol.gg/game/stats/'+game_number+'/page-summary/'
    link_data = requests.get(link, headers={'User-Agent': 'Mozilla/5.0'})
    # Parse the URL with BeautifulSoup
    link_soup = BeautifulSoup(link_data.text)
    # Count the number of times the div class 'row pb-1' appears in the html. This is the number of games played.
    n_games = len(link_soup.find_all("div", {"class":"row pb-1"}))

    
    return n_games
    
    
def get_game_metadata(tournament_matchlist_url):
    '''
    Retrieves the game metadata from the matches in a given tournament.
    Input: URL of the tournament matchlist
    Returns: A dict where the keys are the game numbers and the values are the metadata for those games. 
    '''
    

    temp_game_data = {}
    links = []
    
    # Load the tournament matchlist URL using requests
    # We need to use a request header to pretend we are using a popular browser or the website will (correctly) think that we are a bot.
    tournament_matchlist = requests.get(tournament_matchlist_url, headers={'User-Agent': 'Mozilla/5.0'})
    # Parse the html file using BeautifulSoup
    tournament_matchlist_soup = BeautifulSoup(tournament_matchlist.text)
    
    # Select the table in which we are interested (uses CSS selectors) 
    matchlist_table = tournament_matchlist_soup.select('table.table_list')[0]
    # Select the table rows 
    table_rows = matchlist_table.find_all('tr')
    
    # Iterate through the rows on the table (i.e. the matches in the tournament)
    # Note that the first row is the column header, so we ignore that one
    for row in table_rows[1:]:
        
        # Select the columns in the row 
        cols = row.find_all('td')
        
        # The first column will alway contain the link to the match, and in the link will be the game number of the first game in the match. 
        link = cols[0].find_all('a')[0]
        link = link.get("href")
        # Get the game_number
        game_number = re.split('/', link)[3]
        
        # Count the number of games in the match
        n_games = count_games_in_match(game_number)
        
        # Store the parts of the metadata that are common to a match
        date = cols[-1].text
        patch = cols[-2].text
        
        # Calculates the winner of the match
        team1 = cols[1].text
        team2 = cols[3].text
        match_score = cols[2].text
        match_score = re.split('-', match_score.replace(" ", ""))
        if int(match_score[0]) > int(match_score[1]):
            match_winner = team1
        elif int(match_score[0]) < int(match_score[1]):
            match_winner = team2
        else:
            match_winner = 'UNKNOWN'
        
        
        # We now iterate over the number of games, collecting information from the page-game link for each game
        # If there is only one game then this just runs once.
        for i in range(n_games):
            match_game_number = str(int(game_number) + i)
            
            page_game_link = 'https://gol.gg/game/stats/' + match_game_number + '/page-game/'
            page_game_link_data = requests.get(page_game_link, headers={'User-Agent': 'Mozilla/5.0'})
            page_game_link_soup = BeautifulSoup(page_game_link_data.text)     
            
            # Get Blue Team (first component is blue team, second component is whether blue team won or lost)
            BLUE = page_game_link_soup.find_all("div", {"class":"col-12 blue-line-header"})[0].text.strip('\n')
            blue_team = re.split('-', BLUE)[0].strip()
            
            # Get Red Team (first component is red team, second component is whether red team won or lost)
            RED = page_game_link_soup.find_all("div", {"class":"col-12 red-line-header"})[0].text.strip('\n')
            red_team = re.split('-', RED)[0].strip()
            
            # Get game Winner
            if (re.split('-', BLUE)[1].strip() == 'WIN'):
                game_winner = blue_team
            elif (re.split('-', RED)[1].strip() == 'WIN'):
                game_winner = red_team
            else:
                game_winner = 'UNKNOWN'
            
            
            # Store all the metadata in a metadata dictionary
            metadata = {}
            metadata['BLUE_TEAM'] = blue_team
            metadata['RED_TEAM'] = red_team
            metadata['MATCH_WINNER'] = match_winner
            metadata['GAME_WINNER'] = game_winner
            metadata['DATE'] = date
            metadata['PATCH'] = patch
            metadata['MATCH_SCORE'] = match_score
            
            temp_game_data[match_game_number] = {'metadata':metadata}
            
    return temp_game_data
            



def flatten(l):
    '''
    Flattens a list
    '''
    
    return [item for sublist in l for item in sublist]

In [6]:
%%time
test = get_game_metadata(tournament_matchlist_urls[0])
print(len(test))

78
Wall time: 33.4 s


In [7]:
%%time
# Get all the match data in all the tournaments in tournament_matchlist_urls. 

#Initialise empty dict for the data
game_data = {}

for url in tournament_matchlist_urls:
    # Get the game numbers and data from the url and temporarily store them
    temp_game_metadata = get_game_metadata(url)
    
    # Combine the temporary game data dict with the existing game data dict
    # Note: {**x, **y} is a shallow merge of x and y, with values in y overrriding the values of x if necessary. 
    game_data = {**game_data,  **temp_game_metadata}


Wall time: 4min 54s


In [8]:
game_numbers = list(game_data.keys())
game_numbers = [int(num) for num in game_numbers]

In [9]:
def get_game_df(game_url):
    '''
    Takes a game URL and returns the game data in the form of a pandas dataframe.
    '''
    
    link_data = requests.get(game_url, headers={'User-Agent': 'Mozilla/5.0'})
    # Parse the URL with BeautifulSoup
    link_soup = BeautifulSoup(link_data.text)

    stats = pd.read_html(link_data.text)[0]
    stats.set_index('Unnamed: 0',inplace=True)
    stats.index.name = None
    stats = stats.T

    stats.iloc[0].Role = 'BLUE_TOP'
    stats.iloc[1].Role = 'BLUE_JNG'
    stats.iloc[2].Role = 'BLUE_MID'
    stats.iloc[3].Role = 'BLUE_ADC'
    stats.iloc[4].Role = 'BLUE_SUP'

    stats.iloc[5].Role = 'RED_TOP'
    stats.iloc[6].Role = 'RED_JNG'
    stats.iloc[7].Role = 'RED_MID'
    stats.iloc[8].Role = 'RED_ADC'
    stats.iloc[9].Role = 'RED_SUP'
    
    stats["GOLD%"] = stats["GOLD%"].str.rstrip('%')
    stats["VS%"] = stats["VS%"].str.rstrip('%')
    stats["DMG%"] = stats["DMG%"].str.rstrip('%')
    stats["KP%"] = stats["KP%"].str.rstrip('%')


    stats.set_index('Role',inplace=True)    

    
    # Convert all columns except Player and KDA to float  
    float_cols = [col for col in stats.columns if (col !='Player' and col!='KDA')]
    stats = stats.astype({col:'float64' for col in float_cols})
        
    return stats


In [10]:
%%time
for num in game_numbers:
    game_data[f'{num}']['data'] = get_game_df(f'https://gol.gg/game/stats/{num}/page-fullstats/')

Wall time: 2min 31s


In [11]:
# Save match_data dict to pickle file
with open('game_data.pkl', 'wb') as f:
    pickle.dump(game_data, f)
        
# To load, we use
#with open('game_data.pkl', 'rb') as f:
#    game_data = pickle.load(f)

In [85]:
game_data['257']['metadata']

{'BLUE_TEAM': 'Star Horn Royal Club',
 'RED_TEAM': 'Samsung Galaxy White',
 'MATCH_WINNER': 'Samsung Galaxy White',
 'GAME_WINNER': 'Samsung Galaxy White',
 'DATE': '2014-10-19',
 'PATCH': '4.14',
 'MATCH_SCORE': ['1', '3']}

In [84]:
game_data['257']['data']

Unnamed: 0_level_0,Player,Level,Kills,Deaths,Assists,KDA,CS,CS in Team's Jungle,CS in Enemy Jungle,CSM,Golds,GPM,GOLD%,Vision Score,Wards placed,Wards destroyed,Control Wards Purchased,Detector Wards Placed,VSPM,WPM,VWPM,WCPM,VS%,Total damage to Champion,Physical Damage,Magic Damage,True Damage,DPM,DMG%,K+A Per Minute,KP%,Solo kills,Double kills,Triple kills,Quadra kills,Penta kills,GD@15,CSD@15,XPD@15,LVLD@15,Objectives Stolen,Damage dealt to turrets,Damage dealt to buildings,Total heal,Total Heals On Teammates,Damage self mitigated,Total Damage Shielded On Teammates,Time ccing others,Total Time CC Dealt,Total damage taken,Total Time Spent Dead,Consumables purchased,Items Purchased,Shutdown bounty collected,Shutdown bounty lost
Role,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1
BLUE_TOP,Cola,,0.0,4.0,1.0,0.3,152.0,8.0,3.0,6.2,6326.0,258.0,20.7,,11.0,1.0,1.0,,,0.45,0.04,0.04,,4015.0,252.0,3763.0,0.0,164.0,18.4,0.04,100.0,,0.0,0.0,0.0,0.0,-1151.0,-10.0,-801.0,-1.0,,,,,,,,,,,,,,,
BLUE_JNG,inSec,,0.0,4.0,0.0,0,115.0,78.0,2.0,4.7,5964.0,243.0,19.5,,10.0,5.0,2.0,,,0.41,0.08,0.2,,3866.0,3411.0,0.0,455.0,158.0,17.7,0.0,0.0,,0.0,0.0,0.0,0.0,-886.0,23.0,352.0,0.0,,,,,,,,,,,,,,,
BLUE_MID,Corn,,1.0,4.0,0.0,0.3,192.0,27.0,0.0,7.8,7062.0,288.0,23.1,,9.0,2.0,2.0,,,0.37,0.08,0.08,,6195.0,359.0,5836.0,0.0,253.0,28.4,0.04,100.0,,0.0,0.0,0.0,0.0,-1734.0,-10.0,-962.0,-1.0,,,,,,,,,,,,,,,
BLUE_ADC,Uzi,,0.0,2.0,1.0,0.5,168.0,12.0,0.0,6.8,6940.0,283.0,22.7,,5.0,2.0,1.0,,,0.2,0.04,0.08,,5449.0,2368.0,2657.0,424.0,222.0,24.9,0.04,100.0,,0.0,0.0,0.0,0.0,-412.0,-2.0,-3.0,0.0,,,,,,,,,,,,,,,
BLUE_SUP,Zero,,0.0,2.0,1.0,0.5,5.0,0.0,0.0,0.2,4306.0,176.0,14.1,,24.0,4.0,6.0,,,0.98,0.24,0.16,,2319.0,601.0,1458.0,260.0,95.0,10.6,0.04,100.0,,0.0,0.0,0.0,0.0,-701.0,-12.0,-415.0,0.0,,,,,,,,,,,,,,,
RED_TOP,Looper,,3.0,0.0,8.0,Perfect KDA,193.0,0.0,0.0,7.9,10492.0,428.0,21.3,,8.0,5.0,2.0,,,0.33,0.08,0.2,,6715.0,837.0,5878.0,0.0,274.0,17.8,0.45,68.8,,0.0,0.0,0.0,0.0,1151.0,10.0,801.0,1.0,,,,,,,,,,,,,,,
RED_JNG,DanDy,,5.0,0.0,9.0,Perfect KDA,83.0,59.0,11.0,3.4,9692.0,395.0,19.6,,10.0,6.0,5.0,,,0.41,0.2,0.24,,6506.0,5263.0,968.0,275.0,265.0,17.3,0.57,87.5,,0.0,0.0,0.0,0.0,886.0,-23.0,-352.0,0.0,,,,,,,,,,,,,,,
RED_MID,PawN,,7.0,0.0,7.0,Perfect KDA,230.0,6.0,9.0,9.4,12507.0,510.0,25.4,,16.0,6.0,1.0,,,0.65,0.04,0.24,,11460.0,10913.0,457.0,90.0,467.0,30.4,0.57,87.5,,1.0,0.0,0.0,0.0,1734.0,10.0,962.0,1.0,,,,,,,,,,,,,,,
RED_ADC,imp,,1.0,0.0,7.0,Perfect KDA,163.0,3.0,2.0,6.6,9507.0,388.0,19.3,,7.0,2.0,2.0,,,0.29,0.08,0.08,,10293.0,9091.0,0.0,1202.0,420.0,27.3,0.33,50.0,,0.0,0.0,0.0,0.0,412.0,2.0,3.0,0.0,,,,,,,,,,,,,,,
RED_SUP,Mata,,0.0,1.0,11.0,11,23.0,0.0,1.0,0.9,7128.0,291.0,14.5,,33.0,6.0,4.0,,,1.35,0.16,0.24,,2705.0,358.0,2113.0,234.0,110.0,7.2,0.45,68.8,,0.0,0.0,0.0,0.0,701.0,12.0,415.0,0.0,,,,,,,,,,,,,,,


In [87]:
game_data['257']['data']['GD@15'][0:5].sum() - game_data['257']['data']['GD@15'][5:].sum()

-9768.0

In [88]:
game_data['257']['data']['Golds'][0:5].sum() - game_data['257']['data']['Golds'][5:].sum()

-18728.0

# Decide on relevant predictors and construct training/testing dataframe

We now need to decide what data we will be using in order to train our ML models. Our data will be in the form of a pandas dataframe where each row is a single game, and the columns are data about that game.

In [138]:
def extract_gamestats(game_number):
    '''
    Extracts game data from a single game which consists of a dictionary with keys 'metadata' and 'data'.
    
    '''
    game = game_data[game_number]
    
    game_stats = {}
    
    metadata = game['metadata']
    data = game['data']
    
    game_stats['GAME_NUMBER'] = game_number
    game_stats['DATE'] = metadata['DATE']
    game_stats['PATCH'] = metadata['PATCH']
    game_stats['BLUE_TEAM'] = metadata['BLUE_TEAM']
    game_stats['RED_TEAM'] = metadata['RED_TEAM']
    game_stats['GAME_WINNER'] = metadata['GAME_WINNER']
    game_stats['MATCH_WINNER'] = metadata['MATCH_WINNER']
    
    game_stats['games_in_match'] = count_games_in_match(game_number)
    game_stats['MATCH_SCORE'] = metadata['MATCH_SCORE']
    
    game_stats['GD'] = data['Golds'][0:5].sum() - data['Golds'][5:].sum()
    game_stats['GD@15'] = data['GD@15'][0:5].sum()
    
    game_stats['GPMD'] = data['GPM'][0:5].sum() - data['GPM'][5:].sum()
    game_stats['CSMD'] = data['CSM'][0:5].sum() - data['CSM'][5:].sum()
    

    
    return game_stats
    

In [139]:
%%time
df_dict = [extract_gamestats(f'{num}') for num in game_numbers]

df =  pd.DataFrame(df_dict)
df = df.set_index('GAME_NUMBER')

Wall time: 3min 24s


In [141]:
# Convert dtypes
df['DATE'] = pd.to_datetime(df['DATE'])

# Add codes for categorical data
# 1 for blue win, 0 for red win
df['BLUEWIN'] = np.where(df['BLUE_TEAM'] == df['GAME_WINNER'], 1, 0)

# 
team_codes = df.BLUE_TEAM.astype('category').cat.codes
teams = ['BLUE_TEAM','RED_TEAM']
df[['BLUE_TEAM_CODES', 'RED_TEAM_CODES']] = (pd.factorize(df[teams].values.ravel())[0]+1).reshape(-1, len(teams))

In [142]:
df

Unnamed: 0_level_0,DATE,PATCH,BLUE_TEAM,RED_TEAM,GAME_WINNER,MATCH_WINNER,games_in_match,MATCH_SCORE,GD,GD@15,GPMD,CSMD,BLUEWIN,BLUE_TEAM_CODES,RED_TEAM_CODES
GAME_NUMBER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
257,2014-10-19,4.14,Star Horn Royal Club,Samsung Galaxy White,Samsung Galaxy White,Samsung Galaxy White,4,"[1, 3]",-18728.0,-4884.0,-764.0,-2.5,0,1,2
258,2014-10-19,4.14,Samsung Galaxy White,Star Horn Royal Club,Samsung Galaxy White,Samsung Galaxy White,4,"[1, 3]",19765.0,2766.0,676.0,3.2,1,2,1
259,2014-10-19,4.14,Star Horn Royal Club,Samsung Galaxy White,Star Horn Royal Club,Samsung Galaxy White,4,"[1, 3]",10371.0,2182.0,269.0,0.5,1,1,2
260,2014-10-19,4.14,Samsung Galaxy White,Star Horn Royal Club,Samsung Galaxy White,Samsung Galaxy White,4,"[1, 3]",15523.0,2011.0,675.0,4.2,1,2,1
252,2014-10-12,4.14,Star Horn Royal Club,OMG,OMG,Star Horn Royal Club,5,"[3, 2]",-21841.0,-4325.0,-743.0,-4.0,0,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44341,2022-10-07,12.18,T1,Edward Gaming,T1,T1,1,"[1, 0]",8802.0,2932.0,385.0,4.0,1,73,6
44340,2022-10-07,12.18,JD Gaming,Evil Geniuses,JD Gaming,JD Gaming,1,"[1, 0]",10842.0,2377.0,364.0,2.1,1,64,80
44339,2022-10-07,12.18,CTBC Flying Oyster,100 Thieves,CTBC Flying Oyster,CTBC Flying Oyster,1,"[1, 0]",9693.0,2338.0,304.0,2.3,1,79,49
44338,2022-10-07,12.18,G2 Esports,DWG KIA,DWG KIA,DWG KIA,1,"[0, 1]",-17989.0,-4376.0,-555.0,-5.7,0,39,72


# Set up the Classification model (this will be moved later)

In [143]:
from sklearn.ensemble import RandomForestClassifier

In [144]:
rf = RandomForestClassifier(n_estimators = 10, min_samples_split = 5, random_state = 42)

In [164]:
train = df[(df['DATE'] < '2022-10-19') & (df['DATE'] > '2021-10-01') ]
test = df[df['DATE'] > '2022-10-19']

In [165]:
predictors = ['BLUE_TEAM_CODES', 'RED_TEAM_CODES']

In [166]:
rf.fit(train[predictors], train['BLUEWIN'])

RandomForestClassifier(min_samples_split=5, n_estimators=10, random_state=42)

In [167]:
preds = rf.predict(test[predictors])

In [168]:
preds

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1])

In [169]:
from sklearn.metrics import accuracy_score

In [170]:
acc = accuracy_score(test['BLUEWIN'], preds)

In [171]:
acc

0.4482758620689655