# Overview

The strategy for collecting data will be simple. Since it is already known that rating is a very good indicator of skill level, we will:

- select the top 2000 acounts on the ranked ladder
- identify games with these players in them
- save this data to a database

To do this we will need to connect to a few different endpoints to get the required data.

(We will separately need a mapping file between Professional players and their account names for our labelled data)

### Import some libraries & API-Keys

In [77]:
import requests
import json
import time
import pandas as pd

In [6]:
with open('creds.json') as api_creds:
    creds = json.load(api_creds)
    
# credentials are now stored in the variable creds['key_one']

In [11]:
def riot_api(api, api_key):
    
    # note - the api must specify all variables in it, 
    # so the path may need to be created outside of this function
    URL = 'https://' + api + "?api_key=" + api_key
    
    response = requests.get(URL)
    
    return response.json()

In [28]:
# first we need a list of the top players in the EUW region

region = "euw1.api.riotgames.com"
challenger_endpoint = "/lol/league/v4/challengerleagues/by-queue/RANKED_SOLO_5x5/"
grandmaster_endpoint = "/lol/league/v4/grandmasterleagues/by-queue/RANKED_SOLO_5x5/"
master_endpoint = "/lol/league/v4/masterleagues/by-queue/RANKED_SOLO_5x5/"

In [34]:
challenger = riot_api(region + challenger_endpoint, creds['key_one'])

# looks good!
challenger['entries']

[{'summonerId': 'BUpbqiivzJcwwTE5K61m2EvesrfpjR-uywQ_bp51Q03oaj0',
  'summonerName': 'motus vetiti',
  'leaguePoints': 820,
  'rank': 'I',
  'wins': 91,
  'losses': 71,
  'veteran': False,
  'inactive': False,
  'freshBlood': False,
  'hotStreak': False},
 {'summonerId': 'JmkODrIn4ql6MBWZBB8z2pAT8CXGGzvtFwE32VOq8kQPYxUT',
  'summonerName': 'JDX7 Odi11',
  'leaguePoints': 902,
  'rank': 'I',
  'wins': 132,
  'losses': 115,
  'veteran': True,
  'inactive': False,
  'freshBlood': False,
  'hotStreak': False},
 {'summonerId': 'ykmcXq_mFOKMZWMFtR1pdKiz3iL_SSjbxmWw3eaDn4F4GlE-KII7LRQtsA',
  'summonerName': 'YOUNG TALENT',
  'leaguePoints': 636,
  'rank': 'I',
  'wins': 82,
  'losses': 53,
  'veteran': False,
  'inactive': False,
  'freshBlood': True,
  'hotStreak': False},
 {'summonerId': 'BykFSgMA-hkbuPdF3AE63gvGZ6b4L77X_9T4-J8Ed2PNyQM',
  'summonerName': 'TakeSet sama',
  'leaguePoints': 575,
  'rank': 'I',
  'wins': 78,
  'losses': 65,
  'veteran': False,
  'inactive': False,
  'freshBloo

In [38]:
grandmaster = riot_api(region + grandmaster_endpoint, creds['key_one'])
master = riot_api(region + master_endpoint, creds['key_one'])

In [40]:
master_len, grandmaster_len, challenger_len = len(master['entries']), len(grandmaster['entries']), len(challenger['entries'])

print(f'number of master players: {master_len}, grandmaster: {grandmaster_len}, and challenger: {challenger_len}')

number of master players: 3720, grandmaster: 700, and challenger: 300


# At this point we have some questions to answer.

Our API-keys let us call a maximum of 400 requests every 2 minutes. This means that at a maximum we will only be able ingest 12,000 games per hour, and only time to run the script for ~10 hours. With some loss due to the APIs not perfectly matching the rate limits, we will have 100,000 games of data. We therefore have to carefully construct which games we select. 

Many of the top players are known to play against each other all the time, so to save on API calls we need to implement a system to check if the game has already been downloaded. 

On top of this, we need to decide whether choosing a broader set of players (~4000) and a small number of games per player (~25 in the worst case scenario) is better or worse than focusing on a smaller set of players, with a larger number of games per player. Since we intend to use pro-players games histories as the supervised set, these will be loaded separately and are not a concern here.

The question is, how many games are needed to determine whether or not someone is a rising star? Given that top tier players play on average 10games/day, what is the absolute minimum length of time needed to get a reasonable sample size?

Borrowing from significance tests, a 'small sample size' below which standard tests such as T-tests are no longer useful to distinguish between normal distributions, is around 20 (below which we need F-tests), so as a ballpark figure this seems a good place to start as the absolute lowest number of games per player.

A different way of looking at the problem will be, without any modelling yet done, can we approximate what kind of data we will need when performing the modelling steps?

## The (very rough) rough modelling idea will be as follows:

- Semi - supervised: Take games data of each player and cluster players into groups. Each group will have a set number of professionals already in it, and as such groups with high numbers of professionals should in theory contain amateurs that behave as professionals.
- Supervised: Take a set players performances in a number of matches and identify the professionals. Use this to build a supervised binary classifier to identify whether a players set of performances behave as the pros. We in theory will then be able to interpret instances where strongly incorrectly classified players as those who, though are amateur, *should* be professional. Using Shapley values we will be able to determine feature importance for the classifier and as such identify features which are useful in understanding player performance.



## How many features do we have in the match data?

To understand what these algorithms will need to perform well, let's look at the number of base features we get from each match. (Connecting to APIs below)


In [44]:
region = "europe.api.riotgames.com"
match = "/lol/match/v5/matches/by-puuid/ykmcXq_mFOKMZWMFtR1pdKiz3iL_SSjbxmWw3eaDn4F4GlE-KII7LRQtsA/ids"

test = riot_api(region + match, creds['key_one'])

In [45]:
# it seems puuid and summonerid are not the same...

test

{'status': {'message': 'Bad Request - Expected message of type puuid, but found summonerId',
  'status_code': 400}}

In [58]:
region = "euw1.api.riotgames.com"
random_player = "/lol/summoner/v4/summoners/ykmcXq_mFOKMZWMFtR1pdKiz3iL_SSjbxmWw3eaDn4F4GlE-KII7LRQtsA/"

In [59]:
test = riot_api(region + random_player, creds['key_one'])

In [60]:
test

{'id': 'ykmcXq_mFOKMZWMFtR1pdKiz3iL_SSjbxmWw3eaDn4F4GlE-KII7LRQtsA',
 'accountId': 'cugHSw2kD9xrvRjFkG7VyNesaJAsf4QY6mYaUQmXYbG1b2pcQckGznoj',
 'puuid': '_un5TMN7ELF20phpmUEg-hHGQkt1siDTDWnQg1duwsvPPWZSCeJWgIRr0rMUHzKbMil5AXIUDXANNQ',
 'name': 'YOUNG TALENT',
 'profileIconId': 5128,
 'revisionDate': 1643222366000,
 'summonerLevel': 201}

In [61]:
europe = "europe.api.riotgames.com"
matches_list_endpoint = '/lol/match/v5/matches/by-puuid/' + test['puuid'] +  '/ids'

# default is 20 returned matches, need to add in extra optional arguments to specify range
matches_list =  riot_api(europe + matches_list_endpoint, creds['key_one'])

In [62]:
matches_list

['EUW1_5698494054',
 'EUW1_5697281617',
 'EUW1_5697193265',
 'EUW1_5697073881',
 'EUW1_5696995150',
 'EUW1_5696878023',
 'EUW1_5696882146',
 'EUW1_5696803607',
 'EUW1_5696648703',
 'EUW1_5696537695',
 'EUW1_5696090147',
 'EUW1_5696126139',
 'EUW1_5696081836',
 'EUW1_5695983008',
 'EUW1_5695924148',
 'EUW1_5692967563',
 'EUW1_5692921090',
 'EUW1_5692705244',
 'EUW1_5692617738',
 'EUW1_5692603206']

In [96]:
one_match_endpoint = '/lol/match/v5/matches/EUW1_5697193265'

one_match_data = riot_api(europe + one_match_endpoint, creds['key_one'])

In [70]:
# number of features for 1 player in a given game (not all useful however)

print(len(one_match_data['info']['participants'][0]))

one_match_data['info']['participants'][0]

105


{'assists': 7,
 'baronKills': 0,
 'bountyLevel': 0,
 'champExperience': 8205,
 'champLevel': 11,
 'championId': 266,
 'championName': 'Aatrox',
 'championTransform': 0,
 'consumablesPurchased': 6,
 'damageDealtToBuildings': 728,
 'damageDealtToObjectives': 1203,
 'damageDealtToTurrets': 728,
 'damageSelfMitigated': 12229,
 'deaths': 6,
 'detectorWardsPlaced': 3,
 'doubleKills': 1,
 'dragonKills': 0,
 'firstBloodAssist': False,
 'firstBloodKill': False,
 'firstTowerAssist': False,
 'firstTowerKill': False,
 'gameEndedInEarlySurrender': False,
 'gameEndedInSurrender': True,
 'goldEarned': 7690,
 'goldSpent': 7225,
 'individualPosition': 'MIDDLE',
 'inhibitorKills': 0,
 'inhibitorTakedowns': 0,
 'inhibitorsLost': 1,
 'item0': 6630,
 'item1': 0,
 'item2': 3133,
 'item3': 1054,
 'item4': 3076,
 'item5': 3047,
 'item6': 3364,
 'itemsPurchased': 24,
 'killingSprees': 1,
 'kills': 3,
 'lane': 'MIDDLE',
 'largestCriticalStrike': 2,
 'largestKillingSpree': 2,
 'largestMultiKill': 2,
 'longestTim

In [69]:
timeline_endpoint = '/lol/match/v5/matches/EUW1_5698494054/timeline'
timeline_data = riot_api(europe + timeline_endpoint, creds['key_one'])

timeline_data['info']['frames']['participantFrames']

{'metadata': {'dataVersion': '2',
  'matchId': 'EUW1_5698494054',
  'participants': ['d_F-mdI9s-rYRn5K4rnjJyi9QYVzzpqbrN8rBuHnC0vCLo3LtYSGCulY-WGe-R3obFAGvcK-Gs5Ajw',
   'Xt_b6XX39kLswXAz050elJmr6jDR-YPe0br9ub4iIF_ISnroqohovSG15A-9bTjgP7jfQFZTuFjhTA',
   'pxprPAKB1HgTJCIbzHTlMZ-ZWc83kEPNqk_4GxhldCa_CZA9GwWTx8IVVDPkADeEvvWQoO85D9CC7w',
   'KmQNlVkBJ0jYhrlIFSgrpL1Kxtlx6hsE4Tg2YklDp36OwKVN11dBgjInlmcK1CcVmI7vbWt0RShN7w',
   'OKOvFwwWpS2PMPgs9tXBmqjHGYIumaDVfitvhCd-f-nlfz--NioOWka6n_wrjK9QYbxt-9NM5JxSDg',
   'uTKkJBshrNZMnLlNkxlbRw9OQ2gwCixHCb2MtemQKB1i6pA4rR04cljYYMf4D3pjDNdg3IxT-Oc2DQ',
   '_un5TMN7ELF20phpmUEg-hHGQkt1siDTDWnQg1duwsvPPWZSCeJWgIRr0rMUHzKbMil5AXIUDXANNQ',
   'y16jk6xF5alhCY9vmqTrRrBhfzDVBsw7pwk_R1PeKfcADPftDKai0lN9Y8WirJk2jimJ0zdUjDah8g',
   'vgNx7oWONHFxMv_3SkunClA8QlGCOeCOdQO3kRkJeLun08OVmXlv-Tk-8Dl_08twaBsSk46ItU7SEw',
   'qbpSoURdKm2ivhP4uO0Zf6XKWVpUkQXS2_iGQk9eqtSsrWLwxW2fgBVz6MdNfVtLms1fiAGLTJCwIw']},
 'info': {'frameInterval': 60000,
  'frames': [{'events': [{'realT

# It turns out we have two separate sources for one match. Timeline (in-game data) and Summary data.

# Now we have seen one Timeline and one Match call, we are in a better position to answer our data-related questions.

There are more features in every match (in that of summary statistics, ~105 features), and the timeline data (containing a snapshot of each player in every minute of the game, in combination with any 'events' that take place throughout the game. This includes thousands of 'features' (though at this point it would be infeasible to include all in a model).

There used to be a conversation in 'how much data do we need to cover the possible vector space of N features'. Generally speaking people would say '2^N datapoints'. In reality this is not the case, however if there is a take home message from this, it's that, in ML terms, more data on a player will lead to better quality predictions. (Though that is not strictly what we are trying to do here, if models perform better, the feature importance done after will also be better). 

Qualitatively speaking, 4,000 players feels too much - we will have a set of ~50-100 professional players (and even fewer stars) and we will want to identify amateurs playing around this level. In the top tiers of play, there are three brackets. The top with 300 players, the second with 700 players, and the third with a few thousand (in this case 3720). A measure of the existing frequency of professional players within each bracket will tell us at a very baseline case what the 'odds' are that a pro is in this group. However it decreases significantly with each bracket (it is already well understood that the rating of a player is strongly correlated with skill level). 

Therefore to get a good mix of amateurs playing with pros, we will take a look at the top 1,000 players, giving us 100 games/ player. This should be enough to cancel out any randomness cause by 'lucky games' or 'unlucky games', whilst keeping a balance of a larger number of amateurs than professionals. The extremely high dimensionality of the data is still a problem in this case, but that will have to be dealt with separately. (When has too much data ever been a problem?)

## Finally: Are we going to use Match Summaries **and** timelines, or just one?

In principle everything in the match summaries data can be calculated from the timeline - however this will be time consuming and there are some handy statistics readily available to us in this data. So we will keep both.

Therefore we should expect that without any overlapping games we will have in the end just 50 games per player. This is still markedly above our 20 minimum criteria, so we won't reduce the total number of players we look at any further.

With all that out the way, let's call start getting our data!

# The quick and dirty data collection method

## Step 1: Collect a list of ~50,000 match_id's

## Step 2: Chunk into groups of 1,000, and call the api for each of these games. 

- Append each result to a dataframe and save this dataframe to a file in a data folder. Leave running for ~10 hours (overnight), and move the data into a database once the data has been collected.

In [73]:
#Step 1: Collect a list of 50,000 match ids:

players = challenger['entries'] + grandmaster['entries']
players

[{'summonerId': 'BUpbqiivzJcwwTE5K61m2EvesrfpjR-uywQ_bp51Q03oaj0',
  'summonerName': 'motus vetiti',
  'leaguePoints': 820,
  'rank': 'I',
  'wins': 91,
  'losses': 71,
  'veteran': False,
  'inactive': False,
  'freshBlood': False,
  'hotStreak': False},
 {'summonerId': 'JmkODrIn4ql6MBWZBB8z2pAT8CXGGzvtFwE32VOq8kQPYxUT',
  'summonerName': 'JDX7 Odi11',
  'leaguePoints': 902,
  'rank': 'I',
  'wins': 132,
  'losses': 115,
  'veteran': True,
  'inactive': False,
  'freshBlood': False,
  'hotStreak': False},
 {'summonerId': 'ykmcXq_mFOKMZWMFtR1pdKiz3iL_SSjbxmWw3eaDn4F4GlE-KII7LRQtsA',
  'summonerName': 'YOUNG TALENT',
  'leaguePoints': 636,
  'rank': 'I',
  'wins': 82,
  'losses': 53,
  'veteran': False,
  'inactive': False,
  'freshBlood': True,
  'hotStreak': False},
 {'summonerId': 'BykFSgMA-hkbuPdF3AE63gvGZ6b4L77X_9T4-J8Ed2PNyQM',
  'summonerName': 'TakeSet sama',
  'leaguePoints': 575,
  'rank': 'I',
  'wins': 78,
  'losses': 65,
  'veteran': False,
  'inactive': False,
  'freshBloo

In [75]:
summoner_ids = [player['summonerId'] for player in players]

In [88]:
region = "euw1.api.riotgames.com"
base_endpoint = "/lol/summoner/v4/summoners/"

players_full = []

for i, summoner_id in enumerate(summoner_ids):
    URL = region + base_endpoint + summoner_id + "/"
    
    results = riot_api(URL, creds['key_one'])
    
    time.sleep(0.85) # put in some sleep time to avoid hitting rate limits
    
    if i % 100 == 0:
        print(f'{i} requests made')
    
    players_full.append(results)

0 requests made
100 requests made
200 requests made
300 requests made
400 requests made
500 requests made
600 requests made
700 requests made
800 requests made
900 requests made


In [91]:
# so that we don't lose this data (and have to call the APIs again, let's save this data to file)

players_df = pd.DataFrame(players_full)
players_df.to_csv('players.csv', index=False)

In [147]:
# this cell was run twice with the second the the all_matches was commented out
#all_matches = set([])


for i, puuid in enumerate(list(players_df['puuid'].values)):
    endpoint = '/lol/match/v5/matches/by-puuid/' + puuid +  '/ids'
    
    # we chose count=70 variable to account for some overlap of games, we are aiming for 50games/player
    matches_list =  riot_api(europe + endpoint, creds['key_one'] + "&type=ranked&start=70&count=70")
    
    [all_matches.add(item) for item in matches_list]
    
    time.sleep(0.85)
    
    if i % 100 == 0:
        print(f'{i} requests made')
    
# again save to remove chance of losing data!
matches = pd.DataFrame(list(all_matches), columns=['Matches'])
matches.to_csv('matches_2.csv')

0 requests made
100 requests made
200 requests made
300 requests made
400 requests made
500 requests made
600 requests made
700 requests made
800 requests made
900 requests made


In [146]:
# check how many matches we have in total (a max of 70,000 given no overlap)
print(matches.shape) 

# only 22,800 matches with parameters of 70 games / api call. Means a high degree of game overlap. 
# Let us take 70 more by changing above cell parameters to start=70, count=70 (from start=0)

(22866, 1)


In [148]:
print(matches.shape)

(52155, 1)


In [182]:
# step 2: call data and save to file every 1000 calls for timeline + summary data

save_file_summaries = './match-data/summaries/'
save_file_timeline = './match-data/timelines/'

# another loop here to save files - but first test inside loop

for j in range(18000,matches.shape[0],1000):

    match_summary_data = []
    match_timeline_data = []

    for i, match in enumerate(matches['Matches'].values[j:j+1000]):

        a = time.time()

        one_match_endpoint = '/lol/match/v5/matches/' + match
        one_match_data = riot_api(europe + one_match_endpoint, creds['key_one'])
        two_match_data = riot_api(europe + one_match_endpoint, creds['key_two'])

        timeline_endpoint = f'/lol/match/v5/matches/{match}/timeline'
        three_timeline_data = riot_api(europe + timeline_endpoint, creds['key_three'])
        four_timeline_data = riot_api(europe + timeline_endpoint, creds['key_four'])

        match_summary_data = match_summary_data + [one_match_data] + [two_match_data]
        match_timeline_data = match_timeline_data + [three_timeline_data] + [four_timeline_data]

        b = time.time()

        # add in rate-limiting if necessary
        if b-a < 0.85:
            remainder = 0.85 - (b-a)
            time.sleep(remainder)
            
        if i % 100 == 0:
            print(str(j+i) +  ' requests made')

    timeline_path = save_file_timeline + 'timeline_' + str(j) + '.csv'
    summary_path = save_file_summaries + 'summary_' + str(j) + '.csv'
    
    timeline_df = pd.DataFrame(match_timeline_data)
    timeline_df.to_csv(timeline_path, index=False)

    summary_df = pd.DataFrame(match_summary_data)
    summary_df.to_csv(summary_path, index=False)

# some save to a dataframe here and save that to a file containing some index value every 1000 runs
# spotted a bug in here - each datapoint should be duplicated exactly once - will deduplicate later

18000 requests made
18100 requests made
18200 requests made
18300 requests made
18400 requests made
18500 requests made
18600 requests made
18700 requests made
18800 requests made
18900 requests made
19000 requests made
19100 requests made
19200 requests made


KeyboardInterrupt: 

## ----- Manually Interrupted -----

## 1000 timeline match data sets are 1GB large each, summary data is 65mb. Download continuing anyway but this will push the limit of a local-only setup. (~60GB data in total)
## Time considerations mean it won't realistically be possible to deploy the data + compute resources to the cloud

## after collecting 20gb of timeline data, due to time constraints, the decision was made to continue to collect 50,000 games of only summary data to speed up the data collection process

In [199]:
# run the same script with only timeline data

# step 2: call data and save to file every 1000 calls

save_file_summaries = './match-data/summaries/'
save_file_timeline = './match-data/timelines/'

# another loop here to save files - but first test inside loop

for j in range(37000,matches.shape[0],1000):

    match_summary_data = []

    
    for i in range(j, j+1000, 4):
        
        a = time.time()
        
        match_endpoint_1 = '/lol/match/v5/matches/' + matches['Matches'].values[i]
        match_endpoint_2 = '/lol/match/v5/matches/' + matches['Matches'].values[i+1]
        match_endpoint_3 = '/lol/match/v5/matches/' + matches['Matches'].values[i+2]
        match_endpoint_4 = '/lol/match/v5/matches/' + matches['Matches'].values[i+3]
        
        match_data_1 = riot_api(europe + match_endpoint_1, creds['key_one'])
        match_data_2 = riot_api(europe + match_endpoint_2, creds['key_two'])
        match_data_3 = riot_api(europe + match_endpoint_3, creds['key_three'])
        match_data_4 = riot_api(europe + match_endpoint_4, creds['key_four'])
        

        match_summary_data = match_summary_data + [match_data_1] + \
                            [match_data_2] + [match_data_3] + [match_data_4]
        
        b = time.time()
        
        # add in rate-limiting if necessary
        if b-a < 0.85:
            remainder = 0.85 - (b-a)
            time.sleep(remainder)
            
        # keep track of progress
        if i % 100 == 0:
            print(str(i) +  ' requests made')

            
    summary_path = save_file_summaries + 'summary_' + str(j) + '.csv'
    summary_df = pd.DataFrame(match_summary_data)
    summary_df.to_csv(summary_path, index=False)

37000 requests made
37100 requests made
37200 requests made
37300 requests made
37400 requests made
37500 requests made
37600 requests made
37700 requests made
37800 requests made
37900 requests made
38000 requests made
38100 requests made
38200 requests made
38300 requests made
38400 requests made
38500 requests made
38600 requests made
38700 requests made
38800 requests made
38900 requests made
39000 requests made
39100 requests made
39200 requests made
39300 requests made
39400 requests made
39500 requests made
39600 requests made
39700 requests made
39800 requests made
39900 requests made
40000 requests made
40100 requests made
40200 requests made
40300 requests made
40400 requests made
40500 requests made
40600 requests made
40700 requests made
40800 requests made
40900 requests made
41000 requests made
41100 requests made
41200 requests made
41300 requests made
41400 requests made
41500 requests made
41600 requests made
41700 requests made
41800 requests made
41900 requests made


IndexError: index 52155 is out of bounds for axis 0 with size 52155

In [None]:
# finished collecting! (in a very dirty way - but time is of the essence)