# Getting match data from the RIOT games API
This notebook goes through the process of aggregating match data from the RIOT games API. Due to rate limiting and the convoluted method of obtaining match data, the process can take quite a while. Where possible, I have saved the intermediate results so that long steps do not need to be repeated. 

## Here is the workflow for obtaining the data:

1. ### Obtain summoner IDs by looking up the summoner data from the first page of each tier and division 
(No saving required, since there are only about 20 API calls to make)

2. ### Use those summoner IDs to obtain the corresponding PUUIDs
(save puuids in a pickle file 'PUUIDs' )

3. ### Use the PUUIDs to query the match history of those summoners, obtaining a list of match IDs
(save match IDs in a pickle file 'match_IDs' )

4. ### Use the match IDs to get the match data
(save match data in a pickle file 'match_data')



In [1]:
"""
@author: Mark Bugden
August 2022

Part of a ML project in predicting win rates for League of Legends games based on team composition.
Current update available on GitHub: https://github.com/Mark-Bugden
"""

# Import anything necessary
import requests
import pandas as pd
from ratelimit import limits, sleep_and_retry
import pickle


# This gives us a progress bar for longer computations. 
from tqdm.notebook import tqdm
# To use it, just wrap any iterable with tqdm(iterable).
# Eg: 
# for i in tqdm(range(100)):
#     ....




# We need to pick a region. 
region_list = ['BR1', 'EUN1', 'EUW1', 'JP1', 'KR', 'LA1', 'LA2', 'NA1', 'OC1', 'RU', 'TR1']
region = 'EUN1'


# Here are the tiers and divisions
tier_list = ['DIAMOND', 'PLATINUM', 'GOLD', 'SILVER', 'BRONZE', 'IRON']
division_list = ['I', 'II', 'III', 'IV']



# Load the data for the champions
champion_url = 'http://ddragon.leagueoflegends.com/cdn/12.14.1/data/en_US/champion.json'
r = requests.get(champion_url)
json_data = r.json()
champion_data = json_data['data']

#champions = list(champion_data.keys())
#champion_data['Zyra']

### Note: 
The API rates are meant to be 20/1s and 100/120s, but I have found that I get errors when I set the ratelimit to exactly that. I have found that I don't get any errors when I set it at half the rate, which works for now, but doubles the time required to get the data. I should try it again att slightly over half the rate, to see if I get an error. If I do, then I am probably accidentally accessing the API twice per call instead of once.


In [2]:
# Some useful functions


def flatten(l):
    ''' Flattens a list
    
    Parameters
    ----------
        l:list
            A list to be flattened
    
    Returns
    -------
        list
            The flattened list
    
    '''
    return [item for sublist in l for item in sublist]

def unique(l):
    ''' Returns the unique elements of a list
    
    Parameters
    ----------
        l:list
            A list you want the unique elements of
    
    Returns
    -------
        list
            The unique elements of the list (unordered)
    
    '''
    return list(set(l))

# We will need an API key to access the Riot games API. I have one of these, but I don't want it to be publically available on my GitHub, so I am storing it locally in a text file. 

def getAPI_key():
    ''' Accesses my locally stored API key so that I don't have to include it publically on GitHub
    
    Returns
    -------
        string:
            My API key for RIOT games

    '''
    f = open("api_key.txt", "r")
    return f.read()



# Our API calls are rate limited to 100 every 2 minutes, or 20 every second. So we will use the ratelimit package to limit how many times we call the API. 
# If the rate limit is reached, the program will sleep until it can try again. We will set the rate to 5 calls per 7s. This will be slower for short queries, but won't give us errors long ones.
# Note that it should be 5/6s, but for some reason that gave me Error:429. Trying 7s just to be a bit safer.




@sleep_and_retry
@limits(5, 7)
def callAPI(url):
    ''' Send and retrieve API requests, rate limited to the RIOT games API rate limit. 
    
    Parameters
    ----------
        url: string
            The URL of the request you are making. 

    Returns
    -------
        list
            A list of dictionaries encoding the data accessed. 
    '''
    r = requests.get(url)
    if r.status_code != 200:
        raise Exception('API response: {}'.format(r.status_code))
    return r.json()


# If I am getting a 401 error, I probably just need to refresh my API key from the developer website




def get_summoner_ids(page=1):
    '''
    Aggregates a list of summoner ids from the first page of all the low-ranking tiers and divisions.
    
    Parameters
    ----------
        page: int
            Which page is queried for the summoner info
    
    Returns
    -------
        list
            A list of summoner ids
    
    '''
    summoners = []

    # For all leagues from Iron to Diamond, and for all tiers from I to IV, send a request to get the first page of the summoners for that league and tier.
    for tier in tqdm(tier_list):
        for division in division_list:

            url = 'https://' + region + '.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/' + tier + '/' + division + '?page=' + str(page) + '&api_key='

            # Here json_data is a list. Each item in the list corresponds to one summoner, and is a dict whose key/value pairs contain information about that summoner.
            json_data = callAPI(url + getAPI_key())
            

            for item in json_data:
                summoners.append(item)

    summoners_df = pd.DataFrame(summoners)

    return summoners_df['summonerId'].tolist()





def get_puuids(ids):
    ''' Takes a list of summoner ids and queries the RIOT API for their puuids
    
    Parameters
    ----------
        ids: list
            A list of summoner ids
        
    Returns
    -------
        list
            A list of the corresponding puuids
    '''
    
    summoner_info = []

    for summoner in tqdm(ids):
        url = 'https://' + region + '.api.riotgames.com/lol/summoner/v4/summoners/' + summoner + '?api_key='
        json_data = callAPI(url + getAPI_key())
        summoner_info.append(json_data)
    df_summ = pd.DataFrame(summoner_info)
    
    return df_summ['puuid'].tolist()





def get_match_ids(puuids, n = 10):
    ''' Takes a list of puuids and returns the match IDs for the previous n matches. Any duplicate match IDs are removed. 
    
    Parameters
    ----------
        puuids: list
            A list of puuids to query
        n: int 
            The number of matches to get per puuid
        
    Returns
    -------
        list
            A list of match ids
    '''
    
    match_id = []
    
    for puuid in tqdm(puuids):
        url = 'https://europe.api.riotgames.com/lol/match/v5/matches/by-puuid/' + puuid + '/ids?start=0&count=100&api_key='
        json_data = callAPI(url + getAPI_key())
        match_id.append(json_data)
    
    return list(set(flatten(match_id)))






def get_match_data(batch):
    ''' Accesses the match data for a given batch of match ids and returns the data as a list
    
    Parameters
    ----------
        batch: list
            A list of match ids 
            
    Returns
    -------
        list
            A list containing the match data for each of the match ids in batch
            
    '''
    data_list = []
    
    for match in tqdm(batch):
        url = 'https://europe.api.riotgames.com/lol/match/v5/matches/'+ match + '?api_key='
        json_data = callAPI(url + getAPI_key())
        data_list.append(json_data)
    
    return data_list

# Step 1

In [3]:
# Getting the summoner ids is as easy as calling the get_summoner_list function 
#summoner_ids = get_summoner_ids()

# We already have these, so we can just load them.
with open("Step 1 summoner ids/summoner_id_file", "rb") as fp:   # Unpickling
    summoner_ids = pickle.load(fp)

# Step 2

In [4]:
# Now that we have the summoners IDs, we can call get_puuids to get the associated puuids. Remember this step takes a long time, since we need one query per ID.
#summoner_puuids = get_puuids(summoner_ids[0:20])
# Save this to pickle file since it took a long time to get
#with open("puuid_file", "wb") as fp:   #Pickling
    #pickle.dump(summoner_puuids, fp)
    
# We already have these, so we can just load them.
summoner_puuids = {}
for i in range(8):
    with open(f"Step 2 puuids/puuid_file_batch{i}", "rb") as fp:   # Unpickling
        summoner_puuids[i] = pickle.load(fp)

# Step 3

In [5]:
# Now that we have a list of puuids, we can query their match histories
# This takes a long time, so we will do it in batches.

#match_ids = get_match_ids(summoner_puuids)

# We already have the match ids, so we can just load them
match_ids_batches = {}
for i in range(8):
    with open(f"Step 3 match ids/match_id_file_batch{i}", "rb") as fp:   # Unpickling
        match_ids_batches[i] = pickle.load(fp)


In [6]:
match_ids = []
for i in range(8):
    match_ids += match_ids_batches[i] 

In [7]:
# We have approximately 3.8 million match ids
len(match_ids)

38653157

In [8]:
# How many of these are unique?
match_ids = unique(match_ids)
len(match_ids)

28558609

In [11]:
# Approximately 2.8 million unique matches

# Step 4

In [16]:
# We have the match IDs, but now we need the match data. Unfortunately, this is going to take a LONG time to get due to rate limiting, so we will be doing it in batches.
#match_id_firstbatch = match_ids[0:100]

In [18]:
# Great, we now have a big list of match IDs. We can now access the match data for each of the match IDs in the first batch. 
#match_data = get_match_data(match_id_firstbatch)


  0%|          | 0/100 [00:00<?, ?it/s]

In [9]:
# Let's load in one of the minibatch data files.
# In later iterations of this, we will do all the minibatches here.
match_data_batches = {}
for i in tqdm(range(8)):
    with open("Step 4 match data/batch0/minibatch{i}", "rb") as fp:   # Unpickling
        match_data_batches[i] = pickle.load(fp)

In [10]:
match_data = []
match_data += match_data_batches[0]

In [14]:
# match_data is a big list where each element contains the data for a single match. We will construct a DataFrame from all this data.
len(match_data)

48287

In [15]:
# Each element is a match, and each match is a dict, whose keys are metadata and info. For the first match, we have
match_data[0].keys()

dict_keys(['metadata', 'info'])

In [17]:
# To be honest, we don't really care about the metadata, except for maybe the matchId
# The info is much more relevant for us. It is also a dict, with a lot of different entries. We'll be interested in the participants for now.

print("The metadata keys are: ", match_data[0]['metadata'].keys(), "\n")
print("The info keys are: ", match_data[0]['info'].keys())

The metadata keys are:  dict_keys(['dataVersion', 'matchId', 'participants']) 

The info keys are:  dict_keys(['gameCreation', 'gameDuration', 'gameEndTimestamp', 'gameId', 'gameMode', 'gameName', 'gameStartTimestamp', 'gameType', 'gameVersion', 'mapId', 'participants', 'platformId', 'queueId', 'teams', 'tournamentCode'])


In [18]:
# Here we can see what game mode the first match was, and how long it lasted.  
# We will need to subset our data to include only ranked games, and games which didn't end in a forfeit (say, under 15mins).

print('Game mode: ', match_data[0]['info']['gameMode'])
print("Duration (mins): ", match_data[0]['info']['gameDuration']/60)

Game mode:  CLASSIC
Duration (mins):  30.916666666666668


In [19]:
# Participants is a list with 10 entries (for normal game modes). Each entry is a dict containing information about each summoner. For example, the first summoner has the following info:
match_data[0]['info']['participants'][0].keys()

dict_keys(['assists', 'baronKills', 'basicPings', 'bountyLevel', 'challenges', 'champExperience', 'champLevel', 'championId', 'championName', 'championTransform', 'consumablesPurchased', 'damageDealtToBuildings', 'damageDealtToObjectives', 'damageDealtToTurrets', 'damageSelfMitigated', 'deaths', 'detectorWardsPlaced', 'doubleKills', 'dragonKills', 'eligibleForProgression', 'firstBloodAssist', 'firstBloodKill', 'firstTowerAssist', 'firstTowerKill', 'gameEndedInEarlySurrender', 'gameEndedInSurrender', 'goldEarned', 'goldSpent', 'individualPosition', 'inhibitorKills', 'inhibitorTakedowns', 'inhibitorsLost', 'item0', 'item1', 'item2', 'item3', 'item4', 'item5', 'item6', 'itemsPurchased', 'killingSprees', 'kills', 'lane', 'largestCriticalStrike', 'largestKillingSpree', 'largestMultiKill', 'longestTimeSpentLiving', 'magicDamageDealt', 'magicDamageDealtToChampions', 'magicDamageTaken', 'neutralMinionsKilled', 'nexusKills', 'nexusLost', 'nexusTakedowns', 'objectivesStolen', 'objectivesStolenAs

In [20]:
# We can get a list of the champions in the first game:
for summoner in match_data[0]['info']['participants']:
    print(summoner['championName'])

TahmKench
Elise
Azir
Jhin
Yuumi
Garen
Kayn
Cassiopeia
Varus
Xerath


In [21]:
# We can see what team they are on, whether they won or lost, etc
# Note: team 100 is blue team, team 200 is red team

for summoner in match_data[0]['info']['participants']:
    print(summoner['teamId'], summoner['championName'], summoner['win'])

100 TahmKench False
100 Elise False
100 Azir False
100 Jhin False
100 Yuumi False
200 Garen True
200 Kayn True
200 Cassiopeia True
200 Varus True
200 Xerath True


In [23]:
# Since we are interested in ranked games that last at least 15 minutes, we will create a new DataFrame with only such games, and only keep information relevant to our project.
# All of the match data will always still be available as the match_data list
# Note that this loop doesn't involve API calls and is therefore not rate limited - it should run relatively quickly. 

ranked_matches = []
for match in range(len(match_data)):
    for i in range(10):
        if (match_data[match]['info']['gameDuration'] >= 900) and (match_data[match]['info']['queueId'] == 420):
            row_dict = {k: match_data[match]['info']['participants'][i][k] for k in ('win', 'championName', 'teamId', 'summonerName')}
            row_dict['team'] = 'Blue' if row_dict['teamId']==100 else 'Red'
            row_dict['matchId'] = match_data[match]['metadata']['matchId']
            row_dict['gameMode'] = match_data[match]['info']['queueId']
            ranked_matches.append(row_dict)

        
rankeddf = pd.DataFrame(ranked_matches)
rankeddf = rankeddf.drop(columns=['teamId'])
rankeddf = rankeddf.set_index(['matchId', 'team'])

In [47]:
# Let's save this as a csv which we will import in the next notebook.
rankeddf.to_csv('ranked_matches.csv')