## Build Training Set

Now that we know what data we'd like to get from the API, let's try to create a training set. To get started we'll try to get ~10,000 games.

In [1]:
import os
import json
import time
import datetime
import requests

In [2]:
if 'STEAM_API_KEY' not in os.environ:
  print("No API Key :(")
else:
  print("Found API Key.")
  STEAM_API_KEY = os.environ['STEAM_API_KEY']

Found API Key.


Below I've modified the `__request` method to handle `429` (Too Many Request) errors. It looks like we can't really hit the API more than once every 5 seconds and that if we're limited, we have to wait for ~15 seconds.

None of this is based off of any kind of documentation and could probably change at a moment's notice.

In [None]:
base_url = 'https://api.steampowered.com'

def __request(method, path, **kwargs):
  url = base_url + path
  kwargs.setdefault('params', dict()).update(key=STEAM_API_KEY)
  response = requests.request(method, url, **kwargs)

  if response.status_code == 429:
    print("Backing off.")
    print(response.headers)
    time.sleep(15)
    response = requests.request(method, url, **kwargs)
  try:
    response_json = response.json()
    return response_json
  except:
    print("Error converting to json.")
    print(response)
  

def get_match_history_by_seq_num(seq_num, num_matches, **params):
  path = '/IDOTA2Match_570/GetMatchHistoryBySequenceNum/V001'
  params.update(start_at_match_seq_num=seq_num)
  params.update(matches_requested=num_matches)  
  return __request('get', path, params=params)

In [4]:
# The starting sequence number.
current_seq_num = 5126114401
# The maximum number of matches the API will return at once.
num_matches_per_request = 100
# Total number of matches we'd like to save.
num_total_matches = 10000

In [None]:
training_set = []

while len(training_set) < num_total_matches:
    # Don't hit the API too frequently.
    time.sleep(5)
    response = get_match_history_by_seq_num(current_seq_num, num_matches_per_request)
    
    matches = response['result']['matches']
    
    if len(matches) != 100:
        print("Problem. Expected 100 matches. Actual: ", len(matches))
        print(response)
        break
    
    for match in matches:
        if match['human_players'] != 10:
            # We only want "real" games of Dota so we're ignoring
            # games without 10 human players.
            continue
        try:
            players = match['players']
            hero0 = players[0]['hero_id']
            hero1 = players[1]['hero_id']
            hero2 = players[2]['hero_id']
            hero3 = players[3]['hero_id']
            hero4 = players[4]['hero_id']
            hero5 = players[5]['hero_id']
            hero6 = players[6]['hero_id']
            hero7 = players[7]['hero_id']
            hero8 = players[8]['hero_id']
            hero9 = players[9]['hero_id']
            radiant_win = match['radiant_win']
            training_set.append((hero0, hero1, hero2, hero3, hero4,
                                hero5, hero6, hero7, hero8, hero9,
                                radiant_win))
        except:
            print("Error:", match)
            break

        
    current_seq_num -= num_matches_per_request
        
print("Complete. Training set size:", len(training_set))

Backing off.
{'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '108', 'Expires': 'Sat, 07 Aug 2021 23:03:46 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Sat, 07 Aug 2021 23:03:46 GMT', 'Connection': 'keep-alive'}
Backing off.
{'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '108', 'Expires': 'Sat, 07 Aug 2021 23:05:49 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Sat, 07 Aug 2021 23:05:49 GMT', 'Connection': 'keep-alive'}
Backing off.
{'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '108', 'Expires': 'Sat, 07 Aug 2021 23:10:50 GMT', 'Cache-Control': 'max-age=0, no-cache, no-store', 'Pragma': 'no-cache', 'Date': 'Sat, 07 Aug 2021 23:10:50 GMT', 'Connection': 'keep-alive'}


Now we just need to save our training set so we can feed it to a machine learning classifier later.

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(training_set, 
             columns=['hero0', 'hero1', 'hero2', 'hero3', 'hero4',
                      'hero5', 'hero6', 'hero7', 'hero8', 'hero9',
                      'radiant_win'])
df.head()

In [None]:
df.to_csv("training_set_small.csv", index=False)

Great, now we can try to traing a classifier on this dataset.