# Simulations Episode Scraper Match Downloader

From Kaggle user robga: https://www.kaggle.com/robga/simulations-episode-scraper-match-downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

Meta Kaggle is refreshed daily, and sometimes fails a daily refresh. That's OK, Goose keeps well for 24hr.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Please let me know of any bugs. It's new, and my goose may be cooked.

Todo:
- Add teamid's once meta kaggle add them (a few days away)

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections

global num_api_calls_today
num_api_calls_today = 0

In [2]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'hungry-geese'
MAX_CALLS_PER_DAY = 3400
LOWEST_SCORE_THRESH = 1000

In [3]:
META = "metadata/"
MATCH_DIR = 'data/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [4]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

Episodes.csv: 19335777 rows before filtering.
EpisodeAgents.csv: 42549950 rows before filtering.
Episodes.csv: 116112 rows after filtering for hungry-geese.
EpisodeAgents.csv: 464448 rows after filtering for hungry-geese.


In [5]:
# Prepare dataframes
episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

In [6]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

1195 submissions with score over 1000


In [7]:
# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

40357 episodes for these 1195 submissions


In [8]:
all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist],seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

37056 of these 40357 episodes not yet saved
Total of 3301 games in existing library


In [9]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    with open(MATCH_DIR + '{}.json'.format(epid), 'w') as f:
        f.write(re.json()['result']['replay'])

    # save match info
    info = create_info_json(epid)
    with open(MATCH_DIR +  '{}_info.json'.format(epid), 'w') as f:
        json.dump(info, f)

In [None]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today<=MAX_CALLS_PER_DAY:
        print('')
        remaining = sorted(np.setdiff1d(sub_to_episodes[key],seen_episodes), reverse=True)
        print(f'submission={key}, LB={"{:.0f}".format(value)}, matches={len(set(sub_to_episodes[key]))}, still to save={len(remaining)}')
        
        for epid in remaining:
            if epid not in seen_episodes and num_api_calls_today<=MAX_CALLS_PER_DAY:
                saveEpisode(epid); 
                r+=1;
                se+=1
                try:
                    size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                    print(str(num_api_calls_today) + f': saved episode #{epid}')
                    seen_episodes.append(epid)
                    num_api_calls_today+=1
                except:
                    print('  file {}.json did not seem to save'.format(epid))    
                if r > (datetime.datetime.now() - start_time).seconds:
                    time.sleep( r - (datetime.datetime.now() - start_time).seconds)
            if num_api_calls_today >= (min(3600,MAX_CALLS_PER_DAY)):
                break
print('')
print(f'Episodes saved: {se}')


submission=19722457, LB=1254, matches=21, still to save=5
0: saved episode #19439530
1: saved episode #19437799
2: saved episode #19435628
3: saved episode #19433412
4: saved episode #19431254

submission=19711300, LB=1249, matches=25, still to save=4
5: saved episode #19439896
6: saved episode #19437293
7: saved episode #19434473
8: saved episode #19431485

submission=19719024, LB=1219, matches=23, still to save=5
9: saved episode #19439834
10: saved episode #19436450
11: saved episode #19435031
12: saved episode #19432326
13: saved episode #19430265

submission=19721626, LB=1214, matches=21, still to save=4
14: saved episode #19439331
15: saved episode #19437018
16: saved episode #19433217
17: saved episode #19431031

submission=19728229, LB=1214, matches=18, still to save=18
18: saved episode #19440001
19: saved episode #19438432
20: saved episode #19437048
21: saved episode #19435500
22: saved episode #19434440
23: saved episode #19434405
24: saved episode #19434074
25: saved epis

225: saved episode #19428731
226: saved episode #19428699
227: saved episode #19428666
228: saved episode #19428588

submission=19471900, LB=1126, matches=100, still to save=3
229: saved episode #19432550
230: saved episode #19431198
231: saved episode #19430583

submission=19469841, LB=1126, matches=103, still to save=3
232: saved episode #19438461
233: saved episode #19435124
234: saved episode #19431546

submission=19514384, LB=1125, matches=74, still to save=3
235: saved episode #19437590
236: saved episode #19433680
237: saved episode #19430125

submission=19719678, LB=1124, matches=24, still to save=5
238: saved episode #19439793
239: saved episode #19437056
240: saved episode #19436280
241: saved episode #19433383
242: saved episode #19433254

submission=19705803, LB=1123, matches=29, still to save=3
243: saved episode #19439761
244: saved episode #19437823
245: saved episode #19434755

submission=19476680, LB=1121, matches=82, still to save=3
246: saved episode #19438018
247: s

451: saved episode #19429868
452: saved episode #19429835
453: saved episode #19429801
454: saved episode #19429769
455: saved episode #19429737
456: saved episode #19429700
457: saved episode #19429662
458: saved episode #19429627
459: saved episode #19429587
460: saved episode #19429505

submission=19701978, LB=1102, matches=34, still to save=1
461: saved episode #19440864

submission=19699440, LB=1102, matches=33, still to save=27
462: saved episode #19439137
463: saved episode #19436283
464: saved episode #19433116
465: saved episode #19428219
466: saved episode #19426682
467: saved episode #19424060
468: saved episode #19422709
469: saved episode #19419876
470: saved episode #19417296
471: saved episode #19414993
472: saved episode #19412998
473: saved episode #19410946
474: saved episode #19410424
475: saved episode #19409271
476: saved episode #19408228
477: saved episode #19407227
478: saved episode #19407171
479: saved episode #19407036
480: saved episode #19407016
481: saved 