# Simulations Episode Scraper Match Downloader

From Kaggle user robga: https://www.kaggle.com/robga/simulations-episode-scraper-match-downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

Meta Kaggle is refreshed daily, and sometimes fails a daily refresh. That's OK, Goose keeps well for 24hr.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Please let me know of any bugs. It's new, and my goose may be cooked.

Todo:
- Add teamid's once meta kaggle add them (a few days away)

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections

global num_api_calls_today
num_api_calls_today = 0

In [2]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'hungry-geese'
MAX_CALLS_PER_DAY = 3500
LOWEST_SCORE_THRESH = 1000

In [3]:
META = "episode_scraping/metadata/"
MATCH_DIR = 'episode_scraping/episodes/'
INFO_DIR = 'episode_scraping/infos/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [4]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

# Prepare dataframes
episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

latest_scores_df = epagents_df.loc[epagents_df.groupby('SubmissionId').EpisodeId.idxmax(),:].sort_values(by=['UpdatedScore'])
latest_scores_df['LatestScore'] = latest_scores_df.UpdatedScore
latest_scores_df = latest_scores_df[['SubmissionId', 'LatestScore']]
epagents_df = epagents_df.merge(latest_scores_df, left_on='SubmissionId', right_on='SubmissionId', how='outer').sort_values(by=['LatestScore'])

Episodes.csv: 19817699 rows before filtering.
EpisodeAgents.csv: 44176230 rows before filtering.
Episodes.csv: 446741 rows after filtering for hungry-geese.
EpisodeAgents.csv: 1786964 rows after filtering for hungry-geese.


In [8]:
# Get episodes with all agent scores > a given threshold
episode_min_scores = epagents_df.groupby('EpisodeId').LatestScore.min()
ep_to_score = episode_min_scores[episode_min_scores >= LOWEST_SCORE_THRESH].to_dict()
print(f'{len(ep_to_score)} episodes with all agent scores over {LOWEST_SCORE_THRESH}')

all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                 if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([ep for ep in ep_to_score.keys()], seen_episodes)
print(f'{len(remaining)} of these {len(ep_to_score)} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

48655 episodes with all agent scores over 1000
455 of these 48655 episodes not yet saved
Total of 94497 games in existing library


In [6]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    replay = re.json()['result']['replay']
    with open(MATCH_DIR + '{}.json'.format(epid), 'w') as f:
        f.write(replay)

    # save match info
    info = create_info_json(epid)
    with open(INFO_DIR +  '{}_info.json'.format(epid), 'w') as f:
        json.dump(info, f)

In [7]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for epid, value in sorted(ep_to_score.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today <= MAX_CALLS_PER_DAY:
        if epid not in seen_episodes and num_api_calls_today < MAX_CALLS_PER_DAY:
            try:
                saveEpisode(epid); 
            except requests.exceptions.ConnectionError:
                pass
            r+=1;
            se+=1
            try:
                size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                print(f'{num_api_calls_today+1}: saved episode #{epid}')
                seen_episodes.append(epid)
                num_api_calls_today+=1
            except:
                print('  file {}.json did not seem to save'.format(epid))
                se -= 1
            if r > (datetime.datetime.now() - start_time).seconds:
                time.sleep( r - (datetime.datetime.now() - start_time).seconds)
        if num_api_calls_today >= (min(3600, MAX_CALLS_PER_DAY)):
            break
print('')
print(f'Episodes saved: {se}')

1: saved episode #19918452
2: saved episode #19905522
3: saved episode #19718808
4: saved episode #19797610
5: saved episode #19805330
6: saved episode #19815154
7: saved episode #19820090
8: saved episode #19828974
9: saved episode #19840439
10: saved episode #19848599
11: saved episode #19848852
12: saved episode #19866929
13: saved episode #19885231
14: saved episode #19813556
15: saved episode #19827834
16: saved episode #19833145
17: saved episode #19878945
18: saved episode #19899048
19: saved episode #19906025
20: saved episode #19919693
21: saved episode #19433838
22: saved episode #19535117
23: saved episode #19555250
24: saved episode #19555372
25: saved episode #19651679
26: saved episode #19681472
27: saved episode #19691612
28: saved episode #19692046
29: saved episode #19700649
30: saved episode #19749396
31: saved episode #19777590
32: saved episode #19778876
33: saved episode #19798420
34: saved episode #19817135
35: saved episode #19858171
36: saved episode #19859755
3

289: saved episode #19809482
290: saved episode #19827348
291: saved episode #19846355
292: saved episode #19885661
293: saved episode #19893905
294: saved episode #19895566
295: saved episode #19918743
296: saved episode #19797611
297: saved episode #19816757
298: saved episode #19825006
299: saved episode #19825384
300: saved episode #19825778
301: saved episode #19838251
302: saved episode #19839425
303: saved episode #19847283
304: saved episode #19864251
305: saved episode #19882145
306: saved episode #19887673
307: saved episode #19890250
308: saved episode #19907003
309: saved episode #19912914
310: saved episode #19917061
311: saved episode #19680304
312: saved episode #19698479
313: saved episode #19804385
314: saved episode #19837609
315: saved episode #19842393
316: saved episode #19842432
317: saved episode #19848383
318: saved episode #19866496
319: saved episode #19866861
320: saved episode #19878396
321: saved episode #19894976
322: saved episode #19896891
323: saved epi

572: saved episode #19881835
573: saved episode #19889884
574: saved episode #19902385
575: saved episode #19915997
576: saved episode #19916052
577: saved episode #19917249
578: saved episode #19523568
579: saved episode #19796927
580: saved episode #19808167
581: saved episode #19811750
582: saved episode #19812516
583: saved episode #19813969
584: saved episode #19821336
585: saved episode #19829240
586: saved episode #19834333
587: saved episode #19849211
588: saved episode #19856681
589: saved episode #19862258
590: saved episode #19867653
591: saved episode #19880218
592: saved episode #19893117
593: saved episode #19893817
594: saved episode #19895755
595: saved episode #19896405
596: saved episode #19464205
597: saved episode #19557631
598: saved episode #19623291
599: saved episode #19723625
600: saved episode #19727954
601: saved episode #19740088
602: saved episode #19755245
603: saved episode #19783011
604: saved episode #19783394
605: saved episode #19802363
606: saved epi

855: saved episode #19920428
856: saved episode #19800515
857: saved episode #19801262
858: saved episode #19811497
859: saved episode #19821519
860: saved episode #19862261
861: saved episode #19876247
862: saved episode #19884839
863: saved episode #19912488
864: saved episode #19922623
865: saved episode #19583511
866: saved episode #19912666
867: saved episode #19919359
868: saved episode #19731950
869: saved episode #19754603
870: saved episode #19798838
871: saved episode #19809457
872: saved episode #19816544
873: saved episode #19830579
874: saved episode #19836690
875: saved episode #19839387
876: saved episode #19840452
877: saved episode #19859948
878: saved episode #19891764
879: saved episode #19899654
880: saved episode #19906464
881: saved episode #19732287
882: saved episode #19768775
883: saved episode #19796740
884: saved episode #19800697
885: saved episode #19806632
886: saved episode #19806803
887: saved episode #19854397
888: saved episode #19901813
889: saved epi

1133: saved episode #19815788
1134: saved episode #19823258
1135: saved episode #19835840
1136: saved episode #19854225
1137: saved episode #19870345
1138: saved episode #19879944
1139: saved episode #19893440
1140: saved episode #19906327
1141: saved episode #19570293
1142: saved episode #19619465
1143: saved episode #19730151
1144: saved episode #19732433
1145: saved episode #19766794
1146: saved episode #19769221
1147: saved episode #19804815
1148: saved episode #19817224
1149: saved episode #19866053
1150: saved episode #19917627
1151: saved episode #19408782
1152: saved episode #19788036
1153: saved episode #19801455
1154: saved episode #19807990
1155: saved episode #19808009
1156: saved episode #19817695
1157: saved episode #19822387
1158: saved episode #19829908
1159: saved episode #19835318
1160: saved episode #19839818
1161: saved episode #19850970
1162: saved episode #19851158
1163: saved episode #19852342
1164: saved episode #19861457
1165: saved episode #19861734
1166: save

1408: saved episode #19826242
1409: saved episode #19833672
1410: saved episode #19844261
1411: saved episode #19850321
1412: saved episode #19854065
1413: saved episode #19857974
1414: saved episode #19858192
1415: saved episode #19869390
1416: saved episode #19872456
1417: saved episode #19873481
1418: saved episode #19888607
1419: saved episode #19772114
1420: saved episode #19839485
1421: saved episode #19850751
1422: saved episode #19916321
1423: saved episode #19919617
1424: saved episode #19812947
1425: saved episode #19838961
1426: saved episode #19842896
1427: saved episode #19890778
1428: saved episode #19900533
1429: saved episode #19906074
1430: saved episode #19916022
1431: saved episode #19596762
1432: saved episode #19687647
1433: saved episode #19690406
1434: saved episode #19705211
1435: saved episode #19719788
1436: saved episode #19724566
1437: saved episode #19749600
1438: saved episode #19764350
1439: saved episode #19777887
1440: saved episode #19782203
1441: save

1682: saved episode #19805126
1683: saved episode #19811899
1684: saved episode #19820404
1685: saved episode #19825856
1686: saved episode #19837840
1687: saved episode #19844653
1688: saved episode #19850073
1689: saved episode #19867072
1690: saved episode #19876480
1691: saved episode #19899102
1692: saved episode #19917057
1693: saved episode #19592148
1694: saved episode #19714676
1695: saved episode #19733287
1696: saved episode #19767221
1697: saved episode #19802569
1698: saved episode #19815795
1699: saved episode #19824154
1700: saved episode #19830767
1701: saved episode #19834400
1702: saved episode #19838153
1703: saved episode #19856251
1704: saved episode #19860533
1705: saved episode #19861217
1706: saved episode #19871223
1707: saved episode #19873456
1708: saved episode #19876380
1709: saved episode #19882663
1710: saved episode #19883899
1711: saved episode #19888510
1712: saved episode #19902482
1713: saved episode #19905514
1714: saved episode #19918228
1715: save

1956: saved episode #19920661
1957: saved episode #19641938
1958: saved episode #19739059
1959: saved episode #19745589
1960: saved episode #19810311
1961: saved episode #19842702
1962: saved episode #19843100
1963: saved episode #19853850
1964: saved episode #19892798
1965: saved episode #19915817
1966: saved episode #19763237
1967: saved episode #19778186
1968: saved episode #19790819
1969: saved episode #19791276
1970: saved episode #19794094
1971: saved episode #19801309
1972: saved episode #19811685
1973: saved episode #19835022
1974: saved episode #19849166
1975: saved episode #19855935
1976: saved episode #19870041
1977: saved episode #19874648
1978: saved episode #19901179
1979: saved episode #19903564
1980: saved episode #19909075
1981: saved episode #19909888
1982: saved episode #19914082
1983: saved episode #19917067
1984: saved episode #19555011
1985: saved episode #19556797
1986: saved episode #19568178
1987: saved episode #19581192
1988: saved episode #19584921
1989: save

2230: saved episode #19916507
2231: saved episode #19595588
2232: saved episode #19606549
2233: saved episode #19715454
2234: saved episode #19739387
2235: saved episode #19773000
2236: saved episode #19786777
2237: saved episode #19800039
2238: saved episode #19810041
2239: saved episode #19810239
2240: saved episode #19818790
2241: saved episode #19822309
2242: saved episode #19822897
2243: saved episode #19830208
2244: saved episode #19835007
2245: saved episode #19844939
2246: saved episode #19859862
2247: saved episode #19865739
2248: saved episode #19876105
2249: saved episode #19877368
2250: saved episode #19877847
2251: saved episode #19882196
2252: saved episode #19896325
2253: saved episode #19906980
2254: saved episode #19915116
2255: saved episode #19917953
2256: saved episode #19923458
2257: saved episode #19591662
2258: saved episode #19599843
2259: saved episode #19694141
2260: saved episode #19742064
2261: saved episode #19785059
2262: saved episode #19786994
2263: save

2504: saved episode #19898247
2505: saved episode #19633585
2506: saved episode #19716536
2507: saved episode #19718607
2508: saved episode #19749115
2509: saved episode #19759128
2510: saved episode #19766090
2511: saved episode #19771129
2512: saved episode #19773250
2513: saved episode #19779692
2514: saved episode #19784206
2515: saved episode #19790953
2516: saved episode #19804224
2517: saved episode #19828633
2518: saved episode #19833530
2519: saved episode #19836747
2520: saved episode #19865738
2521: saved episode #19866606
2522: saved episode #19895471
2523: saved episode #19906291
2524: saved episode #19914309
2525: saved episode #19918683
2526: saved episode #19922935
2527: saved episode #19460160
2528: saved episode #19549361
2529: saved episode #19806933
2530: saved episode #19814916
2531: saved episode #19864838
2532: saved episode #19872930
2533: saved episode #19898034
2534: saved episode #19910900
2535: saved episode #19911366
2536: saved episode #19806491
2537: save

2778: saved episode #19853579
2779: saved episode #19858390
2780: saved episode #19870080
2781: saved episode #19878471
2782: saved episode #19880911
2783: saved episode #19888513
2784: saved episode #19894541
2785: saved episode #19903092
2786: saved episode #19908753
2787: saved episode #19917074
2788: saved episode #19921829
2789: saved episode #19528999
2790: saved episode #19636164
2791: saved episode #19756084
2792: saved episode #19763578
2793: saved episode #19812205
2794: saved episode #19818797
2795: saved episode #19835346
2796: saved episode #19836987
2797: saved episode #19848070
2798: saved episode #19859435
2799: saved episode #19886524
2800: saved episode #19894443
2801: saved episode #19898577
2802: saved episode #19905434
2803: saved episode #19909616
2804: saved episode #19916965
2805: saved episode #19538648
2806: saved episode #19666197
2807: saved episode #19733967
2808: saved episode #19738405
2809: saved episode #19745356
2810: saved episode #19750514
2811: save

3052: saved episode #19923077
3053: saved episode #19803520
3054: saved episode #19846977
3055: saved episode #19850850
3056: saved episode #19864186
3057: saved episode #19876322
3058: saved episode #19907428
3059: saved episode #19911413
3060: saved episode #19923383
3061: saved episode #19593005
3062: saved episode #19653122
3063: saved episode #19714871
3064: saved episode #19787930
3065: saved episode #19797705
3066: saved episode #19805983
3067: saved episode #19870627
3068: saved episode #19889105
3069: saved episode #19891079
3070: saved episode #19923007
3071: saved episode #19743010
3072: saved episode #19839935
3073: saved episode #19845101
3074: saved episode #19850376
3075: saved episode #19875572
3076: saved episode #19877131
3077: saved episode #19882556
3078: saved episode #19899546
3079: saved episode #19912599
3080: saved episode #19913309
3081: saved episode #19669745
3082: saved episode #19696096
3083: saved episode #19744947
3084: saved episode #19755508
3085: save

3326: saved episode #19720264
3327: saved episode #19726036
3328: saved episode #19729546
3329: saved episode #19729604
3330: saved episode #19732274
3331: saved episode #19733273
3332: saved episode #19735959
3333: saved episode #19736647
3334: saved episode #19747002
3335: saved episode #19748829
3336: saved episode #19774518
3337: saved episode #19777847
3338: saved episode #19778821
3339: saved episode #19785889
3340: saved episode #19788196
3341: saved episode #19799978
3342: saved episode #19812648
3343: saved episode #19822920
3344: saved episode #19825835
3345: saved episode #19826298
3346: saved episode #19827322
3347: saved episode #19836251
3348: saved episode #19853195
3349: saved episode #19857867
3350: saved episode #19863962
3351: saved episode #19871226
3352: saved episode #19876047
3353: saved episode #19899638
3354: saved episode #19899819
3355: saved episode #19906757
3356: saved episode #19908641
3357: saved episode #19913950
3358: saved episode #19916213
3359: save

## Deprecated - filter episodes by submission:

In [12]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist], seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

512 submissions with score over 1050
31948 episodes for these 512 submissions
10224 of these 31948 episodes not yet saved
Total of 23850 games in existing library
