# Simulations Episode Scraper Match Downloader

From Kaggle user robga: https://www.kaggle.com/robga/simulations-episode-scraper-match-downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

Meta Kaggle is refreshed daily, and sometimes fails a daily refresh. That's OK, Goose keeps well for 24hr.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Please let me know of any bugs. It's new, and my goose may be cooked.

Todo:
- Add teamid's once meta kaggle add them (a few days away)

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections

global num_api_calls_today
num_api_calls_today = 0

In [6]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'hungry-geese'
MAX_CALLS_PER_DAY = 3400
LOWEST_SCORE_THRESH = 950

In [7]:
META = "episode_scraping/metadata/"
MATCH_DIR = 'episode_scraping/episodes/'
INFO_DIR = 'episode_scraping/infos/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [4]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

# Prepare dataframes
episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

latest_scores_df = epagents_df.loc[epagents_df.groupby('SubmissionId').EpisodeId.idxmax(),:].sort_values(by=['UpdatedScore'])
latest_scores_df['LatestScore'] = latest_scores_df.UpdatedScore
latest_scores_df = latest_scores_df[['SubmissionId', 'LatestScore']]
epagents_df = epagents_df.merge(latest_scores_df, left_on='SubmissionId', right_on='SubmissionId', how='outer').sort_values(by=['LatestScore'])

Episodes.csv: 19412649 rows before filtering.
EpisodeAgents.csv: 42804524 rows before filtering.
Episodes.csv: 166443 rows after filtering for hungry-geese.
EpisodeAgents.csv: 665772 rows after filtering for hungry-geese.


In [8]:
# Get episodes with all agent scores > a given threshold
episode_min_scores = epagents_df.groupby('EpisodeId').LatestScore.min()
ep_to_score = episode_min_scores[episode_min_scores >= LOWEST_SCORE_THRESH].to_dict()
print(f'{len(ep_to_score)} episodes with all agent scores over {LOWEST_SCORE_THRESH}')

all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                 if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([ep for ep in ep_to_score.keys()], seen_episodes)
print(f'{len(remaining)} of these {len(ep_to_score)} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

33719 episodes with all agent scores over 950
8915 of these 33719 episodes not yet saved
Total of 32889 games in existing library


In [10]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    with open(MATCH_DIR + '{}.json'.format(epid), 'w') as f:
        f.write(re.json()['result']['replay'])

    # save match info
    info = create_info_json(epid)
    with open(INFO_DIR +  '{}_info.json'.format(epid), 'w') as f:
        json.dump(info, f)

In [11]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for epid, value in sorted(ep_to_score.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today <= MAX_CALLS_PER_DAY:
        if epid not in seen_episodes and num_api_calls_today < MAX_CALLS_PER_DAY:
            saveEpisode(epid); 
            r+=1;
            se+=1
            try:
                size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                print(f'{num_api_calls_today+1}: saved episode #{epid}')
                seen_episodes.append(epid)
                num_api_calls_today+=1
            except:
                print('  file {}.json did not seem to save'.format(epid))    
            if r > (datetime.datetime.now() - start_time).seconds:
                time.sleep( r - (datetime.datetime.now() - start_time).seconds)
        if num_api_calls_today >= (min(3600, MAX_CALLS_PER_DAY)):
            break
print('')
print(f'Episodes saved: {se}')

1: saved episode #18860203
2: saved episode #19356431
3: saved episode #19445659
4: saved episode #19466278
5: saved episode #19487256
6: saved episode #19493546
7: saved episode #19497793
8: saved episode #19501422
9: saved episode #19506675
10: saved episode #19512787
11: saved episode #17114034
12: saved episode #17465690
13: saved episode #18749343
14: saved episode #19283340
15: saved episode #19419432
16: saved episode #19422000
17: saved episode #19422581
18: saved episode #19443249
19: saved episode #19452326
20: saved episode #19460218
21: saved episode #19502686
22: saved episode #19505853
23: saved episode #19512451
24: saved episode #19515227
25: saved episode #19426626
26: saved episode #19452480
27: saved episode #19454373
28: saved episode #19454416
29: saved episode #19459600
30: saved episode #19471124
31: saved episode #19499671
32: saved episode #19486385
33: saved episode #19451345
34: saved episode #19451526
35: saved episode #19454384
36: saved episode #19458429
3

288: saved episode #19488120
289: saved episode #19489797
290: saved episode #19490165
291: saved episode #19495593
292: saved episode #19511308
293: saved episode #19515571
294: saved episode #19514301
295: saved episode #19514713
296: saved episode #19514790
297: saved episode #19514953
298: saved episode #19516911
299: saved episode #19432691
300: saved episode #19433010
301: saved episode #19460153
302: saved episode #19473921
303: saved episode #19476683
304: saved episode #19479660
305: saved episode #19492647
306: saved episode #19509950
307: saved episode #19515365
308: saved episode #16844074
309: saved episode #17087204
310: saved episode #17662041
311: saved episode #18193902
312: saved episode #18226478
313: saved episode #18257810
314: saved episode #18384976
315: saved episode #18747974
316: saved episode #18825736
317: saved episode #18976569
318: saved episode #19411770
319: saved episode #19415719
320: saved episode #19429893
321: saved episode #19431960
322: saved epi

572: saved episode #17415270
573: saved episode #17962643
574: saved episode #18688847
575: saved episode #18897364
576: saved episode #19408385
577: saved episode #19408548
578: saved episode #19412599
579: saved episode #19419814
580: saved episode #19434821
581: saved episode #19456159
582: saved episode #19460816
583: saved episode #19463996
584: saved episode #19466273
585: saved episode #19471951
586: saved episode #19479558
587: saved episode #19483653
588: saved episode #19514051
589: saved episode #19517165
590: saved episode #19411590
591: saved episode #19411940
592: saved episode #19422194
593: saved episode #19423934
594: saved episode #19438868
595: saved episode #19449571
596: saved episode #19454198
597: saved episode #19477964
598: saved episode #19479667
599: saved episode #19483497
600: saved episode #19494311
601: saved episode #19497697
602: saved episode #19511766
603: saved episode #19499227
604: saved episode #19499511
605: saved episode #19500059
606: saved epi

855: saved episode #19412439
856: saved episode #19412475
857: saved episode #19413094
858: saved episode #19414805
859: saved episode #19423873
860: saved episode #19431328
861: saved episode #19436412
862: saved episode #19461203
863: saved episode #19470227
864: saved episode #19478209
865: saved episode #19487264
866: saved episode #19493843
867: saved episode #19498008
868: saved episode #19498766
869: saved episode #19506382
870: saved episode #19441283
871: saved episode #19457311
872: saved episode #19488125
873: saved episode #19514060
874: saved episode #19411803
875: saved episode #19421181
876: saved episode #19427343
877: saved episode #19457416
878: saved episode #19459462
879: saved episode #19477247
880: saved episode #19481074
881: saved episode #19493812
882: saved episode #19497277
883: saved episode #19499595
884: saved episode #19501265
885: saved episode #19512718
886: saved episode #19516944
887: saved episode #19508922
888: saved episode #19509164
889: saved epi

1133: saved episode #17432073
1134: saved episode #17552957
1135: saved episode #17592227
1136: saved episode #17700077
1137: saved episode #18137522
1138: saved episode #18210824
1139: saved episode #18285979
1140: saved episode #18321069
1141: saved episode #18397518
1142: saved episode #18503434
1143: saved episode #18629719
1144: saved episode #18847776
1145: saved episode #19139191
1146: saved episode #19201955
1147: saved episode #19240570
1148: saved episode #19270923
1149: saved episode #19415315
1150: saved episode #19416633
1151: saved episode #19432849
1152: saved episode #19436726
1153: saved episode #19448628
1154: saved episode #19456073
1155: saved episode #19474801
1156: saved episode #19475160
1157: saved episode #19477397
1158: saved episode #19483067
1159: saved episode #19484668
1160: saved episode #19496764
1161: saved episode #15682401
1162: saved episode #17215174
1163: saved episode #17532363
1164: saved episode #17647702
1165: saved episode #18178238
1166: save

1407: saved episode #19411792
1408: saved episode #19431637
1409: saved episode #19437864
1410: saved episode #19456473
1411: saved episode #19464265
1412: saved episode #19468236
1413: saved episode #19477929
1414: saved episode #19479703
1415: saved episode #19484874
1416: saved episode #19502760
1417: saved episode #19507153
1418: saved episode #19508379
1419: saved episode #19514864
1420: saved episode #19517177
1421: saved episode #19422307
1422: saved episode #19425270
1423: saved episode #19459711
1424: saved episode #19465795
1425: saved episode #19483142
1426: saved episode #19502434
1427: saved episode #19510521
1428: saved episode #19511081
1429: saved episode #16630361
1430: saved episode #17224548
1431: saved episode #17574766
1432: saved episode #17758438
1433: saved episode #17940767
1434: saved episode #18037722
1435: saved episode #18659959
1436: saved episode #18745222
1437: saved episode #18939395
1438: saved episode #19022689
1439: saved episode #19407529
1440: save

1682: saved episode #19476471
1683: saved episode #19480470
1684: saved episode #19491261
1685: saved episode #19500146
1686: saved episode #19506793
1687: saved episode #19507190
1688: saved episode #19507239
1689: saved episode #19513446
1690: saved episode #17301371
1691: saved episode #17723156
1692: saved episode #18319805
1693: saved episode #18388107
1694: saved episode #19375740
1695: saved episode #19407075
1696: saved episode #19414376
1697: saved episode #19418036
1698: saved episode #19421871
1699: saved episode #19423143
1700: saved episode #19439034
1701: saved episode #19448680
1702: saved episode #19458039
1703: saved episode #19464124
1704: saved episode #19464485
1705: saved episode #19480438
1706: saved episode #19482580
1707: saved episode #19495638
1708: saved episode #19514957
1709: saved episode #18389507
1710: saved episode #18597886
1711: saved episode #18647595
1712: saved episode #18792014
1713: saved episode #19181958
1714: saved episode #19242633
1715: save

1956: saved episode #17257644
1957: saved episode #17319408
1958: saved episode #17397839
1959: saved episode #17784610
1960: saved episode #17803314
1961: saved episode #17821393
1962: saved episode #17911357
1963: saved episode #18188266
1964: saved episode #18286603
1965: saved episode #18495915
1966: saved episode #18532946
1967: saved episode #18754152
1968: saved episode #18842963
1969: saved episode #19190253
1970: saved episode #19363322
1971: saved episode #19392270
1972: saved episode #19411215
1973: saved episode #19421336
1974: saved episode #19433919
1975: saved episode #19442708
1976: saved episode #19452760
1977: saved episode #19461578
1978: saved episode #19462384
1979: saved episode #19464137
1980: saved episode #19468140
1981: saved episode #19472888
1982: saved episode #19484422
1983: saved episode #19491913
1984: saved episode #19492820
1985: saved episode #19495126
1986: saved episode #19498194
1987: saved episode #19509063
1988: saved episode #19481546
1989: save

2230: saved episode #19435734
2231: saved episode #19467937
2232: saved episode #19471569
2233: saved episode #17185214
2234: saved episode #17190198
2235: saved episode #17448257
2236: saved episode #17476264
2237: saved episode #17554193
2238: saved episode #17649571
2239: saved episode #17774642
2240: saved episode #18113084
2241: saved episode #18261567
2242: saved episode #18619392
2243: saved episode #18668906
2244: saved episode #18836757
2245: saved episode #19407724
2246: saved episode #19416057
2247: saved episode #19440910
2248: saved episode #19441817
2249: saved episode #19460639
2250: saved episode #19462489
2251: saved episode #19469610
2252: saved episode #19481809
2253: saved episode #19487762
2254: saved episode #19507831
2255: saved episode #18492781
2256: saved episode #18497157
2257: saved episode #18555525
2258: saved episode #18954545
2259: saved episode #19132292
2260: saved episode #19409590
2261: saved episode #19410323
2262: saved episode #19426907
2263: save

2504: saved episode #18489007
2505: saved episode #18556152
2506: saved episode #18902193
2507: saved episode #19052330
2508: saved episode #19161958
2509: saved episode #19326104
2510: saved episode #19416085
2511: saved episode #19418716
2512: saved episode #19424770
2513: saved episode #19424803
2514: saved episode #19442012
2515: saved episode #19443182
2516: saved episode #19445497
2517: saved episode #19447881
2518: saved episode #19451508
2519: saved episode #19472124
2520: saved episode #19477750
2521: saved episode #19496551
2522: saved episode #19498270
2523: saved episode #19502436
2524: saved episode #19508019
2525: saved episode #16578076
2526: saved episode #17579132
2527: saved episode #17896328
2528: saved episode #18128119
2529: saved episode #19210919
2530: saved episode #19212986
2531: saved episode #19270233
2532: saved episode #19422063
2533: saved episode #19427920
2534: saved episode #19448431
2535: saved episode #19455924
2536: saved episode #19467149
2537: save

2778: saved episode #19212993
2779: saved episode #19349531
2780: saved episode #19404020
2781: saved episode #19417339
2782: saved episode #19424182
2783: saved episode #19428714
2784: saved episode #19441024
2785: saved episode #19447689
2786: saved episode #19455212
2787: saved episode #19462020
2788: saved episode #19474956
2789: saved episode #19477367
2790: saved episode #19516364
2791: saved episode #19429504
2792: saved episode #19430635
2793: saved episode #19435498
2794: saved episode #19437587
2795: saved episode #19442354
2796: saved episode #19448884
2797: saved episode #19453408
2798: saved episode #19457794
2799: saved episode #19459955
2800: saved episode #19469091
2801: saved episode #19477920
2802: saved episode #19485943
2803: saved episode #19493177
2804: saved episode #19496398
2805: saved episode #19502469
2806: saved episode #19506558
2807: saved episode #19507270
2808: saved episode #19511723
2809: saved episode #19461969
2810: saved episode #19465128
2811: save

3052: saved episode #19515643
3053: saved episode #19472746
3054: saved episode #19480729
3055: saved episode #19497714
3056: saved episode #19505397
3057: saved episode #19510551
3058: saved episode #15569857
3059: saved episode #16248368
3060: saved episode #17493706
3061: saved episode #17642712
3062: saved episode #17888191
3063: saved episode #18102768
3064: saved episode #18450763
3065: saved episode #18506575
3066: saved episode #18528554
3067: saved episode #18562421
3068: saved episode #19075759
3069: saved episode #19186107
3070: saved episode #19308158
3071: saved episode #19407756
3072: saved episode #19411242
3073: saved episode #19419596
3074: saved episode #19427201
3075: saved episode #19427992
3076: saved episode #19428252
3077: saved episode #19431770
3078: saved episode #19436011
3079: saved episode #19436984
3080: saved episode #19446275
3081: saved episode #19453092
3082: saved episode #19456669
3083: saved episode #19460238
3084: saved episode #19465938
3085: save

3326: saved episode #19511646
3327: saved episode #19517371
3328: saved episode #19478859
3329: saved episode #19491454
3330: saved episode #19494453
3331: saved episode #19496903
3332: saved episode #19512516
3333: saved episode #19437231
3334: saved episode #19437577
3335: saved episode #19437610
3336: saved episode #19437648
3337: saved episode #19442678
3338: saved episode #19447188
3339: saved episode #19456106
3340: saved episode #19463965
3341: saved episode #19479811
3342: saved episode #19495890
3343: saved episode #19500004
3344: saved episode #19507713
3345: saved episode #18418817
3346: saved episode #18426943
3347: saved episode #18498422
3348: saved episode #18512857
3349: saved episode #19085411
3350: saved episode #19387455
3351: saved episode #19409587
3352: saved episode #19414839
3353: saved episode #19415342
3354: saved episode #19418132
3355: saved episode #19420289
3356: saved episode #19425793
3357: saved episode #19427105
3358: saved episode #19434183
3359: save

## Deprecated - filter episodes by submission:

In [12]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist], seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

512 submissions with score over 1050
31948 episodes for these 512 submissions
10224 of these 31948 episodes not yet saved
Total of 23850 games in existing library
