# Simulations Episode Scraper Match Downloader

From Kaggle user robga: https://www.kaggle.com/robga/simulations-episode-scraper-match-downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

Meta Kaggle is refreshed daily, and sometimes fails a daily refresh. That's OK, Goose keeps well for 24hr.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Please let me know of any bugs. It's new, and my goose may be cooked.

Todo:
- Add teamid's once meta kaggle add them (a few days away)

In [5]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections

global num_api_calls_today
num_api_calls_today = 0

In [17]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'hungry-geese'
MAX_CALLS_PER_DAY = 3400
LOWEST_SCORE_THRESH = 1000

In [18]:
META = "metadata/"
MATCH_DIR = 'data/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [19]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

Episodes.csv: 19347847 rows before filtering.
EpisodeAgents.csv: 42589576 rows before filtering.
Episodes.csv: 123847 rows after filtering for hungry-geese.
EpisodeAgents.csv: 495388 rows after filtering for hungry-geese.


In [20]:
# Prepare dataframes
episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

In [21]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

1241 submissions with score over 1000


In [22]:
# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

43622 episodes for these 1241 submissions


In [23]:
all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist],seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

36587 of these 43622 episodes not yet saved
Total of 7035 games in existing library


In [24]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    with open(MATCH_DIR + '{}.json'.format(epid), 'w') as f:
        f.write(re.json()['result']['replay'])

    # save match info
    info = create_info_json(epid)
    with open(MATCH_DIR +  '{}_info.json'.format(epid), 'w') as f:
        json.dump(info, f)

In [25]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today<=MAX_CALLS_PER_DAY:
        print('')
        remaining = sorted(np.setdiff1d(sub_to_episodes[key],seen_episodes), reverse=True)
        print(f'submission={key}, LB={"{:.0f}".format(value)}, matches={len(set(sub_to_episodes[key]))}, still to save={len(remaining)}')
        
        for epid in remaining:
            if epid not in seen_episodes and num_api_calls_today<=MAX_CALLS_PER_DAY:
                saveEpisode(epid); 
                r+=1;
                se+=1
                try:
                    size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                    print(str(num_api_calls_today) + f': saved episode #{epid}')
                    seen_episodes.append(epid)
                    num_api_calls_today+=1
                except:
                    print('  file {}.json did not seem to save'.format(epid))    
                if r > (datetime.datetime.now() - start_time).seconds:
                    time.sleep( r - (datetime.datetime.now() - start_time).seconds)
            if num_api_calls_today >= (min(3600,MAX_CALLS_PER_DAY)):
                break
print('')
print(f'Episodes saved: {se}')


submission=19722457, LB=1299, matches=26, still to save=0

submission=19728229, LB=1265, matches=24, still to save=0

submission=19721626, LB=1258, matches=26, still to save=0

submission=19719024, LB=1250, matches=27, still to save=0

submission=19711300, LB=1244, matches=32, still to save=0

submission=19725734, LB=1220, matches=26, still to save=0

submission=19729286, LB=1209, matches=24, still to save=0

submission=19722219, LB=1207, matches=26, still to save=0

submission=19727623, LB=1207, matches=24, still to save=0

submission=19708538, LB=1202, matches=32, still to save=0

submission=19729402, LB=1196, matches=25, still to save=0

submission=19721013, LB=1196, matches=29, still to save=0

submission=19725921, LB=1190, matches=25, still to save=0

submission=19711133, LB=1173, matches=34, still to save=0

submission=19533620, LB=1157, matches=89, still to save=0

submission=19722649, LB=1153, matches=27, still to save=0

submission=19711076, LB=1146, matches=32, still to save

444: saved episode #19452390
445: saved episode #19451992
446: saved episode #19450352
447: saved episode #19449358
448: saved episode #19448389

submission=19699440, LB=1093, matches=37, still to save=3
449: saved episode #19447421
450: saved episode #19444550
451: saved episode #19442078

submission=19469854, LB=1093, matches=112, still to save=4
452: saved episode #19451718
453: saved episode #19449259
454: saved episode #19447755
455: saved episode #19443747

submission=19723306, LB=1093, matches=30, still to save=26
456: saved episode #19452491
457: saved episode #19450927
458: saved episode #19447747
459: saved episode #19444569
460: saved episode #19443780
461: saved episode #19441953
462: saved episode #19441950
463: saved episode #19439490
464: saved episode #19439365
465: saved episode #19437263
466: saved episode #19433775
467: saved episode #19431810
468: saved episode #19430646
469: saved episode #19429906
470: saved episode #19429778
471: saved episode #19429709
472: save

698: saved episode #19444275
699: saved episode #19440601
700: saved episode #19437793
701: saved episode #19435689
702: saved episode #19431737
703: saved episode #19430020
704: saved episode #19429489
705: saved episode #19427596
706: saved episode #19422100
707: saved episode #19418901
708: saved episode #19416516
709: saved episode #19413512
710: saved episode #19412806
711: saved episode #19407473
712: saved episode #19334371
713: saved episode #19265407
714: saved episode #19224709
715: saved episode #19141953
716: saved episode #19062663
717: saved episode #18966932
718: saved episode #18966237
719: saved episode #18915978
720: saved episode #18840217
721: saved episode #18836769
722: saved episode #18754157
723: saved episode #18656530
724: saved episode #18524782
725: saved episode #18487756
726: saved episode #18409414
727: saved episode #18337976
728: saved episode #18277845
729: saved episode #18274102
730: saved episode #18227730
731: saved episode #18221473
732: saved epi

965: saved episode #18365554
966: saved episode #18323566
967: saved episode #18243394
968: saved episode #18212707
969: saved episode #18212704
970: saved episode #18138145
971: saved episode #18085886
972: saved episode #18038980
973: saved episode #18002691
974: saved episode #17940762
975: saved episode #17865081
976: saved episode #17859457
977: saved episode #17787738
978: saved episode #17714427
979: saved episode #17657057
980: saved episode #17573520
981: saved episode #17561680
982: saved episode #17560426
983: saved episode #17480632
984: saved episode #17440169
985: saved episode #17424615
986: saved episode #17320656
987: saved episode #17248279
988: saved episode #17171470
989: saved episode #17158972
990: saved episode #17150235
991: saved episode #17131515
992: saved episode #17125265
993: saved episode #17054171
994: saved episode #16982468
995: saved episode #16981219
996: saved episode #16943184
997: saved episode #16902036
998: saved episode #16887699
999: saved epi

1214: saved episode #19447327
1215: saved episode #19443309

submission=19471969, LB=1087, matches=118, still to save=3
1216: saved episode #19452689
1217: saved episode #19445597
1218: saved episode #19441671

submission=19530173, LB=1086, matches=102, still to save=73
1219: saved episode #19452864
1220: saved episode #19448862
1221: saved episode #19447756
1222: saved episode #19443748
1223: saved episode #19440660
1224: saved episode #19436953
1225: saved episode #19436323
1226: saved episode #19435332
1227: saved episode #19433022
1228: saved episode #19429126
1229: saved episode #19428838
1230: saved episode #19424354
1231: saved episode #19422916
1232: saved episode #19416326
1233: saved episode #19408425
1234: saved episode #19407893
1235: saved episode #19407046
1236: saved episode #19379880
1237: saved episode #19357124
1238: saved episode #19341954
1239: saved episode #19275756
1240: saved episode #19154363
1241: saved episode #19144708
1242: saved episode #19106786
1243: sav

1452: saved episode #18277221
1453: saved episode #18255304
1454: saved episode #18217090
1455: saved episode #18164463
1456: saved episode #18082768
1457: saved episode #18077131
1458: saved episode #18037718
1459: saved episode #17972037
1460: saved episode #17957025
1461: saved episode #17900721
1462: saved episode #17817018
1463: saved episode #17813287
1464: saved episode #17813286
1465: saved episode #17778385
1466: saved episode #17731254
1467: saved episode #17706944
1468: saved episode #17678873
1469: saved episode #17636480
1470: saved episode #17633367
1471: saved episode #17602193
1472: saved episode #17600321
1473: saved episode #17594714
1474: saved episode #17594092
1475: saved episode #17593469
1476: saved episode #17592846
1477: saved episode #17592223
1478: saved episode #17591600
1479: saved episode #17590977
1480: saved episode #17590353
1481: saved episode #17589730
1482: saved episode #17589105
1483: saved episode #17588478
1484: saved episode #17587852
1485: save

1716: saved episode #17202690
1717: saved episode #17202066
1718: saved episode #17201442
1719: saved episode #17200818
1720: saved episode #17200194
1721: saved episode #17199570
1722: saved episode #17198946
1723: saved episode #17198323
1724: saved episode #17197699
1725: saved episode #17197075
1726: saved episode #17196450
1727: saved episode #17195825
1728: saved episode #17195210

submission=19658330, LB=1082, matches=57, still to save=4
1729: saved episode #19452455
1730: saved episode #19449726
1731: saved episode #19447780
1732: saved episode #19446791

submission=19567961, LB=1082, matches=80, still to save=65
1733: saved episode #19452683
1734: saved episode #19450761
1735: saved episode #19449324
1736: saved episode #19445389
1737: saved episode #19443947
1738: saved episode #19441023
1739: saved episode #19440873
1740: saved episode #19440472
1741: saved episode #19438961
1742: saved episode #19437196
1743: saved episode #19430684
1744: saved episode #19426686
1745: saved

1976: saved episode #18547362
1977: saved episode #18414433
1978: saved episode #18361165
1979: saved episode #18342991
1980: saved episode #18342369
1981: saved episode #18269091
1982: saved episode #18267839
1983: saved episode #18193268
1984: saved episode #18167587
1985: saved episode #18131881
1986: saved episode #18007079
1987: saved episode #17885693
1988: saved episode #17847596
1989: saved episode #17791476
1990: saved episode #17789602
1991: saved episode #17782741
1992: saved episode #17711306
1993: saved episode #17690725
1994: saved episode #17683866
1995: saved episode #17610295
1996: saved episode #17574771
1997: saved episode #17568537
1998: saved episode #17427716
1999: saved episode #17412774
2000: saved episode #17412773
2001: saved episode #17393475
2002: saved episode #17363604
2003: saved episode #17355521
2004: saved episode #17351161
2005: saved episode #17321904
2006: saved episode #17288247
2007: saved episode #17282000
2008: saved episode #17219552
2009: save

2240: saved episode #19451139
2241: saved episode #19450384
2242: saved episode #19446547
2243: saved episode #19442585
2244: saved episode #19438724
2245: saved episode #19432582
2246: saved episode #19431863
2247: saved episode #19427475
2248: saved episode #19425670
2249: saved episode #19422903
2250: saved episode #19419117
2251: saved episode #19416381
2252: saved episode #19412879
2253: saved episode #19409496
2254: saved episode #19407534
2255: saved episode #19380564
2256: saved episode #19301273
2257: saved episode #19224024
2258: saved episode #19219877
2259: saved episode #19128843
2260: saved episode #18969681
2261: saved episode #18889791
2262: saved episode #18815415
2263: saved episode #18778938
2264: saved episode #18686776
2265: saved episode #18675090
2266: saved episode #18612317
2267: saved episode #18547361
2268: saved episode #18526042
2269: saved episode #18511600
2270: saved episode #18462035
2271: saved episode #18406291
2272: saved episode #18371191
2273: save

2506: saved episode #17594093
2507: saved episode #17504933
2508: saved episode #17446392
2509: saved episode #17428343
2510: saved episode #17351791
2511: saved episode #17300742
2512: saved episode #17255770
2513: saved episode #17187085
2514: saved episode #17123392
2515: saved episode #17081599
2516: saved episode #17050425
2517: saved episode #17037959
2518: saved episode #17032975
2519: saved episode #17028611
2520: saved episode #17020503
2521: saved episode #16998675
2522: saved episode #16933216
2523: saved episode #16869622
2524: saved episode #16852173
2525: saved episode #16841580
2526: saved episode #16828173
2527: saved episode #16786702
2528: saved episode #16785462
2529: saved episode #16754286
2530: saved episode #16701317
2531: saved episode #16680788
2532: saved episode #16665234
2533: saved episode #16662123
2534: saved episode #16661501
2535: saved episode #16660879
2536: saved episode #16660254
2537: saved episode #16659629
2538: saved episode #16659006
2539: save

2766: saved episode #18960728
2767: saved episode #18911153
2768: saved episode #18906319
2769: saved episode #18873279
2770: saved episode #18846403
2771: saved episode #18831251
2772: saved episode #18758276
2773: saved episode #18731471
2774: saved episode #18574357
2775: saved episode #18571214
2776: saved episode #18516627
2777: saved episode #18492141
2778: saved episode #18453898
2779: saved episode #18376204
2780: saved episode #18319815
2781: saved episode #18274097
2782: saved episode #18263444
2783: saved episode #18237131
2784: saved episode #18121848
2785: saved episode #18041473
2786: saved episode #18037094
2787: saved episode #18034601
2788: saved episode #18022725
2789: saved episode #18008957
2790: saved episode #17866954
2791: saved episode #17825136
2792: saved episode #17806425
2793: saved episode #17765914
2794: saved episode #17731259
2795: saved episode #17712556
2796: saved episode #17635857
2797: saved episode #17596585
2798: saved episode #17468807
2799: save

3030: saved episode #19449064
3031: saved episode #19445163
3032: saved episode #19443936

submission=19463929, LB=1078, matches=117, still to save=68
3033: saved episode #19446107
3034: saved episode #19441844
3035: saved episode #19438398
3036: saved episode #19431741
3037: saved episode #19426717
3038: saved episode #19426687
3039: saved episode #19424585
3040: saved episode #19424321
3041: saved episode #19423932
3042: saved episode #19421873
3043: saved episode #19419341
3044: saved episode #19415806
3045: saved episode #19412753
3046: saved episode #19412138
3047: saved episode #19411579
3048: saved episode #19330244
3049: saved episode #19243331
3050: saved episode #19022699
3051: saved episode #18939389
3052: saved episode #18891176
3053: saved episode #18808531
3054: saved episode #18763114
3055: saved episode #18649650
3056: saved episode #18637274
3057: saved episode #18578432
3058: saved episode #18502803
3059: saved episode #18453904
3060: saved episode #18332341
3061: sav

3291: saved episode #19450451
3292: saved episode #19449674
3293: saved episode #19445732
3294: saved episode #19445164
3295: saved episode #19444846
3296: saved episode #19444482
3297: saved episode #19438131
3298: saved episode #19429601
3299: saved episode #19428127
3300: saved episode #19424353
3301: saved episode #19422681
3302: saved episode #19421217
3303: saved episode #19418264
3304: saved episode #19407415
3305: saved episode #19332312
3306: saved episode #19315073
3307: saved episode #19231606
3308: saved episode #19109545
3309: saved episode #19086781
3310: saved episode #19001340
3311: saved episode #18918047
3312: saved episode #18875343
3313: saved episode #18749350
3314: saved episode #18697098
3315: saved episode #18606036
3316: saved episode #18586584
3317: saved episode #18510344
3318: saved episode #18495909
3319: saved episode #18482738
3320: saved episode #18452017
3321: saved episode #18425070
3322: saved episode #18403152
3323: saved episode #18264072
3324: save