# Simulations Episode Scraper Match Downloader

From Kaggle user robga: https://www.kaggle.com/robga/simulations-episode-scraper-match-downloader

This notebook downloads episodes using Kaggle's GetEpisodeReplay API and the [Meta Kaggle](https://www.kaggle.com/kaggle/meta-kaggle) dataset.

Meta Kaggle is refreshed daily, and sometimes fails a daily refresh. That's OK, Goose keeps well for 24hr.

Why download replays?
- Train your ML/RL model
- Inspect the performance of yours and others agents
- To add to your ever growing json collection 

Only one scraping strategy is implemented: For each top scoring submission, download all missing matches, move on to next submission.

Other scraping strategies can be implemented, but not here. Like download max X matches per submission or per team per day, or ignore certain teams or ignore where some scores < X, or only download some teams.

Please let me know of any bugs. It's new, and my goose may be cooked.

Todo:
- Add teamid's once meta kaggle add them (a few days away)

In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import datetime
import time
import glob
import collections

global num_api_calls_today
num_api_calls_today = 0

In [2]:
## You should configure these to your needs. Choose one of ...
# 'hungry-geese', 'rock-paper-scissors', santa-2020', 'halite', 'google-football'
COMP = 'hungry-geese'
MAX_CALLS_PER_DAY = 3400
LOWEST_SCORE_THRESH = 1050

In [3]:
META = "episode_scraping/metadata/"
MATCH_DIR = 'episode_scraping/episodes/'
INFO_DIR = 'episode_scraping/infos/'
base_url = "https://www.kaggle.com/requests/EpisodeService/"
get_url = base_url + "GetEpisodeReplay"
BUFFER = 1
COMPETITIONS = {
    'hungry-geese': 25401,
    'rock-paper-scissors': 22838,
    'santa-2020': 24539,
    'halite': 18011,
    'google-football': 21723
}

In [4]:
# Load Episodes
episodes_df = pd.read_csv(META + "Episodes.csv")

# Load EpisodeAgents
epagents_df = pd.read_csv(META + "EpisodeAgents.csv")

print(f'Episodes.csv: {len(episodes_df)} rows before filtering.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows before filtering.')

episodes_df = episodes_df[episodes_df.CompetitionId == COMPETITIONS[COMP]] 
epagents_df = epagents_df[epagents_df.EpisodeId.isin(episodes_df.Id)]

print(f'Episodes.csv: {len(episodes_df)} rows after filtering for {COMP}.')
print(f'EpisodeAgents.csv: {len(epagents_df)} rows after filtering for {COMP}.')

Episodes.csv: 19360293 rows before filtering.
EpisodeAgents.csv: 42630338 rows before filtering.
Episodes.csv: 131934 rows after filtering for hungry-geese.
EpisodeAgents.csv: 527736 rows after filtering for hungry-geese.


In [5]:
# Prepare dataframes
episodes_df = episodes_df.set_index(['Id'])
episodes_df['CreateTime'] = pd.to_datetime(episodes_df['CreateTime'])
episodes_df['EndTime'] = pd.to_datetime(episodes_df['EndTime'])

epagents_df.fillna(0, inplace=True)
epagents_df = epagents_df.sort_values(by=['Id'], ascending=False)

In [6]:
# Get top scoring submissions
max_df = (epagents_df.sort_values(by=['EpisodeId'], ascending=False).groupby('SubmissionId').head(1).drop_duplicates().reset_index(drop=True))
max_df = max_df[max_df.UpdatedScore>=LOWEST_SCORE_THRESH]
max_df = pd.merge(left=episodes_df, right=max_df, left_on='Id', right_on='EpisodeId')
sub_to_score_top = pd.Series(max_df.UpdatedScore.values,index=max_df.SubmissionId).to_dict()
print(f'{len(sub_to_score_top)} submissions with score over {LOWEST_SCORE_THRESH}')

488 submissions with score over 1050


In [7]:
# Get episodes for these submissions
sub_to_episodes = collections.defaultdict(list)
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    eps = sorted(epagents_df[epagents_df['SubmissionId'].isin([key])]['EpisodeId'].values,reverse=True)
    sub_to_episodes[key] = eps
candidates = len(set([item for sublist in sub_to_episodes.values() for item in sublist]))
print(f'{candidates} episodes for these {len(sub_to_score_top)} submissions')

25915 episodes for these 488 submissions


In [8]:
all_files = []
for root, dirs, files in os.walk(MATCH_DIR, topdown=False):
    all_files.extend(files)
seen_episodes = [int(f.split('.')[0]) for f in all_files 
                      if '.' in f and f.split('.')[0].isdigit() and f.split('.')[1] == 'json']
remaining = np.setdiff1d([item for sublist in sub_to_episodes.values() for item in sublist],seen_episodes)
print(f'{len(remaining)} of these {candidates} episodes not yet saved')
print('Total of {} games in existing library'.format(len(seen_episodes)))

12459 of these 25915 episodes not yet saved
Total of 13504 games in existing library


In [9]:
def create_info_json(epid):
    
    create_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)
    end_seconds = int((episodes_df[episodes_df.index == epid]['CreateTime'].values[0]).item()/1e9)

    agents = []
    for index, row in epagents_df[epagents_df['EpisodeId'] == epid].sort_values(by=['Index']).iterrows():
        agent = {
            "id": int(row["Id"]),
            "state": int(row["State"]),
            "submissionId": int(row['SubmissionId']),
            "reward": int(row['Reward']),
            "index": int(row['Index']),
            "initialScore": float(row['InitialScore']),
            "initialConfidence": float(row['InitialConfidence']),
            "updatedScore": float(row['UpdatedScore']),
            "updatedConfidence": float(row['UpdatedConfidence']),
            "teamId": int(99999)
        }
        agents.append(agent)

    info = {
        "id": int(epid),
        "competitionId": int(COMPETITIONS[COMP]),
        "createTime": {
            "seconds": int(create_seconds)
        },
        "endTime": {
            "seconds": int(end_seconds)
        },
        "agents": agents
    }

    return info

def saveEpisode(epid):
    # request
    re = requests.post(get_url, json = {"EpisodeId": int(epid)})
        
    # save replay
    with open(MATCH_DIR + '{}.json'.format(epid), 'w') as f:
        f.write(re.json()['result']['replay'])

    # save match info
    info = create_info_json(epid)
    with open(INFO_DIR +  '{}_info.json'.format(epid), 'w') as f:
        json.dump(info, f)

In [10]:
r = BUFFER;

start_time = datetime.datetime.now()
se=0
for key, value in sorted(sub_to_score_top.items(), key=lambda kv: kv[1], reverse=True):
    if num_api_calls_today<=MAX_CALLS_PER_DAY:
        print('')
        remaining = sorted(np.setdiff1d(sub_to_episodes[key],seen_episodes), reverse=True)
        print(f'submission={key}, LB={"{:.0f}".format(value)}, matches={len(set(sub_to_episodes[key]))}, still to save={len(remaining)}')
        
        for epid in remaining:
            if epid not in seen_episodes and num_api_calls_today<=MAX_CALLS_PER_DAY:
                saveEpisode(epid); 
                r+=1;
                se+=1
                try:
                    size = os.path.getsize(MATCH_DIR+'{}.json'.format(epid)) / 1e6
                    print(str(num_api_calls_today) + f': saved episode #{epid}')
                    seen_episodes.append(epid)
                    num_api_calls_today+=1
                except:
                    print('  file {}.json did not seem to save'.format(epid))    
                if r > (datetime.datetime.now() - start_time).seconds:
                    time.sleep( r - (datetime.datetime.now() - start_time).seconds)
            if num_api_calls_today >= (min(3600,MAX_CALLS_PER_DAY)):
                break
print('')
print(f'Episodes saved: {se}')


submission=19722457, LB=1307, matches=30, still to save=0

submission=19728229, LB=1287, matches=28, still to save=0

submission=19721626, LB=1287, matches=30, still to save=0

submission=19719024, LB=1255, matches=31, still to save=0

submission=19725734, LB=1246, matches=30, still to save=0

submission=19727623, LB=1233, matches=29, still to save=0

submission=19722219, LB=1221, matches=31, still to save=0

submission=19725921, LB=1220, matches=29, still to save=0

submission=19711300, LB=1212, matches=43, still to save=0

submission=19721013, LB=1211, matches=33, still to save=0

submission=19729286, LB=1210, matches=28, still to save=0

submission=19708538, LB=1199, matches=35, still to save=0

submission=19729402, LB=1195, matches=33, still to save=0

submission=19740432, LB=1182, matches=24, still to save=0

submission=19747794, LB=1180, matches=23, still to save=0

submission=19711133, LB=1164, matches=41, still to save=0

submission=19533620, LB=1158, matches=93, still to save

submission=19722120, LB=1080, matches=35, still to save=0

submission=19645304, LB=1080, matches=75, still to save=0

submission=19745165, LB=1080, matches=23, still to save=0

submission=19676284, LB=1079, matches=51, still to save=0

submission=19529174, LB=1079, matches=128, still to save=0

submission=19622511, LB=1079, matches=82, still to save=0

submission=19532556, LB=1079, matches=112, still to save=0

submission=19513042, LB=1079, matches=130, still to save=0

submission=19586963, LB=1078, matches=79, still to save=0

submission=19603402, LB=1078, matches=88, still to save=0

submission=19537603, LB=1078, matches=117, still to save=0

submission=19706986, LB=1077, matches=41, still to save=0

submission=19591514, LB=1077, matches=84, still to save=0

submission=19537771, LB=1077, matches=123, still to save=0

submission=19546013, LB=1077, matches=93, still to save=0

submission=19537041, LB=1077, matches=95, still to save=0

submission=19457211, LB=1077, matches=135, still to

KeyboardInterrupt: 