<a href="https://colab.research.google.com/github/AseiSugiyama/PokemonAnalytics/blob/master/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrape Pokémon battle results dataset 

To analyze pokemon battle, pokemon battle result dataset is required. We are going to build it from [Pokémon Showdown](https://pokemonshowdown.com/) battle result. In this notebook, we use [Requests](https://2.python-requests.org/en/master/#) to download battle results and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) to parse its result.

Almost all codes are from [odanado/poke2vec](https://github.com/odanado/poke2vec). This notebook add some explanation to it.

## Setup

In [0]:
import json
import time
from pathlib import Path
import logging

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

In [0]:
LADDER_URL = 'http://pokemonshowdown.com/ladder/{ladder}'
USERNAME_URL = ('http://replay.pokemonshowdown.com/search/?output=html&'
                'user={user}&format=&page={page}&output=html')
REPLAY_URL = 'http://replay.pokemonshowdown.com/{replay_id}'
SLEEP = 0.8
save_dir = './data/battle'

We can choose the "generation". Default is generation 8 (Sword and Shield). If you want to correct all pokémon before generation 7 (Sun and Moon), uncomment following code.

In [0]:
ladder = 'gen8battlestadiumsingles' # gen8
# ladder = 'gen7battlespotsingles' # gen7

## Download top users

In [0]:
def top_users(save_dir, ladder):
    save_dir = Path(save_dir)
    save_dir.mkdir(exist_ok=True)
    save_file = save_dir / '{}_top_users.json'.format(ladder)

    url = LADDER_URL.format(ladder=ladder)
    text = requests.get(url).text
    soup = BeautifulSoup(text, 'html.parser')
    users = [a.get('href')
             for a in soup.find_all('a', {'class': 'subtle'})]
    users = [Path(user).name for user in users]

    save_file.write_text(json.dumps({'ladder': ladder, 'users': users}))

In [0]:
top_users(save_dir, ladder)

## Download replay ids

Pokemon showdown provides battle replay. You can watch each saved battle with graphical game simulator, for example [[Gen 8] Battle Stadium Singles replay: zvws77 vs. ABCWXYZ - Pokémon Showdown](https://replay.pokemonshowdown.com/gen8battlestadiumsingles-1044334093). It's uri has the form of `https://replay.pokemonshowdown.com/{ladder}-{replayid}`. Let's get replay ids from the top uses list!

In [0]:
def replay_ids(save_dir, users_file):
    save_dir = Path(save_dir)

    users_file = Path(users_file)
    data = json.loads(users_file.read_text())

    ladder = data['ladder']
    users = data['users']

    save_file = save_dir / '{}_replay_ids.json'.format(ladder)

    all_replay_ids = {}

    for user in tqdm(users):
        logging.info('user = {}'.format(user))
        replay_ids = []
        alredy_ids = set()
        for page in range(1, 100):
            url = USERNAME_URL.format(
                user=user,
                page=page
            )
            html = requests.get(url).text
            time.sleep(SLEEP)
            soup = BeautifulSoup(html, 'html.parser')
            links = soup.find_all('a')
            ids = [link.get('href') for link in links]
            if len(ids) == 0:
                break

            ids = [x for x in ids if ladder in x]
            if len(ids) == 0:
                continue
            if ids[0] in alredy_ids or ids[-1] in alredy_ids:
                break

            replay_ids += ids
            alredy_ids |= set(ids)
        logging.info(len(replay_ids))
        all_replay_ids[user] = replay_ids

    save_file.write_text(json.dumps(
        {'ladder': ladder, 'replay_ids': all_replay_ids}))

Note: This will take over **30 minits** to finish.

In [0]:
replay_ids(save_dir, save_dir + f'/{ladder}_top_users.json')

## Download each battle log

We can get battle logs by followings;

1. Download replay page ( https://replay.pokemonshowdown.com/{ladder}-{replayid} ) and parse it.
2. Download logs from https://replay.pokemonshowdown.com/{ladder}-{replayid}.log
3. Download annotated json from https://replay.pokemonshowdown.com/{ladder}-{replayid}.json

In this notebook, we try to download replay page and parse it (1). If you start it from scratch, it is recommended to download annotated json (3) .

In [0]:
def battle_logs(save_dir, replay_ids_file):
    save_dir = Path(save_dir)

    replay_ids_file = Path(replay_ids_file)
    data = json.loads(replay_ids_file.read_text())

    ladder = data['ladder']
    replay_ids = data['replay_ids']

    save_file = save_dir / '{}_battle_logs.json'.format(ladder)
    battle_logs = {}
    sorted_reply_ids = sorted(replay_ids.items(), key=lambda x: x[0])

    for user, replay_id_list in tqdm(sorted_reply_ids):
        logging.info('user = {}'.format(user))
        logs = []
        for replay_id in replay_id_list:
            html = requests.get(REPLAY_URL.format(replay_id=replay_id)).text
            soup = BeautifulSoup(html, 'html.parser')
            time.sleep(SLEEP)
            log = soup.find('script', {'class': 'log'}).text
            assert len(log) != 0
            logs.append(log)

        battle_logs[user] = logs

    save_file.write_text(json.dumps(
        {'ladder': ladder, 'battle_logs': battle_logs}))


Note: This will take over **60 minits** to finish.

In [0]:
battle_logs(save_dir, save_dir + f"/{ladder}_replay_ids.json")

## Parse battle logs

Battle logs from Pokémon showdown contains whole actions of players. However, we need informations about only names of pokémon in each players party. Let's parse the log and extract it!

In [0]:
import re
USER_PLAYER = re.compile(r"\|player\|(?P<player>.+?)\|(?P<username>.+?)\|.*?")
POKE = re.compile(r"\|poke\|(?P<player>.+?)\|(?P<poke>.+?)\|.*?")


def to_id(name):
    return re.sub(r'[^a-z0-9]+', '', name.lower())


def parse_logs(save_dir, battle_logs_file):
    print(save_dir, battle_logs_file)
    save_dir = Path(save_dir)
    battle_logs_file = Path(battle_logs_file)

    data = json.loads(battle_logs_file.read_text())

    ladder = data['ladder']
    battle_logs = data['battle_logs']

    save_file = save_dir / '{}_parsed_battle_logs.json'.format(ladder)

    players_list = []
    pokes_list = []

    for user, battle_log_list in sorted(battle_logs.items(),
                                        key=lambda x: x[0]):
        logging.info('user = {}'.format(user))
        for battle_log in battle_log_list:
            players = {}
            matches = USER_PLAYER.findall(battle_log)
            for match in matches:
                players[match[0]] = to_id(match[1])

            pokes = {}
            matches = POKE.findall(battle_log)
            for match in matches:
                player, poke = match
                poke = to_id(poke.split(',')[0])

                if player not in pokes:
                    pokes[player] = []

                pokes[player].append(poke)

            players_list.append(players)
            pokes_list.append(pokes)

    save_file.write_text(json.dumps({
        'ladder': ladder,
        'players': players_list,
        'pokes': pokes_list
    }))


In [0]:
parse_logs(save_dir, save_dir + f'/{ladder}_battle_logs.json')

## Load bag of the pokémons

Finally, we finished to build our dataset. Let's check it.

In [0]:
parsed_battle_logs_file = save_dir + f'/{ladder}_parsed_battle_logs.json'
parsed_battle_logs_file = Path(parsed_battle_logs_file)
data = json.loads(parsed_battle_logs_file.read_text())

In [0]:
data["pokes"][:3]

## Data cleansing

To make our dataset more easy to use, let's clean up our pokémon dataset! For cleaning up, we try followings;

1. Flatten it, we do not need to know `p1` or `p2`.
2. Standarize it, make each party has six pokémons.
3. Splite it into train-validation dataset.

In [0]:
import numpy as np
import random


def set_seed(random_seed):
  random.seed(random_seed)
  np.random.seed(random_seed)


def preprocess(save_dir, parsed_battle_logs_file, random_seed=42):
    set_seed(random_seed)
    save_dir = Path(save_dir)
    parsed_battle_logs_file = Path(parsed_battle_logs_file)

    data = json.loads(parsed_battle_logs_file.read_text())
    ladder = data['ladder']

    save_file = save_dir / '{}_dataset.json'.format(ladder)

    pokes = []
    for poke in data['pokes']:
        if not poke:
            continue
        if len(poke['p1']) == 6:
            pokes.append(tuple(sorted(poke['p1'])))
        if len(poke['p2']) == 6:
            pokes.append(tuple(sorted(poke['p2'])))

    uniq_pokes = list(set(pokes))

    logging.info('reduce {} -> {} ({:.03f} %)'
                .format(len(pokes), len(uniq_pokes),
                        100 * len(uniq_pokes) / len(pokes)))

    np.random.shuffle(uniq_pokes)
    N = len(uniq_pokes)

    train = uniq_pokes[N // 10:]
    valid = uniq_pokes[:N // 10]

    save_file.write_text(json.dumps({
        'ladder': ladder,
        'train': train,
        'valid': valid
    }))

In [0]:
preprocess(save_dir, save_dir + f'/{ladder}_parsed_battle_logs.json')

Let's check it.

In [0]:
parsed_battle_logs_file = save_dir + f'/{ladder}_dataset.json'
parsed_battle_logs_file = Path(parsed_battle_logs_file)
data = json.loads(parsed_battle_logs_file.read_text())

In [0]:
data["train"][:3]