Author: Kayleb Klapp
Date: 11 May, 2025
File: parse_file.ipynb

The purpose of this program is to parse .SC2Replay files into a feature dataset, all downloaded using a webscraper tool on https://lotv.spawningtool.com/replays/

The program reads in the files in the "replay" directory, loads them into memory creating files for each replay with all the metadata for a specific replay (.abt) files, and finally outputs an output csv with all features and metadata.

The filename feature references a .abt file, which contains the name of the actual replay file, as well as a bunch of other useful information.


Imports and file directory variables. All self explanatory, except for abt.file_info_path, which just stores unique number used in .abt files names so that .abt files never override each other.

See https://github.com/ZephyrBlu/zephyrus-sc2-parser?tab=readme-ov-file for information on zephyrus_sc2_parser

In [None]:
from zephyrus_sc2_parser import parse_replay
from glob import glob
import os
import numpy as np
import time

replay_path = "replays"
metadata_path = "replay_metadata"
abt_file_info_path = "about_file_number.txt"
output_file = "output.csv"

: 

Header information for the output csv. The header goes: 
    admin header -> player 1 generic_header (p1header) -> player 2 generic header (p2header) -> admin_footer

Any changes made to the parsing to add new features will have to added to the header here as well.

In [None]:

admin_header = [
    "id",
    "file_name",
    "gametime"
]

admin_footer = [
    "winner"
]

# This is the CSV header sans admin. Admin includes player prefixes, ids, etc.
generic_header = [
        "unspent_minerals",
        "unspent_gas",
        "unit_count",
        "building_count",
        "upgrade_count",
        "active_workers",
        "supply_cap",
        "total_gas_collected",
        "total_minerals_collected",
        "total_army_value"
        ]

p1header = ["p1_" + val for val in generic_header]
p2header = ["p2_" + val for val in generic_header]

File processing for .SC2Replay files.
The order of operations should be:
    Parse the files
    Get the metadata and game information
    Write metadata to a file (.abt file)
    For each point in time in each file:
        extract features for both players

Strictly file io operations. No feature extraction/ handling is done here.

In [None]:

# Gets all of the specified file type from the replay path directory.
def get_replay_files():
    replay_files = glob(os.path.join(replay_path, "*.SC2Replay"))
    return replay_files

# Parses the files into replay data structures. No features are extracted here.
def parse_replay_files(files):
    replays = list()
    badfilecount = 0
    goodfilecount = 0
    for i, file in enumerate(files):
        # This can fail. Some issue with the parse_replay function
        try:
            replay = parse_replay(file)
            replay.metadata["filename"] = file
            replays.append(replay)
            goodfilecount += 1
        except:
            badfilecount += 1
        
        if (i % 1000 == 0):
            print(f"{i}/{len(files)} files parsed...")

    print(f"Parsing Complete:") 
    print(f"\t Replays Failed:    {badfilecount}")
    print(f"\t Replays Succeeded: {goodfilecount}")
    print(f"\tSuccess Rate: {float(goodfilecount) / float(goodfilecount + badfilecount)}")
    return replays

# Makes a .abt file with the metadata, and a pointer to the replay file
def make_replay_summary_file(replay):
    # This makes a unique filename, as long as 
    # abt_file_info_path hasn't been tampered with
    filename = "None"
    num = None
    with open(abt_file_info_path, "r") as f:
        file_cont = f.read()
        num = int(file_cont)
        filename = f"{num:05X}.abt"

    # Writes all the metadata to the file. This can fail, theres some weird
    # language stuff happening.
    try:
        with open(os.path.join(metadata_path, filename), "w") as summary_file:
            for key, value in replay.metadata.items():
                summary_file.writelines([str(key), str(value), "\n"])
            for key, value in replay.summary.items():
                summary_file.writelines([str(key), str(value), "\n"])
    except:
        # I'm dealing with a lot of errors centered around weird characters in 
        # different languages. Its really rare (<1/10000), so I'll just make these
        # empty.
        filename="nofile"
        print(f"Language string problem found.")
    finally:
        num += 1
        with open("randseed.txt", "w") as f:
            f.write(str(num))
        return filename

### Feature Extraction

These are the functions that do the actual feature extractions. The current developed features are:
        unspent_minerals - An unused resource is generally bad.
        unspent_gas      - More sparse resource than minerals.
        unit_count       - The number of units. Some units are considered more than one by StarCraft II
        building_count   - The number of buildings. Buildings do a lot of things, this is low resolution.
        upgrade_count    - The number of upgrade buildings.
        active_workers   - Workers collect resources and build buildings    
        supply_cap       - This is the number of unit_count a player can reach
        total_gas_collected         - Total amount of gas collected in the whole game
        total_minerals_collected    - Total number of mineral collected in the whole game
        total_army_value            - The sum of the value of the army units

Features are collected for the state of the game, so a long game can yield potentially thousands of observations.

In [None]:
# Highest level per game feature extraction
def get_features_from_replay(replay, make_summary_file=False):
    play_features = list()
    for gamestate in replay.timeline:
        return_val = get_features_from_gamestate(gamestate) 
        play_features.append(return_val)

    filename = "nofile"
    if(make_summary_file):
        filename = make_replay_summary_file(replay, play_features)

    for i in range(len(play_features)):
        play_features[i] = [filename] + play_features[i]
    
    return play_features

# Per frame feature extraction
def get_features_from_gamestate(gamestate): 
    state_features = list([gamestate[1]['gameloop']])
    for player in [1,2]:
        state_features.extend(get_features_for_player_state(gamestate[player]))
    return state_features

# Per player per frame feature extraction
def get_features_for_player_state(playerstate):
    resources = playerstate["unspent_resources"]
    unspent_minerals = resources["minerals"]
    unspent_gas = resources["gas"]
    unit_count = len(playerstate["unit"])
    building_count = len(playerstate["building"])
    upgrade_count = len(playerstate["upgrade"])
    active_workers = playerstate["workers_active"]
    supply_cap = playerstate["supply_cap"]
    total_resources_collected = playerstate["resources_collected"]
    total_gas_collected = total_resources_collected["gas"]
    total_minerals_collected = total_resources_collected["minerals"]
    total_army_value = playerstate["total_army_value"]

    # The purpose of all this extra complexity is to make the header dynamic.
    # Dictionary keys are stored in the csvHeader variable, and values are looked
    # up in order for lookup in this dictionary.
    feature_dict = dict({
        "unspent_minerals":unspent_minerals,
        "unspent_gas":unspent_gas,
        "unit_count":unit_count,
        "building_count":building_count,
        "upgrade_count":upgrade_count,
        "active_workers":active_workers,
        "supply_cap":supply_cap,
        "total_gas_collected":total_gas_collected,
        "total_minerals_collected":total_minerals_collected,
        "total_army_value":total_army_value
    })

    feature_list = [feature_dict[header_key] for header_key in generic_header]
    return list(feature_list)


### Main

The rest of this is calling the function above, and building a numpy array to output to a file.

Read and parse the files. This is a lengthy process.

In [None]:
files = get_replay_files()
print(f"""
      Looking for .SC2Replay files
      {len(files)} files found.
      Starting parsing.
      """)

replays = parse_replay_files(files)
total_states = 0
for game in replays:
    total_states += len(game.timeline)

print(f"""
      Parsing Complete.
      Starting Feature Processing on {total_states} observations.
      """)

Process each game for features, and output the table to a csv file.

In [None]:

total_columns = (len(p1header)) + len(p2header) + len(admin_header) + len(admin_footer)

all_features = np.zeros((total_states, total_columns), dtype=object)
row_index = 0
for i,replay in enumerate(replays):  
    obs = get_features_from_replay(replay, make_summary_file=True)
    for observation in obs:
        all_features[row_index] = [row_index] + observation + [replay.metadata["winner"]]
        row_index += 1

np.savetxt(output_file, all_features, "%s", ",", comments="", header=", ".join(admin_header + p1header + p2header + admin_footer))
print(all_features.shape)
print("Done.")