#### WARNING: DO NOT RUN THIS NOTEBOOK FULLY, IT CONTAINS A COUPLE OF CELLS WHICH EXECUTE LONG WINDED PROCESSES

In this notebook we extract the raw data from the files (already parsed from .demo (which we couldn't attach to this upload, because they weigh 4.8 GB) to .json).

First thing we do is load the graph created in the previous notebook and define a couple of strings.

In [3]:
import pickle
import json
import os
import kdtree

graph = None
with open('graph.pickle', 'rb') as handle:
    graph = pickle.load(handle)

points, _, connect_to, connect_from = graph

filenames_dir_train = "parsed_demos"
middle_result_name_train = "movements_raw_train.pickle"
final_result_name_train = "samples_train.pickle"

filenames_dir_test = "parsed_demos\\test"
middle_result_name_test = "movements_raw_test.pickle"
final_result_name_test = "samples_test.pickle"

Next we define, since the connections are directed, an "inverse" of the set of edges, this will be useful later.

Then, since this task is pretty heavy to run, we make use of our *extensive* knowledge of algorithms and data structures to make it such that the stuff we are about to run is not going to be a native algorithm.

Just for knowledge: k-d-trees are data structures (mostly used in 3d graphics and physics simulations) which make collision-checking between multiple agents much faster since they exploit the location of the objects to return the possible colliding ones.

The kdtree import we are making is for a .py file we wrote ourselves, the implementation is quite naive, but effective nonetheless.

In [4]:
import kdtree

data_struct = kdtree.construct_2d_tree([(point[0], point[1], point[2], i) for i,point in enumerate(points)])

Next we load all of the .jsons we are interested in.

In [5]:
filenames = os.listdir(filenames_dir_train)

datas = dict()

for filename in filenames:
    name, ext = os.path.splitext(filename)

    if ext != ".json":
        continue

    print(f"Loading {filename}")
    
    with open(f"{filenames_dir_train}\\{filename}", 'rb') as handle:
        datas[filename] = json.load(handle)

Loading astralis-vs-g2-m1-dust2.json
Loading catevil-vs-nexga-dust2.json
Loading ence-vs-faze-m4-dust2.json
Loading faze-vs-big-m1-dust2.json
Loading g2-vs-spirit-dust2.json
Loading natus-vincere-vs-faze-m2-dust2.json
Loading outsiders-vs-big-dust2.json
Loading players-vs-astralis-m2-dust2.json
Loading vitality-vs-mouz-m2-dust2.json
Loading vitality-vs-outsiders-m1-dust2.json


Then we define this function which returns the cell to which a player has moved, both absolutely to the general structure (needed to keep track of the player position during the round "scraping") and relatively to the cell it is coming from.

In [6]:
def actual_movement(next_position, prev_cell):
    next_cell = data_struct.find_closest(next_position[0], next_position[1], next_position[2])[0]
    
    if prev_cell == next_cell:
        return 0, next_cell
    
    for i, val in enumerate(connect_to[prev_cell]):
        if val == next_cell:
            return i + 1, next_cell
    # Too bad, let's return None

    # Hold your horses, maybe he moved a bit
    # quick and passed over
    # to the second degree neighbor?
    for j, other_cell in enumerate(connect_to[prev_cell]):
        for i, val2 in enumerate(connect_to[other_cell]):
            if val2 == next_cell:
                return j + 1, next_cell

    # Well now we're not recovering it
    return None, next_cell

Now we do the round scraping, essentially we go through each frame (a frame happens every .5 seconds in our parsed .jsons) for each .json's round and keep track of the player's movements.

In [7]:
cell_samples = [list() for _ in range(len(points))]

counter = 0
bad = 0

for filename, data in datas.items():
    for round_i, gameRound in enumerate(data["gameRounds"]):
        bookkeep_pos = dict()
        for frame_i, frame in enumerate(gameRound["frames"]):
            if frame["ct"]["players"] != None:
                for i, player in enumerate(frame["ct"]["players"]):
                    pos = (player["x"], player["y"], player["z"])
                    if (prev_cell := bookkeep_pos.get(player["steamID"], None)) == None:
                        bookkeep_pos[player["steamID"]] = data_struct.find_closest(pos[0],pos[1],pos[2])[0]
                    else:
                        choice, next_cell = actual_movement(pos, prev_cell)

                        if choice != None:
                            cell_samples[prev_cell].append((filename, round_i, frame_i - 1, "ct", player["steamID"], choice))
                        else:
                            bad += 1
                        counter += 1
                        bookkeep_pos[player["steamID"]] = next_cell


            if frame["t"]["players"] != None:
                for i, player in enumerate(frame["t"]["players"]):
                    pos = (player["x"], player["y"], player["z"])
                    if (prev_cell := bookkeep_pos.get(player["steamID"], None)) == None:
                        bookkeep_pos[player["steamID"]] = data_struct.find_closest(pos[0],pos[1],pos[2])[0]
                    else:
                        choice, next_cell = actual_movement(pos, prev_cell)

                        if choice != None:
                            cell_samples[prev_cell].append((filename, round_i, frame_i - 1, "t", player["steamID"], choice))
                        else:
                            bad += 1

                        counter += 1
                        bookkeep_pos[player["steamID"]] = next_cell

with open(middle_result_name_train + "DONOTOVERWRITE", 'wb') as handle:
    pickle.dump(cell_samples, handle)

print(f"Movements done: {counter}. Of which {bad} bad.")

Movements done: 547761. Of which 6210 bad.


Only 6210 bads out of 547761? That's nice. How do we define a bad movement? Essentially it is any movement for which our graph is not designed for (special and rare cases for which modeling was really too difficult, we'll explain).

Now we actually get our samples, to do so we go through each of the cells we have in the graph and, for it, through each sample for that cell (this is needed since we'll have to build a model for each cell, thus we group the data in this way).

For each sample we recover/record the cell from which the player came from, the team it was part of, the number of teammates alive and the average direction his team was with respect to him (you'll see/read that these last two explanatory variables are problematic, anyway we've left them in the files for recording reasons).
Of course we also record the cell it went to.

Some of these records will not be stored and will be labeled as bad for a multitude of reasons: maybe we didn't design the structure for the movement they did (most of these cases were pruned before) or other reasons came up.

#### WARNING THIS CELL TAKES A COUPLE OF MINUTES TO RUN

In [8]:
import pandas as pd
import numpy as np

with open(middle_result_name_train, 'rb') as handle:
    cell_samples = pickle.load(handle)

# Actually parsing data
recovered_samples = list()
total_data = 0
bad = 0
# Let's now go through every sample
for cell_i,samples in enumerate(cell_samples):
    cell_dict = {"choice": list(), "cell_from": list(), "team": list(), "dir_team": list(), "n_alive": list()}

    for filename, round_i, frame_i, team, player_id, true_choice in samples:
        total_data += 1
        if frame_i <= 0:
            bad += 1
            continue
        team_info_before = datas[filename]["gameRounds"][round_i]["frames"][frame_i - 1][team]
        index = None
        other_positions = list()
        for i, player in enumerate(team_info_before["players"]):
            if player["steamID"] == player_id:
                index = i
            else:
                other_positions.append((player["x"], player["y"]))
        
        if index == None:
            bad += 1
            continue
        player = team_info_before["players"][index]
        prev_cell = data_struct.find_closest(player["x"],player["y"],player["z"])[0]
        
        prev_choice = 0
        possible_bad = True

        if prev_cell in connect_from[cell_i]:
            prev_choice = connect_from[cell_i].index(prev_cell) + 1
            possible_bad = False
        elif prev_cell == cell_i:
            possible_bad = False
        
        if possible_bad:
            for j, other_cell in enumerate(connect_from[cell_i]):
                for val2 in connect_from[other_cell]:
                    if val2 == prev_cell:
                        possible_bad = False
                        prev_choice = j + 1
                        break
                if not possible_bad:
                    break

        if possible_bad:
            bad += 1
            continue

        angle = 0
        if len(other_positions) > 0:
            mean_pos = np.mean(np.asarray(other_positions), axis=0)
            mean_pos -= np.asarray([player["x"],player["y"]])
            angle = np.arctan2(mean_pos[1], mean_pos[0])

        cell_dict["n_alive"].append(len(other_positions) + 1)
        cell_dict["dir_team"].append(angle)
        cell_dict["choice"].append(true_choice)
        cell_dict["cell_from"].append(prev_choice)
        if team == "ct":
            cell_dict["team"].append(0)
        else:
            cell_dict["team"].append(1)

    temp_df = pd.DataFrame()
    temp_df["cell_from"] = pd.Categorical(cell_dict["cell_from"], categories=list(range(len(connect_from[cell_i]) + 1)), ordered=False)
    temp_df["team"] = pd.Categorical(cell_dict["team"], categories=[0,1], ordered=False)
    temp_df["n_alive"] = pd.Categorical(cell_dict["n_alive"], categories=list(range(1,6)), ordered=False)
    temp_df["dir_team"] = cell_dict["dir_team"]
    temp_df["choice"] = cell_dict["choice"]
    recovered_samples.append(temp_df)


print(total_data, bad)

with open(final_result_name_train + "DONOTOVERWRITE", 'wb') as handle:
    pickle.dump(recovered_samples, handle)

541551 7920


These are the results for the data we'll use to *train* (to regress) our model, as you can see again the number of bad samples is pretty low with respect to how many samples we've looked at, huge success!


Now we extract the data for testing, since we want to predict player's positions in a 5 seconds timeline (10 frames in our .jsons), we'll have to use slightly different code in the final part.

Again we load the jsons.

In [9]:
filenames_test = os.listdir(filenames_dir_test)

datas_test = dict()

for filename in filenames_test:
    name, ext = os.path.splitext(filename)

    if ext != ".json":
        continue

    print(f"Loading {filename}")
    
    with open(f"{filenames_dir_test}\\{filename}", 'rb') as handle:
        datas_test[filename] = json.load(handle)

Loading ence-vs-natus-vincere-m2-dust2.json
Loading ex-uyu-vs-unjustified-m2-dust2.json
Loading faze-vs-spirit-m2-dust2.json
Loading finest-vs-cloudrunners-dust2.json
Loading shape-vs-babos-m1-dust2.json
Loading stone-vs-leviatan-dust2.json


We run it through the scraper as before...

In [10]:
cell_samples_test = [list() for _ in range(len(points))]

counter = 0
bad = 0

for filename, data in datas_test.items():
    for round_i, gameRound in enumerate(data["gameRounds"]):
        bookkeep_pos = dict()
        for frame_i, frame in enumerate(gameRound["frames"]):
            if frame["ct"]["players"] != None:
                for i, player in enumerate(frame["ct"]["players"]):
                    pos = (player["x"], player["y"], player["z"])
                    if (prev_cell := bookkeep_pos.get(player["steamID"], None)) == None:
                        bookkeep_pos[player["steamID"]] = data_struct.find_closest(pos[0],pos[1],pos[2])[0]
                    else:
                        choice, next_cell = actual_movement(pos, prev_cell)

                        if choice != None:
                            cell_samples_test[prev_cell].append((filename, round_i, frame_i - 1, "ct", player["steamID"], choice))
                        else:
                            bad += 1
                        counter += 1
                        bookkeep_pos[player["steamID"]] = next_cell


            if frame["t"]["players"] != None:
                for i, player in enumerate(frame["t"]["players"]):
                    pos = (player["x"], player["y"], player["z"])
                    if (prev_cell := bookkeep_pos.get(player["steamID"], None)) == None:
                        bookkeep_pos[player["steamID"]] = data_struct.find_closest(pos[0],pos[1],pos[2])[0]
                    else:
                        choice, next_cell = actual_movement(pos, prev_cell)

                        if choice != None:
                            cell_samples_test[prev_cell].append((filename, round_i, frame_i - 1, "t", player["steamID"], choice))
                        else:
                            bad += 1

                        counter += 1
                        bookkeep_pos[player["steamID"]] = next_cell

with open(middle_result_name_test  + "DONOTOVERWRITE", 'wb') as handle:
    pickle.dump(cell_samples_test, handle)

print(f"Movements done: {counter}. Of which {bad} bad.")

Movements done: 398074. Of which 4506 bad.


As we can see we have fewer movements for the testing (six games instead of the ten used before!), anyway we can observe that the number of "bad"s is again pretty low.

We now go through code (similar to the one used for the recovery of regression data) to get the explanatory variables, the only difference is that now we also store some data for multiple frames (a list of 10 values instead of a single number).

#### WARNING THIS CELL TAKES 5 MINUTES TO RUN

In [11]:
import pandas as pd
import numpy as np

with open(middle_result_name_test, 'rb') as handle:
    cell_samples_test = pickle.load(handle)

# Actually parsing data
recovered_samples = list()
total_data = 0
bad = 0

# Let's now go through every sample
for cell_i,samples in enumerate(cell_samples_test):
    cell_dict = {"cell_from": list(), "team": list(), "course": list(), "n_alive_lst": list(), "dir_team_lst": list()}

    for filename, round_i, frame_i, team, player_id, true_choice in samples:
        total_data += 1
        if frame_i <= 0:
            bad += 1
            continue

        team_info_before = datas_test[filename]["gameRounds"][round_i]["frames"][frame_i - 1][team]
        index = None
        for i, player in enumerate(team_info_before["players"]):
            if player["steamID"] == player_id:
                index = i
                break
        
        if index == None:
            bad += 1
            continue
        player = team_info_before["players"][index]
        prev_cell = data_struct.find_closest(player["x"],player["y"],player["z"])[0]

        
        prev_choice = 0
        possible_bad = True
        if prev_cell in connect_from[cell_i]:
            prev_choice = connect_from[cell_i].index(prev_cell) + 1
            possible_bad = False
        elif prev_cell == cell_i:
            possible_bad = False
        
        if possible_bad:
            for j, other_cell in enumerate(connect_from[cell_i]):
                for val2 in connect_from[other_cell]:
                    if val2 == prev_cell:
                        possible_bad = False
                        prev_choice = j + 1
                        break
                if not possible_bad:
                    break

        if possible_bad:
            bad += 1
            continue
        
        course = []
        angles = []
        n_alives = []
        get_data_frame = frame_i + 10
        if get_data_frame < len(datas_test[filename]["gameRounds"][round_i]["frames"]):
            bad_one = False
            prev_cell = cell_i
            for frame_k in range(frame_i, get_data_frame + 1):
                team_info_curr = datas_test[filename]["gameRounds"][round_i]["frames"][frame_k][team]
                
                index = None
                other_positions = list()
                try:
                    for i, player in enumerate(team_info_curr["players"]):
                        if player["steamID"] == player_id:
                            index = i
                        else:
                            other_positions.append((player["x"], player["y"]))
                except:
                    bad_one = True
                    break
                if index == None:
                    bad_one = True
                    break

                player = team_info_curr["players"][index]

                mov, next_cell = actual_movement((player["x"],player["y"],player["z"]), prev_cell)

                if mov == None:
                    bad_one = True
                    break

                angle = 0
                if len(other_positions) > 0:
                    mean_pos = np.mean(np.asarray(other_positions), axis=0)
                    mean_pos -= np.asarray([player["x"],player["y"]])
                    angle = np.arctan2(mean_pos[1], mean_pos[0])

                angles.append(angle)
                n_alives.append(len(other_positions) + 1)
                course.append(next_cell)
                prev_cell = next_cell

            
            if bad_one:
                bad += 1
                continue
        else:
            bad += 1
            continue


        cell_dict["n_alive_lst"].append(n_alives)
        cell_dict["dir_team_lst"].append(angles)
        cell_dict["course"].append(course)
        cell_dict["cell_from"].append(prev_choice)
        if team == "ct":
            cell_dict["team"].append(0)
        else:
            cell_dict["team"].append(1)

    temp_df = pd.DataFrame()
    temp_df["cell_from"] = pd.Categorical(cell_dict["cell_from"], categories=list(range(len(connect_from[cell_i]) + 1)), ordered=False)
    temp_df["team"] = pd.Categorical(cell_dict["team"], categories=[0,1], ordered=False)
    temp_df["n_alive_lst"] = [pd.Categorical(lst, categories=list(range(1,6)), ordered=False) for lst in cell_dict["n_alive_lst"]]
    temp_df["dir_team_lst"] = cell_dict["dir_team_lst"]
    temp_df["course"] = cell_dict["course"]
    recovered_samples.append(temp_df)

print(total_data, bad)

with open(final_result_name_test  + "DONOTOVERWRITE", 'wb') as handle:
    pickle.dump(recovered_samples, handle)

393568 47745


Oh wow, 47745 "bad" samples? This may seem a bit much, but we assure you the reason is simple: it's not like the design we did is unable (as happened in the previous cases) to nicely represent the events.
In this case the problem stems from the fact that we count as "bad" the cases in which we are unable to gain all full 5 seconds of information about the player: thus, if something happens to him such that the data is not present (simplest case: the player dies in that timeframe) then we must discard the sample.

At any rate the number of good samples is still extremely high, so we do not worry of this number.