## Graf Dataset Balancing

Balances the Graf Dataset generated by generate-graf-data.ipynb. Since push is chosen a lot more often than other trump values, only a random subset of every trump is sampled for the final, used dataset.

Note, here it's the same issue with not wanting to clean it up into a stand-alone script because it would take a lot of effort and not be worth it as it won't every change and tuning random states like the one in this is anyway not something I stand for. But yes, in the real script I would not have had all these hard-coded numbers. Luckily, those are actually deterministic though and will not every change as the graf dataset actually contains every single possible card combination a person could hold in their hand.

In [None]:
import dask.dataframe as dd
import pandas as pd

In [None]:
ddf = dd.read_parquet("./data/graf-dataset/")
ddf

In [None]:
# number of times UNE_UFE was selected when partner pushed. see graf-dataset-analysis
LOWEST_N = 276_332
TOTAL_N = 188_286_560
# note, that is some heavy downsampling, as push was selected 72 million times alone

In [None]:
!mkdir ./data/graf-dataset-balanced

In [None]:
rand_state = 42

In [None]:
# see graf-dataset-analysis
counts = {
    0: {
        0: 16123470,
        1: 15877006,
        2: 15635613,
        3: 15399127,
        4: 14540772,
        5: 16567292,
    },
    1: {
        0: 4561545,
        1: 4546921,
        2: 4532349,
        3: 4517829,
        4: 3618884,
        5:  276332,
        6: 2089420,
    }
}

In [None]:
import os

In [None]:
for fh in [0, 1]:
    for trump in range(6+fh):
        train_path = f"./data/graf-dataset-balanced/train/{trump}" + ("fh" if fh == 1 else "") + "/"
        if os.path.exists(train_path):
            continue
        partition = ddf.query(f"fh == {fh} & trump == {trump}")
        # total_n = len(partition) expensive
        total_n = counts[fh][trump]
        downsampled = partition.sample(frac=LOWEST_N / total_n, random_state=rand_state)
        # downsampled.to_parquet("./data/graf-dataset-balanced/")
        train, val = downsampled.random_split([.8, .2], random_state=rand_state)
        train.to_parquet(train_path)
        val.to_parquet(f"./data/graf-dataset-balanced/val/{trump}" + ("fh" if fh == 1 else "") + "/")

In [None]:
train_ddf = dd.read_parquet("./data/graf-dataset-balanced/train")
train_ddf

In [None]:
train_ddf.compute()

In [None]:
val_ddf = dd.read_parquet("./data/graf-dataset-balanced/val/")
val_ddf

In [None]:
val_ddf.compute()