# EDH card pair prediction

1. build train / test / dev data set
    1. get EDH card pair recommendations from edhrec.com. these have prediction value 1
    1. generate false pairs (prediction value 0) by randomly generating pairs
    1. split, stratifying on card color identity, card type, rarity.
    1. convert cards into sentences
1. fine-tune
    1. load pre-trained bert model on prediction task "card a, card b --> {yes,no} was edh rec
1. make deck predictions for one of my existing decks

In [None]:
import csv
import itertools
import os
from glob import glob

import datasets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datasets import load_dataset, load_from_disk
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
from transformers import (BertTokenizerFast,
                          BertForNextSentencePrediction,
                          Trainer, TrainingArguments, )

import mtg.cards
import mtg.extract.edhrec

In [None]:
%matplotlib inline

## build train / test / dev data set

### get EDH card pair recommendations from edhrec.com

these will have prediction value 1

In [None]:
edhrec_cards = (mtg.extract.edhrec.get_commanders_and_cards()
                [['name', 'commander']])
edhrec_cards.head()

In [None]:
# most common cards
edhrec_cards.name.value_counts().plot.hist()

In [None]:
# most common cards
vc = edhrec_cards.name.value_counts()
vc.head()

In [None]:
# most cards appear in only 1 commander recs, up to 500 cards appear
# in 10 commander recs
vc[vc <= 10].plot.hist()

if we just ran with this, how many total pairs could we generate this way? basically, for every card in deck X, every other card is a valid pair. that's:

In [None]:
z = edhrec_cards.commander.value_counts()
f"{int((z * (z - 1) / 2).sum()):,}"

at first I was going to say no way, buuuuuut it's actually not terrible... we want big data, after all

we would need to generate about 32 min negative labels if that were the dataset we were interested in

### get all cards from mtgjson

to generate false pairs we will randomly select from all cards. about 65% of all MTG cards are referenced on edhrec, but the rest are also, presumably, good choices for 0 labels

In [None]:
cards = (mtg.cards.cards_df()
         .sort_values(by=['name', 'multiverseId'], ascending=False)
         .groupby('name')
         .first())

In [None]:
# before groupby().last(): 56_002, 78
# after: 21_814, 77
cards.shape

In [None]:
all_cards = set(cards.index.values)

we can eventually use this dataframe to create a generator of true card pairs off of a single card anchor

### split, stratifying on card color identity, card type, rarity.

we will split on cards. this is actually tricky, right? it would be easy if we could just do a 95/5/5 and then there was enough pairing between 5s and other 5s to build an entire test / val set, but I actually suspect we might have a problem fielding that many extra records. oh well, I guess we'll tell in due time

since we want to stratify on so many things, and we have a 2/3s chance of any card being in the true label, I actually think fully random sampling is approporiate. we can look at the breakdown of that by other features if we need to

In [None]:
cards_train, cards_test_val = train_test_split(cards.index.values, test_size=.1, random_state=1337)
cards_test, cards_val = train_test_split(cards_test_val, test_size=.5, random_state=1337)

print(f"""
train: {cards_train.shape[0]}
test: {cards_test.shape[0]}
val: {cards_val.shape[0]}
""")

### convert cards into sentences

In [None]:
cmc_map = {0.0: 'zero',
           0.5: 'one half',
           1.0: 'one',
           2.0: 'two',
           3.0: 'three',
           4.0: 'four',
           5.0: 'five',
           6.0: 'six',
           7.0: 'seven',
           8.0: 'eight',
           9.0: 'nine',
           10.0: 'ten',
           11.0: 'eleven',
           12.0: 'twelve',
           13.0: 'thirteen',
           14.0: 'fourteen',
           15.0: 'fifteen',
           16.0: 'sixteen',
           1000000.0: 'one million', }


color_map = {'W': 'white', 'U': 'blue', 'B': 'black', 'R': 'red', 'G': 'green'}


def parse_mana_colors_from_cost(mc):
    return ', '.join(color_map[c] for c in 'WUBRG' if c in (mc or ''))

In [None]:
assert parse_mana_colors_from_cost('{2}{U}{U}{B}') == 'blue, black'
assert parse_mana_colors_from_cost('{8}{W}{W}') == 'white'

In [None]:
def get_card_text(card):
    mana_color_str = parse_mana_colors_from_cost(card.manaCost)
    cmc_str = f"{cmc_map[card.convertedManaCost]} mana"
    
    if mana_color_str != '':
        mana_color_str = f' including {mana_color_str}'
    
    return (f"for {cmc_str}{mana_color_str}, cast {card.type} {card.name}: {card.text}"
            .lower()
            .replace('\n', ' '))

In [None]:
name = cards_train[4]
cards.loc[name]
get_card_text(cards.loc[name])

In [None]:
card_text = pd.DataFrame({'text': cards.apply(get_card_text, axis=1)})
card_text.head(20)

let's just go with this, see how it works out

### create a `huggingface` `datasets`

following along with the relatively simple example [here](https://github.com/huggingface/datasets/blob/master/datasets/squad/squad.py)

#### custom dataset loader?

meh let's try the `csv` loader first

In [None]:
# class Edhrec(datasets.GeneratorBasedBuilder):
#     raise NotImplementedError("havent written builder configs")
#     BUILDER_CONFIGS = []
    
#     def _info(self):
#         return datasets.DatasetInfo(
#             description="lol no thanks",
#             features=datasets.Features({"id": datasets.Value('string'),
#                                         "text_a": datasets.Value('string'),
#                                         "text_b": datasets.Value('string'),
#                                         "label": datasets.Value("int32"), }),
#             supervised_keys=None
#         )

#### `csv` loader

generate `csv`s the same way we were doing `parquet` (see appendix) and load those as datasets

#### loading csvs, shuffling, tokenizing, etc datasets now

+ tokenizing from [here](https://huggingface.co/docs/datasets/processing.html#processing-data-in-batches)

In [None]:
F_DS_CACHE = os.path.join('.', 'data', 'edhrec_cache_dataset')

In [None]:
# MAX_LENGTH = 300

# tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# def tokenizer_map_func(rec):
#     return tokenizer(rec['text_a'], rec['text_b'],
#                      return_tensors='np',
#                      padding='max_length',
#                      max_length=MAX_LENGTH,
#                      truncation=True)


# (load_dataset('csv',
#               data_files={split_type: glob(os.path.join('.', 'data', split_type, '*.csv'))
#                           for split_type in ['val', 'test', 'train']},
#               quoting=csv.QUOTE_ALL)
#  .shuffle()
#  .map(tokenizer_map_func,
#       batched=True)
#  .save_to_disk(F_DS_CACHE))

In [None]:
dataset = load_from_disk(F_DS_CACHE)

In [None]:
dataset.column_names

In [None]:
# dataset['val'][0]

## fine-tune

In [None]:
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total # of training epochs
    per_device_train_batch_size=64,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=1,
    # my custom ones
    overwrite_output_dir=True,
    evaluation_strategy='steps',
    logging_first_step=True,
    no_cuda=True,
    seed=1337,
    dataloader_drop_last=True,
    dataloader_num_workers=30,
    label_names=['label'],
)


trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=dataset['test'],      # training dataset
    eval_dataset=dataset['val'],         # evaluation dataset
)

In [None]:
trainer.train()

## make deck predictions for one of my existing decks

# appendix

the following is either hacking, didn't work, etc

### tokenizing sentences

~~we will be reusing most of the text sentences above several times; might as well tokenize them all up front once instead of tokenizing most of them 100x later~~

just do shit the way the documenation suggests we should. do them on the completely built pair parquet files below

In [None]:
# from transformers import RobertaTokenizerFast
# tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

In [None]:
# def my_tokenizer(row, *args, **kwargs):
#     return pd.Series(tokenizer(row.text, *args, **kwargs))

In [None]:
# (card_text.head(20)
#  .apply(my_tokenizer, axis=1, truncation=True, padding=True))

In [None]:
# card_text = (card_text
#              .join(card_text
#                    .apply(my_tokenizer, axis=1, truncation=True, padding=True)))

# card_text.head(10)

### making the pair suggestions dataset

okay so we have

1. a train / test / val split of all cards
1. a series of card text values (our "sentences")
1. a list of `card --> deck` relationships

the task now is to

1. generate positive and negative cases for each card
    + positive: `card --> deck <-- card`
    + negative: just not that
1. look up their text values
1. write those values to file
    + probably want to chunk this up somehow, maybe write 1k sentences per parquet

### build the pytorch datasets

basing this in large part off of [this doc page](https://huggingface.co/transformers/custom_datasets.html#nlplib)

#### do the encodings

so, the below killed the kernel... :(