# EDH card pair prediction

1. build train / test / dev data set
    1. get EDH card pair recommendations from edhrec.com. these have prediction value 1
    1. generate false pairs (prediction value 0) by randomly generating pairs
    1. split, stratifying on card color identity, card type, rarity.
    1. convert cards into sentences
1. fine-tune
    1. load pre-trained bert model on prediction task "card a, card b --> {yes,no} was edh rec
1. make deck predictions for one of my existing decks

In [None]:
# !pip install datasets transformers

In [None]:
import csv
import itertools
import os
from glob import glob

import datasets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datasets import load_dataset, load_from_disk
from sklearn.model_selection import train_test_split
from tqdm.notebook import tqdm
from transformers import (BertConfig, BertTokenizerFast,
                          BertForNextSentencePrediction,
                          DataCollatorWithPadding,
                          PreTrainedModel, PreTrainedTokenizerFast,
                          Trainer, TrainingArguments, )

In [None]:
%matplotlib inline

## build train / test / dev data set

### get EDH card pair recommendations from edhrec.com

these will have prediction value 1

In [None]:
import mtg.cards
import mtg.extract.edhrec

In [None]:
edhrec_cards = (mtg.extract.edhrec.get_commanders_and_cards()
                [['name', 'commander']])
edhrec_cards.head()

In [None]:
# most common cards
edhrec_cards.name.value_counts().plot.hist()

In [None]:
# most common cards
vc = edhrec_cards.name.value_counts()
vc.head()

In [None]:
# most cards appear in only 1 commander recs, up to 500 cards appear
# in 10 commander recs
vc[vc <= 10].plot.hist()

if we just ran with this, how many total pairs could we generate this way? basically, for every card in deck X, every other card is a valid pair. that's:

at first I was going to say no way, buuuuuut it's actually not terrible... we want big data, after all

we would need to generate about 32 min negative labels if that were the dataset we were interested in

### get all cards from mtgjson

to generate false pairs we will randomly select from all cards. about 65% of all MTG cards are referenced on edhrec, but the rest are also, presumably, good choices for 0 labels

In [None]:
cards = (mtg.cards.cards_df()
         .sort_values(by=['name', 'multiverseId'], ascending=False)
         .groupby('name')
         .first())

In [None]:
# before groupby().last(): 56_002, 78
# after: 21_814, 77
cards.shape

In [None]:
all_cards = set(cards.index.values)

we can eventually use this dataframe to create a generator of true card pairs off of a single card anchor

### split, stratifying on card color identity, card type, rarity.

we will split on cards. this is actually tricky, right? it would be easy if we could just do a 95/5/5 and then there was enough pairing between 5s and other 5s to build an entire test / val set, but I actually suspect we might have a problem fielding that many extra records. oh well, I guess we'll tell in due time

since we want to stratify on so many things, and we have a 2/3s chance of any card being in the true label, I actually think fully random sampling is approporiate. we can look at the breakdown of that by other features if we need to

### convert cards into sentences

In [None]:
cmc_map = {0.0: 'zero',
           0.5: 'one half',
           1.0: 'one',
           2.0: 'two',
           3.0: 'three',
           4.0: 'four',
           5.0: 'five',
           6.0: 'six',
           7.0: 'seven',
           8.0: 'eight',
           9.0: 'nine',
           10.0: 'ten',
           11.0: 'eleven',
           12.0: 'twelve',
           13.0: 'thirteen',
           14.0: 'fourteen',
           15.0: 'fifteen',
           16.0: 'sixteen',
           1000000.0: 'one million', }


color_map = {'W': 'white', 'U': 'blue', 'B': 'black', 'R': 'red', 'G': 'green'}


def parse_mana_colors_from_cost(mc):
    return ', '.join(color_map[c] for c in 'WUBRG' if c in (mc or ''))

In [None]:
assert parse_mana_colors_from_cost('{2}{U}{U}{B}') == 'blue, black'
assert parse_mana_colors_from_cost('{8}{W}{W}') == 'white'

In [None]:
def get_card_text(card):
    mana_color_str = parse_mana_colors_from_cost(card.manaCost)
    cmc_str = f"{cmc_map[card.convertedManaCost]} mana"
    
    if mana_color_str != '':
        mana_color_str = f' including {mana_color_str}'
    
    return (f"for {cmc_str}{mana_color_str}, cast {card.type} {card.name}: {card.text}"
            .lower()
            .replace('\n', ' '))

In [None]:
card_text = pd.DataFrame({'text': cards.apply(get_card_text, axis=1)})
card_text.head(20)

let's just go with this, see how it works out

### create a `huggingface` `datasets`

following along with the relatively simple example [here](https://github.com/huggingface/datasets/blob/master/datasets/squad/squad.py)

#### custom dataset loader?

meh let's try the `csv` loader first

#### `csv` loader

generate `csv`s the same way we were doing `parquet` (see appendix) and load those as datasets

#### loading csvs, shuffling, tokenizing, etc datasets now

+ tokenizing from [here](https://huggingface.co/docs/datasets/processing.html#processing-data-in-batches)

## fine-tune

## double-checking our trained model

next steps

+ what do our false positives look like
+ what is the separation like for "cards that have been on edhrec" vs. "cards that havent
    + i.e. do we just predict "both cards have been on EDHREC"?
    + did we create a dataset that is just (edhrec cards, either type)? I thought we were making (either type, either type)
+ what is the sorted list of recommendations given an existing deck

why are these all only edhrec cards? I thought I was generating pairs from both sides?

is this a problem? when a new card shows up and has never been seen before, will the model be unable to handle it? I think not, because presumably there were cards in test / val that it had never seen before (have I verified that).

In [None]:
# loading the trained model
config = BertConfig.from_pretrained('edhrec-bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('edhrec-bert-base-uncased', config=config)
tokenizer = BertTokenizerFast.from_pretrained('edhrec-bert-base-uncased')

In [None]:
EVAL_BATCH = 36

training_args = TrainingArguments(
    output_dir='./ignore',
    per_device_eval_batch_size=EVAL_BATCH,    # batch size for evaluation
    label_names=['labels'],
)

trainer = Trainer(model=model, args=training_args, )

In [None]:
MAX_LENGTH = 300

# tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenizer_map_func(rec):
    return tokenizer(rec['text_a'], rec['text_b'],
                     padding='max_length',
                     max_length=MAX_LENGTH,
                     truncation=True)


def fix_label(rec):
    return {'label_as_int': [int(_) for _ in rec['label']]}


split_types = ['val', 'test', 'train']
dataset = (load_dataset('csv',
                        data_files={split_type: sorted(glob(os.path.join('.', 'data', split_type, '*.csv')))
                                    for split_type in split_types},
                        quoting=csv.QUOTE_ALL)
           #.map(fix_label,
           #     batched=True)
           .shuffle(seeds={split_type: 1337
                           for split_type in split_types})
           .map(tokenizer_map_func,
                batched=True))

TODO

+ make the combo dataframe
+ convert that into a dataset (probably a `.from_pandas` or some shit)
+ pass that to eval
+ sort by predictions
+ profit

run the following on a computer with `mtg` installed

In [None]:
kykar_cards = [
    "Aetherflux Reservoir",
    "Anointed Procession",
    "As Foretold",
    "Austere Command",
    "Azorius Chancery",
    "Azorius Signet",
    "Baral, Chief of Compliance",
    "Blue Sun's Zenith",
    "Boros Charm",
    "Boros Garrison",
    "Boros Signet",
    "Cascade Bluffs",
    "Chaos Warp",
    "Chromatic Lantern",
    "Command Tower",
    "Commander's Sphere",
    "Counterspell",
    "Cultivator's Caravan",
    "Cyclonic Rift",
    "Desolate Lighthouse",
    "Disallow",
    "Dismantling Blow",
    "Docent of Perfection",
    "Dovin's Veto",
    "Eerie Interlude",
    "Esper Panorama",
    "Fact or Fiction",
    "Faithless Looting",
    "Flood Plain",
    "Gitaxian Probe",
    "Glacial Fortress",
    "Grixis Panorama",
    "Guttersnipe",
    "Hallowed Fountain",
    "Impulse",
    "Izzet Boilerworks",
    "Izzet Signet",
    "Kor Haven",
    "Kykar, Wind's Fury",
    "Mentor of the Meek",
    "Mind Stone",
    "Mizzix of the Izmagnus",
    "Mizzix's Mastery",
    "Murmuring Mystic",
    "Mystic Confluence",
    "Mystic Monastery",
    "Mystic Speculation",
    "Mystical Tutor",
    "Narset Transcendent",
    "Needle Spires",
    "Neurok Stealthsuit",
    "Nimbus Maze",
    "Niv-Mizzet, Parun",
    "Omniscience",
    "Ponder",
    "Port Town",
    "Prairie Stream",
    "Preordain",
    "Primal Amulet",
    "Ral, Izzet Viceroy",
    "Reliquary Tower",
    "Render Silent",
    "Rhystic Study",
    "Sacred Foundry",
    "Sea of Clouds",
    "Seachrome Coast",
    "Serum Visions",
    "Sol Ring",
    "Spirebluff Canal",
    "Sram's Expertise",
    "Steam Vents",
    "Stroke of Genius",
    "Sulfur Falls",
    "Sunforger",
    "Supreme Verdict",
    "Swords to Plowshares",
    "Taigam, Ojutai Master",
    "Talrand, Sky Summoner",
    "Teferi, Hero of Dominaria",
    "Teferi, Time Raveler",
    "Temple of Enlightenment",
    "Temple of Epiphany",
    "The Locust God",
    "Thought Vessel",
    "Tidespout Tyrant",
    "Trail of Evidence",
    "Vandalblast",
    "Young Pyromancer", 
]

In [None]:
infinite_purphoros = [
    "Combustible Gearhulk",
    "Hellkite Charger",
    "Inferno Titan",
    "Neheb, the Eternal",
    "Tyrant's Familiar",
    "Urabrask the Hidden",
    "Zealous Conscripts",
    "Braid of Fire",
    "Impact Tremors",
    "Seething Song",
    "Sundial of the Infinite",
    "Purphoros, Bronze-Blooded",
]

In [None]:
tokens = [
    "Advent of the Wurm", 
    "Ajani, Mentor of Heroes", 
    "Akroma's Memorial", 
    "Archangel of Thune", 
    "Armada Wurm", 
    "Austere Command", 
    "Avenger of Zendikar", 
    "Blighted Woodland", 
    "Blossoming Sands", 
    "Bow of Nylea", 
    "Brushland", 
    "Caged Sun", 
    "Champion of Lambholt", 
    "Command Tower", 
    "Constant Mists", 
    "Courser of Kruphix", 
    "Cultivate", 
    "Darien, King of Kjeldor", 
    "Doubling Season", 
    "Elfhame Palace", 
    "Elspeth, Sun's Champion", 
    "Emmara Tandris", 
    "Evolving Wilds", 
    "Forest", 
    "Geist-Honored Monk", 
    "Giant Adephage", 
    "Graypelt Refuge", 
    "Green Sun's Zenith", 
    "Grove of the Guardian", 
    "Growing Ranks", 
    "Hornet Queen", 
    "Hydra Broodmaster", 
    "Incremental Growth", 
    "Into the Wilds", 
    "Kodama's Reach", 
    "Krosan Verge", 
    "Meadowboon", 
    "Mikaeus, the Lunarch", 
    "Mimic Vat", 
    "Mirari's Wake", 
    "Nature's Lore", 
    "Nissa's Renewal", 
    "Nissa, Voice of Zendikar", 
    "Nylea, God of the Hunt", 
    "Oblivion Ring", 
    "Oracle of Mul Daya", 
    "Parallel Lives", 
    "Phyrexian Processor", 
    "Phyrexian Rebirth", 
    "Plains", 
    "Primal Vigor", 
    "Rampaging Baloths", 
    "Rancor", 
    "Razorverge Thicket", 
    "Reap What Is Sown", 
    "Reliquary Tower", 
    "Restoration Angel", 
    "Rhys the Redeemed", 
    "Riftstone Portal", 
    "Rupture Spire", 
    "Sakura-Tribe Elder", 
    "Second Harvest", 
    "Selesnya Charm", 
    "Selesnya Sanctuary", 
    "Selesnya Signet", 
    "Skyshroud Claim", 
    "Slime Molding", 
    "Spawnwrithe", 
    "Sundering Growth", 
    "Sunpetal Grove", 
    "Temple Garden", 
    "Terminus", 
    "Tireless Tracker", 
    "Transguild Promenade", 
    "Trostani's Summoner", 
    "Trostani, Selesnya's Voice", 
    "Vitu-Ghazi, the City-Tree", 
    "Voice of Resurgence", 
    "Wayfaring Temple", 
    "Windswept Heath", 
    "Worldspine Wurm", 
    "Worn Powerstone", 
    "Wrath of God", 
]

In [None]:
goblins = [
    "Ash Barrens",
    "Auntie's Hovel",
    "Battle Squadron",
    "Beetleback Chief",
    "Blasphemous Act",
    "Blood Crypt",
    "Bloodfell Caves",
    "Bloodmark Mentor",
    "Boggart Harbinger",
    "Boggart Mob",
    "Boggart Shenanigans",
    "Brightstone Ritual",
    "Chandra Ablaze",
    "Cinder Barrens",
    "Coat of Arms",
    "Command Tower",
    "Commander's Sphere",
    "Diabolic Tutor",
    "Door of Destinies",
    "Dreadbore",
    "Earwig Squad",
    "Empty the Warrens",
    "Fatal Push",
    "Fervor",
    "Foreboding Ruins",
    "Frenzied Goblin",
    "Frogtosser Banneret",
    "Gempalm Incinerator",
    "Ghost Quarter",
    "Goblin Charbelcher",
    "Goblin Chieftain",
    "Goblin Grenade",
    "Goblin King",
    "Goblin Lackey",
    "Goblin Matron",
    "Goblin Offensive",
    "Goblin Piledriver",
    "Goblin Rabblemaster",
    "Goblin Razerunners",
    "Goblin Recruiter",
    "Goblin Ringleader",
    "Goblin Sharpshooter",
    "Goblin War Strike",
    "Goblin Warchief",
    "Grenzo, Dungeon Warden",
    "Grenzo, Havoc Raiser",
    "Hammer of Purphoros",
    "Havoc Festival",
    "Hordeling Outburst",
    "Impact Tremors",
    "Kiki-Jiki, Mirror Breaker",
    "Knucklebone Witch",
    "Krenko, Mob Boss",
    "Lightning Crafter",
    "Mad Auntie",
    "Mana Echoes",
    "Mogg Infestation",
    "Mogg War Marshal",
    "Mountain",
    "Nykthos, Shrine to Nyx",
    "Phyrexian Arena",
    "Purphoros, God of the Forge",
    "Quest for the Goblin Lord",
    "Rakdos Carnarium",
    "Rakdos's Return",
    "Reckless One",
    "Reliquary Tower",
    "Ruby Medallion",
    "Siege-Gang Commander",
    "Smoldering Marsh",
    "Sol Ring",
    "Solemn Simulacrum",
    "Stingscourger",
    "Sulfuric Vortex",
    "Swamp",
    "Temple of Malice",
    "Terminate",
    "Tuktuk the Explorer",
    "Vivid Marsh",
    "Whip of Erebos",
    "Wort, Boggart Auntie",
]

In [None]:
rogues = [
    "Akki Underminer",
    "Amphin Pathmage",
    "Aqueous Form",
    "Ash Barrens",
    "Ashling, the Extinguisher",
    "Balefire Dragon",
    "Barren Moor",
    "Bident of Thassa",
    "Blasphemous Act",
    "Blighted Agent",
    "Chromatic Lantern",
    "Command Tower",
    "Commander's Sphere",
    "Counterspell",
    "Crumbling Necropolis",
    "Cyclonic Rift",
    "Darkwater Catacombs",
    "Decree of Pain",
    "Deepchannel Mentor",
    "Deepfathom Skulker",
    "Diabolic Tutor",
    "Dictate of Erebos",
    "Dimir Aqueduct",
    "Dimir Guildgate",
    "Dimir Signet",
    "Disallow",
    "Dismal Backwater",
    "Dowsing Dagger",
    "Drana, Liberator of Malakir",
    "Elbrus, the Binding Blade",
    "Evolving Wilds",
    "Exotic Orchard",
    "Exsanguinate",
    "Fellwar Stone",
    "Filth",
    "Fortune Thief",
    "Goblin Vandal",
    "Grave Pact",
    "Halimar Depths",
    "Hero's Downfall",
    "Ink-Eyes, Servant of Oni",
    "Island",
    "Jwar Isle Refuge",
    "Keeper of Keys",
    "Lonely Sandbar",
    "Marchesa, the Black Rose",
    "Mask of Riddles",
    "Master of Cruelties",
    "Mind Stone",
    "Mountain",
    "Mu Yanling",
    "Myriad Landscape",
    "Mystical Tutor",
    "Nicol Bolas",
    "Night Market Lookout",
    "Notion Thief",
    "Oona's Blackguard",
    "Phage the Untouchable",
    "Polluted Delta",
    "Pyreheart Wolf",
    "Quietus Spike",
    "Rakdos Guildgate",
    "Rakdos Signet",
    "Rankle, Master of Pranks",
    "Raving Dead",
    "Reliquary Tower",
    "Rhystic Study",
    "Rogue's Passage",
    "Scion of Darkness",
    "Scytheclaw",
    "Sheoldred, Whispering One",
    "Shizo, Death's Storehouse",
    "Skeleton Key",
    "Sol Ring",
    "Submerged Boneyard",
    "Sunken Hollow",
    "Swamp",
    "Sword of Sinew and Steel",
    "Teleportal",
    "Temple of the False God",
    "Terramorphic Expanse",
    "Thada Adel, Acquisitor",
    "Thassa, God of the Sea",
    "Thought Vessel",
    "Thraximundar",
    "Unclaimed Territory",
    "Vraska, Scheming Gorgon",
    "Whispersilk Cloak",
]

In [None]:
esper_blink = [
    "Acrobatic Maneuver",
    "Angel of Condemnation",
    "Angel of Despair",
    "Angelic Chorus",
    "Arcane Sanctum",
    "Ashen Rider",
    "Austere Command",
    "Azor, the Lawbringer",
    "Azorius Chancery",
    "Azorius Signet",
    "Baleful Strix",
    "Basalt Monolith",
    "Brago, King Eternal",
    "Cathars' Crusade",
    "Cloudblazer",
    "Command Tower",
    "Commander's Sphere",
    "Conjurer's Closet",
    "Counterspell",
    "Day of Judgment",
    "Deadeye Navigator",
    "Dimir Signet",
    "Dire Undercurrents",
    "Eerie Interlude",
    "Eldrazi Displacer",
    "Ephara, God of the Polis",
    "Esper Panorama",
    "Felidar Guardian",
    "Flickerform",
    "Flickerwisp",
    "Ghostly Flicker",
    "Ghostway",
    "Glacial Fortress",
    "Glimmerpoint Stag",
    "Gonti, Lord of Luxury",
    "Halimar Depths",
    "Illusionist's Stratagem",
    "Island",
    "Knight of the White Orchid",
    "Kor Cartographer",
    "Magister Sphinx",
    "Merciless Eviction",
    "Merieke Ri Berit",
    "Mistmeadow Witch",
    "Momentary Blink",
    "Mulldrifter",
    "Mycosynth Wellspring",
    "Nebelgast Herald",
    "Nephalia Smuggler",
    "Orzhov Basilica",
    "Orzhov Signet",
    "Panharmonicon",
    "Peregrine Drake",
    "Plains",
    "Port Town",
    "Reflector Mage",
    "Rescue from the Underworld",
    "Rune-Scarred Demon",
    "Solemn Simulacrum",
    "Sphinx of the Final Word",
    "Sphinx of Uthuun",
    "Spine of Ish Sah",
    "Stonehorn Dignitary",
    "Strionic Resonator",
    "Sudden Disappearance",
    "Supreme Verdict",
    "Suture Priest",
    "Swamp",
    "Temple of Deceit",
    "Thought Vessel",
    "Traveler's Cloak",
    "Unquestioned Authority",
    "Venser, Shaper Savant",
    "Venser, the Sojourner",
    "Wall of Omens",
]

In [None]:
sets_to_check = [
    "2XM",
    "AKR",
    "C20",
    "CC1",
    "CMC",
    "CMR",
    "IKO",
    "JMP",
    "KHC",
    "M21",
    "MB1",
    "MH2",
    "Q03",
    "SLD",
    "SLU",
    "SS3",
    "THB",
    "TSR",
    "ZNC",
    "ZNE",
    "ZNR",
]

In [None]:
it = [(kykar_cards, ['W', 'U', 'R'], 'kykar.csv'),
      (infinite_purphoros, ['R'], 'infinite_purphoros.csv'),
      (tokens, ['W', 'G'], 'tokens.csv'),
      (goblins, ['B', 'R'], 'goblins.csv'),
      (rogues, ['U', 'B', 'R'], 'rogues.csv'),
      (esper_blink, ['U', 'B', 'W'], 'esper_blink.csv')]

In [None]:
for (deck_cards, deck_colors, f_out) in it:
    print(f"f_out = {f_out}")
    cards_to_check = (cards
                      [cards.setname.isin(sets_to_check)
                       & cards.colorIdentity.apply(lambda x: set(x).difference(deck_colors) == set())
                       & ~cards.index.isin(deck_cards)]
                      .index
                      .unique())
    
    df_to_check = pd.DataFrame([{'text_a': card_text.loc[kc, 'text'],
                                 'text_b': card_text.loc[ctc, 'text'],
                                 'name_a': kc,
                                 'name_b': ctc}
                                for kc in deck_cards
                                for ctc in cards_to_check])
    
    print(df_to_check.shape)

    df_to_check.to_csv(os.path.join('.', f_out),
                       index=False,
                       quoting=csv.QUOTE_ALL)

now run the following on any machine that has those `csvs` copied to it

In [None]:
deck_names = ['esper_blink',
              'goblins',
              'kykar',
              'infinite_purphoros',
              'rogues',
              'tokens']

ds_to_check = (load_dataset('csv',
                            data_files={k: f'{k}.csv' for k in deck_names},
                            quoting=csv.QUOTE_ALL)
               .map(tokenizer_map_func, batched=True))

In [None]:
ds_to_check

In [None]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)

In [None]:
from scipy.special import softmax

for deck_name in deck_names:
    print(f"deck_name = {deck_name}")
    p = trainer.predict(ds_to_check[deck_name])
    print(f"p.predictions.shape = {p.predictions.shape}")

    probs = softmax(p.predictions, axis=1)

    z = pd.DataFrame({'p1': probs[:, 1],
                      'y_pred': probs.argmax(axis=1),
                      'name_b': ds_to_check[deck_name]['name_b'],
                      'text_b': ds_to_check[deck_name]['text_b']})
    z.reset_index(drop=True, inplace=True)
    
    recs = (z
            .groupby(['name_b', 'text_b'])
            .p1
            .median()
            .sort_values(ascending=False)
            .reset_index())
    
    recs.to_parquet(f"{deck_name}.parquet")

In [None]:
d1k = dataset['test'].select(range(1_000))
d1k

In [None]:
d1h = dataset['test'].select(range(100))
d1h

In [None]:
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)

In [None]:
import pandas as pd
import torch


def get_preds(n=100):
    chunk_size = 100
    z = None
    
    i = 0
    while i < n:
        print(f"i = {i}")
        d = dataset['test'][i: i + chunk_size]
        p = (model(**{k: torch.as_tensor(np.array(v))
                  for (k, v) in d.items()
                  if k in ['attention_mask', 'input_ids', 'token_type_ids']})
         [0]
         .softmax(1))

        z_now = pd.DataFrame(p.detach().numpy(), columns=['p0', 'p1'])
        for key in ['label', 'text_a', 'text_b']:
            z_now.loc[:, key] = d[key]
        
        if z is None:
            z = z_now
        else:
            z = z.append(z_now, ignore_index=True)
        
        i += chunk_size
        
    z.reset_index(drop=True, inplace=True)
    z.loc[:, 'p_delta'] = (z.p0 - z.p1).abs()
    
    z.loc[:, 'is_right'] = (z.p1 > z.p0) == z.label

    z.sort_values(by='p_delta', inplace=True, ascending=False)
    
    return z

In [None]:
z = get_preds(500)

z.tail(20)

In [None]:
z.is_right.value_counts()

In [None]:
z[~z.is_right].head(15)

In [None]:
is_edhrec = card_text.copy()
is_edhrec.loc[:, 'is_edhrec'] = is_edhrec.index.isin(edhrec_cards.name.unique())
is_edhrec.reset_index(inplace=True)
is_edhrec.head(10)

In [None]:
(z
 .merge(is_edhrec.rename(columns={'text': 'text_a', 'is_edhrec': 'is_edhrec_a'})[['text_a', 'is_edhrec_a']],
        how='left',
        on='text_a')
 .merge(is_edhrec.rename(columns={'text': 'text_b', 'is_edhrec': 'is_edhrec_b'})[['text_b', 'is_edhrec_b']],
        how='left',
        on='text_b')
 .groupby(['is_right', 'is_edhrec_b'])
 .is_right.count())

## make deck predictions for one of my existing decks

# appendix

the following is either hacking, didn't work, etc

### tokenizing sentences

~~we will be reusing most of the text sentences above several times; might as well tokenize them all up front once instead of tokenizing most of them 100x later~~

just do shit the way the documenation suggests we should. do them on the completely built pair parquet files below

In [None]:
# from transformers import RobertaTokenizerFast
# tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

In [None]:
# def my_tokenizer(row, *args, **kwargs):
#     return pd.Series(tokenizer(row.text, *args, **kwargs))

In [None]:
# (card_text.head(20)
#  .apply(my_tokenizer, axis=1, truncation=True, padding=True))

In [None]:
# card_text = (card_text
#              .join(card_text
#                    .apply(my_tokenizer, axis=1, truncation=True, padding=True)))

# card_text.head(10)

### making the pair suggestions dataset

okay so we have

1. a train / test / val split of all cards
1. a series of card text values (our "sentences")
1. a list of `card --> deck` relationships

the task now is to

1. generate positive and negative cases for each card
    + positive: `card --> deck <-- card`
    + negative: just not that
1. look up their text values
1. write those values to file
    + probably want to chunk this up somehow, maybe write 1k sentences per parquet

### build the pytorch datasets

basing this in large part off of [this doc page](https://huggingface.co/transformers/custom_datasets.html#nlplib)

#### do the encodings

so, the below killed the kernel... :(