Initial Testing of a Decoder-only (GPT style) architecture.

# NOTE:

The training of the tokenizer and other items was done on the FULL dataset.  This causes a data leak and proper train/val/test splitting should be done in the final product.

# Imports

In [36]:
import os

import tokenizers
from tokenizers import Tokenizer, models, pre_tokenizers, trainers

import transformers
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from datasets import Dataset, load_dataset #, load_metric

import tensorflow as tf

from sklearn.model_selection import train_test_split


import ast # Because the tokenized_text column data is stored as a string instead of a list...
import pandas as pd


In [None]:
# Check tensforflow GPU availability
print(tf.config.list_physical_devices('GPU'))

# It looks like Brandon's rig hates him right now... stupid WSL2...

[]


# Grab the dataset

1) Get the dataset
2) Fix the tokenize_text column from a string to a list of strings
3) Pull out all of the words into a mega-list

In [40]:
# Read the processed card data from wherever you have it.

# df = pd.read_csv('../data/processed/mtg_carddata_processed.csv')
df = pd.read_csv('..\data\processed\mtg_carddata_processed_2_23_25.csv')

In [41]:
df

Unnamed: 0,name,mana_cost,type_line,oracle_text,power,toughness,colors,keywords,mtgo_id,loyalty,defense,processed_oracle_text,tokenized_text,tfidf_vector
0,"Nissa, Worldsoul Speaker",{3}{G},Legendary Creature — Elf Druid,"Landfall — Whenever a land you control enters,...",3,3,['G'],['Landfall'],,,,Landfall — Whenever a land you control enters ...,"['Landfall', 'Whenever', 'a', 'land', 'you', '...",[0. 0. 0. ... 0. 0. 0.]
1,Static Orb,{3},Artifact,"As long as <name> is untapped, players can't u...",,,[],[],15870.0,,,"As long as Static Orb is untapped , players ca...","['As', 'long', 'as', '<name>', 'is', 'untapped...",[0. 0. 0. ... 0. 0. 0.]
2,Sensory Deprivation,{U},Enchantment — Aura,Enchant creature\r\nEnchanted creature gets -3...,,,['U'],['Enchant'],49283.0,,,Enchant creature \n Enchanted creature gets -3...,"['Enchant', 'creature', '\\n', 'Enchanted', 'c...",[0. 0. 0. ... 0. 0. 0.]
3,Road of Return,{G}{G},Sorcery,Choose one —\r\n• Return target permanent card...,,,['G'],['Entwine'],77122.0,,,Choose one — \n • Return target permanent card...,"['Choose', 'one', '\\n', 'Return', 'target', '...",[0. 0. 0. ... 0. 0. 0.]
4,Storm Crow,{1}{U},Creature — Bird,Flying (This creature can't be blocked except ...,1,2,['U'],['Flying'],22609.0,,,Flying (This creature can't be blocked except ...,"['Flying', 'This', 'creature', ""can't"", 'be', ...",[0. 0. 0. ... 0. 0. 0.]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30618,Devoted Hero,{W},Creature — Elf Soldier,,1,2,['W'],[],,,,,[],[0. 0. 0. ... 0. 0. 0.]
30619,Without Weakness,{1}{B},Instant,Target creature you control gains indestructib...,,,['B'],['Cycling'],64646.0,,,Target creature you control gains indestructib...,"['Target', 'creature', 'you', 'control', 'gain...",[0. 0. 0. ... 0. 0. 0.]
30620,Firesong and Sunspeaker,{4}{R}{W},Legendary Creature — Minotaur Cleric,Red instant and sorcery spells you control hav...,4,6,"['R', 'W']",[],101914.0,,,Red instant and sorcery spells you control hav...,"['Red', 'instant', 'and', 'sorcery', 'spells',...",[0. 0. 0. ... 0. 0. 0.]
30621,"Samut, the Tested",{2}{R}{G},Legendary Planeswalker — Samut,+1: Up to one target creature gains double str...,,,"['G', 'R']",[],64772.0,4,,+1 : Up to one target creature gains double st...,"['1', ':', 'Up', 'to', 'one', 'target', 'creat...",[0. 0. 0. ... 0. 0. 0.]


In [42]:
df['tokenized_text'][0]

"['Landfall', 'Whenever', 'a', 'land', 'you', 'control', 'enters', ',', 'you', 'get', '{E}', '{E}', 'two', 'energy', 'counters', '.', '\\\\n', 'You', 'may', 'pay', 'eight', '{E}', 'rather', 'than', 'pay', 'the', 'mana', 'cost', 'for', 'permanent', 'spells', 'you', 'cast', '.']"

In [43]:
import ast

# Apply ast.literal_eval to each row in the tokenized_text column
df['tokenized_text'] = df['tokenized_text'].apply(ast.literal_eval)

# Display first few rows to verify
df['tokenized_text'].head()

0    [Landfall, Whenever, a, land, you, control, en...
1    [As, long, as, <name>, is, untapped, ,, player...
2    [Enchant, creature, \n, Enchanted, creature, g...
3    [Choose, one, \n, Return, target, permanent, c...
4    [Flying, This, creature, can't, be, blocked, e...
Name: tokenized_text, dtype: object

In [44]:
df['tokenized_text'][0][0]

'Landfall'

In [45]:
# Combine all tokens into one large list
all_tokens = []
for tokens in df['tokenized_text']:
    all_tokens.extend(tokens)

# Alternative one-liner using list comprehension
# all_tokens = [token for tokens in df['tokenized_text'] for token in tokens]

# Display the first 20 tokens to verify
print(f"Total tokens: {len(all_tokens)}")
print("First 20 tokens:", all_tokens[:20])

Total tokens: 968000
First 20 tokens: ['Landfall', 'Whenever', 'a', 'land', 'you', 'control', 'enters', ',', 'you', 'get', '{E}', '{E}', 'two', 'energy', 'counters', '.', '\\n', 'You', 'may', 'pay']


In [46]:
corpus = df['processed_oracle_text']
corpus

0        Landfall — Whenever a land you control enters ...
1        As long as Static Orb is untapped , players ca...
2        Enchant creature \n Enchanted creature gets -3...
3        Choose one — \n • Return target permanent card...
4        Flying (This creature can't be blocked except ...
                               ...                        
30618                                                  NaN
30619    Target creature you control gains indestructib...
30620    Red instant and sorcery spells you control hav...
30621    +1 : Up to one target creature gains double st...
30622                     All Sliver creatures get +1/+1 .
Name: processed_oracle_text, Length: 30623, dtype: object

In [47]:
# Convert corpus to list of strings if it's not already
corpus_list = corpus.tolist() if hasattr(corpus, 'tolist') else list(corpus)


In [48]:
type(corpus)

pandas.core.series.Series

In [49]:
type(corpus_list[0])

str

In [50]:
print(corpus[0:3])

0    Landfall — Whenever a land you control enters ...
1    As long as Static Orb is untapped , players ca...
2    Enchant creature \n Enchanted creature gets -3...
Name: processed_oracle_text, dtype: object


In [51]:
# Convert pandas Series to list of strings
corpus_list = corpus.values.tolist()

# Ensure all elements are strings
corpus_list = [str(text) for text in corpus_list]



In [52]:
corpus_list[0]

'Landfall — Whenever a land you control enters , you get {E} {E} (two energy counters) . \\n You may pay eight {E} rather than pay the mana cost for permanent spells you cast .'

In [53]:
print(type(corpus_list))  # Should show: <class 'list'>
print(type(corpus_list[0]))  # Should show: <class 'str'>

<class 'list'>
<class 'str'>


# Split into train/val/test datasets

Do we need to keep other information with the corpus_list as we split it up?  Names, color, etc.?

If we're just generating text, I think that answer is no.  If we want more, then the answer is likely yes.

In [54]:
train, temp = train_test_split(corpus_list, test_size=0.2) # Set 80% for training
val, test = train_test_split(temp, test_size=0.5) # Set 10% for validation and 10% for testing

In [56]:
train[:5]

['{3} : Hypnotic Grifter connives . (Draw a card , then discard a card . If you discarded a nonland card , put a +1/+1 counter on this creature . )',
 'Burning Hands deals 2 damage to target creature or planeswalker . If that permanent is green , Burning Hands deals 6 damage instead .',
 '(Theme color : {U} )',
 "Enchant creature \\n Enchanted creature gets +1/+1 and has lifelink . \\n When Squire's Devotion enters , create a 1/1 white Vampire creature token with lifelink .",
 'Devoid (This card has no color . ) \\n Flying \\n When Eldrazi Skyspawner enters , create a 1/1 colorless Eldrazi Scion creature token . It has "Sacrifice this creature : Add {C} . "']

# Train the Tokenizer on Training (or the full thing?) Data

In [None]:
# Create a directory for tokenizer files if it doesn't exist
models_dir = "../models"
os.makedirs(models_dir, exist_ok=True)

# Initialize trainer with specific output directory
tk_trainer = tokenizers.trainers.BpeTrainer(
    vocab_size=8192,
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
)

# Initialize tokenizer with output directory
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE())
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.Whitespace()

tokenizer.enable_padding(pad_id=0, pad_token="[PAD]")
tokenizer.enable_truncation(max_length=512) # Maybe we cut this down to 258?

# Train the tokenizer on just the test set of data to prevent data leakage
# tokenizer.train_from_iterator(corpus_list, trainer=tk_trainer)
tokenizer.train_from_iterator(train, trainer=tk_trainer)

# Save the tokenizer
tokenizer.save(os.path.join(models_dir, "tokenizer.json")) # Your tokenizer may have been trained with the full dataset causing a data leak...

In [58]:
# Verify the tokenizer works
sample_text = val[0]
encoded = tokenizer.encode(sample_text)
print(f"Encoded: {encoded.tokens}")

Encoded: ['If', 'Leyline', 'of', 'Ab', 'und', 'ance', 'is', 'in', 'your', 'opening', 'hand', ',', 'you', 'may', 'begin', 'the', 'game', 'with', 'it', 'on', 'the', 'battlefield', '.', '\\', 'n', 'Whenever', 'you', 'tap', 'a', 'creature', 'for', 'mana', ',', 'add', 'an', 'additional', '{', 'G', '}', '.', '\\', 'n', '{', '6', '}', '{', 'G', '}', '{', 'G', '}', ':', 'Put', 'a', '+', '1', '/+', '1', 'counter', 'on', 'each', 'creature', 'you', 'control', '.']


In [59]:
# Test the tokenizer on a string with a tab character and non-sensical text
test = tokenizer.encode("This is a test of \\t nonzensicallicalness {U}")
print(test.tokens)

['This', 'is', 'a', 'test', 'of', '\\', 't', 'non', 'z', 'ens', 'ic', 'all', 'ical', 'ness', '{', 'U', '}']


## Load your Tokenizer

In [None]:
# Since you have a pre-trained tokenizer, you can now load it directly
tokenizer = Tokenizer.from_file(os.path.join(models_dir, "tokenizer.json"))

# Build the Decoder

In [60]:
class PositionalEmbedding(tf.keras.layers.Layer):
    def __init__(self, vocab_size, embed_dims, seq_len):
        super().__init__()
        self.embedding = tf.keras.layers.Embedding(vocab_size + 1, embed_dims)
        self.pos_embedding = tf.keras.layers.Embedding(seq_len, embed_dims, mask_zero=True)

    def compute_mask(self, *args, **kwargs):
        return self.embedding.compute_mask(*args, **kwargs)

    def call(self, x):
        index_range = tf.shape(x)[-1]
        y = self.embedding(x)
        indices = tf.range(index_range)
        pos = self.pos_embedding(indices)
        return y + pos

In [61]:
class DecoderSelfAttention(tf.keras.layers.Layer):
    def __init__(self, num_heads, key_dim):
        super().__init__()
        self.attention = tf.keras.layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=key_dim)
        
    def call(self, x):
        y = self.attention(
            query=x,
            value=x,
            key=x,
            use_causal_mask=True)
        return y

In [None]:
VOCAB_SIZE = 8912 # Define this based on training data vocab size / tokenize size.
# TODO: Or does it need to relate to the tokenizer?
SEQUENCE_LENGTH = 500 # Define this based on training data sequence length.
EMBEDDING_DIM = 512 # Configure
FEED_FORWARD_DIM = 4 * EMBEDDING_DIM # This is usual value used by other LLMs, from below reference.
DROPOUT = 0.5 # Configure
NUM_HEADS = 16 # Configure
NUM_DECODER_LAYERS = 4 # Configure

### Reference: https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse

decoder_inputs = tf.keras.Input(shape=(None,), dtype=tf.int64, name="decoder_inputs")
positional_embedding = PositionalEmbedding(
    VOCAB_SIZE, EMBEDDING_DIM, SEQUENCE_LENGTH)(decoder_inputs)

layer_input = positional_embedding
layer_output = None
for i in range(NUM_DECODER_LAYERS):
    decoder_norm_1 = tf.keras.layers.LayerNormalization()(layer_input)
    decoder_self_attention = DecoderSelfAttention(NUM_HEADS, EMBEDDING_DIM)(decoder_norm_1)
    decoder_add_1 = tf.keras.layers.Add()([decoder_self_attention, decoder_norm_1])
    decoder_norm_2 = tf.keras.layers.LayerNormalization()(decoder_add_1)
    decoder_feedforward_1 = tf.keras.layers.Dense(FEED_FORWARD_DIM, activation="gelu")(decoder_norm_2)
    decoder_feedforward_2 = tf.keras.layers.Dense(EMBEDDING_DIM)(decoder_feedforward_1)
    decoder_dropout = tf.keras.layers.Dropout(DROPOUT)(decoder_feedforward_2)
    decoder_add_2 = tf.keras.layers.Add()([decoder_dropout, decoder_add_1])

    layer_input = decoder_add_2
    layer_output = decoder_add_2

prediction = tf.keras.layers.Dense(VOCAB_SIZE, activation="softmax")(layer_output)
transformer = tf.keras.models.Model(
    inputs=decoder_inputs, outputs=prediction, name="transformer")

In [63]:
transformer.summary()

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_inputs (InputLayer  [(None, None)]               0         []                            
 )                                                                                                
                                                                                                  
 positional_embedding (Posi  (None, None, 512)            4819456   ['decoder_inputs[0][0]']      
 tionalEmbedding)                                                                                 
                                                                                                  
 layer_normalization (Layer  (None, None, 512)            1024      ['positional_embedding[0][0]']
 Normalization)                                                                         

## A Short Rebuild of the above but with Hugging Face API

We might be able to leverage accelerated training by using HF so this could be worth to have.  It also uses a PyTorch backend which I know works on Brandon's system... TensorFlow has been giving me problems in my environment lately.

In [None]:
# Grab the decoder-only setup
from transformers import GPT2Config, GPT2LMHeadModel

In [None]:
VOCAB_SIZE = 8912 # Define this based on training data vocab size / tokenize size.
# TODO: Or does it need to relate to the tokenizer?
SEQUENCE_LENGTH = 500 # Define this based on training data sequence length. aka. context window
EMBEDDING_DIM = 512 # Configure
FEED_FORWARD_DIM = 4 * EMBEDDING_DIM # Standard for decoder-only items.
DROPOUT = 0.5 # Configure
NUM_HEADS = 16 # Configure attention heads
NUM_DECODER_LAYERS = 4 # Configure

# Define the model configuration
model_config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=SEQUENCE_LENGTH,
    n_ctx=SEQUENCE_LENGTH,
    n_embd=EMBEDDING_DIM,
    n_layer=NUM_DECODER_LAYERS,
    n_head=NUM_HEADS,
    resid_pdrop=DROPOUT,
    embd_pdrop=DROPOUT,
    attn_pdrop=DROPOUT,
    # layer_norm_epsilon=1e-5,
    # initializer_range=0.02,
)

# Initialize the model
gptmodel = GPT2LMHeadModel(model_config)
# Output the model layers for inspection
gptmodel

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(8912, 512)
    (wpe): Embedding(500, 512)
    (drop): Dropout(p=0.5, inplace=False)
    (h): ModuleList(
      (0-3): 4 x GPT2Block(
        (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=1536, nx=512)
          (c_proj): Conv1D(nf=512, nx=512)
          (attn_dropout): Dropout(p=0.5, inplace=False)
          (resid_dropout): Dropout(p=0.5, inplace=False)
        )
        (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=2048, nx=512)
          (c_proj): Conv1D(nf=512, nx=2048)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.5, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=8912, bias=False)
)

In [None]:
# How many trainable parameters exist in the gptmodel? (setting p.requires_grad to True for this)
trainable_params = sum(p.numel() for p in gptmodel.parameters() if p.requires_grad)
print("Number of trainable parameters:", trainable_params)

Number of trainable parameters: 17429504


In [76]:
# Print each parameter's name and trainable count individually
for name, param in gptmodel.named_parameters():
    if param.requires_grad:
        print(f"{name}: {param.numel()} trainable parameters")

transformer.wte.weight: 4562944 trainable parameters
transformer.wpe.weight: 256000 trainable parameters
transformer.h.0.ln_1.weight: 512 trainable parameters
transformer.h.0.ln_1.bias: 512 trainable parameters
transformer.h.0.attn.c_attn.weight: 786432 trainable parameters
transformer.h.0.attn.c_attn.bias: 1536 trainable parameters
transformer.h.0.attn.c_proj.weight: 262144 trainable parameters
transformer.h.0.attn.c_proj.bias: 512 trainable parameters
transformer.h.0.ln_2.weight: 512 trainable parameters
transformer.h.0.ln_2.bias: 512 trainable parameters
transformer.h.0.mlp.c_fc.weight: 1048576 trainable parameters
transformer.h.0.mlp.c_fc.bias: 2048 trainable parameters
transformer.h.0.mlp.c_proj.weight: 1048576 trainable parameters
transformer.h.0.mlp.c_proj.bias: 512 trainable parameters
transformer.h.1.ln_1.weight: 512 trainable parameters
transformer.h.1.ln_1.bias: 512 trainable parameters
transformer.h.1.attn.c_attn.weight: 786432 trainable parameters
transformer.h.1.attn.c_at

There is a difference in the reported number of trained parameters between Etienne's custom build and the Hugging Face model.

I'm 90% certain it has to do with the positional eoncodings being done slightly different.  48M of his model parameters are from the PosEnc which doesn't show up the same way in the GPT-model.

Both models are likely to generate similar results.

# Tokenize your train/val/test

## Set up datasets for loading data

# Training Loop

# Setup Training Arguments

OLD CODING WORK:

# Set up the Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, return_tensors='pt')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_reduced['train'],
    eval_dataset=tokenized_dataset_reduced['validation'],0
    tokenizer=tokenizer,
    data_collator=data_collator
)
