Cleaned Version of a Decoder-only (GPT style) transformer architecture.

# Imports

*YOU NEED TO RUN `decoder_thompson_botzercopy.ipynb` before this so you can create the dataset via: `training_data.to_csv("../data/processed/training_data.csv", index=False)`*

In [1]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
from tokenizers.normalizers import NFKC, Sequence
import os
from typing import List, Optional, Union
from tqdm.notebook import tqdm
from tqdm import tqdm
import random

import transformers
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, GPT2TokenizerFast
from transformers import PreTrainedTokenizerFast
from transformers import AutoTokenizer, BertForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding, DataCollatorForLanguageModeling
from datasets import Dataset, DatasetDict,load_dataset #, load_metric
from transformers.tokenization_utils_base import AddedToken

# import tensorflow as tf

from sklearn.model_selection import train_test_split

import torch


import ast # Because the tokenized_text column data is stored as a string instead of a list...
import pandas as pd



This makes use of the Hugging Face transformer.trainer API so it uses a PyTorch backend

In [2]:
# Check pytorch GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'You are usisng {device} on a {torch.cuda.get_device_name()}.')

You are usisng cuda on a NVIDIA GeForce GTX 1070.


# Grab the dataset

1) Get the dataset
2) Fix the tokenize_text column from a string to a list of strings
3) Pull out all of the words into a mega-list

**SOMTHING IS WRONG WITH THE DATASET.  Likely an issue with the preprocessing steps...**

See:

Item 19152 seems like it is too many cards put together... I need to figure out what is going on.

'Crash Landing — Search your library for a basic land card , reveal it , put it into your hand , then shuffle . \\n Goblin Camp — Create a Treasure token . \\n Emerald Grove — Create a 2/2 white Knight creature token . \\n Auntie\'s Teahouse — Scry 3 . \\n Defiled Temple — You may sacrifice a permanent . If you do , draw a card . \\n Mountain Pass — You may put a land card from your hand onto the battlefield . \\n Ebonlake Grotto — Create two 1/1 blue Faerie Dragon creature tokens with flying . \\n Grymforge — For each opponent , goad up to one target creature that player controls . \\n Githyanki Crèche — Distribute three +1/+1 counters among up to three target creatures you control . \\n Last Light Inn — Draw two cards . \\n Reithwin Tollhouse — Roll 2d4 and create that many Treasure tokens . \\n Moonrise Towers — Instant and sorcery spells you cast this turn cost 3 less to cast . \\n Gauntlet of Shar — Each opponent loses 5 life . \\n Balthazar\'s Lab — Return up to two target creature cards from your graveyard to your hand . \\n Circus of the Last Days — Create a token that\'s a copy of one of your commanders , except it\'s not legendary . \\n Undercity Ruins — Create three 4/1 black Skeleton creature tokens with menace . \\n Steel Watch Foundry — You get an emblem with "Creatures you control get +2/+2 and have trample . " \\n Ansur\'s Sanctum — Reveal the top four cards of your library and put them into your hand . Each opponent loses life equal to those cards\' total mana value . \\n Temple of Bhaal — Creatures your opponents control get -5/-5 until end of turn .'


In [25]:
# Read the processed card data from wherever you have it.

# df = pd.read_csv('../data/processed/mtg_carddata_processed.csv')

# df = pd.read_csv('../data/processed/mtg_carddata_processed_2_23_25.csv')  # Trainer and current process will work with this dataset

df = pd.read_csv('../data/processed/training_data.csv')

In [26]:
df

Unnamed: 0,sentence,prediction
0,"[start] <THEMES> N Identity , Artifact , N Col...",<CARD_NAME>
1,"[start] <THEMES> N Identity , Artifact , N Col...",<name>
2,"[start] <THEMES> N Identity , Artifact , N Col...",<MANA_COST>
3,"[start] <THEMES> N Identity , Artifact , N Col...",{3}
4,"[start] <THEMES> N Identity , Artifact , N Col...",<TYPE_LINE>
...,...,...
1267820,"[start] <THEMES> Creature , W Color , Creature...",<POWER>
1267821,"[start] <THEMES> Creature , W Color , Creature...",1
1267822,"[start] <THEMES> Creature , W Color , Creature...",<TOUGHNESS>
1267823,"[start] <THEMES> Creature , W Color , Creature...",1


In [27]:
df[:7]

Unnamed: 0,sentence,prediction
0,"[start] <THEMES> N Identity , Artifact , N Col...",<CARD_NAME>
1,"[start] <THEMES> N Identity , Artifact , N Col...",<name>
2,"[start] <THEMES> N Identity , Artifact , N Col...",<MANA_COST>
3,"[start] <THEMES> N Identity , Artifact , N Col...",{3}
4,"[start] <THEMES> N Identity , Artifact , N Col...",<TYPE_LINE>
5,"[start] <THEMES> N Identity , Artifact , N Col...",Artifact
6,"[start] <THEMES> N Identity , Artifact , N Col...",<ORACLE_TEXT>


In [28]:

print(df['sentence'][:7])

0    [start] <THEMES> N Identity , Artifact , N Col...
1    [start] <THEMES> N Identity , Artifact , N Col...
2    [start] <THEMES> N Identity , Artifact , N Col...
3    [start] <THEMES> N Identity , Artifact , N Col...
4    [start] <THEMES> N Identity , Artifact , N Col...
5    [start] <THEMES> N Identity , Artifact , N Col...
6    [start] <THEMES> N Identity , Artifact , N Col...
Name: sentence, dtype: object


In [29]:
df['sentence'][0]

'[start] <THEMES> N Identity , Artifact , N Color , {3} Cost'

In [30]:
# Basic split on whitespace
df['sentence_split'] = df['sentence'].str.split()
df['sentence_split'][:7]

0    [[start], <THEMES>, N, Identity, ,, Artifact, ...
1    [[start], <THEMES>, N, Identity, ,, Artifact, ...
2    [[start], <THEMES>, N, Identity, ,, Artifact, ...
3    [[start], <THEMES>, N, Identity, ,, Artifact, ...
4    [[start], <THEMES>, N, Identity, ,, Artifact, ...
5    [[start], <THEMES>, N, Identity, ,, Artifact, ...
6    [[start], <THEMES>, N, Identity, ,, Artifact, ...
Name: sentence_split, dtype: object

In [34]:
df['sentence_split'][0][0]

'[start]'

In [32]:
# Combine all tokens into one large list
all_tokens = []
for tokens in df['sentence']:
    all_tokens.extend(tokens)

# Alternative one-liner using list comprehension
# all_tokens = [token for tokens in df['sentence'] for token in tokens]

# Display the first 20 tokens to verify
print(f"Total tokens: {len(all_tokens)}")
print("First 20 tokens:", all_tokens[:20])

Total tokens: 343568844
First 20 tokens: ['[', 's', 't', 'a', 'r', 't', ']', ' ', '<', 'T', 'H', 'E', 'M', 'E', 'S', '>', ' ', 'N', ' ', 'I']


In [33]:
corpus = df['sentence']
print(corpus[0:3])

0    [start] <THEMES> N Identity , Artifact , N Col...
1    [start] <THEMES> N Identity , Artifact , N Col...
2    [start] <THEMES> N Identity , Artifact , N Col...
Name: sentence, dtype: object


In [35]:
# Convert corpus to list of strings if it's not already
corpus_list = corpus.tolist() if hasattr(corpus, 'tolist') else list(corpus)

In [36]:
type(corpus)

pandas.core.series.Series

In [37]:
type(corpus_list[0])

str

In [38]:
# Convert pandas Series to list of strings
corpus_list = corpus.values.tolist()

# Ensure all elements are strings
corpus_list = [str(text) for text in corpus_list]

In [39]:
# Corrected the above code to ensure all elements are strings
corpus_list[0]

'[start] <THEMES> N Identity , Artifact , N Color , {3} Cost'

In [40]:
# Check the first few elements to verify
print(type(corpus_list))  # Should show: <class 'list'>
print(type(corpus_list[0]))  # Should show: <class 'str'>

<class 'list'>
<class 'str'>


# Split into train/val/test datasets

Do we need to keep other information with the corpus_list as we split it up?  Names, color, etc.?

If we're just generating text, I think that answer is no.  If we want more, then the answer is likely yes.

In [41]:
# Split data into train/validation/test (80/10/10) sets
train_list, temp = train_test_split(corpus_list, test_size=0.2, random_state=42) # Set 80% for training
val, test = train_test_split(temp, test_size=0.5, random_state=42) # Set 10% for validation and 10% for testing

In [42]:
# Turn them into dicts so I can use them with the Hugging Face Dataset class
train = [{"sentence": text} for text in train_list]  # Wrap each sentence in a dict
val = [{"sentence": text} for text in val]
test = [{"sentence": text} for text in test]

# Create the Dataset Dictionary for future mapping
data = DatasetDict({
    'train': Dataset.from_list(train),
    'validation': Dataset.from_list(val),
    'test': Dataset.from_list(test)
    })

data

DatasetDict({
    train: Dataset({
        features: ['sentence'],
        num_rows: 1014260
    })
    validation: Dataset({
        features: ['sentence'],
        num_rows: 126782
    })
    test: Dataset({
        features: ['sentence'],
        num_rows: 126783
    })
})

In [43]:
# Check the items
data['train'][:5]

{'sentence': ['[start] <THEMES> B Identity , 1 1 Counters , Creature , Sacrifice , Creature Based , B Color , Zombie , {2} {B} Cost <CARD_NAME> <name> <MANA_COST> {2} {B} <TYPE_LINE> Creature — Zombie <ORACLE_TEXT> {1} {B} , Sacrifice another creature : Put two 1 1 counters on <name> . Activate only as a',
  '[start] <THEMES> Damage , Flash , Enchantment , R Color , R Identity , {3} {R} {R} Cost <CARD_NAME> <name> <MANA_COST> {3} {R} {R} <TYPE_LINE> Enchantment <ORACLE_TEXT> Flash \\n If a source would deal damage to a permanent or player , it deals double that damage to that permanent or player instead',
  '[start] <THEMES> {2} {W} Cost , Enchantment , W Color , Tokens , Treasure Tokens , W Identity , Card Draw <CARD_NAME> <name> <MANA_COST> {2} {W} <TYPE_LINE> Enchantment <ORACLE_TEXT> At the beginning',
  '[start] <THEMES> {3} {W} Cost , W Color , Creature Based , Sorcery , Rebound , W Identity <CARD_NAME> <name> <MANA_COST> {3} {W} <TYPE_LINE> Sorcery <ORACLE_TEXT> Creatures you co

In [44]:
data['validation']['sentence'][0]

'[start] <THEMES> UW Identity , Creature Based , {W} {U} Cost , Combat Damage , UW Color , Instant , Fog <CARD_NAME> <name> <MANA_COST> {W} {U} <TYPE_LINE> Instant'

# Train the Tokenizer on Training Data

Keep the validation and test sets out of the tokenizer to prevent data leakage.

In [45]:
# Create a directory for tokenizer files if it doesn't exist
models_dir = "../models/themed_data"
os.makedirs(models_dir, exist_ok=True)

## Tokenizer Functions

1) `create_gpt2_tokenizer`

2) `train_tokenizer_from_texts`

3) `convert_to_transformers_tokenizer` -> This is important to allow `trainer.train` to work

4) `use_transformers_tokenizer`

5) `load_transformers_tokenizer` -> Used if you have one to load, else just carry from `tokenizer=convert_to_transformers_tokenizer`

6) `demonstrate_workflow2` -> make sure this whole thing even works...

In [46]:
def create_gpt2_tokenizer(vocab_size=8192, min_frequency=2):
    """
    Create a GPT2-style BPE tokenizer from scratch using Hugging Face tokenizers library.

    Args:
        vocab_size: The size of the vocabulary to learn
        min_frequency: Minimum frequency for a token to be considered in the BPE algorithm

    Returns:
        A tokenizer object ready for training
    """
    # Initialize a ByteLevelBPE model-based tokenizer
    tokenizer = Tokenizer(models.BPE())

    # Add byte-level pre-tokenizer
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

    # Add decoder to properly decode byte-level tokens
    tokenizer.decoder = decoders.ByteLevel()

    # Set up normalizers - GPT-2 doesn't do much normalization
    tokenizer.normalizer = Sequence([NFKC()])

    # Return the tokenizer (to be trained later)
    return tokenizer


def train_tokenizer_from_texts(
    tokenizer: Tokenizer,
    texts: List[str],
    vocab_size: int = 8192,
    min_frequency: int = 2,
    batch_size: int = 512,
    output_dir: str = "tokenizer"
):
    """
    Train the tokenizer on a list of texts using batching

    Args:
        tokenizer: The tokenizer object to train
        texts: List of strings to use for training
        vocab_size: Maximum vocabulary size
        min_frequency: Minimum frequency for a token
        batch_size: Number of texts to process in each batch
        output_dir: Directory to save the trained tokenizer

    Returns:
        The trained tokenizer
    """
    if not texts:
        raise ValueError("No texts provided for training")

    # Configure the BPE trainer
    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=["<|endoftext|>", "<|pad|>"],
        show_progress=True,
    )

    # Create temporary files for batched training
    temp_dir = os.path.join(output_dir, "temp")
    os.makedirs(temp_dir, exist_ok=True)

    # Split texts into batches
    batches = [texts[i:i+batch_size] for i in range(0, len(texts), batch_size)]
    batch_files = []

    print(f"Preparing {len(batches)} batches for training...")

    # Write batches to temporary files
    for i, batch in enumerate(tqdm(batches)):
        batch_file = os.path.join(temp_dir, f"batch_{i}.txt")
        with open(batch_file, "w", encoding="utf-8") as f:
            f.write("\n<|endoftext|>\n".join(batch))
            f.write("\n<|endoftext|>\n")  # Add final separator
        batch_files.append(batch_file)

    # Train the tokenizer
    print("Training tokenizer...")
    tokenizer.train(batch_files, trainer)

    # Add post-processor to handle special tokens
    tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

    # Enable padding with the PAD token
    pad_id = tokenizer.token_to_id("<|pad|>")
    if pad_id is not None:
        tokenizer.enable_padding(pad_id=pad_id, pad_token="<|pad|>")
    else:
        print("Warning: <|pad|> token not found in vocabulary. Padding won't work correctly.")

    # Save the trained tokenizer
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    tokenizer.save(f"{output_dir}/gpt2_tokenizer.json")

    # Clean up temporary files
    for file in batch_files:
        os.remove(file)
    os.rmdir(temp_dir)

    print(f"Tokenizer trained and saved to {output_dir}/gpt2_tokenizer.json")
    return tokenizer


def convert_to_transformers_tokenizer(
    tokenizer_path: str,
    output_dir: str = None,
    model_max_length: int = 1024
):
    """
    Convert a trained tokenizers.Tokenizer to a transformers compatible tokenizer

    Args:
        tokenizer_path: Path to the saved tokenizer JSON file
        output_dir: Directory to save the transformers tokenizer
        model_max_length: Maximum sequence length for the model

    Returns:
        A transformers-compatible tokenizer
    """
    # Load the tokenizer
    if isinstance(tokenizer_path, Tokenizer):
        # If a tokenizer object is passed directly
        raw_tokenizer = tokenizer_path
    else:
        # Load from file
        raw_tokenizer = Tokenizer.from_file(tokenizer_path)

    # Define special tokens
    eos_token = "<|endoftext|>"  # GPT-2 uses EOS as both BOS and EOS
    pad_token = "<|pad|>"

    # Create the transformers wrapper
    transformers_tokenizer = PreTrainedTokenizerFast(
        tokenizer_object=raw_tokenizer,
        bos_token=eos_token,  # GPT-2 uses EOS as BOS too
        eos_token=eos_token,
        pad_token=pad_token,
        unk_token=eos_token,  # GPT-2 doesn't really have UNK, defaults to EOS
        model_max_length=model_max_length
    )

    # Set additional GPT-2 specific attributes
    transformers_tokenizer.add_special_tokens({
        "eos_token": eos_token,
        "bos_token": eos_token,
        "pad_token": pad_token,
    })

    # Save the transformers tokenizer if an output directory is provided
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)
        transformers_tokenizer.save_pretrained(output_dir)
        print(f"Transformers tokenizer saved to {output_dir}")

    return transformers_tokenizer


def use_transformers_tokenizer(tokenizer, text, padding=True, truncation=True, max_length=None, return_tensors=None):
    """
    Demonstrate how to use the transformers tokenizer

    Args:
        tokenizer: The transformers tokenizer
        text: Text to tokenize or list of texts
        padding: Whether to pad the sequences (True, False, 'max_length', 'longest')
        truncation: Whether to truncate sequences (True, False)
        max_length: Maximum length for padding/truncation (defaults to model_max_length if None)
        return_tensors: Return format for tensors ('pt' for PyTorch, 'tf' for TensorFlow, None for lists)

    Returns:
        Encoding object(s) with tokens, ids, etc.
    """
    # Configure parameters
    if max_length is None:
        max_length = tokenizer.model_max_length

    # Encode the text(s)
    encoded = tokenizer(
        text,
        padding=padding,
        truncation=truncation,
        max_length=max_length,
        return_tensors=return_tensors
    )

    # Display results for single text if it's a string
    if isinstance(text, str):
        print(f"Input text: {text}")
        print(f"Token IDs: {encoded['input_ids'].tolist()[0] if return_tensors else encoded['input_ids']}")
        tokens = tokenizer.convert_ids_to_tokens(
            encoded['input_ids'].tolist()[0] if return_tensors else encoded['input_ids']
        )
        print(f"Tokens: {tokens}")

        if padding:
            print(f"Attention mask: {encoded['attention_mask'].tolist()[0] if return_tensors else encoded['attention_mask']}")

        # Decoding demonstration
        decoded = tokenizer.decode(
            encoded['input_ids'].tolist()[0] if return_tensors else encoded['input_ids'],
            skip_special_tokens=True
        )
        print(f"Decoded text: {decoded}")

    return encoded


def load_transformers_tokenizer(path):
    """
    Load a previously saved transformers tokenizer

    Args:
        path: Path to the saved transformers tokenizer directory

    Returns:
        The loaded transformers tokenizer
    """
    return GPT2TokenizerFast.from_pretrained(path)


def demonstrate_workflow2():
    """
    Full demonstration of creating, training, and using a GPT-2 style tokenizer
    that's compatible with the transformers library
    """
    # Create a new tokenizer
    print("Creating tokenizer...")
    tokenizer = create_gpt2_tokenizer()

    # Generate some sample texts for demonstration
    print("Generating sample texts...")
    sample_texts = [
        "This is an example of text that could be used to train a tokenizer.",
        "It should include diverse vocabulary, punctuation (like commas, periods, question marks?).",
        "Multiple paragraphs are good to include.",
        "Numbers like 42, 3.14159, and 2023 should be represented.",
        "Code snippets might be important if your model will process code:\ndef hello_world():\n    print('Hello, world!')",
    ]

    # Add some more generated texts for better training
    for i in range(50):
        words = ["The", "quick", "brown", "fox", "jumps", "over", "lazy", "dog",
                "Hello", "world", "Python", "programming", "is", "fun", "GPT",
                "natural", "language", "processing", "models", "work", "well"]
        length = random.randint(5, 20)
        text = " ".join(random.choices(words, k=length)) + "."
        sample_texts.append(text)

    # Train the tokenizer on the sample texts
    print(f"Training tokenizer on {len(sample_texts)} texts...")
    raw_tokenizer = train_tokenizer_from_texts(
        tokenizer,
        sample_texts,
        vocab_size=1000,  # Smaller vocab for demo
        min_frequency=1,
        batch_size=20,
        output_dir="tokenizer_demo"
    )

    # Convert to transformers tokenizer
    print("\nConverting to transformers tokenizer...")
    transformers_tokenizer = convert_to_transformers_tokenizer(
        "tokenizer_demo/gpt2_tokenizer.json",
        output_dir="transformers_tokenizer_demo",
        model_max_length=512
    )

    # Test the transformers tokenizer
    print("\nTesting transformers tokenizer...")
    test_texts = [
        "This is a short text.",
        "This is a slightly longer text with more content.",
        "Let's see how padding works on texts of different lengths."
    ]

    # Encode with transformers tokenizer (return PyTorch tensors)
    print("\nEncoding with PyTorch tensors:")
    encodings = use_transformers_tokenizer(
        transformers_tokenizer,
        test_texts,
        padding=True,
        return_tensors="pt"
    )

    # Show that it can be saved and loaded
    print("\nLoading saved transformers tokenizer:")
    loaded_tokenizer = load_transformers_tokenizer("transformers_tokenizer_demo")
    test_encode = loaded_tokenizer(
        "Testing that loading works correctly.",
        return_tensors="pt"
    )
    print(f"Encoded IDs: {test_encode['input_ids'].tolist()[0]}")
    print(f"Decoded: {loaded_tokenizer.decode(test_encode['input_ids'][0])}")

    return transformers_tokenizer

In [47]:
# Check if this even works...
demonstrate_workflow2()

Creating tokenizer...
Generating sample texts...
Training tokenizer on 55 texts...
Preparing 3 batches for training...


100%|██████████| 3/3 [00:00<00:00, 1502.08it/s]
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizer'. 
The class this function is called from is 'GPT2TokenizerFast'.


Training tokenizer...
Tokenizer trained and saved to tokenizer_demo/gpt2_tokenizer.json

Converting to transformers tokenizer...
Transformers tokenizer saved to transformers_tokenizer_demo

Testing transformers tokenizer...

Encoding with PyTorch tensors:

Loading saved transformers tokenizer:
Encoded IDs: [26, 96, 47, 70, 287, 57, 39, 42, 28, 31, 70, 125, 46, 166, 66, 182, 220, 39, 52, 7]
Decoded: Testing that loading works correctly.


PreTrainedTokenizerFast(name_or_path='', vocab_size=339, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|pad|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

## Create and train the tokenizer

In [48]:
# Train a tokenizer if you don't have one yet
trained_tokenizer = create_gpt2_tokenizer(vocab_size=8192, min_frequency=2)

# This saves the tokenizer out to memory as well
trained_tokenizer = train_tokenizer_from_texts(tokenizer=trained_tokenizer, texts=data['train']['sentence'], batch_size=512, output_dir=models_dir)


Preparing 1981 batches for training...


100%|██████████| 1981/1981 [00:03<00:00, 497.26it/s]


Training tokenizer...
Tokenizer trained and saved to ../models/themed_data/gpt2_tokenizer.json


In [49]:
# This pulls the trained_tokenizer and makes it compatable with the transformers library for training
# Needs to be a PreTrainedTokenizerFast object
tokenizer_fast = convert_to_transformers_tokenizer(tokenizer_path=f"{models_dir}/gpt2_tokenizer.json",
                                                   output_dir=models_dir, model_max_length=256)

# Check the tokenizer
tokenizer_fast

Transformers tokenizer saved to ../models/themed_data


PreTrainedTokenizerFast(name_or_path='', vocab_size=7490, model_max_length=256, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|pad|>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<|pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [50]:
# Check to make sure the fast tokenizer is working
encodings = use_transformers_tokenizer(tokenizer_fast, data['validation']['sentence'][1], padding=True, max_length=1024, return_tensors='pt')
encodings

Input text: [start] <THEMES> B Identity , Creature , Imp , Flying , Threshold , {B} Cost , B Color , Zombie , Discard <CARD_NAME> <name> <MANA_COST> {B} <TYPE_LINE> Creature — Zombie Imp <ORACLE_TEXT> Discard a card : <name> gains flying until end of turn . \n Threshold — As long as seven or more cards are in your graveyard , <name> gets 1 1
Token IDs: [48, 130, 50, 92, 142, 21, 121, 134, 91, 114, 91, 1258, 91, 272, 91, 1179, 91, 94, 23, 80, 129, 91, 121, 131, 91, 507, 91, 501, 92, 146, 51, 147, 21, 92, 119, 21, 92, 151, 51, 153, 21, 94, 23, 80, 92, 160, 51, 158, 21, 114, 183, 507, 1258, 92, 173, 51, 174, 21, 501, 188, 229, 274, 92, 119, 21, 586, 741, 407, 367, 226, 329, 175, 193, 65, 1179, 183, 585, 673, 384, 1274, 292, 571, 366, 747, 325, 252, 455, 91, 92, 119, 21, 449, 198, 198]
Tokens: ['[', 'start', ']', 'Ġ<', 'THEMES', '>', 'ĠB', 'ĠIdentity', 'Ġ,', 'ĠCreature', 'Ġ,', 'ĠImp', 'Ġ,', 'ĠFlying', 'Ġ,', 'ĠThreshold', 'Ġ,', 'Ġ{', 'B', '}', 'ĠCost', 'Ġ,', 'ĠB', 'ĠColor', 'Ġ,', 'ĠZombie',

{'input_ids': tensor([[  48,  130,   50,   92,  142,   21,  121,  134,   91,  114,   91, 1258,
           91,  272,   91, 1179,   91,   94,   23,   80,  129,   91,  121,  131,
           91,  507,   91,  501,   92,  146,   51,  147,   21,   92,  119,   21,
           92,  151,   51,  153,   21,   94,   23,   80,   92,  160,   51,  158,
           21,  114,  183,  507, 1258,   92,  173,   51,  174,   21,  501,  188,
          229,  274,   92,  119,   21,  586,  741,  407,  367,  226,  329,  175,
          193,   65, 1179,  183,  585,  673,  384, 1274,  292,  571,  366,  747,
          325,  252,  455,   91,   92,  119,   21,  449,  198,  198]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attenti

# Tokenize all of the data

## Functions

In [51]:
def tokenize_dataset_dict(dataset_dict, tokenizer, max_length=None, batch_size=1000):
    """
    Tokenize all sentence entries in a Hugging Face DatasetDict.

    Args:
        dataset_dict: A Hugging Face DatasetDict containing data with 'sentence' column
        tokenizer: A transformers tokenizer
        max_length: Maximum sequence length for tokenization (None = no limit)
        batch_size: Batch size for processing

    Returns:
        A new DatasetDict with tokenized data
    """
    from datasets import DatasetDict

    # Function to tokenize a batch of examples
    def tokenize_function(examples):
        # For transformers tokenizers use the __call__ method
        return tokenizer(
            examples["sentence"],
            padding="max_length" if max_length else False,
            truncation=True if max_length else False,
            max_length=max_length,
            return_special_tokens_mask=True,
        )

    # Create a new DatasetDict to store the tokenized datasets
    tokenized_datasets = DatasetDict()

    # Process each split in the dataset
    for split_name, dataset in dataset_dict.items():
        # Map the tokenize function over the dataset in batches
        tokenized_datasets[split_name] = dataset.map(
            tokenize_function,
            batched=True,
            batch_size=batch_size,
            # remove_columns=["sentence"] if "sentence" in dataset.column_names else None, # Keep the sentences
            desc=f"Tokenizing {split_name} split",
        )

        # Print the first example to verify
        if len(tokenized_datasets[split_name]) > 0:
            print(f"\nExample from {split_name} split:")
            example = tokenized_datasets[split_name][0]
            print(f"Input IDs (first 10): {example['input_ids'][:10]}...")
            print(f"Attention Mask (first 10): {example['attention_mask'][:10]}...")

            # Decode a sample for verification
            decoded = tokenizer.decode(example['input_ids'])
            print(f"Decoded sample preview: {decoded[:50]}...")

    return tokenized_datasets

## Tokenize the Datasets

Do we need to do this?  I think so... I can train when I do this so... yes?

In [52]:
# Tokenize the DatasetDict
# TODO: We can probably decrease the max_length here to 128 or 64 to save memory; We'll have to test it out
# Or maybe we don't use a max length...?? TODO: Figure out how to handle this
tokenize_dataset_dict = tokenize_dataset_dict(data, tokenizer_fast, max_length=256, batch_size=1000)
tokenize_dataset_dict

Tokenizing train split:   0%|          | 0/1014260 [00:00<?, ? examples/s]


Example from train split:
Input IDs (first 10): [48, 130, 50, 92, 142, 21, 121, 134, 91, 198]...
Attention Mask (first 10): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...
Decoded sample preview: [start] <THEMES> B Identity , 1 1 Counters , Creat...


Tokenizing validation split:   0%|          | 0/126782 [00:00<?, ? examples/s]


Example from validation split:
Input IDs (first 10): [48, 130, 50, 92, 142, 21, 660, 134, 91, 114]...
Attention Mask (first 10): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...
Decoded sample preview: [start] <THEMES> UW Identity , Creature Based , {W...


Tokenizing test split:   0%|          | 0/126783 [00:00<?, ? examples/s]


Example from test split:
Input IDs (first 10): [48, 130, 50, 92, 142, 21, 733, 91, 114, 91]...
Attention Mask (first 10): [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]...
Decoded sample preview: [start] <THEMES> Druid , Creature , Vigilance , GW...


DatasetDict({
    train: Dataset({
        features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 1014260
    })
    validation: Dataset({
        features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 126782
    })
    test: Dataset({
        features: ['sentence', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'],
        num_rows: 126783
    })
})

In [53]:
# Make sure 'sentnece' is still there
tokenize_dataset_dict['validation']['sentence'][0]

'[start] <THEMES> UW Identity , Creature Based , {W} {U} Cost , Combat Damage , UW Color , Instant , Fog <CARD_NAME> <name> <MANA_COST> {W} {U} <TYPE_LINE> Instant'

In [54]:
tokenize_dataset_dict['validation']['input_ids'][0]

[48,
 130,
 50,
 92,
 142,
 21,
 660,
 134,
 91,
 114,
 185,
 91,
 94,
 44,
 80,
 94,
 42,
 80,
 129,
 91,
 531,
 319,
 91,
 660,
 131,
 91,
 293,
 91,
 1073,
 92,
 146,
 51,
 147,
 21,
 92,
 119,
 21,
 92,
 151,
 51,
 153,
 21,
 94,
 44,
 80,
 94,
 42,
 80,
 92,
 160,
 51,
 158,
 21,
 293,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [55]:
# A short function to easily convert from ids_to_tokens
# It effectivly just wraps the decode function
def ids_to_tokens(tokenizer, ids):
    '''
    Convert token IDs back to tokens using the tokenizer
    Really just a mapping function for tokenizer.decode()
    '''
    tokens = tokenizer.decode(ids)

    return tokens

In [56]:
trained_tokenizer.decode(tokenize_dataset_dict['validation']['input_ids'][0])

'[start] <THEMES> UW Identity , Creature Based , {W} {U} Cost , Combat Damage , UW Color , Instant , Fog <CARD_NAME> <name> <MANA_COST> {W} {U} <TYPE_LINE> Instant'

In [57]:
ids_to_tokens(trained_tokenizer, tokenize_dataset_dict['validation']['input_ids'][0])

'[start] <THEMES> UW Identity , Creature Based , {W} {U} Cost , Combat Damage , UW Color , Instant , Fog <CARD_NAME> <name> <MANA_COST> {W} {U} <TYPE_LINE> Instant'

In [58]:
tokenize_dataset_dict['validation']['attention_mask'][0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,


## Set formating to torch of the DatasetDict

In [59]:
# Make sure the tokenized dataset is in the correct format for PyTorch

# old
# tokenize_dataset_dict.set_format(type='torch', columns=['sentence', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'])

# New
tokenize_dataset_dict.set_format(type='torch', columns=['sentence', 'input_ids', 'token_type_ids', 'attention_mask', 'special_tokens_mask'])

### Check shapes

In [60]:
# Note that these have been padded out to the max_length

for i in range(0, 11):
    print(tokenize_dataset_dict['train'][i]['input_ids'].shape)

torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])


In [61]:
for i in range(0, 11):
    print(tokenize_dataset_dict['train'][i]['attention_mask'].shape)

torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])
torch.Size([256])


In [62]:
for i in range(0, 11):
    print(tokenize_dataset_dict['train'][i]['sentence'])

[start] <THEMES> B Identity , 1 1 Counters , Creature , Sacrifice , Creature Based , B Color , Zombie , {2} {B} Cost <CARD_NAME> <name> <MANA_COST> {2} {B} <TYPE_LINE> Creature — Zombie <ORACLE_TEXT> {1} {B} , Sacrifice another creature : Put two 1 1 counters on <name> . Activate only as a
[start] <THEMES> Damage , Flash , Enchantment , R Color , R Identity , {3} {R} {R} Cost <CARD_NAME> <name> <MANA_COST> {3} {R} {R} <TYPE_LINE> Enchantment <ORACLE_TEXT> Flash \n If a source would deal damage to a permanent or player , it deals double that damage to that permanent or player instead
[start] <THEMES> {2} {W} Cost , Enchantment , W Color , Tokens , Treasure Tokens , W Identity , Card Draw <CARD_NAME> <name> <MANA_COST> {2} {W} <TYPE_LINE> Enchantment <ORACLE_TEXT> At the beginning
[start] <THEMES> {3} {W} Cost , W Color , Creature Based , Sorcery , Rebound , W Identity <CARD_NAME> <name> <MANA_COST> {3} {W} <TYPE_LINE> Sorcery <ORACLE_TEXT> Creatures you control get 2 1 until end of turn

# Build the Decoder

## A Short Rebuild of the above but with Hugging Face API

We might be able to leverage accelerated training by using HF so this could be worth to have.  It also uses a PyTorch backend which I know works on Brandon's system... TensorFlow has been giving me problems in my environment lately.

In [63]:
# Grab the decoder-only setup
from transformers import GPT2Config, GPT2LMHeadModel

In [64]:
# Make the model

VOCAB_SIZE = 8912 # Define this based on training data vocab size / tokenize size.
# TODO: Or does it need to relate to the tokenizer?
SEQUENCE_LENGTH = 500 # Define this based on training data sequence length. aka. context window
EMBEDDING_DIM = 512 # Configure
FEED_FORWARD_DIM = 4 * EMBEDDING_DIM # Standard for decoder-only items.
DROPOUT = 0.5 # Configure
NUM_HEADS = 16 # Configure attention heads
NUM_DECODER_LAYERS = 4 # Configure

# Define the model configuration
model_config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=SEQUENCE_LENGTH,
    n_ctx=SEQUENCE_LENGTH,
    n_embd=EMBEDDING_DIM,
    n_layer=NUM_DECODER_LAYERS,
    n_head=NUM_HEADS,
    resid_pdrop=DROPOUT,
    embd_pdrop=DROPOUT,
    attn_pdrop=DROPOUT,
    # layer_norm_epsilon=1e-5,
    # initializer_range=0.02,
)

# Initialize the model
gptmodel = GPT2LMHeadModel(model_config)
# Output the model layers for inspection
gptmodel

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(8912, 512)
    (wpe): Embedding(500, 512)
    (drop): Dropout(p=0.5, inplace=False)
    (h): ModuleList(
      (0-3): 4 x GPT2Block(
        (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=1536, nx=512)
          (c_proj): Conv1D(nf=512, nx=512)
          (attn_dropout): Dropout(p=0.5, inplace=False)
          (resid_dropout): Dropout(p=0.5, inplace=False)
        )
        (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=2048, nx=512)
          (c_proj): Conv1D(nf=512, nx=2048)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.5, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=512, out_features=8912, bias=False)
)

In [65]:
# Check the number of trainable parameters.
# How many trainable parameters exist in the gptmodel? (setting p.requires_grad to True for this)
trainable_params = sum(p.numel() for p in gptmodel.parameters() if p.requires_grad)
print("Number of trainable parameters:", trainable_params)

Number of trainable parameters: 17429504


In [67]:
# Print each parameter's name and trainable count individually
for name, param in gptmodel.named_parameters():
    if param.requires_grad:
        print(f"{name}: {param.numel()} trainable parameters")

transformer.wte.weight: 4562944 trainable parameters
transformer.wpe.weight: 256000 trainable parameters
transformer.h.0.ln_1.weight: 512 trainable parameters
transformer.h.0.ln_1.bias: 512 trainable parameters
transformer.h.0.attn.c_attn.weight: 786432 trainable parameters
transformer.h.0.attn.c_attn.bias: 1536 trainable parameters
transformer.h.0.attn.c_proj.weight: 262144 trainable parameters
transformer.h.0.attn.c_proj.bias: 512 trainable parameters
transformer.h.0.ln_2.weight: 512 trainable parameters
transformer.h.0.ln_2.bias: 512 trainable parameters
transformer.h.0.mlp.c_fc.weight: 1048576 trainable parameters
transformer.h.0.mlp.c_fc.bias: 2048 trainable parameters
transformer.h.0.mlp.c_proj.weight: 1048576 trainable parameters
transformer.h.0.mlp.c_proj.bias: 512 trainable parameters
transformer.h.1.ln_1.weight: 512 trainable parameters
transformer.h.1.ln_1.bias: 512 trainable parameters
transformer.h.1.attn.c_attn.weight: 786432 trainable parameters
transformer.h.1.attn.c_at

There is a difference in the reported number of trained parameters between Etienne's custom build and the Hugging Face model.

I'm 90% certain it has to do with the positional eoncodings being done slightly different.  48M of his model parameters are from the PosEnc which doesn't show up the same way in the GPT-model.

Both models are likely to generate similar results.

# Training the Model

## Setup Training Arguments

In [74]:
# Set up the Training Args
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5, # Prior runs show that I am easily over fitting the data at 25 epochs...
    weight_decay=0.01,
    logging_dir='./logs',
    # logging_steps=10, # This made my loss vs epoch plot too noisy...
    logging_strategy='epoch',
    save_strategy='epoch',
)

In [75]:
# Set up the Trainer

#mlm is the masked language modeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer_fast, mlm=False, return_tensors='pt')

trainer = Trainer(
    model=gptmodel,
    args=training_args,
    train_dataset=tokenize_dataset_dict['train'],
    eval_dataset=tokenize_dataset_dict['validation'],
    processing_class=tokenizer_fast,
    data_collator=data_collator
)

## CHOO CHOO!

In [None]:
# Run the Trainer
trainer.train()

In [None]:
# Save out the model
trainer.save_model("../models/hf_gpt2_style_theme_model")

# Metrics

In [None]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 11.90


We'll likely need to create our prompt input type thing down here...

### Simple Prompting

In [None]:
def generate_text(model, tokenizer, prompt, max_length=100, num_return_sequences=1, temperature=1.0):
    """
    Generate text using the trained GPT2 model.

    Args:
        model: The trained GPT2 model
        tokenizer: The tokenizer used for encoding/decoding text
        prompt: The input prompt text to generate from
        max_length: Maximum length of generated sequence
        num_return_sequences: Number of sequences to generate
        temperature: Controls randomness (higher = more random)

    Returns:
        List of generated text sequences
    """
    # Move model to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model.eval()

    # Encode the input prompt
    encoded_prompt = tokenizer(prompt, return_tensors='pt').to(device)

    # Generate text
    with torch.no_grad():
        output_sequences = model.generate(
            input_ids=encoded_prompt['input_ids'],
            attention_mask=encoded_prompt['attention_mask'],
            max_length=max_length,
            temperature=temperature,
            num_return_sequences=num_return_sequences,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            do_sample=True,
        )

    # Decode and return the generated sequences
    generated_sequences = []
    for generated_sequence in output_sequences:
        generated_text = tokenizer.decode(generated_sequence, skip_special_tokens=True)
        generated_sequences.append(generated_text)

    return generated_sequences

In [51]:
generate_text(gptmodel, tokenizer_fast, "llanowar elves", max_length=100, num_return_sequences=1, temperature=1.0)

['llanowar elves-or blocks a creature an opponent controls enters with three +1/+1 counter on it were {T} : Exile a creature of your hand : Create a 2/1 white colorless1 white Spirit creature token . \\n Whenever an artifact creature you control attacks , draw a card , then put into the top card of each1/+1 counter on it counter on it and you may put another Treasure token that many +1 counter on it . (To create a +1']

### More Advanced Prompting

In [None]:
# This has some nucleus sampling and top-k sampling
# This is likely for us a better way to generate text

from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Load  trained model and tokenizer
# model = GPT2LMHeadModel.from_pretrained("./model/path")
# tokenizer = GPT2TokenizerFast.from_pretrained("./tokenizer/path")

# Prepare your prompt
prompt = "llanowar elves"
# inputs = tokenizer(prompt, return_tensors="pt").to(device)
inputs = tokenizer_fast(prompt, return_tensors="pt").to(device)


# TODO: change `gptmodel` to `model` and `tokenizer_fast` to `tokenizer`
# Generate text
outputs = gptmodel.generate(
    inputs.input_ids.to(device),
    max_length=100,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    no_repeat_ngram_size=2,
    pad_token_id=tokenizer_fast.pad_token_id,
    eos_token_id=tokenizer_fast.eos_token_id,
)

# Decode the generated text
generated_text = tokenizer_fast.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

llanowar elves enters tapped . \n {T} : Add {U} . It gets +1/+1 until end of turn , Sacrifice an opponent sacrifices a creature : Return target creature card from your graveyard to its owner's hand . If you do , draw a card . You may put a +2/+0 counter on it . Activate only as a land . (If you control an instant or more +X/+X counter from a sorcery card in a random order . ) \
