In [4]:
from datasets import load_dataset
from transformers import GPT2Tokenizer

In [5]:
dataset = load_dataset("text", data_files={"train": "../dataset/gpt2_recipes.txt"})

In [6]:
# Check the first few rows
for i in range(16):
    print(dataset["train"][i]["text"])

<|startoftext|>
Name: Low-Fat Berry Blue Frozen Dessert
Ingredients: blueberries, granulated sugar, vanilla yogurt, lemon juice
Instructions:
Toss 2 cups berries with sugar.
Let stand for 45 minutes
stirring occasionally.
Transfer berry-sugar mixture to food processor.
Add yogurt and process until smooth.
Strain through fine sieve. Pour into baking pan (or transfer to ice cream maker and process according to manufacturers' directions). Freeze uncovered until edges are solid but centre is soft.  Transfer to processor and blend until smooth again.
Return to pan and freeze until edges are solid.
Transfer to processor and blend until smooth again.

Fold in remaining 2 cups of blueberries.
Pour into plastic mold and freeze overnight. Let soften slightly to serve.
<|endoftext|>


In [7]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Add special tokens if you're using them
tokenizer.add_special_tokens({
    'bos_token': '<|startoftext|>',
    'eos_token': '<|endoftext|>',
})

# Set pad token (needed for padding during batching)
tokenizer.pad_token = tokenizer.eos_token

In [8]:
def tokenize_function(example):
    return tokenizer(
        example['text'],
        truncation=True,
        padding='max_length',
        max_length=512
    )

Transforms each sample from raw text into token IDs  
Stores the result as a new Dataset object ready for training  
Language models like GPT-2 don’t understand raw text like: "Ingredients: tomato, onion, garlic"  
They require that text to be tokenized into numbers (IDs), using the tokenizer that matches the model.  
tokenizer.encode("tomato")  → [15496]  

> tokenization is how we turn our cleaned recipe text into input the model can understand and train on.

What we've done so far:
1. GPT-2 needs numbers, not text, Converts the text into token IDs
2. GPT-2 can't handle long inputs, Truncates to max 512 tokens
3. Trainer needs equal-size inputs,	Pads short ones to same length
4. GPT-2 doesn’t have pad token, Assigns `<
5. Model needs start/end markers, Adds custom special tokens

In [9]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

In [10]:
# Just taking 1000 samples for quicker training in this example
sample_dataset = tokenized_dataset["train"].select(range(500))

In [11]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained('gpt2')

# Resize embedding layer to include new tokens
model.resize_token_embeddings(len(tokenizer))
# This makes sure the model understands your added tokens like <|startoftext|>.

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 768)

In [12]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We are NOT using masked language modeling (BERT style)
)

In [13]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="../NLPModel/gpt2-recipes",         # where to save the model
    overwrite_output_dir=True,
    num_train_epochs=3,                  # try 1 first to test
    per_device_train_batch_size=2,       # lower if memory is tight
    save_steps=500,
    save_total_limit=2,
    logging_steps=100,
    prediction_loss_only=True
)


In [14]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=sample_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


In [15]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 50257, 'pad_token_id': 50256}.
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,3.4691
200,3.0382
300,2.7714
400,2.4674
500,2.391
600,2.132
700,1.9613




TrainOutput(global_step=750, training_loss=2.5535806477864584, metrics={'train_runtime': 3212.9717, 'train_samples_per_second': 0.467, 'train_steps_per_second': 0.233, 'total_flos': 391938048000000.0, 'train_loss': 2.5535806477864584, 'epoch': 3.0})