# LLM Practice Notebook
# 
# We'll load Shakespeare's poems from input.txt, analyze them, and generate similar content using an LLM.


Open file: input.txt

In [2]:
from transformers import GPT2Tokenizer

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

with open("input.txt", "r", encoding="utf-8") as file:
    text = file.read()

tokens = tokenizer.encode(text)
print(f"Number of tokens: {len(tokens)}")


Token indices sequence length is longer than the specified maximum sequence length for this model (109103 > 1024). Running this sequence through the model will result in indexing errors


Number of tokens: 109103


In [3]:
print(text)

Classic Poetry Series








William Shakespeare
- poems -




Publication Date:
 2012
Publisher:
Poemhunter.com - The World's Poetry Archive
William Shakespeare(26 April 1564 - 23 April 1616)

an English poet and playwright, widely regarded as the greatest writer in the
English language and the world's pre-eminent dramatist. He is often called
England's national poet and the "Bard of Avon". His surviving works, including
some collaborations, consist of about 38 plays, 154 sonnets, two long narrative
poems, and several other poems. His plays have been translated into every
major living language and are performed more often than those of any other
playwright.
Shakespeare was born and raised in Stratford-upon-Avon. At the age of 18, he
married Anne Hathaway, with whom he had three children: Susanna, and twins
Hamnet and Judith. Between 1585 and 1592, he began a successful career in
London as an actor, writer, and part owner of a playing company called the Lord
Chamberlain's Men, later k

In [4]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling

model = GPT2LMHeadModel.from_pretrained(model_name)

train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="input.txt",
    block_size=128,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=1, 
    save_steps=500,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,4.7958
1000,4.3265
1500,4.1608
2000,4.1019
2500,4.0834


TrainOutput(global_step=2556, training_loss=4.290635050742279, metrics={'train_runtime': 1215.7035, 'train_samples_per_second': 2.102, 'train_steps_per_second': 2.102, 'total_flos': 166965608448000.0, 'train_loss': 4.290635050742279, 'epoch': 3.0})

In [None]:
# Set a prompt to guide the generation
prompt = "In today's discussion,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

model = model.cpu()

output = model.generate(
    input_ids,
    max_length=100,
    temperature=0.7,
    top_p=0.9, 
    top_k=50,
    do_sample=True,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In today's discussion, I am not so bold as to proclaim that I am not a philosopher, nor a philosopher of that I am.
But what is it that I am not a philosopher of the, nor a philosopher of the present,
Nor a philosopher of the the world, nor a philosopher of the present.
William Shakespeare
www.PoemHunter.com - The World's Poetry Archive 389
Sonnet 4: The Time of My Death

My death is the
