# Training GPT-2 with English Poetry

#### Don't run this exept if you have a gpu, torch and conda all installed

This notebook will train the model you give him, on the data given and there is a part where we can generate poems from the newly trained model.

You can put the files you want in the training and test data.

The instructions for training are based on those found at https://towardsdatascience.com/how-to-fine-tune-gpt-2-for-text-generation-ae2ea53bc272 or at https://colab.research.google.com/github/philschmid/fine-tune-GPT-2/blob/master/Fine_tune_a_non_English_GPT_2_Model_with_Huggingface.ipynb#scrollTo=V36gOIOfLHvB.  In fact, the model is pre-trained, and then **fine-tuned** to poetry data.  The training objective is, like for all language models, to predict the next token when given large amounts of training texts.

In [1]:
import torch
from tqdm import tqdm

In [2]:
print(torch.cuda.is_available())
print(torch.cuda.device_count())
torch.cuda.empty_cache()
torch.cuda.memory_summary()

True
1




In [19]:
f = open("data_english/english_poems_processed_train.txt", mode="r", encoding="ansi")
train = f.read()
f = open("data_english/english_poems_processed_test.txt", mode="r", encoding="utf-8")
test = f.read() # this is a smaller excerpt (4.5 MB) used for validation (during training)

print("Train dataset length: "+ str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 2210851
Test dataset length: 35245


In [4]:
from transformers import AutoTokenizer

In [20]:
tokenizer = AutoTokenizer.from_pretrained("gpt2")

train_path = 'data_english/english_poems_processed_train.txt'
test_path = 'data_english/english_poems_processed_test.txt'

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at C:\Users\etien/.cache\huggingface\hub\models--gpt2\snapshots\e7da7f221d5bf496a48136c0cd264e630fe9fcc8\config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
  

In [6]:
from transformers import TextDataset, DataCollatorForLanguageModeling

In [7]:
def load_dataset(train_path, test_path, tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



In [8]:
from transformers import Trainer, TrainingArguments, AutoModelWithLMHead

In [9]:
model = AutoModelWithLMHead.from_pretrained("gpt2-poetry-model-with-start-token")



In [10]:
training_args = TrainingArguments(
    output_dir="./gpt2-poetry-model-with-start-token", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=10, # number of training epochs
    per_device_train_batch_size=16, # batch size for training
    per_device_eval_batch_size=32,  # batch size for evaluation
    eval_steps = 500, # Number of update steps between two evaluations.
    save_steps=500, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    evaluation_strategy="steps"
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

In [11]:
trainer.train()

***** Running training *****
  Num examples = 4722
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 2960
  Number of trainable parameters = 124439808


Step,Training Loss,Validation Loss
500,2.0249,5.471236
1000,1.7122,5.62081
1500,1.7383,5.681308
2000,1.7748,5.539335
2500,1.8701,5.50561


***** Running Evaluation *****
  Num examples = 73
  Batch size = 32
Saving model checkpoint to ./gpt2-poetry-model-with-start-token\checkpoint-500
Configuration saved in ./gpt2-poetry-model-with-start-token\checkpoint-500\config.json
Model weights saved in ./gpt2-poetry-model-with-start-token\checkpoint-500\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 73
  Batch size = 32
Saving model checkpoint to ./gpt2-poetry-model-with-start-token\checkpoint-1000
Configuration saved in ./gpt2-poetry-model-with-start-token\checkpoint-1000\config.json
Model weights saved in ./gpt2-poetry-model-with-start-token\checkpoint-1000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 73
  Batch size = 32
Saving model checkpoint to ./gpt2-poetry-model-with-start-token\checkpoint-1500
Configuration saved in ./gpt2-poetry-model-with-start-token\checkpoint-1500\config.json
Model weights saved in ./gpt2-poetry-model-with-start-token\checkpoint-1500\pytorch_model.bin
***** Runn

TrainOutput(global_step=2960, training_loss=1.8501393911000845, metrics={'train_runtime': 1848.2822, 'train_samples_per_second': 25.548, 'train_steps_per_second': 1.601, 'total_flos': 3084552437760000.0, 'train_loss': 1.8501393911000845, 'epoch': 10.0})

In [12]:
trainer.save_model()

Saving model checkpoint to ./gpt2-poetry-model-with-start-token
Configuration saved in ./gpt2-poetry-model-with-start-token\config.json
Model weights saved in ./gpt2-poetry-model-with-start-token\pytorch_model.bin


In [13]:
torch.device("cuda")

device(type='cuda')

In [14]:
from transformers import pipeline
poetry = pipeline('text-generation', model='./gpt2-poetry-model-with-start-token', tokenizer='gpt2', device = 0)

loading configuration file ./gpt2-poetry-model-with-start-token\config.json
Model config GPT2Config {
  "_name_or_path": "./gpt2-poetry-model-with-start-token",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "use

In [15]:
print(poetry('<start>Though as the weather goes, my mind would be all in trouble for me', max_length = 19)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<start>Though as the weather goes, my mind would be all in trouble for me



In [16]:
print(len(poetry('<start>The snow is white and the sky is blue.\n<start>', max_length = 210)[0]['generated_text']))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


824


In [17]:
print(poetry('Do not stand by my grave and weep', max_length = 300)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do not stand by my grave and weep

For pity only show thy woe
And love and pity may me lose

And she that did my sighs inspire
May pray for me in this wise

Thou most imperfect spirit, say
What is the cause of this almighty pain

Then say, and I will think it well
This eye which kindled this fire

And since that grief and sorrow I know
Does with it feed and combat fire

Thus may all pity prove deceiving
Thus doth love and this withen die

And then 't is but by hoping too much
May I still live on in love's despite

Thou most unjust and cruel god, to me unjust

Yet dost love but to torment me with a smile
Painful and certain doom

Lo, cruel god, my soul doth wish a thing unjust

Which never shall be done unjust to thee

Yet cruel god, I do requite thy wrong

O cruel god, by hard compulsion crave

O cruel god, by hard compulsion crave

O cruel god, since all my wrongs are vain

Therefore, since she in time hath been unjust

Can I for ever unjust be damned


Since thou hast cursed am I and

In [18]:
import random
import time

for i in tqdm(range(50)):
    
    # AABB
    samples = ['Do not stand by my grave and weep\nThe appeal for darkness is to steep\n\n', 'She walks in beauty, like the night\nI would love to be her knight\n\n', 'Life, believe, is not a dream\nYou can not wake up just with a scream\n\n', 'life is a tall tender tree\nSavor it like a cookie\n\n', 'The taste of marmelade is better with you\nYou light up my day when you lace your shoe\n\n','They were mad and in love\nShe was a wolf and he was a dove\n\n', 'My heart aches just thinking of you\nBut now I am left feeling blue\n\n', 'They were lonely and tired\nThat is when the shot was fired\n\n', 'I am writting this poem for you\nTo let you know that I will always be true\n\n', 'I long for the warmth of your smile\nand the sight of your teeth like a crocodile\n\n']
    
    # ABAB
    # samples = ['Do not stand by my grave and weep\n My memory of you remains\nThe appeal for darkness is to steep\nMy link with death feel like chains\n\n', 'She walks in beauty, like the night\n In a white dress like the moon\nI would love to be her knight\n On her path flowers bloom\n\n', 'Life, believe, is not a dream\nEven in the middle of a nightmare\nYou can not wake up just with a scream\nBut you still feel the scare\n\n','life is a tall tender tree\nLive it with all your soul\nSavor it like a cookie\nBut beware not to fall\n\n']
              
    
    first = random.sample(samples, 1)[0]
    
    begin = time.time()
    
    poem = poetry(first, max_length = 300)[0]['generated_text']
    
    # write on file
    
    with open('poems_generated_english/TEST_POEM.txt', 'a', encoding='utf-8') as f:
        f.write(poem)

    end = time.time()
    
    print(f"Total runtime of the program is {end - begin}")

  0%|                                                                                           | 0/50 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  2%|█▋                                                                                 | 1/50 [00:04<03:23,  4.14s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.143949031829834


  4%|███▎                                                                               | 2/50 [00:08<03:18,  4.13s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.126929044723511


  6%|████▉                                                                              | 3/50 [00:12<03:12,  4.09s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.046182632446289


  8%|██████▋                                                                            | 4/50 [00:16<03:11,  4.17s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.285568952560425


 10%|████████▎                                                                          | 5/50 [00:20<03:06,  4.14s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.094080924987793


 12%|█████████▉                                                                         | 6/50 [00:24<03:00,  4.09s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9983434677124023


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.010268449783325


 16%|█████████████▎                                                                     | 8/50 [00:32<02:49,  4.05s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.002304792404175


 18%|██████████████▉                                                                    | 9/50 [00:36<02:45,  4.05s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.049151659011841


 20%|████████████████▍                                                                 | 10/50 [00:41<02:45,  4.14s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.346354007720947


 22%|██████████████████                                                                | 11/50 [00:45<02:39,  4.10s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.996342658996582


 24%|███████████████████▋                                                              | 12/50 [00:49<02:34,  4.07s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.021247625350952


 26%|█████████████████████▎                                                            | 13/50 [00:53<02:28,  4.02s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.8806211948394775


 28%|██████████████████████▉                                                           | 14/50 [00:57<02:24,  4.01s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9943230152130127


 30%|████████████████████████▌                                                         | 15/50 [01:01<02:21,  4.04s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.115575551986694


 32%|██████████████████████████▏                                                       | 16/50 [01:05<02:17,  4.06s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.091071844100952


 34%|███████████████████████████▉                                                      | 17/50 [01:09<02:13,  4.05s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.026198625564575


 36%|█████████████████████████████▌                                                    | 18/50 [01:13<02:09,  4.05s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.049174547195435


 38%|███████████████████████████████▏                                                  | 19/50 [01:17<02:04,  4.00s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.900570869445801


 40%|████████████████████████████████▊                                                 | 20/50 [01:21<01:59,  3.98s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.92551851272583


 42%|██████████████████████████████████▍                                               | 21/50 [01:25<01:56,  4.01s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.063109874725342


 44%|████████████████████████████████████                                              | 22/50 [01:29<01:51,  4.00s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.98038649559021


 46%|█████████████████████████████████████▋                                            | 23/50 [01:33<01:47,  3.98s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9434502124786377


 48%|███████████████████████████████████████▎                                          | 24/50 [01:37<01:43,  3.98s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9653897285461426


 50%|█████████████████████████████████████████                                         | 25/50 [01:41<01:39,  3.97s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9494400024414062


 52%|██████████████████████████████████████████▋                                       | 26/50 [01:44<01:34,  3.94s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.8696274757385254


 54%|████████████████████████████████████████████▎                                     | 27/50 [01:48<01:30,  3.92s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.874666213989258


 56%|█████████████████████████████████████████████▉                                    | 28/50 [01:52<01:26,  3.94s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.972339630126953


 58%|███████████████████████████████████████████████▌                                  | 29/50 [01:56<01:22,  3.94s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9364752769470215


 60%|█████████████████████████████████████████████████▏                                | 30/50 [02:00<01:17,  3.90s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.8138394355773926


 62%|██████████████████████████████████████████████████▊                               | 31/50 [02:04<01:13,  3.89s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.881586790084839


 64%|████████████████████████████████████████████████████▍                             | 32/50 [02:08<01:09,  3.88s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.8357441425323486


 66%|██████████████████████████████████████████████████████                            | 33/50 [02:12<01:06,  3.88s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.8956198692321777


 68%|███████████████████████████████████████████████████████▊                          | 34/50 [02:16<01:02,  3.91s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.987304925918579


 70%|█████████████████████████████████████████████████████████▍                        | 35/50 [02:20<00:58,  3.92s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.920543670654297


 72%|███████████████████████████████████████████████████████████                       | 36/50 [02:23<00:54,  3.92s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.933488368988037


 74%|████████████████████████████████████████████████████████████▋                     | 37/50 [02:27<00:51,  3.93s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9354708194732666


 76%|██████████████████████████████████████████████████████████████▎                   | 38/50 [02:31<00:47,  3.97s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.063110113143921


 78%|███████████████████████████████████████████████████████████████▉                  | 39/50 [02:35<00:43,  3.95s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9065544605255127


 80%|█████████████████████████████████████████████████████████████████▌                | 40/50 [02:39<00:39,  3.95s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.9631519317626953


 82%|███████████████████████████████████████████████████████████████████▏              | 41/50 [02:43<00:35,  3.93s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.86862850189209


 84%|████████████████████████████████████████████████████████████████████▉             | 42/50 [02:47<00:31,  3.89s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.807845115661621


 86%|██████████████████████████████████████████████████████████████████████▌           | 43/50 [02:51<00:27,  3.86s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.7983460426330566


 88%|████████████████████████████████████████████████████████████████████████▏         | 44/50 [02:55<00:23,  3.86s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.847738027572632


 90%|█████████████████████████████████████████████████████████████████████████▊        | 45/50 [02:59<00:19,  3.92s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.053130626678467


 92%|███████████████████████████████████████████████████████████████████████████▍      | 46/50 [03:02<00:15,  3.87s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.7649362087249756


 94%|█████████████████████████████████████████████████████████████████████████████     | 47/50 [03:07<00:11,  3.94s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.088070392608643


 96%|██████████████████████████████████████████████████████████████████████████████▋   | 48/50 [03:11<00:08,  4.07s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 4.36732292175293


 98%|████████████████████████████████████████████████████████████████████████████████▎ | 49/50 [03:15<00:03,  4.00s/it]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Total runtime of the program is 3.84368896484375


100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [03:19<00:00,  3.98s/it]

Total runtime of the program is 3.79288649559021



