# Fine-Tuning GPT-2 for Short Story Generation

This project demonstrates fine-tuning **GPT-2**, a pre-trained language model, to generate short stories using the **TinyStories** dataset. We'll preprocess the data, fine-tune the model with early stopping, evaluate its performance, and generate new stories.

In [1]:
from datasets import load_dataset

# Load the TinyStories dataset
dataset = load_dataset("roneneldan/TinyStories")

# Print dataset information
print(dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})


To reduce training time, we'll sample only **5% of the dataset** from both training and validation splits. This ensures faster experimentation while retaining representative data.

In [2]:
# Function to sample 5% of the dataset
def sample_five_percent(dataset_split):
    total_size = len(dataset_split)
    five_percent_size = total_size // 20  # 5% of the dataset
    return dataset_split.shuffle(seed=42).select(range(five_percent_size))

# Sample 5% from train and validation splits
train_data = sample_five_percent(dataset['train'])
val_data = sample_five_percent(dataset['validation'])

# Check sizes of sampled datasets
print(f"Train size (5%): {len(train_data)}")
print(f"Validation size (5%): {len(val_data)}")

Train size (5%): 105985
Validation size (5%): 1099


The **GPT-2 tokenizer** is used to preprocess text data by converting it into numerical format. We'll also ensure that the **end-of-sequence (EOS)** token is used for padding.

In [3]:
from transformers import GPT2Tokenizer

# Load GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set padding token to EOS token to avoid warnings
tokenizer.pad_token = tokenizer.eos_token

The dataset is tokenized into sequences of numerical tokens with **padding** and **truncation** applied to ensure all sequences are of the same length. This is critical for training in batches.

In [4]:
# Function to tokenize and prepare inputs/labels
def tokenize_function_with_labels(examples):
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        padding="max_length",  # Pad to a fixed length for batches
        max_length=512         # Set maximum length for sequences
    )
    # Add labels (same as input_ids for language modeling)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Tokenize train and validation datasets
train_dataset = train_data.map(tokenize_function_with_labels, batched=True, remove_columns=["text"])
val_dataset = val_data.map(tokenize_function_with_labels, batched=True, remove_columns=["text"])

# Verify tokenized data structure
print(train_dataset[0])

{'input_ids': [14967, 290, 32189, 588, 284, 711, 287, 262, 3952, 13, 1119, 766, 257, 1263, 3430, 319, 262, 2323, 13, 632, 318, 7586, 290, 890, 290, 4334, 13, 198, 198, 1, 8567, 11, 257, 3430, 2474, 5045, 1139, 13, 366, 40, 460, 10303, 340, 2474, 198, 198, 1544, 8404, 284, 10303, 262, 3430, 11, 475, 340, 318, 1165, 5802, 13, 679, 8953, 866, 290, 10532, 262, 3430, 13, 198, 198, 1, 46, 794, 2474, 339, 1139, 13, 366, 2504, 5938, 2474, 198, 198, 44, 544, 22051, 13, 1375, 318, 407, 1612, 11, 673, 655, 6834, 340, 318, 8258, 13, 198, 198, 1, 5756, 502, 1949, 2474, 673, 1139, 13, 366, 40, 460, 5236, 340, 2474, 198, 198, 3347, 11103, 510, 262, 3430, 290, 7584, 340, 319, 607, 1182, 13, 1375, 11114, 6364, 290, 7773, 13, 1375, 857, 407, 2121, 866, 13, 198, 198, 1, 22017, 2474, 5045, 1139, 13, 366, 1639, 389, 922, 379, 22486, 2474, 198, 198, 1, 10449, 345, 2474, 32189, 1139, 13, 366, 1026, 318, 1257, 2474, 198, 198, 2990, 1011, 4962, 22486, 262, 3430, 319, 511, 6665, 11, 5101, 11, 290, 7405, 13, 111

We load the pre-trained **GPT-2** model and adjust the token embeddings to account for the padding token added during tokenization.

In [5]:
from transformers import GPT2LMHeadModel

# Load GPT-2 model with a language modeling head
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Resize embeddings to account for padding token
model.resize_token_embeddings(len(tokenizer))

Embedding(50257, 768)

Training arguments control various aspects of training, including **learning rate**, **batch size**, **evaluation frequency**, and saving model checkpoints. The best model will be loaded at the end of training.

In [6]:
from transformers import TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-tinystories",  # Directory to save model checkpoints
    evaluation_strategy="steps",     # Evaluate at specific intervals
    eval_steps=500,                  # Evaluate every 500 steps
    learning_rate=5e-5,              # Learning rate for optimizer
    weight_decay=0.01,               # Regularization strength
    per_device_train_batch_size=4,   # Training batch size
    per_device_eval_batch_size=4,    # Evaluation batch size
    num_train_epochs=3,              # Total number of training epochs
    save_strategy="steps",           # Save model at specific intervals
    save_steps=500,                  # Save model every 500 steps
    logging_dir="./logs",            # Directory for training logs
    save_total_limit=2,              # Limit total number of saved checkpoints
    load_best_model_at_end=True,     # Load best model after training
)



An **early stopping callback** halts training if the validation loss doesn't improve for 2 consecutive evaluation steps, preventing unnecessary computation.

In [7]:
from transformers import EarlyStoppingCallback

# Add EarlyStoppingCallback to stop training if validation loss does not improve
early_stopping_callback = EarlyStoppingCallback(early_stopping_patience=2)

The **Trainer** API simplifies training by integrating the model, dataset, training arguments, and callbacks into one interface.

In [8]:
from transformers import Trainer

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    callbacks=[early_stopping_callback],  # Add early stopping
)

  trainer = Trainer(


The model is fine-tuned on the **TinyStories** dataset, with evaluation and checkpointing performed every 500 steps. Early stopping ensures training efficiency.

In [9]:
# Train the model
trainer.train()

  1%|          | 500/79491 [03:02<7:56:07,  2.77it/s]

{'loss': 0.8855, 'grad_norm': 1.7314858436584473, 'learning_rate': 4.968549898730674e-05, 'epoch': 0.02}


                                                     
  1%|          | 500/79491 [03:30<7:56:07,  2.77it/s]

{'eval_loss': 0.770193874835968, 'eval_runtime': 27.9024, 'eval_samples_per_second': 39.387, 'eval_steps_per_second': 9.856, 'epoch': 0.02}


  1%|▏         | 1000/79491 [06:33<8:07:16,  2.68it/s] 

{'loss': 0.8014, 'grad_norm': 1.4368808269500732, 'learning_rate': 4.937099797461348e-05, 'epoch': 0.04}


                                                      
  1%|▏         | 1000/79491 [07:01<8:07:16,  2.68it/s]

{'eval_loss': 0.7424324154853821, 'eval_runtime': 27.6315, 'eval_samples_per_second': 39.773, 'eval_steps_per_second': 9.952, 'epoch': 0.04}


  2%|▏         | 1500/79491 [10:04<7:52:22,  2.75it/s]  

{'loss': 0.7713, 'grad_norm': 1.3933892250061035, 'learning_rate': 4.9056496961920216e-05, 'epoch': 0.06}


                                                      
  2%|▏         | 1500/79491 [10:32<7:52:22,  2.75it/s]

{'eval_loss': 0.7257647514343262, 'eval_runtime': 27.704, 'eval_samples_per_second': 39.669, 'eval_steps_per_second': 9.926, 'epoch': 0.06}


  3%|▎         | 2000/79491 [13:36<7:49:59,  2.75it/s]  

{'loss': 0.7662, 'grad_norm': 0.8095577955245972, 'learning_rate': 4.874199594922696e-05, 'epoch': 0.08}


                                                      
  3%|▎         | 2000/79491 [14:04<7:49:59,  2.75it/s]

{'eval_loss': 0.7115599513053894, 'eval_runtime': 27.63, 'eval_samples_per_second': 39.776, 'eval_steps_per_second': 9.953, 'epoch': 0.08}


  3%|▎         | 2500/79491 [17:07<7:45:09,  2.76it/s]  

{'loss': 0.7475, 'grad_norm': 0.9268472194671631, 'learning_rate': 4.8427494936533695e-05, 'epoch': 0.09}


                                                      
  3%|▎         | 2500/79491 [17:35<7:45:09,  2.76it/s]

{'eval_loss': 0.7021278738975525, 'eval_runtime': 27.6065, 'eval_samples_per_second': 39.809, 'eval_steps_per_second': 9.961, 'epoch': 0.09}


  4%|▍         | 3000/79491 [20:38<7:43:27,  2.75it/s]  

{'loss': 0.7345, 'grad_norm': 0.8242177367210388, 'learning_rate': 4.811299392384044e-05, 'epoch': 0.11}


                                                      
  4%|▍         | 3000/79491 [21:06<7:43:27,  2.75it/s]

{'eval_loss': 0.6948038935661316, 'eval_runtime': 27.6003, 'eval_samples_per_second': 39.818, 'eval_steps_per_second': 9.964, 'epoch': 0.11}


  4%|▍         | 3500/79491 [24:09<7:38:37,  2.76it/s]  

{'loss': 0.7339, 'grad_norm': 0.7347555160522461, 'learning_rate': 4.779849291114717e-05, 'epoch': 0.13}


                                                      
  4%|▍         | 3500/79491 [24:37<7:38:37,  2.76it/s]

{'eval_loss': 0.6883376836776733, 'eval_runtime': 27.6229, 'eval_samples_per_second': 39.786, 'eval_steps_per_second': 9.955, 'epoch': 0.13}


  5%|▌         | 4000/79491 [27:40<7:35:06,  2.76it/s]  

{'loss': 0.7298, 'grad_norm': 0.660949170589447, 'learning_rate': 4.7483991898453915e-05, 'epoch': 0.15}


                                                      
  5%|▌         | 4000/79491 [28:08<7:35:06,  2.76it/s]

{'eval_loss': 0.6822287440299988, 'eval_runtime': 27.6113, 'eval_samples_per_second': 39.803, 'eval_steps_per_second': 9.96, 'epoch': 0.15}


  6%|▌         | 4500/79491 [31:12<7:33:28,  2.76it/s]  

{'loss': 0.7239, 'grad_norm': 0.7262910604476929, 'learning_rate': 4.716949088576065e-05, 'epoch': 0.17}


                                                      
  6%|▌         | 4500/79491 [31:39<7:33:28,  2.76it/s]

{'eval_loss': 0.6774558424949646, 'eval_runtime': 27.5808, 'eval_samples_per_second': 39.847, 'eval_steps_per_second': 9.971, 'epoch': 0.17}


  6%|▋         | 5000/79491 [34:42<7:29:33,  2.76it/s]  

{'loss': 0.7213, 'grad_norm': 0.8898983597755432, 'learning_rate': 4.6854989873067394e-05, 'epoch': 0.19}


                                                      
  6%|▋         | 5000/79491 [35:10<7:29:33,  2.76it/s]

{'eval_loss': 0.6730363965034485, 'eval_runtime': 27.3209, 'eval_samples_per_second': 40.226, 'eval_steps_per_second': 10.066, 'epoch': 0.19}


  7%|▋         | 5500/79491 [38:12<7:26:50,  2.76it/s]  

{'loss': 0.7095, 'grad_norm': 0.7683311700820923, 'learning_rate': 4.654048886037413e-05, 'epoch': 0.21}


                                                      
  7%|▋         | 5500/79491 [38:40<7:26:50,  2.76it/s]

{'eval_loss': 0.6708530187606812, 'eval_runtime': 27.3148, 'eval_samples_per_second': 40.235, 'eval_steps_per_second': 10.068, 'epoch': 0.21}


  8%|▊         | 6000/79491 [41:42<7:23:22,  2.76it/s]  

{'loss': 0.7128, 'grad_norm': 0.6735734343528748, 'learning_rate': 4.622598784768087e-05, 'epoch': 0.23}


                                                      
  8%|▊         | 6000/79491 [42:09<7:23:22,  2.76it/s]

{'eval_loss': 0.6665481925010681, 'eval_runtime': 27.2917, 'eval_samples_per_second': 40.269, 'eval_steps_per_second': 10.076, 'epoch': 0.23}


  8%|▊         | 6500/79491 [45:12<7:19:18,  2.77it/s]  

{'loss': 0.7007, 'grad_norm': 0.7345395088195801, 'learning_rate': 4.591148683498761e-05, 'epoch': 0.25}


                                                      
  8%|▊         | 6500/79491 [45:40<7:19:18,  2.77it/s]

{'eval_loss': 0.662101686000824, 'eval_runtime': 27.3524, 'eval_samples_per_second': 40.179, 'eval_steps_per_second': 10.054, 'epoch': 0.25}


  9%|▉         | 7000/79491 [48:42<7:17:18,  2.76it/s]  

{'loss': 0.7039, 'grad_norm': 0.9376696944236755, 'learning_rate': 4.559698582229435e-05, 'epoch': 0.26}


                                                      
  9%|▉         | 7000/79491 [49:10<7:17:18,  2.76it/s]

{'eval_loss': 0.6596439480781555, 'eval_runtime': 27.3062, 'eval_samples_per_second': 40.247, 'eval_steps_per_second': 10.071, 'epoch': 0.26}


  9%|▉         | 7500/79491 [52:12<7:22:48,  2.71it/s]  

{'loss': 0.7074, 'grad_norm': 0.8313651084899902, 'learning_rate': 4.5282484809601086e-05, 'epoch': 0.28}


                                                      
  9%|▉         | 7500/79491 [52:40<7:22:48,  2.71it/s]

{'eval_loss': 0.6577733755111694, 'eval_runtime': 27.3568, 'eval_samples_per_second': 40.173, 'eval_steps_per_second': 10.052, 'epoch': 0.28}


 10%|█         | 8000/79491 [55:42<7:10:55,  2.76it/s]  

{'loss': 0.704, 'grad_norm': 0.806961178779602, 'learning_rate': 4.496798379690783e-05, 'epoch': 0.3}


                                                      
 10%|█         | 8000/79491 [56:10<7:10:55,  2.76it/s]

{'eval_loss': 0.6563994884490967, 'eval_runtime': 27.3177, 'eval_samples_per_second': 40.23, 'eval_steps_per_second': 10.067, 'epoch': 0.3}


 11%|█         | 8500/79491 [59:12<7:06:42,  2.77it/s]  

{'loss': 0.6837, 'grad_norm': 0.7555195689201355, 'learning_rate': 4.465348278421457e-05, 'epoch': 0.32}


                                                      
 11%|█         | 8500/79491 [59:40<7:06:42,  2.77it/s]

{'eval_loss': 0.6546250581741333, 'eval_runtime': 27.2811, 'eval_samples_per_second': 40.284, 'eval_steps_per_second': 10.08, 'epoch': 0.32}


 11%|█▏        | 9000/79491 [1:02:42<7:05:23,  2.76it/s]

{'loss': 0.6956, 'grad_norm': 0.6832727789878845, 'learning_rate': 4.433898177152131e-05, 'epoch': 0.34}


                                                        
 11%|█▏        | 9000/79491 [1:03:10<7:05:23,  2.76it/s]

{'eval_loss': 0.651020884513855, 'eval_runtime': 27.3148, 'eval_samples_per_second': 40.235, 'eval_steps_per_second': 10.068, 'epoch': 0.34}


 12%|█▏        | 9500/79491 [1:06:13<7:06:21,  2.74it/s]  

{'loss': 0.6946, 'grad_norm': 0.5899460315704346, 'learning_rate': 4.402448075882805e-05, 'epoch': 0.36}


                                                        
 12%|█▏        | 9500/79491 [1:06:41<7:06:21,  2.74it/s]

{'eval_loss': 0.64857017993927, 'eval_runtime': 27.8089, 'eval_samples_per_second': 39.52, 'eval_steps_per_second': 9.889, 'epoch': 0.36}


 13%|█▎        | 10000/79491 [1:09:47<7:01:05,  2.75it/s] 

{'loss': 0.6857, 'grad_norm': 0.7748724222183228, 'learning_rate': 4.3709979746134785e-05, 'epoch': 0.38}


                                                         
 13%|█▎        | 10000/79491 [1:10:15<7:01:05,  2.75it/s]

{'eval_loss': 0.6458824872970581, 'eval_runtime': 28.2403, 'eval_samples_per_second': 38.916, 'eval_steps_per_second': 9.738, 'epoch': 0.38}


 13%|█▎        | 10500/79491 [1:13:20<6:59:13,  2.74it/s]  

{'loss': 0.6724, 'grad_norm': 0.7316901087760925, 'learning_rate': 4.339547873344153e-05, 'epoch': 0.4}


                                                         
 13%|█▎        | 10500/79491 [1:13:48<6:59:13,  2.74it/s]

{'eval_loss': 0.645098090171814, 'eval_runtime': 27.9653, 'eval_samples_per_second': 39.299, 'eval_steps_per_second': 9.834, 'epoch': 0.4}


 14%|█▍        | 11000/79491 [1:16:54<6:55:37,  2.75it/s]  

{'loss': 0.6876, 'grad_norm': 0.5783931612968445, 'learning_rate': 4.308097772074826e-05, 'epoch': 0.42}


                                                         
 14%|█▍        | 11000/79491 [1:17:22<6:55:37,  2.75it/s]

{'eval_loss': 0.6435040831565857, 'eval_runtime': 27.9401, 'eval_samples_per_second': 39.334, 'eval_steps_per_second': 9.842, 'epoch': 0.42}


 14%|█▍        | 11500/79491 [1:20:27<6:54:03,  2.74it/s]  

{'loss': 0.6839, 'grad_norm': 0.6709120273590088, 'learning_rate': 4.2766476708055006e-05, 'epoch': 0.43}


                                                         
 14%|█▍        | 11500/79491 [1:20:55<6:54:03,  2.74it/s]

{'eval_loss': 0.6418911218643188, 'eval_runtime': 27.9613, 'eval_samples_per_second': 39.304, 'eval_steps_per_second': 9.835, 'epoch': 0.43}


 15%|█▌        | 12000/79491 [1:24:00<6:51:10,  2.74it/s]  

{'loss': 0.6777, 'grad_norm': 0.8772743940353394, 'learning_rate': 4.245197569536174e-05, 'epoch': 0.45}


                                                         
 15%|█▌        | 12000/79491 [1:24:28<6:51:10,  2.74it/s]

{'eval_loss': 0.6394924521446228, 'eval_runtime': 27.9533, 'eval_samples_per_second': 39.316, 'eval_steps_per_second': 9.838, 'epoch': 0.45}


 16%|█▌        | 12500/79491 [1:27:32<6:43:44,  2.77it/s]  

{'loss': 0.6838, 'grad_norm': 0.599226713180542, 'learning_rate': 4.2137474682668484e-05, 'epoch': 0.47}


                                                         
 16%|█▌        | 12500/79491 [1:28:00<6:43:44,  2.77it/s]

{'eval_loss': 0.6377079486846924, 'eval_runtime': 27.3561, 'eval_samples_per_second': 40.174, 'eval_steps_per_second': 10.053, 'epoch': 0.47}


 16%|█▋        | 13000/79491 [1:31:02<6:41:49,  2.76it/s]  

{'loss': 0.6672, 'grad_norm': 0.6534574031829834, 'learning_rate': 4.182297366997522e-05, 'epoch': 0.49}


                                                         
 16%|█▋        | 13000/79491 [1:31:30<6:41:49,  2.76it/s]

{'eval_loss': 0.6360580325126648, 'eval_runtime': 27.3668, 'eval_samples_per_second': 40.158, 'eval_steps_per_second': 10.049, 'epoch': 0.49}


 17%|█▋        | 13500/79491 [1:34:32<6:37:31,  2.77it/s]  

{'loss': 0.6665, 'grad_norm': 0.7958711981773376, 'learning_rate': 4.150847265728196e-05, 'epoch': 0.51}


                                                         
 17%|█▋        | 13500/79491 [1:35:00<6:37:31,  2.77it/s]

{'eval_loss': 0.6351034641265869, 'eval_runtime': 27.3621, 'eval_samples_per_second': 40.165, 'eval_steps_per_second': 10.05, 'epoch': 0.51}


 18%|█▊        | 14000/79491 [1:38:03<6:35:50,  2.76it/s]  

{'loss': 0.6735, 'grad_norm': 0.7535305619239807, 'learning_rate': 4.11939716445887e-05, 'epoch': 0.53}


                                                         
 18%|█▊        | 14000/79491 [1:38:30<6:35:50,  2.76it/s]

{'eval_loss': 0.6334057450294495, 'eval_runtime': 27.3305, 'eval_samples_per_second': 40.211, 'eval_steps_per_second': 10.062, 'epoch': 0.53}


 18%|█▊        | 14500/79491 [1:41:34<6:31:16,  2.77it/s]  

{'loss': 0.6734, 'grad_norm': 0.6295506358146667, 'learning_rate': 4.087947063189544e-05, 'epoch': 0.55}


                                                         
 18%|█▊        | 14500/79491 [1:42:01<6:31:16,  2.77it/s]

{'eval_loss': 0.6328350901603699, 'eval_runtime': 27.3392, 'eval_samples_per_second': 40.199, 'eval_steps_per_second': 10.059, 'epoch': 0.55}


 19%|█▉        | 15000/79491 [1:45:05<6:29:17,  2.76it/s]  

{'loss': 0.6666, 'grad_norm': 0.7021023631095886, 'learning_rate': 4.0564969619202176e-05, 'epoch': 0.57}


                                                         
 19%|█▉        | 15000/79491 [1:45:32<6:29:17,  2.76it/s]

{'eval_loss': 0.6302140355110168, 'eval_runtime': 27.3339, 'eval_samples_per_second': 40.206, 'eval_steps_per_second': 10.061, 'epoch': 0.57}


 19%|█▉        | 15500/79491 [1:48:35<6:25:10,  2.77it/s]  

{'loss': 0.6729, 'grad_norm': 0.6633785367012024, 'learning_rate': 4.025046860650892e-05, 'epoch': 0.58}


                                                         
 19%|█▉        | 15500/79491 [1:49:02<6:25:10,  2.77it/s]

{'eval_loss': 0.6299276351928711, 'eval_runtime': 27.395, 'eval_samples_per_second': 40.117, 'eval_steps_per_second': 10.038, 'epoch': 0.58}


 20%|██        | 16000/79491 [1:52:06<6:24:16,  2.75it/s]  

{'loss': 0.6623, 'grad_norm': 0.69783616065979, 'learning_rate': 3.9935967593815654e-05, 'epoch': 0.6}


                                                         
 20%|██        | 16000/79491 [1:52:33<6:24:16,  2.75it/s]

{'eval_loss': 0.6276156902313232, 'eval_runtime': 27.4693, 'eval_samples_per_second': 40.008, 'eval_steps_per_second': 10.011, 'epoch': 0.6}


 21%|██        | 16500/79491 [1:55:37<6:22:33,  2.74it/s]  

{'loss': 0.6681, 'grad_norm': 0.7070022225379944, 'learning_rate': 3.962146658112239e-05, 'epoch': 0.62}


                                                         
 21%|██        | 16500/79491 [1:56:05<6:22:33,  2.74it/s]

{'eval_loss': 0.6273935437202454, 'eval_runtime': 27.496, 'eval_samples_per_second': 39.969, 'eval_steps_per_second': 10.001, 'epoch': 0.62}


 21%|██▏       | 17000/79491 [1:59:09<6:19:31,  2.74it/s]  

{'loss': 0.6581, 'grad_norm': 0.5963709950447083, 'learning_rate': 3.930696556842913e-05, 'epoch': 0.64}


                                                         
 21%|██▏       | 17000/79491 [1:59:37<6:19:31,  2.74it/s]

{'eval_loss': 0.6257082223892212, 'eval_runtime': 27.4416, 'eval_samples_per_second': 40.049, 'eval_steps_per_second': 10.021, 'epoch': 0.64}


 22%|██▏       | 17500/79491 [2:02:41<6:18:11,  2.73it/s]  

{'loss': 0.6712, 'grad_norm': 0.5927578806877136, 'learning_rate': 3.899246455573587e-05, 'epoch': 0.66}


                                                         
 22%|██▏       | 17500/79491 [2:03:09<6:18:11,  2.73it/s]

{'eval_loss': 0.6242783069610596, 'eval_runtime': 27.489, 'eval_samples_per_second': 39.98, 'eval_steps_per_second': 10.004, 'epoch': 0.66}


 23%|██▎       | 18000/79491 [2:06:13<6:14:13,  2.74it/s]  

{'loss': 0.6557, 'grad_norm': 0.7926117181777954, 'learning_rate': 3.867796354304261e-05, 'epoch': 0.68}


                                                         
 23%|██▎       | 18000/79491 [2:06:41<6:14:13,  2.74it/s]

{'eval_loss': 0.6223311424255371, 'eval_runtime': 27.4521, 'eval_samples_per_second': 40.033, 'eval_steps_per_second': 10.017, 'epoch': 0.68}


 23%|██▎       | 18500/79491 [2:09:45<6:09:43,  2.75it/s]  

{'loss': 0.6616, 'grad_norm': 0.6481297016143799, 'learning_rate': 3.836346253034935e-05, 'epoch': 0.7}


                                                         
 23%|██▎       | 18500/79491 [2:10:13<6:09:43,  2.75it/s]

{'eval_loss': 0.6222378015518188, 'eval_runtime': 27.4479, 'eval_samples_per_second': 40.04, 'eval_steps_per_second': 10.019, 'epoch': 0.7}


 24%|██▍       | 19000/79491 [2:13:17<6:07:41,  2.74it/s]  

{'loss': 0.6611, 'grad_norm': 0.7353787422180176, 'learning_rate': 3.804896151765609e-05, 'epoch': 0.72}


                                                         
 24%|██▍       | 19000/79491 [2:13:45<6:07:41,  2.74it/s]

{'eval_loss': 0.62031489610672, 'eval_runtime': 27.4511, 'eval_samples_per_second': 40.035, 'eval_steps_per_second': 10.018, 'epoch': 0.72}


 25%|██▍       | 19500/79491 [2:16:49<6:03:30,  2.75it/s]  

{'loss': 0.6489, 'grad_norm': 0.7574770450592041, 'learning_rate': 3.7734460504962825e-05, 'epoch': 0.74}


                                                         
 25%|██▍       | 19500/79491 [2:17:17<6:03:30,  2.75it/s]

{'eval_loss': 0.6199206113815308, 'eval_runtime': 27.4476, 'eval_samples_per_second': 40.04, 'eval_steps_per_second': 10.019, 'epoch': 0.74}


 25%|██▌       | 20000/79491 [2:20:21<6:00:34,  2.75it/s]  

{'loss': 0.6568, 'grad_norm': 0.6529269814491272, 'learning_rate': 3.741995949226957e-05, 'epoch': 0.75}


                                                         
 25%|██▌       | 20000/79491 [2:20:48<6:00:34,  2.75it/s]

{'eval_loss': 0.6188176274299622, 'eval_runtime': 27.4831, 'eval_samples_per_second': 39.988, 'eval_steps_per_second': 10.006, 'epoch': 0.75}


 26%|██▌       | 20500/79491 [2:23:53<5:58:38,  2.74it/s]  

{'loss': 0.6591, 'grad_norm': 0.6927819848060608, 'learning_rate': 3.71054584795763e-05, 'epoch': 0.77}


                                                         
 26%|██▌       | 20500/79491 [2:24:20<5:58:38,  2.74it/s]

{'eval_loss': 0.6186086535453796, 'eval_runtime': 27.4369, 'eval_samples_per_second': 40.056, 'eval_steps_per_second': 10.023, 'epoch': 0.77}


 26%|██▋       | 21000/79491 [2:27:25<5:57:03,  2.73it/s]  

{'loss': 0.6412, 'grad_norm': 0.8211563229560852, 'learning_rate': 3.6790957466883046e-05, 'epoch': 0.79}


                                                         
 26%|██▋       | 21000/79491 [2:27:52<5:57:03,  2.73it/s]

{'eval_loss': 0.618323028087616, 'eval_runtime': 27.45, 'eval_samples_per_second': 40.036, 'eval_steps_per_second': 10.018, 'epoch': 0.79}


 27%|██▋       | 21500/79491 [2:30:57<5:54:37,  2.73it/s]  

{'loss': 0.6509, 'grad_norm': 0.8604490160942078, 'learning_rate': 3.647645645418978e-05, 'epoch': 0.81}


                                                         
 27%|██▋       | 21500/79491 [2:31:25<5:54:37,  2.73it/s]

{'eval_loss': 0.616117537021637, 'eval_runtime': 28.0657, 'eval_samples_per_second': 39.158, 'eval_steps_per_second': 9.798, 'epoch': 0.81}


 28%|██▊       | 22000/79491 [2:34:33<6:07:15,  2.61it/s]  

{'loss': 0.6543, 'grad_norm': 0.6240878105163574, 'learning_rate': 3.6161955441496524e-05, 'epoch': 0.83}


                                                         
 28%|██▊       | 22000/79491 [2:35:02<6:07:15,  2.61it/s]

{'eval_loss': 0.6156446933746338, 'eval_runtime': 28.6127, 'eval_samples_per_second': 38.409, 'eval_steps_per_second': 9.611, 'epoch': 0.83}


 28%|██▊       | 22500/79491 [2:38:11<6:03:31,  2.61it/s]  

{'loss': 0.6574, 'grad_norm': 0.5610270500183105, 'learning_rate': 3.584745442880326e-05, 'epoch': 0.85}


                                                         
 28%|██▊       | 22500/79491 [2:38:41<6:03:31,  2.61it/s]

{'eval_loss': 0.6154545545578003, 'eval_runtime': 29.2458, 'eval_samples_per_second': 37.578, 'eval_steps_per_second': 9.403, 'epoch': 0.85}


 29%|██▉       | 23000/79491 [2:41:47<5:45:39,  2.72it/s]  

{'loss': 0.6507, 'grad_norm': 0.5796142220497131, 'learning_rate': 3.553295341611e-05, 'epoch': 0.87}


                                                         
 29%|██▉       | 23000/79491 [2:42:15<5:45:39,  2.72it/s]

{'eval_loss': 0.6134079694747925, 'eval_runtime': 28.0739, 'eval_samples_per_second': 39.147, 'eval_steps_per_second': 9.796, 'epoch': 0.87}


 30%|██▉       | 23500/79491 [2:45:21<5:41:49,  2.73it/s]  

{'loss': 0.641, 'grad_norm': 0.5646616816520691, 'learning_rate': 3.521845240341674e-05, 'epoch': 0.89}


                                                         
 30%|██▉       | 23500/79491 [2:45:49<5:41:49,  2.73it/s]

{'eval_loss': 0.6128501296043396, 'eval_runtime': 28.186, 'eval_samples_per_second': 38.991, 'eval_steps_per_second': 9.757, 'epoch': 0.89}


 30%|███       | 24000/79491 [2:48:56<5:45:39,  2.68it/s]  

{'loss': 0.6523, 'grad_norm': 0.6489104628562927, 'learning_rate': 3.490395139072348e-05, 'epoch': 0.91}


                                                         
 30%|███       | 24000/79491 [2:49:24<5:45:39,  2.68it/s]

{'eval_loss': 0.611576497554779, 'eval_runtime': 27.787, 'eval_samples_per_second': 39.551, 'eval_steps_per_second': 9.897, 'epoch': 0.91}


 31%|███       | 24500/79491 [2:52:27<5:31:54,  2.76it/s]  

{'loss': 0.6525, 'grad_norm': 0.5968216061592102, 'learning_rate': 3.4589450378030216e-05, 'epoch': 0.92}


                                                         
 31%|███       | 24500/79491 [2:52:55<5:31:54,  2.76it/s]

{'eval_loss': 0.6109687089920044, 'eval_runtime': 27.4137, 'eval_samples_per_second': 40.09, 'eval_steps_per_second': 10.031, 'epoch': 0.92}


 31%|███▏      | 25000/79491 [2:55:58<5:29:25,  2.76it/s]  

{'loss': 0.6468, 'grad_norm': 0.5695536136627197, 'learning_rate': 3.427494936533696e-05, 'epoch': 0.94}


                                                         
 31%|███▏      | 25000/79491 [2:56:26<5:29:25,  2.76it/s]

{'eval_loss': 0.6099345684051514, 'eval_runtime': 27.4609, 'eval_samples_per_second': 40.021, 'eval_steps_per_second': 10.014, 'epoch': 0.94}


 32%|███▏      | 25500/79491 [2:59:29<5:26:11,  2.76it/s]  

{'loss': 0.6421, 'grad_norm': 0.868198573589325, 'learning_rate': 3.39604483526437e-05, 'epoch': 0.96}


                                                         
 32%|███▏      | 25500/79491 [2:59:57<5:26:11,  2.76it/s]

{'eval_loss': 0.6092467308044434, 'eval_runtime': 27.4252, 'eval_samples_per_second': 40.073, 'eval_steps_per_second': 10.027, 'epoch': 0.96}


 33%|███▎      | 26000/79491 [3:03:01<5:24:22,  2.75it/s]  

{'loss': 0.6464, 'grad_norm': 0.7231113314628601, 'learning_rate': 3.364594733995044e-05, 'epoch': 0.98}


                                                         
 33%|███▎      | 26000/79491 [3:03:28<5:24:22,  2.75it/s]

{'eval_loss': 0.6087997555732727, 'eval_runtime': 27.4811, 'eval_samples_per_second': 39.991, 'eval_steps_per_second': 10.007, 'epoch': 0.98}


 33%|███▎      | 26500/79491 [3:06:32<5:01:03,  2.93it/s]  

{'loss': 0.6446, 'grad_norm': 0.6516608595848083, 'learning_rate': 3.333144632725718e-05, 'epoch': 1.0}


                                                         
 33%|███▎      | 26500/79491 [3:06:59<5:01:03,  2.93it/s]

{'eval_loss': 0.6080111861228943, 'eval_runtime': 27.4811, 'eval_samples_per_second': 39.991, 'eval_steps_per_second': 10.007, 'epoch': 1.0}


 34%|███▍      | 27000/79491 [3:10:03<5:18:35,  2.75it/s]  

{'loss': 0.6305, 'grad_norm': 0.5322346091270447, 'learning_rate': 3.3016945314563915e-05, 'epoch': 1.02}


                                                         
 34%|███▍      | 27000/79491 [3:10:30<5:18:35,  2.75it/s]

{'eval_loss': 0.6084678173065186, 'eval_runtime': 27.433, 'eval_samples_per_second': 40.061, 'eval_steps_per_second': 10.024, 'epoch': 1.02}


 35%|███▍      | 27500/79491 [3:13:34<5:14:33,  2.75it/s]  

{'loss': 0.6248, 'grad_norm': 0.6140685081481934, 'learning_rate': 3.270244430187066e-05, 'epoch': 1.04}


                                                         
 35%|███▍      | 27500/79491 [3:14:01<5:14:33,  2.75it/s]

{'eval_loss': 0.6068751215934753, 'eval_runtime': 27.4036, 'eval_samples_per_second': 40.104, 'eval_steps_per_second': 10.035, 'epoch': 1.04}


 35%|███▌      | 28000/79491 [3:17:04<5:12:04,  2.75it/s]  

{'loss': 0.6332, 'grad_norm': 0.6205551624298096, 'learning_rate': 3.2387943289177393e-05, 'epoch': 1.06}


                                                         
 35%|███▌      | 28000/79491 [3:17:32<5:12:04,  2.75it/s]

{'eval_loss': 0.6056958436965942, 'eval_runtime': 27.4468, 'eval_samples_per_second': 40.041, 'eval_steps_per_second': 10.019, 'epoch': 1.06}


 36%|███▌      | 28500/79491 [3:20:35<5:08:50,  2.75it/s]  

{'loss': 0.6215, 'grad_norm': 0.6073554754257202, 'learning_rate': 3.2073442276484136e-05, 'epoch': 1.08}


                                                         
 36%|███▌      | 28500/79491 [3:21:03<5:08:50,  2.75it/s]

{'eval_loss': 0.6060558557510376, 'eval_runtime': 27.4224, 'eval_samples_per_second': 40.077, 'eval_steps_per_second': 10.028, 'epoch': 1.08}


 36%|███▋      | 29000/79491 [3:24:06<5:04:19,  2.77it/s]  

{'loss': 0.6298, 'grad_norm': 0.5228870511054993, 'learning_rate': 3.175894126379087e-05, 'epoch': 1.09}


                                                         
 36%|███▋      | 29000/79491 [3:24:33<5:04:19,  2.77it/s]

{'eval_loss': 0.6052270531654358, 'eval_runtime': 27.3982, 'eval_samples_per_second': 40.112, 'eval_steps_per_second': 10.037, 'epoch': 1.09}


 37%|███▋      | 29500/79491 [3:27:37<5:02:14,  2.76it/s]  

{'loss': 0.6219, 'grad_norm': 0.8285207748413086, 'learning_rate': 3.1444440251097614e-05, 'epoch': 1.11}


                                                         
 37%|███▋      | 29500/79491 [3:28:04<5:02:14,  2.76it/s]

{'eval_loss': 0.604537844657898, 'eval_runtime': 27.4098, 'eval_samples_per_second': 40.095, 'eval_steps_per_second': 10.033, 'epoch': 1.11}


 38%|███▊      | 30000/79491 [3:31:07<4:59:52,  2.75it/s]  

{'loss': 0.6197, 'grad_norm': 0.6045670509338379, 'learning_rate': 3.112993923840435e-05, 'epoch': 1.13}


                                                         
 38%|███▊      | 30000/79491 [3:31:35<4:59:52,  2.75it/s]

{'eval_loss': 0.6030477285385132, 'eval_runtime': 27.7676, 'eval_samples_per_second': 39.579, 'eval_steps_per_second': 9.904, 'epoch': 1.13}


 38%|███▊      | 30500/79491 [3:34:40<4:57:21,  2.75it/s]  

{'loss': 0.6207, 'grad_norm': 0.6885800957679749, 'learning_rate': 3.081543822571109e-05, 'epoch': 1.15}


                                                         
 38%|███▊      | 30500/79491 [3:35:08<4:57:21,  2.75it/s]

{'eval_loss': 0.6036983132362366, 'eval_runtime': 27.835, 'eval_samples_per_second': 39.483, 'eval_steps_per_second': 9.88, 'epoch': 1.15}


 39%|███▉      | 31000/79491 [3:38:13<4:57:04,  2.72it/s]  

{'loss': 0.62, 'grad_norm': 0.5706389546394348, 'learning_rate': 3.0500937213017828e-05, 'epoch': 1.17}


                                                         
 39%|███▉      | 31000/79491 [3:38:40<4:57:04,  2.72it/s]

{'eval_loss': 0.6024275422096252, 'eval_runtime': 27.4866, 'eval_samples_per_second': 39.983, 'eval_steps_per_second': 10.005, 'epoch': 1.17}


 40%|███▉      | 31500/79491 [3:41:45<4:51:56,  2.74it/s]  

{'loss': 0.627, 'grad_norm': 0.7186428308486938, 'learning_rate': 3.0186436200324564e-05, 'epoch': 1.19}


                                                         
 40%|███▉      | 31500/79491 [3:42:13<4:51:56,  2.74it/s]

{'eval_loss': 0.6017741560935974, 'eval_runtime': 27.5192, 'eval_samples_per_second': 39.936, 'eval_steps_per_second': 9.993, 'epoch': 1.19}


 40%|████      | 32000/79491 [3:45:17<4:51:21,  2.72it/s]  

{'loss': 0.6241, 'grad_norm': 0.754653811454773, 'learning_rate': 2.9871935187631306e-05, 'epoch': 1.21}


                                                         
 40%|████      | 32000/79491 [3:45:44<4:51:21,  2.72it/s]

{'eval_loss': 0.6011863350868225, 'eval_runtime': 27.5105, 'eval_samples_per_second': 39.948, 'eval_steps_per_second': 9.996, 'epoch': 1.21}


 41%|████      | 32500/79491 [3:48:49<4:45:05,  2.75it/s]  

{'loss': 0.6156, 'grad_norm': 0.5729495286941528, 'learning_rate': 2.9557434174938042e-05, 'epoch': 1.23}


                                                         
 41%|████      | 32500/79491 [3:49:16<4:45:05,  2.75it/s]

{'eval_loss': 0.600623607635498, 'eval_runtime': 27.5339, 'eval_samples_per_second': 39.914, 'eval_steps_per_second': 9.988, 'epoch': 1.23}


 42%|████▏     | 33000/79491 [3:52:20<4:43:07,  2.74it/s]  

{'loss': 0.6204, 'grad_norm': 0.7587493658065796, 'learning_rate': 2.9242933162244785e-05, 'epoch': 1.25}


                                                         
 42%|████▏     | 33000/79491 [3:52:48<4:43:07,  2.74it/s]

{'eval_loss': 0.600182831287384, 'eval_runtime': 27.5525, 'eval_samples_per_second': 39.887, 'eval_steps_per_second': 9.981, 'epoch': 1.25}


 42%|████▏     | 33500/79491 [3:55:52<4:38:27,  2.75it/s]  

{'loss': 0.6094, 'grad_norm': 0.7315734624862671, 'learning_rate': 2.892843214955152e-05, 'epoch': 1.26}


                                                         
 42%|████▏     | 33500/79491 [3:56:20<4:38:27,  2.75it/s]

{'eval_loss': 0.6007843017578125, 'eval_runtime': 27.5632, 'eval_samples_per_second': 39.872, 'eval_steps_per_second': 9.977, 'epoch': 1.26}


 43%|████▎     | 34000/79491 [3:59:24<4:34:57,  2.76it/s]  

{'loss': 0.6172, 'grad_norm': 0.5776650905609131, 'learning_rate': 2.8613931136858263e-05, 'epoch': 1.28}


                                                         
 43%|████▎     | 34000/79491 [3:59:51<4:34:57,  2.76it/s]

{'eval_loss': 0.5993344187736511, 'eval_runtime': 27.527, 'eval_samples_per_second': 39.924, 'eval_steps_per_second': 9.99, 'epoch': 1.28}


 43%|████▎     | 34500/79491 [4:02:56<4:32:50,  2.75it/s]  

{'loss': 0.6162, 'grad_norm': 0.6426888704299927, 'learning_rate': 2.8299430124165e-05, 'epoch': 1.3}


                                                         
 43%|████▎     | 34500/79491 [4:03:23<4:32:50,  2.75it/s]

{'eval_loss': 0.5986365675926208, 'eval_runtime': 27.4855, 'eval_samples_per_second': 39.985, 'eval_steps_per_second': 10.005, 'epoch': 1.3}


 44%|████▍     | 35000/79491 [4:06:27<4:29:05,  2.76it/s]  

{'loss': 0.6212, 'grad_norm': 0.8923363089561462, 'learning_rate': 2.798492911147174e-05, 'epoch': 1.32}


                                                         
 44%|████▍     | 35000/79491 [4:06:55<4:29:05,  2.76it/s]

{'eval_loss': 0.5985850691795349, 'eval_runtime': 27.5161, 'eval_samples_per_second': 39.94, 'eval_steps_per_second': 9.994, 'epoch': 1.32}


 45%|████▍     | 35500/79491 [4:09:59<4:29:59,  2.72it/s]  

{'loss': 0.612, 'grad_norm': 0.8470678925514221, 'learning_rate': 2.7670428098778477e-05, 'epoch': 1.34}


                                                         
 45%|████▍     | 35500/79491 [4:10:27<4:29:59,  2.72it/s]

{'eval_loss': 0.5976590514183044, 'eval_runtime': 27.5162, 'eval_samples_per_second': 39.94, 'eval_steps_per_second': 9.994, 'epoch': 1.34}


 45%|████▌     | 36000/79491 [4:13:31<4:24:29,  2.74it/s]  

{'loss': 0.6139, 'grad_norm': 0.7511613965034485, 'learning_rate': 2.735592708608522e-05, 'epoch': 1.36}


                                                         
 45%|████▌     | 36000/79491 [4:13:59<4:24:29,  2.74it/s]

{'eval_loss': 0.5978723168373108, 'eval_runtime': 27.5955, 'eval_samples_per_second': 39.825, 'eval_steps_per_second': 9.965, 'epoch': 1.36}


 46%|████▌     | 36500/79491 [4:17:03<4:19:53,  2.76it/s]  

{'loss': 0.6149, 'grad_norm': 0.828278124332428, 'learning_rate': 2.7041426073391955e-05, 'epoch': 1.38}


                                                         
 46%|████▌     | 36500/79491 [4:17:31<4:19:53,  2.76it/s]

{'eval_loss': 0.5963183641433716, 'eval_runtime': 27.5654, 'eval_samples_per_second': 39.869, 'eval_steps_per_second': 9.976, 'epoch': 1.38}


 47%|████▋     | 37000/79491 [4:20:35<4:18:14,  2.74it/s]  

{'loss': 0.6101, 'grad_norm': 0.5643382668495178, 'learning_rate': 2.6726925060698698e-05, 'epoch': 1.4}


                                                         
 47%|████▋     | 37000/79491 [4:21:03<4:18:14,  2.74it/s]

{'eval_loss': 0.5968629717826843, 'eval_runtime': 27.5107, 'eval_samples_per_second': 39.948, 'eval_steps_per_second': 9.996, 'epoch': 1.4}


 47%|████▋     | 37500/79491 [4:24:07<4:14:57,  2.75it/s]  

{'loss': 0.6182, 'grad_norm': 0.6713138222694397, 'learning_rate': 2.6412424048005437e-05, 'epoch': 1.42}


                                                         
 47%|████▋     | 37500/79491 [4:24:35<4:14:57,  2.75it/s]

{'eval_loss': 0.5958506464958191, 'eval_runtime': 27.5223, 'eval_samples_per_second': 39.931, 'eval_steps_per_second': 9.992, 'epoch': 1.42}


 48%|████▊     | 38000/79491 [4:27:39<4:13:34,  2.73it/s]  

{'loss': 0.6128, 'grad_norm': 0.7566153407096863, 'learning_rate': 2.6097923035312176e-05, 'epoch': 1.43}


                                                         
 48%|████▊     | 38000/79491 [4:28:07<4:13:34,  2.73it/s]

{'eval_loss': 0.595589816570282, 'eval_runtime': 27.4893, 'eval_samples_per_second': 39.979, 'eval_steps_per_second': 10.004, 'epoch': 1.43}


 48%|████▊     | 38500/79491 [4:31:11<4:08:56,  2.74it/s]  

{'loss': 0.6153, 'grad_norm': 0.7034271955490112, 'learning_rate': 2.5783422022618915e-05, 'epoch': 1.45}


                                                         
 48%|████▊     | 38500/79491 [4:31:39<4:08:56,  2.74it/s]

{'eval_loss': 0.5946606993675232, 'eval_runtime': 27.5514, 'eval_samples_per_second': 39.889, 'eval_steps_per_second': 9.981, 'epoch': 1.45}


 49%|████▉     | 39000/79491 [4:34:43<4:06:18,  2.74it/s]  

{'loss': 0.6179, 'grad_norm': 0.5959654450416565, 'learning_rate': 2.5468921009925654e-05, 'epoch': 1.47}


                                                         
 49%|████▉     | 39000/79491 [4:35:11<4:06:18,  2.74it/s]

{'eval_loss': 0.5942524075508118, 'eval_runtime': 27.4575, 'eval_samples_per_second': 40.026, 'eval_steps_per_second': 10.015, 'epoch': 1.47}


 50%|████▉     | 39500/79491 [4:38:15<4:01:31,  2.76it/s]  

{'loss': 0.6109, 'grad_norm': 0.60798180103302, 'learning_rate': 2.5154419997232393e-05, 'epoch': 1.49}


                                                         
 50%|████▉     | 39500/79491 [4:38:43<4:01:31,  2.76it/s]

{'eval_loss': 0.5938854813575745, 'eval_runtime': 27.5242, 'eval_samples_per_second': 39.929, 'eval_steps_per_second': 9.991, 'epoch': 1.49}


 50%|█████     | 40000/79491 [4:41:47<4:01:08,  2.73it/s]  

{'loss': 0.6183, 'grad_norm': 0.7595256567001343, 'learning_rate': 2.4839918984539132e-05, 'epoch': 1.51}


                                                         
 50%|█████     | 40000/79491 [4:42:15<4:01:08,  2.73it/s]

{'eval_loss': 0.5935652256011963, 'eval_runtime': 27.5108, 'eval_samples_per_second': 39.948, 'eval_steps_per_second': 9.996, 'epoch': 1.51}


 51%|█████     | 40500/79491 [4:45:21<4:03:31,  2.67it/s]  

{'loss': 0.6093, 'grad_norm': 0.7164648175239563, 'learning_rate': 2.452541797184587e-05, 'epoch': 1.53}


                                                         
 51%|█████     | 40500/79491 [4:45:49<4:03:31,  2.67it/s]

{'eval_loss': 0.5925575494766235, 'eval_runtime': 28.0924, 'eval_samples_per_second': 39.121, 'eval_steps_per_second': 9.789, 'epoch': 1.53}


 52%|█████▏    | 41000/79491 [4:48:55<3:52:26,  2.76it/s]  

{'loss': 0.6149, 'grad_norm': 0.5769319534301758, 'learning_rate': 2.421091695915261e-05, 'epoch': 1.55}


                                                         
 52%|█████▏    | 41000/79491 [4:49:22<3:52:26,  2.76it/s]

{'eval_loss': 0.592928946018219, 'eval_runtime': 27.3383, 'eval_samples_per_second': 40.2, 'eval_steps_per_second': 10.059, 'epoch': 1.55}


 52%|█████▏    | 41500/79491 [4:52:25<3:48:15,  2.77it/s] 

{'loss': 0.6126, 'grad_norm': 0.6081030368804932, 'learning_rate': 2.389641594645935e-05, 'epoch': 1.57}


                                                         
 52%|█████▏    | 41500/79491 [4:52:53<3:48:15,  2.77it/s]

{'eval_loss': 0.5921077728271484, 'eval_runtime': 27.3094, 'eval_samples_per_second': 40.242, 'eval_steps_per_second': 10.07, 'epoch': 1.57}


 53%|█████▎    | 42000/79491 [4:55:55<3:45:32,  2.77it/s] 

{'loss': 0.6054, 'grad_norm': 0.6022106409072876, 'learning_rate': 2.358191493376609e-05, 'epoch': 1.59}


                                                         
 53%|█████▎    | 42000/79491 [4:56:22<3:45:32,  2.77it/s]

{'eval_loss': 0.5915566086769104, 'eval_runtime': 27.3008, 'eval_samples_per_second': 40.255, 'eval_steps_per_second': 10.073, 'epoch': 1.59}


 53%|█████▎    | 42500/79491 [4:59:25<3:42:35,  2.77it/s] 

{'loss': 0.6134, 'grad_norm': 0.5991289615631104, 'learning_rate': 2.3267413921072828e-05, 'epoch': 1.6}


                                                         
 53%|█████▎    | 42500/79491 [4:59:52<3:42:35,  2.77it/s]

{'eval_loss': 0.5916240215301514, 'eval_runtime': 27.3023, 'eval_samples_per_second': 40.253, 'eval_steps_per_second': 10.072, 'epoch': 1.6}


 54%|█████▍    | 43000/79491 [5:02:55<3:39:43,  2.77it/s] 

{'loss': 0.6071, 'grad_norm': 0.7706238627433777, 'learning_rate': 2.2952912908379567e-05, 'epoch': 1.62}


                                                         
 54%|█████▍    | 43000/79491 [5:03:22<3:39:43,  2.77it/s]

{'eval_loss': 0.5910969972610474, 'eval_runtime': 27.2845, 'eval_samples_per_second': 40.279, 'eval_steps_per_second': 10.079, 'epoch': 1.62}


 55%|█████▍    | 43500/79491 [5:06:24<3:36:38,  2.77it/s] 

{'loss': 0.6044, 'grad_norm': 0.6097461581230164, 'learning_rate': 2.2638411895686303e-05, 'epoch': 1.64}


                                                         
 55%|█████▍    | 43500/79491 [5:06:52<3:36:38,  2.77it/s]

{'eval_loss': 0.5899255871772766, 'eval_runtime': 27.3022, 'eval_samples_per_second': 40.253, 'eval_steps_per_second': 10.072, 'epoch': 1.64}


 55%|█████▌    | 44000/79491 [5:09:54<3:33:36,  2.77it/s] 

{'loss': 0.6095, 'grad_norm': 0.5952906608581543, 'learning_rate': 2.2323910882993042e-05, 'epoch': 1.66}


                                                         
 55%|█████▌    | 44000/79491 [5:10:21<3:33:36,  2.77it/s]

{'eval_loss': 0.5894774198532104, 'eval_runtime': 27.2916, 'eval_samples_per_second': 40.269, 'eval_steps_per_second': 10.076, 'epoch': 1.66}


 56%|█████▌    | 44500/79491 [5:13:24<3:30:43,  2.77it/s] 

{'loss': 0.6036, 'grad_norm': 0.6640282273292542, 'learning_rate': 2.200940987029978e-05, 'epoch': 1.68}


                                                         
 56%|█████▌    | 44500/79491 [5:13:51<3:30:43,  2.77it/s]

{'eval_loss': 0.5895929336547852, 'eval_runtime': 27.3055, 'eval_samples_per_second': 40.248, 'eval_steps_per_second': 10.071, 'epoch': 1.68}


 57%|█████▋    | 45000/79491 [5:16:53<3:27:28,  2.77it/s] 

{'loss': 0.6132, 'grad_norm': 0.6004648208618164, 'learning_rate': 2.169490885760652e-05, 'epoch': 1.7}


                                                         
 57%|█████▋    | 45000/79491 [5:17:21<3:27:28,  2.77it/s]

{'eval_loss': 0.5888933539390564, 'eval_runtime': 27.5522, 'eval_samples_per_second': 39.888, 'eval_steps_per_second': 9.981, 'epoch': 1.7}


 57%|█████▋    | 45500/79491 [5:20:25<3:27:03,  2.74it/s] 

{'loss': 0.5987, 'grad_norm': 0.7889156341552734, 'learning_rate': 2.138040784491326e-05, 'epoch': 1.72}


                                                         
 57%|█████▋    | 45500/79491 [5:20:53<3:27:03,  2.74it/s]

{'eval_loss': 0.5886742472648621, 'eval_runtime': 27.8836, 'eval_samples_per_second': 39.414, 'eval_steps_per_second': 9.862, 'epoch': 1.72}


 58%|█████▊    | 46000/79491 [5:23:58<3:23:17,  2.75it/s] 

{'loss': 0.6176, 'grad_norm': 0.6018222570419312, 'learning_rate': 2.1065906832220002e-05, 'epoch': 1.74}


                                                         
 58%|█████▊    | 46000/79491 [5:24:26<3:23:17,  2.75it/s]

{'eval_loss': 0.587986171245575, 'eval_runtime': 27.8795, 'eval_samples_per_second': 39.42, 'eval_steps_per_second': 9.864, 'epoch': 1.74}


 58%|█████▊    | 46500/79491 [5:27:31<3:20:13,  2.75it/s] 

{'loss': 0.616, 'grad_norm': 0.6065133810043335, 'learning_rate': 2.075140581952674e-05, 'epoch': 1.75}


                                                         
 58%|█████▊    | 46500/79491 [5:27:58<3:20:13,  2.75it/s]

{'eval_loss': 0.5875309705734253, 'eval_runtime': 27.8901, 'eval_samples_per_second': 39.405, 'eval_steps_per_second': 9.86, 'epoch': 1.75}


 59%|█████▉    | 47000/79491 [5:31:03<3:16:53,  2.75it/s] 

{'loss': 0.6095, 'grad_norm': 0.5345597267150879, 'learning_rate': 2.043690480683348e-05, 'epoch': 1.77}


                                                         
 59%|█████▉    | 47000/79491 [5:31:31<3:16:53,  2.75it/s]

{'eval_loss': 0.5878124237060547, 'eval_runtime': 27.8792, 'eval_samples_per_second': 39.42, 'eval_steps_per_second': 9.864, 'epoch': 1.77}


 60%|█████▉    | 47500/79491 [5:34:35<3:12:33,  2.77it/s] 

{'loss': 0.6117, 'grad_norm': 0.8767426609992981, 'learning_rate': 2.012240379414022e-05, 'epoch': 1.79}


                                                         
 60%|█████▉    | 47500/79491 [5:35:02<3:12:33,  2.77it/s]

{'eval_loss': 0.5871134996414185, 'eval_runtime': 27.3348, 'eval_samples_per_second': 40.205, 'eval_steps_per_second': 10.06, 'epoch': 1.79}


 60%|██████    | 48000/79491 [5:38:06<3:14:54,  2.69it/s] 

{'loss': 0.6161, 'grad_norm': 0.8823980689048767, 'learning_rate': 1.980790278144696e-05, 'epoch': 1.81}


                                                         
 60%|██████    | 48000/79491 [5:38:34<3:14:54,  2.69it/s]

{'eval_loss': 0.5867051482200623, 'eval_runtime': 27.8867, 'eval_samples_per_second': 39.409, 'eval_steps_per_second': 9.861, 'epoch': 1.81}


 61%|██████    | 48500/79491 [5:41:38<3:06:02,  2.78it/s] 

{'loss': 0.6107, 'grad_norm': 0.5764450430870056, 'learning_rate': 1.9493401768753698e-05, 'epoch': 1.83}


                                                         
 61%|██████    | 48500/79491 [5:42:06<3:06:02,  2.78it/s]

{'eval_loss': 0.586264967918396, 'eval_runtime': 27.3374, 'eval_samples_per_second': 40.201, 'eval_steps_per_second': 10.059, 'epoch': 1.83}


 62%|██████▏   | 49000/79491 [5:45:10<3:08:50,  2.69it/s] 

{'loss': 0.6022, 'grad_norm': 0.5715750455856323, 'learning_rate': 1.9178900756060437e-05, 'epoch': 1.85}


                                                         
 62%|██████▏   | 49000/79491 [5:45:38<3:08:50,  2.69it/s]

{'eval_loss': 0.5856779217720032, 'eval_runtime': 28.2173, 'eval_samples_per_second': 38.948, 'eval_steps_per_second': 9.746, 'epoch': 1.85}


 62%|██████▏   | 49500/79491 [5:48:44<3:03:40,  2.72it/s] 

{'loss': 0.6061, 'grad_norm': 0.8328660726547241, 'learning_rate': 1.8864399743367176e-05, 'epoch': 1.87}


                                                         
 62%|██████▏   | 49500/79491 [5:49:13<3:03:40,  2.72it/s]

{'eval_loss': 0.5853335857391357, 'eval_runtime': 28.8048, 'eval_samples_per_second': 38.153, 'eval_steps_per_second': 9.547, 'epoch': 1.87}


 63%|██████▎   | 50000/79491 [5:52:19<2:58:57,  2.75it/s] 

{'loss': 0.6063, 'grad_norm': 0.6875287294387817, 'learning_rate': 1.8549898730673915e-05, 'epoch': 1.89}


                                                         
 63%|██████▎   | 50000/79491 [5:52:47<2:58:57,  2.75it/s]

{'eval_loss': 0.585791289806366, 'eval_runtime': 27.4032, 'eval_samples_per_second': 40.105, 'eval_steps_per_second': 10.035, 'epoch': 1.89}


 64%|██████▎   | 50500/79491 [5:55:55<2:59:44,  2.69it/s] 

{'loss': 0.6121, 'grad_norm': 0.7554181218147278, 'learning_rate': 1.8235397717980654e-05, 'epoch': 1.91}


                                                         
 64%|██████▎   | 50500/79491 [5:56:24<2:59:44,  2.69it/s]

{'eval_loss': 0.5845136642456055, 'eval_runtime': 28.4795, 'eval_samples_per_second': 38.589, 'eval_steps_per_second': 9.656, 'epoch': 1.91}


 64%|██████▍   | 51000/79491 [5:59:32<2:56:02,  2.70it/s] 

{'loss': 0.6057, 'grad_norm': 0.5578505396842957, 'learning_rate': 1.792089670528739e-05, 'epoch': 1.92}


                                                         
 64%|██████▍   | 51000/79491 [6:00:00<2:56:02,  2.70it/s]

{'eval_loss': 0.5840162634849548, 'eval_runtime': 28.4041, 'eval_samples_per_second': 38.692, 'eval_steps_per_second': 9.682, 'epoch': 1.92}


 65%|██████▍   | 51500/79491 [6:03:06<2:48:38,  2.77it/s] 

{'loss': 0.6183, 'grad_norm': 0.5958075523376465, 'learning_rate': 1.760639569259413e-05, 'epoch': 1.94}


                                                         
 65%|██████▍   | 51500/79491 [6:03:34<2:48:38,  2.77it/s]

{'eval_loss': 0.5836584568023682, 'eval_runtime': 27.8437, 'eval_samples_per_second': 39.47, 'eval_steps_per_second': 9.877, 'epoch': 1.94}


 65%|██████▌   | 52000/79491 [6:06:41<2:48:52,  2.71it/s] 

{'loss': 0.5992, 'grad_norm': 0.813602089881897, 'learning_rate': 1.7291894679900868e-05, 'epoch': 1.96}


                                                         
 65%|██████▌   | 52000/79491 [6:07:10<2:48:52,  2.71it/s]

{'eval_loss': 0.5834474563598633, 'eval_runtime': 28.4103, 'eval_samples_per_second': 38.683, 'eval_steps_per_second': 9.68, 'epoch': 1.96}


 66%|██████▌   | 52500/79491 [6:10:17<2:47:47,  2.68it/s] 

{'loss': 0.5962, 'grad_norm': 0.6377272605895996, 'learning_rate': 1.6977393667207607e-05, 'epoch': 1.98}


                                                         
 66%|██████▌   | 52500/79491 [6:10:45<2:47:47,  2.68it/s]

{'eval_loss': 0.5837182998657227, 'eval_runtime': 28.1963, 'eval_samples_per_second': 38.977, 'eval_steps_per_second': 9.753, 'epoch': 1.98}


 67%|██████▋   | 53000/79491 [6:13:53<2:38:53,  2.78it/s] 

{'loss': 0.6075, 'grad_norm': 0.7350703477859497, 'learning_rate': 1.6662892654514346e-05, 'epoch': 2.0}


                                                         
 67%|██████▋   | 53000/79491 [6:14:21<2:38:53,  2.78it/s]

{'eval_loss': 0.5833860635757446, 'eval_runtime': 28.2792, 'eval_samples_per_second': 38.863, 'eval_steps_per_second': 9.724, 'epoch': 2.0}


 67%|██████▋   | 53500/79491 [6:17:26<2:34:46,  2.80it/s] 

{'loss': 0.5937, 'grad_norm': 0.7892270684242249, 'learning_rate': 1.6348391641821085e-05, 'epoch': 2.02}


                                                         
 67%|██████▋   | 53500/79491 [6:17:54<2:34:46,  2.80it/s]

{'eval_loss': 0.5838709473609924, 'eval_runtime': 27.6525, 'eval_samples_per_second': 39.743, 'eval_steps_per_second': 9.945, 'epoch': 2.02}


 68%|██████▊   | 54000/79491 [6:21:00<2:37:48,  2.69it/s] 

{'loss': 0.5888, 'grad_norm': 0.6508153080940247, 'learning_rate': 1.6033890629127825e-05, 'epoch': 2.04}


                                                         
 68%|██████▊   | 54000/79491 [6:21:28<2:37:48,  2.69it/s]

{'eval_loss': 0.5830591320991516, 'eval_runtime': 28.0974, 'eval_samples_per_second': 39.114, 'eval_steps_per_second': 9.787, 'epoch': 2.04}


 69%|██████▊   | 54500/79491 [6:24:35<2:37:42,  2.64it/s] 

{'loss': 0.5888, 'grad_norm': 0.7069897651672363, 'learning_rate': 1.5719389616434567e-05, 'epoch': 2.06}


                                                         
 69%|██████▊   | 54500/79491 [6:25:03<2:37:42,  2.64it/s]

{'eval_loss': 0.5833233594894409, 'eval_runtime': 28.1607, 'eval_samples_per_second': 39.026, 'eval_steps_per_second': 9.765, 'epoch': 2.06}


 69%|██████▉   | 55000/79491 [6:28:09<2:28:31,  2.75it/s] 

{'loss': 0.5848, 'grad_norm': 0.6200448274612427, 'learning_rate': 1.5404888603741306e-05, 'epoch': 2.08}


                                                         
 69%|██████▉   | 55000/79491 [6:28:38<2:28:31,  2.75it/s]

{'eval_loss': 0.5823865532875061, 'eval_runtime': 28.1204, 'eval_samples_per_second': 39.082, 'eval_steps_per_second': 9.779, 'epoch': 2.08}


 70%|██████▉   | 55500/79491 [6:31:44<2:27:26,  2.71it/s] 

{'loss': 0.5876, 'grad_norm': 0.8145111799240112, 'learning_rate': 1.5090387591048044e-05, 'epoch': 2.09}


                                                         
 70%|██████▉   | 55500/79491 [6:32:11<2:27:26,  2.71it/s]

{'eval_loss': 0.5826457142829895, 'eval_runtime': 27.7254, 'eval_samples_per_second': 39.639, 'eval_steps_per_second': 9.919, 'epoch': 2.09}


 70%|███████   | 56000/79491 [6:35:16<2:21:52,  2.76it/s] 

{'loss': 0.5939, 'grad_norm': 0.525015115737915, 'learning_rate': 1.4775886578354783e-05, 'epoch': 2.11}


                                                         
 70%|███████   | 56000/79491 [6:35:43<2:21:52,  2.76it/s]

{'eval_loss': 0.5820212960243225, 'eval_runtime': 27.5607, 'eval_samples_per_second': 39.876, 'eval_steps_per_second': 9.978, 'epoch': 2.11}


 71%|███████   | 56500/79491 [6:38:46<2:18:50,  2.76it/s] 

{'loss': 0.588, 'grad_norm': 0.630763590335846, 'learning_rate': 1.4461385565661522e-05, 'epoch': 2.13}


                                                         
 71%|███████   | 56500/79491 [6:39:14<2:18:50,  2.76it/s]

{'eval_loss': 0.5818555355072021, 'eval_runtime': 27.4988, 'eval_samples_per_second': 39.965, 'eval_steps_per_second': 10.0, 'epoch': 2.13}


 72%|███████▏  | 57000/79491 [6:42:16<2:15:16,  2.77it/s] 

{'loss': 0.5872, 'grad_norm': 0.570725679397583, 'learning_rate': 1.4146884552968263e-05, 'epoch': 2.15}


                                                         
 72%|███████▏  | 57000/79491 [6:42:44<2:15:16,  2.77it/s]

{'eval_loss': 0.5821970701217651, 'eval_runtime': 27.5596, 'eval_samples_per_second': 39.877, 'eval_steps_per_second': 9.978, 'epoch': 2.15}


 72%|███████▏  | 57500/79491 [6:45:47<2:12:30,  2.77it/s] 

{'loss': 0.594, 'grad_norm': 0.6017985343933105, 'learning_rate': 1.3832383540275002e-05, 'epoch': 2.17}


                                                         
 72%|███████▏  | 57500/79491 [6:46:14<2:12:30,  2.77it/s]

{'eval_loss': 0.5818544030189514, 'eval_runtime': 27.5262, 'eval_samples_per_second': 39.926, 'eval_steps_per_second': 9.99, 'epoch': 2.17}


 73%|███████▎  | 58000/79491 [6:49:17<2:09:18,  2.77it/s] 

{'loss': 0.5929, 'grad_norm': 0.6221532821655273, 'learning_rate': 1.3517882527581741e-05, 'epoch': 2.19}


                                                         
 73%|███████▎  | 58000/79491 [6:49:45<2:09:18,  2.77it/s]

{'eval_loss': 0.581585168838501, 'eval_runtime': 27.5383, 'eval_samples_per_second': 39.908, 'eval_steps_per_second': 9.986, 'epoch': 2.19}


 74%|███████▎  | 58500/79491 [6:52:47<2:05:57,  2.78it/s] 

{'loss': 0.5833, 'grad_norm': 0.6306270956993103, 'learning_rate': 1.320338151488848e-05, 'epoch': 2.21}


                                                         
 74%|███████▎  | 58500/79491 [6:53:15<2:05:57,  2.78it/s]

{'eval_loss': 0.5811446309089661, 'eval_runtime': 27.5349, 'eval_samples_per_second': 39.913, 'eval_steps_per_second': 9.987, 'epoch': 2.21}


 74%|███████▍  | 59000/79491 [6:56:18<2:03:08,  2.77it/s] 

{'loss': 0.5955, 'grad_norm': 0.7630110383033752, 'learning_rate': 1.2888880502195216e-05, 'epoch': 2.23}


                                                         
 74%|███████▍  | 59000/79491 [6:56:45<2:03:08,  2.77it/s]

{'eval_loss': 0.5811436176300049, 'eval_runtime': 27.5994, 'eval_samples_per_second': 39.82, 'eval_steps_per_second': 9.964, 'epoch': 2.23}


 75%|███████▍  | 59500/79491 [6:59:48<2:00:08,  2.77it/s] 

{'loss': 0.5795, 'grad_norm': 0.5841894149780273, 'learning_rate': 1.2574379489501955e-05, 'epoch': 2.25}


                                                         
 75%|███████▍  | 59500/79491 [7:00:16<2:00:08,  2.77it/s]

{'eval_loss': 0.5806331634521484, 'eval_runtime': 27.5307, 'eval_samples_per_second': 39.919, 'eval_steps_per_second': 9.989, 'epoch': 2.25}


 75%|███████▌  | 60000/79491 [7:03:19<1:57:45,  2.76it/s] 

{'loss': 0.592, 'grad_norm': 0.7432337403297424, 'learning_rate': 1.2259878476808696e-05, 'epoch': 2.26}


                                                         
 75%|███████▌  | 60000/79491 [7:03:46<1:57:45,  2.76it/s]

{'eval_loss': 0.5805425047874451, 'eval_runtime': 27.5455, 'eval_samples_per_second': 39.898, 'eval_steps_per_second': 9.983, 'epoch': 2.26}


 76%|███████▌  | 60500/79491 [7:06:52<2:03:47,  2.56it/s] 

{'loss': 0.585, 'grad_norm': 0.6748996376991272, 'learning_rate': 1.1945377464115435e-05, 'epoch': 2.28}


                                                         
 76%|███████▌  | 60500/79491 [7:07:22<2:03:47,  2.56it/s]

{'eval_loss': 0.5804219841957092, 'eval_runtime': 29.4654, 'eval_samples_per_second': 37.298, 'eval_steps_per_second': 9.333, 'epoch': 2.28}


 77%|███████▋  | 61000/79491 [7:10:33<1:52:25,  2.74it/s] 

{'loss': 0.5914, 'grad_norm': 0.6919480562210083, 'learning_rate': 1.1630876451422174e-05, 'epoch': 2.3}


                                                         
 77%|███████▋  | 61000/79491 [7:11:01<1:52:25,  2.74it/s]

{'eval_loss': 0.5799312591552734, 'eval_runtime': 28.2989, 'eval_samples_per_second': 38.835, 'eval_steps_per_second': 9.718, 'epoch': 2.3}


 77%|███████▋  | 61500/79491 [7:14:14<1:57:36,  2.55it/s] 

{'loss': 0.5913, 'grad_norm': 0.8473767042160034, 'learning_rate': 1.1316375438728913e-05, 'epoch': 2.32}


                                                         
 77%|███████▋  | 61500/79491 [7:14:43<1:57:36,  2.55it/s]

{'eval_loss': 0.5802159905433655, 'eval_runtime': 29.7491, 'eval_samples_per_second': 36.942, 'eval_steps_per_second': 9.244, 'epoch': 2.32}


 78%|███████▊  | 62000/79491 [7:17:58<1:47:32,  2.71it/s] 

{'loss': 0.5888, 'grad_norm': 0.5818822383880615, 'learning_rate': 1.1001874426035652e-05, 'epoch': 2.34}


                                                         
 78%|███████▊  | 62000/79491 [7:18:29<1:47:32,  2.71it/s]

{'eval_loss': 0.5793130397796631, 'eval_runtime': 30.3477, 'eval_samples_per_second': 36.214, 'eval_steps_per_second': 9.062, 'epoch': 2.34}


 79%|███████▊  | 62500/79491 [7:21:42<1:43:32,  2.73it/s] 

{'loss': 0.5928, 'grad_norm': 0.6450375318527222, 'learning_rate': 1.0687373413342391e-05, 'epoch': 2.36}


                                                         
 79%|███████▊  | 62500/79491 [7:22:10<1:43:32,  2.73it/s]

{'eval_loss': 0.579116940498352, 'eval_runtime': 28.5326, 'eval_samples_per_second': 38.517, 'eval_steps_per_second': 9.638, 'epoch': 2.36}


 79%|███████▉  | 63000/79491 [7:25:18<1:41:55,  2.70it/s] 

{'loss': 0.6001, 'grad_norm': 0.6082453727722168, 'learning_rate': 1.037287240064913e-05, 'epoch': 2.38}


                                                         
 79%|███████▉  | 63000/79491 [7:25:47<1:41:55,  2.70it/s]

{'eval_loss': 0.5791692733764648, 'eval_runtime': 28.3166, 'eval_samples_per_second': 38.811, 'eval_steps_per_second': 9.712, 'epoch': 2.38}


 80%|███████▉  | 63500/79491 [7:28:54<1:38:41,  2.70it/s] 

{'loss': 0.5768, 'grad_norm': 0.8261473178863525, 'learning_rate': 1.005837138795587e-05, 'epoch': 2.4}


                                                         
 80%|███████▉  | 63500/79491 [7:29:23<1:38:41,  2.70it/s]

{'eval_loss': 0.5790166854858398, 'eval_runtime': 28.3306, 'eval_samples_per_second': 38.792, 'eval_steps_per_second': 9.707, 'epoch': 2.4}


 81%|████████  | 64000/79491 [7:32:30<1:35:46,  2.70it/s] 

{'loss': 0.5899, 'grad_norm': 0.6818480491638184, 'learning_rate': 9.743870375262609e-06, 'epoch': 2.42}


                                                         
 81%|████████  | 64000/79491 [7:32:59<1:35:46,  2.70it/s]

{'eval_loss': 0.5781592726707458, 'eval_runtime': 28.3305, 'eval_samples_per_second': 38.792, 'eval_steps_per_second': 9.707, 'epoch': 2.42}


 81%|████████  | 64500/79491 [7:36:07<1:32:31,  2.70it/s] 

{'loss': 0.5869, 'grad_norm': 0.6237018704414368, 'learning_rate': 9.429369362569348e-06, 'epoch': 2.43}


                                                         
 81%|████████  | 64500/79491 [7:36:35<1:32:31,  2.70it/s]

{'eval_loss': 0.5780287384986877, 'eval_runtime': 28.3326, 'eval_samples_per_second': 38.789, 'eval_steps_per_second': 9.706, 'epoch': 2.43}


 82%|████████▏ | 65000/79491 [7:39:43<1:30:17,  2.67it/s] 

{'loss': 0.586, 'grad_norm': 0.6287600994110107, 'learning_rate': 9.114868349876087e-06, 'epoch': 2.45}


                                                         
 82%|████████▏ | 65000/79491 [7:40:11<1:30:17,  2.67it/s]

{'eval_loss': 0.5778172612190247, 'eval_runtime': 27.8421, 'eval_samples_per_second': 39.473, 'eval_steps_per_second': 9.877, 'epoch': 2.45}


 82%|████████▏ | 65500/79491 [7:43:14<1:23:59,  2.78it/s] 

{'loss': 0.592, 'grad_norm': 0.6483827233314514, 'learning_rate': 8.800367337182826e-06, 'epoch': 2.47}


                                                         
 82%|████████▏ | 65500/79491 [7:43:41<1:23:59,  2.78it/s]

{'eval_loss': 0.5776991248130798, 'eval_runtime': 27.3408, 'eval_samples_per_second': 40.196, 'eval_steps_per_second': 10.058, 'epoch': 2.47}


 83%|████████▎ | 66000/79491 [7:46:43<1:20:56,  2.78it/s] 

{'loss': 0.5916, 'grad_norm': 0.8808621764183044, 'learning_rate': 8.485866324489565e-06, 'epoch': 2.49}


                                                         
 83%|████████▎ | 66000/79491 [7:47:11<1:20:56,  2.78it/s]

{'eval_loss': 0.5777106881141663, 'eval_runtime': 27.3085, 'eval_samples_per_second': 40.244, 'eval_steps_per_second': 10.07, 'epoch': 2.49}


 84%|████████▎ | 66500/79491 [7:50:13<1:18:14,  2.77it/s] 

{'loss': 0.5924, 'grad_norm': 0.5999689698219299, 'learning_rate': 8.171365311796304e-06, 'epoch': 2.51}


                                                         
 84%|████████▎ | 66500/79491 [7:50:40<1:18:14,  2.77it/s]

{'eval_loss': 0.577665388584137, 'eval_runtime': 27.3175, 'eval_samples_per_second': 40.231, 'eval_steps_per_second': 10.067, 'epoch': 2.51}


 84%|████████▍ | 67000/79491 [7:53:43<1:15:15,  2.77it/s] 

{'loss': 0.5795, 'grad_norm': 0.5984557867050171, 'learning_rate': 7.856864299103044e-06, 'epoch': 2.53}


                                                         
 84%|████████▍ | 67000/79491 [7:54:10<1:15:15,  2.77it/s]

{'eval_loss': 0.5772587656974792, 'eval_runtime': 27.3189, 'eval_samples_per_second': 40.228, 'eval_steps_per_second': 10.066, 'epoch': 2.53}


 85%|████████▍ | 67500/79491 [7:57:12<1:12:35,  2.75it/s] 

{'loss': 0.5901, 'grad_norm': 0.8431810736656189, 'learning_rate': 7.542363286409783e-06, 'epoch': 2.55}


                                                         
 85%|████████▍ | 67500/79491 [7:57:40<1:12:35,  2.75it/s]

{'eval_loss': 0.5771465301513672, 'eval_runtime': 27.2933, 'eval_samples_per_second': 40.266, 'eval_steps_per_second': 10.076, 'epoch': 2.55}


 86%|████████▌ | 68000/79491 [8:00:42<1:08:59,  2.78it/s] 

{'loss': 0.5771, 'grad_norm': 0.6316896080970764, 'learning_rate': 7.227862273716522e-06, 'epoch': 2.57}


                                                         
 86%|████████▌ | 68000/79491 [8:01:09<1:08:59,  2.78it/s]

{'eval_loss': 0.576915979385376, 'eval_runtime': 27.3151, 'eval_samples_per_second': 40.234, 'eval_steps_per_second': 10.068, 'epoch': 2.57}


 86%|████████▌ | 68500/79491 [8:04:12<1:06:09,  2.77it/s] 

{'loss': 0.5842, 'grad_norm': 0.5890490412712097, 'learning_rate': 6.913361261023262e-06, 'epoch': 2.59}


                                                         
 86%|████████▌ | 68500/79491 [8:04:40<1:06:09,  2.77it/s]

{'eval_loss': 0.57649165391922, 'eval_runtime': 27.3071, 'eval_samples_per_second': 40.246, 'eval_steps_per_second': 10.071, 'epoch': 2.59}


 87%|████████▋ | 69000/79491 [8:07:42<1:03:01,  2.77it/s] 

{'loss': 0.5856, 'grad_norm': 0.7476420998573303, 'learning_rate': 6.598860248329999e-06, 'epoch': 2.6}


                                                         
 87%|████████▋ | 69000/79491 [8:08:09<1:03:01,  2.77it/s]

{'eval_loss': 0.5765967965126038, 'eval_runtime': 27.3259, 'eval_samples_per_second': 40.218, 'eval_steps_per_second': 10.064, 'epoch': 2.6}


 87%|████████▋ | 69500/79491 [8:11:12<1:00:12,  2.77it/s] 

{'loss': 0.5753, 'grad_norm': 0.8646785616874695, 'learning_rate': 6.284359235636738e-06, 'epoch': 2.62}


                                                         
 87%|████████▋ | 69500/79491 [8:11:39<1:00:12,  2.77it/s]

{'eval_loss': 0.5763429403305054, 'eval_runtime': 27.3111, 'eval_samples_per_second': 40.24, 'eval_steps_per_second': 10.069, 'epoch': 2.62}


 88%|████████▊ | 70000/79491 [8:14:41<57:07,  2.77it/s]   

{'loss': 0.5827, 'grad_norm': 0.6407005190849304, 'learning_rate': 5.969858222943478e-06, 'epoch': 2.64}


                                                       
 88%|████████▊ | 70000/79491 [8:15:09<57:07,  2.77it/s]

{'eval_loss': 0.5761524438858032, 'eval_runtime': 27.2916, 'eval_samples_per_second': 40.269, 'eval_steps_per_second': 10.076, 'epoch': 2.64}


 89%|████████▊ | 70500/79491 [8:18:11<54:01,  2.77it/s]   

{'loss': 0.5801, 'grad_norm': 0.5968071818351746, 'learning_rate': 5.6553572102502174e-06, 'epoch': 2.66}


                                                       
 89%|████████▊ | 70500/79491 [8:18:38<54:01,  2.77it/s]

{'eval_loss': 0.5760002136230469, 'eval_runtime': 27.303, 'eval_samples_per_second': 40.252, 'eval_steps_per_second': 10.072, 'epoch': 2.66}


 89%|████████▉ | 71000/79491 [8:21:40<51:03,  2.77it/s]   

{'loss': 0.5803, 'grad_norm': 0.7828541994094849, 'learning_rate': 5.3408561975569566e-06, 'epoch': 2.68}


                                                       
 89%|████████▉ | 71000/79491 [8:22:08<51:03,  2.77it/s]

{'eval_loss': 0.5759164094924927, 'eval_runtime': 27.2778, 'eval_samples_per_second': 40.289, 'eval_steps_per_second': 10.081, 'epoch': 2.68}


 90%|████████▉ | 71500/79491 [8:25:10<48:01,  2.77it/s]   

{'loss': 0.586, 'grad_norm': 0.7680946588516235, 'learning_rate': 5.026355184863696e-06, 'epoch': 2.7}


                                                       
 90%|████████▉ | 71500/79491 [8:25:37<48:01,  2.77it/s]

{'eval_loss': 0.5757426023483276, 'eval_runtime': 27.273, 'eval_samples_per_second': 40.296, 'eval_steps_per_second': 10.083, 'epoch': 2.7}


 91%|█████████ | 72000/79491 [8:28:39<45:08,  2.77it/s]   

{'loss': 0.5894, 'grad_norm': 0.6174433827400208, 'learning_rate': 4.711854172170435e-06, 'epoch': 2.72}


                                                       
 91%|█████████ | 72000/79491 [8:29:07<45:08,  2.77it/s]

{'eval_loss': 0.5754876732826233, 'eval_runtime': 27.2992, 'eval_samples_per_second': 40.258, 'eval_steps_per_second': 10.074, 'epoch': 2.72}


 91%|█████████ | 72500/79491 [8:32:09<42:11,  2.76it/s]   

{'loss': 0.5785, 'grad_norm': 0.6316460371017456, 'learning_rate': 4.397353159477174e-06, 'epoch': 2.74}


                                                       
 91%|█████████ | 72500/79491 [8:32:37<42:11,  2.76it/s]

{'eval_loss': 0.5752719044685364, 'eval_runtime': 27.296, 'eval_samples_per_second': 40.262, 'eval_steps_per_second': 10.075, 'epoch': 2.74}


 92%|█████████▏| 73000/79491 [8:35:39<39:02,  2.77it/s]   

{'loss': 0.5869, 'grad_norm': 0.6153004765510559, 'learning_rate': 4.082852146783913e-06, 'epoch': 2.76}


                                                       
 92%|█████████▏| 73000/79491 [8:36:06<39:02,  2.77it/s]

{'eval_loss': 0.5750085711479187, 'eval_runtime': 27.2846, 'eval_samples_per_second': 40.279, 'eval_steps_per_second': 10.079, 'epoch': 2.76}


 92%|█████████▏| 73500/79491 [8:39:08<36:03,  2.77it/s]   

{'loss': 0.5803, 'grad_norm': 0.616878092288971, 'learning_rate': 3.768351134090652e-06, 'epoch': 2.77}


                                                       
 92%|█████████▏| 73500/79491 [8:39:36<36:03,  2.77it/s]

{'eval_loss': 0.5749304294586182, 'eval_runtime': 27.3013, 'eval_samples_per_second': 40.255, 'eval_steps_per_second': 10.073, 'epoch': 2.77}


 93%|█████████▎| 74000/79491 [8:42:38<33:09,  2.76it/s]   

{'loss': 0.588, 'grad_norm': 0.6422815918922424, 'learning_rate': 3.453850121397391e-06, 'epoch': 2.79}


                                                       
 93%|█████████▎| 74000/79491 [8:43:05<33:09,  2.76it/s]

{'eval_loss': 0.5747304558753967, 'eval_runtime': 27.2927, 'eval_samples_per_second': 40.267, 'eval_steps_per_second': 10.076, 'epoch': 2.79}


 94%|█████████▎| 74500/79491 [8:46:08<29:59,  2.77it/s]   

{'loss': 0.5805, 'grad_norm': 0.6129563450813293, 'learning_rate': 3.13934910870413e-06, 'epoch': 2.81}


                                                       
 94%|█████████▎| 74500/79491 [8:46:35<29:59,  2.77it/s]

{'eval_loss': 0.574815571308136, 'eval_runtime': 27.2984, 'eval_samples_per_second': 40.259, 'eval_steps_per_second': 10.074, 'epoch': 2.81}


 94%|█████████▍| 75000/79491 [8:49:37<27:05,  2.76it/s]   

{'loss': 0.5769, 'grad_norm': 0.5498062372207642, 'learning_rate': 2.824848096010869e-06, 'epoch': 2.83}


                                                       
 94%|█████████▍| 75000/79491 [8:50:05<27:05,  2.76it/s]

{'eval_loss': 0.5744608044624329, 'eval_runtime': 27.2967, 'eval_samples_per_second': 40.261, 'eval_steps_per_second': 10.074, 'epoch': 2.83}


 95%|█████████▍| 75500/79491 [8:53:07<23:59,  2.77it/s]   

{'loss': 0.5778, 'grad_norm': 0.7151221632957458, 'learning_rate': 2.5103470833176083e-06, 'epoch': 2.85}


                                                       
 95%|█████████▍| 75500/79491 [8:53:34<23:59,  2.77it/s]

{'eval_loss': 0.5743979811668396, 'eval_runtime': 27.2768, 'eval_samples_per_second': 40.291, 'eval_steps_per_second': 10.082, 'epoch': 2.85}


 96%|█████████▌| 76000/79491 [8:56:36<20:58,  2.77it/s]   

{'loss': 0.5819, 'grad_norm': 0.7319996953010559, 'learning_rate': 2.1958460706243474e-06, 'epoch': 2.87}


                                                       
 96%|█████████▌| 76000/79491 [8:57:04<20:58,  2.77it/s]

{'eval_loss': 0.5743251442909241, 'eval_runtime': 27.2991, 'eval_samples_per_second': 40.258, 'eval_steps_per_second': 10.074, 'epoch': 2.87}


 96%|█████████▌| 76500/79491 [9:00:06<17:58,  2.77it/s]  

{'loss': 0.5777, 'grad_norm': 0.5554105639457703, 'learning_rate': 1.8813450579310867e-06, 'epoch': 2.89}


                                                       
 96%|█████████▌| 76500/79491 [9:00:33<17:58,  2.77it/s]

{'eval_loss': 0.5744186639785767, 'eval_runtime': 27.2528, 'eval_samples_per_second': 40.326, 'eval_steps_per_second': 10.091, 'epoch': 2.89}


 97%|█████████▋| 77000/79491 [9:03:36<15:00,  2.77it/s]  

{'loss': 0.5833, 'grad_norm': 0.7311345934867859, 'learning_rate': 1.5668440452378257e-06, 'epoch': 2.91}


                                                       
 97%|█████████▋| 77000/79491 [9:04:03<15:00,  2.77it/s]

{'eval_loss': 0.5742220282554626, 'eval_runtime': 27.2769, 'eval_samples_per_second': 40.291, 'eval_steps_per_second': 10.082, 'epoch': 2.91}


 97%|█████████▋| 77500/79491 [9:07:05<12:00,  2.76it/s]  

{'loss': 0.581, 'grad_norm': 0.5449205636978149, 'learning_rate': 1.2523430325445648e-06, 'epoch': 2.92}


                                                       
 97%|█████████▋| 77500/79491 [9:07:32<12:00,  2.76it/s]

{'eval_loss': 0.574213981628418, 'eval_runtime': 27.2809, 'eval_samples_per_second': 40.285, 'eval_steps_per_second': 10.08, 'epoch': 2.92}


 98%|█████████▊| 78000/79491 [9:10:35<08:56,  2.78it/s]  

{'loss': 0.5862, 'grad_norm': 0.7882190346717834, 'learning_rate': 9.37842019851304e-07, 'epoch': 2.94}


                                                       
 98%|█████████▊| 78000/79491 [9:11:02<08:56,  2.78it/s]

{'eval_loss': 0.5741375088691711, 'eval_runtime': 27.294, 'eval_samples_per_second': 40.265, 'eval_steps_per_second': 10.075, 'epoch': 2.94}


 99%|█████████▉| 78500/79491 [9:14:04<05:58,  2.77it/s]  

{'loss': 0.5895, 'grad_norm': 0.7271870374679565, 'learning_rate': 6.233410071580431e-07, 'epoch': 2.96}


                                                       
 99%|█████████▉| 78500/79491 [9:14:32<05:58,  2.77it/s]

{'eval_loss': 0.5740313529968262, 'eval_runtime': 27.2826, 'eval_samples_per_second': 40.282, 'eval_steps_per_second': 10.08, 'epoch': 2.96}


 99%|█████████▉| 79000/79491 [9:17:35<02:57,  2.77it/s]  

{'loss': 0.5808, 'grad_norm': 0.8756421208381653, 'learning_rate': 3.088399944647822e-07, 'epoch': 2.98}


                                                       
 99%|█████████▉| 79000/79491 [9:18:02<02:57,  2.77it/s]

{'eval_loss': 0.5740256905555725, 'eval_runtime': 27.2687, 'eval_samples_per_second': 40.303, 'eval_steps_per_second': 10.085, 'epoch': 2.98}


100%|██████████| 79491/79491 [9:21:01<00:00,  3.43it/s]  There were missing keys in the checkpoint model loaded: ['lm_head.weight'].
100%|██████████| 79491/79491 [9:21:03<00:00,  2.36it/s]

{'train_runtime': 33663.1059, 'train_samples_per_second': 9.445, 'train_steps_per_second': 2.361, 'train_loss': 0.6290795461238095, 'epoch': 3.0}





TrainOutput(global_step=79491, training_loss=0.6290795461238095, metrics={'train_runtime': 33663.1059, 'train_samples_per_second': 9.445, 'train_steps_per_second': 2.361, 'total_flos': 8.307910803456e+16, 'train_loss': 0.6290795461238095, 'epoch': 3.0})

After training, we evaluate the model on the validation set to measure its performance.

In [10]:
# Evaluate the model on the validation set
results = trainer.evaluate()
print("Evaluation Results:", results)

100%|██████████| 275/275 [00:27<00:00,  9.96it/s]

Evaluation Results: {'eval_loss': 0.5740256905555725, 'eval_runtime': 27.6359, 'eval_samples_per_second': 39.767, 'eval_steps_per_second': 9.951, 'epoch': 3.0}





Finally, we use the fine-tuned model to generate a short story based on a given prompt. The model completes the story in a coherent and creative manner.

In [11]:
from transformers import pipeline

# Create a text generation pipeline using the fine-tuned model
story_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Provide a prompt to generate a story
prompt = "Once upon a time in a magical forest"
generated_story = story_generator(prompt, max_length=150, num_return_sequences=1)

# Display the generated story
print("Generated Story:")
print(generated_story[0]['generated_text'])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Generated Story:
Once upon a time in a magical forest, there was a bunny named Benny. Benny loved to hop and play all day in his green grass. One day, Benny noticed a big tree with lots of leaves. He hopped over to it and saw something very yummy! Benny wanted to eat the taste, so he started eating the yummy thing.

Suddenly, a big bird flew down and grabbed Benny. It tried to take Benny away, but Benny didn't want to give up. He hopped back to his burrow and watched the little bird eat the yummy thing. Benny couldn't believe his eyes!

Later that day, Benny realized that he should be more careful around the tree. He hopped around it a little bit and
