# Training and Evaluation of GNNs and LLMs
In this notebook, we train the models on the [MovieLens Dataset](https://movielens.org/) after the Pytorch Geometrics Tutorial on [Link Prediction](https://colab.research.google.com/drive/1xpzn1Nvai1ygd_P5Yambc_oe4VBPK_ZT?usp=sharing#scrollTo=vit8xKCiXAue).

First we import all of our dependencies.

The **GNNTrainer** manages and trains a GNN model. Its most important interfaces include
**the constructor**, which defines the GNN architecture and loads the pre-trained GNN model if it is already on the hard disk,
**the training method**, which initializes the training on the GNN model and
**the get_embedding methods**, which represent the inference interface to the GNN model and return the corresponding embeddings in the dimension defined in the constructor for given user movie node pairs.

**The MovieLensLoader** loads and manages the data sets. The most important tasks include **saving and (re)loading and transforming** the data sets.

**PromptEncoderOnlyClassifier** and **VanillaEncoderOnlyClassifier** each manage a **prompt (model) LLM** and a **vanilla (model) LLM**. An EncoderOnlyClassifier (ClassifierBase) provides interfaces for training and testing an LLM model.
PromptEncoder and VanillaEncoder differ from their DataCollectors. DataCollectors change the behavior of the models during training and testing and allow data points to be created at runtime. With the help of these collators, we **create non-existent edges on the fly**.

In [1]:
from gnn import GNNTrainer
from movie_lens_loader import MovieLensLoader
from llm import PromptBertClassifier, VanillaBertClassifier, AddingEmbeddingsBertClassifierBase

from transformers import AutoConfig

We define in advance which **Knowledge Graph Embedding Dimension (KGE_DIMENSION)** the GNN encoder has. We want to determine from which output dimension the GNN encoder can produce embeddings that lead to a significant increase in performance *without exceeding the context length of the LLMs*. In the original tutorial, the KGE_DIMENSION was $64$.

In [2]:
MODEL_NAME = "google/bert_uncased_L-2_H-128_A-2"
MODEL_HIDDEN_SIZE = AutoConfig.from_pretrained(MODEL_NAME).hidden_size
KGE_DIMENSIONS = [4] # Output Dimension of the GNN Encoder.
EPOCHS = 20
BATCH_SIZE = 256

First we load the MovieLensLoader, which downloads the Movie Lens dataset (https://files.grouplens.org/datasets/movielens/ml-latest-small.zip) and prepares it to be used on GNN and LLM. We also pass the embedding dimensions that we will assume we are training with. First time takes approx. 30 sec.

In [3]:

movie_lens_loader = MovieLensLoader(kge_dimensions = KGE_DIMENSIONS)

In [4]:
movie_lens_loader.llm_df.head()

Unnamed: 0,mappedUserId,mappedMovieId,title,genres,prompt,split,user_embedding_4,movie_embedding_4,user_embedding_128,movie_embedding_128
0,0,0,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy...","user: 0, title: Toy Story (1995), genres: ['Ad...",train,"[-0.03827861696481705, -2.2619593143463135, 2....","[-0.550093412399292, -0.8870471119880676, 0.99...","[-0.20877254009246826, 0.7261011004447937, -0....","[-0.23873792588710785, 0.15261110663414001, -0..."
1,0,2,Grumpier Old Men (1995),"['Comedy', 'Romance']","user: 0, title: Grumpier Old Men (1995), genre...",rest,"[0.33092743158340454, -2.609788417816162, 2.80...","[-0.14393314719200134, -1.0467438697814941, -0...","[-0.3288712799549103, 0.7533886432647705, -0.2...","[-0.601038932800293, -0.014170907437801361, -0..."
2,0,5,Heat (1995),"['Action', 'Crime', 'Thriller']","user: 0, title: Heat (1995), genres: ['Action'...",val,"[0.26843586564064026, -2.6071953773498535, 2.8...","[-0.5415042638778687, -0.8318737745285034, 0.6...","[-0.26235586404800415, 0.8743588328361511, -0....","[-0.00794781744480133, -0.03061569482088089, 0..."
3,0,43,Seven (a.k.a. Se7en) (1995),"['Mystery', 'Thriller']","user: 0, title: Seven (a.k.a. Se7en) (1995), g...",train,"[0.18218860030174255, -2.3380637168884277, 2.7...","[-0.9858976602554321, -0.801932692527771, 0.80...","[-0.23487024009227753, 0.6445444226264954, -0....","[0.23825839161872864, 0.006370842456817627, -0..."
4,0,46,"Usual Suspects, The (1995)","['Crime', 'Mystery', 'Thriller']","user: 0, title: Usual Suspects, The (1995), ge...",test,"[0.09569544345140457, -1.986099123954773, 3.03...","[-0.865667998790741, -0.9774402976036072, 0.52...","[-0.1536864936351776, 0.8239970803260803, -0.3...","[0.16465282440185547, 0.09803374111652374, -0...."


Next, we initialize the GNN trainers (possible on Cuda), one for each KGE_DIMENSION.
A GNN trainer manages a model and each model consists of an **encoder and classifier** part.

**The encoder** is a parameterized *Grap Convolutional Network (GCN)* with a *2-layer GNN computation graph* and a single *ReLU* activation function in between.

**The classifier** applies the dot-product between source and destination kges to derive edge-level predictions.

In [5]:
gnn_trainers =    [GNNTrainer(movie_lens_loader.data, kge_dimension = kge_dimension) for kge_dimension in KGE_DIMENSIONS]
gnn_trainer_large = GNNTrainer(movie_lens_loader.data, hidden_channels=MODEL_HIDDEN_SIZE, kge_dimension=MODEL_HIDDEN_SIZE)

loading pretrained model
Device: 'cuda'
loading pretrained model
Device: 'cuda'


We then train and validate the model on the link prediction task.

If the model is already trained, we can skip this part.
Training the models can take up to 5 minutes.

In [6]:
for gnn_trainer in gnn_trainers:
    print(gnn_trainer.kge_dimension)
    #gnn_trainer.train_model(movie_lens_loader.gnn_train_data, EPOCHS)
    #gnn_trainer.validate_model(movie_lens_loader.gnn_test_data)
#print("large_gnn")
#gnn_trainer_large.train_model(movie_lens_loader.gnn_train_data, EPOCHS)
#gnn_trainer_large.validate_model(movie_lens_loader.gnn_test_data)


4


Next we produce the KGEs for every edge in the dataset. These embeddings can then be used for the LLM on the link-prediction task.

In [7]:
[gnn_trainer.get_embeddings(movie_lens_loader) for gnn_trainer in gnn_trainers]
gnn_trainer_large.get_embeddings(movie_lens_loader)
movie_lens_loader.llm_df.head()

Unnamed: 0,mappedUserId,mappedMovieId,title,genres,prompt,split,user_embedding_4,movie_embedding_4,user_embedding_128,movie_embedding_128
0,0,0,Toy Story (1995),"['Adventure', 'Animation', 'Children', 'Comedy...","user: 0, title: Toy Story (1995), genres: ['Ad...",train,"[-0.03827861696481705, -2.2619593143463135, 2....","[-0.550093412399292, -0.8870471119880676, 0.99...","[-0.20877254009246826, 0.7261011004447937, -0....","[-0.23873792588710785, 0.15261110663414001, -0..."
1,0,2,Grumpier Old Men (1995),"['Comedy', 'Romance']","user: 0, title: Grumpier Old Men (1995), genre...",rest,"[0.33092743158340454, -2.609788417816162, 2.80...","[-0.14393314719200134, -1.0467438697814941, -0...","[-0.3288712799549103, 0.7533886432647705, -0.2...","[-0.601038932800293, -0.014170907437801361, -0..."
2,0,5,Heat (1995),"['Action', 'Crime', 'Thriller']","user: 0, title: Heat (1995), genres: ['Action'...",val,"[0.26843586564064026, -2.6071953773498535, 2.8...","[-0.5415042638778687, -0.8318737745285034, 0.6...","[-0.26235586404800415, 0.8743588328361511, -0....","[-0.00794781744480133, -0.03061569482088089, 0..."
3,0,43,Seven (a.k.a. Se7en) (1995),"['Mystery', 'Thriller']","user: 0, title: Seven (a.k.a. Se7en) (1995), g...",train,"[0.18218860030174255, -2.3380637168884277, 2.7...","[-0.9858976602554321, -0.801932692527771, 0.80...","[-0.23487024009227753, 0.6445444226264954, -0....","[0.23825839161872864, 0.006370842456817627, -0..."
4,0,46,"Usual Suspects, The (1995)","['Crime', 'Mystery', 'Thriller']","user: 0, title: Usual Suspects, The (1995), ge...",test,"[0.09569544345140457, -1.986099123954773, 3.03...","[-0.865667998790741, -0.9774402976036072, 0.52...","[-0.1536864936351776, 0.8239970803260803, -0.3...","[0.16465282440185547, 0.09803374111652374, -0...."


Next we initialize the vanilla encoder only classifier. This classifier does only use the NLP part of the prompt (no KGE) for predicting if the given link exists.

In [8]:
vanilla_bert_classifier = VanillaBertClassifier(movie_lens_loader.llm_df, batch_size=BATCH_SIZE, model_name=MODEL_NAME)

Next we generate a vanilla llm dataset and tokenize it for training.

In [9]:
dataset_vanilla = movie_lens_loader.generate_vanilla_dataset(vanilla_bert_classifier.tokenize_function)

Next we train the model on the produced dataset. This can be skipped, if already trained ones.

In [10]:
#vanilla_bert_classifier.train_model_on_data(dataset_vanilla, epochs=EPOCHS)

Next we initialize the prompt encoder only classifier. This classifier uses the vanilla prompt and the KGEs for its link prediction.

In [11]:
prompt_bert_classifiers = [PromptBertClassifier(movie_lens_loader, gnn_trainer.get_embedding, kge_dimension=embedding_dimension, batch_size=BATCH_SIZE, model_name=MODEL_NAME) for embedding_dimension, gnn_trainer in zip(KGE_DIMENSIONS, gnn_trainers)]

We also generate a prompt dataset, this time the prompts also include the KGEs.

In [12]:
datasets_prompt = [movie_lens_loader.generate_prompt_embedding_dataset(prompt_bert_classifier.tokenize_function, kge_dimension = prompt_bert_classifier.kge_dimension) for prompt_bert_classifier in prompt_bert_classifiers]

We also train the model. This can be skipped if already done ones.

In [13]:
#[prompt_bert_classifier.train_model_on_data(dataset_prompt, epochs = EPOCHS) for prompt_bert_classifier, dataset_prompt in zip(prompt_bert_classifiers, datasets_prompt)]

In [14]:
adding_embedding_bert_only_classifier = AddingEmbeddingsBertClassifierBase(movie_lens_loader, gnn_trainer_large.get_embedding, kge_dimension=MODEL_HIDDEN_SIZE, batch_size=BATCH_SIZE, model_name=MODEL_NAME)
dataset_adding_embedding = movie_lens_loader.generate_adding_embedding_dataset(adding_embedding_bert_only_classifier.tokenizer.sep_token, adding_embedding_bert_only_classifier.tokenizer.pad_token, adding_embedding_bert_only_classifier.tokenize_function, kge_dimension = MODEL_HIDDEN_SIZE)


Some weights of InsertEmbeddingBertForSequenceClassification were not initialized from the model checkpoint at google/bert_uncased_L-2_H-128_A-2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
adding_embedding_bert_only_classifier.train_model_on_data(dataset_adding_embedding, epochs = EPOCHS)

  0%|          | 0/4420 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


{'loss': 0.7771, 'grad_norm': 3.4389734268188477, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.05}
{'loss': 0.7699, 'grad_norm': 3.764988422393799, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.09}
{'loss': 0.7705, 'grad_norm': 3.20927357673645, 'learning_rate': 3e-06, 'epoch': 0.14}
{'loss': 0.7597, 'grad_norm': 3.2560675144195557, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.18}
{'loss': 0.7518, 'grad_norm': 2.8689229488372803, 'learning_rate': 5e-06, 'epoch': 0.23}
{'loss': 0.7356, 'grad_norm': 3.503141403198242, 'learning_rate': 6e-06, 'epoch': 0.27}
{'loss': 0.7217, 'grad_norm': 2.7570619583129883, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.32}
{'loss': 0.7035, 'grad_norm': 1.7646822929382324, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.36}
{'loss': 0.6869, 'grad_norm': 0.8705242276191711, 'learning_rate': 9e-06, 'epoch': 0.41}
{'loss': 0.6647, 'grad_norm': 1.5016316175460815, 'learning_rate': 1e-05, 'epoch': 0.45}
{'loss': 0.6443, 'grad_norm': 0.

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.587656557559967, 'eval_accuracy': 0.6646248979115622, 'eval_runtime': 157.3055, 'eval_samples_per_second': 108.973, 'eval_steps_per_second': 0.426, 'epoch': 1.0}
{'loss': 0.6019, 'grad_norm': 0.8464211225509644, 'learning_rate': 2.3000000000000003e-05, 'epoch': 1.04}
{'loss': 0.5844, 'grad_norm': 0.8668562173843384, 'learning_rate': 2.4e-05, 'epoch': 1.09}
{'loss': 0.5756, 'grad_norm': 1.1645541191101074, 'learning_rate': 2.5e-05, 'epoch': 1.13}
{'loss': 0.5514, 'grad_norm': 0.6494861245155334, 'learning_rate': 2.6000000000000002e-05, 'epoch': 1.18}
{'loss': 0.555, 'grad_norm': 0.900191068649292, 'learning_rate': 2.7000000000000002e-05, 'epoch': 1.22}
{'loss': 0.5468, 'grad_norm': 1.0752432346343994, 'learning_rate': 2.8000000000000003e-05, 'epoch': 1.27}
{'loss': 0.5371, 'grad_norm': 1.3139674663543701, 'learning_rate': 2.9e-05, 'epoch': 1.31}
{'loss': 0.5096, 'grad_norm': 1.4389201402664185, 'learning_rate': 3e-05, 'epoch': 1.36}
{'loss': 0.5249, 'grad_norm': 1.339485

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.39852476119995117, 'eval_accuracy': 0.8251079220627698, 'eval_runtime': 156.8076, 'eval_samples_per_second': 109.319, 'eval_steps_per_second': 0.427, 'epoch': 2.0}
{'loss': 0.3929, 'grad_norm': 1.1929914951324463, 'learning_rate': 4.5e-05, 'epoch': 2.04}
{'loss': 0.3949, 'grad_norm': 1.0564472675323486, 'learning_rate': 4.600000000000001e-05, 'epoch': 2.08}
{'loss': 0.4102, 'grad_norm': 1.180271029472351, 'learning_rate': 4.7e-05, 'epoch': 2.13}
{'loss': 0.4146, 'grad_norm': 1.3503707647323608, 'learning_rate': 4.8e-05, 'epoch': 2.17}
{'loss': 0.4125, 'grad_norm': 0.8526611924171448, 'learning_rate': 4.9e-05, 'epoch': 2.22}
{'loss': 0.3883, 'grad_norm': 1.1794458627700806, 'learning_rate': 5e-05, 'epoch': 2.26}
{'loss': 0.3823, 'grad_norm': 0.8128304481506348, 'learning_rate': 4.987244897959184e-05, 'epoch': 2.31}
{'loss': 0.4011, 'grad_norm': 0.7751808166503906, 'learning_rate': 4.974489795918368e-05, 'epoch': 2.35}
{'loss': 0.3689, 'grad_norm': 0.9094699621200562, 'le

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.34288468956947327, 'eval_accuracy': 0.8517092521292732, 'eval_runtime': 155.7217, 'eval_samples_per_second': 110.081, 'eval_steps_per_second': 0.43, 'epoch': 3.0}
{'loss': 0.3554, 'grad_norm': 0.8012714385986328, 'learning_rate': 4.783163265306123e-05, 'epoch': 3.03}
{'loss': 0.3558, 'grad_norm': 0.7976391911506653, 'learning_rate': 4.7704081632653066e-05, 'epoch': 3.08}
{'loss': 0.3718, 'grad_norm': 1.6673355102539062, 'learning_rate': 4.7576530612244904e-05, 'epoch': 3.12}
{'loss': 0.3707, 'grad_norm': 0.7889525890350342, 'learning_rate': 4.744897959183674e-05, 'epoch': 3.17}
{'loss': 0.3679, 'grad_norm': 1.1878535747528076, 'learning_rate': 4.732142857142857e-05, 'epoch': 3.21}
{'loss': 0.3675, 'grad_norm': 0.7542093992233276, 'learning_rate': 4.719387755102041e-05, 'epoch': 3.26}
{'loss': 0.3727, 'grad_norm': 1.6318435668945312, 'learning_rate': 4.706632653061225e-05, 'epoch': 3.3}
{'loss': 0.3503, 'grad_norm': 0.79146409034729, 'learning_rate': 4.6938775510204086e-

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.3149949908256531, 'eval_accuracy': 0.8656516159141291, 'eval_runtime': 155.1841, 'eval_samples_per_second': 110.462, 'eval_steps_per_second': 0.432, 'epoch': 4.0}
{'loss': 0.3367, 'grad_norm': 1.2894411087036133, 'learning_rate': 4.502551020408164e-05, 'epoch': 4.03}
{'loss': 0.3605, 'grad_norm': 1.003653645515442, 'learning_rate': 4.4897959183673474e-05, 'epoch': 4.07}
{'loss': 0.3366, 'grad_norm': 0.950372040271759, 'learning_rate': 4.477040816326531e-05, 'epoch': 4.12}
{'loss': 0.3446, 'grad_norm': 0.9905082583427429, 'learning_rate': 4.464285714285715e-05, 'epoch': 4.16}
{'loss': 0.3452, 'grad_norm': 1.411708116531372, 'learning_rate': 4.451530612244898e-05, 'epoch': 4.21}
{'loss': 0.3529, 'grad_norm': 1.0551918745040894, 'learning_rate': 4.438775510204082e-05, 'epoch': 4.25}
{'loss': 0.3609, 'grad_norm': 0.8467634320259094, 'learning_rate': 4.4260204081632656e-05, 'epoch': 4.3}
{'loss': 0.342, 'grad_norm': 0.8253487348556519, 'learning_rate': 4.4132653061224493e-05

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.3117503523826599, 'eval_accuracy': 0.8613347334033369, 'eval_runtime': 155.033, 'eval_samples_per_second': 110.57, 'eval_steps_per_second': 0.432, 'epoch': 5.0}
{'loss': 0.3484, 'grad_norm': 0.9919852614402771, 'learning_rate': 4.2219387755102045e-05, 'epoch': 5.02}
{'loss': 0.3305, 'grad_norm': 0.8359972834587097, 'learning_rate': 4.209183673469388e-05, 'epoch': 5.07}
{'loss': 0.3262, 'grad_norm': 0.916772186756134, 'learning_rate': 4.196428571428572e-05, 'epoch': 5.11}
{'loss': 0.3534, 'grad_norm': 0.7964627146720886, 'learning_rate': 4.183673469387756e-05, 'epoch': 5.16}
{'loss': 0.332, 'grad_norm': 0.850926399230957, 'learning_rate': 4.170918367346939e-05, 'epoch': 5.2}
{'loss': 0.3533, 'grad_norm': 0.7804059386253357, 'learning_rate': 4.1581632653061226e-05, 'epoch': 5.25}
{'loss': 0.3159, 'grad_norm': 0.8583928346633911, 'learning_rate': 4.1454081632653064e-05, 'epoch': 5.29}
{'loss': 0.3289, 'grad_norm': 2.086366891860962, 'learning_rate': 4.13265306122449e-05, '

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.29931893944740295, 'eval_accuracy': 0.8657099521642749, 'eval_runtime': 156.125, 'eval_samples_per_second': 109.797, 'eval_steps_per_second': 0.429, 'epoch': 6.0}
{'loss': 0.3385, 'grad_norm': 1.129901647567749, 'learning_rate': 3.9413265306122446e-05, 'epoch': 6.02}
{'loss': 0.3125, 'grad_norm': 0.8128340840339661, 'learning_rate': 3.928571428571429e-05, 'epoch': 6.06}
{'loss': 0.3174, 'grad_norm': 0.8235880732536316, 'learning_rate': 3.915816326530613e-05, 'epoch': 6.11}
{'loss': 0.2958, 'grad_norm': 0.9050197601318359, 'learning_rate': 3.9030612244897965e-05, 'epoch': 6.15}
{'loss': 0.3134, 'grad_norm': 0.8400742411613464, 'learning_rate': 3.8903061224489796e-05, 'epoch': 6.2}
{'loss': 0.3196, 'grad_norm': 1.1275615692138672, 'learning_rate': 3.8775510204081634e-05, 'epoch': 6.24}
{'loss': 0.3346, 'grad_norm': 1.0378447771072388, 'learning_rate': 3.864795918367347e-05, 'epoch': 6.29}
{'loss': 0.314, 'grad_norm': 0.9596097469329834, 'learning_rate': 3.852040816326531e

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.29066622257232666, 'eval_accuracy': 0.8750437521876093, 'eval_runtime': 154.9944, 'eval_samples_per_second': 110.598, 'eval_steps_per_second': 0.432, 'epoch': 7.0}
{'loss': 0.2968, 'grad_norm': 1.4202179908752441, 'learning_rate': 3.6607142857142853e-05, 'epoch': 7.01}
{'loss': 0.3297, 'grad_norm': 1.0436146259307861, 'learning_rate': 3.64795918367347e-05, 'epoch': 7.06}
{'loss': 0.3177, 'grad_norm': 1.0750023126602173, 'learning_rate': 3.6352040816326536e-05, 'epoch': 7.1}
{'loss': 0.3106, 'grad_norm': 0.8044393062591553, 'learning_rate': 3.622448979591837e-05, 'epoch': 7.15}
{'loss': 0.3176, 'grad_norm': 0.8793078660964966, 'learning_rate': 3.609693877551021e-05, 'epoch': 7.19}
{'loss': 0.3088, 'grad_norm': 0.9129120707511902, 'learning_rate': 3.596938775510204e-05, 'epoch': 7.24}
{'loss': 0.3109, 'grad_norm': 1.0727661848068237, 'learning_rate': 3.584183673469388e-05, 'epoch': 7.29}
{'loss': 0.3115, 'grad_norm': 0.8541715145111084, 'learning_rate': 3.571428571428572e

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2809720039367676, 'eval_accuracy': 0.877727219694318, 'eval_runtime': 155.3397, 'eval_samples_per_second': 110.352, 'eval_steps_per_second': 0.431, 'epoch': 8.0}
{'loss': 0.3191, 'grad_norm': 0.8211286664009094, 'learning_rate': 3.380102040816326e-05, 'epoch': 8.01}
{'loss': 0.2931, 'grad_norm': 1.0676250457763672, 'learning_rate': 3.36734693877551e-05, 'epoch': 8.05}
{'loss': 0.3082, 'grad_norm': 0.8023940920829773, 'learning_rate': 3.354591836734694e-05, 'epoch': 8.1}
{'loss': 0.2998, 'grad_norm': 0.864509105682373, 'learning_rate': 3.341836734693878e-05, 'epoch': 8.14}
{'loss': 0.3039, 'grad_norm': 0.8879441618919373, 'learning_rate': 3.329081632653062e-05, 'epoch': 8.19}
{'loss': 0.3064, 'grad_norm': 0.9329259395599365, 'learning_rate': 3.316326530612245e-05, 'epoch': 8.24}
{'loss': 0.3114, 'grad_norm': 1.0114703178405762, 'learning_rate': 3.303571428571429e-05, 'epoch': 8.28}
{'loss': 0.2908, 'grad_norm': 1.1163395643234253, 'learning_rate': 3.2908163265306125e-05,

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2766559422016144, 'eval_accuracy': 0.8847275697118189, 'eval_runtime': 156.1664, 'eval_samples_per_second': 109.768, 'eval_steps_per_second': 0.429, 'epoch': 9.0}
{'loss': 0.3071, 'grad_norm': 1.089714527130127, 'learning_rate': 3.0994897959183676e-05, 'epoch': 9.0}
{'loss': 0.3015, 'grad_norm': 0.9022876024246216, 'learning_rate': 3.086734693877551e-05, 'epoch': 9.05}
{'loss': 0.3119, 'grad_norm': 0.9541189670562744, 'learning_rate': 3.073979591836735e-05, 'epoch': 9.1}
{'loss': 0.2858, 'grad_norm': 0.8675541877746582, 'learning_rate': 3.061224489795919e-05, 'epoch': 9.14}
{'loss': 0.2986, 'grad_norm': 1.298288345336914, 'learning_rate': 3.0484693877551023e-05, 'epoch': 9.19}
{'loss': 0.2986, 'grad_norm': 1.2993860244750977, 'learning_rate': 3.0357142857142857e-05, 'epoch': 9.23}
{'loss': 0.282, 'grad_norm': 0.861420214176178, 'learning_rate': 3.0229591836734695e-05, 'epoch': 9.28}
{'loss': 0.3027, 'grad_norm': 1.1722073554992676, 'learning_rate': 3.0102040816326533e-0

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2726163864135742, 'eval_accuracy': 0.8826858009567145, 'eval_runtime': 157.1139, 'eval_samples_per_second': 109.106, 'eval_steps_per_second': 0.426, 'epoch': 10.0}
{'loss': 0.2878, 'grad_norm': 1.012218713760376, 'learning_rate': 2.8061224489795918e-05, 'epoch': 10.05}
{'loss': 0.294, 'grad_norm': 1.0959364175796509, 'learning_rate': 2.7933673469387756e-05, 'epoch': 10.09}
{'loss': 0.2861, 'grad_norm': 1.3462098836898804, 'learning_rate': 2.7806122448979593e-05, 'epoch': 10.14}
{'loss': 0.3029, 'grad_norm': 2.371915102005005, 'learning_rate': 2.767857142857143e-05, 'epoch': 10.18}
{'loss': 0.2839, 'grad_norm': 1.5120748281478882, 'learning_rate': 2.7551020408163265e-05, 'epoch': 10.23}
{'loss': 0.2641, 'grad_norm': 1.1346203088760376, 'learning_rate': 2.7423469387755103e-05, 'epoch': 10.27}
{'loss': 0.2934, 'grad_norm': 0.9342988133430481, 'learning_rate': 2.729591836734694e-05, 'epoch': 10.32}
{'loss': 0.2967, 'grad_norm': 0.9687184691429138, 'learning_rate': 2.7168367

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2699470520019531, 'eval_accuracy': 0.8819857659549644, 'eval_runtime': 159.9186, 'eval_samples_per_second': 107.192, 'eval_steps_per_second': 0.419, 'epoch': 11.0}
{'loss': 0.2767, 'grad_norm': 0.8424698114395142, 'learning_rate': 2.5255102040816326e-05, 'epoch': 11.04}
{'loss': 0.2908, 'grad_norm': 1.0856053829193115, 'learning_rate': 2.5127551020408164e-05, 'epoch': 11.09}
{'loss': 0.2818, 'grad_norm': 0.977380633354187, 'learning_rate': 2.5e-05, 'epoch': 11.13}
{'loss': 0.2918, 'grad_norm': 1.2663642168045044, 'learning_rate': 2.487244897959184e-05, 'epoch': 11.18}
{'loss': 0.2889, 'grad_norm': 0.8479412794113159, 'learning_rate': 2.4744897959183673e-05, 'epoch': 11.22}
{'loss': 0.2921, 'grad_norm': 0.8527776598930359, 'learning_rate': 2.461734693877551e-05, 'epoch': 11.27}
{'loss': 0.2919, 'grad_norm': 0.954310417175293, 'learning_rate': 2.448979591836735e-05, 'epoch': 11.31}
{'loss': 0.2862, 'grad_norm': 0.9729942679405212, 'learning_rate': 2.4362244897959186e-05, 

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2679998576641083, 'eval_accuracy': 0.8836775172091937, 'eval_runtime': 161.1409, 'eval_samples_per_second': 106.379, 'eval_steps_per_second': 0.416, 'epoch': 12.0}
{'loss': 0.2916, 'grad_norm': 1.089388132095337, 'learning_rate': 2.2448979591836737e-05, 'epoch': 12.04}
{'loss': 0.2967, 'grad_norm': 1.1035181283950806, 'learning_rate': 2.2321428571428575e-05, 'epoch': 12.08}
{'loss': 0.2817, 'grad_norm': 0.8369129300117493, 'learning_rate': 2.219387755102041e-05, 'epoch': 12.13}
{'loss': 0.2955, 'grad_norm': 1.1285077333450317, 'learning_rate': 2.2066326530612247e-05, 'epoch': 12.17}
{'loss': 0.2623, 'grad_norm': 1.161742925643921, 'learning_rate': 2.193877551020408e-05, 'epoch': 12.22}
{'loss': 0.2834, 'grad_norm': 0.8510958552360535, 'learning_rate': 2.181122448979592e-05, 'epoch': 12.26}
{'loss': 0.2991, 'grad_norm': 1.020033597946167, 'learning_rate': 2.1683673469387756e-05, 'epoch': 12.31}
{'loss': 0.2893, 'grad_norm': 1.0266822576522827, 'learning_rate': 2.15561224

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2704576849937439, 'eval_accuracy': 0.8827441372068603, 'eval_runtime': 159.7239, 'eval_samples_per_second': 107.323, 'eval_steps_per_second': 0.419, 'epoch': 13.0}
{'loss': 0.2853, 'grad_norm': 0.9642989039421082, 'learning_rate': 1.9642857142857145e-05, 'epoch': 13.03}
{'loss': 0.2915, 'grad_norm': 0.9962408542633057, 'learning_rate': 1.9515306122448983e-05, 'epoch': 13.08}
{'loss': 0.2961, 'grad_norm': 1.0513514280319214, 'learning_rate': 1.9387755102040817e-05, 'epoch': 13.12}
{'loss': 0.2915, 'grad_norm': 1.3349093198776245, 'learning_rate': 1.9260204081632655e-05, 'epoch': 13.17}
{'loss': 0.2979, 'grad_norm': 1.2407432794570923, 'learning_rate': 1.913265306122449e-05, 'epoch': 13.21}
{'loss': 0.2728, 'grad_norm': 1.0659154653549194, 'learning_rate': 1.9005102040816326e-05, 'epoch': 13.26}
{'loss': 0.2853, 'grad_norm': 1.374281883239746, 'learning_rate': 1.8877551020408164e-05, 'epoch': 13.3}
{'loss': 0.2848, 'grad_norm': 1.7093620300292969, 'learning_rate': 1.87500

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.25977084040641785, 'eval_accuracy': 0.8889277797223194, 'eval_runtime': 160.0289, 'eval_samples_per_second': 107.118, 'eval_steps_per_second': 0.419, 'epoch': 14.0}
{'loss': 0.2754, 'grad_norm': 0.9592587947845459, 'learning_rate': 1.683673469387755e-05, 'epoch': 14.03}
{'loss': 0.2916, 'grad_norm': 1.1504807472229004, 'learning_rate': 1.670918367346939e-05, 'epoch': 14.07}
{'loss': 0.2866, 'grad_norm': 0.9509300589561462, 'learning_rate': 1.6581632653061225e-05, 'epoch': 14.12}
{'loss': 0.2698, 'grad_norm': 1.054692268371582, 'learning_rate': 1.6454081632653062e-05, 'epoch': 14.16}
{'loss': 0.2721, 'grad_norm': 1.1029858589172363, 'learning_rate': 1.6326530612244897e-05, 'epoch': 14.21}
{'loss': 0.2836, 'grad_norm': 0.9609228372573853, 'learning_rate': 1.6198979591836734e-05, 'epoch': 14.25}
{'loss': 0.2948, 'grad_norm': 1.2036974430084229, 'learning_rate': 1.6071428571428572e-05, 'epoch': 14.3}
{'loss': 0.2925, 'grad_norm': 1.7518787384033203, 'learning_rate': 1.59438

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.26842913031578064, 'eval_accuracy': 0.8836775172091937, 'eval_runtime': 161.1449, 'eval_samples_per_second': 106.376, 'eval_steps_per_second': 0.416, 'epoch': 15.0}
{'loss': 0.2781, 'grad_norm': 1.0478496551513672, 'learning_rate': 1.4030612244897959e-05, 'epoch': 15.02}
{'loss': 0.28, 'grad_norm': 0.925595223903656, 'learning_rate': 1.3903061224489797e-05, 'epoch': 15.07}
{'loss': 0.2706, 'grad_norm': 0.8009387850761414, 'learning_rate': 1.3775510204081633e-05, 'epoch': 15.11}
{'loss': 0.2916, 'grad_norm': 0.9219182729721069, 'learning_rate': 1.364795918367347e-05, 'epoch': 15.16}
{'loss': 0.2859, 'grad_norm': 1.0717352628707886, 'learning_rate': 1.3520408163265308e-05, 'epoch': 15.2}
{'loss': 0.2887, 'grad_norm': 1.1216868162155151, 'learning_rate': 1.3392857142857144e-05, 'epoch': 15.25}
{'loss': 0.2759, 'grad_norm': 1.0146452188491821, 'learning_rate': 1.3265306122448982e-05, 'epoch': 15.29}
{'loss': 0.2677, 'grad_norm': 1.1930161714553833, 'learning_rate': 1.313775

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2549313008785248, 'eval_accuracy': 0.8905028584762571, 'eval_runtime': 161.7393, 'eval_samples_per_second': 105.985, 'eval_steps_per_second': 0.414, 'epoch': 16.0}
{'loss': 0.2813, 'grad_norm': 0.8820334672927856, 'learning_rate': 1.1224489795918369e-05, 'epoch': 16.02}
{'loss': 0.2704, 'grad_norm': 0.8762072324752808, 'learning_rate': 1.1096938775510205e-05, 'epoch': 16.06}
{'loss': 0.2835, 'grad_norm': 0.8517970442771912, 'learning_rate': 1.096938775510204e-05, 'epoch': 16.11}
{'loss': 0.2907, 'grad_norm': 1.3361337184906006, 'learning_rate': 1.0841836734693878e-05, 'epoch': 16.15}
{'loss': 0.2872, 'grad_norm': 1.013611912727356, 'learning_rate': 1.0714285714285714e-05, 'epoch': 16.2}
{'loss': 0.2695, 'grad_norm': 1.4438892602920532, 'learning_rate': 1.0586734693877552e-05, 'epoch': 16.24}
{'loss': 0.2912, 'grad_norm': 0.9045618176460266, 'learning_rate': 1.045918367346939e-05, 'epoch': 16.29}
{'loss': 0.2716, 'grad_norm': 0.7900695204734802, 'learning_rate': 1.033163

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.25796571373939514, 'eval_accuracy': 0.8892777972231944, 'eval_runtime': 160.17, 'eval_samples_per_second': 107.024, 'eval_steps_per_second': 0.418, 'epoch': 17.0}
{'loss': 0.2589, 'grad_norm': 0.9772096276283264, 'learning_rate': 8.418367346938775e-06, 'epoch': 17.01}
{'loss': 0.2822, 'grad_norm': 1.4322434663772583, 'learning_rate': 8.290816326530612e-06, 'epoch': 17.06}
{'loss': 0.2768, 'grad_norm': 0.9762919545173645, 'learning_rate': 8.163265306122448e-06, 'epoch': 17.1}
{'loss': 0.2943, 'grad_norm': 1.1315116882324219, 'learning_rate': 8.035714285714286e-06, 'epoch': 17.15}
{'loss': 0.2785, 'grad_norm': 1.067330002784729, 'learning_rate': 7.908163265306124e-06, 'epoch': 17.19}
{'loss': 0.2917, 'grad_norm': 1.1795541048049927, 'learning_rate': 7.78061224489796e-06, 'epoch': 17.24}
{'loss': 0.2757, 'grad_norm': 1.0631250143051147, 'learning_rate': 7.653061224489797e-06, 'epoch': 17.29}
{'loss': 0.287, 'grad_norm': 1.241036057472229, 'learning_rate': 7.525510204081633

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2539231479167938, 'eval_accuracy': 0.8931279897328199, 'eval_runtime': 160.3263, 'eval_samples_per_second': 106.919, 'eval_steps_per_second': 0.418, 'epoch': 18.0}
{'loss': 0.2778, 'grad_norm': 1.154860258102417, 'learning_rate': 5.612244897959184e-06, 'epoch': 18.01}
{'loss': 0.2791, 'grad_norm': 1.1577537059783936, 'learning_rate': 5.48469387755102e-06, 'epoch': 18.05}
{'loss': 0.2964, 'grad_norm': 0.9296960830688477, 'learning_rate': 5.357142857142857e-06, 'epoch': 18.1}
{'loss': 0.2751, 'grad_norm': 0.9612064361572266, 'learning_rate': 5.229591836734695e-06, 'epoch': 18.14}
{'loss': 0.2617, 'grad_norm': 1.0373328924179077, 'learning_rate': 5.102040816326531e-06, 'epoch': 18.19}
{'loss': 0.2788, 'grad_norm': 2.3092970848083496, 'learning_rate': 4.9744897959183674e-06, 'epoch': 18.24}
{'loss': 0.2612, 'grad_norm': 1.8288300037384033, 'learning_rate': 4.846938775510204e-06, 'epoch': 18.28}
{'loss': 0.26, 'grad_norm': 0.9837968945503235, 'learning_rate': 4.7193877551020

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.25330156087875366, 'eval_accuracy': 0.8917279197293198, 'eval_runtime': 159.9405, 'eval_samples_per_second': 107.177, 'eval_steps_per_second': 0.419, 'epoch': 19.0}
{'loss': 0.2612, 'grad_norm': 1.0151054859161377, 'learning_rate': 2.806122448979592e-06, 'epoch': 19.0}
{'loss': 0.2966, 'grad_norm': 1.208803415298462, 'learning_rate': 2.6785714285714285e-06, 'epoch': 19.05}
{'loss': 0.2811, 'grad_norm': 1.0914499759674072, 'learning_rate': 2.5510204081632653e-06, 'epoch': 19.1}
{'loss': 0.2747, 'grad_norm': 0.899793267250061, 'learning_rate': 2.423469387755102e-06, 'epoch': 19.14}
{'loss': 0.2777, 'grad_norm': 0.741868257522583, 'learning_rate': 2.295918367346939e-06, 'epoch': 19.19}
{'loss': 0.2847, 'grad_norm': 0.8857210278511047, 'learning_rate': 2.1683673469387757e-06, 'epoch': 19.23}
{'loss': 0.2958, 'grad_norm': 0.9391604065895081, 'learning_rate': 2.040816326530612e-06, 'epoch': 19.28}
{'loss': 0.2577, 'grad_norm': 0.8202571272850037, 'learning_rate': 1.9132653061

  0%|          | 0/67 [00:00<?, ?it/s]

{'eval_loss': 0.2521660327911377, 'eval_accuracy': 0.8913195659782989, 'eval_runtime': 160.1167, 'eval_samples_per_second': 107.059, 'eval_steps_per_second': 0.418, 'epoch': 20.0}
{'train_runtime': 14229.5051, 'train_samples_per_second': 79.369, 'train_steps_per_second': 0.311, 'train_loss': 0.3319284814515265, 'epoch': 20.0}
