# Training and Evaluation of GNNs and LLMs
In this notebook, we train the models on the [MovieLens Dataset](https://movielens.org/) after the Pytorch Geometrics Tutorial on [Link Prediction](https://colab.research.google.com/drive/1xpzn1Nvai1ygd_P5Yambc_oe4VBPK_ZT?usp=sharing#scrollTo=vit8xKCiXAue).

First we import all of our dependencies.

The **GraphRepresentationGenerator** manages and trains a GNN model. Its most important interfaces include
**the constructor**, which defines the GNN architecture and loads the pre-trained GNN model if it is already on the hard disk,
**the training method**, which initializes the training on the GNN model and
**the get_embedding methods**, which represent the inference interface to the GNN model and return the corresponding embeddings in the dimension defined in the constructor for given user movie node pairs.

**The MovieLensLoader** loads and manages the data sets. The most important tasks include **saving and (re)loading and transforming** the data sets.

**PromptEncoderOnlyClassifier** and **VanillaEncoderOnlyClassifier** each manage a **prompt (model) LLM** and a **vanilla (model) LLM**. An EncoderOnlyClassifier (ClassifierBase) provides interfaces for training and testing an LLM model.
PromptEncoder and VanillaEncoder differ from their DataCollectors. DataCollectors change the behavior of the models during training and testing and allow data points to be created at runtime. With the help of these collators, we **create non-existent edges on the fly**.

In [1]:
from graph_representation_generator import GraphRepresentationGenerator
from dataset_manager import (
    MovieLensManager,
    PROMPT_KGE_DIMENSION,
    INPUT_EMBEDS_REPLACE_KGE_DIMENSION,
    ROOT,
)
from llm_manager import (
    PromptBertClassifier,
    VanillaBertClassifier,
    InputEmbedsReplaceClassifier,
)

In [2]:
EPOCHS = 20
BATCH_SIZE_KGE = 128000
BATCH_SIZE_LLM = 256

We define in advance which **Knowledge Graph Embedding Dimension (KGE_DIMENSION)** the GNN encoder has. We want to determine from which output dimension the GNN encoder can produce embeddings that lead to a significant increase in performance *without exceeding the context length of the LLMs*. In the original tutorial, the KGE_DIMENSION was $64$.

In [3]:
kg_manager = MovieLensManager()

Using existing file ml-32m.zip
Extracting ./data\ml-32m.zip


splitting LLM dataset
6400040 6400040
generate llm dataset...


llm_df = kg_manager.llm_df.merge(kg_manager.target_df[["id", "prompt_feature_title", "prompt_feature_genres"]].rename(columns={"id": "target_id"}), on = "target_id")
llm_df

First we load the MovieLensLoader, which downloads the Movie Lens dataset (https://files.grouplens.org/datasets/movielens/ml-32m.zip) and prepares it to be used on GNN and LLM. We also pass the embedding dimensions that we will assume we are training with. First time takes approx. 30 sec.

In [4]:
kg_manager.data

HeteroData(
  source={ node_id=[200948] },
  target={
    node_id=[87585],
    x=[87585, 20],
  },
  (source, edge, target)={ edge_index=[2, 32000204] },
  (target, rev_edge, source)={ edge_index=[2, 32000204] }
)

Next, we initialize the GNN trainers (possible on Cuda), one for each KGE_DIMENSION.
A GNN trainer manages a model and each model consists of an **encoder and classifier** part.

**The encoder** is a parameterized *Grap Convolutional Network (GCN)* with a *2-layer GNN computation graph* and a single *ReLU* activation function in between.

**The classifier** applies the dot-product between source and destination kges to derive edge-level predictions.

In [5]:
graph_representation_generator_prompt = GraphRepresentationGenerator(
    kg_manager.data,
    kg_manager.gnn_train_data,
    kg_manager.gnn_val_data,
    kg_manager.gnn_test_data,
    kge_dimension=PROMPT_KGE_DIMENSION,
    force_recompute=False,
)
graph_representation_generator_input_embeds_replace = GraphRepresentationGenerator(
    kg_manager.data,
    kg_manager.gnn_train_data,
    kg_manager.gnn_val_data,
    kg_manager.gnn_test_data,
    hidden_channels=INPUT_EMBEDS_REPLACE_KGE_DIMENSION,
    kge_dimension=INPUT_EMBEDS_REPLACE_KGE_DIMENSION,
    force_recompute=False,
)

Device: 'cpu'
Device: 'cpu'


We then train and validate the model on the link prediction task.

If the model is already trained, we can skip this part.
Training the models can take up to 5 minutes.

In [6]:
print("Prompt Training")
graph_representation_generator_prompt.train_model(
    kg_manager.gnn_train_data, EPOCHS, BATCH_SIZE_KGE
)
graph_representation_generator_prompt.validate_model(
    kg_manager.gnn_test_data, batch_size=BATCH_SIZE_KGE
)
print("Attention Training")
graph_representation_generator_input_embeds_replace.train_model(
    kg_manager.gnn_train_data, EPOCHS, BATCH_SIZE_KGE
)
graph_representation_generator_input_embeds_replace.validate_model(
    kg_manager.gnn_test_data, batch_size=BATCH_SIZE_KGE
)


Prompt Training


100%|██████████| 61/61 [08:28<00:00,  8.33s/it]


Epoch: 001, Loss: 0.5086


100%|██████████| 61/61 [08:24<00:00,  8.26s/it]


Epoch: 002, Loss: 0.2182


100%|██████████| 61/61 [08:23<00:00,  8.26s/it]


Epoch: 003, Loss: 0.1761


100%|██████████| 61/61 [08:23<00:00,  8.25s/it]


Epoch: 004, Loss: 0.1611


100%|██████████| 61/61 [08:23<00:00,  8.25s/it]


Epoch: 005, Loss: 0.1532


100%|██████████| 61/61 [08:25<00:00,  8.28s/it]


Epoch: 006, Loss: 0.1484


100%|██████████| 61/61 [08:27<00:00,  8.32s/it]


Epoch: 007, Loss: 0.1440


100%|██████████| 61/61 [08:26<00:00,  8.31s/it]


Epoch: 008, Loss: 0.1411


100%|██████████| 61/61 [08:26<00:00,  8.30s/it]


Epoch: 009, Loss: 0.1392


100%|██████████| 61/61 [08:23<00:00,  8.26s/it]


Epoch: 010, Loss: 0.1364


100%|██████████| 61/61 [08:24<00:00,  8.27s/it]


Epoch: 011, Loss: 0.1340


100%|██████████| 61/61 [08:24<00:00,  8.27s/it]


Epoch: 012, Loss: 0.1313


100%|██████████| 61/61 [08:21<00:00,  8.23s/it]


Epoch: 013, Loss: 0.1293


100%|██████████| 61/61 [08:23<00:00,  8.25s/it]


Epoch: 014, Loss: 0.1280


100%|██████████| 61/61 [08:25<00:00,  8.28s/it]


Epoch: 015, Loss: 0.1260


100%|██████████| 61/61 [08:23<00:00,  8.26s/it]


Epoch: 016, Loss: 0.1238


100%|██████████| 61/61 [08:25<00:00,  8.28s/it]


Epoch: 017, Loss: 0.1218


100%|██████████| 61/61 [08:24<00:00,  8.27s/it]


Epoch: 018, Loss: 0.1197


100%|██████████| 61/61 [08:23<00:00,  8.25s/it]


Epoch: 019, Loss: 0.1178


100%|██████████| 61/61 [08:24<00:00,  8.26s/it]


Epoch: 020, Loss: 0.1140


100%|██████████| 51/51 [04:35<00:00,  5.40s/it]



Validation AUC: 0.9898
Attention Training


100%|██████████| 61/61 [10:20<00:00, 10.18s/it]


Epoch: 001, Loss: 0.3936


100%|██████████| 61/61 [10:19<00:00, 10.16s/it]


Epoch: 002, Loss: 0.1813


100%|██████████| 61/61 [10:19<00:00, 10.15s/it]


Epoch: 003, Loss: 0.1592


100%|██████████| 61/61 [10:19<00:00, 10.16s/it]


Epoch: 004, Loss: 0.1510


100%|██████████| 61/61 [10:19<00:00, 10.15s/it]


Epoch: 005, Loss: 0.1458


100%|██████████| 61/61 [10:17<00:00, 10.12s/it]


Epoch: 006, Loss: 0.1419


100%|██████████| 61/61 [10:19<00:00, 10.15s/it]


Epoch: 007, Loss: 0.1388


100%|██████████| 61/61 [10:18<00:00, 10.13s/it]


Epoch: 008, Loss: 0.1341


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 009, Loss: 0.1312


100%|██████████| 61/61 [10:17<00:00, 10.13s/it]


Epoch: 010, Loss: 0.1317


100%|██████████| 61/61 [10:19<00:00, 10.16s/it]


Epoch: 011, Loss: 0.1286


100%|██████████| 61/61 [10:19<00:00, 10.15s/it]


Epoch: 012, Loss: 0.1272


100%|██████████| 61/61 [10:17<00:00, 10.13s/it]


Epoch: 013, Loss: 0.1237


100%|██████████| 61/61 [10:13<00:00, 10.05s/it]


Epoch: 014, Loss: 0.1225


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 015, Loss: 0.1226


100%|██████████| 61/61 [10:19<00:00, 10.15s/it]


Epoch: 016, Loss: 0.1181


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 017, Loss: 0.1140


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 018, Loss: 0.1090


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 019, Loss: 0.1069


100%|██████████| 61/61 [10:18<00:00, 10.14s/it]


Epoch: 020, Loss: 0.1038


100%|██████████| 51/51 [04:47<00:00,  5.63s/it]



Validation AUC: 0.9910


Next we produce the KGEs for every edge in the dataset. These embeddings can then be used for the LLM on the link-prediction task.