# Graph Node Classification with ThirdAI's Universal Deep Transformer
This notebook shows how to build a node classification model on a graph with ThirdAI's Universal Deep Transformer (UDT) model. In this demo, we will train and evaluate the model on the YelpCHI dataset, but you can easily replace this with your own dataset.

UDT reaches SOTA accuracy on YelpChi extremely quickly. Below, we report accuracy numbers for some standard node classification methods from the paper [New Benchmarks for Learning on Non-Homophilous Graphs
](https://arxiv.org/pdf/2104.01404.pdf), the results of timing tests on a CPU for those methods, as well as UDT's numbers when it is integrated into their [benchmark repo](https://github.com/CUAI/Non-Homophily-Benchmarks).

| Method  | AUC ROC  | Training Time* | Entire Graph Inference Time** | Single Node Inference Time
| -----------                             | -----------       | --            | --                | --  |
| Graph Convolutional Network (GCN)       | 63.62 ± 1.00      | ~15 min       | **0.74s ± 0.10s**     | Not Possible |
| Graph Attention Network (GAT)           | 81.42 ± 2.12      | ~1 hour       | 2.95s ± 0.17s     | Not Possible |
| GAT w/ Jumping Knowledge (GAT+JK)       | 90.04 ± 0.61      | ~1 hour       | 2.91s ± 0.16s     | Not Possible |
| **ThirdAI**    | **91.04 ± 0.73**  | **~1 min**    | 1.43s ± 0.09s     | **0.5 ms** |

*Training time on a 16 core Intel Xeon CPU (no GPU). Not including the hyperparameter search, which is only feasible on GPUs (and only necessary for the non UDT methods). The benchmark methods use batch size = graph size and run for 500 epochs, and UDT uses batch size = 2048 and runs for 15 epochs.

**Time to run inference on all nodes in the graph.

We also can increase UDT's AUC to 93% at the cost of training time; see below for more details.

You can immediately run a version of this notebook in your browser on Google Colab at the following link:

https://githubtocolab.com/ThirdAILabs/Demos/blob/main/universal_deep_transformer/graph_neural_networks/GraphNodeClassification.ipynb

This notebook uses an activation key that will only work with this demo. If you want to try us out on your own dataset, you can obtain a free trial license at the following link: https://www.thirdai.com/try-bolt/

In [None]:
!pip3 install thirdai --upgrade
!pip3 install scikit-learn

# This activates the ThirdAI package with a key that is only good for this demo
import thirdai

thirdai.licensing.activate("MWHE-FVNH-3XN3-KWXY-TFRA-X3VJ-U7FT-RLVR")

# Dataset Download
We will use the demos module in the thirdai package to download the YelpCHI dataset. You can replace this step and the next step with a download method and a UDT initialization step that is specific to your dataset.

In [None]:
from thirdai.demos import download_yelp_chi_dataset

train_data_path, eval_data_path, inference_samples, _ = download_yelp_chi_dataset()

# UDT Initialization
We can now create a UDT model by passing in the types of each column in the dataset and the target column we want to be able to predict. The input format for training UDT on a graph dataset is a CSV where every row represents a single node in the graph. The bolt.types.node_id() column should be a unique integer identifying the node, while the bolt.types.neighbors() column should be a space separated list of node ids corresponding to the node's neighbors. For YelpChi, each node additionally has 32 numerical features (a "feature vector"), as well as a target column we wish to predict.

In [None]:
from thirdai import bolt

model = bolt.UniversalDeepTransformer(
    data_types={
        "node_id": bolt.types.node_id(),
        **{f"col_{i}": bolt.types.numerical([0, 1]) for i in range(32)},
        "target": bolt.types.categorical(),
        "neighbors": bolt.types.neighbors(),
    },
    target="target",
    n_target_classes=2,
    integer_target=True,
    # Optional: you can uncomment the next line to increase AUC to ~0.93 at the
    # cost of increasing training time
    # options={"contextual_columns": True},
)

# Indexing
Graph models with UDT require an additional *indexing* step before training. Since nodes in the train set have neighbors in the eval set, we must tell the model about the node features of nodes in the eval set before training. This effectively gives the UDT model complete knowledge of the graph structure and features before training. The eval set has all labels set to 0, so this is not leaking testing information.

In [None]:
model.index_nodes(eval_data_path)

# Training
We can now train our UDT graph model! Feel free to customize the number of epochs and the learning rate; we have chosen values that give good convergence.

In [None]:
model.train(
    train_data_path, learning_rate=0.001, epochs=15, metrics=["categorical_accuracy"]
)

# Evaluation
We can evaluate the performance of UDT by calculating the [ROC AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). UDT achieves about 91% with the default configurations in this notebook, and as noted above we achieve 93-94% with longer training time (the previous SOTA was [90%](https://paperswithcode.com/sota/node-classification-on-yelpchi)). Here, we show the predict API, where we pass in samples without labels.

In [None]:
import numpy as np
from sklearn import metrics

ground_truth = [inference_sample[1] for inference_sample in inference_samples]

predictions = []
ground_truths = []
for sample, ground_truth in inference_samples:
    prediction = model.predict(sample)
    predictions.append(prediction)
    ground_truths.append(ground_truth)
predictions = np.array(predictions)

auc = metrics.roc_auc_score(ground_truths, predictions[:, 1])

print("Test AUC:", auc)