Zhangyang Gao*, Daize Dong*, Cheng Tan, Jun Xia, Bozhen Hu, Stan Z. Li
Published on The 41st International Conference on Machine Learning (ICML 2024).
Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce GraphsGPT, featuring an Graph2Seq encoder that transforms Non-Euclidean graphs into learnable GraphWords in the Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from GraphWords to ensure information equivalence. We pretrain GraphsGPT on 100M molecules and yield some interesting findings:
- The pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on
graph classification and regression tasks. - The pretrained GraphGPT serves as a strong graph generator, demonstrated by its strong ability to perform both few-shot and conditional graph generation.
- Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known Non-Euclidean challenges.
- The edge-centric pretraining framework GraphsGPT demonstrates its efficacy in graph domain tasks, excelling in both representation and generation.
To get started with GraphsGPT, please run the following commands to install the environments.
git clone git@github.com:A4Bio/GraphsGPT.git --depth=1
cd GraphsGPT
conda create --name graphsgpt python=3.12
conda activate graphsgpt
pip install -e .[dev]
pip install -r requirements.txt
We provide some Jupyter Notebooks in ./jupyter_notebooks
, and their corresponding online Google Colaboratory Notebooks. You can run them for a quick start.
Jupyter Notebook | Google Colaboratory | |
---|---|---|
GraphsGPT Pipeline | example_pipeline.ipynb | |
Clustering Analysis | clustering.ipynb | |
Hybridization Analysis | hybridization.ipynb | |
Interpolation Analysis | interpolation.ipynb |
The model checkpoints can be downloaded from 🤗 Transformers. We provide both the foundational pretrained models with different number of Graph Words
Model Name | Model Type | Model Checkpoint |
---|---|---|
GraphsGPT-1W | Foundation Model | |
GraphsGPT-2W | Foundation Model | |
GraphsGPT-4W | Foundation Model | |
GraphsGPT-8W | Foundation Model | |
GraphsGPT-1W-C | Finetuned Model |
You should first download the configurations and data for finetuning, and put them in ./data_finetune
. (We also include the finetuned checkpoints in the model_zoom.zip
file for a quick test.)
To evaluate the representation performance of the Graph2Seq Encoder, please run:
bash ./scripts/representation/finetune.sh
You can also toggle the --mixup_strategy
for graph mixup using Graph2Seq.
For the unconditional generation with GraphGPT Decoder, please refer to README-Generation-Uncond.md.
For the conditional generation with GraphGPT-C Decoder, please refer to README-Generation-Cond.md.
To evaluate the few-shots generation performance of GraphGPT Decoder, please run:
bash ./scripts/generation/evaluation/moses.sh
bash ./scripts/generation/evaluation/zinc250k.sh
@article{gao2024graph,
title={A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer},
author={Gao, Zhangyang and Dong, Daize and Tan, Cheng and Xia, Jun and Hu, Bozhen and Li, Stan Z},
journal={arXiv preprint arXiv:2402.02464},
year={2024}
}
If you have any questions, please contact:
-
Zhangyang Gao: gaozhangyang@westlake.edu.cn
-
Daize Dong: dzdong2019@gmail.com