Heterformer

This repository contains the source code and datasets for Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks, published in KDD 2023.

Links

Requirements
Overview
Data
Train
Inference
Downstream Tasks
Citations

Requirements

The code is written in Python 3.6. Before running, you need to first install the required packages by typing following commands (Using a virtual environment is recommended):

pip3 install -r requirements.txt

Overview

Heterformer is a Transformer architecture (language model) for representation on heterogeneous text-rich (text-attributed) networks. It can take text data associated with nodes and heterogeneous network structure information into consideration.

Data

Download raw data from DBLP, Twitter and Goodreads.
Data processing: Run the cells in data/$dataset/data_processing.ipynb for first step data processing.
Network Sampling: Run the cells in data/$dataset/sampling.ipynb for ego-network sampling and train/val/test data generation.
Pretrain data: Run the cells in data/$dataset/generate_pretrain_data.ipynb for textless node pretraining data generation.

Train

Pretrain textless node embeddings. Take Goodreads dataset as an example.

cd pretrain/
bash run.sh

Prepare textless node embedding file for Heterformer training.

Run the cells in pretrain/transfer_embed.ipynb

Heterformer training.

cd ..
python main.py --data_path data/$dataset --model_type Heterformer --pretrain_embed True --pretrain_dir data/$dataset/pretrain_embed

Test

python main.py --data_path data/$dataset --model_type Heterformer --mode test --load_ckpt_name $load_ckpt_dir

Inference

python main.py --data_path data/$dataset --model_type Heterformer --mode infer --load 1 --load_ckpt_name $load_ckpt_dir

Downstream

Transductive Text-rich node classification

cd downstream/
python classification.py --mode transductive --dataset $dataset --method Heterformer

Inductive Text-rich node classification

python classification.py --mode inductive --dataset $dataset --method Heterformer

Textless node classification

python author_classification.py --dataset $dataset --method Heterformer

Node Clustering

python clustering.py --mode transductive --dataset $dataset --method Heterformer

Retrieval

python retrieval.py --method Heterformer

Citations

Please cite the following paper if you find the code helpful for your research.

@inproceedings{jin2023heterformer,
  title={Heterformer: Transformer-based deep node representation learning on heterogeneous text-rich networks},
  author={Jin, Bowen and Zhang, Yu and Zhu, Qi and Han, Jiawei},
  booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages={1020--1031},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ckpt		ckpt
data		data
downstream		downstream
figure		figure
pretrain		pretrain
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

PeterGriffinJin/Heterformer

Folders and files

Latest commit

History

Repository files navigation

Heterformer

Links

Requirements

Overview

Data

Train

Test

Inference

Downstream

Transductive Text-rich node classification

Inductive Text-rich node classification

Textless node classification

Node Clustering

Retrieval

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages