Skip to content

Orange-OpenSource/fgdiffusion

Repository files navigation

FGdiffusion

Large-Scale Knowledge Graph Generation Using a Diffusion Approach

Paper License: BSD-4

FGdiffusion is a framework for generating graphs - including knowledge graphs - using discrete diffusion models. It implements and compares several generative approaches for graph generation, with a focus on discrete diffusion applied to edge-level graph tokens.

Key Contributions:

  • Discrete diffusion for graph generation: We adapt LLaDA-style discrete diffusion and score-based discrete diffusion (SEDD) to the graph generation task, operating directly on edge token sequences.
  • Graph flattening via BFS ordering: Graphs are serialized into edge-index sequences using BFS node ordering, enabling transformer-based models to process graphs as token sequences.
  • Knowledge graph generation: We extend the approach to generate typed knowledge graphs with node and edge labels.

This repository is the official code for the paper "FGdiffusion: Large-Scale Knowledge Graph Generation Using a Diffusion Approach" (Adrien Bufort, Lionel Tailhardat, 2025). If you are using FGdiffusion in your work, please cite:

@misc{hal-05410352,
  title        = {FGdiffusion: Large-Scale Knowledge Graph Generation Using a Diffusion Approach},
  author       = {Bufort, Adrien and Tailhardat, Lionel},
  year         = {2025},
  howpublished = {\url{https://hal.science/hal-05410352}}
}

Implemented Models

Model Type File Reference
G2PT + LLaDA Discrete diffusion deepgraphgen/trainers/trainer_g2pt_llada.py Xie et al., 2025
G2PT + Score Diffusion Score-based discrete diffusion deepgraphgen/trainers/trainer_g2pt_score.py Based on SEDD
G2PT + KG (NASA) KG generation with labels deepgraphgen/trainers/trainer_g2pt_llada_kg.py This paper

Installation

Requirements

  • Python 3.10+

For the full list of dependencies,

Note that the CUDA framework is recommended for the FGdiffusion training stage ; it is the user responsibility to download those tools and to agree to the associated terms and conditions.

Setup

# Clone the repository
git clone https://github.com/yourusername/FGdiffusion.git
cd FGdiffusion

# Install dependencies
pip install .

Docker

docker build -t fgdiffusion .
docker run --gpus all -it fgdiffusion

Quick Start

Training

# Train G2PT with LLaDA discrete diffusion on planar graphs
python scripts/train_g2pt_llada.py

# Train G2PT with score-based diffusion
python scripts/train_g2pt_score.py

# Train GraphGDP baseline
python scripts/train_diffusion.py

# Train GRAN baseline
python scripts/train_gran.py

Evaluation

# Evaluate generated graphs
python scripts/evaluate.py --checkpoint path/to/checkpoint.ckpt

Knowledge Graph Generation

# Train on NASA Knowledge Graph
python scripts/train_kg.py

Graph Representation

Graphs are flattened into sequences of edge tokens using the following approach:

  1. BFS ordering: Nodes are ordered via BFS traversal starting from node 0
  2. Edge serialization: Edges are serialized as pairs of node indices (u, v)
  3. Padding: Edge sequences are padded to a fixed length based on the edges_to_node_ratio
  4. Mask tokens: A special [MASK] token (index nb_max_node + 2) is used for discrete diffusion

Key Findings

  • Graph Transformers outperform GNNs for graph generation, especially in diffusion setups
  • Flattening to edge indices works better than adjacency matrix representations
  • Discrete diffusion (LLaDA-like) is effective for graph generation
  • BFS node ordering significantly enhances learning

Project Structure

FGdiffusion/
├── deepgraphgen/
│   ├── trainers/                # Training modules (PyTorch Lightning)
│   │   ├── trainer_g2pt_auto.py # Autoregressive G2PT
│   │   ├── trainer_g2pt_llada.py # G2PT + LLaDA discrete diffusion
│   │   ├── trainer_g2pt_score.py # G2PT + score-based diffusion
│   │   └── trainer_g2pt_llada_kg.py # G2PT + KG with labels
│   ├── datageneration.py        # Graph data generation utilities
│   ├── datasets.py              # Dataset classes for all approaches
│   ├── diffusion_generation.py  # Diffusion noise scheduling
│   ├── utils.py                 # Shared utilities
│   └── random_walk_features.py  # Random walk feature computation
├── scripts/                     # Training & evaluation scripts
├── tests/                       # Unit tests
├── scripts_preprocess/          # Data preprocessing notebooks
├── images/                      # Example generation images
├── data_kg/                     # Knowledge graph data (parquet)
├── Dockerfile
├── pyproject.toml
└── README.md

License

BSD-4-Clause

Copyright

Copyright (c) 2024-2026, Orange. All rights reserved.

Maintainer

About

FGdiffusion, a method that combines a diffusion model with a flattened graph representation as input and employs ontology-based constraints at inference to generate large-scale, domain relevant knowledge graphs.

Topics

Resources

License

Stars

Watchers

Forks

Packages