DAO: Siamese Foundation Models for Crystal Structure Prediction

DAO (Diffusion-based crystAl Omni) presents a pair of Siamese foundation models for material science:

DAO-G: A generative model for stable crystal structure prediction (CSP), capable of generating diverse polymorphic structures.
DAO-P: A predictive model for energy and property prediction, which acts as an energy guider for DAO-G to steer generation towards thermodynamic stability.

Both models are built upon Crysformer, an equivariant graph transformer, and are pretrained on CrysDB (940K entries) via a novel two-stage pretraining strategy involving unstable structure relaxation.

🌐 Website: https://glad-ruc.github.io/DAO/

Installation

We recommend using Conda to manage the environment to ensure compatibility with the specific PyTorch and CUDA versions used in our experiments. The repo ships a one-shot setup.sh that:

creates a conda env (default name: dao) with Python 3.8.18,
installs pytorch==1.10.0 + cudatoolkit=11.3 via conda,
installs the rest of the stack pinned in pyproject.toml via pip install -e ..

bash setup.sh
conda activate dao
python -m dao doctor   # verify installation

To use a different env name (e.g., my_dao):

ENV_NAME=my_dao bash setup.sh
conda activate my_dao

Note: pyproject.toml pins prebuilt PyG wheels (torch-scatter, torch-sparse, torch-cluster) for Linux x86_64 + Python 3.8 + CUDA 11.3. On other OS/CUDA combinations you will need to adjust those URLs manually.

Environment variables (optional)

The scripts/CLI rely on a few environment variables to locate the repo and decide where to write outputs. You usually do not need to set these manually (the CLI sets reasonable defaults):

PROJECT_ROOT: repository root (used to find conf/).
HYDRA_JOBS: Hydra run directory root (default: outputs/hydra).
WANDB_DIR: W&B run directory root (default: outputs/wandb).

Data and Checkpoints

To replicate our results or use the models, you need to download the datasets and pretrained checkpoints.

Datasets

Download the datasets (MP-20, MPTS-52, SuperCon, etc.) and place them in the data/ directory:

Download Link (Google Drive)

or use gdown to download directly:

pip install gdown
gdown https://drive.google.com/drive/folders/1SOOvLycBhsOKKp3qX_l6SkASjfwf7-7A?usp=drive_link --output ./data --folder

Checkpoints

Download the pretrained and finetuned checkpoints and place them in the ckpts/ directory:

Download Link (Google Drive)

or use gdown to download directly:

pip install gdown
gdown https://drive.google.com/drive/folders/1msp-D3uWD0fJrwE7qrbRXok1u-FoOAdc?usp=drive_link --output ./ckpts --folder

The sampling/inference scripts expect the checkpoint directory to also contain the saved scalers (e.g. prop_scaler.pt).

Repository Structure

dao/: Source code for models and CLI.
- cli.py: CLI entrypoint (python -m dao).
- common/: Shared utilities, constants, data processing.
- pl_modules/: PyTorch Lightning modules (CrysFormer, CSPNet, Diffusion, etc.).
- pl_data/: Dataset and datamodule classes.
conf/: Hydra configuration files (data, model, optimizer, etc.).
scripts/: Backend scripts invoked by the CLI.
- run/: Generation, finetune, and conversion launchers.
- eval/: Structure evaluation utilities.
- infer/: Property/energy inference utilities.
- data/: Dataset preparation (CIF/CSV to cached .pt).
data/: Benchmark datasets (CSV + cached *_ori.pt).
ckpts/: Pretrained/finetuned checkpoints (plus scalers).
setup.sh: One-shot conda environment setup.
pyproject.toml: Package metadata and pinned dependencies.

Usage

All interactions go through the dao CLI (python -m dao).

CLI Overview

Command	Description
`dao csp generate-from-formula`	Generate structures from chemical formula(s)
`dao csp generate`	Generate structures from a benchmark dataset
`dao csp convert`	Convert `eval_diff_*.pt` to CIF / POSCAR
`dao csp evaluate`	Evaluate generated structures (MR / RMSD)
`dao prop predict-from-cif`	Predict properties directly from CIF files
`dao prop predict`	Predict properties from cached datasets
`dao supercon generate`	Generate superconductor structures
`dao csp finetune`	Finetune DAO-G on a downstream dataset
`dao doctor`	Check repo layout and environment

Generate Structures

DAO-G generates crystal structures via a diffusion process. You can optionally enable energy guidance using DAO-P to steer generation towards lower-energy (more stable) structures.

From chemical formulas

The most common use case: predict structures for arbitrary compositions without needing a benchmark dataset.

Single formula:

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --formula "Li2FeO4" \
  --num-evals 1 \
  --write-cifs \
  --output-dir ./outputs

Batch input file (one formula per line; optional second column = num_atoms):

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --input-file my_formulas.txt \
  --num-evals 1 \
  --write-cifs

my_formulas.txt example:

# formula              num_atoms (optional)
Li2FeO4
SrTiO3                 5
NaCl                   8

Outputs:

eval_diff_<label>.pt (default label formula_<num_evals>) under --output-dir (if not specified, saved in --model-path),
one CIF per (formula, eval) pair under <output_dir>/cifs/ when --write-cifs is set.

From benchmark datasets

Reproduce paper results on MP-20, MPTS-52, etc.

Single GPU:

python -m dao csp generate \
  --dataset mp_20 \
  --model-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --num-gpus 1

Multi-GPU (shard workload across GPUs):

python -m dao csp generate \
  --dataset mp_20 \
  --model-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --num-gpus 4 \
  --base-gpu 0

Energy-guided generation

Both modes support energy guidance. Add --energy-guidance --energy-model-path <dao-p-ckpt> to steer the diffusion process towards lower-energy structures:

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --formula "Li2FeO4" \
  --energy-guidance \
  --energy-model-path ckpts/dao_p/last.ckpt \
  --num-evals 1

Convert to CIF / POSCAR

Both csp generate and csp generate-from-formula produce eval_diff_*.pt tensors. Use csp convert to extract human-readable structure files:

# Convert all evals to CIF (default)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt

# Convert only eval index 0, output as POSCAR (.vasp)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --format poscar --eval-idx 0

# Custom output directory
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --out-dir ./my_structures

By default, files are written to <pt_file_dir>/<pt_stem>_structures/. For generate-from-formula outputs, filenames include the formula; for benchmark outputs, they use an index.

About --eval-idx: When running generation with --num-evals N > 1, the model produces N candidate structures per input. All candidates are stored in a single .pt file (the first dimension of frac_coords, lattices, etc. is N). Use --eval-idx to select which candidate to convert (0-indexed). The default (-1) converts all candidates.

Predict Properties

DAO-P predicts properties (e.g., energy above hull, band gap) for crystal structures. The command prints MAE and saves predicted properties to a .npy file.

From CIF files

The easiest way: point to a directory of CIF files.

python -m dao prop predict-from-cif \
  --cif-dir my_cifs/ \
  --model-path ckpts/dao_p/last.ckpt \
  --pred-energy

This writes pred_<prop>_of_custom.npy inside --cif-dir. Intermediate files (CSV + .pt cache) are created in a temporary directory and cleaned up automatically; use --keep-cache to preserve them.

From cached datasets

python -m dao prop predict \
  --mode dataset \
  --ori-path data/mp_20/test_ori.pt \
  --model-path ckpts/dao_p_dedup/last.ckpt \
  --pred-energy

This saves pred_<prop>_of_<ori_file>.npy next to --ori-path.

From generated structures

python -m dao prop predict \
  --mode generated \
  --ori-path data/mp_20/test_ori.pt \
  --eval-path ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --model-path ckpts/dao_p_dedup/last.ckpt \
  --pred-energy \
  --sample-size 1

Note: To predict a property other than energy (Ehull), specify the --ori-path argument with the path to the finetuned DAO-P model, provide the target property using --prop, and omit the --pred-energy flag. Critical temperature prediction is an example.

Evaluate Structures

Evaluate generated structures against ground truth using Match Rate (MR) and RMSD metrics.

python -m dao csp evaluate \
  --dataset mp_20 \
  --root-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --label 1_all

--root-path: Directory containing generation results (eval_diff_*.pt).
--label: Suffix of the generated file (e.g., 1_all for eval_diff_1_all.pt).

Superconductor Discovery

DAO demonstrates significant potential in discovering superconductors.

Generate superconductor structures

For ordered superconductors without experimentally resolved structures in SuperCon dataset (supercon_rest, 748 entries):

python -m dao supercon generate \
  --dataset supercon_rest \
  --model-path ckpts/finetune_gen_supercon \
  --energy-guidance \
  --energy-model-path ckpts/dao_p/last.ckpt \
  --num-evals 1 \
  --gpu 0

For three real-world superconductors, just replace supercon_rest with supercon_realworld and set --num-evals 20.

Critical Temperature (Tc) Prediction

For three real-world superconductors not in SuperCon3D dataset (supercon_real, 3 entries):

for fold in {0..4}; do
  model_path="ckpts/finetuned_tc_model_fold_${fold}/last.ckpt"
  python -m dao prop predict \
    --mode dataset \
    --ori-path data/super_conductors/real_world/output_ori.pt \
    --model-path $model_path \
    --prop logtc
done

Then average the results of the five folds to get the final predictions.

Training

Finetuning

To adapt a pretrained DAO-G model to a specific downstream dataset (e.g., MP-20):

python -m dao csp finetune \
  --dataset mp_20 \
  --pretrain-ckpt ckpts/dao_g/last.ckpt \
  --epochs 1000 \
  --lr 2e-5 \
  --weight-decay 1e-5 \
  --gpus 1

The finetune outputs are written by Hydra to outputs/hydra/singlerun/<date>/finetune_<dataset>/ by default. Use that directory as --model-path for subsequent csp generate/csp evaluate.

Pretraining

To reproduce the pretraining of DAO-G (Stage I & II) and DAO-P, please refer to the scripts in scripts/run/:

bash scripts/run/run_pretrain.sh

Citation

If you find this repository useful, please cite our paper:

@article{wu2026dao,
  title = {Siamese foundation models for crystal structure prediction},
  issn = {2041-1723},
  doi = {10.1038/s41467-026-72362-3},
  journal = {Nature Communications},
  author = {Wu, Liming and Huang, Wenbing and Jiao, Rui and Huang, Jianxing and Liu, Liwei and Zhou, Yipeng and Sun, Hao and Liu, Yang and Sun, Fuchun and Ren, Yuxiang and Wen, Ji-Rong},
  year = {2026},
}

Contact

If you have any questions, feedback, or collaboration ideas, feel free to reach out: wlm155@126.com

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
ckpts		ckpts
conf		conf
dao		dao
data		data
scripts		scripts
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAO: Siamese Foundation Models for Crystal Structure Prediction

Table of Contents

Installation

Data and Checkpoints

Datasets

Checkpoints

Repository Structure

Usage

CLI Overview

Generate Structures

From chemical formulas

From benchmark datasets

Energy-guided generation

Convert to CIF / POSCAR

Predict Properties

From CIF files

From cached datasets

From generated structures

Evaluate Structures

Superconductor Discovery

Generate superconductor structures

Critical Temperature (Tc) Prediction

Training

Finetuning

Pretraining

Citation

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DAO: Siamese Foundation Models for Crystal Structure Prediction

Table of Contents

Installation

Data and Checkpoints

Datasets

Checkpoints

Repository Structure

Usage

CLI Overview

Generate Structures

From chemical formulas

From benchmark datasets

Energy-guided generation

Convert to CIF / POSCAR

Predict Properties

From CIF files

From cached datasets

From generated structures

Evaluate Structures

Superconductor Discovery

Generate superconductor structures

Critical Temperature (Tc) Prediction

Training

Finetuning

Pretraining

Citation

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages