DAO (Diffusion-based crystAl Omni) presents a pair of Siamese foundation models for material science:
- DAO-G: A generative model for stable crystal structure prediction (CSP), capable of generating diverse polymorphic structures.
- DAO-P: A predictive model for energy and property prediction, which acts as an energy guider for DAO-G to steer generation towards thermodynamic stability.
Both models are built upon Crysformer, an equivariant graph transformer, and are pretrained on CrysDB (940K entries) via a novel two-stage pretraining strategy involving unstable structure relaxation.
🌐 Website: https://glad-ruc.github.io/DAO/
We recommend using Conda to manage the environment to ensure compatibility with the specific PyTorch and CUDA versions used in our experiments. The repo ships a one-shot setup.sh that:
- creates a conda env (default name:
dao) with Python 3.8.18, - installs
pytorch==1.10.0+cudatoolkit=11.3via conda, - installs the rest of the stack pinned in
pyproject.tomlviapip install -e ..
bash setup.sh
conda activate dao
python -m dao doctor # verify installationTo use a different env name (e.g., my_dao):
ENV_NAME=my_dao bash setup.sh
conda activate my_daoNote: pyproject.toml pins prebuilt PyG wheels (torch-scatter, torch-sparse, torch-cluster) for Linux x86_64 + Python 3.8 + CUDA 11.3. On other OS/CUDA combinations you will need to adjust those URLs manually.
Environment variables (optional)
The scripts/CLI rely on a few environment variables to locate the repo and decide where to write outputs. You usually do not need to set these manually (the CLI sets reasonable defaults):
PROJECT_ROOT: repository root (used to findconf/).HYDRA_JOBS: Hydra run directory root (default:outputs/hydra).WANDB_DIR: W&B run directory root (default:outputs/wandb).
To replicate our results or use the models, you need to download the datasets and pretrained checkpoints.
Download the datasets (MP-20, MPTS-52, SuperCon, etc.) and place them in the data/ directory:
or use gdown to download directly:
pip install gdown
gdown https://drive.google.com/drive/folders/1SOOvLycBhsOKKp3qX_l6SkASjfwf7-7A?usp=drive_link --output ./data --folderDownload the pretrained and finetuned checkpoints and place them in the ckpts/ directory:
or use gdown to download directly:
pip install gdown
gdown https://drive.google.com/drive/folders/1msp-D3uWD0fJrwE7qrbRXok1u-FoOAdc?usp=drive_link --output ./ckpts --folderThe sampling/inference scripts expect the checkpoint directory to also contain the saved scalers (e.g. prop_scaler.pt).
dao/: Source code for models and CLI.cli.py: CLI entrypoint (python -m dao).common/: Shared utilities, constants, data processing.pl_modules/: PyTorch Lightning modules (CrysFormer, CSPNet, Diffusion, etc.).pl_data/: Dataset and datamodule classes.
conf/: Hydra configuration files (data, model, optimizer, etc.).scripts/: Backend scripts invoked by the CLI.run/: Generation, finetune, and conversion launchers.eval/: Structure evaluation utilities.infer/: Property/energy inference utilities.data/: Dataset preparation (CIF/CSV to cached.pt).
data/: Benchmark datasets (CSV + cached*_ori.pt).ckpts/: Pretrained/finetuned checkpoints (plus scalers).setup.sh: One-shot conda environment setup.pyproject.toml: Package metadata and pinned dependencies.
All interactions go through the dao CLI (python -m dao).
| Command | Description |
|---|---|
dao csp generate-from-formula |
Generate structures from chemical formula(s) |
dao csp generate |
Generate structures from a benchmark dataset |
dao csp convert |
Convert eval_diff_*.pt to CIF / POSCAR |
dao csp evaluate |
Evaluate generated structures (MR / RMSD) |
dao prop predict-from-cif |
Predict properties directly from CIF files |
dao prop predict |
Predict properties from cached datasets |
dao supercon generate |
Generate superconductor structures |
dao csp finetune |
Finetune DAO-G on a downstream dataset |
dao doctor |
Check repo layout and environment |
DAO-G generates crystal structures via a diffusion process. You can optionally enable energy guidance using DAO-P to steer generation towards lower-energy (more stable) structures.
The most common use case: predict structures for arbitrary compositions without needing a benchmark dataset.
Single formula:
python -m dao csp generate-from-formula \
--model-path ckpts/finetune_mp_20 \
--formula "Li2FeO4" \
--num-evals 1 \
--write-cifs \
--output-dir ./outputsBatch input file (one formula per line; optional second column = num_atoms):
python -m dao csp generate-from-formula \
--model-path ckpts/finetune_mp_20 \
--input-file my_formulas.txt \
--num-evals 1 \
--write-cifsmy_formulas.txt example:
# formula num_atoms (optional)
Li2FeO4
SrTiO3 5
NaCl 8
Outputs:
eval_diff_<label>.pt(default labelformula_<num_evals>) under--output-dir(if not specified, saved in--model-path),- one CIF per (formula, eval) pair under
<output_dir>/cifs/when--write-cifsis set.
Reproduce paper results on MP-20, MPTS-52, etc.
Single GPU:
python -m dao csp generate \
--dataset mp_20 \
--model-path ckpts/finetune_mp_20 \
--num-evals 1 \
--num-gpus 1Multi-GPU (shard workload across GPUs):
python -m dao csp generate \
--dataset mp_20 \
--model-path ckpts/finetune_mp_20 \
--num-evals 1 \
--num-gpus 4 \
--base-gpu 0Both modes support energy guidance. Add --energy-guidance --energy-model-path <dao-p-ckpt> to steer the diffusion process towards lower-energy structures:
python -m dao csp generate-from-formula \
--model-path ckpts/finetune_mp_20 \
--formula "Li2FeO4" \
--energy-guidance \
--energy-model-path ckpts/dao_p/last.ckpt \
--num-evals 1Both csp generate and csp generate-from-formula produce eval_diff_*.pt tensors. Use csp convert to extract human-readable structure files:
# Convert all evals to CIF (default)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt
# Convert only eval index 0, output as POSCAR (.vasp)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
--format poscar --eval-idx 0
# Custom output directory
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
--out-dir ./my_structuresBy default, files are written to <pt_file_dir>/<pt_stem>_structures/. For generate-from-formula outputs, filenames include the formula; for benchmark outputs, they use an index.
About --eval-idx: When running generation with --num-evals N > 1, the model produces N candidate structures per input. All candidates are stored in a single .pt file (the first dimension of frac_coords, lattices, etc. is N). Use --eval-idx to select which candidate to convert (0-indexed). The default (-1) converts all candidates.
DAO-P predicts properties (e.g., energy above hull, band gap) for crystal structures. The command prints MAE and saves predicted properties to a .npy file.
The easiest way: point to a directory of CIF files.
python -m dao prop predict-from-cif \
--cif-dir my_cifs/ \
--model-path ckpts/dao_p/last.ckpt \
--pred-energyThis writes pred_<prop>_of_custom.npy inside --cif-dir. Intermediate files (CSV + .pt cache) are created in a temporary directory and cleaned up automatically; use --keep-cache to preserve them.
python -m dao prop predict \
--mode dataset \
--ori-path data/mp_20/test_ori.pt \
--model-path ckpts/dao_p_dedup/last.ckpt \
--pred-energyThis saves pred_<prop>_of_<ori_file>.npy next to --ori-path.
python -m dao prop predict \
--mode generated \
--ori-path data/mp_20/test_ori.pt \
--eval-path ckpts/finetune_mp_20/eval_diff_1_all.pt \
--model-path ckpts/dao_p_dedup/last.ckpt \
--pred-energy \
--sample-size 1Note: To predict a property other than energy (Ehull), specify the
--ori-pathargument with the path to the finetuned DAO-P model, provide the target property using--prop, and omit the--pred-energyflag. Critical temperature prediction is an example.
Evaluate generated structures against ground truth using Match Rate (MR) and RMSD metrics.
python -m dao csp evaluate \
--dataset mp_20 \
--root-path ckpts/finetune_mp_20 \
--num-evals 1 \
--label 1_all--root-path: Directory containing generation results (eval_diff_*.pt).--label: Suffix of the generated file (e.g.,1_allforeval_diff_1_all.pt).
DAO demonstrates significant potential in discovering superconductors.
For ordered superconductors without experimentally resolved structures in SuperCon dataset (supercon_rest, 748 entries):
python -m dao supercon generate \
--dataset supercon_rest \
--model-path ckpts/finetune_gen_supercon \
--energy-guidance \
--energy-model-path ckpts/dao_p/last.ckpt \
--num-evals 1 \
--gpu 0For three real-world superconductors, just replace supercon_rest with supercon_realworld and set --num-evals 20.
For three real-world superconductors not in SuperCon3D dataset (supercon_real, 3 entries):
for fold in {0..4}; do
model_path="ckpts/finetuned_tc_model_fold_${fold}/last.ckpt"
python -m dao prop predict \
--mode dataset \
--ori-path data/super_conductors/real_world/output_ori.pt \
--model-path $model_path \
--prop logtc
doneThen average the results of the five folds to get the final predictions.
To adapt a pretrained DAO-G model to a specific downstream dataset (e.g., MP-20):
python -m dao csp finetune \
--dataset mp_20 \
--pretrain-ckpt ckpts/dao_g/last.ckpt \
--epochs 1000 \
--lr 2e-5 \
--weight-decay 1e-5 \
--gpus 1The finetune outputs are written by Hydra to outputs/hydra/singlerun/<date>/finetune_<dataset>/ by default. Use that directory as --model-path for subsequent csp generate/csp evaluate.
To reproduce the pretraining of DAO-G (Stage I & II) and DAO-P, please refer to the scripts in scripts/run/:
bash scripts/run/run_pretrain.shIf you find this repository useful, please cite our paper:
@article{wu2026dao,
title = {Siamese foundation models for crystal structure prediction},
issn = {2041-1723},
doi = {10.1038/s41467-026-72362-3},
journal = {Nature Communications},
author = {Wu, Liming and Huang, Wenbing and Jiao, Rui and Huang, Jianxing and Liu, Liwei and Zhou, Yipeng and Sun, Hao and Liu, Yang and Sun, Fuchun and Ren, Yuxiang and Wen, Ji-Rong},
year = {2026},
}If you have any questions, feedback, or collaboration ideas, feel free to reach out: wlm155@126.com
This project is licensed under the MIT License.