Skip to content

GLAD-RUC/DAO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAO: Siamese Foundation Models for Crystal Structure Prediction

DAO (Diffusion-based crystAl Omni) presents a pair of Siamese foundation models for material science:

  • DAO-G: A generative model for stable crystal structure prediction (CSP), capable of generating diverse polymorphic structures.
  • DAO-P: A predictive model for energy and property prediction, which acts as an energy guider for DAO-G to steer generation towards thermodynamic stability.

Both models are built upon Crysformer, an equivariant graph transformer, and are pretrained on CrysDB (940K entries) via a novel two-stage pretraining strategy involving unstable structure relaxation.

🌐 Website: https://glad-ruc.github.io/DAO/

Table of Contents

Installation

We recommend using Conda to manage the environment to ensure compatibility with the specific PyTorch and CUDA versions used in our experiments. The repo ships a one-shot setup.sh that:

  • creates a conda env (default name: dao) with Python 3.8.18,
  • installs pytorch==1.10.0 + cudatoolkit=11.3 via conda,
  • installs the rest of the stack pinned in pyproject.toml via pip install -e ..
bash setup.sh
conda activate dao
python -m dao doctor   # verify installation

To use a different env name (e.g., my_dao):

ENV_NAME=my_dao bash setup.sh
conda activate my_dao

Note: pyproject.toml pins prebuilt PyG wheels (torch-scatter, torch-sparse, torch-cluster) for Linux x86_64 + Python 3.8 + CUDA 11.3. On other OS/CUDA combinations you will need to adjust those URLs manually.

Environment variables (optional)

The scripts/CLI rely on a few environment variables to locate the repo and decide where to write outputs. You usually do not need to set these manually (the CLI sets reasonable defaults):

  • PROJECT_ROOT: repository root (used to find conf/).
  • HYDRA_JOBS: Hydra run directory root (default: outputs/hydra).
  • WANDB_DIR: W&B run directory root (default: outputs/wandb).

Data and Checkpoints

To replicate our results or use the models, you need to download the datasets and pretrained checkpoints.

Datasets

Download the datasets (MP-20, MPTS-52, SuperCon, etc.) and place them in the data/ directory:

or use gdown to download directly:

pip install gdown
gdown https://drive.google.com/drive/folders/1SOOvLycBhsOKKp3qX_l6SkASjfwf7-7A?usp=drive_link --output ./data --folder

Checkpoints

Download the pretrained and finetuned checkpoints and place them in the ckpts/ directory:

or use gdown to download directly:

pip install gdown
gdown https://drive.google.com/drive/folders/1msp-D3uWD0fJrwE7qrbRXok1u-FoOAdc?usp=drive_link --output ./ckpts --folder

The sampling/inference scripts expect the checkpoint directory to also contain the saved scalers (e.g. prop_scaler.pt).

Repository Structure

  • dao/: Source code for models and CLI.
    • cli.py: CLI entrypoint (python -m dao).
    • common/: Shared utilities, constants, data processing.
    • pl_modules/: PyTorch Lightning modules (CrysFormer, CSPNet, Diffusion, etc.).
    • pl_data/: Dataset and datamodule classes.
  • conf/: Hydra configuration files (data, model, optimizer, etc.).
  • scripts/: Backend scripts invoked by the CLI.
    • run/: Generation, finetune, and conversion launchers.
    • eval/: Structure evaluation utilities.
    • infer/: Property/energy inference utilities.
    • data/: Dataset preparation (CIF/CSV to cached .pt).
  • data/: Benchmark datasets (CSV + cached *_ori.pt).
  • ckpts/: Pretrained/finetuned checkpoints (plus scalers).
  • setup.sh: One-shot conda environment setup.
  • pyproject.toml: Package metadata and pinned dependencies.

Usage

All interactions go through the dao CLI (python -m dao).

CLI Overview

Command Description
dao csp generate-from-formula Generate structures from chemical formula(s)
dao csp generate Generate structures from a benchmark dataset
dao csp convert Convert eval_diff_*.pt to CIF / POSCAR
dao csp evaluate Evaluate generated structures (MR / RMSD)
dao prop predict-from-cif Predict properties directly from CIF files
dao prop predict Predict properties from cached datasets
dao supercon generate Generate superconductor structures
dao csp finetune Finetune DAO-G on a downstream dataset
dao doctor Check repo layout and environment

Generate Structures

DAO-G generates crystal structures via a diffusion process. You can optionally enable energy guidance using DAO-P to steer generation towards lower-energy (more stable) structures.

From chemical formulas

The most common use case: predict structures for arbitrary compositions without needing a benchmark dataset.

Single formula:

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --formula "Li2FeO4" \
  --num-evals 1 \
  --write-cifs \
  --output-dir ./outputs

Batch input file (one formula per line; optional second column = num_atoms):

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --input-file my_formulas.txt \
  --num-evals 1 \
  --write-cifs

my_formulas.txt example:

# formula              num_atoms (optional)
Li2FeO4
SrTiO3                 5
NaCl                   8

Outputs:

  • eval_diff_<label>.pt (default label formula_<num_evals>) under --output-dir (if not specified, saved in --model-path),
  • one CIF per (formula, eval) pair under <output_dir>/cifs/ when --write-cifs is set.

From benchmark datasets

Reproduce paper results on MP-20, MPTS-52, etc.

Single GPU:

python -m dao csp generate \
  --dataset mp_20 \
  --model-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --num-gpus 1

Multi-GPU (shard workload across GPUs):

python -m dao csp generate \
  --dataset mp_20 \
  --model-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --num-gpus 4 \
  --base-gpu 0

Energy-guided generation

Both modes support energy guidance. Add --energy-guidance --energy-model-path <dao-p-ckpt> to steer the diffusion process towards lower-energy structures:

python -m dao csp generate-from-formula \
  --model-path ckpts/finetune_mp_20 \
  --formula "Li2FeO4" \
  --energy-guidance \
  --energy-model-path ckpts/dao_p/last.ckpt \
  --num-evals 1

Convert to CIF / POSCAR

Both csp generate and csp generate-from-formula produce eval_diff_*.pt tensors. Use csp convert to extract human-readable structure files:

# Convert all evals to CIF (default)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt

# Convert only eval index 0, output as POSCAR (.vasp)
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --format poscar --eval-idx 0

# Custom output directory
python -m dao csp convert ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --out-dir ./my_structures

By default, files are written to <pt_file_dir>/<pt_stem>_structures/. For generate-from-formula outputs, filenames include the formula; for benchmark outputs, they use an index.

About --eval-idx: When running generation with --num-evals N > 1, the model produces N candidate structures per input. All candidates are stored in a single .pt file (the first dimension of frac_coords, lattices, etc. is N). Use --eval-idx to select which candidate to convert (0-indexed). The default (-1) converts all candidates.

Predict Properties

DAO-P predicts properties (e.g., energy above hull, band gap) for crystal structures. The command prints MAE and saves predicted properties to a .npy file.

From CIF files

The easiest way: point to a directory of CIF files.

python -m dao prop predict-from-cif \
  --cif-dir my_cifs/ \
  --model-path ckpts/dao_p/last.ckpt \
  --pred-energy

This writes pred_<prop>_of_custom.npy inside --cif-dir. Intermediate files (CSV + .pt cache) are created in a temporary directory and cleaned up automatically; use --keep-cache to preserve them.

From cached datasets

python -m dao prop predict \
  --mode dataset \
  --ori-path data/mp_20/test_ori.pt \
  --model-path ckpts/dao_p_dedup/last.ckpt \
  --pred-energy

This saves pred_<prop>_of_<ori_file>.npy next to --ori-path.

From generated structures

python -m dao prop predict \
  --mode generated \
  --ori-path data/mp_20/test_ori.pt \
  --eval-path ckpts/finetune_mp_20/eval_diff_1_all.pt \
  --model-path ckpts/dao_p_dedup/last.ckpt \
  --pred-energy \
  --sample-size 1

Note: To predict a property other than energy (Ehull), specify the --ori-path argument with the path to the finetuned DAO-P model, provide the target property using --prop, and omit the --pred-energy flag. Critical temperature prediction is an example.

Evaluate Structures

Evaluate generated structures against ground truth using Match Rate (MR) and RMSD metrics.

python -m dao csp evaluate \
  --dataset mp_20 \
  --root-path ckpts/finetune_mp_20 \
  --num-evals 1 \
  --label 1_all
  • --root-path: Directory containing generation results (eval_diff_*.pt).
  • --label: Suffix of the generated file (e.g., 1_all for eval_diff_1_all.pt).

Superconductor Discovery

DAO demonstrates significant potential in discovering superconductors.

Generate superconductor structures

For ordered superconductors without experimentally resolved structures in SuperCon dataset (supercon_rest, 748 entries):

python -m dao supercon generate \
  --dataset supercon_rest \
  --model-path ckpts/finetune_gen_supercon \
  --energy-guidance \
  --energy-model-path ckpts/dao_p/last.ckpt \
  --num-evals 1 \
  --gpu 0

For three real-world superconductors, just replace supercon_rest with supercon_realworld and set --num-evals 20.

Critical Temperature (Tc) Prediction

For three real-world superconductors not in SuperCon3D dataset (supercon_real, 3 entries):

for fold in {0..4}; do
  model_path="ckpts/finetuned_tc_model_fold_${fold}/last.ckpt"
  python -m dao prop predict \
    --mode dataset \
    --ori-path data/super_conductors/real_world/output_ori.pt \
    --model-path $model_path \
    --prop logtc
done

Then average the results of the five folds to get the final predictions.

Training

Finetuning

To adapt a pretrained DAO-G model to a specific downstream dataset (e.g., MP-20):

python -m dao csp finetune \
  --dataset mp_20 \
  --pretrain-ckpt ckpts/dao_g/last.ckpt \
  --epochs 1000 \
  --lr 2e-5 \
  --weight-decay 1e-5 \
  --gpus 1

The finetune outputs are written by Hydra to outputs/hydra/singlerun/<date>/finetune_<dataset>/ by default. Use that directory as --model-path for subsequent csp generate/csp evaluate.

Pretraining

To reproduce the pretraining of DAO-G (Stage I & II) and DAO-P, please refer to the scripts in scripts/run/:

bash scripts/run/run_pretrain.sh

Citation

If you find this repository useful, please cite our paper:

@article{wu2026dao,
  title = {Siamese foundation models for crystal structure prediction},
  issn = {2041-1723},
  doi = {10.1038/s41467-026-72362-3},
  journal = {Nature Communications},
  author = {Wu, Liming and Huang, Wenbing and Jiao, Rui and Huang, Jianxing and Liu, Liwei and Zhou, Yipeng and Sun, Hao and Liu, Yang and Sun, Fuchun and Ren, Yuxiang and Wen, Ji-Rong},
  year = {2026},
}

Contact

If you have any questions, feedback, or collaboration ideas, feel free to reach out: wlm155@126.com

License

This project is licensed under the MIT License.

About

[Nature Communications] Official code for "Siamese Foundation Models for Crystal Structure Prediction".

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors