Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alexandria dataset to matsciml toolkit https://github.com/IntelLabs/matsciml/discussions/107 #132

Merged
merged 27 commits into from
Mar 13, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
55ff4c9
add alexandria dataset and download api
JonathanSchmidt1 Jan 30, 2024
4caaacd
remove forgotten lines
JonathanSchmidt1 Jan 30, 2024
5294ed5
Merge pull request #6 from IntelLabs/main
JonathanSchmidt1 Feb 19, 2024
17f819e
remove m3gnet dataset class
JonathanSchmidt1 Feb 19, 2024
9a19502
remove m3gnet dataset class
JonathanSchmidt1 Feb 19, 2024
c7604d7
remove m3gnet dataset class
JonathanSchmidt1 Feb 19, 2024
0ba534e
Update MANIFEST.in with alexandria devset
JonathanSchmidt1 Feb 19, 2024
ff9f05e
Create README.md for alexandria database
JonathanSchmidt1 Feb 19, 2024
425070a
Add missing reference to README.md
JonathanSchmidt1 Feb 19, 2024
146669c
added examples similar to materials project and download example for …
JonathanSchmidt1 Feb 21, 2024
94ef446
added download options for all alexandria datasets to download_datase…
JonathanSchmidt1 Feb 21, 2024
502adec
Added PerdiodicPropertiesTransform to examples and test,
JonathanSchmidt1 Feb 21, 2024
b1ce3af
added pyg transform tests
JonathanSchmidt1 Feb 21, 2024
e19ec1a
Merge branch 'main' of https://github.com/IntelLabs/matsciml into ale…
JonathanSchmidt1 Feb 22, 2024
312941d
changes in devset
JonathanSchmidt1 Mar 4, 2024
06218d2
Merge changes from 'intel/main' into alexandria_api
JonathanSchmidt1 Mar 4, 2024
7e2ac25
Update README.md
JonathanSchmidt1 Mar 4, 2024
10a232f
Update README.md
JonathanSchmidt1 Mar 4, 2024
a26906c
change union operator to update for python 3.8 compatibility
JonathanSchmidt1 Mar 4, 2024
cf05f4e
Merge branch 'alexandria_api' of github.com:JonathanSchmidt1/matsciml…
JonathanSchmidt1 Mar 4, 2024
6760d62
add regression_targets for readability
JonathanSchmidt1 Mar 4, 2024
795eb45
fixed examples (fast devset), devset check before processing in api,…
JonathanSchmidt1 Mar 12, 2024
3739071
removed devset path
JonathanSchmidt1 Mar 12, 2024
90e8e1b
added cell to return_dict from parse structure
JonathanSchmidt1 Mar 12, 2024
e78fa10
reduced cutoff radii in examples
JonathanSchmidt1 Mar 12, 2024
af14c4a
change datamodule in example to from_devset, copy cell tensor to remo…
JonathanSchmidt1 Mar 12, 2024
0569d7e
remove unnecessary import
JonathanSchmidt1 Mar 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ recursive-include matsciml/datasets/nomad/devset *
recursive-include matsciml/datasets/oqmd/devset *
recursive-include matsciml/datasets/symmetry/devset *
recursive-include matsciml/datasets/colabfit/devset *
recursive-include matsciml/datasets/alexandria/devset *
37 changes: 37 additions & 0 deletions examples/datasets/alexandria/download_alexandria.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from matsciml.datasets.alexandria import AlexandriaRequest


# example of downloading the 3D scan dataset
indices = list(range(0, 5))
# The target directory where the LMDB file will be written
lmdb_target_dir = "alexandria_3D_scan"
request = AlexandriaRequest(indices, lmdb_target_dir, dataset="scan")
request.download_and_write(n_jobs=5)

# example of downloading the 3D pbesol dataset
indices = list(range(0, 5))
# The target directory where the LMDB file will be written
lmdb_target_dir = "alexandria_3D_scan"
request = AlexandriaRequest(indices, lmdb_target_dir, dataset="pbesol")
request.download_and_write(n_jobs=5)

# example of downloading the 3D pbe dataset
indices = list(range(0, 45))
# The target directory where the LMDB file will be written
lmdb_target_dir = "alexandria_3D_pbe"
request = AlexandriaRequest(indices, lmdb_target_dir, dataset="pbe")
request.download_and_write(n_jobs=5)

# example of downloading the 2D pbe dataset
indices = list(range(0, 2))
# The target directory where the LMDB file will be written
lmdb_target_dir = "alexandria_2D"
request = AlexandriaRequest(indices, lmdb_target_dir, dataset="2D")
request.download_and_write(n_jobs=2)

# example of downloading the 1D pbe dataset
indices = list(range(0, 1))
# The target directory where the LMDB file will be written
lmdb_target_dir = "alexandria_1D"
request = AlexandriaRequest(indices, lmdb_target_dir, dataset="1D")
request.download_and_write(n_jobs=1)
46 changes: 46 additions & 0 deletions examples/datasets/alexandria/single_task_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
from __future__ import annotations

import pytorch_lightning as pl
from torch.nn import LayerNorm, SiLU

from matsciml.datasets.transforms import (
PointCloudToGraphTransform,
PeriodicPropertiesTransform,
)
from matsciml.lightning.data_utils import MatSciMLDataModule
from matsciml.models import GraphConvModel
from matsciml.models.base import ScalarRegressionTask

pl.seed_everything(21616)


model = GraphConvModel(100, 128, encoder_only=True)
task = ScalarRegressionTask(
model,
output_kwargs={
"norm": LayerNorm(128),
"hidden_dim": 128,
"activation": SiLU,
"lazy": False,
"input_dim": 128,
},
lr=1e-3,
task_keys=["band_gap_ind"],
)


dm = MatSciMLDataModule(
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved
"AlexandriaDataset",
train_path="../../../matsciml/datasets/alexandria/devset",
dset_kwargs={
"transforms": [
PeriodicPropertiesTransform(10.0),
PointCloudToGraphTransform("dgl", cutoff_dist=10.0),
]
},
val_split=0.2,
)

trainer = pl.Trainer(fast_dev_run=10)

trainer.fit(task, datamodule=dm)
31 changes: 31 additions & 0 deletions examples/datasets/alexandria/single_task_devset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from __future__ import annotations

import pytorch_lightning as pl

from matsciml.datasets.transforms import (
PointCloudToGraphTransform,
PeriodicPropertiesTransform,
)
from matsciml.lightning.data_utils import MatSciMLDataModule
from matsciml.models import GraphConvModel
from matsciml.models.base import ScalarRegressionTask

# configure a simple model for testing
model = GraphConvModel(100, 128, encoder_only=True)
task = ScalarRegressionTask(model, task_keys=["band_gap_ind"])

# configure alexandria devset
dm = MatSciMLDataModule.from_devset(
"AlexandriaDataset",
dset_kwargs={
"transforms": [
PeriodicPropertiesTransform(10.0, adaptive_cutoff=True),
PointCloudToGraphTransform("dgl", cutoff_dist=10.0),
]
},
)

# run 10 steps for funsies
trainer = pl.Trainer(fast_dev_run=10)

trainer.fit(task, datamodule=dm)
79 changes: 79 additions & 0 deletions examples/datasets/alexandria/single_task_egnn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
from __future__ import annotations

import pytorch_lightning as pl
from torch.nn import LayerNorm, SiLU

from matsciml.datasets.transforms import (
PointCloudToGraphTransform,
PeriodicPropertiesTransform,
)
from matsciml.lightning.data_utils import MatSciMLDataModule
from matsciml.models import PLEGNNBackbone
from matsciml.models.base import ScalarRegressionTask

pl.seed_everything(21616)

model_args = {
"embed_in_dim": 128,
"embed_hidden_dim": 32,
"embed_out_dim": 128,
"embed_depth": 5,
"embed_feat_dims": [128, 128, 128],
"embed_message_dims": [128, 128, 128],
"embed_position_dims": [64, 64],
"embed_edge_attributes_dim": 0,
"embed_activation": "relu",
"embed_residual": True,
"embed_normalize": True,
"embed_tanh": True,
"embed_activate_last": False,
"embed_k_linears": 1,
"embed_use_attention": False,
"embed_attention_norm": "sigmoid",
"readout": "sum",
"node_projection_depth": 3,
"node_projection_hidden_dim": 128,
"node_projection_activation": "relu",
"prediction_out_dim": 1,
"prediction_depth": 3,
"prediction_hidden_dim": 128,
"prediction_activation": "relu",
"encoder_only": True,
}

model = PLEGNNBackbone(**model_args)
task = ScalarRegressionTask(
model,
output_kwargs={
"norm": LayerNorm(128),
"hidden_dim": 128,
"activation": SiLU,
"lazy": False,
"input_dim": 128,
},
lr=1e-3,
task_keys=["band_gap_ind"],
)

dm = MatSciMLDataModule(
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved
dataset="AlexandriaDataset",
train_path="../../../matsciml/datasets/alexandria/devset",
dset_kwargs={
"transforms": [
PeriodicPropertiesTransform(10.0),
PointCloudToGraphTransform(
"dgl",
cutoff_dist=10.0,
node_keys=["pos", "atomic_numbers"],
),
],
},
val_split=0.2,
batch_size=16,
num_workers=0,
)

trainer = pl.Trainer(
fast_dev_run=10,
)
trainer.fit(task, datamodule=dm)
58 changes: 58 additions & 0 deletions examples/datasets/alexandria/single_task_gala.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from __future__ import annotations

import pytorch_lightning as pl
from torch.nn import LayerNorm, SiLU

from matsciml.lightning.data_utils import MatSciMLDataModule
from matsciml.models import GalaPotential
from matsciml.models.base import ScalarRegressionTask

model_args = {
"D_in": 100,
"hidden_dim": 128,
"merge_fun": "concat",
"join_fun": "concat",
"invariant_mode": "full",
"covariant_mode": "full",
"include_normalized_products": True,
"invar_value_normalization": "momentum",
"eqvar_value_normalization": "momentum_layer",
"value_normalization": "layer",
"score_normalization": "layer",
"block_normalization": "layer",
"equivariant_attention": False,
"tied_attention": True,
"encoder_only": True,
}

mp_norms = {
"formation_energy_per_atom_mean": -1.454,
"formation_energy_per_atom_std": 1.206,
}

task = ScalarRegressionTask(
mp_norms,
encoder_class=GalaPotential,
encoder_kwargs=model_args,
output_kwargs={
"norm": LayerNorm(128),
"hidden_dim": 128,
"activation": SiLU,
"lazy": False,
"input_dim": 128,
},
lr=1e-4,
task_keys=["band_gap_ind"],
)


dm = MatSciMLDataModule(
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved
dataset="AlexandriaDataset",
train_path="../../../matsciml/datasets/alexandria/devset",
val_split=0.2,
batch_size=16,
num_workers=0,
)

trainer = pl.Trainer(fast_dev_run=10)
trainer.fit(task, datamodule=dm)
43 changes: 43 additions & 0 deletions examples/datasets/alexandria/single_task_mpnn_dgl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from __future__ import annotations

import pytorch_lightning as pl

from matsciml.datasets.transforms import (
DistancesTransform,
PointCloudToGraphTransform,
PeriodicPropertiesTransform,
)
from matsciml.lightning.data_utils import MatSciMLDataModule
from matsciml.models import MPNN
from matsciml.models.base import ScalarRegressionTask

# construct a scalar regression task with MPNN encoder
task = ScalarRegressionTask(
encoder_class=MPNN,
encoder_kwargs={
"encoder_only": True,
"atom_embedding_dim": 8,
"node_out_dim": 16,
},
task_keys=["band_gap_ind"],
output_kwargs={"lazy": False, "input_dim": 16, "hidden_dim": 16},
)
# MPNN expects edge features corresponding to atom-atom distances
dm = MatSciMLDataModule.from_devset(
"AlexandriaDataset",
dset_kwargs={
"transforms": [
PeriodicPropertiesTransform(10.0),
PointCloudToGraphTransform(
"dgl",
cutoff_dist=10.0,
node_keys=["pos", "atomic_numbers"],
),
DistancesTransform(),
],
},
)

# run a quick training loop
trainer = pl.Trainer(fast_dev_run=10)
trainer.fit(task, datamodule=dm)
2 changes: 2 additions & 0 deletions matsciml/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
is2re_devset = Path(__file__).parents[0].joinpath("dev-is2re")


from matsciml.datasets.alexandria import AlexandriaDataset
from matsciml.datasets.carolina_db import CMDataset
from matsciml.datasets.colabfit import ColabFitDataset
from matsciml.datasets.lips import LiPSDataset, lips_devset
Expand All @@ -26,6 +27,7 @@
from matsciml.datasets.symmetry import SyntheticPointGroupDataset, symmetry_devset

__all__ = [
"AlexandriaDataset",
"IS2REDataset",
"S2EFDataset",
"CMDataset",
Expand Down
31 changes: 31 additions & 0 deletions matsciml/datasets/alexandria/README.md
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
Alexandria Database

The [alexandria database](https://alexandria.icams.rub.de/) is maintained by Miguel Marques at RUB.
The database comprises ~4.5 million three-dimensional relaxed crystal structures that span
the periodic table, in addition to over 100,000 two-dimensional and 10,000 one-dimensional crystal structures.
Further ~400k of the three-dimensional crystal structures are also available at PBEsol geometries and with all properties
calculated with the SCAN functional.
**Warning: The 2D and 1D structures are peridioc in the "non-periodic" directions with a vacuum of 15 Å. Cutoff distances during
graph construction larger than this vacuum will produce wrong neighborlists.**
Each structures has an associated total energy (eV), forces (eV/Å), band gap (eV), magnetization (Bohr magneton),
magnetic moments on each atom (Bohr magneton),distance to the convex hull per atom (eV/atom),
formation energy per atom (eV/atom) and density of states at the fermi level (states/eV).
During standard processsing of the dataset these quantities are available as training targets.
Stress (eV/Å<sup>2</sup>) data is also available in the database but not added as a target during standard processing.
A fixed training, validation and test split as well as a link to a FAIR repository will be added in the future
with a further publication.
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved

References:

[10.1002/adma.202210788](http://hdl.handle.net/10.1002/adma.202210788) (3D),

[10.1088/2053-1583/accc43](http://hdl.handle.net/10.1088/2053-1583/accc43) (2D),

[10.1126/sciadv.abi7948](http://hdl.handle.net/10.1126/sciadv.abi7948) (method)

[10.1038/s41597-022-01177-w](http://hdl.handle.net/10.1038/s41597-022-01177-w) (PBEsol and SCAN)

Alexandria is available for use under the terms of the Creative Commons Attribution 4.0 License.
Under this license you are free to share and adapt the data, but you must give appropriate credit
to alexandria, provide a link to the license, and indicate if changes were made. You may do so in
any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
10 changes: 10 additions & 0 deletions matsciml/datasets/alexandria/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from __future__ import annotations

from pathlib import Path
JonathanSchmidt1 marked this conversation as resolved.
Show resolved Hide resolved
from matsciml.datasets.alexandria.api import AlexandriaRequest
from matsciml.datasets.alexandria.dataset import AlexandriaDataset

__all__ = [
"AlexandriaDataset",
"AlexandriaRequest",
]
Loading