Importing Different Dataset #48

aydinmirac · 2022-04-10T17:41:56Z

Hi,

First of all, thanks for creating this tool.

I am using OMDB dataset for bandgap prediction with using Schnetpack and ALIGNN models. OMDB dataset is similar to QM9, but it includes much larger molecules which have on average 82 atoms per unit cell:

https://omdb.mathub.io/dataset

The dataset includes bandgap values and xyz file including all structures.

The problem is that the dataset is not as big as QM9. It only includes 12500 molecules and this prevents better MAE. Is there any way to import this dataset into AugLiChem and augment it to train Schnetpack?

Best regards,

CoopLo · 2022-04-11T21:27:53Z

Hello, thank you for reaching out. I've updated the package on pip for easier use of user-defined data sets, so please update using pip install auglichem==0.1.6. Currently AugLiChem supports crystal data sets in the form of CIF files. The OMDB data set you mention can be converted using the code below:

from ase.io import read, write
import numpy as np
import os
from tqdm import tqdm
import glob

os.makedirs("./omdb_cifs", exist_ok=True)

materials = read('structures.xyz', index=':')

bandgaps = np.loadtxt('bandgaps.csv', dtype=float)
cods = np.loadtxt('CODids.csv', dtype=int)

for idx in tqdm(range(len(materials))):
    write("./omdb_cifs/{}.cif".format(cods[idx]), materials[idx])

np.savetxt("./omdb_cifs/id_prop.csv",
           np.array(list(zip(cods.astype(float), bandgaps.astype(float)))),
           delimiter=',', fmt=["%i", "%f"])

After converting the data set, this can be loaded using the updated functionality:

dataset = CrystalDatasetWrapper(
           "custom", kfolds=5,
            data_path="./data_download",
            data_src="/path/to/omdb_cifs"
)

From there, the data set should be usable like the data explicitly supported in the package. I've updated the documentation on the Crystal Usage page as well. Please let me know if this works for you

aydinmirac · 2022-04-12T05:12:28Z

Hi @CoopLo,

Thank you so much for your effort. I really appreciate it.

I will try this implementation and inform you as soon as possible. I also found that ALIGNN tool uses JARVIS-TOOL to import and configure datasets. They are usable in CIF, VASP and XYZ formats:

https://github.com/usnistgov/alignn
https://jarvis-tools.readthedocs.io/en/master/databases.html

here is an example script to import and use listed datasets:

https://github.com/usnistgov/alignn/blob/main/alignn/examples/sample_data/scripts/generate_sample_data_reg.py

`from jarvis.db.figshare import data as jdata
from jarvis.core.atoms import Atoms

dft_3d = jdata("dft_3d")
prop = "optb88vdw_bandgap"
max_samples = 50
f = open("id_prop.csv", "w")
count = 0
for i in dft_3d:
atoms = Atoms.from_dict(i["atoms"])
jid = i["jid"]
poscar_name = "POSCAR-" + jid + ".vasp"
target = i[prop]
if target != "na":
atoms.write_poscar(poscar_name)
f.write("%s,%6f\n" % (poscar_name, target))
count += 1
if count == max_samples:
break
f.close()`

Maybe we can use the output folder of this code snippet. I will also try this one.

aydinmirac · 2022-04-13T13:04:49Z

Hi @CoopLo,

When I tried your implementation, I encountered the following error:

Traceback (most recent call last): File "test.py", line 72, in <module> train_loader, valid_loader, test_loader = dataset.get_data_loaders( File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 712, in get_data_loaders train_set = self._remove_bad_cifs(train_set, transform) File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 644, in _remove_bad_cifs train_set = CrystalDataset(train_set.dataset, File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 185, in __init__ raise RuntimeError(error_str) RuntimeError: Need data source directory when using custom data set. Use data_src=/path/to/data.

The corresponding line is:

train_loader, valid_loader, test_loader = dataset.get_data_loaders( target=None, transform=transforms, fold=None, remove_bad_cifs=True )
I think "get_data_loaders" function needs this parameter but when I added "data_src" it gave error too. Because there is no parameter definition in the source code:

https://github.com/BaratiLab/AugLiChem/blob/main/auglichem/crystal/data/_crystal_dataset.py#L659

CoopLo · 2022-04-13T21:10:22Z

Hello @miracaydin1, thank you for the follow up. When I was testing the custom data set code I forgot to test with CIF removal feature with it, thank you for catching that. I've noticed there are issues with the remove_bad_cifs function currently that need to be fixed, so I don't recommend using it for now. I'll let you know when I have that fixed and updated.

I've also found most data sets don't need CIFs removed, and the swap-axes transformation tends to introduce a bad CIF file or two. In the meantime I recommend trying running without using the built-in CIF removal for now. If there are still CIF files that need to be removed, theres a hack-around to remove them manually. The best way to see which ones need to be removed would be to iterate over the train/test/validation loaders with a batch size of 1 and keep track of which ones throw an error, then manually removing them from the data directory and id_prop.csv.

CoopLo · 2022-04-14T16:04:59Z

Hello @miracaydin1 I've updated the package. Custom data sets should be fully functional just as the explicitly supported packages are. Let me know if you have any other issues. ALIGNN and JARVIS-TOOL integration look like they could be a good addition to the package as well, thanks for linking those.

aydinmirac · 2022-04-17T17:50:00Z

Hello @CoopLo, thank you very much for your help.

I can train the Schnetpack model right now. But I have some questions.

As I understand from data augmentation process, "id_prop.csv" file does not include bandgap values of augmented cif files. It only includes the original files.

The question is how does the model learn from structure without any value? I feel like we are feeding data without any label.
Also, native schnetpack gave around 0.4 eV MAE in my previous trainings with the original OMDB dataset (12.5k molecules). But the schnetpack model in Auglichem gave 101 eV MAE with augmented OMDB (around 60k molecules)

Do you have any suggestions to reduce MAE in this model? Obviously I am making mistakes somewhere but I could not find it.

Best regards,

CoopLo · 2022-04-18T17:36:37Z

I'm glad you're able to train your model now!

The augmentation process uses the original target value for both the original and augmented CIF files. id_prop_train/valid/test simply keeps track of which CIF goes in each split. During ~~initialization of CrystalDatasetWrapper~~ splitting of the CrystalDatasetWrapper, the augmented CIF files are added to the training set id_prop. You can verify this by printing train_loader.dataset.id_prop_augment after splitting. The augmented cif files will show up there.

As far as training goes, I would check the training curves first. I've noticed some training instability for some models and data sets. The model begins training well and then train+validation MAE will spike and get stuck in a local minimum. If you're seeing this, rerunning the model with a different random seed may help. Using different augmentations may also help both stabilize training and improve results. We've seen that perturbation with a perturbation value of 0.05 tends to work well. Perturbation of 0.05 + supercell also tends to work well, but takes more memory and longer to train. Other approaches like changing batch size, learning rate, learning rate schedule, model hyperparameters may also help. I've found each of these to be helpful in my own experiments

CoopLo closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing Different Dataset #48

Importing Different Dataset #48

aydinmirac commented Apr 10, 2022

CoopLo commented Apr 11, 2022 •

edited

aydinmirac commented Apr 12, 2022

aydinmirac commented Apr 13, 2022

CoopLo commented Apr 13, 2022

CoopLo commented Apr 14, 2022

aydinmirac commented Apr 17, 2022

CoopLo commented Apr 18, 2022 •

edited

Importing Different Dataset #48

Importing Different Dataset #48

Comments

aydinmirac commented Apr 10, 2022

CoopLo commented Apr 11, 2022 • edited

aydinmirac commented Apr 12, 2022

aydinmirac commented Apr 13, 2022

CoopLo commented Apr 13, 2022

CoopLo commented Apr 14, 2022

aydinmirac commented Apr 17, 2022

CoopLo commented Apr 18, 2022 • edited

CoopLo commented Apr 11, 2022 •

edited

CoopLo commented Apr 18, 2022 •

edited