Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing Different Dataset #48

Closed
aydinmirac opened this issue Apr 10, 2022 · 7 comments
Closed

Importing Different Dataset #48

aydinmirac opened this issue Apr 10, 2022 · 7 comments

Comments

@aydinmirac
Copy link

Hi,

First of all, thanks for creating this tool.

I am using OMDB dataset for bandgap prediction with using Schnetpack and ALIGNN models. OMDB dataset is similar to QM9, but it includes much larger molecules which have on average 82 atoms per unit cell:

https://omdb.mathub.io/dataset

The dataset includes bandgap values and xyz file including all structures.

The problem is that the dataset is not as big as QM9. It only includes 12500 molecules and this prevents better MAE. Is there any way to import this dataset into AugLiChem and augment it to train Schnetpack?

Best regards,

@CoopLo
Copy link
Collaborator

CoopLo commented Apr 11, 2022

Hello, thank you for reaching out. I've updated the package on pip for easier use of user-defined data sets, so please update using pip install auglichem==0.1.6. Currently AugLiChem supports crystal data sets in the form of CIF files. The OMDB data set you mention can be converted using the code below:

from ase.io import read, write
import numpy as np
import os
from tqdm import tqdm
import glob

os.makedirs("./omdb_cifs", exist_ok=True)

materials = read('structures.xyz', index=':')

bandgaps = np.loadtxt('bandgaps.csv', dtype=float)
cods = np.loadtxt('CODids.csv', dtype=int)

for idx in tqdm(range(len(materials))):
    write("./omdb_cifs/{}.cif".format(cods[idx]), materials[idx])

np.savetxt("./omdb_cifs/id_prop.csv",
           np.array(list(zip(cods.astype(float), bandgaps.astype(float)))),
           delimiter=',', fmt=["%i", "%f"])

After converting the data set, this can be loaded using the updated functionality:

dataset = CrystalDatasetWrapper(
           "custom", kfolds=5,
            data_path="./data_download",
            data_src="/path/to/omdb_cifs"
)

From there, the data set should be usable like the data explicitly supported in the package. I've updated the documentation on the Crystal Usage page as well. Please let me know if this works for you

@aydinmirac
Copy link
Author

Hi @CoopLo,

Thank you so much for your effort. I really appreciate it.

I will try this implementation and inform you as soon as possible. I also found that ALIGNN tool uses JARVIS-TOOL to import and configure datasets. They are usable in CIF, VASP and XYZ formats:

https://github.com/usnistgov/alignn
https://jarvis-tools.readthedocs.io/en/master/databases.html

here is an example script to import and use listed datasets:

https://github.com/usnistgov/alignn/blob/main/alignn/examples/sample_data/scripts/generate_sample_data_reg.py

`from jarvis.db.figshare import data as jdata
from jarvis.core.atoms import Atoms

dft_3d = jdata("dft_3d")
prop = "optb88vdw_bandgap"
max_samples = 50
f = open("id_prop.csv", "w")
count = 0
for i in dft_3d:
atoms = Atoms.from_dict(i["atoms"])
jid = i["jid"]
poscar_name = "POSCAR-" + jid + ".vasp"
target = i[prop]
if target != "na":
atoms.write_poscar(poscar_name)
f.write("%s,%6f\n" % (poscar_name, target))
count += 1
if count == max_samples:
break
f.close()`

Maybe we can use the output folder of this code snippet. I will also try this one.

@aydinmirac
Copy link
Author

Hi @CoopLo,

When I tried your implementation, I encountered the following error:

Traceback (most recent call last): File "test.py", line 72, in <module> train_loader, valid_loader, test_loader = dataset.get_data_loaders( File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 712, in get_data_loaders train_set = self._remove_bad_cifs(train_set, transform) File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 644, in _remove_bad_cifs train_set = CrystalDataset(train_set.dataset, File "/raid/apps/auglichem/2022/lib/python3.8/site-packages/auglichem/crystal/data/_crystal_dataset.py", line 185, in __init__ raise RuntimeError(error_str) RuntimeError: Need data source directory when using custom data set. Use data_src=/path/to/data.

The corresponding line is:

train_loader, valid_loader, test_loader = dataset.get_data_loaders( target=None, transform=transforms, fold=None, remove_bad_cifs=True )
I think "get_data_loaders" function needs this parameter but when I added "data_src" it gave error too. Because there is no parameter definition in the source code:

https://github.com/BaratiLab/AugLiChem/blob/main/auglichem/crystal/data/_crystal_dataset.py#L659

@CoopLo
Copy link
Collaborator

CoopLo commented Apr 13, 2022

Hello @miracaydin1, thank you for the follow up. When I was testing the custom data set code I forgot to test with CIF removal feature with it, thank you for catching that. I've noticed there are issues with the remove_bad_cifs function currently that need to be fixed, so I don't recommend using it for now. I'll let you know when I have that fixed and updated.

I've also found most data sets don't need CIFs removed, and the swap-axes transformation tends to introduce a bad CIF file or two. In the meantime I recommend trying running without using the built-in CIF removal for now. If there are still CIF files that need to be removed, theres a hack-around to remove them manually. The best way to see which ones need to be removed would be to iterate over the train/test/validation loaders with a batch size of 1 and keep track of which ones throw an error, then manually removing them from the data directory and id_prop.csv.

@CoopLo
Copy link
Collaborator

CoopLo commented Apr 14, 2022

Hello @miracaydin1 I've updated the package. Custom data sets should be fully functional just as the explicitly supported packages are. Let me know if you have any other issues. ALIGNN and JARVIS-TOOL integration look like they could be a good addition to the package as well, thanks for linking those.

@aydinmirac
Copy link
Author

Hello @CoopLo, thank you very much for your help.

I can train the Schnetpack model right now. But I have some questions.

As I understand from data augmentation process, "id_prop.csv" file does not include bandgap values of augmented cif files. It only includes the original files.

The question is how does the model learn from structure without any value? I feel like we are feeding data without any label.
Also, native schnetpack gave around 0.4 eV MAE in my previous trainings with the original OMDB dataset (12.5k molecules). But the schnetpack model in Auglichem gave 101 eV MAE with augmented OMDB (around 60k molecules)

Do you have any suggestions to reduce MAE in this model? Obviously I am making mistakes somewhere but I could not find it.

Best regards,

@CoopLo
Copy link
Collaborator

CoopLo commented Apr 18, 2022

I'm glad you're able to train your model now!

The augmentation process uses the original target value for both the original and augmented CIF files. id_prop_train/valid/test simply keeps track of which CIF goes in each split. During initialization of CrystalDatasetWrapper splitting of the CrystalDatasetWrapper, the augmented CIF files are added to the training set id_prop. You can verify this by printing train_loader.dataset.id_prop_augment after splitting. The augmented cif files will show up there.

As far as training goes, I would check the training curves first. I've noticed some training instability for some models and data sets. The model begins training well and then train+validation MAE will spike and get stuck in a local minimum. If you're seeing this, rerunning the model with a different random seed may help. Using different augmentations may also help both stabilize training and improve results. We've seen that perturbation with a perturbation value of 0.05 tends to work well. Perturbation of 0.05 + supercell also tends to work well, but takes more memory and longer to train. Other approaches like changing batch size, learning rate, learning rate schedule, model hyperparameters may also help. I've found each of these to be helpful in my own experiments

@CoopLo CoopLo closed this as completed Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants