New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing Different Dataset #48
Comments
Hello, thank you for reaching out. I've updated the package on pip for easier use of user-defined data sets, so please update using
After converting the data set, this can be loaded using the updated functionality:
From there, the data set should be usable like the data explicitly supported in the package. I've updated the documentation on the Crystal Usage page as well. Please let me know if this works for you |
Hi @CoopLo, Thank you so much for your effort. I really appreciate it. I will try this implementation and inform you as soon as possible. I also found that ALIGNN tool uses JARVIS-TOOL to import and configure datasets. They are usable in CIF, VASP and XYZ formats: https://github.com/usnistgov/alignn here is an example script to import and use listed datasets: `from jarvis.db.figshare import data as jdata dft_3d = jdata("dft_3d") Maybe we can use the output folder of this code snippet. I will also try this one. |
Hi @CoopLo, When I tried your implementation, I encountered the following error:
The corresponding line is:
https://github.com/BaratiLab/AugLiChem/blob/main/auglichem/crystal/data/_crystal_dataset.py#L659 |
Hello @miracaydin1, thank you for the follow up. When I was testing the custom data set code I forgot to test with CIF removal feature with it, thank you for catching that. I've noticed there are issues with the I've also found most data sets don't need CIFs removed, and the |
Hello @miracaydin1 I've updated the package. Custom data sets should be fully functional just as the explicitly supported packages are. Let me know if you have any other issues. ALIGNN and JARVIS-TOOL integration look like they could be a good addition to the package as well, thanks for linking those. |
Hello @CoopLo, thank you very much for your help. I can train the Schnetpack model right now. But I have some questions. As I understand from data augmentation process, "id_prop.csv" file does not include bandgap values of augmented cif files. It only includes the original files. The question is how does the model learn from structure without any value? I feel like we are feeding data without any label. Do you have any suggestions to reduce MAE in this model? Obviously I am making mistakes somewhere but I could not find it. Best regards, |
I'm glad you're able to train your model now! The augmentation process uses the original target value for both the original and augmented CIF files. id_prop_train/valid/test simply keeps track of which CIF goes in each split. During As far as training goes, I would check the training curves first. I've noticed some training instability for some models and data sets. The model begins training well and then train+validation MAE will spike and get stuck in a local minimum. If you're seeing this, rerunning the model with a different random seed may help. Using different augmentations may also help both stabilize training and improve results. We've seen that perturbation with a perturbation value of 0.05 tends to work well. Perturbation of 0.05 + supercell also tends to work well, but takes more memory and longer to train. Other approaches like changing batch size, learning rate, learning rate schedule, model hyperparameters may also help. I've found each of these to be helpful in my own experiments |
Hi,
First of all, thanks for creating this tool.
I am using OMDB dataset for bandgap prediction with using Schnetpack and ALIGNN models. OMDB dataset is similar to QM9, but it includes much larger molecules which have on average 82 atoms per unit cell:
https://omdb.mathub.io/dataset
The dataset includes bandgap values and xyz file including all structures.
The problem is that the dataset is not as big as QM9. It only includes 12500 molecules and this prevents better MAE. Is there any way to import this dataset into AugLiChem and augment it to train Schnetpack?
Best regards,
The text was updated successfully, but these errors were encountered: