Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification regarding the number of molecular building blocks. Why they are different from JT-VAE? #9

Open
Srilok opened this issue Jan 3, 2023 · 3 comments

Comments

@Srilok
Copy link

Srilok commented Jan 3, 2023

Hello,

First, I really enjoyed reading the paper. Amazing work!

I have a question regarding the number of building blocks used for generating small molecules. Appendix A.3 of the paper states that there are a total of 105 unique building blocks (after accounting for different attachment points) and that they were obtained by the process suggested by the JT-VAE paper. (Jin et al. (2020)). However, in the JT-VAE paper, the total vocabulary size is $|\chi|=780$ obtained from the same ZINC dataset. My understanding is they are both the same. If that is correct, why are the number of building blocks different here? What am I missing? If they are not the same, can you please explain the difference?

Thank you so much for your help

@MKorablyov
Copy link
Collaborator

The building blocks in two papers are not the same but quite similar. In both cases we represent molecules as junction trees - that means there are no cycles. Ours are obtained by BRICS followed by Bemis-Murcko decomposition. Finally, we had a chemist who curated our set of building blocks. In the end, I think our building blocks ended up slightly smaller and more rigid compared to JT-VAE and worked better for us in practice.

BRICS: https://www.rdkit.org/docs/source/rdkit.Chem.BRICS.html
Bemis-Murcko: https://rdkit.org/docs/source/rdkit.Chem.Scaffolds.MurckoScaffold.html

@Srilok
Copy link
Author

Srilok commented Jan 10, 2023

Thank you. After performing the BRICS followed by Bemis-Murcko decomposition on the 250k SMILES dataset, I get 8962 unique building blocks. Can you please comment a bit more about the curation process? How did you narrow down to a smaller list of 105 building blocks?

Also, how did you determine the attachment points (block_r in data/blocks_PDB_105.json)?

Thank you

@Srilok
Copy link
Author

Srilok commented Jan 18, 2023

Hi @MKorablyov , just following up on my comment earlier. It would be really helpful if you could provide those details. Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants