Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset size and creation #3

Open
LivC193 opened this issue Dec 13, 2020 · 3 comments
Open

Dataset size and creation #3

LivC193 opened this issue Dec 13, 2020 · 3 comments

Comments

@LivC193
Copy link

LivC193 commented Dec 13, 2020

Hi, first of all congrats on your article and the NeurIPS workshop.

I have a few questions:

  1. Regarding fine-tuning: do you update the pre-trained encoder or do you freeze it ?
  2. You say that any molecule with a ECFP4 similarity higher than 0.323 to 10 drugs was discarded. I assume this was done for generalisation. However what type of similarity did you use (Tanimoto, Dice etc) and why 0.323 ? Also have you performed any clustering based on similarity for the final dataset to ensure that the parsed chemical space is balanced ?
@bfabiandev
Copy link
Collaborator

Hey, thanks!

  1. Where we note that we are doing finetuning we finetune the whole pre-trained encoder without freezing any layers.
  2. I think there is a confusion here, 0.323 is the performance for the best model and does not denote similarity, so I'm not sure I understand the question. Feel free to clarify! Thanks

@LivC193
Copy link
Author

LivC193 commented Dec 15, 2020

Hi thank you for your fast answer. Sorry for the confusion I will try to explain what I mean.
As input for your model you used the dataset published here:
To generate the final dataset for the benchmarks, ChEMBL is post-processed by

  1. removal of salts.
  2. charge neutralization.
  3. removal of molecules with SMILES strings longer than 100 characters.
  4. removal of molecules containing any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br, and I.
  5. removal of molecules with a larger ECFP4 similarity than 0.323 compared to a holdout set consisting of 10 marketed drugs (celecoxib, aripiprazole, cobimetinib, osimertinib, troglitazone, ranolazine, thiothixene, albuterol, fexofenadine, mestranol). This allows us to define similarity benchmarks for targets that are not part of the training set."

My question was referring to list item number 5 in the Data Set Generation. I assumed that for each molecule in your dataset you computed the similarity to those 10 drugs and if similarity was higher than 0.323 that molecule was discarded. I was curious how you selected this cutoff and what type of similarity was used.

As a follow up question: For your pre-trained model you set max_seq_length (smiles length) to 128, but in some tests you set it to 512. If I want to use your pre-trained model (max_seq_length = 128) to embed smiles longer than 128 characters can I simply change the max_seq_lenght argument or would that embedding be incorrect ?

@JoshuaMeyers
Copy link

Hi @LivC182, thanks for your interest in our work. Regarding threshold selection for similarity filtering in the GuacaMol training dataset, I can point you to reference (86) in the GuacaMol paper which is this blogpost, http://rdkit.blogspot.com/2013/10/fingerprint-thresholds.html. I believe the Tanimoto similarity was used (admittedly the relevant figure from the blog seems to have been transcribed as 0.323 instead of 0.321). This is in line with other suggested tanimoto thresholds for the ECFP4 fingerprints (e.g. here).

If you have any follow up questions regarding the training dataset, it might be worth asking in the GuacaMol repo (apologies for the slow response here).

Regarding the max_seq_length , we use relative positional encodings as described in Transformer-xl, which allows MolBERT to process sequences of arbitrary length at inference time, despite training with a fixed length vector. There is a caveat that MolBERT has not been trained for longer SMILES examples so we cannot guarantee that the model generalizes to SMILES of longer lengths, this would require further investigation. I would be interested in your experience with this also if you do try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants