Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

COMPETITION: A Predictive Model For Series Four #421

Open
alintheopen opened this Issue Aug 19, 2016 · 67 comments

Comments

Projects
None yet
Owner

alintheopen commented Aug 19, 2016 edited

**UPDATE: Competition Deadline Extended to 31st March 2017. All necessary data can be found here (contains all the relevant compounds, with ATP4 activities and potencies).

See YouTube video here

This** competition is to develop a computational model that predicts which molecules will block the malaria parasite's ion pump, PfATP4.

PfATP4 is an important target for the development of new drugs for malaria. We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds that will be active against that target.

The competition soft-launched on 19th August 2016. It will close on 31st March 2017. This competition is part of Open Source Malaria, meaning everything needs to adhere to the Six Laws.

Details are as follows.

Outline We need a predictive model for PfATP4.
PfATP4 is a sodium pump found in the membrane of the malaria parasite. A number of promising antimalarial compounds, with distinct and diverse chemical structures, have been found to be active in an ion regulation assay, which was developed by Kiaran Kirk's lab at the ANU. A number of publications have indicated that this pump is an important new target for malaria medicines. It seems that PfATP4 active compounds disrupt the pump and cause the rapid influx of Na+ into the parasite, leading to its demise. The structure of PfATP4 is not known. Simulations, based on docking of PfATP4 actives, have used a homology model developed by Joseph DeRisi's laboratory. The OSM consortium needs a predictive model for (potency vs PfATP4 activity) to assist in the design and synthesis of new Series 4 compounds and of course to help others working on other compound series.

The first attempt
@murrayfold had a quick, informal attempt (here and here) at the development of a pharmacophore model using known actives and inactives from the MMV Malaria Box. At the time Kiaran Kirk's paper was under embargo but Murray has since written up his work. This initial attempt was unsuccessful (i.e. not predictive - see image below, where the "P Model predictions" correlate poorly with what was found in the ion regulation assay) possibly because the model did not allow for overlapping binding sites or take into consideration compound chirality.

compounds sent to kirk and fidock

The Competition
We need a predictive in silico model. The best model will win the prize.

How will it work?
OSM will provide:

  • Files containing details of actives and inactive compounds NB This list is more substantial than the original dataset used by Murray. So you could also use the new data as a 'test set' for any model developed. Here is the composite dataset containing all the relevant compounds, with ATP4 activities and potencies.
  • The Master Chemical List which contains activity data for all OSM compounds, so Series 1-4 Test Set A (column Y) includes both Series One (all inactive against PfATP4) and Series Four compounds.
  • Jeremy Horst's Homology Model:
    PfATP4-PNAS2014.pdb.txt
  • details of the relevant mutations known to be associated with resistance.

Submission Rules

  • all entries to be submitted to GitHub and shared openly
  • entrants can work individually or in teams (no limit to team size)
  • entrants must work openly during the competition. This doesn't necessarily mean that inputs have to be logged in real time (although that is strongly encouraged), but entries that have not openly deposited working data on a regular basis prior to the deadline will not be accepted.
    Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
  • entrants must agree to their work's incorporation into a future OSM journal publication(s)
  • competition winner(s) will be authors on any relevant future paper(s)
  • any valid* entries will at least be acknowledged on any relevant future paper(s) and if the contribution is significant may lead to authorship.

How will entries be assessed?

  • The model will be evaluated using three 'test sets' of molecules:
  1. 'The Frontrunners' #400 [Test Set B (column Y) in Master Chemical List]
  2. @edwintse's new compounds #420 [Test Set C (column Y) in Master Chemical List
  3. A closed dataset of compounds (not available in Master Chemical List)
    NB the 'closed dataset' of compounds may not be publicly available when the winner is announced. All models will be tested against this 'closed dataset' and all details, including the source of this 'closed dataset' will be revealed as soon as the data has been published.
  • Entrants will predict the activities of compounds against PfATP4 in each of the test sets
  • The winner will be decided by two currently-active members of OSM and two external experts

What's the prize
$500
...and the opportunity to contribute to our understanding of a new class of antimalarials
...and authorship on a resulting peer-reviewed publication arising from the OSM consortium

What if none of the models are any good?
Good question. If none of the models prove to be predictive, then it may not be possible to announce a 'winner'. All data will be collated and published at least in the form of a blog, if not a paper...and then we will try again.

Deadline for Entries
30th October 2016, 23:59:59 AEST (Sydney Time)

*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.

An open consultation on how best to run this competition was conducted in Issue #412
Initial competition discussion was started in Issue #417

Owner

mattodd commented Oct 28, 2016

Deadline updated to Jan 16th 2017 as per comment here.

What kind of in silico model is requested here? A classification model (e.g. 'active' vs 'inactive' above/below a particular potency)? Potency predictions against parasite (in uMol, as directed by 'Column B' as found here?

Thanks!

Owner

mattodd commented Nov 16, 2016

Activity against PfATP4 according to Kiaran Kirk's ion regulation assay.
The current granularity is yes/no, though there are a few compounds that
are "slightly active" and have been entered as 0.5. A model that predicts
this would be useful. We make the assumption that activity in the assay
correlates with activity vs parasite. Has held so far.

On 17 Nov 2016 5:45 am, "spadavec" notifications@github.com wrote:

What kind of in silico model is requested here? A classification model
(e.g. 'active' vs 'inactive' above/below a particular potency)? Potency
predictions against parasite (in uMol, as directed by 'Column B' as found
here
https://docs.google.com/spreadsheets/d/1WWP8fE3X2BLzZ7jOm6bRWpnHZqVJBf8XgIYEb7hshXU/edit#gid=1950012249
?

Thanks!


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#421 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AELtNYntq_JtPBH0mgjTrHOOEGL1P0NKks5q-06xgaJpZM4JoIcr
.

@mattodd Thank you for the response! I want to clarify the purpose of this competition a little further, so I hope its ok if I ask a few other questions:

  • is the purpose of this in silico model to triage synthesis of derivatives around an arbitrary (or specific) core (e.g. in flushing out SAR)? Or is it geared more towards being a screening tool to discover novel chemotypes/cores?
  • what kind of accuracy is needed from the model? It would appear from the dataset that youd like to be able to delineate compounds which have EC50 ~<5 microM from those less potent. Is this classification sufficient, or do you need to know an approximate EC50 for any given compound within a X microM range (e.g. within ~1.5kcal/mol or ~10x fold in potency range)?
  • Is there a specific limitation on using the data found in this datasheet for creating our model, or can we use GI50/EC50 values from the larger screens of Novartis/GSK (e.g. here and here, which appear to use LDH as a proxy for effectiveness)

My apologies if I'm asking questions that have been answered elsewhere, but I dug around for a bit and couldn't find anything to these points :)

Owner

mattodd commented Nov 20, 2016 edited

No problem at all.

Well, it'd be great to understand this PfATP4 target a little better, but the immediate practical concern for us is indeed the triage. We would love to be able to predict better which of OSM's future Series 4 compounds will be actives. So developing a model around Series 4 is the main objective. A model that achieves this may or may not be effective in predicting structures based on other chemotypes. If the mechanism of action is the same, we might expect it to predict those other structures. If, however, PfATP4 has multiple binding sites, then it won't. I have my money on the latter.

Accuracy: This is not set. I'm not sure, to be honest - others should feel free to chime in with what is desirable/reasonable. From a pragmatic standpoint I'd like to know whether a proposed structure is predicted to be potent at the 200 nM level (good) or 5 micromolar level (bad). If it's possible to be confident in finer-grained predictions (compound X will be 200 nM, compound Y will be better than that) then great. I had assumed relatively low-definition predictions.

You can use whatever data you like. We'd tried to collate the most relevant and useful data into one sheet - i.e. all the compounds for which we have ion regulation data and all the compounds we've made in OSM so far, as well as lit compounds that have been assayed. Be aware that other antimalarials identified in major screens (like the ones you cite) may have different mechanisms of action. They could of course be assumed to be PfATP4-inactives, but I don't know how well that will hold up. The Malaria Box screen mentioned above was of a subset of 400 known antimalarial compounds, and many more than might be expected were active in the "PfATP4" ( more properly, "ion regulation") assay.
Don't hesitate to come back with more questions if you like.

Owner

mattodd commented Nov 24, 2016

2016 paper from Elizabeth Winzeler may be of interest - homology model of related target in yeast (ScPma1p) and docking with KAE609 and dihydroisoquinolines.

Thanks again for the clarification! If I'm understanding things properly, we want a in silico model for PfATP4, as it seems to correlate well with whole cell assays/potency well, and is the putative target for the lead series we have--is that correct?

In my initial tests, it actually seems to me that the PfATP4 assay doesn't correlate well with whole cell assay data. If you look at all the compounds which are "inactive' in the PfATP4 assay, the average EC50 value is ~1.1 microM, with a standard deviation of 3.08 microM. Conversely, the average EC50 of "active" compounds in the PfATP4 assay is ~0.66 microM, with a standard deviation of 0.56 microM. It appears to me that there isn't a strong correlation between the two assays, or rather, that the PfATP4 assay doesn't have a good discriminatory power (the active EC50 falls well within 1 std. dev. of the inactives EC50). I should not that the data I'm using for this analysis is in the spreadsheet @mattodd compiled here, notably the "Ion Regulation Activity" and "Potency vs. Parasite" columns.

If I am looking at this properly, then I think any model based on this data will have a poor discriminatory power. Instead, I believe it would make more sense to actually base the model off the the whole cell potency instead, as it has a lot more data associated with and will lead to better predictions. I would be more than happy to create 2 models (1 based off PfATP4 assay data, and one based off whole cell potency data), if the competition will allow it.

My apologies if I'm not understanding the connection properly!

Owner

mattodd commented Dec 12, 2016

Hi @spadavec sorry for the delay.
Yes, correct. The assumption is that PfATP4 is the target. We want to be more predictive, and to understand the interactions between the Series 4 compounds and this target. We don't know if there are in fact multiple binding sites on this target, accounting for the diversity of chemotypes that appear to be interacting with the protein. If we can understand that, great, but our first priority is OSM's Series 4.

For the other question, I think you may be on to something. The "MBox" entries in the spreadsheet are for a set of molecules that is structurally diverse. There are many molecules in there that are potent against the parasite and which have a mechanism of action that is unrelated to PfATP4. This may be skewing things. I'm not sure how best to factor that in.

(FYI: Data on a new set of 20 compounds (or so) should be coming this week with potency and PfATP4 data within Series 4, boosting what we can work with in the short term.)

Submission for competition: http://malaria.ourexperiment.org/in_silico_pfatp4_mo

I split it into 2 parts; the first is just using the data that was marked with relevant Pfal assay activity. The second is using all available whole cell data, given that (in my opinion) the Pfal assay doesn't seem to correlate at all with the whole cell potency.

holeung commented Jan 23, 2017 edited

My submission, using EC50 data and 3D alignment of ligands/fields to predicted protein binding site on PfATPase4. Predict OSM-S364, OSM-S365, OSM-S375 to be most active from Test Set C.

http://malaria.ourexperiment.org/series_4_modeling

edwintse commented Feb 3, 2017

UPDATE: The spreadsheet of data linked above has been updated to reflect the recent potency data reported here and here.

Note that the potency values for compounds with existing data have been calculated by averaging the old and new data values.

Owner

mattodd commented Feb 3, 2017

Hi @spadavec and @holeung - your entries are duly noted, thank you. We were awaiting these final data on relevant and interesting compounds - both frontrunners and newly-synthesised compounds with some more structural variation. The dataset is now, finally, complete. Please accept my apologies for the delay in reaching this point. We will extend the deadline for the competition until probably the end of March, but will clarify this early next week. You can leave your entries as they are or modify them. As indicated in the rules, models will be tested against the Mystery Dataset (which already exists) after the end of the submission period.

spadavec commented Feb 8, 2017

@mattodd Thanks for the update! I've used the new data in my model and posted a new entry in my notebook:

http://malaria.ourexperiment.org/in_silico_pfatp4_mo

Hi everybody!
I would like to participate in this contest by building QSAR predictive models but first I would need a clarification about some issues that are not clear to me. I hope somebody can help me.

  1. The activity to model and predict (meaning the PfATP4 ion pump activity) is the one called “Potency vs Parasite (uMol)” in column B of your data sheet? If that is the case, do we have to focus only on this activity or do we have to take into account also the “Ion Regulation Activity” in column C?

  2. Do we need a quantitative model to quantitatively predict the activity (inside a certain error range) or a qualitative model that predicts only if the molecules are active/inactive? And if we need a qualitative model, which activity value should we use to discriminate between active and inactive compounds? 5 uM? 1 uM?

  3. I'm considering building QSAR models to predict the activity but it is not clear to me which compounds can we use as “training set” and which we can not. In the text it is stated that entries will be evaluated using three 'test sets' of molecules: “The Frontrunners” #400 [Test Set B (column Y) in Master Chemical List], @edwintse's new compounds #420 [Test Set C (column Y) in Master Chemical List and a closed dataset of compounds (not available in Master Chemical List). In case of the first 2 datasets, is it possible to (partially) include those as training set? I would say that if they will be considered as “test sets” they must not be included but that is not explicitly stated anywhere in the text. This is why I'm asking.

  4. When the molecular structure of the closed dataset of compounds will be released for prediction?

Any answer, feedback and suggestion will be appreciated.

@gcincilla (Anyone please correct me if I am wrong!)

The activity to model and predict (meaning the PfATP4 ion pump activity) is the one called “Potency vs Parasite (uMol)” in column B of your data sheet? If that is the case, do we have to focus only on this activity or do we have to take into account also the “Ion Regulation Activity” in column C?

The "activity against parasite" column is the whole cell assay EC50 value, while the "Ion regulation Activity" column are the results of the biochemical assay against PfATP4, which is the putative target. The intent (correct me if I'm wrong), as to have a QSAR model for the PfATP4 assay data, but you could, in theory, use the whole cell assay data to help.

Do we need a quantitative model to quantitatively predict the activity (inside a certain error range) or a qualitative model that predicts only if the molecules are active/inactive? And if we need a qualitative model, which activity value should we use to discriminate between active and inactive compounds? 5 uM? 1 uM?

Either a quantitative or qualitative model is allowed. @mattodd mentioned earlier that a classification model would be allowed and would hope that the cuttoff would enable separation of nM from uM compounds. If you go the regression route, then there isn't a cutoff for accuracy, it just needs to be better than everyone else's and actually help move the project forward (or at least have potential to?).

I'm considering building QSAR models to predict the activity but it is not clear to me which compounds can we use as “training set” and which we can not. In the text it is stated that entries will be evaluated using three 'test sets' of molecules: “The Frontrunners” #400 [Test Set B (column Y) in Master Chemical List], @edwintse's new compounds #420 [Test Set C (column Y) in Master Chemical List and a closed dataset of compounds (not available in Master Chemical List). In case of the first 2 datasets, is it possible to (partially) include those as training set? I would say that if they will be considered as “test sets” they must not be included but that is not explicitly stated anywhere in the text. This is why I'm asking.

You are supposed to train your model on compounds which are not marked "B" or "C" in the Ion Regulation Test Set (e.g. A or Lit, etc). The "test sets" come in two forms; first, the quality of your model will be judged against compounds labeled "B" or "C" in the Ion Regulation Test Set column. Then, after the competition date closes, we will be given another set of "secret" compounds (not yet posted) for us to evaluate the potency.

@spadavec, thank you very much for your reply and explanation. Now everything is clearer.
So, if I correctly understood, for this first phase we are required to provide only the PfATP4 ion pump activity prediction for compounds marked as “B” and “C” in the Ion Regulation Test Set column and a description about how we built the model. Is that right? Thank you in advance!

Owner

mattodd commented Mar 1, 2017

Hi @gcincilla - I hope this competition is of interest! Thanks @spadavec for these clarifications. You beat me to it. Yes, the language we used about "test sets" is a little ambiguous. In my mind we would be able to see whether the model is a good one by its ability to predict activity vs PfATP4 for the molecules for which we already have data. There's no restriction on how you do this. But the final test will be vs the mystery set which we'll reveal after the competition closes (which will have structures and PfATP4 assay data in it, as well as whole parasite potency I believe). Models submitted at the time of the competition close need to be run vs that set.

Note one thing (which I hope is clear in the above). There appear to be a number of chemotypes hitting PfATP4. We'd love to know how this is possible. It is possible that a model is developed which is predictive for OSM's Series 4 but not predictive for other chemotypes. Or vice versa. Both these situations are interesting, as is the case that one model works for all chemotypes. Since we don't understand the target well, it's as well not to be too prescriptive here. On the one hand we really need to be much more predictive for Series 4 (that, pragmatically, is OSM's priority) but we'd also love to understand how this target can be impacted by so many different structures, even if we're not going to make those in the short term.

gcincilla commented Mar 3, 2017 edited

@mattodd and @spadavec, thank you for your replies. I decided to participate in the competition trying to build a good model for PfATP4, hoping this can helps moving research forward. I would have a couple of additional questions for you guys:

  1. The "activity against parasite" (column B) is an IC50 or an EC50?
  2. I saw @spadavec you submitted 2 different models in your notebook (called Model A and Model B) using different sets of training compounds. This can be a good option depending on if the desired final model should be more or less general (meaning it can be applied to wider or narrower chemical spaces). As this is not stated in the competition text: is it possible to present more than one model? I mention this because depending on the pre-modeling data analysis I may want to build a classification model for the PfATP4 “Ion Regulation Activity” (column C) and a regression model for the “Potency vs Parasite” (column B) or 2 different versions of them using different training sets. Anyway, strictly speaking it terms of the competition, this multiply the winning chances.
  3. @spadavec, I saw in your notebook you have negative pEC50 values. Normally the pEC50s are positive values being that the -log10(EC50) where EC50 is expressed in mol/L. So a pEC50 of 6 correspond to a compound active at 1 uM and a pEC50 of 9, one active at 1 nM. Did you scaled pEC50 at 1 uM range?
    Thanks in advance for any comment. I will publish the results the next days/weeks, as soon as I have them.

spadavec commented Mar 4, 2017 edited

@gcincilla

The "activity against parasite" (column B) is an IC50 or an EC50?

EC50

I saw @spadavec you submitted 2 different models in your notebook (called Model A and Model B) using different sets of training compounds. This can be a good option depending on if the desired final model should be more or less general (meaning it can be applied to wider or narrower chemical spaces). As this is not stated in the competition text: is it possible to present more than one model? I mention this because depending on the pre-modeling data analysis I may want to build a classification model for the PfATP4 “Ion Regulation Activity” (column C) and a regression model for the “Potency vs Parasite” (column B) or 2 different versions of them using different training sets. Anyway, strictly speaking it terms of the competition, this multiply the winning chances.

As far as I understand it, only 1 model was needed. I made a second model more out of morbid curiosity than anything, and it might eventually get to the point where we want to predict compounds against whole cell, not just Series 4 against PfATP. I am only submitting modelB as my submission to the competition.

@spadavec, I saw in your notebook you have negative pEC50 values. Normally the pEC50s are positive values being that the -log10(EC50) where EC50 is expressed in mol/L. So a pEC50 of 6 correspond to a compound active at 1 uM and a pEC50 of 9, one active at 1 nM. Did you scaled pEC50 at 1 uM range?

In my notebook, I mention that my 'pEC50' is log(EC50) (base 10), hence the negative values. Sorry for the confusion!

Owner

mattodd commented Mar 5, 2017

Hi @gcincilla - you can submit as many as you like, but we'd take only one as the entry, which would be the most recent unless an older one is flagged as better. Pragmatically we need a model focussed on Series 4. Intellectually we'd love to understand wider chemotypes.

Hi all, I am James McCulloch (is there a formal process to join OSM?) and have entered the competition. I have a couple of questions and some software.

First the questions.

  1. In the spreadsheet "Ion Regulation Data for OSM Compeition.xls" there is a column "Single Shot Inhibition %". How should I interpret this data? Other most compounds have ION Activity classes or IC50 values. However a few have only a value for this field, it seems a pity to exclude them from the analysis.

  2. In my research on the project I have come across QSAR data such as an old (but still free) version of the "Dragon" program. Can we use this 3rd party QSAR data for classification or are we limited to data in the spreadsheet?

The software.

  1. I have released a version of my "meta entry" software which is available on GitHub here https://github.com/kellerberrin/OSM-QSAR This software uses the plug-in pattern to allow ML models to be developed very quickly. It currently implements a couple of NNs using Keras including the NN posted by Vito Spadavecchio (thanks very much Vito, any errors are mine) and a couple of models from SKLearn. The number of models will increase substantially because I am intend to take the "Kitchen Sink" approach to this problem. I will post when they are stable(ish). The software runs on python 2.7 and 3.5, Linux or Windows (Mac untested).

Hi @kellerberrin,
Welcome on board! As far as I know there is no formal process to join OSM (anyone please correct me if I am wrong). You should only announce your participation here and follow the submission rules stated above.
Unfortunately I cannot answer your first question but I'm sure members from the Open Source Malaria organization they will. For the moment I took into consideration only activities in columns called “Potency vs Parasite (uMol)” and “Ion Regulation Activity”.
With respect to your second question I think you can use any molecular descriptor you like to, meaning also 3rd party ones.
I hope this helps.

@mattodd, I published the first entry in the notebook with a pre-modeling activity analysis. As was already correctly mentioned by @spadavec, if we consider all the available molecules, PfATP4 Ion Regulation Activity assay doesn't correlate well with whole cell assay (see figure 1). Nevertheless a correlation between these 2 assays can be observed if only the OSM series 4 compounds are considered (see figure 2).
We may try to build models for both activities but to what deals with the competition, given that just 1 model can be submitted, I would like to know which one of the 2 activities will be used to assess the model predictive ability? “Ion Regulation Activity” or “Potency vs Parasite”? If I understood correctly it will be the only the “Ion Regulation Activity” but I would like you to confirm this. Thank you in advance.

Thanks for the feedback @gcincilla I will post the Dragon QSAR data as soon I download it (there is a limit of 150 smiles per request).

Hi All,

I have posted the EDragon QSAR data here: https://github.com/kellerberrin/OSM-QSAR

You can also download the data from here: http://www.vcclab.org/lab/edragon/

I strongly advise using DragonNorm.csv initially (dimension 1552) . I have normalized all fields in this file to the interval [1,0] and removed uninformative fields.

Unlike the sparse RDKIT fingerprints, the Dragon data is very busy and NNs are likely to wander off into a local minima. However a deep neural network [1552, 50, 50, 10, 3] classifier provided a surprisingly good AUC = 0.76 (raw output files attached) against "ION Regulation Activity". This result can almost certainly be significantly enhanced with an improved NN model and the addition of complementary QSAR data such as fingerprints. Hopefully, I will have time to post refined results tomorrow.

TestStatistics.txt
TrainStatistics.txt
OSM_QSAR.txt

Here is a key to all the EDragon QSAR fields. Note that these fields can also be used to check compound ADME, Rule of 5, etc.

dragon_molecular_descriptor_list.pdf

I've just finished my first model set of modelling experiments, so here is my initial submission.

Owner

mattodd commented Mar 23, 2017

Hi @kellerberrin. No, no “joining” process. If you want to work on new medicines for malaria in an open way, you’re part of this thing already.

Sorry for the delay in replying. Major grant-writing season here in Aus.

The “single shot” data arose from an assay measuring potency of the compounds against the parasite. They are less useful than the full potency assay since they do not provide the usual inhibitory concentration. Compounds that are active (values near 100%) would have then had concentration values (IC50s) measured. Lesser activities in the single shot assay would have meant that compound was not examined in more detail.

The Dragon data - I’m not sure what you mean. Data on these compounds?

Hi @gcincilla yes, the aim is a prediction of activity in the ion regulation assay.

Thanks for the entry @IamDavyG!

Everyone - we’ve about a week left! Very excited.

Hi @mattodd and everybody.

Initial Report Submission.

The Dragon Data is a vector of 1666 molecular descriptors generated for each molecule. It starts with molecular weight and then moves on. Basically, everything and the kitchen sink. Neural Network models can train on this kind of unstructured data and may not need more structured data such as molecular fingerprints.

In fact, I have just completed my preliminary report into building a NN classifier (AUC = 0.77) for molecular PfATP4 Ion activity. The report is attached below as a pdf or there is a jupyter notebook here: https://github.com/kellerberrin/OSM-QSAR/tree/master/Results

The report finds that the NN classifiers produce better results on the Dragon data than on RDKIT fingerprints.

Best to all

OSM+Prelim+Results.pdf

Hi everybody.

Initial Report Submission (2).

I used a custom Neural Network and an "off-the-shelf" SKLearn logistic classifier to predict if the test molecules were in 3 classes, the AUC for each classification is in brackets:

EC50 <= 200 nMol (AUC = 0.82)

EC50 <= 500 nMol (AUC = 0.93)

EC50 <= 1000 nMol (AUC = 0.82)

Then using the classifiers I developed to predict if a molecule is in the EC50 <= 500 nMol class, I predicted if the molecule was PfATP4 Ion activity [ACTIVE]. The custom Neural Network predictor had a PfATP4 Ion activity prediction AUC score of 0.95 and the "off-the-shelf" SKLearn classifier had an AUC score of 0.93.

If these numbers bear scrutiny, and please scrutinise, then they are pretty good.

These calculations are available in the jupyter notebook on google drive:
https://drive.google.com/open?id=0B0Rfx1fjhlsaMWFsTmFZeUx5WUk

A pdf file of the notebook is attached.

Best to all

EC50+Prelim+Results.pdf

Hi everybody,
We developed several PfATP4 Ion Regulation Activity classification models using different strategies for modeling set sampling, different machine learning methods and different descriptors. Here we submitted the best performing one with which we achieved good general results: balanced accuracy (for actives) = 0.77, sensitivity (for actives) = 0.833, AUC (for actives) = 0.810. All the details are explained in the notebook entry.
If after the publishing of the OSM hidden test set our predictive model for PfATP4 Ion Regulation Activity results to be useful, it can be effectively and thoroughly exploited by anybody after Molomics provides it in Lead Designer, an Android app to easily and quickly access molecule properties important in drug discovery.
Lead Designer allows to easily sketch new molecules with an easy, fully automatized touchpad drawing mechanism. For each molecule, PfATP4 Ion Regulation Activity class and its associated prediction confidence can be instantaneously calculated on the fly. In this way all the people willing to participate in the OSM project, especially medicinal and synthetic chemists, can do design hypothesis for new active compounds and easily check in Real-Time if these compounds have high chances to be active or not (according to the provided prediction model). Each user can save her or his interesting molecules on the cloud to later access them from different devices through its own account.
If the current proposal is of interest, especially to medicinal and synthetic chemists involved in the project, Lead Designer could be used for the design of new active compounds of OSM Series-4. All the molecules designed for the project through Lead Designer are automatically collected on the cloud and then provided to the OSM consortium for possible synthesis and testing. As Lead Designer can involve an arbitrary large number of participants spread around the globe, this project can result in the World's First Crowd Sourced Drug Design Campaign, which can be interesting also for publication purposes.
Please, let us know whether you would be interested in this proposal.

Member

MedChemProf commented Mar 28, 2017

@gcincilla Any chance you are coming out with an iOS version?

@MedChemProf, we have the porting of Lead Designer to iOS on our to do list, but that will not be completed in the short term. Meaning for the OSM project.

jon-c-silva commented Mar 29, 2017 edited

Hi everyone! I am new to OSM, I might probably not have anything ready to submit to the competition by the end of the week but I'm learning a lot from all the submissions and I might help in the future.

@gcincilla might I ask you how did you decide which samples to keep on internal/external validation? I was also training a Random Forest but I was using the competition sets as external validation and it was not a good approach

Hello All,

I have released my entry for the PfATP4 Ion Regulation Activity competition on a notebook here. The headline AUC score is 0.89.

The final model is a meta classifier which uses the probability maps of upstream classifiers to produce an optimal composite classification. This seems like a powerful innovation; provided any classifier has additional independent information then you can just add it to the meta classifier to increase it's predictive power.

The notebook also has details on how to download the classification software and run it on your Linux or Windows computer. I have setup the software so that it runs from a single batch file command.

The software is written in python and was forked from a regression model developed by @spadavec; thanks Vito (btw I checked the molecules you suggested in #486 - looks good). If anyone has any suggestions for any bells and whistles that might be bolted on to the software please let me know.

BTW. Have any of you pythonistas looked at the deepchem software? It has been developed by a group at Stanford university. I cloned a copy from Git and have begun playing around with it. It looks impressive.

Best to All.

Owner

alintheopen commented Mar 30, 2017

Hi everyone - thanks for the entries, really interesting to see the different approaches and can't wait to see the results!

One day to go...

@jon-c-silva we tried several strategies we generally use here at Molomics in similar cases as the OSM one. We'll be happy to explain the sampling details if our model results the best.

We tested also the sampling strategy you mentioned and it didn't perform as good as the model we submitted. This is probably due to the fact that most of OSM series-4 molecules (i.e. 35 over a total of 43) are the ones of the competition set (Ion Regulation Test Set = "A,B", "B", and "C"). As they represent the 81% of total OSM-series-4, if the objective is to build a predictive model especially focused on OSM-series-4 molecules, in our opinion they cannot be completely excluded from the model.

Here is my final submission.
@kellerberrin Yeah, I have extensively used DeepChem in my submission. It has come a long way from where it started last year.

holeung commented Mar 31, 2017

My final submission is here. I took a different approach from the others and instead tried to predict a binding site and the best fitting ligands.

Here is my submission.
I agree with @gcincilla, since we are only interested on Series 4 at the moment and the actual validation will be with the secret compounds, a sampling strategy can be useful here.
I did a sampling strategy of my own, using Tanimoto Similarity to construct a network of the compounds and sample those in the vicinity of OSM S4, and some of the test set compounds were used for internal validation too.
XGBoost was my algorithm of choice and it had a multiclass AUC of 0.8462.

madgpap commented Mar 31, 2017

Nice one @jon-c-silva!

Thanks @IamDavyG for your response about deepchem. I have been watching some youtube videos on deepchem made by Bharath Ramsundar and playing around with the examples they have in their Git repository. These guys have some powerful ideas and , even better, have already implemented their ideas in a python library. Should we collaborate on using those ideas and code for OSM? Is there a discussion forum for Cheminformatics in OSM?

spadavec commented Apr 5, 2017

@mattodd @alintheopen do we know when the final test set will be released?

Owner

mattodd commented Apr 6, 2017

Hi guys - just quickly. Am so sorry for the delay in responding to recent posts and for not yet posting a roundup/info about what happens next. Will get to this in the coming few days. Hang on...

Owner

alintheopen commented Apr 17, 2017

Hi everyone, I've constructed a Google Sheet with a summary of entries and notebook information ready for the judges and anyone following this competition. Feel free to let me know if you'd like to edit/add anything that could be useful.

https://docs.google.com/spreadsheets/d/1pY6sYXIw66jnzUO3CoP8HceYdDjLRvwg5_pLkBY1Wek/edit?usp=sharing

@alintheopen, thank you for the summary you prepared. In my case it seems the 3 references about QSAR modeling I followed are missing in the references column. They are listed at the end of this notebook entry.
What are the next steps? Do you know when the molecular structures of the final test set will be released?

Owner

mattodd commented Apr 18, 2017

Just finalising this right now, in fact @gcincilla . I'm hoping in the next day or so, but we're just clarifying how public we can make the dataset at the moment.

Owner

mattodd commented May 10, 2017

Hi all - so sorry for the delay.

The test dataset you need in order to evaluate the models consists of 400 compounds - it’s the MMV Pathogen Box, available here. This has been screened for activity in the Kiaran Kirk ion regulation assay by Adelaide Dennis. We are not, at this stage, making the identity of the hits public; by maintaining the confidentiality of the data set we can continue to use it as a test set for further iterations of any of the models, or indeed alternative models.

Instead we will ask all of the submitters of the models to download the spreadsheet of Pathogen Box compounds, run it through their own submitted models and to generate an output for the judges.

Ideally we would have outputs that can be easily compared. Can we come to an agreement as to what that would look like? A number? A diagram? The outputs can be uploaded to this Github post, or elsewhere. The judges can then compare those outputs with the actual activities obtained by Kiaran’s team and determine performance of the models. We will have two technical judges who will oversee this evaluation phase, to ensure that everything is done fairly.

Can each submitter please “like” this post to indicate their agreement to this way forward and that the dataset is downloadable and suitable?

Owner

murrayfold commented May 10, 2017

An active list of say 50-100 (number to be decided) would be good. It would then be easy to compare with the test set data. Judging should not only be based on hit-rate % but also scaffold variation of correctly predicted active compounds.

If results could be submitted in a standard format that would make judging easier too. Probably a csv file of smiles would be the most useful

@mattodd Just to clarify--are we tasked with ranking/classifying all of the compounds in the "MASTER SHEET" sheet (of which there seems to be ~400 compounds) in the file you attached?

I am also fine with creating a report on a subset of the compounds in the sheet, but would like to know what reporting method is preferred. Given that I went with a regression approach, I can both rank-order compounds or provide all compounds that are beyond a certain micoM cutoff, whichever is best for the group!

holeung commented May 10, 2017

Sounds good. I request that the number of compounds to be evaluated be not too large, maybe ~20. My methods are not automated. I like to look and evaluate everything visually. I suggest rank-order as output as that is easy to evaluate and actionable.

Owner

mattodd commented May 10, 2017

So: a rank order in .csv format that includes SMILES. Everyone OK with that?

The dataset contains experimental assay data for all 400 compounds. Some actives, some inactives. I guess the ideal situation is that the model predicts something for every compound, then ranks them. But is that likely to cause problems, based on what you've been saying above?

I can do rank order, and have crated a spreadsheet with the results here, and a rather brief notebook entry here.

gcincilla commented May 12, 2017 edited

Hi everybody,
Thank you for providing the hidden molecule set. I think it's good to provide the prediction results it in a ranked order in ".csv" format (also including SMILES) as @mattod suggested. I have some questions about how the project will continue.

  1. How will the technical judges evaluate this test set (and the previous ones)?
  2. Which assessment metrics will be used? (e.g. active molecules AUC, active molecule Enrichment Factor, etc)?
  3. What is the deadline for providing the ranked hidden set?
  4. When will the final model evaluation results will be published?

Thank you in advance

Owner

mattodd commented May 18, 2017

Hi all, @gcincilla - sorry for the delay, I was in transit to a conference in Boulder where it's snowing, despite our being in the middle of May.

I'll consult with the judges on your questions and post back here. No immediate deadline, but each submitter of a model should submit the output of the model. If this does not happen in the next week, I'll chase people manually, to ensure that everyone is aware we're at this stage of the competition.

There is an unresolved issue about throughput - whether people can analyse compounds in batch or not - this needs resolving. Is the analysis of all 400 compounds going to be a problem, and should we do something about this if so, e.g. apply the models to a subset of the full Pathogen Box? Please let me know - I think we'd all made the assumption that the models would be able to take a number of molecules as inputs at the same time, in batch.

@mattodd mattodd referenced this issue May 18, 2017

Open

3D follow up #500

Hi everybody,
Here you can find the file with the 400 compounds of the hidden test set and the predictions done by our previously described PfATP4 Ion Regulation Activity classification model. Where:

  1. the “Predicted PfATP4 Ion Regulation Activity Class” is the predicted class for the actual molecule (being 1.0 = “active”, 0.0 = “inactive”, and 0.5 = “partially active”).
  2. the “PfATP4 Active Class Probability” is the probability of the actual molecule to be active.

Molecules are reported in rank order with the most probable to be active at the top so that rank-based evaluation (e.g. AUC) are facilitated.
Here we also introduced the results as a notebook entry.

@mattodd, thank you for your answer. Regarding to your question about throughput: I was surprised about the nature of the hidden test set. Meaning that I was expecting a smaller dataset focused on series-4-like compounds while the provided hidden test set is a quite large, general and diverse set of compounds. Having said this the model we developed (as I think all the other models developed by other participants using machine learning techniques) is high-throughput. Meaning it can be be applied in batch on large set of compounds.
On the other hand models based on more computationally expensive 3D techniques (e.g. molecular docking) may require more time and effort. Participants that used these last tools can say more about this.

@mattodd, one last thing: in the Google Sheet with the summary for the judges they didn’t report our references in the specific reference column and I think these can be useful to understand the method we applied. Could you please add those? I added a comment with the references in the specific cell.
Thank you

holeung commented May 23, 2017

Yes, I can still use 3D-based methods on a dataset of 400 compounds. It will just take more time for the computations and manual analysis of each prediction.

Hello All,

I have attached a csv spreadsheet (google drive link below) with the ranked 400 molecules for PfATP4 activity for the final competition evaluation.

The rankings are based on an ensemble of Neural Network models trained on molecular information obtained from the EDragon freeware available here: http://www.vcclab.org/lab/edragon/

This software converts molecular sdf records (smiles) into 1666 pharmacophore fields. These fields were normalized and submitted to a conventional (dense) neural network for training and analysis. The NNs were trained on the 3 reported PfATP4 activity classes; [ACTIVE, PARTIAL, INACTIVE].

The "SCORE" field in the spreadsheet is an average of the probability of each molecule being in the ACTIVE class across all the trained NN models.

If anyone is interested, I also have NN sensitivity results available for the 1666 pharmacophore fields (the relative importance of each field in the trained NN models).

https://docs.google.com/spreadsheets/d/1SfTbZ4aktv9CHmsxzF40xTZ3kZog_Xvi9LzNr7PUq0c/edit?usp=sharing

Owner

mattodd commented May 24, 2017

That's great, thanks. I'll take a look at the submitted lists tomorrow. I'm hoping that we might have all models submitted by Monday and the judges can begin to evaluate/compare. Is that timeline reasonable?

holeung commented May 24, 2017

Good with me.

holeung commented May 28, 2017 edited

My latest notebook entry contains my csv file containing the ranking of the 400 MMV box compounds. This is in notebook "MMV400 QSAR modeling". This is based on 3D homology modeling of PfATP4 and field alignment of compounds in Cresset Forge.

http://malaria.ourexperiment.org/mmv400_qsar_modelin/15893/post.html

jon-c-silva commented May 28, 2017 edited

Hi everybody,

Here is my submission to the competition for the hidden set.

I have used CDK to calculate molecular descriptors and eXtreme Gradient Boosting (xgboost) to predict the class of the compounds. I will post the source code I used after the competition.

The list of compounds is ranked by probability of Active class but I have also included the probabilities for Partial and Inactive. Six of the compounds were predicted to be Active.

Hey everyone,

Here are my model predictions for the MMV dataset. I have both classified and predicted the EC50 activity of the MMV molecules.

Hi everybody, is there any news on the competition front?

Owner

mattodd commented Jul 4, 2017

We've had a tough time getting everyone together - 3 different time zones. I'm taking a new approach of talking to judges in series and have just got the logistical ball rolling. Initial analysis of all entries has been done, and this just needs verification by others. Won't be much longer, sorry for the delay again.

spadavec commented Aug 9, 2017

Will this likely be completed before the Series4 paper comes out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment