Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellular Modeling Support #1245

Open
rbharath opened this issue May 3, 2018 · 11 comments
Open

Cellular Modeling Support #1245

rbharath opened this issue May 3, 2018 · 11 comments

Comments

@rbharath
Copy link
Member

rbharath commented May 3, 2018

There's been a lot of interesting progress recently in cellular modeling. In particular, I'm thinking of this paper that creates a deep learned cell simulator:

https://www.nature.com/articles/nmeth.4627

The code for the simulator is open sources as well:

https://github.com/idekerlab/DCell

I wonder if there's a way to support this form of modeling work through DeepChem. I suspect this would be a very nice complement to deep microscopy support.

@peastman
Copy link
Contributor

peastman commented May 4, 2018

That's a really cool paper! In some ways it's similar to a CNN. They basically take a fully connected network, then remove all the connections except the ones they expect to be important based on domain knowledge.

This should be simple to implement with TensorGraph. For each leaf node in the hierarchy, use a Gather layer to collect the inputs for the genes it includes. Other than that, it's just Dense, Concat, and BatchNorm layers.

@rbharath
Copy link
Member Author

rbharath commented May 7, 2018

@peastman Would it be feasible to build a framework for these types of simulations? I imagine people would want to use different cell types or other tweaks

@peastman
Copy link
Contributor

peastman commented May 7, 2018

Sure. We just need to decide how the structure should be specified. Here are the things that need to be specified:

  • The full list of genes.
  • The set of genes in each leaf node.
  • The child nodes of each higher level node.
  • The number of outputs for each node.

We could possibly automate some of that, but not all. For example, we can build in the GO hierarchy, but the list of nodes needs to be filtered in ways that are data and problem specific (see the "preparation of ontologies" section).

@rbharath
Copy link
Member Author

rbharath commented May 8, 2018

This sounds like a good first list. I think a similar API doesn't really exist, so starting with a reasonable design without much automation and adjusting as we get community feedback on ontology specification should work well

@rbharath
Copy link
Member Author

rbharath commented Jun 3, 2018

@peastman Is this one on your TODO list already? Feel free to remove the contribution labels if so.

@peastman
Copy link
Contributor

peastman commented Jun 3, 2018

There are still other things higher up on my list, so let's leave it in case someone else gets to it before I do.

@peastman
Copy link
Contributor

I'm starting to work on this, so I've removed the labels.

@peastman
Copy link
Contributor

I'm looking for datasets to test this on. We don't currently have any genomic datasets, do we? Any suggestions for things we should try?

I'm assuming we don't want to start adding genomic data to molnet. That seems outside its scope, and there are already public repositories with huge amounts of data.

@rbharath
Copy link
Member Author

I don't think we have any Genomic dataset at present. Could we duplicate some of the results from the original paper?

I think it would make sense to add genomic data to molnet if there's not an easily accessible public repository for the data already. In case there's already a public repository, we could add a utility function that provides easy access to the data.

@peastman
Copy link
Contributor

The most important one is probably the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/). It has data from around 100,000 experiments, mostly gene expression but also genetic variation, TF binding, methylation, etc. The data isn't in any consistent format though. It's whatever files the experimenters uploaded for each one.

For sequence data there's a whole lot of public databases, but GenBank (https://www.ncbi.nlm.nih.gov/genbank/) is the really big one. That's going to be harder to use with this model, though, because it isn't organized by gene.

@rbharath
Copy link
Member Author

Better GEO and GenBank support would be great. Perhaps we could add download/processing functions that surface specific datasets from these repositories we need for applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants