Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: download all data by type #111

Open
sgosline opened this issue Mar 11, 2024 · 8 comments
Open

new feature: download all data by type #111

sgosline opened this issue Mar 11, 2024 · 8 comments
Assignees
Labels
enhancement New feature or request package

Comments

@sgosline
Copy link
Member

It'd be nice to get ALL data of a particular data type (transcriptomics, proteomics, etc.) regardless of source. Can you add a function to do this? We can also filter by source after the fact.

@sgosline sgosline added the enhancement New feature or request label Mar 11, 2024
@jjacobson95
Copy link
Collaborator

jjacobson95 commented Mar 21, 2024

I could build this into its own function or class, however I think this could get redundant / confusing for users as this can actually already be done using the following commands:

import coderdata as cd

depmap = cd.DatasetLoader('depmap')
mpnst = cd.DatasetLoader('MPNST')
cptac = cd.DatasetLoader('cptac')
beataml = cd.DatasetLoader('beataml')
hcmi = cd.DatasetLoader('hcmi')

joined_data = cd.join_datasets(beataml,hcmi,cptac,depmap,mpnst)

joined_data.transcriptomics # all transcriptomics data
joined_data.proteomics # all proteomics data
joined_data.drugs # all drug data
joined_data.samples # all sample data
#  ... etc

@sgosline
Copy link
Member Author

sgosline commented Mar 22, 2024

yes, but this assumes that people understand (and care) about the shorthand dataset names. For deep learning, they just need to know what type of data it is, and how much there is. How about you rename DatasetLoader to something like data_by_source and create a new function called data_by_type that includes 'transcriptomics', 'proteomics,' 'dose_response','perturbation','copy_number','mutations', etc. They can exist side by side.

you can add a sources and data_types function as well so that users can determine what to choose from. Then the above calls just become:

import coderdata as cd
sources = cd.sources
ds = {}
for so in srouces:
    ds[so] = cd.data_by_source(so)
joined_data = cd.join_data_by_source(ds.values())

Do you have ad ocument describing the general users and use cases of the package?

@jjacobson95
Copy link
Collaborator

Okay will do. There is a general usage page in the docs but I haven't gotten a chance to update with the use cases - I'd like to directly link our tutorials to the docs but haven't had the time to do so yet. It takes quite a few extra steps with the CI blocked.

@sgosline
Copy link
Member Author

Usage and use cases are not the same thing - use cases are the start of a design document that motivate the choices made in software development. Generally a good thing to have on hand to make detailed design decisions.

@jjacobson95
Copy link
Collaborator

I didn't know about that - I'll add that as an issue.

@sgosline
Copy link
Member Author

No need, it's not really a thing that can be fixed in the code base, just something that'll need to be done ahead of the paper/pub.

@jjacobson95
Copy link
Collaborator

Shouldn't we keep track of if it as we will eventually need to add it to the docs?

@sgosline
Copy link
Member Author

docs are for end users, they do not need to know how/why the software was designed as it was. Use cases/specifications are for developers so they can make informed implementation choices. I believe there are some github features to incorporate the full software engineering process, but i think that ship has sailed at this point :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request package
Projects
Status: Ready
Development

No branches or pull requests

2 participants