-
Notifications
You must be signed in to change notification settings - Fork 67
Data Download Script #2
Comments
We might also want a script that downloads small subsets of the data (or even dummy data with identical structure) so that we can test quickly via CI. |
Some features we should really have:
|
@jharenza @yuankunzhu - if this is in progress could you open up a work in progress pull request here so we can see how things are going? This way we can check things out and maybe make suggestions before things get too far along if the issue is under-specified. Thanks! |
We are fundamentally updating the downloading process/script as we got push back from the CAVATICA security for the shared downloader API token. We are most likely will do a public s3 bucket route at this moment, which eventually will be a wget/curl command line instead a script. I will do the PR for that part under How to Obtain OpenPBTA Data once we have the bucket/data ready. In addition to that, instead of gzip everything into one big compressed file. I'm also proposing we have files for each data type separately. But we can review/discuss that during the actual PR process too.
|
Ok! I think the public s3 bucket is an appropriate way to go. If we have each data file separately, I think it will be important to have a solution that lets us have folks point their analysis code at a specific location with the expectation that the location will hold the latest version of the data files in question. If the md5sum file in the bucket always has the most recent version, we can write some code at the beginning of the analysis ordering to quickly check that the md5s match and warn if they don't. |
This is implemented in #45 . |
…alls (#76) * One last thing * Refresh notebooks * Add PolyPhen plot * add to circle CI * Add VennDiagrams to Dockerfile * Fix a Dockerfile prob * Edit CI analysis to be under one header * Put both notebooks in same command * Push changes to NA handling * Add writing variants to files and plots saving
Skip filtering and batch correction. in CI
We should put together a script to download the data into a defined folder. We should require folks to not modify that folder. We should include dummy germline data in the download if we are not able to distribute it without restrictions so that at least the CI works.
The text was updated successfully, but these errors were encountered: