Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Data Download Script #2

Closed
cgreene opened this issue Jul 11, 2019 · 6 comments
Closed

Data Download Script #2

cgreene opened this issue Jul 11, 2019 · 6 comments

Comments

@cgreene
Copy link
Collaborator

cgreene commented Jul 11, 2019

We should put together a script to download the data into a defined folder. We should require folks to not modify that folder. We should include dummy germline data in the download if we are not able to distribute it without restrictions so that at least the CI works.

@cgreene
Copy link
Collaborator Author

cgreene commented Jul 11, 2019

We might also want a script that downloads small subsets of the data (or even dummy data with identical structure) so that we can test quickly via CI.

@cgreene
Copy link
Collaborator Author

cgreene commented Jul 29, 2019

Some features we should really have:

  • Uses checksum to evaluate validity of downloaded data (probably best to download the file and then a .md5 or similar file & compare)
  • Downloads a compressed file that includes the freeze of the data (the full set of data files)
  • The compressed file includes a README file that has a version number and description of each of the data files / formats
  • The compressed file includes a CHANGELOG file that contains a list of all the changes that have occurred since the initial data release.
  • The downloaded files are decompressed to data/

@cgreene
Copy link
Collaborator Author

cgreene commented Aug 2, 2019

@jharenza @yuankunzhu - if this is in progress could you open up a work in progress pull request here so we can see how things are going? This way we can check things out and maybe make suggestions before things get too far along if the issue is under-specified. Thanks!

@yuankunzhu
Copy link
Collaborator

yuankunzhu commented Aug 2, 2019

We are fundamentally updating the downloading process/script as we got push back from the CAVATICA security for the shared downloader API token. We are most likely will do a public s3 bucket route at this moment, which eventually will be a wget/curl command line instead a script.

I will do the PR for that part under How to Obtain OpenPBTA Data once we have the bucket/data ready.

In addition to that, instead of gzip everything into one big compressed file. I'm also proposing we have files for each data type separately. But we can review/discuss that during the actual PR process too.

data/somatic-snv.gz
data/somatic-cnv.gz
data/somatic-sv.gz
data/gene-fusion.gz
data/expression.gz
data/clinical-manifest.csv
data/md5sum.txt
data/README.md

@cgreene
Copy link
Collaborator Author

cgreene commented Aug 2, 2019

Ok! I think the public s3 bucket is an appropriate way to go.

If we have each data file separately, I think it will be important to have a solution that lets us have folks point their analysis code at a specific location with the expectation that the location will hold the latest version of the data files in question. If the md5sum file in the bucket always has the most recent version, we can write some code at the beginning of the analysis ordering to quickly check that the md5s match and warn if they don't.

@cgreene
Copy link
Collaborator Author

cgreene commented Aug 12, 2019

This is implemented in #45 .

@cgreene cgreene closed this as completed Aug 12, 2019
cansavvy added a commit that referenced this issue Aug 27, 2019
…alls (#76)

* One last thing

* Refresh notebooks

* Add PolyPhen plot

* add to circle CI

* Add VennDiagrams to Dockerfile

* Fix a Dockerfile prob

* Edit CI analysis to be under one header

* Put both notebooks in same command

* Push changes to NA handling

* Add writing variants to files and plots saving
cansavvy pushed a commit to cansavvy/OpenPBTA-analysis that referenced this issue Apr 2, 2020
jaclyn-taroni referenced this issue in jaclyn-taroni/OpenPBTA-analysis Aug 21, 2020
Skip filtering and batch correction. in CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants