Data Download Script #2

cgreene · 2019-07-11T12:55:14Z

We should put together a script to download the data into a defined folder. We should require folks to not modify that folder. We should include dummy germline data in the download if we are not able to distribute it without restrictions so that at least the CI works.

cgreene · 2019-07-11T12:55:46Z

We might also want a script that downloads small subsets of the data (or even dummy data with identical structure) so that we can test quickly via CI.

cgreene · 2019-07-29T18:45:21Z

Some features we should really have:

Uses checksum to evaluate validity of downloaded data (probably best to download the file and then a .md5 or similar file & compare)
Downloads a compressed file that includes the freeze of the data (the full set of data files)
The compressed file includes a README file that has a version number and description of each of the data files / formats
The compressed file includes a CHANGELOG file that contains a list of all the changes that have occurred since the initial data release.
The downloaded files are decompressed to data/

cgreene · 2019-08-02T15:17:29Z

@jharenza @yuankunzhu - if this is in progress could you open up a work in progress pull request here so we can see how things are going? This way we can check things out and maybe make suggestions before things get too far along if the issue is under-specified. Thanks!

yuankunzhu · 2019-08-02T16:30:36Z

We are fundamentally updating the downloading process/script as we got push back from the CAVATICA security for the shared downloader API token. We are most likely will do a public s3 bucket route at this moment, which eventually will be a wget/curl command line instead a script.

I will do the PR for that part under How to Obtain OpenPBTA Data once we have the bucket/data ready.

In addition to that, instead of gzip everything into one big compressed file. I'm also proposing we have files for each data type separately. But we can review/discuss that during the actual PR process too.

data/somatic-snv.gz
data/somatic-cnv.gz
data/somatic-sv.gz
data/gene-fusion.gz
data/expression.gz
data/clinical-manifest.csv
data/md5sum.txt
data/README.md

cgreene · 2019-08-02T16:36:25Z

Ok! I think the public s3 bucket is an appropriate way to go.

If we have each data file separately, I think it will be important to have a solution that lets us have folks point their analysis code at a specific location with the expectation that the location will hold the latest version of the data files in question. If the md5sum file in the bucket always has the most recent version, we can write some code at the beginning of the analysis ordering to quickly check that the md5s match and warn if they don't.

cgreene · 2019-08-12T13:58:32Z

This is implemented in #45 .

…alls (#76) * One last thing * Refresh notebooks * Add PolyPhen plot * add to circle CI * Add VennDiagrams to Dockerfile * Fix a Dockerfile prob * Edit CI analysis to be under one header * Put both notebooks in same command * Push changes to NA handling * Add writing variants to files and plots saving

Teja tmb qc

Skip filtering and batch correction. in CI

cgreene mentioned this issue Jul 29, 2019

Method to check correctness of download file #31

Closed

This was referenced Jul 29, 2019

Process for CAVATICA access/analysis #20

Closed

Data analysis/format questions for downstream analysts #32

Closed

jaclyn-taroni mentioned this issue Aug 5, 2019

Add a data folder, keep it empty #34

Merged

cansavvy mentioned this issue Aug 6, 2019

In Progress: Mutect2 vs Strelka2 First Pass #38

Closed

7 tasks

cgreene closed this as completed Aug 12, 2019

yuankunzhu mentioned this issue Sep 11, 2019

Missing README files in v4 #105

Closed

cansavvy pushed a commit to cansavvy/OpenPBTA-analysis that referenced this issue Apr 2, 2020

Merge pull request AlexsLemonade#2 from tkoganti/Teja_tmbQC

074aa69

Teja tmb qc

jaclyn-taroni referenced this issue in jaclyn-taroni/OpenPBTA-analysis Aug 21, 2020

Merge pull request #2 from jaclyn-taroni/skip-mb-filter

9dd13c8

Skip filtering and batch correction. in CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Download Script #2

Data Download Script #2

cgreene commented Jul 11, 2019

cgreene commented Jul 11, 2019

cgreene commented Jul 29, 2019 •

edited

Loading

cgreene commented Aug 2, 2019

yuankunzhu commented Aug 2, 2019 •

edited

Loading

cgreene commented Aug 2, 2019

cgreene commented Aug 12, 2019

Data Download Script #2

Data Download Script #2

Comments

cgreene commented Jul 11, 2019

cgreene commented Jul 11, 2019

cgreene commented Jul 29, 2019 • edited Loading

cgreene commented Aug 2, 2019

yuankunzhu commented Aug 2, 2019 • edited Loading

cgreene commented Aug 2, 2019

cgreene commented Aug 12, 2019

cgreene commented Jul 29, 2019 •

edited

Loading

yuankunzhu commented Aug 2, 2019 •

edited

Loading