-
Notifications
You must be signed in to change notification settings - Fork 67
Conversation
As far as the SNV consensus files go, I think we ultimately want the unzipped files in
We do the first three points already so we could add the last step. Thoughts? |
Thinking about this today as I'm getting ready to make changes required to get the subset files for CI. We can accomplish what I outlined above with adding the following to the end of the download script:
|
Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
download-data.sh
Outdated
# Check the md5s for everything we downloaded except CHANGELOG.md | ||
cd data/$RELEASE | ||
md5sum -c md5sum.txt | ||
mv pbta-snv-consensus_11122019.zip ../ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now, the consensus zip file will be downloaded to data/$RELEASE/
as part of the release. We can just move such file out to the data/
after the md5 check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with moving rather than symlinking is that having everything that is included with a data download live in the data/$RELEASE
folder is convenient if folks want to go back and compare different versions of files to narrow down an issue they are having (#260). It's also helpful for generating files in CI because one of the inputs to that is a data/$RELEASE
directory (#278).
@@ -31,3 +39,6 @@ for file in "${FILES[@]}" | |||
do | |||
ln -sfn $RELEASE/$file data/$file | |||
done | |||
|
|||
# Unzip any zip files in the data directory using the update flag | |||
unzip -u -d data data/*.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure no problem, only add upzip to the end of the scrip. @jaclyn-taroni please review.
Running this locally right now -- if all looks good, I will approve. I will probably wait to merge until we get the CI files sorted (#278) so continuous integration doesn't fail for everyone at the download step. I expect that will happen sometime today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ran this locally
This is taking awhile to run in CI currently -- I'm fairly sure the issue is with Ensembl being slow because I am stuck on the genome FASTA step locally. |
As far as I know, no analysis module uses the FASTA reference file yet. I am going try commenting those parts of the script out until we discuss how we might shorten the download process. Tracked in #281. |
### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon
Purpose/implementation
data release updates for v10
-reran VCF2MAF to harmonize columns
Issue/Caveats
Directions for reviewers
changed on release notes and download script.
download script updated:
release file added:
release file content updated:
pbta-snv-mutect2.vep.maf.gz
pbta-snv-lancet.vep.maf.gz
pbta-gene-expression-rsem-fpkm.polya.rds
pbta-snv-vardict.vep.maf.gz
Results
Docker and continuous integration