V10 Release #273

yuankunzhu · 2019-11-15T20:12:44Z

Purpose/implementation

data release updates for v10

changes:
- Add RNA-Seq GTF and fasta files per ticket here
- Add RSEM gene TPM and isoform matrices
- Add SNV consensus files
- Add new MAFs for lancet, vardict, mutect2
  -reran VCF2MAF to harmonize columns
- Add Collapsed RNA matrices

Issue/Caveats

Directions for reviewers

changed on release notes and download script.

download script updated:
- release version updated
- added FTP link download for GRCh38 reference and Gencode V27 GTF
release file added:
- pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
- pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
- pbta-gene-expression-rsem-tpm.polya.rds
- pbta-gene-expression-rsem-tpm.stranded.rds
- pbta-isoform-expression-rsem-tpm.polya.rds
- pbta-isoform-expression-rsem-tpm.stranded.rds
- pbta-snv-consensus_11122019.zip
release file content updated:
- pbta-snv-mutect2.vep.maf.gz
- pbta-snv-lancet.vep.maf.gz
- pbta-gene-expression-rsem-fpkm.polya.rds
- pbta-snv-vardict.vep.maf.gz

Results

Docker and continuous integration

…fa & gencode27.gtf

download-data.sh

jaclyn-taroni · 2019-11-15T20:43:57Z

As far as the SNV consensus files go, I think we ultimately want the unzipped files in data which we could accomplish with

Download the file
Check the checksum
Symlink the files
Unzip that folder

We do the first three points already so we could add the last step. Thoughts?

jaclyn-taroni · 2019-11-17T17:46:47Z

Thinking about this today as I'm getting ready to make changes required to get the subset files for CI. We can accomplish what I outlined above with adding the following to the end of the download script:

# Unzip any zip files in the data directory using the update flag
unzip -u -d data data/*.zip

Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>

yuankunzhu · 2019-11-17T22:43:24Z

download-data.sh

 # Check the md5s for everything we downloaded except CHANGELOG.md
 cd data/$RELEASE
 md5sum -c md5sum.txt
+mv pbta-snv-consensus_11122019.zip ../


for now, the consensus zip file will be downloaded to data/$RELEASE/ as part of the release. We can just move such file out to the data/ after the md5 check.

My concern with moving rather than symlinking is that having everything that is included with a data download live in the data/$RELEASE folder is convenient if folks want to go back and compare different versions of files to narrow down an issue they are having (#260). It's also helpful for generating files in CI because one of the inputs to that is a data/$RELEASE directory (#278).

yuankunzhu · 2019-11-18T13:59:41Z

download-data.sh

@@ -31,3 +39,6 @@ for file in "${FILES[@]}"
 do
  ln -sfn $RELEASE/$file data/$file
 done
+
+# Unzip any zip files in the data directory using the update flag
+unzip -u -d data data/*.zip 


sure no problem, only add upzip to the end of the scrip. @jaclyn-taroni please review.

jaclyn-taroni · 2019-11-18T14:34:13Z

Running this locally right now -- if all looks good, I will approve. I will probably wait to merge until we get the CI files sorted (#278) so continuous integration doesn't fail for everyone at the download step. I expect that will happen sometime today.

jaclyn-taroni

👍 ran this locally

jaclyn-taroni · 2019-11-18T21:51:19Z

This is taking awhile to run in CI currently -- I'm fairly sure the issue is with Ensembl being slow because I am stuck on the genome FASTA step locally.

jaclyn-taroni · 2019-11-18T23:56:31Z

As far as I know, no analysis module uses the FASTA reference file yet. I am going try commenting those parts of the script out until we discuss how we might shorten the download process. Tracked in #281.

### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Release V12 data ### release-v12-20191217 - release date: 2019-12-17 - status: available - changes: - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336) - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated: - pbta-fusion-arriba.tsv.gz - pbta-fusion-starfusion.tsv.gz - pbta-gene-expression-rsem-tpm.stranded.rds - pbta-gene-expression-rsem-fpkm.stranded.rds - pbta-isoform-expression-rsem-tpm.stranded.rds - pbta-isoform-counts-rsem-expected_count.stranded.rds - pbta-gene-counts-rsem-expected_count.stranded.rds - pbta-gene-expression-kallisto.stranded.rds - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) - Update consensus TMB files and MAF [#333]](#333) - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273) - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8) * Update release-notes.md fix link * Update data-files-description.md fix GISTIC table sectioning * Update data-files-description.md fix spacing on data description table * Update data-files-description.md fix more spacing in data file description file * Update download-data.sh add new release date to download script * Update the TMB file descriptions * Update TMB file formats section * Update fusion section of data formats Also more specific description of the by sample file * Add GISTIC file to data-formats * Update download-data.sh * Update download-data.sh * data description md is also included in md5sum * TMB exon -> coding sequence * Coding TMB CDS, not exon

yuankunzhu added 4 commits November 15, 2019 14:57

🔧 update release-notes.md for v10 release

96b76d5

🔧 update release-notes.md for v10 release; version id

2238559

🔧 update download-data.sh for v10 release

5089802

🔧 update download-data.sh for v10 release; add ftp download for hg38.…

e84b0c2

…fa & gencode27.gtf

jaclyn-taroni mentioned this pull request Nov 15, 2019

Updated analysis: rewrite SNV consensus steps in the oncoprint plotting module #274

Closed

jaclyn-taroni reviewed Nov 15, 2019

View reviewed changes

download-data.sh Outdated Show resolved Hide resolved

yuankunzhu and others added 2 commits November 17, 2019 17:31

Update download-data.sh. ref/gft file location.

574fbd7

Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>

move consensus file to data/ folder

48d05d3

yuankunzhu commented Nov 17, 2019

View reviewed changes

yuankunzhu added 2 commits November 17, 2019 17:45

Unzip *.zip under data/

5b634ab

Update download-data.sh

b4a0f16

yuankunzhu commented Nov 18, 2019

View reviewed changes

jaclyn-taroni approved these changes Nov 18, 2019

View reviewed changes

jaclyn-taroni mentioned this pull request Nov 18, 2019

Update CI files for v10 #278

Merged

Merge branch 'master' into release/v10

2a07c5c

Merge branch 'master' into release/v10

b835eb6

Comment out reference FASTA download

5b28432

jaclyn-taroni merged commit 3091d46 into AlexsLemonade:master Nov 19, 2019

yuankunzhu mentioned this pull request Nov 26, 2019

V11 Release #293

Merged

jharenza mentioned this pull request Dec 2, 2019

Proposed Analysis: Comparative RNA-Seq analysis #229

Open

jharenza mentioned this pull request Dec 17, 2019

Release V12 data #347

Merged

yuankunzhu deleted the release/v10 branch March 6, 2020 17:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V10 Release #273

V10 Release #273

yuankunzhu commented Nov 15, 2019

jaclyn-taroni commented Nov 15, 2019

jaclyn-taroni commented Nov 17, 2019

yuankunzhu Nov 17, 2019

jaclyn-taroni Nov 18, 2019

yuankunzhu Nov 18, 2019

jaclyn-taroni commented Nov 18, 2019

jaclyn-taroni left a comment

jaclyn-taroni commented Nov 18, 2019

jaclyn-taroni commented Nov 18, 2019

V10 Release #273

V10 Release #273

Conversation

yuankunzhu commented Nov 15, 2019

Purpose/implementation

Issue/Caveats

Directions for reviewers

Results

Docker and continuous integration

jaclyn-taroni commented Nov 15, 2019

jaclyn-taroni commented Nov 17, 2019

yuankunzhu Nov 17, 2019

Choose a reason for hiding this comment

jaclyn-taroni Nov 18, 2019

Choose a reason for hiding this comment

yuankunzhu Nov 18, 2019

Choose a reason for hiding this comment

jaclyn-taroni commented Nov 18, 2019

jaclyn-taroni left a comment

Choose a reason for hiding this comment

jaclyn-taroni commented Nov 18, 2019

jaclyn-taroni commented Nov 18, 2019