Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

V10 Release #273

Merged
merged 11 commits into from
Nov 19, 2019
Merged

V10 Release #273

merged 11 commits into from
Nov 19, 2019

Conversation

yuankunzhu
Copy link
Collaborator

Purpose/implementation

data release updates for v10

Issue/Caveats

Directions for reviewers

changed on release notes and download script.

  • download script updated:

    • release version updated
    • added FTP link download for GRCh38 reference and Gencode V27 GTF
  • release file added:

    • pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds
    • pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
    • pbta-gene-expression-rsem-tpm.polya.rds
    • pbta-gene-expression-rsem-tpm.stranded.rds
    • pbta-isoform-expression-rsem-tpm.polya.rds
    • pbta-isoform-expression-rsem-tpm.stranded.rds
    • pbta-snv-consensus_11122019.zip
  • release file content updated:

    • pbta-snv-mutect2.vep.maf.gz
    • pbta-snv-lancet.vep.maf.gz
    • pbta-gene-expression-rsem-fpkm.polya.rds
    • pbta-snv-vardict.vep.maf.gz

Results

Docker and continuous integration

download-data.sh Outdated Show resolved Hide resolved
@jaclyn-taroni
Copy link
Member

As far as the SNV consensus files go, I think we ultimately want the unzipped files in data which we could accomplish with

  • Download the file
  • Check the checksum
  • Symlink the files
  • Unzip that folder

We do the first three points already so we could add the last step. Thoughts?

@jaclyn-taroni
Copy link
Member

Thinking about this today as I'm getting ready to make changes required to get the subset files for CI. We can accomplish what I outlined above with adding the following to the end of the download script:

# Unzip any zip files in the data directory using the update flag
unzip -u -d data data/*.zip 

yuankunzhu and others added 2 commits November 17, 2019 17:31
download-data.sh Outdated
# Check the md5s for everything we downloaded except CHANGELOG.md
cd data/$RELEASE
md5sum -c md5sum.txt
mv pbta-snv-consensus_11122019.zip ../
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, the consensus zip file will be downloaded to data/$RELEASE/ as part of the release. We can just move such file out to the data/ after the md5 check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with moving rather than symlinking is that having everything that is included with a data download live in the data/$RELEASE folder is convenient if folks want to go back and compare different versions of files to narrow down an issue they are having (#260). It's also helpful for generating files in CI because one of the inputs to that is a data/$RELEASE directory (#278).

@@ -31,3 +39,6 @@ for file in "${FILES[@]}"
do
ln -sfn $RELEASE/$file data/$file
done

# Unzip any zip files in the data directory using the update flag
unzip -u -d data data/*.zip
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure no problem, only add upzip to the end of the scrip. @jaclyn-taroni please review.

@jaclyn-taroni
Copy link
Member

Running this locally right now -- if all looks good, I will approve. I will probably wait to merge until we get the CI files sorted (#278) so continuous integration doesn't fail for everyone at the download step. I expect that will happen sometime today.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ran this locally

@jaclyn-taroni
Copy link
Member

This is taking awhile to run in CI currently -- I'm fairly sure the issue is with Ensembl being slow because I am stuck on the genome FASTA step locally.

@jaclyn-taroni
Copy link
Member

As far as I know, no analysis module uses the FASTA reference file yet. I am going try commenting those parts of the script out until we discuss how we might shorten the download process. Tracked in #281.

@yuankunzhu yuankunzhu mentioned this pull request Nov 26, 2019
jharenza added a commit that referenced this pull request Dec 17, 2019
### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
@jharenza jharenza mentioned this pull request Dec 17, 2019
jaclyn-taroni pushed a commit that referenced this pull request Dec 19, 2019
* Release V12 data

### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Update release-notes.md

fix link

* Update data-files-description.md

fix GISTIC table sectioning

* Update data-files-description.md

fix spacing on data description table

* Update data-files-description.md

fix more spacing in data file description file

* Update download-data.sh

add new release date to download script

* Update the TMB file descriptions

* Update TMB file formats section

* Update fusion section of data formats

Also more specific description of the by sample file

* Add GISTIC file to data-formats

* Update download-data.sh

* Update download-data.sh

* data description md is also included in md5sum

* TMB exon -> coding sequence

* Coding TMB CDS, not exon
@yuankunzhu yuankunzhu deleted the release/v10 branch March 6, 2020 17:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants