Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Planned data release: V11 #287

Closed
2 tasks
jharenza opened this issue Nov 22, 2019 · 9 comments
Closed
2 tasks

Planned data release: V11 #287

jharenza opened this issue Nov 22, 2019 · 9 comments

Comments

@jharenza
Copy link
Collaborator

jharenza commented Nov 22, 2019

What data file(s) does this issue pertain to?

V10 collapsed files were tables of genes removed, not the collapsed matrices, #248

What release are you using?

V10

Put a link to the relevant section of the OpenPBTA-manuscript here.

NA

Put your question or report your issue here.

Planned data to release:

@jharenza jharenza added the data label Nov 22, 2019
@jharenza jharenza self-assigned this Nov 22, 2019
@jashapiro
Copy link
Member

As long as there is an update planned, I will be able to provide updated consensus SNV files. (With MNVs included)

@jaclyn-taroni
Copy link
Member

Related: #275

@jaclyn-taroni
Copy link
Member

New consensus files from @jashapiro: https://open-pbta.s3.us-east-1.amazonaws.com/data/snv-consensus/snv-consensus-20191125.zip

@jharenza
Copy link
Collaborator Author

Re: putative oncogenic fusion (above), I am thinking we release the final prioritized list for the oncoprints, which would also become a supplemental table (I guess we add to the manuscript doc later?). cc: @jaclyn-taroni

@jaclyn-taroni
Copy link
Member

To clarify, you are referring to this file: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/49acc98f5ffd86853fc70f220623311e13e3ca9f/analyses/fusion_filtering/results/PutativeOncogenicFusion.tsv correct @jharenza?

What are you going to name this file? I lean towards something like pbta-fusion-putative-oncogenic.tsv.

Regarding the new consensus mutation files, I think since we will probably put the pbta-fusion-putative-oncogenic.tsv directly in the release folder, we should make some changes to how the consensus mutation files are distributed.

The file structure of snv-consensus-20191125.zip is as follows:

.
├── README.md
├── consensus_mutation.maf.tsv
└── consensus_mutation_tmb.tsv

Could we rename the contents to:
consensus_mutation.maf.tsv -> pbta-snv-consensus-mutation.maf.tsv (+ possibly compress this file)
consensus_mutation_tmb.tsv -> pbta-snv-consensus-mutation-tmb.tsv

And then include these in the "top directory" of the release folder? I propose we move the information in the README file in the folder to probably 1) the release notes for this release and 2) the main README of the repository. We may also want to include a Markdown file with the headers for these consensus files in doc/format as the first one is "MAF-like."

It may behoove us to create a section under Data Formats that talks specifically about these "derivative" data files (e.g., prioritized fusion list, consensus mutation files, collapsed expression matrices and maybe even the independent specimens files) and how we expect folks to use them.

I want to note that I'd prefer we really consider the documentation changes and wait until after the holiday to release v11 over getting this out today without documentation changes. We're also going to have to make a bunch of changes to how we generate the CI files before things consuming these files can get through reviewed and merged.

@jaclyn-taroni
Copy link
Member

Moving the discussion over from #248 -- can we include the collapsed/summarized matrices and call them pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds and pbta-gene-expression-rsem-fpkm-collapsed.polya.rds please?

This is what I would expect based on reading the comments in code that generated them and we will ideally link the code in new section under Data Formats that I proposed in the comment above.

I think we can remove the following files that were included in v10 from the download: pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds and pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds

Sounds like @komalsrathi will be filing a pull request with some additional analyses today (#248 (comment)). We can link to the notebook/script that gets added in Data Formats as well and direct folks to the analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.polya.rds and analyses/collapse-rnaseq/pbta-gene-expression-rsem-fpkm-collapsed_table.stranded.rds files if they need more information.

@jharenza
Copy link
Collaborator Author

@jaclyn-taroni yes to the renaming and the putative oncogenic fusion file. @yuankunzhu is on it!

@jaclyn-taroni
Copy link
Member

Great, thank you. If that pull request allows edits from maintainers, we can have someone on our side make the doc/format change + consensus mutation README additions.

@jharenza
Copy link
Collaborator Author

jharenza commented Dec 2, 2019

closed with #293

@jharenza jharenza closed this as completed Dec 2, 2019
jharenza added a commit that referenced this issue Dec 17, 2019
### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)
jaclyn-taroni pushed a commit that referenced this issue Dec 19, 2019
* Release V12 data

### release-v12-20191217
- release date: 2019-12-17
- status: available
- changes:
  - Add `data-file-descriptions.md` with data release to better track file types, origins, and workflows per [#334](#334) and [#336](#336)
  - Add stranded RNA-Seq for 23 PNOC samples and 21 CBTTC samples previously sequenced using a polyA library prep. Files updated:
    - pbta-fusion-arriba.tsv.gz
    - pbta-fusion-starfusion.tsv.gz
    - pbta-gene-expression-rsem-tpm.stranded.rds
    - pbta-gene-expression-rsem-fpkm.stranded.rds
    - pbta-isoform-expression-rsem-tpm.stranded.rds
    - pbta-isoform-counts-rsem-expected_count.stranded.rds
    - pbta-gene-counts-rsem-expected_count.stranded.rds
    - pbta-gene-expression-kallisto.stranded.rds
    - pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds
  - Add recurrently-fused genes by histology and matrix of recurrently-fused genes by biospecimen from [fusion filtering and prioritization analysis](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering)
  - Update consensus TMB files and MAF [#333]](#333)
  - Add RNA-Seq [collapsed matrices](#287) - wrong files (tables of transcripts removed) were included with [V10](#273)
  - Rename `WGS.hg38.mutect2.unpadded.bed` to `WGS.hg38.mutect2.vardict.unpadded.bed` to better reflect usage
  - Update `pbta-histologies.tsv` to add new RNA-Seq samples listed above, [#222 harmonize disease separators](#222), and reran [medulloblastoma classifier](https://github.com/d3b-center/medullo-classifier-package) using V12 RSEM fpkm collapsed files
    - BS_2Z1MKS84, BS_5VQP0E6K re-classified from Group4 to WNT and BS_3BDAG9YN, BS_8T7DZV2F, and BS_JTMXAMB7 re-classified from Group3 to WNT
  - Add CNVkit GISTIC results focal CN analyses, eg: [#244](#244) and [#8](#8)

* Update release-notes.md

fix link

* Update data-files-description.md

fix GISTIC table sectioning

* Update data-files-description.md

fix spacing on data description table

* Update data-files-description.md

fix more spacing in data file description file

* Update download-data.sh

add new release date to download script

* Update the TMB file descriptions

* Update TMB file formats section

* Update fusion section of data formats

Also more specific description of the by sample file

* Add GISTIC file to data-formats

* Update download-data.sh

* Update download-data.sh

* data description md is also included in md5sum

* TMB exon -> coding sequence

* Coding TMB CDS, not exon
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants