RTCGA data to ExperimentHub #85

Open
MarcinKosinski opened this Issue May 31, 2016 · 18 comments

Comments

Projects
None yet
2 participants
@MarcinKosinski
Member

MarcinKosinski commented May 31, 2016

Hi Marcin,

I'm ready to starting putting the RTCGA data sets in ExperimentHub. The idea is that we'll create a data package, similar to the usual Experimental Data packages, that will have a little more documentation and background. The data will be in an S3 bucket - it's ok if the data.frames live in S3 and we transform them to SummarizedExperiment when the user invokes '[[' on the resource, e.g.,

eh = ExperimentHub()
eh[["EH1234"]]

To get started please do the following:

  1. Let me know what you plan to name your data package, e.g., RTCGAData or just RTCGA etc. Once we have this we'll make an S3 bucket of the same name and give you permissions to upload the data.

  2. Put together the data package following instructions here, specifically section 2.1 'Add New Resources':

http://www.bioconductor.org/packages/3.4/bioc/vignettes/ExperimentHubData/inst/doc/ExperimentHubData.html

Here is a sample package (data are in ExperimentHub) you can use as a template:

http://www.bioconductor.org/packages/3.3/data/experiment/html/GSE62944.html

Thanks.

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski May 31, 2016

Member

2016-05-31 15:30 GMT+02:00 Valerie Obenchain vobencha@gmail.com:

Hi Marcin,

Hello Valerie,

It's good to hear from you as I have a scheduled Skype call for tommorow to talk about uploading new data package/packages to ExperimentHub with my collaborator, Witold Chodor, from RTCGA project https://github.com/orgs/RTCGA/people (CC'd other contributors).

I'm ready to starting putting the RTCGA data sets in ExperimentHub. The idea is that we'll create a data package, similar to the usual Experimental Data packages, that will have a little more documentation and background. The data will be in an S3 bucket - it's ok if the data.frames live in S3 and we transform them to SummarizedExperiment when the user invokes '[[' on the resource, e.g.,

eh = ExperimentHub()
eh[["EH1234"]]

To get started please do the following:

1) Let me know what you plan to name your data package, e.g., RTCGAData or just RTCGA etc. Once we have this we'll make an S3 bucket of the same name and give you permissions to upload the data.

I think that "RTCGA.data" is a good name to start. I'll create there "RTCGA.data_type.date_of_release" data.frames / SummarizedExperiments.

2) Put together the data package following instructions here, specifically section 2.1 'Add New Resources':

http://www.bioconductor.org/packages/3.4/bioc/vignettes/ExperimentHubData/inst/doc/ExperimentHubData.html

Here is a sample package (data are in ExperimentHub) you can use as a template:

http://www.bioconductor.org/packages/3.3/data/experiment/html/GSE62944.html

Are there any size limits for such packages? I would like to combine datasets from, among others, RTCGA.methylation and RTCGA.rnaseq Experiment Data packages which both together exceed 16 GBs and might not build as 1 binary file as they exceed my RAM limits.

Maybe it's a good idea to upload more than 1 package to ExperimentHub like

  • RTCGA.clinical.20160128 as refreshed, newer version of RTCGA.clinical Experiment Data Package

  • RTCGA.mutations.20160128 as refreshed, newer version of RTCGA.mutatios Experiment Data Package

  • RTCGA.rnaseq.20160128 as refreshed, newer version of RTCGA.rnaseq Experiment Data Package

  • RTCGA.methylation.20160128 as refreshed, newer version of RTCGA.methylation Experiment Data Package

  • RTCGA.RPPA.20160128 as refreshed, newer version of RTCGA.RPPA Experiment Data Package

  • RTCGA.miRNASeq.20160128 as refreshed, newer version of RTCGA.miRNASeq Experiment Data Package

  • RTCGA.mRNA.20160128 as refreshed, newer version of RTCGA.mRNA Experiment Data Package

  • RTCGA.CNV.20160128 as refreshed, newer version of RTCGA.CNV Experiment Data Package

    Thanks.
    Valerie

Would you rather continue this issue via e-mails or maybe we can open na issue on our repository at GitHub https://github.com/RTCGA/RTCGA/issues so that more people can benefit from our private correspondence :)?

Best,
Marcin

Member

MarcinKosinski commented May 31, 2016

2016-05-31 15:30 GMT+02:00 Valerie Obenchain vobencha@gmail.com:

Hi Marcin,

Hello Valerie,

It's good to hear from you as I have a scheduled Skype call for tommorow to talk about uploading new data package/packages to ExperimentHub with my collaborator, Witold Chodor, from RTCGA project https://github.com/orgs/RTCGA/people (CC'd other contributors).

I'm ready to starting putting the RTCGA data sets in ExperimentHub. The idea is that we'll create a data package, similar to the usual Experimental Data packages, that will have a little more documentation and background. The data will be in an S3 bucket - it's ok if the data.frames live in S3 and we transform them to SummarizedExperiment when the user invokes '[[' on the resource, e.g.,

eh = ExperimentHub()
eh[["EH1234"]]

To get started please do the following:

1) Let me know what you plan to name your data package, e.g., RTCGAData or just RTCGA etc. Once we have this we'll make an S3 bucket of the same name and give you permissions to upload the data.

I think that "RTCGA.data" is a good name to start. I'll create there "RTCGA.data_type.date_of_release" data.frames / SummarizedExperiments.

2) Put together the data package following instructions here, specifically section 2.1 'Add New Resources':

http://www.bioconductor.org/packages/3.4/bioc/vignettes/ExperimentHubData/inst/doc/ExperimentHubData.html

Here is a sample package (data are in ExperimentHub) you can use as a template:

http://www.bioconductor.org/packages/3.3/data/experiment/html/GSE62944.html

Are there any size limits for such packages? I would like to combine datasets from, among others, RTCGA.methylation and RTCGA.rnaseq Experiment Data packages which both together exceed 16 GBs and might not build as 1 binary file as they exceed my RAM limits.

Maybe it's a good idea to upload more than 1 package to ExperimentHub like

  • RTCGA.clinical.20160128 as refreshed, newer version of RTCGA.clinical Experiment Data Package

  • RTCGA.mutations.20160128 as refreshed, newer version of RTCGA.mutatios Experiment Data Package

  • RTCGA.rnaseq.20160128 as refreshed, newer version of RTCGA.rnaseq Experiment Data Package

  • RTCGA.methylation.20160128 as refreshed, newer version of RTCGA.methylation Experiment Data Package

  • RTCGA.RPPA.20160128 as refreshed, newer version of RTCGA.RPPA Experiment Data Package

  • RTCGA.miRNASeq.20160128 as refreshed, newer version of RTCGA.miRNASeq Experiment Data Package

  • RTCGA.mRNA.20160128 as refreshed, newer version of RTCGA.mRNA Experiment Data Package

  • RTCGA.CNV.20160128 as refreshed, newer version of RTCGA.CNV Experiment Data Package

    Thanks.
    Valerie

Would you rather continue this issue via e-mails or maybe we can open na issue on our repository at GitHub https://github.com/RTCGA/RTCGA/issues so that more people can benefit from our private correspondence :)?

Best,
Marcin

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski May 31, 2016

Member

Valerie:

I would try to keep the file size <= 5GB. How many data.frames do you
have for each of these categories? e.g., how many for clinical, for
mutations, etc.?

Member

MarcinKosinski commented May 31, 2016

Valerie:

I would try to keep the file size <= 5GB. How many data.frames do you
have for each of these categories? e.g., how many for clinical, for
mutations, etc.?

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski May 31, 2016

Member

There are 38 cohort types in TCGA Study so the maximum data.frame limit for each category is 38 but TCGA not always provide full data and the data.frames' count for selected categories are

pckg.names <- 
  c("RTCGA.clinical", "RTCGA.mutations", "RTCGA.rnaseq",
    "RTCGA.RPPA", "RTCGA.mRNA", "RTCGA.CNV", "RTCGA.miRNASeq",
    "RTCGA.PANCAN12", "RTCGA.methylation")


sapply(pckg.names,
  function(pckg){
   length(data(package = pckg)$results[, "Item"]) 
  }
) -> data.frames.counts

knitr::kable(
  data.frame(
    count = data.frames.counts
  ))
count
RTCGA.clinical 38
RTCGA.mutations 35
RTCGA.rnaseq 36
RTCGA.RPPA 36
RTCGA.mRNA 13
RTCGA.CNV 37
RTCGA.miRNASeq 38
RTCGA.PANCAN12 5
RTCGA.methylation 17
Member

MarcinKosinski commented May 31, 2016

There are 38 cohort types in TCGA Study so the maximum data.frame limit for each category is 38 but TCGA not always provide full data and the data.frames' count for selected categories are

pckg.names <- 
  c("RTCGA.clinical", "RTCGA.mutations", "RTCGA.rnaseq",
    "RTCGA.RPPA", "RTCGA.mRNA", "RTCGA.CNV", "RTCGA.miRNASeq",
    "RTCGA.PANCAN12", "RTCGA.methylation")


sapply(pckg.names,
  function(pckg){
   length(data(package = pckg)$results[, "Item"]) 
  }
) -> data.frames.counts

knitr::kable(
  data.frame(
    count = data.frames.counts
  ))
count
RTCGA.clinical 38
RTCGA.mutations 35
RTCGA.rnaseq 36
RTCGA.RPPA 36
RTCGA.mRNA 13
RTCGA.CNV 37
RTCGA.miRNASeq 38
RTCGA.PANCAN12 5
RTCGA.methylation 17
@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha May 31, 2016

OK. Yes, I agree making a package for each type is a good way to organize this.

OK. Yes, I agree making a package for each type is a good way to organize this.

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Jun 2, 2016

Member

We've chatted with @wchodor and we have prepared such an operation plan:

  1. We'll create RTCGA::createTCGA() function that will create ExpressionHub-type package, based on source codes of RTCGA.clinical / RTCGA.mutations / etc.
  2. This function will create new packages that we'll submit to ExpressionHub for the newest release date 2016-01-28 and for the future releases in future.
  3. I suggest firstly to try to create a draft of a RTCGA::createTCGA() function that will
    1. a) take parameters `createTCGA('date_of_release', 'package_name', 'dataset_name', 'author_name')
    2. b) create only DESCRIPTION file (at the beginning) with full information in this file

@wchodor do you agree this is a good start :)?

Member

MarcinKosinski commented Jun 2, 2016

We've chatted with @wchodor and we have prepared such an operation plan:

  1. We'll create RTCGA::createTCGA() function that will create ExpressionHub-type package, based on source codes of RTCGA.clinical / RTCGA.mutations / etc.
  2. This function will create new packages that we'll submit to ExpressionHub for the newest release date 2016-01-28 and for the future releases in future.
  3. I suggest firstly to try to create a draft of a RTCGA::createTCGA() function that will
    1. a) take parameters `createTCGA('date_of_release', 'package_name', 'dataset_name', 'author_name')
    2. b) create only DESCRIPTION file (at the beginning) with full information in this file

@wchodor do you agree this is a good start :)?

MarcinKosinski added a commit that referenced this issue Jun 21, 2016

MarcinKosinski added a commit that referenced this issue Jun 24, 2016

MarcinKosinski added a commit that referenced this issue Sep 6, 2016

@MarcinKosinski MarcinKosinski modified the milestones: RTCGA na Bioc2016, 18th October Bioc release Sep 29, 2016

@MarcinKosinski MarcinKosinski changed the title from RTCGA data to ExperimentHub to RTCGA data to ExperimentHub - before release on 18th October Sep 29, 2016

@MarcinKosinski MarcinKosinski changed the title from RTCGA data to ExperimentHub - before release on 18th October to RTCGA data to ExperimentHub Nov 30, 2016

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Nov 30, 2016

Member

@vobencha do you think RTCGA.clinical.20160128 is suitable for the ExprimentHub ?
https://github.com/RTCGA/RTCGA.clinical.20160128 (without that createTCGA.R in the root directory, which will be included in the RTCGA package).

If this package is ok I'll create similiar packages for the rest of TCGA stuff :)

Member

MarcinKosinski commented Nov 30, 2016

@vobencha do you think RTCGA.clinical.20160128 is suitable for the ExprimentHub ?
https://github.com/RTCGA/RTCGA.clinical.20160128 (without that createTCGA.R in the root directory, which will be included in the RTCGA package).

If this package is ok I'll create similiar packages for the rest of TCGA stuff :)

@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha Dec 1, 2016

@MarcinKosinski RTCGA is broken in release and devel. We can't move forward with packages that depend on RTCGA until it builds and checks clean.
Valerie

vobencha commented Dec 1, 2016

@MarcinKosinski RTCGA is broken in release and devel. We can't move forward with packages that depend on RTCGA until it builds and checks clean.
Valerie

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 2, 2016

Member

@vobencha yes I am aware of this :) I even have an issue for that - #98

It's due to the old version of survminer package whish is required for RTCGA. The main architect of survminer has promised to release new version of survminer, that is immune to the ggplot2 ver 2.2.0, this weekend.

All again many packages have crashed, due to the release of ggplot2 :)

Member

MarcinKosinski commented Dec 2, 2016

@vobencha yes I am aware of this :) I even have an issue for that - #98

It's due to the old version of survminer package whish is required for RTCGA. The main architect of survminer has promised to release new version of survminer, that is immune to the ggplot2 ver 2.2.0, this weekend.

All again many packages have crashed, due to the release of ggplot2 :)

@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha Dec 2, 2016

@MarcinKosinski Sounds good. We'll wait and see if the new survminer works.

vobencha commented Dec 2, 2016

@MarcinKosinski Sounds good. We'll wait and see if the new survminer works.

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 16, 2016

Member

Hi @vobencha once again :) It needed the survminer to be release twice in last 14 days to fix the problem with RTCGA : ) It is now fixed in the devel and in the release.

So if you don't mind, could we check whether below packages fit in the ExperimentHubData methodology and implementation?

All packages were created with RTCGA::createTCGA function that is implemented in the RTCGA 1.5.1 (only on this GitHub repository). I will upload 1.5.1 to te Bioconductor after we will put RTCGA.data.20160128 packages to Bioc, so that I could link to those packages in the RTCGA's documentation.

New data packages extends the RTCGA worklfow as below [new features of the whole family are described as ver 1.5.1] :)
rtcga_workflow

Member

MarcinKosinski commented Dec 16, 2016

Hi @vobencha once again :) It needed the survminer to be release twice in last 14 days to fix the problem with RTCGA : ) It is now fixed in the devel and in the release.

So if you don't mind, could we check whether below packages fit in the ExperimentHubData methodology and implementation?

All packages were created with RTCGA::createTCGA function that is implemented in the RTCGA 1.5.1 (only on this GitHub repository). I will upload 1.5.1 to te Bioconductor after we will put RTCGA.data.20160128 packages to Bioc, so that I could link to those packages in the RTCGA's documentation.

New data packages extends the RTCGA worklfow as below [new features of the whole family are described as ver 1.5.1] :)
rtcga_workflow

@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha Dec 16, 2016

Hi @MarcinKosinski,
Yes, I will take a look early next week. Thanks for the update.
Valerie

Hi @MarcinKosinski,
Yes, I will take a look early next week. Thanks for the update.
Valerie

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 18, 2016

Member

I have even updated the whole RTCGA project website with the newest features and data packages http://rtcga.github.io/RTCGA/ (prepared with http://hadley.github.io/pkgdown/).

Member

MarcinKosinski commented Dec 18, 2016

I have even updated the whole RTCGA project website with the newest features and data packages http://rtcga.github.io/RTCGA/ (prepared with http://hadley.github.io/pkgdown/).

@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha Dec 21, 2016

I'm getting this error when trying to build the RTCGA software package (devel):

~/sandbox/test >R-rel CMD build RTCGA/

I'm getting this error when trying to build the RTCGA software package (devel):

~/sandbox/test >R-rel CMD build RTCGA/

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 21, 2016

Member

The current state is:
1.4.0 in Bioc release
1.5.0 in Bioc devel
1.5.1 in GitHub

I have accidentally removed this file 3cc2977 which I have just uploaded today, when moving from version 1.5.0 to 1.5.1 - I have added this file today so version 1.5.0 should now not have warnings. Actually the current version - 1.5.1 will work after the experimnetHubData from RTCGA family will appear on Bioconducotr so I am not pushing 1.5.1 to Bioc devel : )

Member

MarcinKosinski commented Dec 21, 2016

The current state is:
1.4.0 in Bioc release
1.5.0 in Bioc devel
1.5.1 in GitHub

I have accidentally removed this file 3cc2977 which I have just uploaded today, when moving from version 1.5.0 to 1.5.1 - I have added this file today so version 1.5.0 should now not have warnings. Actually the current version - 1.5.1 will work after the experimnetHubData from RTCGA family will appear on Bioconducotr so I am not pushing 1.5.1 to Bioc devel : )

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 21, 2016

Member

This file is no longer needed in 1.5.1 : ) this is the .png with RTCGA workflow which between 1.5.0 to 1.5.1 was extended and I have uploaded a new file with newest workflow png

Member

MarcinKosinski commented Dec 21, 2016

This file is no longer needed in 1.5.1 : ) this is the .png with RTCGA workflow which between 1.5.0 to 1.5.1 was extended and I have uploaded a new file with newest workflow png

@vobencha

This comment has been minimized.

Show comment
Hide comment
@vobencha

vobencha Dec 21, 2016

We are having problems with our git/svn mirror so please commit changes directly to the svn repo. Also make sure you bump the version each time to commit to relase or devel. I see 1.5.0 in devel svn - sounds like it should be 1.5.1? I will be using/testing the version in svn, not git hub. If svn is broken that needs to be fixed before we can move forward.

We are having problems with our git/svn mirror so please commit changes directly to the svn repo. Also make sure you bump the version each time to commit to relase or devel. I see 1.5.0 in devel svn - sounds like it should be 1.5.1? I will be using/testing the version in svn, not git hub. If svn is broken that needs to be fixed before we can move forward.

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment
@MarcinKosinski

MarcinKosinski Dec 21, 2016

Member
Member

MarcinKosinski commented Dec 21, 2016

@MarcinKosinski

This comment has been minimized.

Show comment
Hide comment

@MarcinKosinski MarcinKosinski self-assigned this Dec 28, 2016

@MarcinKosinski MarcinKosinski removed this from the 18th October Bioc release milestone Dec 30, 2016

@MarcinKosinski MarcinKosinski removed their assignment Jan 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment