Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

ready for merge request #628

Closed
wants to merge 6 commits into from
Closed

Conversation

natemella
Copy link

@natemella natemella commented Mar 13, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

image

What was your approach?

image

What GitHub issue does your pull request address?

image

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

image

Is there anything that you want to discuss further?

image

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

image

What is your summary of the results?

We tested for batch effects in three different areas:

  1. Batch = Sequencing Center
  2. Batch = Cohort
  3. Batch = polyA vs ribodepletion

Within each of these analyses I evaluated RNA seq data that had been prepared using either RSEM normalization or Kallisto (tpm) values.
The null hypothesis is that there are no batch effects in the data

Batch
Sequence center Cohort Preparation method
RSEM polyA reject reject reject
before before before
rsem_polyA_sequence.pdf rsem_polya_cohort_pca.pdf rsem_method.pdf
rsem_polyA_sequence_pca.pdf rsem_polyA_cohort.pdf rsem_method_pca.pdf
after after after
rsem_polya_sequence_combat.pdf rsem_polya_cohort_combat_pca.pdf rsem_method_combat.pdf
rsem_polya_sequence_combat_pca.pdf rsem_polya_cohort_combat.pdf rsem_method_combat_pca.pdf
ribodepletion reject fail to reject
before before
rsem_stranded_sequence.pdf rsem_stranded_cohort.pdf
rsem_stranded_sequence_pca.pdf rsem_stranded_cohort_pca.pdf
after after
rsem_stranded_sequence_combat_pca.pdf
rsem_stranded_sequence_combat.pdf
Kallisto polyA reject reject reject
before before before
kallisto_polyA_sequence.pdf kallisto_polyA_cohort.pdf kallisto_method_pca.pdf
kallisto_polya_sequence_pca.pdf - kallisto_polyA_cohort_pca.pdf kallisto_method.pdf
after after after
kallisto_polya_sequence_combat_pca.pdf kallisto_polya_cohort_combat.pdf kallisto_method_combat.pdf
kallisto_polya_sequence_combat.pdf kallisto_polya_cohort_combat_pca.pdf - kallisto_method_combat_pca.pdf
ribodepletion reject fail to reject
before before
kallisto_stranded_sequence_pca.pdf kallisto_stranded_cohort_pca.pdf
kallisto_stranded_sequence.pdf kallisto_stranded_cohort.pdf
after after
kallisto_stranded_sequence_combat.pdf
kallisto_stranded_sequence_combat_pca.pdf

After running combat, it appears that the batch effects are successfully eliminated. Attached are the two graphs for each section 1) the distribution of P-values and 2) the PCA plot generated from batchQC

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@cgreene
Copy link
Collaborator

cgreene commented Mar 13, 2020

Hi @natemella - if you could paste text instead of images I'd quote tweet, but your answer to this one remains an ongoing question:

Is there anything that you want to discuss further?

You've batch effect corrected for different datasets, but we suspect those may be confounded with disease type. this means your attempts to remove unwanted variability might actually induce other confounded variability. If you wanted to check this, you'd pick one or a few dominant histologies from those centers, compare them, and at least see that:

  1. Those histologies now overlap, or at least overlap better than they did before.
  2. The histologies that are not those within the center(s) in question retain roughly the same variability, looking within center, that they did before the data were merged.

Without doing this analysis, I don't think there's any way to know if this correction is helping or hurting.

@jaclyn-taroni
Copy link
Member

I am going to resolve conflicts and add this analysis to continuous integration.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @natemella,

Thanks for adding this! As you address the high-level comments from @cgreene (#628 (comment)), I figured I would take a look at the code you are requesting to add here and see if there are adjustments we could make to smooth the way for the confounded by disease question.

In general, I think it would be helpful to take an approach where you are writing functions that are very general that can be sourced in all of the scripts or notebooks where you are looking at different "batches." These R files could live in analyses/batch-effects/util/.

I also added some specific comments about things like the handling of directories and file paths. rprojroot is very handy when you will be sourcing files in scripts because you want to make sure that your scripts that source files can be run from anywhere from within the project and they'll always be able to "find" the files they need to source. (here is probably helpful in the exact same way, but rprojroot allows you to specify the criterion used to determine the root of the project.)

Because this is a multi-step analysis, I think we will want to break this up into multiple pull requests per the contributing guidelines – a good rule of thumb is 1 analysis script per pull request or if you need to add files that will be sourced/dependencies to Dockerfile/the shell script, aim for 400 lines maximum. Limiting the size of pull request ensures thorough review. We can leave this pull request open for now as a sort of guide as we break things up.

A good first pull request that incorporates some of the feedback here might contain: 1) the functions that will be used throughout the analysis module 2) the first script that uses those functions 3) the required Dockerfile changes 4) adding this to CI (.circleci/config.yml; instructions here) 5) if you'd like you can add the beginnings of the shell script to run the entire module (adding it at the end vs. as you go is a bit of a matter of personal preference).

I added some specific comments about documentation, but I think those can wait until you file a pull request focused on documentation.

Thanks again! Please let us know if you have any questions about our comments or if you need a hand breaking things up!

# packages needed for batch-effects-analysis
RUN R -e "BiocManager::install(c('BatchQC'))"
RUN R -e "BiocManager::install(c('sva'))"
RUN R -e "install.packages('here', dependencies = TRUE)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use rprojroot throughout the project. I'll also add specific comments about where and how to use it in what you are adding here.

Comment on lines +245 to +246
RUN R -e "BiocManager::install(c('BatchQC'))"
RUN R -e "BiocManager::install(c('sva'))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN R -e "BiocManager::install(c('BatchQC'))"
RUN R -e "BiocManager::install(c('sva'))"
RUN R -e "BiocManager::install(c('BatchQC', 'sva'))"

| [`tcga-capture-kit-investigation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/tcga-capture-kit-investigation) | `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `tcga-snv-lancet.vep.maf.gz` <br> `tcga-snv-mutect2.vep.maf.gz` <br> `tcga-snv-strelka2.vep.maf.gz` <br> `pbta-histologies.tsv` <br> `pbta-tcga-manifest.tsv` <br> `WGS.hg38.lancet.unpadded.bed` <br> `WGS.hg38.strelka2.unpadded.bed` <br> `WGS.hg38.mutect2.vardict.unpadded.bed` <br> | Investigation of the TMB discrepancy between PBTA and TCGA data | `results/*.bed`
| [`batch-effects`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/batch-effects)| `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `pbta-gene-expression-kallisto.polya.rds` <br> `pbta-gene-expression-kallisto.stranded.rds` | batchQC and PCA analysis of batch affects with SVA batch correction (part of [#448](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448)) | N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this where it would be displayed on the analyses page in GitHub? For this particular add, I believe this would be the first entry in the table.

@@ -0,0 +1,75 @@
# Batch Effect Analysis

Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following one sentence per line for Markdown documents that are under version control is helpful for tracking and also makes it a bit easier for review because a review is able to comment on a specific sentence more easily.

Suggested change
Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.
Batch effects are commonly observed in RNA expression data.
Such effects are likely less pronounced in RNA-Seq data than microarray data.
However, batch effects may bias conclusions of studies that do not account for these effects.
We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.

| | | - [kallisto_stranded_sequence_combat_pca.pdf](https://github.com/AlexsLemonade/OpenPBTA-analysis/files/4325816/kallisto_stranded_sequence_combat_pca.pdf) | | |
| | | | | |

We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects.
We propose to use the BatchQC tool to evaluate the data for batch effects.
BatchQC provides various visualization and quantitative tools for evaluating batch effects.
We will use these and prepare a summary based on our findings.
ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects.

Can you add citations throughout please?




trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make these function names a little more descriptive:

Suggested change
trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){
transpose_expression_data = function(data, Kids_First_Biospecimen_ID, val, starting_col){

return(data)
}

clean = function(data, gene_id){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clean = function(data, gene_id){
clean_expression_data = function(data, gene_id){

I don't thinkgene_id is used.


# Get correct file paths to data
library(here)
path = here("data", "release-v13-20200116")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to be able to run this with new releases as data gets updated, any references to data files should be to the symlinked files in data rather than to a specific release folder.

Comment on lines +3 to +6
scr_dir <- dirname("/fslhome/nmella/OpenPBTA-analysis")

setwd(paste0(scr_dir, "/data/release-v13-20200116"))
print(scr_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use rprojroot here and symlinked files in data here as well please.

Comment on lines +113 to +124
print(path)
setwd(path)

# download all covariate data which will be used to identify batches
covariate = read_tsv("pbta-histologies.tsv",col_types = cols(molecular_subtype = "c"))

# download gene expression data
dat_rsem_polya = readRDS("pbta-gene-expression-rsem-tpm.polya.rds")
dat_rsem_stranded = readRDS("pbta-gene-expression-rsem-tpm.stranded.rds")
dat_kallisto_stranded <- readRDS("pbta-gene-expression-kallisto.stranded.rds")
dat_kallisto_stranded = dat_kallisto_stranded[,2:ncol(dat_kallisto_stranded)]
dat_kallisto_polya <- readRDS("pbta-gene-expression-kallisto.polya.rds")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using set.wd, I would do something like:

data_dir <- file.path(root_dir, "data")
rsem_polya_file <- file.path(data_dir, "pbta-gene-expression-rsem-tpm.polya.rds")
dat_rsem_polya <- readRDS(rsem_polya_file)

@jaclyn-taroni
Copy link
Member

On the connected issue, the next steps included new pull requests #448 (comment) - the first of which was edits to the Docker image. The Docker image has been overhauled (#689) since this pull request was opened, so I think the best course of action is to close this one at this point.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants