ready for merge request #628

natemella · 2020-03-13T08:25:24Z

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

We tested for batch effects in three different areas:

Batch = Sequencing Center
Batch = Cohort
Batch = polyA vs ribodepletion

Within each of these analyses I evaluated RNA seq data that had been prepared using either RSEM normalization or Kallisto (tpm) values.
The null hypothesis is that there are no batch effects in the data

		Batch
		Sequence center	Cohort	Preparation method
RSEM	polyA	reject	reject	reject
		before	before	before
		- rsem_polyA_sequence.pdf	- rsem_polya_cohort_pca.pdf	- rsem_method.pdf
		- rsem_polyA_sequence_pca.pdf	- rsem_polyA_cohort.pdf	- rsem_method_pca.pdf
		after	after	after
		- rsem_polya_sequence_combat.pdf	- rsem_polya_cohort_combat_pca.pdf	- rsem_method_combat.pdf
		- rsem_polya_sequence_combat_pca.pdf	- rsem_polya_cohort_combat.pdf	- rsem_method_combat_pca.pdf

	ribodepletion	reject	fail to reject
		before	before
		- rsem_stranded_sequence.pdf	- rsem_stranded_cohort.pdf
		- rsem_stranded_sequence_pca.pdf	- rsem_stranded_cohort_pca.pdf
		after	after
		- rsem_stranded_sequence_combat_pca.pdf
		- rsem_stranded_sequence_combat.pdf

Kallisto	polyA	reject	reject	reject
		before	before	before
		- kallisto_polyA_sequence.pdf	- kallisto_polyA_cohort.pdf	- kallisto_method_pca.pdf
		- kallisto_polya_sequence_pca.pdf	- kallisto_polyA_cohort_pca.pdf	- kallisto_method.pdf
		after	after	after
		- kallisto_polya_sequence_combat_pca.pdf	- kallisto_polya_cohort_combat.pdf	- kallisto_method_combat.pdf
		- kallisto_polya_sequence_combat.pdf	- kallisto_polya_cohort_combat_pca.pdf	- kallisto_method_combat_pca.pdf
	ribodepletion	reject	fail to reject
		before	before
		- kallisto_stranded_sequence_pca.pdf	- kallisto_stranded_cohort_pca.pdf
		- kallisto_stranded_sequence.pdf	- kallisto_stranded_cohort.pdf
		after	after
		- kallisto_stranded_sequence_combat.pdf
		- kallisto_stranded_sequence_combat_pca.pdf

After running combat, it appears that the batch effects are successfully eliminated. Attached are the two graphs for each section 1) the distribution of P-values and 2) the PCA plot generated from batchQC

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

cgreene · 2020-03-13T12:03:29Z

Hi @natemella - if you could paste text instead of images I'd quote tweet, but your answer to this one remains an ongoing question:

Is there anything that you want to discuss further?

You've batch effect corrected for different datasets, but we suspect those may be confounded with disease type. this means your attempts to remove unwanted variability might actually induce other confounded variability. If you wanted to check this, you'd pick one or a few dominant histologies from those centers, compare them, and at least see that:

Those histologies now overlap, or at least overlap better than they did before.
The histologies that are not those within the center(s) in question retain roughly the same variability, looking within center, that they did before the data were merged.

Without doing this analysis, I don't think there's any way to know if this correction is helping or hurting.

jaclyn-taroni · 2020-04-01T15:16:03Z

I am going to resolve conflicts and add this analysis to continuous integration.

jaclyn-taroni

Hi @natemella,

Thanks for adding this! As you address the high-level comments from @cgreene (#628 (comment)), I figured I would take a look at the code you are requesting to add here and see if there are adjustments we could make to smooth the way for the confounded by disease question.

In general, I think it would be helpful to take an approach where you are writing functions that are very general that can be sourced in all of the scripts or notebooks where you are looking at different "batches." These R files could live in analyses/batch-effects/util/.

I also added some specific comments about things like the handling of directories and file paths. rprojroot is very handy when you will be sourcing files in scripts because you want to make sure that your scripts that source files can be run from anywhere from within the project and they'll always be able to "find" the files they need to source. (here is probably helpful in the exact same way, but rprojroot allows you to specify the criterion used to determine the root of the project.)

Because this is a multi-step analysis, I think we will want to break this up into multiple pull requests per the contributing guidelines – a good rule of thumb is 1 analysis script per pull request or if you need to add files that will be sourced/dependencies to Dockerfile/the shell script, aim for 400 lines maximum. Limiting the size of pull request ensures thorough review. We can leave this pull request open for now as a sort of guide as we break things up.

A good first pull request that incorporates some of the feedback here might contain: 1) the functions that will be used throughout the analysis module 2) the first script that uses those functions 3) the required Dockerfile changes 4) adding this to CI (.circleci/config.yml; instructions here) 5) if you'd like you can add the beginnings of the shell script to run the entire module (adding it at the end vs. as you go is a bit of a matter of personal preference).

I added some specific comments about documentation, but I think those can wait until you file a pull request focused on documentation.

Thanks again! Please let us know if you have any questions about our comments or if you need a hand breaking things up!

jaclyn-taroni · 2020-04-01T13:41:18Z

Dockerfile

+# packages needed for batch-effects-analysis
+RUN R -e "BiocManager::install(c('BatchQC'))"
+RUN R -e "BiocManager::install(c('sva'))"
+RUN R -e "install.packages('here', dependencies = TRUE)"


We use rprojroot throughout the project. I'll also add specific comments about where and how to use it in what you are adding here.

jaclyn-taroni · 2020-04-01T13:41:52Z

Dockerfile

+RUN R -e "BiocManager::install(c('BatchQC'))"
+RUN R -e "BiocManager::install(c('sva'))"


Suggested change

RUN R -e "BiocManager::install(c('BatchQC'))"

RUN R -e "BiocManager::install(c('sva'))"

RUN R -e "BiocManager::install(c('BatchQC', 'sva'))"

jaclyn-taroni · 2020-04-01T13:43:09Z

analyses/README.md

+| [`tcga-capture-kit-investigation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/tcga-capture-kit-investigation) | `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `tcga-snv-lancet.vep.maf.gz` <br> `tcga-snv-mutect2.vep.maf.gz` <br> `tcga-snv-strelka2.vep.maf.gz` <br> `pbta-histologies.tsv` <br> `pbta-tcga-manifest.tsv` <br> `WGS.hg38.lancet.unpadded.bed` <br> `WGS.hg38.strelka2.unpadded.bed` <br> `WGS.hg38.mutect2.vardict.unpadded.bed` <br> | Investigation of the TMB discrepancy between PBTA and TCGA data | `results/*.bed`
+| [`batch-effects`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/batch-effects)| `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `pbta-gene-expression-kallisto.polya.rds` <br> `pbta-gene-expression-kallisto.stranded.rds` | batchQC and PCA analysis of batch affects with SVA batch correction (part of [#448](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448)) | N/A


Can you add this where it would be displayed on the analyses page in GitHub? For this particular add, I believe this would be the first entry in the table.

jaclyn-taroni · 2020-04-01T13:46:33Z

analyses/batch-effects/README.md

@@ -0,0 +1,75 @@
+# Batch Effect Analysis
+
+Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.


Following one sentence per line for Markdown documents that are under version control is helpful for tracking and also makes it a bit easier for review because a review is able to comment on a specific sentence more easily.

Suggested change

Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.

Batch effects are commonly observed in RNA expression data.

Such effects are likely less pronounced in RNA-Seq data than microarray data.

However, batch effects may bias conclusions of studies that do not account for these effects.

We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.

jaclyn-taroni · 2020-04-01T13:47:15Z

analyses/batch-effects/README.md

+|          |               | - [kallisto_stranded_sequence_combat_pca.pdf](https://github.com/AlexsLemonade/OpenPBTA-analysis/files/4325816/kallisto_stranded_sequence_combat_pca.pdf)      |                                   |                                    |
+|          |               |                                               |                                          |                                    | 
+
+We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. 


Suggested change

We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects.

We propose to use the BatchQC tool to evaluate the data for batch effects.

BatchQC provides various visualization and quantitative tools for evaluating batch effects.

We will use these and prepare a summary based on our findings.

ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects.

Can you add citations throughout please?

jaclyn-taroni · 2020-04-01T14:38:38Z

analyses/batch-effects/batch-cohort-effects.R

+
+
+
+trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){


I would make these function names a little more descriptive:

Suggested change

trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){

transpose_expression_data = function(data, Kids_First_Biospecimen_ID, val, starting_col){

jaclyn-taroni · 2020-04-01T14:39:25Z

analyses/batch-effects/batch-cohort-effects.R

+  return(data)
+}
+
+clean = function(data, gene_id){


Suggested change

clean = function(data, gene_id){

clean_expression_data = function(data, gene_id){

I don't thinkgene_id is used.

jaclyn-taroni · 2020-04-01T14:47:22Z

analyses/batch-effects/batch-cohort-effects.R

+
+# Get correct file paths to data
+library(here)
+path = here("data", "release-v13-20200116")


Because we want to be able to run this with new releases as data gets updated, any references to data files should be to the symlinked files in data rather than to a specific release folder.

jaclyn-taroni · 2020-04-01T14:56:58Z

analyses/batch-effects/batch-effects.R

+scr_dir <- dirname("/fslhome/nmella/OpenPBTA-analysis")
+
+setwd(paste0(scr_dir, "/data/release-v13-20200116"))
+print(scr_dir)


Use rprojroot here and symlinked files in data here as well please.

jaclyn-taroni · 2020-04-01T15:00:58Z

analyses/batch-effects/batch-cohort-effects.R

+print(path)
+setwd(path)
+
+# download all covariate data which will be used to identify batches
+covariate = read_tsv("pbta-histologies.tsv",col_types = cols(molecular_subtype = "c"))
+
+# download gene expression data
+dat_rsem_polya = readRDS("pbta-gene-expression-rsem-tpm.polya.rds")
+dat_rsem_stranded = readRDS("pbta-gene-expression-rsem-tpm.stranded.rds")
+dat_kallisto_stranded <- readRDS("pbta-gene-expression-kallisto.stranded.rds")
+dat_kallisto_stranded = dat_kallisto_stranded[,2:ncol(dat_kallisto_stranded)]
+dat_kallisto_polya <- readRDS("pbta-gene-expression-kallisto.polya.rds")


Rather than using set.wd, I would do something like:

data_dir <- file.path(root_dir, "data") rsem_polya_file <- file.path(data_dir, "pbta-gene-expression-rsem-tpm.polya.rds") dat_rsem_polya <- readRDS(rsem_polya_file)

jaclyn-taroni · 2020-08-17T16:52:39Z

On the connected issue, the next steps included new pull requests #448 (comment) - the first of which was edits to the Docker image. The Docker image has been overhauled (#689) since this pull request was opened, so I think the best course of action is to close this one at this point.

natemella and others added 3 commits March 13, 2020 02:15

ready for merge request

9903bd7

Merge branch 'master' into batch

d9a328c

Update Dockerfile

4e17a02

jaclyn-taroni added 2 commits April 1, 2020 11:16

Merge branch 'master' into batch

f0e7270

Add batch correction to CI

fccb9ab

jaclyn-taroni reviewed Apr 1, 2020

View reviewed changes

Merge branch 'master' into batch

705bb13

jaclyn-taroni mentioned this pull request Apr 13, 2020

Proposed Analysis: Assess batch effects in RNA-Seq data #448

Closed

jaclyn-taroni closed this Aug 17, 2020

jaclyn-taroni mentioned this pull request Jan 21, 2021

Proposed Analysis: Batch effects in RNA-seq data #919

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ready for merge request #628

ready for merge request #628

natemella commented Mar 13, 2020 •

edited

Loading

cgreene commented Mar 13, 2020

jaclyn-taroni commented Apr 1, 2020

jaclyn-taroni left a comment

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni Apr 1, 2020

jaclyn-taroni commented Aug 17, 2020

		RUN R -e "BiocManager::install(c('BatchQC'))"
		RUN R -e "BiocManager::install(c('sva'))"

		\| [`tcga-capture-kit-investigation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/tcga-capture-kit-investigation) \| `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `tcga-snv-lancet.vep.maf.gz` <br> `tcga-snv-mutect2.vep.maf.gz` <br> `tcga-snv-strelka2.vep.maf.gz` <br> `pbta-histologies.tsv` <br> `pbta-tcga-manifest.tsv` <br> `WGS.hg38.lancet.unpadded.bed` <br> `WGS.hg38.strelka2.unpadded.bed` <br> `WGS.hg38.mutect2.vardict.unpadded.bed` <br> \| Investigation of the TMB discrepancy between PBTA and TCGA data \| `results/*.bed`
		\| [`batch-effects`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/batch-effects)\| `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `pbta-gene-expression-kallisto.polya.rds` <br> `pbta-gene-expression-kallisto.stranded.rds` \| batchQC and PCA analysis of batch affects with SVA batch correction (part of [#448](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448)) \| N/A

		@@ -0,0 +1,75 @@
		# Batch Effect Analysis

		Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects.




		trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){

	trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){
	transpose_expression_data = function(data, Kids_First_Biospecimen_ID, val, starting_col){

	clean = function(data, gene_id){
	clean_expression_data = function(data, gene_id){

ready for merge request #628

ready for merge request #628

Conversation

natemella commented Mar 13, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

cgreene commented Mar 13, 2020

jaclyn-taroni commented Apr 1, 2020

jaclyn-taroni left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaclyn-taroni commented Aug 17, 2020

natemella commented Mar 13, 2020 •

edited

Loading