-
Notifications
You must be signed in to change notification settings - Fork 67
Conversation
Hi @natemella - if you could paste text instead of images I'd quote tweet, but your answer to this one remains an ongoing question:
You've batch effect corrected for different datasets, but we suspect those may be confounded with disease type. this means your attempts to remove unwanted variability might actually induce other confounded variability. If you wanted to check this, you'd pick one or a few dominant histologies from those centers, compare them, and at least see that:
Without doing this analysis, I don't think there's any way to know if this correction is helping or hurting. |
I am going to resolve conflicts and add this analysis to continuous integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @natemella,
Thanks for adding this! As you address the high-level comments from @cgreene (#628 (comment)), I figured I would take a look at the code you are requesting to add here and see if there are adjustments we could make to smooth the way for the confounded by disease question.
In general, I think it would be helpful to take an approach where you are writing functions that are very general that can be sourced in all of the scripts or notebooks where you are looking at different "batches." These R files could live in analyses/batch-effects/util/
.
I also added some specific comments about things like the handling of directories and file paths. rprojroot
is very handy when you will be sourcing files in scripts because you want to make sure that your scripts that source files can be run from anywhere from within the project and they'll always be able to "find" the files they need to source. (here
is probably helpful in the exact same way, but rprojroot
allows you to specify the criterion used to determine the root of the project.)
Because this is a multi-step analysis, I think we will want to break this up into multiple pull requests per the contributing guidelines – a good rule of thumb is 1 analysis script per pull request or if you need to add files that will be sourced/dependencies to Dockerfile/the shell script, aim for 400 lines maximum. Limiting the size of pull request ensures thorough review. We can leave this pull request open for now as a sort of guide as we break things up.
A good first pull request that incorporates some of the feedback here might contain: 1) the functions that will be used throughout the analysis module 2) the first script that uses those functions 3) the required Dockerfile changes 4) adding this to CI (.circleci/config.yml
; instructions here) 5) if you'd like you can add the beginnings of the shell script to run the entire module (adding it at the end vs. as you go is a bit of a matter of personal preference).
I added some specific comments about documentation, but I think those can wait until you file a pull request focused on documentation.
Thanks again! Please let us know if you have any questions about our comments or if you need a hand breaking things up!
# packages needed for batch-effects-analysis | ||
RUN R -e "BiocManager::install(c('BatchQC'))" | ||
RUN R -e "BiocManager::install(c('sva'))" | ||
RUN R -e "install.packages('here', dependencies = TRUE)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use rprojroot
throughout the project. I'll also add specific comments about where and how to use it in what you are adding here.
RUN R -e "BiocManager::install(c('BatchQC'))" | ||
RUN R -e "BiocManager::install(c('sva'))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUN R -e "BiocManager::install(c('BatchQC'))" | |
RUN R -e "BiocManager::install(c('sva'))" | |
RUN R -e "BiocManager::install(c('BatchQC', 'sva'))" |
| [`tcga-capture-kit-investigation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/tcga-capture-kit-investigation) | `pbta-snv-lancet.vep.maf.gz` <br> `pbta-snv-mutect2.vep.maf.gz` <br> `pbta-snv-strelka2.vep.maf.gz` <br> `tcga-snv-lancet.vep.maf.gz` <br> `tcga-snv-mutect2.vep.maf.gz` <br> `tcga-snv-strelka2.vep.maf.gz` <br> `pbta-histologies.tsv` <br> `pbta-tcga-manifest.tsv` <br> `WGS.hg38.lancet.unpadded.bed` <br> `WGS.hg38.strelka2.unpadded.bed` <br> `WGS.hg38.mutect2.vardict.unpadded.bed` <br> | Investigation of the TMB discrepancy between PBTA and TCGA data | `results/*.bed` | ||
| [`batch-effects`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/batch-effects)| `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `pbta-gene-expression-kallisto.polya.rds` <br> `pbta-gene-expression-kallisto.stranded.rds` | batchQC and PCA analysis of batch affects with SVA batch correction (part of [#448](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/448)) | N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this where it would be displayed on the analyses
page in GitHub? For this particular add, I believe this would be the first entry in the table.
@@ -0,0 +1,75 @@ | |||
# Batch Effect Analysis | |||
|
|||
Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following one sentence per line for Markdown documents that are under version control is helpful for tracking and also makes it a bit easier for review because a review is able to comment on a specific sentence more easily.
Batch effects are commonly observed in RNA expression data. Such effects are likely less pronounced in RNA-Seq data than microarray data. However, batch effects may bias conclusions of studies that do not account for these effects. We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects. | |
Batch effects are commonly observed in RNA expression data. | |
Such effects are likely less pronounced in RNA-Seq data than microarray data. | |
However, batch effects may bias conclusions of studies that do not account for these effects. | |
We wish to evaluate the level to which batch effects are present in the RNA-Seq data from this study and create alternative versions of the data that have been adjusted for batch effects. |
| | | - [kallisto_stranded_sequence_combat_pca.pdf](https://github.com/AlexsLemonade/OpenPBTA-analysis/files/4325816/kallisto_stranded_sequence_combat_pca.pdf) | | | | ||
| | | | | | | ||
|
||
We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We propose to use the BatchQC tool to evaluate the data for batch effects. BatchQC provides various visualization and quantitative tools for evaluating batch effects. We will use these and prepare a summary based on our findings. ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. | |
We propose to use the BatchQC tool to evaluate the data for batch effects. | |
BatchQC provides various visualization and quantitative tools for evaluating batch effects. | |
We will use these and prepare a summary based on our findings. | |
ComBat is a widely used method that uses empirical Bayes methods for correcting batch effects. |
Can you add citations throughout please?
|
||
|
||
|
||
trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make these function names a little more descriptive:
trans = function(data, Kids_First_Biospecimen_ID, val, starting_col){ | |
transpose_expression_data = function(data, Kids_First_Biospecimen_ID, val, starting_col){ |
return(data) | ||
} | ||
|
||
clean = function(data, gene_id){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean = function(data, gene_id){ | |
clean_expression_data = function(data, gene_id){ |
I don't thinkgene_id
is used.
|
||
# Get correct file paths to data | ||
library(here) | ||
path = here("data", "release-v13-20200116") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want to be able to run this with new releases as data gets updated, any references to data files should be to the symlinked files in data
rather than to a specific release folder.
scr_dir <- dirname("/fslhome/nmella/OpenPBTA-analysis") | ||
|
||
setwd(paste0(scr_dir, "/data/release-v13-20200116")) | ||
print(scr_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use rprojroot
here and symlinked files in data
here as well please.
print(path) | ||
setwd(path) | ||
|
||
# download all covariate data which will be used to identify batches | ||
covariate = read_tsv("pbta-histologies.tsv",col_types = cols(molecular_subtype = "c")) | ||
|
||
# download gene expression data | ||
dat_rsem_polya = readRDS("pbta-gene-expression-rsem-tpm.polya.rds") | ||
dat_rsem_stranded = readRDS("pbta-gene-expression-rsem-tpm.stranded.rds") | ||
dat_kallisto_stranded <- readRDS("pbta-gene-expression-kallisto.stranded.rds") | ||
dat_kallisto_stranded = dat_kallisto_stranded[,2:ncol(dat_kallisto_stranded)] | ||
dat_kallisto_polya <- readRDS("pbta-gene-expression-kallisto.polya.rds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using set.wd
, I would do something like:
data_dir <- file.path(root_dir, "data")
rsem_polya_file <- file.path(data_dir, "pbta-gene-expression-rsem-tpm.polya.rds")
dat_rsem_polya <- readRDS(rsem_polya_file)
On the connected issue, the next steps included new pull requests #448 (comment) - the first of which was edits to the Docker image. The Docker image has been overhauled (#689) since this pull request was opened, so I think the best course of action is to close this one at this point. |
Purpose/implementation Section
What scientific question is your analysis addressing?
What was your approach?
What GitHub issue does your pull request address?
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Is there anything that you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Results
What types of results are included (e.g., table, figure)?
What is your summary of the results?
We tested for batch effects in three different areas:
Within each of these analyses I evaluated RNA seq data that had been prepared using either RSEM normalization or Kallisto (tpm) values.
The null hypothesis is that there are no batch effects in the data
After running combat, it appears that the batch effects are successfully eliminated. Attached are the two graphs for each section 1) the distribution of P-values and 2) the PCA plot generated from batchQC
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.