-
Notifications
You must be signed in to change notification settings - Fork 67
(PR 1 of 5) Fusion Analysis: filtering #92
(PR 1 of 5) Fusion Analysis: filtering #92
Conversation
all.callers<-rbind(ar.total, sf.total) | ||
|
||
#histology | ||
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I'm not going to be able to go through this whole script today, but this came to my attention because of #94 and wanted to discuss before making changes to the README.
I think this change would be equivalent to what you currently have:
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t") | |
clin<-read.delim("../../data/pbta-histologies.tsv",stringsAsFactors = F,sep="\t") |
or if you wanted to use file.path
:
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t") | |
clin<-read.delim(file.path("..", "..", "data", "pbta-histologies.tsv"),stringsAsFactors = F,sep="\t") |
As far as I know, you don't need to treat the symlinks as any kind of special case when reading them into R. Provided bash download-data.sh
ran to completion, you can use the path to the symlinks in ../../data
and that will read in the equivalent file from the most recent release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please let me know if that works and if my explanation was helpful, it will help me formulate changes for #94.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the file.path to file.path("..", "..", "data","pbta-fusion-starfusion.tsv.gz") still doesnt read the file and fails CI , am I missing something?
I had the extra paste0 from the time I was running paste0("../../data/",Sys.readlink2("../../data/pbta-fusion-starfusion.tsv.gz")) to get the symbolic link for gzfile() which works in R 3.5.1 as well as file.path("../../data/pbta-fusion-starfusion.tsv.gz")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay looking at the CI details: https://circleci.com/gh/AlexsLemonade/OpenPBTA-analysis/252?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
I think I might know what's going on now that I have that context. When calling this in CI with:
./scripts/run_in_ci.sh Rscript analyses/fusion_filtering/01-fusion-filtering.R
The working directory will be the root directory of the repository and you will not have to traverse upward to ../../data
. It would be data/pbta-fusion-starfusion.tsv.gz
rather than ../../data/pbta-fusion-starfusion.tsv.gz
.
I went back to another PR where the Rscript was being run similarly: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/54/files#diff-1d37e48f9ceff6d8030570cd36286a61R20
and data
was used: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/54/files#diff-ad4abc42962164e2453a997a53b3505fR49
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes that makes sense, thanks!
Good to know, @jaclyn-taroni ! |
I missed the part that the scripts need to be runnable on the test datasets in the container, some parts of this code will not work with a small sample set. Let me know if I should edit those parts to make the code pass CI. Thanks! |
@kgaonkar6 could you elaborate on which parts of the code won't work with a differently sized dataset? These may be opportunities to improve the generality of the script. |
Oh yeah, sure! I believe that the part of the code below looking for genes that are fused to more than ( an arbitrary number of) 5 times within a histology to save in variable "total" that is written to fusion_all_list_filt.rds as our "Filtered_Fusions" list. Filtered_Fusion <- (fusions found in atleast 2 samples per histology |Fusions called by both callers |fusions found in transcription factor or kinase gene list | Fusions with either gene fused recurrently (>=5 times) per histology) & Uniquely found in histology) Code starts from https://github.com/kgaonkar6/OpenPBTA-analysis/blob/f89849df225f391b8b27115015d2829aa557e5c3/analyses/fusion_filtering/01-fusion-filtering.R#L143 The CI failed here because the res dataframe is empty and I'm adding a column "note" to it capture what filtering resulted in keeping the gene |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgaonkar6, I wanted to ask a few general questions about the pipeline before going through this script line-by-line. There may be opportunities to revisit the structure of the pipeline in service of making it more generalizable and I want to understand the current structure better before making any further recommendations.
- In looking ahead at https://github.com/d3b-center/fusion_filtering_pipeline, I understand there to be 3 filtering steps: 1) what is contained here (filtering artifacts) 2) filtering based on low expression in the RNA-seq data and 3) another step based on expression. What is the rationale for splitting up the filtering steps into these three parts? Without knowing more information or fully digging into the pipeline, I might expect that the filtering would happen as a single step after standardization and annotation.
- Is it fair to say that there are two kinds of filtering: filtering that you would want to do regardless of the specifics of a project (TF/kinase, is this called by n callers?) and filtering that is specific to this project or more generally specific to a project (e.g., based on histologies)?
- We can always expect that folks will have RNA-seq expression estimates of some kind because these tools call fusions from RNA-seq data, did I understand that correctly?
Thank you! I'm looking forward to learning more about this.
|
||
|
||
|
||
#IGLC6 is expressed in rnaseq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me why you are making this substitution? This comment could be more detailed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so this specific gene name with the "@" is only in STARFusion calls and not in the RSEM expression matrix that I had so I had to change the gene name specifically for STARFusion to match to a known gene name
library(tidyr) | ||
library(R.utils) | ||
|
||
# star fusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this code and what you stated, it seems that the first step is standardizing the file formats and based on #39 (comment), you may ultimately want to accommodate more fusion callers.
With this in mind, you could consider the first step in your pipeline to be standardizing the files, where you have a script that accepts the input file path, the caller type (e.g., Arriba), and the output file path as arguments. There would be logic in the script based on the type argument and the output would be some standardized format that the rest of your pipeline would accept. I am essentially suggesting that you break this data cleaning/standardizing step out from any kind of filtering. Though you may want to consider the read-through and discordant reads steps as part of 'data cleaning/standardizing.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, @jaclyn-taroni !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that's a great idea. While creating the package I did make a note to keep the first function to only standardize the 2 callers that we have with just if
statements to run according to the caller type but I had also included the artifact filtering in there which as you mentioned would need to be in the filtering step. I'll rewrite the first code for the fusion filtering as you mentioned and keeping the readthrough and read based filtering in the filtering steps for the aggregated standardized fusion calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaclyn-taroni I am also adding this for some context prioritization scheme. We are modeling it after work done with the PPTC paper - Figure 1 here, but have since removed known driver gene lists and instead try to generalize high-confidence fusions using known oncogene/TSG/COSMIC gene partners, which has enabled us to capture them all without using a list of known fusions per specific histologies in question.
For the expression filtering - there are two things - 1) completely removing fusions where genes have <1 FPKM for each gene and 2) doing a comparison of expression of each gene partner vs median expression of that gene in the histology to determine if that gene partner is up or down-regulated in comparison to the whole group. This was really rationale based on what pedcbio does with oncoprints when they display up/down-regulated gene expression - they compare within the group of interest. We are going to change this, though, to be a comparison of fusion gene partner expression vs GTEX normal expression (still up for discussion) because as you can imagine, there may be a gene with a fusion in a histology that is common, and the kinase partner, for example, is upregulated, so doing the comparison between sample of interest and the histology may not show the large difference in expression just because there may be many samples with this high expression and thus an inability to see an "up" or "down"-regulated gene.
library(tidyr) | ||
library(R.utils) | ||
|
||
# star fusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, @jaclyn-taroni !
Thanks for the link to the presentation @jharenza, that's helpful. Is it fair to say that
is a general step, not specific to any project and
is a more project-specific step? |
Yes, that's what I would think so too. |
@kgaonkar6 I'm sorry I forgot to ask this as part of my general questions about the pipeline -- what is the difference between the annotation steps in the |
Currently in the 01-fusion-filtering.R performs the genelist based filtering to capture all fusion in genes which are either tumor suppressors/Oncogenic/Cosmic/TCGA etc and saves them as putative_driver_fusion.rds and additionally it identifies recurrent fusions using the n caller, n samples per group , n recurrent times observed fused gene in group and saves them as filtered_fusion.rds. 04-annotate-filtered-fusion.R only annotates the final fusions with the genelists : TSGs, Oncogenes, TCGA fusions, TF, kinase that passed the expression filtering to save as final annotated files. Additionally I also had a quite a few substitutions to convert old/synonym gene names to HUGO Approved gene names in the script. |
@kgaonkar6 @jharenza thank you for answering those questions - it's been very helpful for me. I sketched out my thoughts on the structure and order of the pipeline if the goal is to reuse it broadly: I would make every step except the project-specific step in the red box as general as possible, where each of the the inputs are standardized in some way and the scripts accept any cutoffs as command line arguments or options like we've discussed. So, you might prefer to accept the expression estimates in a different format (e.g., not RDS, use certain gene identifiers, etc.) than what is available as part of the the OpenPBTA data and then, as part of this analysis module, perform some data cleaning step(s) to get it into the correct format for the general pipeline. (This has an added benefit of being a public example of how to get things in the correct format for your pipeline.) There also may be some things I've missed that are prohibitive. I wanted to put this out there to get your thoughts and see if this is helpful -- will help guide me in the rest of the reviews. You have already tackled some of this with the addition of |
thanks @jaclyn-taroni! This looks good, but I think that the red box is not entirely project-specific - there is filtering for artifacts in addition to filtering based on project. Maybe we should have a phone call about it since there are so many steps and that way we can hash out what can go in what order? We did a lot of editing along the way mainly because when looking at many iterations of results, we saw X artifact or Y duplicate or Z was missing. So I do worry a bit about changing orders of everything, but agree that we should have standardized as much as possible so that if people have only 1 histology, for example, they can run it vs if they have many, or if they just want to run the filtering+annotation, they can also do so. So maybe we split filtering into artifact removal+filtering, move annotation before filtering, and then add plotting? |
Actually, @kgaonkar6 - why don't we make a list of every step of the pipeline post-standardizing caller output and mark as |
Sounds good, that's a great idea @jharenza! I was just writing back to say that I believe we are in agreement that the filtering needs to be split up, it is probably a matter of how. We are definitely on the same page on this point:
|
Great idea! I've drafted a filtering table here ,https://docs.google.com/spreadsheets/d/1fKvDkkGddyeQ-_BHlE06XfMz82SC_nAPo7zwximxJRo/edit#gid=0. Filtering criteria has values general or project-specific from what we have discussed above. I also added the Output-type to show resulting type of dataset and some more description. @jaclyn-taroni thanks for pointing out that 01-fusion-filtering.R was still in the PR. I've removed that now. |
@kgaonkar6 thanks for putting that information together! Can you change the access settings for the sheet please? |
Let me know if you have access now, thanks! |
Thanks @kgaonkar6! I added some notes with additional details as well, so @jaclyn-taroni, please let us know if you have any questions about the steps! |
Thank you @kgaonkar6. In my proposed structure, I noted two filtering steps - one that is general and one that is project specific. Wondering if the following is possible -- Could you put all these steps, noted as general in the spreadsheet, in a single script:
And these steps, marked as project-specific in a different script/set of scripts/notebook(s) that takes the output from the above general script:
Above, @jharenza proposed:
I believe splitting things as outlined above meets part of this. What I'm not clear on is the rationale for moving annotation before filtering. If I understand the current pipeline in the d3b-center/fusion_filtering_pipeline repository and @kgaonkar6's comment here #92 (comment), the last two steps are annotation and plotting. It's possible that the thinking on this has changed since putting together the table, but wanted to check in to make sure I understand. |
Thank you @jaclyn-taroni ! This makes sense, I'll gather the filtering steps and write a script to only do general filtering using the standardized output from the 01-fusion-standardization.R script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kgaonkar6, I'm glad we were able to get some more details about the structure of the pipeline hashed out! Thanks for your patience and willingness to walk me through things, I really appreciate it.
The documentation for 01-fusion-standardization.R
looks pretty good. I've left a few questions and there are a few changes I would like to see made before merging:
- Making sure
outputfile
is included in the documentation - Moving the column selection/filtering for
standard_calls
outside the logic and steps that are specific to a particular caller - Some error-handling if someone passes a caller that isn't (currently) supported
- Optionally, you might also consider forcing all the characters in the caller argument to uppercase or lowercase so
STARFusion
vs.STArfusion
vs.STARfusion
doesn't cause any headaches.
- Optionally, you might also consider forcing all the characters in the caller argument to uppercase or lowercase so
- Removing any steps you don't need (e.g., steps around breakpoints)
- A comment explaining the
JunctionReadCount
standardization for Arriba
I've also included some comments that are more for our discussion/information only, such as how one could do something using dplyr
. Those are not things that need to get changed unless you'd prefer to change them.
Please let me know if anything needs clarification. Thank you!
|
||
opt <- parse_args(OptionParser(option_list=option_list)) | ||
|
||
inputfile<-opt$fusionfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputfile<-opt$fusionfile | |
inputfile <- opt$fusionfile |
I find code to be much easier to read and understand if there are spaces around assignment operators and after commas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yup makes sense, I'll keep that in mind for assignments. Maybe this is a dumb question but does it matter in the review process if I commit the suggestion here or should it be ok to fix it with all other changes in the script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a dumb question at all. Took me a while to figure it out. Overall the answer is no it does not matter. Here's how it does make a difference as far as I can tell:
- If you do this from the user interface, you can add suggestions to batch and then you can commit them in a single commit. This would also give me credit for the commit (which is not something I care about when using this feature, I just find it convenient). The other thing that is sometimes annoying is if you have larger changes you need to make after committing the suggestions, you'll have to pull that before making changes locally.
- I've found that the downside of the doing everything, including the suggestions, locally is that sometimes I miss things but some folks like using the resolving conversations feature to help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes all credits for edits are yours and @jharenza's :) Thanks! Since I had to make a bunch of edits I think I'll use the resolving button to keep track of editing hopefully I didn't miss any in this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look good! The comments are really nice. This is very close, two things I'd like to see tweaked before merging:
-
Using either
saveRDS
orwrite.table
to save the standardized output but not both. I'd also include the expected file extension/file type in the documentation aroundoutputfile
. -
There's something off about the indentation throughout. In my personal experience, issues with indentation can sometimes happen if I switch between different text editors during my editing process. I've put a couple specific comments about how I'd expect some of the lines to look. If you are editing in RStudio you can select all and then use
Ctrl + I
(orCmd + I
on a Mac) to automatically fix the indentation for you. One of my favorite RStudio features :)
I've added some dplyr
alternatives as comments as well. You don't have to make these changes, they are for your information if you are interested in adopting more dplyr
into your workflows.
You caught me about the changing editors in between editing :D . I was looking into atom editor which I've been told has great features and packages to write beautiful code. I'm just beginning to learn how to do things in atom. I'll stick to Rstudio and use the Cmd+I feature. I'll also implement only 1 outputfile and dplyr alt. has suggested to make code more readable. Thanks! |
I was wondering if there should be a step after standardization and before filtering to merge the standardized fusion calls from STARfusion and arriba? Or should the merge be included in 01-fusion-standardization? |
Do you exclusively use merged files downstream of this step @kgaonkar6 ? |
yes all the filtering is done on merged files |
I would say let's leave this as is and then add a merge step before filtering. We can always revisit, but I think it would be helpful for me to have eyes on the code that is exclusively for the merge step before commenting further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍 I will fix the conflicts and then merge. Great work on these changes @kgaonkar6! Looking forward to seeing the next steps.
Thank you so much! This has been so helpful to me especially to understand how to write good code :) |
Purpose/implementation
_What scientific question is your analysis addressing? Run to standardize fusion calls from both STAR fusion and arriba caller and filter for biological relevant genes and recurrent status
What was your approach?
Standardize caller format
Added columns for Arriba
SpaninngReadCount in arriba<- split_read1+split_read2
DiscordantFragmentCount in arriba<- discordant_mates
Also added annotation from FusionAnnotator [column 27]
TODO change input column name
Added columns for STAR fusion
Confidence is set to NA
Expected outputs RDS file per caller standardized for pipeline :
../../scratch/star.tsv
../../scratch/arriba.tsv
If this is not adding an analysis, describe your changes in this section._
Issue
What GitHub issue does your pull request address?
Planned Analysis: Filter and Annotate Fusions #39
Directions for reviewers
_Tell potential reviewers what kind of feedback you are soliciting.
Are there particular areas that need a closer look? Code optimization, filtering strategy
Is there something you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?_
Results
If your pull request includes code that produces scientific results, please summarize the results here.
This can help facilitate discussion around interpretation.
Please state what kinds of results are included (e.g., table, figure).
Docker and continuous integration
Check all those that apply or remove this section if it is not applicable.
#Resolved in previous commit @jaclyn-taroni Hi! I checked if the packages are available in the rstudio running the docker on http://localhost:8787/ but Sys.readlink2 didnt work in CI checks for the script, I updated to file.path that I saw is used in other pull requests but that also failed CI. Both options run ok in the script on my laptop. Don't know how to get it passed through CI correctly.
Parts to this analysis