Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

(PR 1 of 5) Fusion Analysis: filtering #92

Merged
merged 21 commits into from
Sep 16, 2019

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Aug 30, 2019

Purpose/implementation

_What scientific question is your analysis addressing? Run to standardize fusion calls from both STAR fusion and arriba caller and filter for biological relevant genes and recurrent status

What was your approach?

Standardize caller format

  1. Added columns for Arriba
    SpaninngReadCount in arriba<- split_read1+split_read2
    DiscordantFragmentCount in arriba<- discordant_mates
    Also added annotation from FusionAnnotator [column 27]
    TODO change input column name

  2. Added columns for STAR fusion
    Confidence is set to NA

Expected outputs RDS file per caller standardized for pipeline :
../../scratch/star.tsv
../../scratch/arriba.tsv

If this is not adding an analysis, describe your changes in this section._

Issue

What GitHub issue does your pull request address?
Planned Analysis: Filter and Annotate Fusions #39

Directions for reviewers

_Tell potential reviewers what kind of feedback you are soliciting.
Are there particular areas that need a closer look? Code optimization, filtering strategy

Is there something you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?_

Results

If your pull request includes code that produces scientific results, please summarize the results here.
This can help facilitate discussion around interpretation.
Please state what kinds of results are included (e.g., table, figure).

Docker and continuous integration

Check all those that apply or remove this section if it is not applicable.

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.

#Resolved in previous commit @jaclyn-taroni Hi! I checked if the packages are available in the rstudio running the docker on http://localhost:8787/ but Sys.readlink2 didnt work in CI checks for the script, I updated to file.path that I saw is used in other pull requests but that also failed CI. Both options run ok in the script on my laptop. Don't know how to get it passed through CI correctly.

  • This analysis has been added to continuous integration.

Parts to this analysis

  • Fusion standardization
  • Fusion filtering
  • Putative driver fusion expression filtering
  • Filtering fusion expression filtering
  • Annotation of final calls
  • Plot recurrent fusion calls

@jharenza jharenza changed the title Fusion filtering (PR 1 of 5) Fusion Analysis: filtering Aug 30, 2019
all.callers<-rbind(ar.total, sf.total)

#histology
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I'm not going to be able to go through this whole script today, but this came to my attention because of #94 and wanted to discuss before making changes to the README.

I think this change would be equivalent to what you currently have:

Suggested change
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t")
clin<-read.delim("../../data/pbta-histologies.tsv",stringsAsFactors = F,sep="\t")

or if you wanted to use file.path:

Suggested change
clin<-read.delim(paste0(file.path("../../data/pbta-histologies.tsv")),stringsAsFactors = F,sep="\t")
clin<-read.delim(file.path("..", "..", "data", "pbta-histologies.tsv"),stringsAsFactors = F,sep="\t")

As far as I know, you don't need to treat the symlinks as any kind of special case when reading them into R. Provided bash download-data.sh ran to completion, you can use the path to the symlinks in ../../data and that will read in the equivalent file from the most recent release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know if that works and if my explanation was helpful, it will help me formulate changes for #94.

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the file.path to file.path("..", "..", "data","pbta-fusion-starfusion.tsv.gz") still doesnt read the file and fails CI , am I missing something?
I had the extra paste0 from the time I was running paste0("../../data/",Sys.readlink2("../../data/pbta-fusion-starfusion.tsv.gz")) to get the symbolic link for gzfile() which works in R 3.5.1 as well as file.path("../../data/pbta-fusion-starfusion.tsv.gz")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay looking at the CI details: https://circleci.com/gh/AlexsLemonade/OpenPBTA-analysis/252?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

I think I might know what's going on now that I have that context. When calling this in CI with:

./scripts/run_in_ci.sh Rscript analyses/fusion_filtering/01-fusion-filtering.R

The working directory will be the root directory of the repository and you will not have to traverse upward to ../../data. It would be data/pbta-fusion-starfusion.tsv.gz rather than ../../data/pbta-fusion-starfusion.tsv.gz.

I went back to another PR where the Rscript was being run similarly: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/54/files#diff-1d37e48f9ceff6d8030570cd36286a61R20
and data was used: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/54/files#diff-ad4abc42962164e2453a997a53b3505fR49

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes that makes sense, thanks!

@jharenza
Copy link
Collaborator

Good to know, @jaclyn-taroni !

@kgaonkar6
Copy link
Collaborator Author

I missed the part that the scripts need to be runnable on the test datasets in the container, some parts of this code will not work with a small sample set. Let me know if I should edit those parts to make the code pass CI. Thanks!

@cgreene
Copy link
Collaborator

cgreene commented Aug 30, 2019

@kgaonkar6 could you elaborate on which parts of the code won't work with a differently sized dataset? These may be opportunities to improve the generality of the script.

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Aug 30, 2019

Oh yeah, sure!

I believe that the part of the code below looking for genes that are fused to more than ( an arbitrary number of) 5 times within a histology to save in variable "total" that is written to fusion_all_list_filt.rds as our "Filtered_Fusions" list.

Filtered_Fusion <- (fusions found in atleast 2 samples per histology |Fusions called by both callers |fusions found in transcription factor or kinase gene list | Fusions with either gene fused recurrently (>=5 times) per histology) & Uniquely found in histology)

Code starts from https://github.com/kgaonkar6/OpenPBTA-analysis/blob/f89849df225f391b8b27115015d2829aa557e5c3/analyses/fusion_filtering/01-fusion-filtering.R#L143

The CI failed here because the res dataframe is empty and I'm adding a column "note" to it capture what filtering resulted in keeping the gene

https://github.com/kgaonkar6/OpenPBTA-analysis/blob/f89849df225f391b8b27115015d2829aa557e5c3/analyses/fusion_filtering/01-fusion-filtering.R#L161

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgaonkar6, I wanted to ask a few general questions about the pipeline before going through this script line-by-line. There may be opportunities to revisit the structure of the pipeline in service of making it more generalizable and I want to understand the current structure better before making any further recommendations.

  1. In looking ahead at https://github.com/d3b-center/fusion_filtering_pipeline, I understand there to be 3 filtering steps: 1) what is contained here (filtering artifacts) 2) filtering based on low expression in the RNA-seq data and 3) another step based on expression. What is the rationale for splitting up the filtering steps into these three parts? Without knowing more information or fully digging into the pipeline, I might expect that the filtering would happen as a single step after standardization and annotation.
  2. Is it fair to say that there are two kinds of filtering: filtering that you would want to do regardless of the specifics of a project (TF/kinase, is this called by n callers?) and filtering that is specific to this project or more generally specific to a project (e.g., based on histologies)?
  3. We can always expect that folks will have RNA-seq expression estimates of some kind because these tools call fusions from RNA-seq data, did I understand that correctly?

Thank you! I'm looking forward to learning more about this.

analyses/fusion_filtering/01-fusion-filtering.R Outdated Show resolved Hide resolved



#IGLC6 is expressed in rnaseq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me why you are making this substitution? This comment could be more detailed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so this specific gene name with the "@" is only in STARFusion calls and not in the RSEM expression matrix that I had so I had to change the gene name specifically for STARFusion to match to a known gene name

library(tidyr)
library(R.utils)

# star fusion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this code and what you stated, it seems that the first step is standardizing the file formats and based on #39 (comment), you may ultimately want to accommodate more fusion callers.

With this in mind, you could consider the first step in your pipeline to be standardizing the files, where you have a script that accepts the input file path, the caller type (e.g., Arriba), and the output file path as arguments. There would be logic in the script based on the type argument and the output would be some standardized format that the rest of your pipeline would accept. I am essentially suggesting that you break this data cleaning/standardizing step out from any kind of filtering. Though you may want to consider the read-through and discordant reads steps as part of 'data cleaning/standardizing.'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, @jaclyn-taroni !

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Sep 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that's a great idea. While creating the package I did make a note to keep the first function to only standardize the 2 callers that we have with just if statements to run according to the caller type but I had also included the artifact filtering in there which as you mentioned would need to be in the filtering step. I'll rewrite the first code for the fusion filtering as you mentioned and keeping the readthrough and read based filtering in the filtering steps for the aggregated standardized fusion calls.

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. https://github.com/d3b-center/fusion_filtering_pipeline

@jaclyn-taroni I am also adding this for some context prioritization scheme. We are modeling it after work done with the PPTC paper - Figure 1 here, but have since removed known driver gene lists and instead try to generalize high-confidence fusions using known oncogene/TSG/COSMIC gene partners, which has enabled us to capture them all without using a list of known fusions per specific histologies in question.

For the expression filtering - there are two things - 1) completely removing fusions where genes have <1 FPKM for each gene and 2) doing a comparison of expression of each gene partner vs median expression of that gene in the histology to determine if that gene partner is up or down-regulated in comparison to the whole group. This was really rationale based on what pedcbio does with oncoprints when they display up/down-regulated gene expression - they compare within the group of interest. We are going to change this, though, to be a comparison of fusion gene partner expression vs GTEX normal expression (still up for discussion) because as you can imagine, there may be a gene with a fusion in a histology that is common, and the kinase partner, for example, is upregulated, so doing the comparison between sample of interest and the histology may not show the large difference in expression just because there may be many samples with this high expression and thus an inability to see an "up" or "down"-regulated gene.

analyses/fusion_filtering/01-fusion-filtering.R Outdated Show resolved Hide resolved
library(tidyr)
library(R.utils)

# star fusion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, @jaclyn-taroni !

@jaclyn-taroni
Copy link
Member

Thanks for the link to the presentation @jharenza, that's helpful. Is it fair to say that

  1. completely removing fusions where genes have <1 FPKM for each gene

is a general step, not specific to any project and

  1. doing a comparison of expression of each gene partner vs median expression of that gene in the histology to determine if that gene partner is up or down-regulated in comparison to the whole group.

is a more project-specific step?

@kgaonkar6
Copy link
Collaborator Author

Thanks for the link to the presentation @jharenza, that's helpful. Is it fair to say that

  1. completely removing fusions where genes have <1 FPKM for each gene

is a general step, not specific to any project and

  1. doing a comparison of expression of each gene partner vs median expression of that gene in the histology to determine if that gene partner is up or down-regulated in comparison to the whole group.

is a more project-specific step?

Yes, that's what I would think so too.

@jaclyn-taroni
Copy link
Member

@kgaonkar6 I'm sorry I forgot to ask this as part of my general questions about the pipeline -- what is the difference between the annotation steps in the 01-fusion-filtering.R and 04-annotate-filtered-fusion.R? We can also cross that bridge when we get to it but wondering if there are opportunities to combine steps.

@kgaonkar6
Copy link
Collaborator Author

Currently in the 01-fusion-filtering.R performs the genelist based filtering to capture all fusion in genes which are either tumor suppressors/Oncogenic/Cosmic/TCGA etc and saves them as putative_driver_fusion.rds and additionally it identifies recurrent fusions using the n caller, n samples per group , n recurrent times observed fused gene in group and saves them as filtered_fusion.rds.

04-annotate-filtered-fusion.R only annotates the final fusions with the genelists : TSGs, Oncogenes, TCGA fusions, TF, kinase that passed the expression filtering to save as final annotated files. Additionally I also had a quite a few substitutions to convert old/synonym gene names to HUGO Approved gene names in the script.

@jaclyn-taroni
Copy link
Member

@kgaonkar6 @jharenza thank you for answering those questions - it's been very helpful for me.

I sketched out my thoughts on the structure and order of the pipeline if the goal is to reuse it broadly:

Image from iOS (7)

I would make every step except the project-specific step in the red box as general as possible, where each of the the inputs are standardized in some way and the scripts accept any cutoffs as command line arguments or options like we've discussed. So, you might prefer to accept the expression estimates in a different format (e.g., not RDS, use certain gene identifiers, etc.) than what is available as part of the the OpenPBTA data and then, as part of this analysis module, perform some data cleaning step(s) to get it into the correct format for the general pipeline. (This has an added benefit of being a public example of how to get things in the correct format for your pipeline.)

There also may be some things I've missed that are prohibitive. I wanted to put this out there to get your thoughts and see if this is helpful -- will help guide me in the rest of the reviews.

You have already tackled some of this with the addition of 01-fusion-standardization.R. For this PR, I think we probably want to remove 01-fusion-filtering.R.

@jharenza
Copy link
Collaborator

@kgaonkar6 @jharenza thank you for answering those questions - it's been very helpful for me.

I sketched out my thoughts on the structure and order of the pipeline if the goal is to reuse it broadly:

Image from iOS (7)

I would make every step except the project-specific step in the red box as general as possible, where each of the the inputs are standardized in some way and the scripts accept any cutoffs as command line arguments or options like we've discussed. So, you might prefer to accept the expression estimates in a different format (e.g., not RDS, use certain gene identifiers, etc.) than what is available as part of the the OpenPBTA data and then, as part of this analysis module, perform some data cleaning step(s) to get it into the correct format for the general pipeline. (This has an added benefit of being a public example of how to get things in the correct format for your pipeline.)

There also may be some things I've missed that are prohibitive. I wanted to put this out there to get your thoughts and see if this is helpful -- will help guide me in the rest of the reviews.

You have already tackled some of this with the addition of 01-fusion-standardization.R. For this PR, I think we probably want to remove 01-fusion-filtering.R.

thanks @jaclyn-taroni! This looks good, but I think that the red box is not entirely project-specific - there is filtering for artifacts in addition to filtering based on project. Maybe we should have a phone call about it since there are so many steps and that way we can hash out what can go in what order? We did a lot of editing along the way mainly because when looking at many iterations of results, we saw X artifact or Y duplicate or Z was missing. So I do worry a bit about changing orders of everything, but agree that we should have standardized as much as possible so that if people have only 1 histology, for example, they can run it vs if they have many, or if they just want to run the filtering+annotation, they can also do so. So maybe we split filtering into artifact removal+filtering, move annotation before filtering, and then add plotting?

@jharenza
Copy link
Collaborator

Actually, @kgaonkar6 - why don't we make a list of every step of the pipeline post-standardizing caller output and mark as general or project-specific and then go from there? Does this sound good, @jaclyn-taroni ?

@jaclyn-taroni
Copy link
Member

Sounds good, that's a great idea @jharenza! I was just writing back to say that I believe we are in agreement that the filtering needs to be split up, it is probably a matter of how. We are definitely on the same page on this point:

agree that we should have standardized as much as possible so that if people have only 1 histology, for example, they can run it vs if they have many, or if they just want to run the filtering+annotation, they can also do so.

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Sep 12, 2019

Actually, @kgaonkar6 - why don't we make a list of every step of the pipeline post-standardizing caller output and mark as general or project-specific and then go from there? Does this sound good, @jaclyn-taroni ?

Great idea! I've drafted a filtering table here ,https://docs.google.com/spreadsheets/d/1fKvDkkGddyeQ-_BHlE06XfMz82SC_nAPo7zwximxJRo/edit#gid=0. Filtering criteria has values general or project-specific from what we have discussed above. I also added the Output-type to show resulting type of dataset and some more description.

@jaclyn-taroni thanks for pointing out that 01-fusion-filtering.R was still in the PR. I've removed that now.

@jaclyn-taroni
Copy link
Member

@kgaonkar6 thanks for putting that information together! Can you change the access settings for the sheet please?

@kgaonkar6
Copy link
Collaborator Author

Let me know if you have access now, thanks!

@jharenza
Copy link
Collaborator

Thanks @kgaonkar6! I added some notes with additional details as well, so @jaclyn-taroni, please let us know if you have any questions about the steps!

@jaclyn-taroni
Copy link
Member

Thank you @kgaonkar6.

In my proposed structure, I noted two filtering steps - one that is general and one that is project specific.

Image from iOS (7)

Wondering if the following is possible --

Could you put all these steps, noted as general in the spreadsheet, in a single script:

Both gene partners with FPKM<1
JunctionReadCount==0
SpanningFragCount-JunctionReadCount>=10
Artifact filtering (readthrough)
Artifact filtering (red flag database from FusionAnnotator)
Fused genes with either gene in TSGs,Cosmic,Oncogenic,TCGA fusion list
In-frame fusions is called by both callers
Fusions with either gene in Kinase, TF list

And these steps, marked as project-specific in a different script/set of scripts/notebook(s) that takes the output from the above general script:

In-frame fusions is called in atleast 2 samples per histology
Fusions with either gene fused more than 5 times in a histology
Fusions in Filtered-fusions found in more than 1 histology
Fused genes in Filtered-fusions found to be multi-fused more than 5 times in a sample

Above, @jharenza proposed:

So maybe we split filtering into artifact removal+filtering, move annotation before filtering, and then add plotting?

I believe splitting things as outlined above meets part of this. What I'm not clear on is the rationale for moving annotation before filtering. If I understand the current pipeline in the d3b-center/fusion_filtering_pipeline repository and @kgaonkar6's comment here #92 (comment), the last two steps are annotation and plotting.

It's possible that the thinking on this has changed since putting together the table, but wanted to check in to make sure I understand.

@kgaonkar6
Copy link
Collaborator Author

Thank you @jaclyn-taroni ! This makes sense, I'll gather the filtering steps and write a script to only do general filtering using the standardized output from the 01-fusion-standardization.R script

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kgaonkar6, I'm glad we were able to get some more details about the structure of the pipeline hashed out! Thanks for your patience and willingness to walk me through things, I really appreciate it.

The documentation for 01-fusion-standardization.R looks pretty good. I've left a few questions and there are a few changes I would like to see made before merging:

  • Making sure outputfile is included in the documentation
  • Moving the column selection/filtering for standard_calls outside the logic and steps that are specific to a particular caller
  • Some error-handling if someone passes a caller that isn't (currently) supported
    • Optionally, you might also consider forcing all the characters in the caller argument to uppercase or lowercase so STARFusion vs. STArfusion vs. STARfusion doesn't cause any headaches.
  • Removing any steps you don't need (e.g., steps around breakpoints)
  • A comment explaining the JunctionReadCount standardization for Arriba

I've also included some comments that are more for our discussion/information only, such as how one could do something using dplyr. Those are not things that need to get changed unless you'd prefer to change them.

Please let me know if anything needs clarification. Thank you!

analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved

opt <- parse_args(OptionParser(option_list=option_list))

inputfile<-opt$fusionfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inputfile<-opt$fusionfile
inputfile <- opt$fusionfile

I find code to be much easier to read and understand if there are spaces around assignment operators and after commas.

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yup makes sense, I'll keep that in mind for assignments. Maybe this is a dumb question but does it matter in the review process if I commit the suggestion here or should it be ok to fix it with all other changes in the script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a dumb question at all. Took me a while to figure it out. Overall the answer is no it does not matter. Here's how it does make a difference as far as I can tell:

  • If you do this from the user interface, you can add suggestions to batch and then you can commit them in a single commit. This would also give me credit for the commit (which is not something I care about when using this feature, I just find it convenient). The other thing that is sometimes annoying is if you have larger changes you need to make after committing the suggestions, you'll have to pull that before making changes locally.
  • I've found that the downside of the doing everything, including the suggestions, locally is that sometimes I miss things but some folks like using the resolving conversations feature to help with this.

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Sep 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes all credits for edits are yours and @jharenza's :) Thanks! Since I had to make a bunch of edits I think I'll use the resolving button to keep track of editing hopefully I didn't miss any in this commit.

analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good! The comments are really nice. This is very close, two things I'd like to see tweaked before merging:

  1. Using either saveRDS or write.table to save the standardized output but not both. I'd also include the expected file extension/file type in the documentation around outputfile.

  2. There's something off about the indentation throughout. In my personal experience, issues with indentation can sometimes happen if I switch between different text editors during my editing process. I've put a couple specific comments about how I'd expect some of the lines to look. If you are editing in RStudio you can select all and then use Ctrl + I (or Cmd + I on a Mac) to automatically fix the indentation for you. One of my favorite RStudio features :)

I've added some dplyr alternatives as comments as well. You don't have to make these changes, they are for your information if you are interested in adopting more dplyr into your workflows.

analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
analyses/fusion_filtering/01-fusion-standardization.R Outdated Show resolved Hide resolved
@kgaonkar6
Copy link
Collaborator Author

These changes look good! The comments are really nice. This is very close, two things I'd like to see tweaked before merging:

  1. Using either saveRDS or write.table to save the standardized output but not both. I'd also include the expected file extension/file type in the documentation around outputfile.
  2. There's something off about the indentation throughout. In my personal experience, issues with indentation can sometimes happen if I switch between different text editors during my editing process. I've put a couple specific comments about how I'd expect some of the lines to look. If you are editing in RStudio you can select all and then use Ctrl + I (or Cmd + I on a Mac) to automatically fix the indentation for you. One of my favorite RStudio features :)

I've added some dplyr alternatives as comments as well. You don't have to make these changes, they are for your information if you are interested in adopting more dplyr into your workflows.

You caught me about the changing editors in between editing :D . I was looking into atom editor which I've been told has great features and packages to write beautiful code. I'm just beginning to learn how to do things in atom. I'll stick to Rstudio and use the Cmd+I feature.

I'll also implement only 1 outputfile and dplyr alt. has suggested to make code more readable. Thanks!

@kgaonkar6
Copy link
Collaborator Author

I was wondering if there should be a step after standardization and before filtering to merge the standardized fusion calls from STARfusion and arriba? Or should the merge be included in 01-fusion-standardization?

@jaclyn-taroni
Copy link
Member

I was wondering if there should be a step after standardization and before filtering to merge the standardized fusion calls from STARfusion and arriba? Or should the merge be included in 01-fusion-standardization?

Do you exclusively use merged files downstream of this step @kgaonkar6 ?

@kgaonkar6
Copy link
Collaborator Author

yes all the filtering is done on merged files

@jaclyn-taroni
Copy link
Member

I would say let's leave this as is and then add a merge step before filtering. We can always revisit, but I think it would be helpful for me to have eyes on the code that is exclusively for the merge step before commenting further.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍 I will fix the conflicts and then merge. Great work on these changes @kgaonkar6! Looking forward to seeing the next steps.

@jaclyn-taroni jaclyn-taroni merged commit 160c8f9 into AlexsLemonade:master Sep 16, 2019
@kgaonkar6
Copy link
Collaborator Author

Thank you so much! This has been so helpful to me especially to understand how to write good code :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants