-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate reference index to use for bulk RNA-sequencing #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
My only real concern is that the annotation outputs may not be written out, as the annotation directory was removed as an output from the generate_fasta
process (which I would rename generate_references
or generate_transcriptome
since it should make gtf files too).
I made some suggestions based on learning something new about dsl2 (emit
statements) to handle separating outputs we will and won't use.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR, stacked on #58, adds the necessary steps to create the index used for bulk RNA-sequencing mapping. I am generating the reference fasta in the same way that we generated the reference fasta for the splici index, except I am only using transcripts from spliced cDNA, rather than spliced cDNA + introns. I'm then using that reference fasta and creating an index with decoys added, with the whole genome being used as decoys.
To do this, I modified our existing script for creating the splici reference fasta to also create the spliced cdna only fasta (changing the name of the file to reflect that the file now preps both of the reference transcriptomes). Part of my reasoning for doing this in one script rather than two separate scripts is because the parsing of the gtf and whole primary genome sequence to annotate spliced and unspliced cDNA is time consuming and those steps only need to happen once before outputting both fasta files so thought it was more efficient to keep it within the same script.
I then modified the current workflow used to build the indices and added in a step to create the decoy salmon index using the reference fasta. I chose to keep things as just the two processes, generating the fasta and then creating the salmon index, and added to the current salmon index process rather than create a third process. I could really go either way on this, but thought this worked nicely since it took as input all of the output from generating the two fastas.
Note that this is stacked on #58 as I had started on that first, but this should probably be reviewed first since fully testing #58 is dependent on generation of the index.