-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add module for bulk RNA-seq to workflow #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start, and most of the functionality is there. I left a number of little comments for formatting things, but the big thoughts are really about inputs and outputs:
- Are there cases where the input reads are split into multiple R1 and R2 files? If so, how does
fastp
handle that? - Should the output from the fastp process just be individual files? I would tend to say yes, but part of me is worried about the case where output files are blank. I also worry about the inputs being blank, in the case of single end reads. It may end up that we want to simply have separate processes for SE and PE, rather than trying to accommodate both.
- If nextflow doesn't like passing blank files, that could cause trouble for SE samples anyway: maybe we should make the input to the fastp process just the directory with all the fastq files, and then do different things in the script based on
meta.technology
? This would basically move the*_R1_*.fastq.gz
glob to the process rather than the workflow. - Do we want to customize fastp behavior at all, or spell out any of the options, even if we are using defaults?
Happy to discuss any of this in a meeting... I am not sure I am explaining it all well.
modules/bulk-salmon.nf
Outdated
container params.FASTP_CONTAINER | ||
label 'cpus_8' | ||
tag "${meta.library_id}-bulk" | ||
publishDir "${params.outdir}/internal/fastp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to publish this? I don't think we do, and we probably don't want to, as this will has PHI that we would have to remember to delete.
// create tuple of (metadata map, [Read 1 files], [Read 2 files]) | ||
bulk_reads_ch = bulk_channel | ||
.map{meta -> tuple(meta, | ||
file("s3://${meta.s3_prefix}/*_R1_*.fastq.gz"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any cases where read1
will be more than one file? Does fastp
handle that properly?
modules/bulk-salmon.nf
Outdated
|
||
fastp --in1 ${read1} \ | ||
${meta.technology == 'paired_end' ? "--in2 ${read2}":""} \ | ||
--out1 ${trimmed_reads}/${meta.library_id}-trimmed-R1.fastq.gz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the library_id in the output here? I don't think it much matters, since the directory is named, but uf we don't need it we could simplify this a bit.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
As I was working on this just wanted to answer some of your questions and get your thoughts if you have any:
Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and
So I had originally had this broken up into a path for trimmed R1 and trimmed R2 for the output of the
Passing around blank files doesn't seem to be an issue. Right now this works for single-end through the |
Okay, great! That was what I hoped it would do, but the docs didn't seem to be clear on the behavior Just to confirm though... you verified that it isn't just processing the first file?
There is an ability to use optional outputs so nextflow doesn't complain that a file is missing, but I hadn't really looked at how it works in this context. I am not sure how it plays with tuples, in particular. I don't feel too strongly about whether we do files or directories, but I don't generally like it when the internal nextflow script depends on constructing a particular file to look for, as it means that any changes in the upstream process can break the downstream one. So if we decided to not do fastp, we would have more to modify. Not a huge deal, but it can be annoying. So in this case I might make the fastp output files something more generally compatible like: (note the
Which will work with any read directory, trimmed or untrimmed. |
Thanks for the helpful comments on getting this set up @jashapiro! I made some changes based on your last comments and everything works well except the
After you asked this, I went back into the output from fastp to make sure that it was processing all of the files, and it actually wasn't. It was only processing the first input file and then not throwing a warning that there were extra files that were input... To solve this I concatenated all of the fastqs for R1 and R2 within the process and then used the merged fastq files as input to I also went ahead and updated the file name structure to be more generally compatible as you had suggested and incorporated the below suggestion into the
However, the changes now work well for paired-end libraries, resulting in successful |
Ah, so looking at the salmon docs again, if you only have SE reads, you need to use One other minor thought is that writing out the
(The |
Thank you @jashapiro! I knew it was something small I was missing... I updated the workflow based on this comment and everything is good to go! I will note that I tried to implement the changes you suggested above with the named pipes and made sure I had the |
I do have another suggestion! Or rather, Vince Buffalo does: process substitution. Try something like:
(no need for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested this myself, but it looks good. I made a previous comment about testing process substitution, which I implemented here as a suggestion (along with some controversially long lines...)
Other than that, and one other little suggestion, it looks good. I haven't tested it, but if it works, feel free to merge with whatever changes you prefer.
modules/bulk-salmon.nf
Outdated
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \ | ||
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \ | ||
--out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \ | ||
${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make the lines long, but I always prefer fewer if
statements if we can get away with it:
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \ | |
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \ | |
--out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \ | |
${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \ | |
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz --out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \ | |
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz --out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \ | |
I'm assuming here that fastp
doesn't care about argument order. It shouldn't and the fact that salmon
does (for some things) is always a mystery to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to implement the second suggestion of removing the second if
statements, but unfortunately it seems like fastp
doesn't like using the process substitution method and it completes the process but it results in an empty fastq file so when it is used as input to the salmon
process it gives a failure. I went to check the work directories too and I can see the files are output with the new trimmed names but they are completely empty. I reverted back to using the cat
outside of the fastp
call and saving the merged file even if it does mean taking up more disk space.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sorry about the dead ends with trying to avoid writing to disk (which wasn't really about space as much as the speed of writing and reading). I have one last idea here, but if it fails I am fine with keeping things as they are.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
In this PR, I started adding the ability to process bulk RNA sequencing samples to the main workflow. To do this, I first modified the list of allowed tech lists to include the bulk technologies, i.e. single_end and paired_end.
I then created a separate module for the bulk processing that includes running fastp and then salmon. It takes as input the same metadata map that the other modules use and will create a tuple of the metadata adn then the R1 and R2 files. That is then used as input to the fastp process which will output the paths to the trimmed fastq files as a tuple.
Then I am use the trimmed fastq files as input for salmon in a separate process. I was testing this and that's when I had the realization that I needed to use a different index for bulk samples then we are using for the single cell samples, since we don't want the splici index here. So I am filing this as a draft for now and will file the PR with the index as a separate PR stacked on this one.
If people do happen to look at this, the main question that I have is about the options that I chose to use for fastp and salmon and if those are what we would like?
Additionally, is there an alternative approach to dealing with processing both single_end and paired_end samples at the same time? Here, I am passing the path to the folder where the trimmed files live rather than the individual paths to the trimmed R1/R2 files so that it can work with either single-end or paired-end in the same process. I went back and forth on whether or not to separate them somehow, but thought that this would be a good way to approach it.