Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for bulk RNA-seq to workflow #58

Merged
merged 25 commits into from
Oct 26, 2021

Conversation

allyhawkins
Copy link
Member

In this PR, I started adding the ability to process bulk RNA sequencing samples to the main workflow. To do this, I first modified the list of allowed tech lists to include the bulk technologies, i.e. single_end and paired_end.

I then created a separate module for the bulk processing that includes running fastp and then salmon. It takes as input the same metadata map that the other modules use and will create a tuple of the metadata adn then the R1 and R2 files. That is then used as input to the fastp process which will output the paths to the trimmed fastq files as a tuple.

Then I am use the trimmed fastq files as input for salmon in a separate process. I was testing this and that's when I had the realization that I needed to use a different index for bulk samples then we are using for the single cell samples, since we don't want the splici index here. So I am filing this as a draft for now and will file the PR with the index as a separate PR stacked on this one.

If people do happen to look at this, the main question that I have is about the options that I chose to use for fastp and salmon and if those are what we would like?

Additionally, is there an alternative approach to dealing with processing both single_end and paired_end samples at the same time? Here, I am passing the path to the folder where the trimmed files live rather than the individual paths to the trimmed R1/R2 files so that it can work with either single-end or paired-end in the same process. I went back and forth on whether or not to separate them somehow, but thought that this would be a good way to approach it.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start, and most of the functionality is there. I left a number of little comments for formatting things, but the big thoughts are really about inputs and outputs:

  • Are there cases where the input reads are split into multiple R1 and R2 files? If so, how does fastp handle that?
  • Should the output from the fastp process just be individual files? I would tend to say yes, but part of me is worried about the case where output files are blank. I also worry about the inputs being blank, in the case of single end reads. It may end up that we want to simply have separate processes for SE and PE, rather than trying to accommodate both.
  • If nextflow doesn't like passing blank files, that could cause trouble for SE samples anyway: maybe we should make the input to the fastp process just the directory with all the fastq files, and then do different things in the script based on meta.technology? This would basically move the *_R1_*.fastq.gz glob to the process rather than the workflow.
  • Do we want to customize fastp behavior at all, or spell out any of the options, even if we are using defaults?

Happy to discuss any of this in a meeting... I am not sure I am explaining it all well.

main.nf Outdated Show resolved Hide resolved
main.nf Outdated Show resolved Hide resolved
modules/bulk-salmon.nf Outdated Show resolved Hide resolved
modules/bulk-salmon.nf Outdated Show resolved Hide resolved
container params.FASTP_CONTAINER
label 'cpus_8'
tag "${meta.library_id}-bulk"
publishDir "${params.outdir}/internal/fastp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish this? I don't think we do, and we probably don't want to, as this will has PHI that we would have to remember to delete.

modules/bulk-salmon.nf Outdated Show resolved Hide resolved
modules/bulk-salmon.nf Outdated Show resolved Hide resolved
// create tuple of (metadata map, [Read 1 files], [Read 2 files])
bulk_reads_ch = bulk_channel
.map{meta -> tuple(meta,
file("s3://${meta.s3_prefix}/*_R1_*.fastq.gz"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any cases where read1 will be more than one file? Does fastp handle that properly?


fastp --in1 ${read1} \
${meta.technology == 'paired_end' ? "--in2 ${read2}":""} \
--out1 ${trimmed_reads}/${meta.library_id}-trimmed-R1.fastq.gz \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the library_id in the output here? I don't think it much matters, since the directory is named, but uf we don't need it we could simplify this a bit.

nextflow.config Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@allyhawkins
Copy link
Member Author

As I was working on this just wanted to answer some of your questions and get your thoughts if you have any:

Are there cases where the input reads are split into multiple R1 and R2 files? If so, how does fastp handle that?

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Should the output from the fastp process just be individual files? I would tend to say yes, but part of me is worried about the case where output files are blank. I also worry about the inputs being blank, in the case of single end reads. It may end up that we want to simply have separate processes for SE and PE, rather than trying to accommodate both.

So I had originally had this broken up into a path for trimmed R1 and trimmed R2 for the output of the fastp process, however, when you pass a SE library through this, that doesn't work and Nextflow complains about having an empty output for the trimmed R2, which is why I combined into having a output directory, rather than individual paths. If you would prefer to have the more explicit outputs of the paths to each file, then I think we are going to have to break it up into separate processes and have a SE version and a PE version. Would you prefer that because if so then I will go ahead and do that now?

If nextflow doesn't like passing blank files, that could cause trouble for SE samples anyway: maybe we should make the input to the fastp process just the directory with all the fastq files, and then do different things in the script based on meta.technology? This would basically move the R1.fastq.gz glob to the process rather than the workflow.

Passing around blank files doesn't seem to be an issue. Right now this works for single-end through the fastp part only having the R1 files and then only outputting the R1 files. I don't see a reason to move this right now since Nextflow isn't complaining about blank files here.

@jashapiro
Copy link
Member

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Okay, great! That was what I hoped it would do, but the docs didn't seem to be clear on the behavior Just to confirm though... you verified that it isn't just processing the first file?

So I had originally had this broken up into a path for trimmed R1 and trimmed R2 for the output of the fastp process, however, when you pass a SE library through this, that doesn't work and Nextflow complains about having an empty output for the trimmed R2, which is why I combined into having a output directory, rather than individual paths.

There is an ability to use optional outputs so nextflow doesn't complain that a file is missing, but I hadn't really looked at how it works in this context. I am not sure how it plays with tuples, in particular.

I don't feel too strongly about whether we do files or directories, but I don't generally like it when the internal nextflow script depends on constructing a particular file to look for, as it means that any changes in the upstream process can break the downstream one. So if we decided to not do fastp, we would have more to modify. Not a huge deal, but it can be annoying.

So in this case I might make the fastp output files something more generally compatible like:
${trimmed_reads}/${meta.library_id}_R1_trimmmed.fastq.gz

(note the _R1_ is preserved from normal Illumina file names. This is to enable us to use a more generic salmon input like:

input:
  tuple val(meta), path(read_dir)

...

'''
salmon quant -i ${params.bulk_index} \
        --libType A \
        -1 ${read_dir}/*_R1_*.fastq.gz" \
        ${meta.technology == 'paired_end' ? "-2 ${read_dir}/*_R2_*.fastq.gz" : ""} \
...

Which will work with any read directory, trimmed or untrimmed.

@allyhawkins
Copy link
Member Author

Thanks for the helpful comments on getting this set up @jashapiro! I made some changes based on your last comments and everything works well except the salmon process for single-end libraries only.

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Okay, great! That was what I hoped it would do, but the docs didn't seem to be clear on the behavior Just to confirm though... you verified that it isn't just processing the first file?

After you asked this, I went back into the output from fastp to make sure that it was processing all of the files, and it actually wasn't. It was only processing the first input file and then not throwing a warning that there were extra files that were input... To solve this I concatenated all of the fastqs for R1 and R2 within the process and then used the merged fastq files as input to fastp. When I checked the output of these, I saw that it was utilizing all three input files from the samples I was testing.

I also went ahead and updated the file name structure to be more generally compatible as you had suggested and incorporated the below suggestion into the salmon process.

input:
  tuple val(meta), path(read_dir)

...

'''
salmon quant -i ${params.bulk_index} \
        --libType A \
        -1 ${read_dir}/*_R1_*.fastq.gz" \
        ${meta.technology == 'paired_end' ? "-2 ${read_dir}/*_R2_*.fastq.gz" : ""} \
...

However, the changes now work well for paired-end libraries, resulting in successful fastp and salmon completion and giving the expected output files but this isn't the case for single-end. If using a single-end library, it will complete fastp successfully and give the expected trimmed output file, but then when running the salmon process, it gives an error that it is expecting a file for R2 and detecting a paired end library, even though it is single-end. I have tried using the automatic library detection, -l SR or -l U and all of them are giving the same error. I've also tried on multiple single-end libraries and again each are giving the same error. I'm including an example of the error message here for reference. Any thoughts on this and how to go about this @jashapiro?
Screen Shot 2021-10-25 at 10 56 13 AM

@jashapiro
Copy link
Member

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

One other minor thought is that writing out the cated files before fastp is a bit wasteful, and I think you can use named pipes to save a bit of disk space (and the overhead of writing to disk). To do that, you would use something like the following code:

mkfifo R1merge 
mkfifo R2merge

cat ${read1} > R1merge &
cat ${read2} > R2merge &

fastp --in1 R1merge --in2 R2merge

(The &s are critical, or the whole thing will hang)

@allyhawkins
Copy link
Member Author

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

Thank you @jashapiro! I knew it was something small I was missing... I updated the workflow based on this comment and everything is good to go! I will note that I tried to implement the changes you suggested above with the named pipes and made sure I had the & and when using that it just hangs at the fastp process... Any suggestions? Other than that, everything is processing as expected now and is ready for review.

@allyhawkins allyhawkins marked this pull request as ready for review October 25, 2021 19:13
@jashapiro
Copy link
Member

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

Thank you @jashapiro! I knew it was something small I was missing... I updated the workflow based on this comment and everything is good to go! I will note that I tried to implement the changes you suggested above with the named pipes and made sure I had the & and when using that it just hangs at the fastp process... Any suggestions? Other than that, everything is processing as expected now and is ready for review.

I do have another suggestion! Or rather, Vince Buffalo does: process substitution.
(Honestly, this is what I was trying to think of first, but couldn't remember)

Try something like:

fastp --in1 <(cat ${read1}) --in2 <(cat ${read2})

(no need for the mkfifo commands!)

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested this myself, but it looks good. I made a previous comment about testing process substitution, which I implemented here as a suggestion (along with some controversially long lines...)

Other than that, and one other little suggestion, it looks good. I haven't tested it, but if it works, feel free to merge with whatever changes you prefer.

modules/bulk-salmon.nf Outdated Show resolved Hide resolved
Comment on lines 21 to 24
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \
--out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \
${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make the lines long, but I always prefer fewer if statements if we can get away with it:

Suggested change
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \
--out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \
${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \
fastp --in1 ${meta.library_id}_R1_merged.fastq.gz --out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \
${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz --out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \

I'm assuming here that fastp doesn't care about argument order. It shouldn't and the fact that salmon does (for some things) is always a mystery to me!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to implement the second suggestion of removing the second if statements, but unfortunately it seems like fastp doesn't like using the process substitution method and it completes the process but it results in an empty fastq file so when it is used as input to the salmon process it gives a failure. I went to check the work directories too and I can see the files are output with the new trimmed names but they are completely empty. I reverted back to using the cat outside of the fastp call and saving the merged file even if it does mean taking up more disk space.

modules/bulk-salmon.nf Outdated Show resolved Hide resolved
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sorry about the dead ends with trying to avoid writing to disk (which wasn't really about space as much as the speed of writing and reading). I have one last idea here, but if it fails I am fine with keeping things as they are.

modules/bulk-salmon.nf Outdated Show resolved Hide resolved
modules/bulk-salmon.nf Outdated Show resolved Hide resolved
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants