Add module for bulk RNA-seq to workflow #58

allyhawkins · 2021-10-20T21:15:50Z

In this PR, I started adding the ability to process bulk RNA sequencing samples to the main workflow. To do this, I first modified the list of allowed tech lists to include the bulk technologies, i.e. single_end and paired_end.

I then created a separate module for the bulk processing that includes running fastp and then salmon. It takes as input the same metadata map that the other modules use and will create a tuple of the metadata adn then the R1 and R2 files. That is then used as input to the fastp process which will output the paths to the trimmed fastq files as a tuple.

Then I am use the trimmed fastq files as input for salmon in a separate process. I was testing this and that's when I had the realization that I needed to use a different index for bulk samples then we are using for the single cell samples, since we don't want the splici index here. So I am filing this as a draft for now and will file the PR with the index as a separate PR stacked on this one.

If people do happen to look at this, the main question that I have is about the options that I chose to use for fastp and salmon and if those are what we would like?

Additionally, is there an alternative approach to dealing with processing both single_end and paired_end samples at the same time? Here, I am passing the path to the folder where the trimmed files live rather than the individual paths to the trimmed R1/R2 files so that it can work with either single-end or paired-end in the same process. I went back and forth on whether or not to separate them somehow, but thought that this would be a good way to approach it.

…workflow

jashapiro

This looks like a good start, and most of the functionality is there. I left a number of little comments for formatting things, but the big thoughts are really about inputs and outputs:

Are there cases where the input reads are split into multiple R1 and R2 files? If so, how does fastp handle that?
Should the output from the fastp process just be individual files? I would tend to say yes, but part of me is worried about the case where output files are blank. I also worry about the inputs being blank, in the case of single end reads. It may end up that we want to simply have separate processes for SE and PE, rather than trying to accommodate both.
If nextflow doesn't like passing blank files, that could cause trouble for SE samples anyway: maybe we should make the input to the fastp process just the directory with all the fastq files, and then do different things in the script based on meta.technology? This would basically move the *_R1_*.fastq.gz glob to the process rather than the workflow.
Do we want to customize fastp behavior at all, or spell out any of the options, even if we are using defaults?

Happy to discuss any of this in a meeting... I am not sure I am explaining it all well.

main.nf

modules/bulk-salmon.nf

jashapiro · 2021-10-21T16:23:06Z

modules/bulk-salmon.nf

+    container params.FASTP_CONTAINER
+    label 'cpus_8'
+    tag "${meta.library_id}-bulk"
+    publishDir "${params.outdir}/internal/fastp"


Do we need to publish this? I don't think we do, and we probably don't want to, as this will has PHI that we would have to remember to delete.

modules/bulk-salmon.nf

jashapiro · 2021-10-21T17:02:30Z

modules/bulk-salmon.nf

+    // create tuple of (metadata map, [Read 1 files], [Read 2 files])
+        bulk_reads_ch = bulk_channel
+          .map{meta -> tuple(meta,
+                             file("s3://${meta.s3_prefix}/*_R1_*.fastq.gz"),


Are there any cases where read1 will be more than one file? Does fastp handle that properly?

jashapiro · 2021-10-21T17:03:27Z

modules/bulk-salmon.nf

+
+        fastp --in1 ${read1} \
+        ${meta.technology == 'paired_end' ? "--in2 ${read2}":""} \
+        --out1 ${trimmed_reads}/${meta.library_id}-trimmed-R1.fastq.gz \


Do we need the library_id in the output here? I don't think it much matters, since the directory is named, but uf we don't need it we could simplify this a bit.

nextflow.config

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins · 2021-10-22T15:48:30Z

As I was working on this just wanted to answer some of your questions and get your thoughts if you have any:

Are there cases where the input reads are split into multiple R1 and R2 files? If so, how does fastp handle that?

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Should the output from the fastp process just be individual files? I would tend to say yes, but part of me is worried about the case where output files are blank. I also worry about the inputs being blank, in the case of single end reads. It may end up that we want to simply have separate processes for SE and PE, rather than trying to accommodate both.

So I had originally had this broken up into a path for trimmed R1 and trimmed R2 for the output of the fastp process, however, when you pass a SE library through this, that doesn't work and Nextflow complains about having an empty output for the trimmed R2, which is why I combined into having a output directory, rather than individual paths. If you would prefer to have the more explicit outputs of the paths to each file, then I think we are going to have to break it up into separate processes and have a SE version and a PE version. Would you prefer that because if so then I will go ahead and do that now?

If nextflow doesn't like passing blank files, that could cause trouble for SE samples anyway: maybe we should make the input to the fastp process just the directory with all the fastq files, and then do different things in the script based on meta.technology? This would basically move the R1.fastq.gz glob to the process rather than the workflow.

Passing around blank files doesn't seem to be an issue. Right now this works for single-end through the fastp part only having the R1 files and then only outputting the R1 files. I don't see a reason to move this right now since Nextflow isn't complaining about blank files here.

jashapiro · 2021-10-22T16:05:18Z

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Okay, great! That was what I hoped it would do, but the docs didn't seem to be clear on the behavior Just to confirm though... you verified that it isn't just processing the first file?

So I had originally had this broken up into a path for trimmed R1 and trimmed R2 for the output of the fastp process, however, when you pass a SE library through this, that doesn't work and Nextflow complains about having an empty output for the trimmed R2, which is why I combined into having a output directory, rather than individual paths.

There is an ability to use optional outputs so nextflow doesn't complain that a file is missing, but I hadn't really looked at how it works in this context. I am not sure how it plays with tuples, in particular.

I don't feel too strongly about whether we do files or directories, but I don't generally like it when the internal nextflow script depends on constructing a particular file to look for, as it means that any changes in the upstream process can break the downstream one. So if we decided to not do fastp, we would have more to modify. Not a huge deal, but it can be annoying.

So in this case I might make the fastp output files something more generally compatible like:
${trimmed_reads}/${meta.library_id}_R1_trimmmed.fastq.gz

(note the _R1_ is preserved from normal Illumina file names. This is to enable us to use a more generic salmon input like:

input:
  tuple val(meta), path(read_dir)

...

'''
salmon quant -i ${params.bulk_index} \
        --libType A \
        -1 ${read_dir}/*_R1_*.fastq.gz" \
        ${meta.technology == 'paired_end' ? "-2 ${read_dir}/*_R2_*.fastq.gz" : ""} \
...

Which will work with any read directory, trimmed or untrimmed.

allyhawkins · 2021-10-25T15:56:46Z

Thanks for the helpful comments on getting this set up @jashapiro! I made some changes based on your last comments and everything works well except the salmon process for single-end libraries only.

Yes, there are cases with multiple input reads for both R1 and R2 and I have been testing with these samples and fastp actually does okay. It takes the whole list of files and then outputs a single trimmed file.

Okay, great! That was what I hoped it would do, but the docs didn't seem to be clear on the behavior Just to confirm though... you verified that it isn't just processing the first file?

After you asked this, I went back into the output from fastp to make sure that it was processing all of the files, and it actually wasn't. It was only processing the first input file and then not throwing a warning that there were extra files that were input... To solve this I concatenated all of the fastqs for R1 and R2 within the process and then used the merged fastq files as input to fastp. When I checked the output of these, I saw that it was utilizing all three input files from the samples I was testing.

I also went ahead and updated the file name structure to be more generally compatible as you had suggested and incorporated the below suggestion into the salmon process.

input:
  tuple val(meta), path(read_dir)

...

'''
salmon quant -i ${params.bulk_index} \
        --libType A \
        -1 ${read_dir}/*_R1_*.fastq.gz" \
        ${meta.technology == 'paired_end' ? "-2 ${read_dir}/*_R2_*.fastq.gz" : ""} \
...

However, the changes now work well for paired-end libraries, resulting in successful fastp and salmon completion and giving the expected output files but this isn't the case for single-end. If using a single-end library, it will complete fastp successfully and give the expected trimmed output file, but then when running the salmon process, it gives an error that it is expecting a file for R2 and detecting a paired end library, even though it is single-end. I have tried using the automatic library detection, -l SR or -l U and all of them are giving the same error. I've also tried on multiple single-end libraries and again each are giving the same error. I'm including an example of the error message here for reference. Any thoughts on this and how to go about this @jashapiro?

jashapiro · 2021-10-25T16:35:32Z

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

One other minor thought is that writing out the cated files before fastp is a bit wasteful, and I think you can use named pipes to save a bit of disk space (and the overhead of writing to disk). To do that, you would use something like the following code:

mkfifo R1merge 
mkfifo R2merge

cat ${read1} > R1merge &
cat ${read2} > R2merge &

fastp --in1 R1merge --in2 R2merge

(The &s are critical, or the whole thing will hang)

allyhawkins · 2021-10-25T19:13:41Z

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

Thank you @jashapiro! I knew it was something small I was missing... I updated the workflow based on this comment and everything is good to go! I will note that I tried to implement the changes you suggested above with the named pipes and made sure I had the & and when using that it just hangs at the fastp process... Any suggestions? Other than that, everything is processing as expected now and is ready for review.

jashapiro · 2021-10-25T19:47:42Z

Ah, so looking at the salmon docs again, if you only have SE reads, you need to use -r file.fastq.gz rather than -1. I would try that with -l A first, as we would rather have automatic detection of strand if we can.

Thank you @jashapiro! I knew it was something small I was missing... I updated the workflow based on this comment and everything is good to go! I will note that I tried to implement the changes you suggested above with the named pipes and made sure I had the & and when using that it just hangs at the fastp process... Any suggestions? Other than that, everything is processing as expected now and is ready for review.

I do have another suggestion! Or rather, Vince Buffalo does: process substitution.
(Honestly, this is what I was trying to think of first, but couldn't remember)

Try something like:

fastp --in1 <(cat ${read1}) --in2 <(cat ${read2})

(no need for the mkfifo commands!)

jashapiro

I haven't tested this myself, but it looks good. I made a previous comment about testing process substitution, which I implemented here as a suggestion (along with some controversially long lines...)

Other than that, and one other little suggestion, it looks good. I haven't tested it, but if it works, feel free to merge with whatever changes you prefer.

modules/bulk-salmon.nf

jashapiro · 2021-10-25T19:59:37Z

modules/bulk-salmon.nf

+        fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \
+        ${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \
+        --out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \
+        ${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \


It might make the lines long, but I always prefer fewer if statements if we can get away with it:

Suggested change

fastp --in1 ${meta.library_id}_R1_merged.fastq.gz \

${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz" : ""} \

--out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \

${meta.technology == 'paired_end' ? "--out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \

fastp --in1 ${meta.library_id}_R1_merged.fastq.gz --out1 ${trimmed_reads}/${meta.library_id}_R1_trimmed.fastq.gz \

${meta.technology == 'paired_end' ? "--in2 ${meta.library_id}_R2_merged.fastq.gz --out2 ${trimmed_reads}/${meta.library_id}_R2_trimmed.fastq.gz" : ""} \

I'm assuming here that fastp doesn't care about argument order. It shouldn't and the fact that salmon does (for some things) is always a mystery to me!

I was able to implement the second suggestion of removing the second if statements, but unfortunately it seems like fastp doesn't like using the process substitution method and it completes the process but it results in an empty fastq file so when it is used as input to the salmon process it gives a failure. I went to check the work directories too and I can see the files are output with the new trimmed names but they are completely empty. I reverted back to using the cat outside of the fastp call and saving the merged file even if it does mean taking up more disk space.

modules/bulk-salmon.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

jashapiro

LGTM!

Sorry about the dead ends with trying to avoid writing to disk (which wasn't really about space as much as the speed of writing and reading). I have one last idea here, but if it fails I am fine with keeping things as they are.

modules/bulk-salmon.nf

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 14 commits October 7, 2021 16:57

add bulk salmon workflow

39b20ac

add fastp docker container location

7392d03

add bulk to techs allowed

fabb021

modify workflow for both single-end and paired-end

aa2f263

Merge remote-tracking branch 'origin/main' into allyhawkins/add-bulk-…

23713c1

…workflow

add bulk process to main

345e0d7

fix directories

313889b

modify tech list

893a567

change naming of output

2d4d0d4

update fastp container

9e6eff4

update optional parameters

b20f58d

add index parameter to config

0efc60d

update fastp process output to work with single end

9c16d2d

fix libtype flag

b962e64

allyhawkins mentioned this pull request Oct 20, 2021

Generate reference index to use for bulk RNA-sequencing #59

Merged

fix slashes

033761f

jashapiro reviewed Oct 21, 2021

View reviewed changes

Apply suggestions from code review

e7d5559

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins added 3 commits October 22, 2021 14:01

adjust fastp process input and output

bc5d7c2

change directory structure for salmon

9bbe8a1

remove publish directory for fastp

95771e3

update libtype and read inputs

e4ece77

allyhawkins marked this pull request as ready for review October 25, 2021 19:13

allyhawkins requested a review from jashapiro October 25, 2021 19:13

jashapiro reviewed Oct 25, 2021

View reviewed changes

allyhawkins and others added 4 commits October 25, 2021 16:28

Apply suggestions from code review

b4d5f85

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

remove change in param names for now

2a93945

fix typo in fastp R2 input

6b20f1a

revert back to saving merged files

5e586b9

jashapiro approved these changes Oct 26, 2021

View reviewed changes

modules/bulk-salmon.nf Outdated Show resolved Hide resolved

jashapiro reviewed Oct 26, 2021

View reviewed changes

modules/bulk-salmon.nf Outdated Show resolved Hide resolved

Apply suggestions from code review

4614d0f

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

allyhawkins merged commit c55310f into main Oct 26, 2021

allyhawkins mentioned this pull request Nov 8, 2021

New Workflow: Bulk RNA processing AlexsLemonade/alsf-scpca#41

Closed

jashapiro mentioned this pull request Dec 14, 2021

Prepare for new workflow release (0.1.3) #70

Closed

allyhawkins deleted the allyhawkins/add-bulk-workflow branch March 23, 2023 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add module for bulk RNA-seq to workflow #58

Add module for bulk RNA-seq to workflow #58

allyhawkins commented Oct 20, 2021

jashapiro left a comment •

edited

jashapiro Oct 21, 2021

jashapiro Oct 21, 2021

jashapiro Oct 21, 2021

allyhawkins commented Oct 22, 2021

jashapiro commented Oct 22, 2021

allyhawkins commented Oct 25, 2021

jashapiro commented Oct 25, 2021

allyhawkins commented Oct 25, 2021

jashapiro commented Oct 25, 2021

jashapiro left a comment

jashapiro Oct 25, 2021

allyhawkins Oct 26, 2021

jashapiro left a comment

Add module for bulk RNA-seq to workflow #58

Add module for bulk RNA-seq to workflow #58

Conversation

allyhawkins commented Oct 20, 2021

jashapiro left a comment • edited

Choose a reason for hiding this comment

jashapiro Oct 21, 2021

Choose a reason for hiding this comment

jashapiro Oct 21, 2021

Choose a reason for hiding this comment

jashapiro Oct 21, 2021

Choose a reason for hiding this comment

allyhawkins commented Oct 22, 2021

jashapiro commented Oct 22, 2021

allyhawkins commented Oct 25, 2021

jashapiro commented Oct 25, 2021

allyhawkins commented Oct 25, 2021

jashapiro commented Oct 25, 2021

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Oct 25, 2021

Choose a reason for hiding this comment

allyhawkins Oct 26, 2021

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro left a comment •

edited