Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded support from Cutadapt in trim_galore #16

Closed
ajphipps opened this issue Jan 30, 2018 · 18 comments
Closed

Multithreaded support from Cutadapt in trim_galore #16

ajphipps opened this issue Jan 30, 2018 · 18 comments

Comments

@ajphipps
Copy link

Dear Felix,

I really like this piece of software and have been testing it for my research, but I was wondering how difficult it would be to implement multithreaded support that is available in basic cutadapt? Is it possible to pass the option or arguement from trim_galore to cutadapt?

My current loop I am using is as follows:
for file in *all.fastq ; do
trim_galore -fastqc_args "-t 60" -o /NGS_Data/Andrew/test/test2/ ${file}
done

Thanks again,
Andrew

@ajphipps
Copy link
Author

ajphipps commented Jan 30, 2018

Sorry just saw the closed issues with questions of multithreaded support - sorry for opening the new issue!

Another thought I had was it might be possible to use GNU parallel to run multiple instances of trim_galore if you have many samples.

@FelixKrueger
Copy link
Owner

Hi Andrew,
thanks for your comments. I can see that it would be 'nice-to-have' from a conceptual point of view to have trimming that is faster. We ourselves are typically dealing with the situation that we need to process hundreds (or even thousands) of samples in parallel (and have some 700 cores available to do so) rather than having a single sample which we would like to process 700 times faster. I suppose having a cluster readily available doesn't exactly create an incentive to work on the speed of trimming. But if demand rises further - I might reconsider...! Cheers, Felix

@ibwoo
Copy link

ibwoo commented May 14, 2018

I'd be very interested in being able to use the new multi-threading features of cutadapt via TrimGalore. Please consider this a +1 for demand!

@FelixKrueger
Copy link
Owner

Noted, thanks.

@ncimb
Copy link

ncimb commented May 21, 2018

+1 for this feature, google brought me here with the same request in mind!

@dpryan79
Copy link

If some nice person were to create a PR implementing this would you consider merging it?

@FelixKrueger
Copy link
Owner

I would definitely consider that. I suppose we, as in 'the nice person', would have to make sure that it still runs fine on machines that don't have Python 3 installed etc though.

While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix

@dpryan79
Copy link

While we are at it, did you manage to finish the MultiQC module for SNPsplit reports? Maybe I could find some time to get that done at some point? Cheers, Felix

[Devon looks away embarrassed] Right, that, it's still on my to do list

@FelixKrueger
Copy link
Owner

Don't be, it was super cool of you to make a start on it anyways. I was considering to try out how to add something to MultiQC for some time, so I could try to take a look at this if you didn't get round to it.

But regarding Trim Galore, I assume getting the multicore trimming to work would require a check for the version of python that is currently running, as well as whether the parallel zipping is installed. It might get more icky for downstream tests that are performed in Trim Galore itself, such as the length validation, N-trimming etc, which all require the reads to be present in a synchronised fashion between R1 and R2...

@dpryan79
Copy link

Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.

Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.

@FelixKrueger
Copy link
Owner

Yeah, it'll require some testing before I would want to make a PR, though I can tell you that cutadapt returns R1 and R2 synchronized, since it trims both at once.

That is good because it will make a lot of things easier. It will mean though that we need to move the paired-end validation testing from two consecutive trimming processes into this new trimming mode. This might be a bit of work but was probably overdue anyway.... But yea it would be great if you wanted to start on that, I'm of course happy to assist if possible at all.

Feel free to play around with my current SNPsplit PR and improve it. We're not using it heavily at the moment so there's not a lot of motivation.

Same here, but I am always looking for things to broaden my horizon...

@dpryan79
Copy link

Upon further inspection it looks like implementing this would require an almost complete rewrite of how Trim Galore is internally structured, which is rather more involved than I have time for at the moment. For us the most useful feature of Trim Galore is the automatic detection of the most appropriate adapter sequence, I'll instead check and see if I can implement that in cutadapt directly.

@FelixKrueger
Copy link
Owner

That's pretty much the same conclusion I came to. Given that we never really are in need of having to speed up any single trimming event (as we can run hundreds of them in parallel), adding multithread support didn't seem to be justified, or at least high up on a priority list.

Thanks for looking anyway. Cheers, Felix

@wolfgangrumpf
Copy link

Another vote for making Trim Galore/Cutadapt multi-threaded, if possible.....

@FelixKrueger
Copy link
Owner

Thanks for the vote Wolfgang. In my preliminary tests (see issue #39) I found that multi-core support doesn't exactly look worthwhile (certainly not specifying beyond 2 cores...):

impact

Maybe indeed splitting the input into junks would be another option to speed things up?

@fjames003
Copy link
Contributor

@FelixKrueger, were these tests ran with pigz installed? In my tests, running Trim-galore with more than two cores (using code from #39) provided a speed up compared to two cores alone. Maybe you are be limited by how fast gzip can process the fastq files?

@fjames003
Copy link
Contributor

@FelixKrueger, I provided some benchmarks from my system, as well as updated the code to do some checks. Please check out the updates in #39.

@FelixKrueger
Copy link
Owner

No, these tests were run without pigz installed. I'll make sure to give #39 a try soon. Thanks for your work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants