182 - Sort fasta before axa step by rbdavid · Pull Request #339 · EnzymeFunctionInitiative/EST

rbdavid · 2026-06-01T21:04:36Z

Past work on duckdb analyses of blast output (w/in all_by_all_blast(), blastreduce(), and restore_condensed() process blocks) should standardize the lexicographical sorting, which should solve the concerns of the original post of #182. But there's additional sorting we can do to ensure a standardized creation of the input fasta shards to the all_by_all_blast() process. This change standardizes the organization of those fasta files and does a load-balancing of the sequences spread across those shards. This load balancing should result in more even distribution of computationally expensive (long) sequences and so avoid instances where a random set of all_by_all_blast() instances take much longer than the rest.

Closes #182

nilsoberg

It looks good. The only thing I would check -- assuming you haven't yet -- is that the seqkit call doesn't run out of memory with large sequence sets.

…of processors to be used

rbdavid · 2026-06-02T14:40:37Z

It looks good. The only thing I would check -- assuming you haven't yet -- is that the seqkit call doesn't run out of memory with large sequence sets.

That kinda bleeds into what I was envisioning was my next major task, which was to revamp the nextflow configuration files to better control how certain memory/compute hungry process blocks are handled and resources are allocated. As seqkit notes in their documentation, sort and split2 can be memory hungry (I use --two-pass arg to try to minimize this). Also, seqkit has a --threads argument (set to 4 by default) so we need to explicitly control that argument. All-in-all, any process block that uses seqkit should probably be given its own process executor with extra allocation for memory and threads.

rbdavid · 2026-06-02T14:44:27Z

That above discussion is related, in part, to #301 and #191.

rbdavid added 2 commits June 1, 2026 15:53

Sort sequences largest to smallest before splitting the set into shards

d837828

Update process block name where its used

7510ac6

rbdavid self-assigned this Jun 1, 2026

rbdavid requested a review from nilsoberg June 1, 2026 21:04

nilsoberg approved these changes Jun 1, 2026

View reviewed changes

Add memory efficient file reading call and explicitly set the number …

b15c49b

…of processors to be used

Remove incorrect arg

11e1154

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

182 - Sort fasta before axa step#339

182 - Sort fasta before axa step#339
rbdavid wants to merge 4 commits into
nextflow-testfrom
182-Sort-fasta-before-axa-step

rbdavid commented Jun 1, 2026

Uh oh!

nilsoberg left a comment

Uh oh!

rbdavid commented Jun 2, 2026

Uh oh!

rbdavid commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rbdavid commented Jun 1, 2026

Uh oh!

nilsoberg left a comment

Choose a reason for hiding this comment

Uh oh!

rbdavid commented Jun 2, 2026

Uh oh!

rbdavid commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants