Skip to content

182 - Sort fasta before axa step#339

Open
rbdavid wants to merge 4 commits into
nextflow-testfrom
182-Sort-fasta-before-axa-step
Open

182 - Sort fasta before axa step#339
rbdavid wants to merge 4 commits into
nextflow-testfrom
182-Sort-fasta-before-axa-step

Conversation

@rbdavid
Copy link
Copy Markdown
Contributor

@rbdavid rbdavid commented Jun 1, 2026

Past work on duckdb analyses of blast output (w/in all_by_all_blast(), blastreduce(), and restore_condensed() process blocks) should standardize the lexicographical sorting, which should solve the concerns of the original post of #182. But there's additional sorting we can do to ensure a standardized creation of the input fasta shards to the all_by_all_blast() process. This change standardizes the organization of those fasta files and does a load-balancing of the sequences spread across those shards. This load balancing should result in more even distribution of computationally expensive (long) sequences and so avoid instances where a random set of all_by_all_blast() instances take much longer than the rest.

Closes #182

@rbdavid rbdavid self-assigned this Jun 1, 2026
@rbdavid rbdavid requested a review from nilsoberg June 1, 2026 21:04
Copy link
Copy Markdown
Collaborator

@nilsoberg nilsoberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. The only thing I would check -- assuming you haven't yet -- is that the seqkit call doesn't run out of memory with large sequence sets.

@rbdavid
Copy link
Copy Markdown
Contributor Author

rbdavid commented Jun 2, 2026

It looks good. The only thing I would check -- assuming you haven't yet -- is that the seqkit call doesn't run out of memory with large sequence sets.

That kinda bleeds into what I was envisioning was my next major task, which was to revamp the nextflow configuration files to better control how certain memory/compute hungry process blocks are handled and resources are allocated. As seqkit notes in their documentation, sort and split2 can be memory hungry (I use --two-pass arg to try to minimize this). Also, seqkit has a --threads argument (set to 4 by default) so we need to explicitly control that argument. All-in-all, any process block that uses seqkit should probably be given its own process executor with extra allocation for memory and threads.

@rbdavid
Copy link
Copy Markdown
Contributor Author

rbdavid commented Jun 2, 2026

That above discussion is related, in part, to #301 and #191.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure BLAST results are consistent

2 participants