Skip to content

blastreduce code should not remove self-alignment lines #258

@rbdavid

Description

@rbdavid

The self-alignment lines in the all-by-all blast result files are important to maintain, specifically for when the sequence set is compressed/reduced before running the alignments. The below line needs to be removed:

WHERE NOT qseqid = sseqid

In the same PR, the logic of the blastreduce's reduce-template.sql code should be verified to be replicating the original perl code's results. Comments in that sql template file says that it does not exactly match the original code.

-- The original blastreduce uses a sort + picking the first occurrence of a
-- (qseqid, sseqid) pair to deduplicate. That is replicated here by paritioning
-- on (qseqid, sseqid) and assigning a rank to a specific sorted order. It does
-- not exactly match the previous method but it is extremly close

I haven't done any sort of speed testing but I highly doubt processing blast result file(s) is the bottle neck for the EST workflow. Is this duckdb strategy worth the slight inaccuracies?

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions