The self-alignment lines in the all-by-all blast result files are important to maintain, specifically for when the sequence set is compressed/reduced before running the alignments. The below line needs to be removed:
|
WHERE NOT qseqid = sseqid |
In the same PR, the logic of the blastreduce's reduce-template.sql code should be verified to be replicating the original perl code's results. Comments in that sql template file says that it does not exactly match the original code.
|
-- The original blastreduce uses a sort + picking the first occurrence of a |
|
-- (qseqid, sseqid) pair to deduplicate. That is replicated here by paritioning |
|
-- on (qseqid, sseqid) and assigning a rank to a specific sorted order. It does |
|
-- not exactly match the previous method but it is extremly close |
I haven't done any sort of speed testing but I highly doubt processing blast result file(s) is the bottle neck for the EST workflow. Is this duckdb strategy worth the slight inaccuracies?
The self-alignment lines in the all-by-all blast result files are important to maintain, specifically for when the sequence set is compressed/reduced before running the alignments. The below line needs to be removed:
EST/pipelines/est/templates/reduce-template.sql
Line 10 in 2c470f5
In the same PR, the logic of the blastreduce's reduce-template.sql code should be verified to be replicating the original perl code's results. Comments in that sql template file says that it does not exactly match the original code.
EST/pipelines/est/templates/reduce-template.sql
Lines 14 to 17 in 2c470f5
I haven't done any sort of speed testing but I highly doubt processing blast result file(s) is the bottle neck for the EST workflow. Is this duckdb strategy worth the slight inaccuracies?