blastreduce code should not remove self-alignment lines

The self-alignment lines in the all-by-all blast result files are important to maintain, specifically for when the sequence set is compressed/reduced before running the alignments. The below line needs to be removed: 

https://github.com/EnzymeFunctionInitiative/EST/blob/2c470f520844460b5e8982bfc4e41b33651cc09b/pipelines/est/templates/reduce-template.sql#L10-L10

In the same PR, the logic of the blastreduce's reduce-template.sql code should be verified to be replicating the original perl code's results. Comments in that sql template file says that it does not exactly match the original code. 

https://github.com/EnzymeFunctionInitiative/EST/blob/2c470f520844460b5e8982bfc4e41b33651cc09b/pipelines/est/templates/reduce-template.sql#L14-L17

I haven't done any sort of speed testing but I highly doubt processing blast result file(s) is the bottle neck for the EST workflow. Is this duckdb strategy worth the slight inaccuracies? 

	-- The original blastreduce uses a sort + picking the first occurrence of a
	-- (qseqid, sseqid) pair to deduplicate. That is replicated here by paritioning
	-- on (qseqid, sseqid) and assigning a rank to a specific sorted order. It does
	-- not exactly match the previous method but it is extremly close

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blastreduce code should not remove self-alignment lines #258

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

blastreduce code should not remove self-alignment lines #258

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions