Skip to content

Ensure BLAST results are consistent #182

@nilsoberg

Description

@nilsoberg

Results from the EST pipeline can produce different results depending on the order of sequences output by the cat_fasta_files pipeline step. Different sequence pairs can have identical scores. This likely caused by an issue in the blastreduce pipeline step, which always takes a specific pair of sequences (likely the first one). One solution is to order sequences during the cat_fasta_files step or in pipelines/est/templates/reduce-template.sql.

Update the pipeline to consistently return results no matter the order of the input sequences.

A simple sort can be added in cat_fasta_files, however alternatives should be examined to determine the best solution.

diff --git a/pipelines/est/est.nf b/pipelines/est/est.nf
index 9274ee3..8d150ad 100644
--- a/pipelines/est/est.nf
+++ b/pipelines/est/est.nf
@@ -102,9 +102,15 @@ process cat_fasta_files {
         """
         $cat_cmd
         perl $projectDir/import/append_blast_query.pl --blast-query-file ${params.blast_query_file} --output-sequence-file all_sequences.fasta
+        awk '/^>/{key=\$1} {print key, \$0}' all_sequences.fasta | sort -k1,1 -s | cut -d' ' -f2- > sorted.fasta
+        mv sorted.fasta all_sequences.fasta
         """
     } else {
-        cat_cmd
+        """
+        $cat_cmd
+        awk '/^>/{key=\$1} {print key, \$0}' all_sequences.fasta | sort -k1,1 -s | cut -d' ' -f2- > sorted.fasta
+        mv sorted.fasta all_sequences.fasta
+        """
     }
 }

Metadata

Metadata

Assignees

Type

No fields configured for Bug.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions