Results from the EST pipeline can produce different results depending on the order of sequences output by the cat_fasta_files pipeline step. Different sequence pairs can have identical scores. This likely caused by an issue in the blastreduce pipeline step, which always takes a specific pair of sequences (likely the first one). One solution is to order sequences during the cat_fasta_files step or in pipelines/est/templates/reduce-template.sql.
Update the pipeline to consistently return results no matter the order of the input sequences.
A simple sort can be added in cat_fasta_files, however alternatives should be examined to determine the best solution.
diff --git a/pipelines/est/est.nf b/pipelines/est/est.nf
index 9274ee3..8d150ad 100644
--- a/pipelines/est/est.nf
+++ b/pipelines/est/est.nf
@@ -102,9 +102,15 @@ process cat_fasta_files {
"""
$cat_cmd
perl $projectDir/import/append_blast_query.pl --blast-query-file ${params.blast_query_file} --output-sequence-file all_sequences.fasta
+ awk '/^>/{key=\$1} {print key, \$0}' all_sequences.fasta | sort -k1,1 -s | cut -d' ' -f2- > sorted.fasta
+ mv sorted.fasta all_sequences.fasta
"""
} else {
- cat_cmd
+ """
+ $cat_cmd
+ awk '/^>/{key=\$1} {print key, \$0}' all_sequences.fasta | sort -k1,1 -s | cut -d' ' -f2- > sorted.fasta
+ mv sorted.fasta all_sequences.fasta
+ """
}
}
Results from the EST pipeline can produce different results depending on the order of sequences output by the
cat_fasta_filespipeline step. Different sequence pairs can have identical scores. This likely caused by an issue in theblastreducepipeline step, which always takes a specific pair of sequences (likely the first one). One solution is to order sequences during thecat_fasta_filesstep or inpipelines/est/templates/reduce-template.sql.Update the pipeline to consistently return results no matter the order of the input sequences.
A simple sort can be added in
cat_fasta_files, however alternatives should be examined to determine the best solution.