You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Users should be able to submit any sequences in FASTA format and the workflow should figure out if those sequences are valid input. The sequence names should NOT require any special formatting. Currently _Segment is required in the user provided sequence name:
seqid, desc, seq=rec.id, rec.description, rec.seq# replace non-word, non-digit, non-period or dash charactersnew_seqid=re.sub(r'[^\w.\-]+, '_', seqid)
# remove leading and trailing underscoresnew_seqid=re.sub(r'^_|_$', '', new_seqid)
# preserve seq description and document changes in FULL seq nameseq_name=f'{seqid}{" "+descifdescelse""}'new_seq_name=f'{new_seqid}{" "+descifdescelse""}'
A subworkflow should be created to handle validation of user-specified sequences to ensure that they are valid input
Create representative gene sequences DB with non-redundant set of full length gene segment sequences with CD-HIT at 95% cluster threshold (or lower). Sequences with ambiguous bases should be excluded.
User input sequences should all be uppercase ASCII and distinct; sequence names can be duplicated, but will be renamed.
All-against-all Edlib global alignment of user sequences against representative gene segment sequences
Assign each user sequence to a gene segment provided there are less than X differences between user seq and rep seq. If no match, fail immediately with informative message to user that one or more of their sequences do not pass the threshold. This threshold would need to be determined with some testing.
Format user sequence names in pipeline compatible way. The name changes should be documented in a table with 3 columns: sequence index, old name, new name. Duplicated sequence names should be handled with a warning and renamed with appended -{seq index}
The text was updated successfully, but these errors were encountered:
Users should be able to submit any sequences in FASTA format and the workflow should figure out if those sequences are valid input. The sequence names should NOT require any special formatting. Currently
_Segment
is required in the user provided sequence name:nf-flu/bin/get_blastn_report.py
Line 83 in bdc8942
But it is not necessary.
The full user sequence name should be preserved not dropped:
nf-flu/bin/get_blastn_report.py
Line 86 in bdc8942
The FASTA record description or comment should be preserved instead of stripped away/ignored:
nf-flu/bin/ref_fasta_check.py
Lines 27 to 29 in bdc8942
The code should be:
A subworkflow should be created to handle validation of user-specified sequences to ensure that they are valid input
-{seq index}
The text was updated successfully, but these errors were encountered: