More robust handling and preprocessing of user input sequences #3

peterk87 · 2022-06-09T15:47:26Z

Users should be able to submit any sequences in FASTA format and the workflow should figure out if those sequences are valid input. The sequence names should NOT require any special formatting. Currently _Segment is required in the user provided sequence name:

nf-flu/bin/get_blastn_report.py

Line 83 in bdc8942

    
           df_blast_result["ref_name"] = df_blast_result["stitle"].str.extract('(.+?)_[sS]egment')

But it is not necessary.

The full user sequence name should be preserved not dropped:

nf-flu/bin/get_blastn_report.py

Line 86 in bdc8942

df_blast_result.drop(columns=["qaccver", "saccver", "stitle"], inplace=True)

The FASTA record description or comment should be preserved instead of stripped away/ignored:

nf-flu/bin/ref_fasta_check.py

Lines 27 to 29 in bdc8942

    
           seqid, sequence = record.id.strip(), record.seq 
        
           seq_record_id = re.sub(r"[()\"#/@;:<>{}`+=~|!?,]", "_", seqid) 
        
           outfile.write(f'>{seq_record_id}\n{sequence}\n')

The code should be:

seqid, desc, seq = rec.id, rec.description, rec.seq
# replace non-word, non-digit, non-period or dash characters
new_seqid = re.sub(r'[^\w.\-]+, '_', seqid)
# remove leading and trailing underscores
new_seqid = re.sub(r'^_|_$', '', new_seqid)
# preserve seq description and document changes in FULL seq name
seq_name = f'{seqid}{" " + desc if desc else ""}'
new_seq_name = f'{new_seqid}{" " + desc if desc else ""}'

A subworkflow should be created to handle validation of user-specified sequences to ensure that they are valid input

Create representative gene sequences DB with non-redundant set of full length gene segment sequences with CD-HIT at 95% cluster threshold (or lower). Sequences with ambiguous bases should be excluded.
User input sequences should all be uppercase ASCII and distinct; sequence names can be duplicated, but will be renamed.
All-against-all Edlib global alignment of user sequences against representative gene segment sequences
Assign each user sequence to a gene segment provided there are less than X differences between user seq and rep seq. If no match, fail immediately with informative message to user that one or more of their sequences do not pass the threshold. This threshold would need to be determined with some testing.
Format user sequence names in pipeline compatible way. The name changes should be documented in a table with 3 columns: sequence index, old name, new name. Duplicated sequence names should be handled with a warning and renamed with appended -{seq index}

The text was updated successfully, but these errors were encountered:

peterk87 assigned peterk87 and nhhaidee Jun 9, 2022

peterk87 added this to the 3.2.0 milestone Jun 9, 2022

peterk87 added the enhancement New feature or request label Jun 9, 2022

peterk87 removed this from the 3.2.0 milestone Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust handling and preprocessing of user input sequences #3

More robust handling and preprocessing of user input sequences #3

peterk87 commented Jun 9, 2022

More robust handling and preprocessing of user input sequences #3

More robust handling and preprocessing of user input sequences #3

Comments

peterk87 commented Jun 9, 2022