Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust handling and preprocessing of user input sequences #3

Open
peterk87 opened this issue Jun 9, 2022 · 0 comments
Open

More robust handling and preprocessing of user input sequences #3

peterk87 opened this issue Jun 9, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@peterk87
Copy link

peterk87 commented Jun 9, 2022

Users should be able to submit any sequences in FASTA format and the workflow should figure out if those sequences are valid input. The sequence names should NOT require any special formatting. Currently _Segment is required in the user provided sequence name:

df_blast_result["ref_name"] = df_blast_result["stitle"].str.extract('(.+?)_[sS]egment')

But it is not necessary.

The full user sequence name should be preserved not dropped:

df_blast_result.drop(columns=["qaccver", "saccver", "stitle"], inplace=True)

The FASTA record description or comment should be preserved instead of stripped away/ignored:

seqid, sequence = record.id.strip(), record.seq
seq_record_id = re.sub(r"[()\"#/@;:<>{}`+=~|!?,]", "_", seqid)
outfile.write(f'>{seq_record_id}\n{sequence}\n')

The code should be:

seqid, desc, seq = rec.id, rec.description, rec.seq
# replace non-word, non-digit, non-period or dash characters
new_seqid = re.sub(r'[^\w.\-]+, '_', seqid)
# remove leading and trailing underscores
new_seqid = re.sub(r'^_|_$', '', new_seqid)
# preserve seq description and document changes in FULL seq name
seq_name = f'{seqid}{" " + desc if desc else ""}'
new_seq_name = f'{new_seqid}{" " + desc if desc else ""}'

A subworkflow should be created to handle validation of user-specified sequences to ensure that they are valid input

  1. Create representative gene sequences DB with non-redundant set of full length gene segment sequences with CD-HIT at 95% cluster threshold (or lower). Sequences with ambiguous bases should be excluded.
  2. User input sequences should all be uppercase ASCII and distinct; sequence names can be duplicated, but will be renamed.
  3. All-against-all Edlib global alignment of user sequences against representative gene segment sequences
  4. Assign each user sequence to a gene segment provided there are less than X differences between user seq and rep seq. If no match, fail immediately with informative message to user that one or more of their sequences do not pass the threshold. This threshold would need to be determined with some testing.
  5. Format user sequence names in pipeline compatible way. The name changes should be documented in a table with 3 columns: sequence index, old name, new name. Duplicated sequence names should be handled with a warning and renamed with appended -{seq index}
@peterk87 peterk87 added this to the 3.2.0 milestone Jun 9, 2022
@peterk87 peterk87 added the enhancement New feature or request label Jun 9, 2022
@peterk87 peterk87 removed this from the 3.2.0 milestone Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants