Skip to content

Accesory script to homogenize sequence IDs in multifasta files

License

Notifications You must be signed in to change notification settings

BBCMdP/Sequence-renamer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

Sequence-renamer

Accesory script to homogenize sequence IDs in multifasta files DOI

Different fasta file sources come with headers (this is, the sequences ID or description line) with an wide variety of formats. Indeed, often times headers may contain unusual characters, or even whitespaces, that may result in a hellish experience with downstream applications. In this sense, is quite usual to find different issues: some applications may trim the headers upon a given character (such as hmmer does with withespaces), change characters (e.g., PhyML changes "." by "_") or even more complicated, trim the header to a given number of characters (this is a common issue with the strict phylip format, which allows only 10 characters). As such, we recommend to homogenize fasta files headers before to start the Seqrutinator pipeline, including MUFASA (and, why not, as a general rule whatsoever). To do so, we provide the seq_renamer.py script

The script renames each of the sequences in a fasta file with a fixed digit string (by default, of 10 characters long). The user can add a key to identify the sequences with argument -id, which should be a letter code for the sequences in the fasta file (for example, -id eco to indicate the sequences come from Escherichia coli). The 10 digits will be completed with increasing numbers. An output file (file_map) will be generated, showing the corresponding original header to each new name. We also include the optional -taxid argument for the proteome's species, which will be incorporated as a third column in the file_map output.

Multiple fasta files can also be processed at once. This requires an additional .tsv file (called with -t), with columns file, code and taxid. All fasta files should have the same extension (defined with -ext). For example:

file code taxid
Aegilops_tauschii_accepted.fa ata 37682
Acomosus_321_v3_accepted.fa aco 4615
Aofficinalis_498_V1_accepted.fa EPAo 4686

The script will iterate over the fasta files in each row, and rename based on the corresponding code. The column taxid can be empty.

Requirements

Sequence renamer requires python package Biopython (https://biopython.org/). Install with: pip install biopython

Examples

We have the proteome of E. coli, with 4959 entries:

$ grep -c ">" Ecoli.fasta                                                                         1 ↵  
4959

Note the headers contain a rather complex setting, with neven lenghts, and presence of whitespaces.

$ head -n 6 Ecoli.fasta                                      
>WP_001300467.1 MULTISPECIES: leu operon leader peptide [Enterobacteriaceae]
MTHIVRFIGLLLLNASSLRGRRVSGIQH
>WP_000478195.1 MULTISPECIES: DUF3302 domain-containing protein [Enterobacteriaceae]
MFLDYFALGVLIFVFLVIFYGIIILHDIPYLIAKKRNHPHADAIHVAGWVSLFTLHVIWPFLWIWATLYRPERGWGMQSHDSSVMQLQQRIAGLEKQLADIKSSSAE
>WP_000543457.1 MULTISPECIES: 2,3-dihydroxyphenylpropionate/2,3-dihydroxicinnamic acid 1,2-dioxygenase [Bacteria]
MHAYLHCLSHSPLVGYVDPAQEVLDEVNGVIASARERIAAFSPELVVLFAPDHYNGFFYDVMPPFCLGVGATAIGDFGSAAGELPVPVELAEACAHAVMKSGIDLAVSYCMQVDHGFAQPLEFLLGGLDKVPVLPVFINGVATPLPGFQRTRMLGEAIGRFTSTLNKRVLFLGSGGLSHQPPVPELAKADAHMRDRLLGSGKDLPASERELRQQRVISAAEKFVEDQRTLHPLNPIWDNQFMTLLEQGRIQELDAVSNEELSAIAGKSTHEIKTWVAAFAAISAFGNWRSEGRYYRPIPEWIAGFGSLSARTEN

Running:

$ python3 seq_renamer.py -i Ecoli.fasta -id eco
Execution Successful: 0:00:00.032205

Yields an output file Ecoli_renamed.fsa which headers look now like:

$ head -n 6 Ecoli_renamed.fsa                                                                          
>eco0000001
MTHIVRFIGLLLLNASSLRGRRVSGIQH
>eco0000002
MFLDYFALGVLIFVFLVIFYGIIILHDIPYLIAKKRNHPHADAIHVAGWVSLFTLHVIWPFLWIWATLYRPERGWGMQSHDSSVMQLQQRIAGLEKQLADIKSSSAE
>eco0000003
MHAYLHCLSHSPLVGYVDPAQEVLDEVNGVIASARERIAAFSPELVVLFAPDHYNGFFYDVMPPFCLGVGATAIGDFGSAAGELPVPVELAEACAHAVMKSGIDLAVSYCMQVDHGFAQPLEFLLGGLDKVPVLPVFINGVATPLPGFQRTRMLGEAIGRFTSTLNKRVLFLGSGGLSHQPPVPELAKADAHMRDRLLGSGKDLPASERELRQQRVISAAEKFVEDQRTLHPLNPIWDNQFMTLLEQGRIQELDAVSNEELSAIAGKSTHEIKTWVAAFAAISAFGNWRSEGRYYRPIPEWIAGFGSLSARTEN

In addition, a file termed Ecoli_file_map is created, containing an index of new and original names:

$ head -n 6 Ecoli_file_map                                                                             
eco0000001 	 WP_001300467.1 MULTISPECIES: leu operon leader peptide [Enterobacteriaceae]
eco0000002 	 WP_000478195.1 MULTISPECIES: DUF3302 domain-containing protein [Enterobacteriaceae]
eco0000003 	 WP_000543457.1 MULTISPECIES: 2,3-dihydroxyphenylpropionate/2,3-dihydroxicinnamic acid 1,2-dioxygenase [Bacteria]
eco0000004 	 WP_000291549.1 MULTISPECIES: lactose permease [Bacteria]
eco0000005 	 WP_000995441.1 MULTISPECIES: host-nuclease inhibitor Gam family protein [Enterobacteriaceae]
eco0000006 	 WP_000018633.1 MULTISPECIES: DUF2526 family protein YdcY [Enterobacterales]

We recommend a header name of 10 characters, with an index code (-id) of 3-4 characters. However, the user can use -l argument to extend the lenght of the header if desired. For example running:

$ python3 seq_renamer.py -i Ecoli.fasta -id esch_coli -l 15
$ head -n 6 Ecoli_renamed.fsa                                                                          
>esch_coli000001
MTHIVRFIGLLLLNASSLRGRRVSGIQH
>esch_coli000002
MFLDYFALGVLIFVFLVIFYGIIILHDIPYLIAKKRNHPHADAIHVAGWVSLFTLHVIWPFLWIWATLYRPERGWGMQSHDSSVMQLQQRIAGLEKQLADIKSSSAE
>esch_coli000003
MHAYLHCLSHSPLVGYVDPAQEVLDEVNGVIASARERIAAFSPELVVLFAPDHYNGFFYDVMPPFCLGVGATAIGDFGSAAGELPVPVELAEACAHAVMKSGIDLAVSYCMQVDHGFAQPLEFLLGGLDKVPVLPVFINGVATPLPGFQRTRMLGEAIGRFTSTLNKRVLFLGSGGLSHQPPVPELAKADAHMRDRLLGSGKDLPASERELRQQRVISAAEKFVEDQRTLHPLNPIWDNQFMTLLEQGRIQELDAVSNEELSAIAGKSTHEIKTWVAAFAAISAFGNWRSEGRYYRPIPEWIAGFGSLSARTEN

Note that headers in the renamed file have 15 characters long, which allows to better accomodate the index "ecoli".

Multiple files:

$ python3 seq_renamer.py -t fasta_codes.tsv -ext '*.fa'

The script will iterate over all .fa files in the folder (provided they are also in the fasta_codes.tsv file). (Note: -l can also be applied here, but make sure not to use -i). *.fa is the default extension.

About

Accesory script to homogenize sequence IDs in multifasta files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages