## Standard Numbering

The standard numbering tool is used to number the residues of a protein sequence. It allows for comparison of different protein sequences by aligning them and numbering the residues in a common reference frame.

It can be run in two different modes:

1. **Pairwise alignment**: This mode aligns two sequences and numbers the residues in a common reference frame. Here a base sequence is provided and the other sequences are aligned to it.
2. **Clustal alignment**: This mode aligns a sequence against a multiple sequence alignment and numbers the residues in a common reference frame. Here a base sequence is provided and the other sequences are aligned to it.


In [25]:
%reload_ext autoreload
%autoreload 2
import sys
from loguru import logger

from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool

logger.remove()
level = logger.add(sys.stderr, level="INFO")

In [26]:
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"

eedb = Pyeed(uri, user=user, password=password)
eedb.db.wipe_database(date="2025-03-14")

eedb.db.initialize_db_constraints(user=user, password=password)


📡 Connected to database.
All data has been wiped from the database.
the connection url is bolt://neo4j:12345678@129.69.129.130:7687
Loaded /home/nab/Niklas/pyeed/src/pyeed/model.py
Connecting to bolt://neo4j:12345678@129.69.129.130:7687
Setting up indexes and constraints...

Found model.StrictStructuredNode
 ! Skipping class model.StrictStructuredNode is abstract
Found model.Organism
 + Creating node unique constraint for taxonomy_id on label Organism for class model.Organism
{code: Neo.ClientError.Schema.EquivalentSchemaRuleAlreadyExists} {message: An equivalent constraint already exists, 'Constraint( id=12, name='constraint_unique_Organism_taxonomy_id', type='UNIQUENESS', schema=(:Organism {taxonomy_id}), ownedIndex=5 )'.}
Found model.Site
 + Creating node unique constraint for site_id on label Site for class model.Site
{code: Neo.ClientError.Schema.EquivalentSchemaRuleAlreadyExists} {message: An equivalent constraint already exists, 'Constraint( id=14, name='constraint_unique_Site_s

In [27]:
ids = ["AAM15527.1", "AAF05614.1", "AFN21551.1", "CAA76794.1", "AGQ50511.1"]

eedb.fetch_from_primary_db(ids, db="ncbi_protein")
eedb.fetch_dna_entries_for_proteins()

[32m2025-03-14 16:01:33.841[0m | [1mINFO    [0m | [36mpyeed.main[0m:[36mfetch_from_primary_db[0m:[36m87[0m - [1mFound 0 sequences in the database.[0m
[32m2025-03-14 16:01:33.841[0m | [1mINFO    [0m | [36mpyeed.main[0m:[36mfetch_from_primary_db[0m:[36m89[0m - [1mFetching 5 sequences from ncbi_protein.[0m
[32m2025-03-14 16:01:33.864[0m | [1mINFO    [0m | [36mpyeed.adapter.primary_db_adapter[0m:[36mexecute_requests[0m:[36m140[0m - [1mStarting requests for 1 batches.[0m
[32m2025-03-14 16:01:35.072[0m | [1mINFO    [0m | [36mpyeed.adapter.ncbi_protein_mapper[0m:[36madd_to_db[0m:[36m301[0m - [1mAdded/updated NCBI protein AAM15527.1 in database[0m
[32m2025-03-14 16:01:35.101[0m | [1mINFO    [0m | [36mpyeed.adapter.ncbi_protein_mapper[0m:[36madd_to_db[0m:[36m301[0m - [1mAdded/updated NCBI protein AAF05614.1 in database[0m
[32m2025-03-14 16:01:35.128[0m | [1mINFO    [0m | [36mpyeed.adapter.ncbi_protein_mapper[0m:[36madd_to_db[0m

In [28]:
sn = StandardNumberingTool(name="test_standard_numbering_pairwise")


sn.apply_standard_numbering_pairwise(
    base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids[0:5]
)


[32m2025-03-14 16:01:37.045[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m385[0m - [1mPairs: [('AAM15527.1', 'CAA76794.1'), ('AAM15527.1', 'AGQ50511.1'), ('AAM15527.1', 'AFN21551.1'), ('AAM15527.1', 'AAF05614.1')][0m
[32m2025-03-14 16:01:37.046[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m394[0m - [1mInput: ['AAF05614.1', 'AFN21551.1', 'CAA76794.1', 'AGQ50511.1', 'AAM15527.1'][0m


Output()

[32m2025-03-14 16:01:41.722[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m403[0m - [1mPairwise alignment results: [{'query_id': 'AAM15527.1', 'target_id': 'CAA76794.1', 'score': 272.0, 'identity': 0.9755244755244755, 'gaps': 0, 'mismatches': 7, 'query_aligned': 'MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAVTMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERDRQIAEIGASLIKHW', 'target_aligned': 'MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDKLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVKYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLRNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGASERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW'}, {'query_id': 'AAM15527.1', 'target_id': 'AGQ50511.1', 'score': 280

In [29]:
sn.apply_standard_numbering_pairwise(
    base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids
)


[32m2025-03-14 16:01:51.025[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m379[0m - [1mPair AAM15527.1 and AAF05614.1 already exists under the same standard numbering node[0m
[32m2025-03-14 16:01:51.026[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m379[0m - [1mPair AAM15527.1 and AFN21551.1 already exists under the same standard numbering node[0m
[32m2025-03-14 16:01:51.026[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m379[0m - [1mPair AAM15527.1 and CAA76794.1 already exists under the same standard numbering node[0m
[32m2025-03-14 16:01:51.027[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m379[0m - [1mPair AAM15527.1 and AGQ50511.1 already exists under the same standard numbering node[0m
[32m202

Output()

[32m2025-03-14 16:01:51.049[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m403[0m - [1mPairwise alignment results: [][0m
[32m2025-03-14 16:01:51.049[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m419[0m - [1mNo alignment found for AAM15527.1[0m


In [30]:
sn_clustal = StandardNumberingTool(name="test_standard_numbering_clustal")

sn_clustal.apply_standard_numbering(
    base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids
)

[32m2025-03-14 16:01:52.356[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m494[0m - [1mUsing 4 sequences for standard numbering[0m
[32m2025-03-14 16:01:52.467[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m514[0m - [1mAlignment received from ClustalOmega:
AAM15527.1  MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAVTMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERDRQIAEIGASLIKHW
AAF05614.1  MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMLSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDRWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSSGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
AFN21551.1  MSIQHFR

In [31]:
sn_dna = StandardNumberingTool(name="test_standard_numbering_dna")

sn_dna.apply_standard_numbering(
    base_sequence_id="AF190695.1", db=eedb.db, node_type="DNA"
)

[32m2025-03-14 16:01:52.743[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m494[0m - [1mUsing 5 sequences for standard numbering[0m
[32m2025-03-14 16:01:53.287[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m514[0m - [1mAlignment received from ClustalOmega:
AF190695.1  TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGGTAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTTCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGCTGAGCACTTTTAAAGTTCTGCTATGTGGTGCGGTATTATCCCGTGTTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCTG

In [32]:
sn_dna_pairwise = StandardNumberingTool(name="test_standard_numbering_dna_pairwise")

sn_dna_pairwise.apply_standard_numbering_pairwise(
    base_sequence_id="AF190695.1", db=eedb.db, node_type="DNA"
)

[32m2025-03-14 16:01:53.600[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m385[0m - [1mPairs: [('AF190695.1', 'Y17582.1'), ('AF190695.1', 'AF347054.1'), ('AF190695.1', 'KC844056.1'), ('AF190695.1', 'JX042489.1')][0m
[32m2025-03-14 16:01:53.601[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m394[0m - [1mInput: ['AF347054.1', 'JX042489.1', 'KC844056.1', 'Y17582.1', 'AF190695.1'][0m


Output()

[32m2025-03-14 16:01:58.161[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering_pairwise[0m:[36m403[0m - [1mPairwise alignment results: [{'query_id': 'AF190695.1', 'target_id': 'Y17582.1', 'score': 834.0, 'identity': 0.7679057116953762, 'gaps': 245, 'mismatches': 11, 'query_aligned': 'TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGGTAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTTCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGCTGAGCACTTTTAAAGTTCTGCTATGTGGTGCGGTATTATCCCGTGTTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCTGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAA

In both cases, there are now standard numbering nodes to all the proteins and they have on their edge the standradnumbering data.