# Mutation Analysis

Mutations between sequences can be comprehensively analyzed.

In [9]:
import sys

from loguru import logger

from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool

logger.remove()
level = logger.add(sys.stderr, level="WARNING")

- `Pyeed`: Main class for interacting with the PyEED database
- `MutationDetection`: Class for identifying differences between protein sequences
- `StandardNumberingTool`: Ensures consistent position numbering across different protein sequences

In [10]:
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"

eedb = Pyeed(uri, user=user, password=password)

eedb.db.wipe_database(date="2025-03-19")

📡 Connected to database.
All data has been wiped from the database.


1. Establishes connection parameters to a local Neo4j database
2. Creates a PyEED instance with these credentials
3. Wipes existing database data (with date "2025-01-19")
4. Removes all database constraints for a fresh start

This ensures we're working with a clean database state.

## Sequence Retrieval

In [11]:
ids = ["AAM15527.1", "AAF05614.1", "AFN21551.1", "CAA76794.1", "AGQ50511.1"]

eedb.fetch_from_primary_db(ids, db="ncbi_protein")
eedb.fetch_dna_entries_for_proteins()
eedb.create_coding_sequences_regions()

1. Defines two protein sequence IDs to analyze
2. Fetches these sequences from NCBI's protein database
3. All sequences are beta-lactamase proteins
4. The sequences are automatically parsed and stored in the Neo4j database
5. Additional metadata like organism information and CDS (Coding Sequence) details are also stored

## Apply Standard Numbering

In [12]:
sn_protein = StandardNumberingTool(name="test_standard_numbering_protein")


sn_protein.apply_standard_numbering(
    base_sequence_id="AAM15527.1", db=eedb.db, list_of_seq_ids=ids
)

sn_dna = StandardNumberingTool(name="test_standard_numbering_dna_pairwise")

query_get_region_ids = """
MATCH (p:Protein)<-[rel:ENCODES]-(d:DNA)-[rel2:HAS_REGION]->(r:Region)
WHERE r.annotation = $region_annotation AND p.accession_id IN $protein_id
RETURN id(r)
"""

region_ids = eedb.db.execute_read(query_get_region_ids, parameters={"protein_id": ids, "region_annotation": "coding sequence"})
region_ids = [id['id(r)'] for id in region_ids]
print(f"Region ids: {region_ids}")
print(f"len of ids: {len(ids)}")

sn_dna.apply_standard_numbering_pairwise(
    base_sequence_id="AF190695.1", db=eedb.db, node_type="DNA", region_ids_neo4j=region_ids
)

Output()

Region ids: [143, 129, 128, 69, 9]
len of ids: 5


1. Creates a new StandardNumberingTool instance named "test_standard_numbering"
2. Uses KJO56189.1 as the reference sequence for numbering
3. Performs multiple sequence alignment (MSA) using CLUSTAL
4. The alignment output shows:
   - Asterisks (*) indicate identical residues
   - Colons (:) indicate conserved substitutions
   - Periods (.) indicate semi-conserved substitutions
5. This step is crucial for ensuring mutations are correctly identified relative to consistent positions

## Mutation Detection

In [13]:
md = MutationDetection()

seq1 = "AAM15527.1"
seq2 = "AAF05614.1"
name_of_standard_numbering_tool = "test_standard_numbering_protein"

mutations_protein = md.get_mutations_between_sequences(
    seq1, seq2, eedb.db, name_of_standard_numbering_tool
)

In [14]:
md = MutationDetection()


seq1 = "AF190695.1"
seq2 = "JX042489.1"
name_of_standard_numbering_tool = "test_standard_numbering_dna_pairwise"

mutations_dna = md.get_mutations_between_sequences(
    seq1, seq2, eedb.db, name_of_standard_numbering_tool, node_type="DNA", region_ids_neo4j=region_ids
)

1. Creates a MutationDetection instance
2. Compares the two sequences using the standard numbering scheme
3. Identifies all positions where amino acids differ
4. Automatically saves the mutations to the database
5. Returns a dictionary containing mutation information

## Results

In [15]:
print(mutations_protein)

{'from_positions': [241, 125, 272], 'to_positions': [241, 125, 272], 'from_monomers': ['R', 'V', 'D'], 'to_monomers': ['S', 'I', 'N']}


Outputs a detailed mutation map showing:
- `from_positions`: [102, 162, 236] - Where mutations occur in the sequence
- `to_positions`: [102, 162, 236] - Corresponding positions in the second sequence
- `from_monomers`: ['E', 'S', 'G'] - Original amino acids
- `to_monomers`: ['K', 'R', 'S'] - Mutated amino acids

This means we found three mutations:
1. Position 102: Glutamic acid (E) → Lysine (K)
2. Position 162: Serine (S) → Arginine (R)
3. Position 236: Glycine (G) → Serine (S)

In [16]:
for i in range(len(mutations_dna['from_positions'])):
    print(f"Mutation on position {mutations_dna['from_positions'][i]} -> {mutations_dna['to_positions'][i]} with a nucleotide change of {mutations_dna['from_monomers'][i]} -> {mutations_dna['to_monomers'][i]}")

Mutation on position 705 -> 705 with a nucleotide change of G -> A
Mutation on position 395 -> 395 with a nucleotide change of T -> G
Mutation on position 137 -> 137 with a nucleotide change of A -> G
Mutation on position 17 -> 17 with a nucleotide change of T -> C
Mutation on position 473 -> 473 with a nucleotide change of T -> C
Mutation on position 716 -> 716 with a nucleotide change of G -> A
Mutation on position 720 -> 720 with a nucleotide change of A -> C
Mutation on position 198 -> 198 with a nucleotide change of C -> A
