# Mutation Analysis

Mutations between sequences can be comprehensively analyzed.

In [1]:
import sys
from loguru import logger

from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool

logger.remove()
level = logger.add(sys.stderr, level="WARNING")

- `Pyeed`: Main class for interacting with the PyEED database
- `MutationDetection`: Class for identifying differences between protein sequences
- `StandardNumberingTool`: Ensures consistent position numbering across different protein sequences

In [None]:
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"

eedb = Pyeed(uri, user=user, password=password)

Pyeed Graph Object Mapping constraints not defined. Use _install_labels() to set up model constraints.
📡 Connected to database.


1. Establishes connection parameters to a local Neo4j database
2. Creates a PyEED instance with these credentials
3. Wipes existing database data (with date "2025-01-19")
4. Removes all database constraints for a fresh start

This ensures we're working with a clean database state.

## Sequence Retrieval

In [3]:
ids = ["KJO56189.1", "KLP91446.1"]

eedb.fetch_from_primary_db(ids, db="ncbi_protein")

1. Defines two protein sequence IDs to analyze
2. Fetches these sequences from NCBI's protein database
3. Both sequences are beta-lactamase proteins:
   - KJO56189.1: beta-lactamase TEM
   - KLP91446.1: class A beta-lactamase
4. The sequences are automatically parsed and stored in the Neo4j database
5. Additional metadata like organism information and CDS (Coding Sequence) details are also stored

## Apply Standard Numbering

In [4]:
sn = StandardNumberingTool(name="test_standard_numbering")
sn.apply_standard_numbering(
    base_sequence_id="KJO56189.1", db=eedb.db, list_of_seq_ids=ids
)

[32m2025-02-07 17:32:37.654[0m | [31m[1mERROR   [0m | [36mpyeed.tools.clustalo[0m:[36malign[0m:[36m35[0m - [31m[1mAlignment failed: [Errno 60] Operation timed out[0m


ConnectTimeout: [Errno 60] Operation timed out

1. Creates a new StandardNumberingTool instance named "test_standard_numbering"
2. Uses KJO56189.1 as the reference sequence for numbering
3. Performs multiple sequence alignment (MSA) using CLUSTAL
4. The alignment output shows:
   - Asterisks (*) indicate identical residues
   - Colons (:) indicate conserved substitutions
   - Periods (.) indicate semi-conserved substitutions
5. This step is crucial for ensuring mutations are correctly identified relative to consistent positions

## Mutation Detection

In [6]:
md = MutationDetection()

seq1 = "KJO56189.1"
seq2 = "KLP91446.1"
name_of_standard_numbering_tool = "test_standard_numbering"

mutations = md.get_mutations_between_sequences(
    seq1, seq2, eedb.db, name_of_standard_numbering_tool
)

[32m2025-02-07 15:26:53.370[0m | [34m[1mDEBUG   [0m | [36mpyeed.analysis.mutation_detection[0m:[36msave_mutations_to_db[0m:[36m137[0m - [34m[1mSaved 3 mutations to database[0m


1. Creates a MutationDetection instance
2. Compares the two sequences using the standard numbering scheme
3. Identifies all positions where amino acids differ
4. Automatically saves the mutations to the database
5. Returns a dictionary containing mutation information

## Results

In [7]:
print(mutations)

{'from_positions': [236, 102, 162], 'to_positions': [236, 102, 162], 'from_monomers': ['G', 'E', 'S'], 'to_monomers': ['S', 'K', 'R']}


Outputs a detailed mutation map showing:
- `from_positions`: [102, 162, 236] - Where mutations occur in the sequence
- `to_positions`: [102, 162, 236] - Corresponding positions in the second sequence
- `from_monomers`: ['E', 'S', 'G'] - Original amino acids
- `to_monomers`: ['K', 'R', 'S'] - Mutated amino acids

This means we found three mutations:
1. Position 102: Glutamic acid (E) → Lysine (K)
2. Position 162: Serine (S) → Arginine (R)
3. Position 236: Glycine (G) → Serine (S)