# Detailed Mutation Analysis Tutorial Using PyEED

This notebook demonstrates a comprehensive workflow for analyzing mutations between protein sequences using PyEED (Python Enzyme Engineering Database). We'll walk through each step in detail.

## Cell 1: Required Imports

In [47]:
%reload_ext autoreload
%autoreload 2

import logging

from pyeed import Pyeed
from pyeed.analysis.mutation_detection import MutationDetection
from pyeed.analysis.standard_numbering import StandardNumberingTool

- `Pyeed`: Main class for interacting with the PyEED database
- `MutationDetection`: Class for identifying differences between protein sequences
- `StandardNumberingTool`: Ensures consistent position numbering across different protein sequences
- `logging`: For tracking execution progress and debugging

## Cell 2: Logging Configuration

In [48]:
logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
LOGGER = logging.getLogger(__name__)

Sets up logging to:
- Show only ERROR level messages and above
- Include timestamp, log level, and message in the output
- Create a logger instance specific to this notebook

## Cell 3: Database Setup and Connection

In [49]:
uri = "bolt://129.69.129.130:7687"
user = "neo4j"
password = "12345678"

eedb = Pyeed(uri, user=user, password=password)

eedb.db.wipe_database("2025-02-07")
eedb.db.remove_db_constraints(user=user, password=password)

Pyeed Graph Object Mapping constraints not defined. Use _install_labels() to set up model constraints.
📡 Connected to database.
All data has been wiped from the database.
Connecting to bolt://neo4j:12345678@129.69.129.130:7687
Dropping constraints...

Dropping indexes...

All constraints and indexes have been removed from the database.


1. Establishes connection parameters to a local Neo4j database
2. Creates a PyEED instance with these credentials
3. Wipes existing database data (with date "2025-01-19")
4. Removes all database constraints for a fresh start

This ensures we're working with a clean database state.

## Cell 4: Sequence Retrieval

In [50]:
ids = ["KJO56189.1", "KLP91446.1"]

eedb.fetch_from_primary_db(ids, db='ncbi_protein')

[32m2025-02-07 12:42:41.902[0m | [1mINFO    [0m | [36mpyeed.main[0m:[36mfetch_from_primary_db[0m:[36m87[0m - [1mFound 0 sequences in the database.[0m
[32m2025-02-07 12:42:41.903[0m | [1mINFO    [0m | [36mpyeed.main[0m:[36mfetch_from_primary_db[0m:[36m89[0m - [1mFetching 2 sequences from ncbi_protein.[0m
[32m2025-02-07 12:42:41.926[0m | [1mINFO    [0m | [36mpyeed.adapter.primary_db_adapter[0m:[36mexecute_requests[0m:[36m140[0m - [1mStarting requests for 1 batches.[0m
[32m2025-02-07 12:42:41.927[0m | [34m[1mDEBUG   [0m | [36mpyeed.adapter.primary_db_adapter[0m:[36mexecute_requests[0m:[36m142[0m - [34m[1mPrepared 1 request payloads.[0m
[32m2025-02-07 12:42:41.928[0m | [34m[1mDEBUG   [0m | [36mpyeed.adapter.primary_db_adapter[0m:[36m_fetch_response[0m:[36m121[0m - [34m[1mSending request to https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi with parameters: {'retmode': 'text', 'rettype': 'genbank', 'db': 'protein', 'id': 

1. Defines two protein sequence IDs to analyze
2. Fetches these sequences from NCBI's protein database
3. Both sequences are beta-lactamase proteins:
   - KJO56189.1: beta-lactamase TEM
   - KLP91446.1: class A beta-lactamase
4. The sequences are automatically parsed and stored in the Neo4j database
5. Additional metadata like organism information and CDS (Coding Sequence) details are also stored

## Cell 5: Standard Numbering Application

In [51]:
sn = StandardNumberingTool(name="test_standard_numbering")
sn.apply_standard_numbering(base_sequence_id='KJO56189.1', db=eedb.db, list_of_seq_ids=ids)

[32m2025-02-07 12:42:43.084[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m422[0m - [1mUsing 2 sequences for standard numbering[0m
[32m2025-02-07 12:42:43.112[0m | [34m[1mDEBUG   [0m | [36mpyeed.tools.clustalo[0m:[36m_run_clustalo_service[0m:[36m81[0m - [34m[1mConnection error: [Errno -3] Temporary failure in name resolution[0m
[32m2025-02-07 12:42:43.161[0m | [1mINFO    [0m | [36mpyeed.analysis.standard_numbering[0m:[36mapply_standard_numbering[0m:[36m441[0m - [1mAlignment received from ClustalOmega:
KJO56189.1  MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRIHYSQNDLVEYSPVTEKHLTDGMTVRELCSAAITMSDNTAANLLLTTIGGPKELTAFLHNMGDHVTRLDSWEPELNEAIPNDERDTTMPAAMATTLRKLLTGELLTLASRQQLIDWMEADKVAGPLLRSALPAGWFIADKSGAGERGSRGIIAALGPDGKPSRIVVIYTTGSQATMDERNRQIAEIGASLIKHW
KLP91446.1  MSIQHFRVALIPFFAAFCLPVFAHPETLVKVKDAEDQLGARVGYIELDLNSGKILESFRPEERFPMMSTFKVLLCGAVLSRVDAGQEQLGRRI

1. Creates a new StandardNumberingTool instance named "test_standard_numbering"
2. Uses KJO56189.1 as the reference sequence for numbering
3. Performs multiple sequence alignment (MSA) using CLUSTAL
4. The alignment output shows:
   - Asterisks (*) indicate identical residues
   - Colons (:) indicate conserved substitutions
   - Periods (.) indicate semi-conserved substitutions
5. This step is crucial for ensuring mutations are correctly identified relative to consistent positions

## Cell 6: Mutation Detection

In [52]:
md = MutationDetection()

seq1 = "KJO56189.1"
seq2 = "KLP91446.1"
name_of_standard_numbering_tool = "test_standard_numbering"

mutations = md.get_mutations_between_sequences(
    seq1, seq2, eedb.db, name_of_standard_numbering_tool
)

[32m2025-02-07 12:42:43.824[0m | [34m[1mDEBUG   [0m | [36mpyeed.analysis.mutation_detection[0m:[36msave_mutations_to_db[0m:[36m137[0m - [34m[1mSaved 3 mutations to database[0m


1. Creates a MutationDetection instance
2. Compares the two sequences using the standard numbering scheme
3. Identifies all positions where amino acids differ
4. Automatically saves the mutations to the database
5. Returns a dictionary containing mutation information

## Cell 7: Results Analysis

In [53]:
print(mutations)

{'from_positions': [162, 236, 102], 'to_positions': [162, 236, 102], 'from_monomers': ['S', 'G', 'E'], 'to_monomers': ['R', 'S', 'K']}


Outputs a detailed mutation map showing:
- `from_positions`: [102, 162, 236] - Where mutations occur in the sequence
- `to_positions`: [102, 162, 236] - Corresponding positions in the second sequence
- `from_monomers`: ['E', 'S', 'G'] - Original amino acids
- `to_monomers`: ['K', 'R', 'S'] - Mutated amino acids

This means we found three mutations:
1. Position 102: Glutamic acid (E) → Lysine (K)
2. Position 162: Serine (S) → Arginine (R)
3. Position 236: Glycine (G) → Serine (S)

These mutations could be significant for:
- Understanding protein evolution
- Analyzing functional differences
- Planning protein engineering experiments
- Studying antibiotic resistance mechanisms (since these are beta-lactamase proteins)

## Technical Notes:
- The database operations are performed using Neo4j's Bolt protocol
- Sequence data is retrieved using NCBI's E-utilities
- Multiple sequence alignment is performed using CLUSTAL
- All operations are tracked with detailed logging
- The system automatically handles protein metadata and relationships