# Multiple Sequence Alignment with Clustal Omega

PyEED provides a convenient interface to Clustal Omega for multiple sequence alignment. This notebook demonstrates how to:
1. Align sequences from a dictionary
2. Align sequences directly from the database

In [1]:
from pyeed import Pyeed
from pyeed.tools.clustalo import ClustalOmega

ModuleNotFoundError: No module named 'esm'

## Direct Sequence Alignment

You can align sequences directly by providing a dictionary of sequences:

In [7]:
# Initialize ClustalOmega
clustalo = ClustalOmega()

# Example sequences
sequences = {
    "seq1": "AKFVMPDRAWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
    "seq2": "AKFVMPDRQWHLYTGQECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI",
    "seq3": "AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQADNMGAYRCALFHVTK",
}

# Perform alignment
alignment = clustalo.align(sequences)
print("Aligned sequences:")
print(alignment)

[32m2025-01-22 15:16:24.610[0m | [31m[1mERROR   [0m | [36mpyeed.tools.clustalo[0m:[36m_run_clustalo_service[0m:[36m80[0m - [31m[1mConnection error: [Errno 8] nodename nor servname provided, or not known[0m


Aligned sequences:
seq1  AKFVMPDRAWHLYTGNECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI
seq2  AKFVMPDRQWHLYTGQECSKQRLYVWFHDGAPILKTQSDNMGAYRCPLFHVTKNWEI
seq3  AKFVMPDRQWHLYTGNECSKQRLYVWFHDGAPILKTQADNMGAYRCALFHVTK----


## Database-based Alignment

You can also align sequences directly from the database by providing a list of accession IDs:

In [6]:
# Connect to database
pyeed = Pyeed(uri="bolt://localhost:7687", user="neo4j", password="12345678")

# Get protein IDs from database
from pyeed.model import Protein

accession_ids = [protein.accession_id for protein in Protein.nodes.all()][:10]

# Align sequences from database
alignment = clustalo.align_from_db(accession_ids, pyeed.db)
print("Database alignment:")
print(alignment)

📡 Connected to database.
Database alignment:
A0A0H5C8F2  MSAAAETFLFTSESVGEGHPDKICDQVSDAILDACLAVDPLSKVACETASKTGMIMVFGEITT-KAQLDYQKIIRDTIKHIGYDSSDKGFDYKTCNVLVAIEQQSPDIAQGLHYEK----------ALEELGAGDQGIMFGYATDETDEKLPLTILLAHKLNAALADARR--SGALPWLRPDTKTQVTVEYKKDGGAVIPLRVDTIVISTQHAEEISTEDLRSEIIKHIVQKVIPEHLLDDKTIYHIQPSGRFVIGGPQGDAGLTGRKIIVDTYGGWGAHGGGAFSGKDFSKVDRSAAYAARWIAKSLVHAKLARRALVQLSYAIGVAEPLSIYVDTYGTSKYTSEQLVDIIKGNFDLRPGVVVKELDLARPIYFKTASYGHFTDQSN-----------------PWEQPKPLKF--------
A0A098MEC3  -MSIKGRHLFTSESVTEGHPDKICDQISDAVLDAFLANDPNARVACEVAVATGLVLVIGEISTKSEYVDIPAIVRNTIKEIGYTRAKFGFDYNTCAVLTSLNEQSADIAQGVNAALEGRDPAQVDEETANIGAGDQGLMFGFATNETPELMPLPIALSHRIARRLAEVRK--NGTLEYLRPDGKTQVTIEYL-DDK---PVRVDTIVVSTQHAEEISLEQIQADIKEHVILPVVPAELLDGETKYFINPTGRFVIGGPQGDAGLTGRKIIVDTYGGYARHGGGAFSGKDPTKVDRSAAYAARYVAKNLVAAGLADKCEIQLAYAIGVANPVSINVDTYGTGKVSEEKLVELISNNFDLRPAGIIAMLDLRKPIYKHTAAYGHFGRTDI---------------DLPWERLDKADLLKSQAEL-
A0A1E3P4X5  ---MSETFLFTSESVGEGHPDKICDQVSDAILDAALAIDPLSKVACETAAKTGLILVFGEITT-KAQLDYQ

## Understanding Alignment Results

The alignment result is a `MultipleSequenceAlignment` object with:
- List of `Sequence` objects
- Each sequence has an ID and aligned sequence
- Gaps are represented by '-' characters
- Sequences are padded to equal length

The alignment preserves sequence order and maintains sequence IDs from the input.

## Configuration

ClustalOmega requires the PyEED Docker service to be running. Make sure to:
1. Have Docker installed
2. Start the service with `docker-compose up -d`
3. The service runs on port 5001 by default