A multiple sequence alignment or MSA is the comparison of several nucleotide or amino acid sequences to detect regions of similarity. When only two sequences are compared, it is referred to as a pairwise alignment.

MSAs can be used to identify a subsequence of a protein that has been conserved in several different species. This can give us insight into the evolutionary relationship between the species.

For this example, the amino acid sequences of member 1 of the potassium voltage-gated channel subfamily A for twelve different species, including *Homo sapiens*, were extracted from the NCBI database and aligned using the Molecular Evolutionary Genetics Analysis (MEGA) software. The resulting MSA was saved as a FASTA file, which was then explored using Biopython, as demonstrated below.

In [1]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: biopython
Successfully installed biopython-1.84


In [8]:
from Bio import AlignIO
alignments = AlignIO.read('/content/Potassium_gated_voltage_channel_MSA.fas', 'fasta')

In [9]:
print(alignments)

Alignment with 12 rows and 517 columns
MTVMSG-----------------ENVDEASAAP-GHP---QDGS...TDV A.mel
MTVMSG-----------------ENVDEASAAP-GHP---QEGS...TDV P.tig
MTVMSG-----------------ENVDEASAAP-GHP---QDGS...TDV H.sap
MTVMSG-----------------ENADEASTAP-GHP---QDGS...TDV M.mus
MTVMSG-----------------ENVEEASAAQ-GHP---QDIS...TDV S.har
MTVMAG-----------------ENMDETSALP-GHP---QD-S...TDV E.gar
MTVMAG-----------------ENMDETSALP-GHP---QD-S...TDV M.uni
MDAISGIPSLTAGIDKGQGTGYTDNLNNSHVRPRGQPTLVNKPV...--- S.sal
MDAISGIPSLTAGIDKGQGTGYTDNLNNSHVRPRGQPTLVNKPV...--- S.sal
MDAISGIPSLTAGIDKGQGTGYTDNLNNSHVRPRGQPTLVNKPV...--- P.for
MDAISGIPSLTAGIDKGQGTGYTDNLNNSHVRPRGQPTLVNKPV...--- X.mac
MTVVAG-----------------DNMDETSAVP-GHP---QD-A...TDV N.fur


In [10]:
len(alignments)

12

In [13]:
for record in alignments:
  print(f"{record.id} {len(record)}")

A.mel 517
P.tig 517
H.sap 517
M.mus 517
S.har 517
E.gar 517
M.uni 517
S.sal 517
S.sal 517
P.for 517
X.mac 517
N.fur 517


In [19]:
#Individual rows can be accessed by their index, not unlike a DataFrame

print(alignments[0])
print('\n')
print(alignments[-1])

ID: A.mel
Name: A.mel
Description: A.mel | pub gene id:KCNA1 seq id:9646 0 description:Potassium voltage-gated channel subfamily A member 1 KCNA1 length: 495
Number of features: 0
Seq('MTVMSG-----------------ENVDEASAAP-GHP---QDGSYPRPAEHDDH...TDV')


ID: N.fur
Name: N.fur
Description: N.fur | pub gene id:KCNA1 seq id:105023 0 description:Potassium voltage-gated channel shaker-related subfamily member 1 KCNA1 length: 491
Number of features: 0
Seq('MTVVAG-----------------DNMDETSAVP-GHP---QD-AY--PPDHNDH...TDV')


In [20]:
#To access the id attribute of an individual record

print(alignments[0].id)
print(alignments[-1].id)

A.mel
N.fur


In [31]:
#Columns can be extracted as strings

print(alignments[:, 1])

TTTTTTTDDDDT


In [29]:
#Sub-alignments can be extracted from the alignment

print(alignments[:6, :])

Alignment with 6 rows and 517 columns
MTVMSG-----------------ENVDEASAAP-GHP---QDGS...TDV A.mel
MTVMSG-----------------ENVDEASAAP-GHP---QEGS...TDV P.tig
MTVMSG-----------------ENVDEASAAP-GHP---QDGS...TDV H.sap
MTVMSG-----------------ENADEASTAP-GHP---QDGS...TDV M.mus
MTVMSG-----------------ENVEEASAAQ-GHP---QDIS...TDV S.har
MTVMAG-----------------ENMDETSALP-GHP---QD-S...TDV E.gar


The sequences were sourced from the [NCBI](https://www.ncbi.nlm.nih.gov/).

The MSA was performed using the MEGA software, with documentation and installation instructions available [here](https://www.megasoftware.net/).

The documentation for the packages used was obtained from the [Biopython website](https://biopython.org/docs/1.76/api/Bio.Align.html).