# Alignment of two or more sequences using Clustal



### Summary 
1. Step 1 - Get input sequences. In this example, the input is NCBI accession number and sequences are retrieved from the Genbank database. The sequences in this example are coronavirus spile proteins. Two or more sequences can be used.
2. Step 2 - Align the sequences. This uses Clustal. Output is fasta format with aligned sequences corresponding to inputs. 


### Notes and references

1. Clustal: https://www.ebi.ac.uk/Tools/msa/clustalo/
2. Entrez     https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/

### Installs
1. clustalo for alignments:
(a) sudo apt update 
(b) sudo apt install clustalw
2. biopython: 
(a) sudo apt update
(b) sudo apt install biopython

In [1]:
## Imports

In [2]:
import Bio                        
from Bio import Entrez
from Bio.Align.Applications import ClustalOmegaCommandline
import tempfile

## Step 1. Get Input Sequences
* Inputs are two or more sequences in fasta format. Include one as the reference sequence. 
* In this case, sequences are retrieve from Genbank using NCBI accession numbers.
* All sequences are written to single output file in fasta format for Clustal

In [3]:
omic = ("UFO69279", "protein")          # NCBI accession number for Omicron spike protein
delta = ("QWK65230", "protein")          # NCBI accession number for Delta spike protein
gamma = ("P0DTC2", "protein")         # NCBI accession number for Gamma spike protein
beta = ("UJB55404", "protein")         # NCBI accession number for Beta spike protein


# this example will just use the Delta Spike Protein and the Wuhan Spike (as base reference)
base = ("YP_009724390","protein")       # NCBI accession number for Wuhan spike protein ** REFERENCE
subj_sequences = [delta]                   # include subjects in this list                 

In [4]:
def read_seq(idX,database):     # Read sequence from Genbank
    """
    Inputs:   (a) NCBI accession number (b) database ("protein", etc.)
    Process:  (a) Retrieve seqence (header and sequence)  ---- Using Entrez (see references / link for more info)
    Outputs:  (a) sequence (header and sequence); i.e. fasta format
    """
    seqX = ''                           # output sequence
    fetch_handle = Entrez.efetch(database, id=idX, rettype='fasta', retmode="text")    # input
                                                           
    for record in fetch_handle:        # retrieved from database
        if len(record) == 0: break     # end
        if record[0] ==">":            #
            hdrX = record
        else: 
            seqX = seqX + record     # sequence record
    print(f"\nHdr: {hdrX} *** Sequencce length: {len(seqX)}")       
    return hdrX +  seqX                      # header concatenated to sequence 

In [5]:
# write input sequences to a single fasta file for input to Clustal. This will be a temporary file used to pass data. 
print ("Header and length of sequence for each input (possible warning due to lack of specified email address):")
# 1. Add Reference as first sequence
out_fasta = read_seq(base[0],base[1])      # Retrieve sequence data from database (accession number and database)

# 2. Add Subject sequences (one or more)
for seq in subj_sequences:
    out_fasta = out_fasta + read_seq(seq[0],seq[1])    # Retrieve sequence data from database (accession number and database)

# 3. Write output for Clustal using temporary file
tmp = tempfile.NamedTemporaryFile()       # create and write sequences to a single temporary file 
with open(tmp.name, 'w') as f:
    f.write(out_fasta.strip()) 



Email address is not specified.

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.



Hdr: >YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
 *** Sequencce length: 1293

Hdr: >QWK65230.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
 *** Sequencce length: 1291


## Step 2. Clustal Align
* Input is a single file of sequences in fasta format. In this case it is a temporary file generated in step 1.
* Output is a single file of aligned sequences (in same order as input) in fasta format.

In [6]:
# input is a single fasta file (tmp.name) with two or more sequences

cline = ClustalOmegaCommandline( cmd='clustalo',  infile = tmp.name,  auto = True , seqtype = base[1])

aligned_seqs, stderr = cline()        
print('Aligned sequences:\n\n',stderr,aligned_seqs)

Aligned sequences:

  >YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSAL