# Lecture 4 - Sequence alignment

In this tutorial, we will learn how to align (nucleotide or protein) sequences. There are two kinds of applications for this:

- Comparing two input sequences to check their similarity.
- Comparing an input sequence against a sequence database to find best matches.

## Learning objectives:

- Learn to use the popular [NCBI BLAST tool](https://blast.ncbi.nlm.nih.gov/).
- Learn to interpret *BLASTing* results. 

## Exercise 1: Pairwise alignment

Let's start by loading a protein sequence from a fasta file. This is the sequence of protein [**P21880**](https://www.uniprot.org/uniprot/P21880) from the UniProt database.

In [None]:
from Bio import SeqIO

protein = SeqIO.read('files/P21880.faa', 'fasta')

print('> original', protein.seq, sep='\n')

### Exercise 1.1

Let's introduce some random mutations to this protein...

In [None]:
from random import randint

def random_AA():
    """ Generate one random amino acid. """
    
    AAs = 'ACDEFGHIKLMNPQRSTVWY'
    return AAs[randint(0, 19)]

def mutate_protein(seq, n=1):
    """ Randomly mutate a sequence N times """
    
    seq = list(seq)
    
    for i in range(n):
        seq[randint(0, len(seq)-1)] = random_AA()
        
    return ''.join(seq)

mutant = mutate_protein(protein.seq, 10)

print('> mutant', mutant, sep='\n')

Go to [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/) and:

- Enter the sequence of the mutant protein under **Query Sequence**
- Select *Align two or more sequences*
- Enter the sequence of the original protein under **Subject Sequence**
- Run BLAST and wait for result
- Go to *Alignments* and select view: *Pairwise with dots for identities*
- Can you find all the 10 mutations?

> 🤔 If you see fewer than 10 mutations, why could that be?

-------

### Exercise 1.2

You could see that despite a few mutations the two sequences still align quite well. 

Let's see what happens when we also mutate the sequence by randomly adding and removing short streches of amino acids.

In [None]:
def delete_chunks(seq, size=1):
    seq = list(seq) 
    p = randint(0, len(seq)-size)
    seq = seq[:p] + seq[p+size:]
    return ''.join(seq)


def add_chunks(seq, size=1):
    seq = list(seq)
    p = randint(0, len(seq)-1)
    chunk = [random_AA() for i in range(size)]
    seq = seq[:p] + chunk + seq[p:]
    return ''.join(seq)

mutant = mutate_protein(protein.seq, 50)

for i in range(5):
    mutant = delete_chunks(mutant, 10)
    mutant = add_chunks(mutant, 10)

print('> mutant 2')
print(mutant)

Go to [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/) and:

- Enter the sequence of mutant 2 under **Query Sequence**
- Select *Align two or more sequences*
- Enter the sequence of the original protein under **Subject Sequence**
- Run BLAST and wait for result
- Go to *Alignments* -> can you find all the inserted and deleted stretches?

> 🧠 Advanced: Go to *Dot Plot* -> can you distinguish the locations of inserted and deleted stretches ?

-------

## Exercise 2: Database alignment

Instead of simply aligning two sequences, a more realistic scenario is to align a sequence against all the sequences in a database. This is how we discover the function of an unknown sequence.

### Exercise 2.1:

Go to [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/) and:

- Enter the sequence of mutant 2 under **Query Sequence**
- Do *NOT* select *Align two or more sequences*
- Change the database to UniProtKB/SwissProt
- Run BLAST and wait for result (this could take some time)
- Can you identify the name and function of this protein ?

### Exercise 2.2:

As you could see, the sequence above was easy to match to existing sequences. Let's try to mutate it even further...

In [None]:
mutant = mutate_protein(protein.seq, 100)

for i in range(10):
    mutant = delete_chunks(mutant, 10)
    mutant = add_chunks(mutant, 5)
    
print('> mutant 3')
print(mutant)

Go to [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/) and:

- Enter the sequence of mutant 3 under **Query Sequence**
- Do *NOT* select *Align two or more sequences*
- Change the database to UniProtKB/SwissProt
- Run BLAST and wait for result (this could take some time)
- 👀 Can you see a difference in alignment scores and E-values compared to mutant 2 ?

### Exercise 2.3:

What happens if we take this even further? 🤔

Let's build an entirely fake protein with a randomly generated sequence...

In [None]:
def random_protein(size):
    seq = [random_AA() for i in range(size)]
    return ''.join(seq)

protein = random_protein(300)
print('> fake protein')
print(protein)

Go to [NCBI BLAST](https://blast.ncbi.nlm.nih.gov/) and:

- Enter the sequence of the *fake protein* under **Query Sequence**
- Do *NOT* select *Align two or more sequences*
- Change the database to UniProtKB/SwissProt
- **Go to *Algorithm parameters* and change *Expected threshold* to 100**.
- Run BLAST and wait for result (this could take some time)
- Go to graphic summary to see an overview of all alignments

> 🧠 What do **you** expect it should happen?

**Notes**:

- If you don't get any matches try generating a new fake protein.
- If you have time, repeat this exercise using different databases, different parameters, etc.


## Wrap-up

Finished? Well done! 🙂

This time you didn't have to write any code, but take some time to make sure you understand all the python code in this notebook as well.