# Sequence alignment

This week we'll look at some of the alignment algorithms discussed in lectures.

If you are new to programming, you can review this [introductory Python tutorial](hammingdist.ipynb) again, as it will give you a more guided introduction to the first exercise on Hamming distance.

In [9]:
import numpy as np

## Sequence data 

We'll read in some real data to play with. Nucleotide (cDNA) sequences for an insulin gene in mice and the insulin gene in humans have been provided in the `data/` directory. 

We'll use nucleotide sequence rather than protein sequence, because the substitution matrix is very important when aligning protein sequences, and we won't implement substitution matrices today.

In [2]:
# Look at the this file with the linux cat command
!cat data/Homo_sapiens_INS_203_sequence.fa

>ENST00000397262.5 INS-203 cdna:protein_coding
TCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGT
CAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGG
ACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCT
GGCCTTCAGCCTGCCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCC
TGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCG
CAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGT
GCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGG
AGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCT
ACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCT
CCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC



In [3]:
!cat data/Mus_musculus_Ins2_205_sequence.fa

>ENSMUST00000105933.7 Ins2-205 cdna:protein_coding
GATCCGCTACAATCAAAAACCATCAGCAAGCAGGAAGGTACTCTTCTCAGTGGGCCTGGC
TCCCCAGCTAAGACCTCAGGGACTTGAGGTAGGATATAGCCTCCTCTCTTACGTGAAACT
TTTGCTATCCTCAACCCAGCCTATCTTCCAGGTTATTGTTTCAACATGGCCCTGTGGATG
CGCTTCCTGCCCCTGCTGGCCCTGCTCTTCCTCTGGGAGTCCCACCCCACCCAGGCTTTT
GTCAAGCAGCACCTTTGTGGTTCCCACCTGGTGGAGGCTCTCTACCTGGTGTGTGGGGAG
CGTGGCTTCTTCTACACACCCATGTCCCGCCGTGAAGTGGAGGACCCACAAGTGGCACAA
CTGGAGCTGGGTGGAGGCCCGGGAGCAGGTGACCTTCAGACCTTGGCACTGGAGGTGGCC
CAGCAGAAGCGTGGCATTGTAGATCAGTGCTGCACCAGCATCTGCTCCCTCTACCAGCTG
GAGAACTACTGCAACTAGACCCACCACTACCCAGCCTACCCCTCTGCAATGAATAAAACC
TTTGAATGAGCA



We could read the FASTA file using just Python code (note that this code assumes there is only ONE sequence in each FASTA file!):

In [4]:
with open('data/Homo_sapiens_INS_203_sequence.fa') as f:
    sequence = ""
    for row in f.readlines():
        if not row.startswith('>'):
            sequence += row.strip()

In [5]:
print(sequence)

TCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCGCCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC


Or, a better practice usually, we could use a standard library such as scikit-bio to parse the file:

# You may need to install skbio:
1. File -> New Launcher
2. open new terminal
3. 'pip install scikit-bio'

In [19]:
import skbio

In [13]:
sequences = skbio.io.read('data/Homo_sapiens_INS_203_sequence.fa', format='fasta')
# This gives us a skbio Sequence object
human_ins_object = list(sequences)[0]
# This gives us the actual sequence as a string
human_ins = str(human_ins_object)

In [14]:
# Repeat for mouse INS2 gene
mouse_ins = str(list(skbio.io.read('data/Mus_musculus_Ins2_205_sequence.fa', format='fasta'))[0])

## Edit distance 

**Exercise 1: Hamming distance**

Edit the Hamming distance function below so that it returns the correct Hamming distance for two strings `a` and `b`.

In [15]:
def hamming(a,b):
    """
    Calculate the Hamming distance between strings a and b.
    The strings must be the same length.
    """
    # 0 is the wrong answer. Edit this function to give the right answer
    return 0

Think also: what will your function do if the strings are of different length? What *should* it do?

In [49]:
# Should return 2
hamming("GATTACA","GACTATA")

0

In [50]:
# Should return 6
hamming("tuesday","sundays")

0

In [51]:
# These strings are of different length!
hamming("happiness","applying")

0

**Exercise 2: Levenshtein distance** 

Edit the `lev` function below to calculate Levenshtein distance recursively. You can use the costs 
* 1 for an indel
* 1 for a mismatch
* 0 for a match

This is the same function as shown during lectures, but try to implement it without looking back at the slides.

In [44]:
def lev(a,b):
    """
    Recursively calculate Levenshtein distance between strings a and b.
    """
    if len(a)==0:
        return len(b)
    if len(b)==0:
        return len(a)
    # Add code below to calculate the distance in terms of 
    # lev(a,b[1:]), lev(a[1:],b), and lev(a[1:],b[1:])
    # We're interested in the incremental cost of aligning the current characters,
    # a[0] and b[0]
    return 0 

In [45]:
# Should return 2
lev("GATTACA","GACTATA")

0

In [46]:
# Should return 4
lev("tuesday","sundays")

0

In [47]:
# Should return 6
lev("happiness","applying")

0

## Alignment scores, global and local alignment 

Here is a function which implements a recursive alignment function like `lev()`, but returns an alignment score rather than an edit distance. Notice that it uses `max()` rather than `min()`, as we're trying to find the maximum possible score, not the minimum possible edit distance.

Avoid looking at the function below until you've done Exercise 2 above as it may give away the answer.

In [52]:
def align_recursive(a, b, indel_score=-1, match_score=2, mismatch_score=-1):
    """
    Recursively calculate alignment score between strings a and b,
    using supplied scores for matches, mismatches and indels.
    """
    if len(a)==0:
        return len(b)
    if len(b)==0:
        return len(a)
    if a[0]==b[0]:
        match_mismatch_score = match_score
    else:
        match_mismatch_score = mismatch_score
    return max(align_recursive(a[1:],b) + indel_score,
               align_recursive(a,b[1:]) + indel_score,
               align_recursive(a[1:],b[1:]) + match_mismatch_score)

Notice that you can now change the scoring system for indels, matches, and mismatches.

In [53]:
align_recursive("GATTACA","GACTATA")

10

In [54]:
align_recursive("GATTACA","GACTATA",match_score=5)

13

Of course, this function is recursive and will be slow for large strings. `align_recursive(human_ins, mouse_ins)` is not practical to run.

### Global alignment: Needleman-Wunsch

To do global alignment with the Needleman-Wunsch algorithm, we need two steps:

1. Fill out the grid of alignment scores. This is enough to give the final alignment score.
2. Trace-back from the bottom-right corner of the grid to get the actual alignment of the strings.

Here, we've provided a function to do the traceback, and given an incomplete function to calculate the alignment score grid. Complete in the `calculate_scoregrid()` function to correctly fill out the grid of scores.

For traceback, we have two options:
* Keep track of which cell(s) was/were the origin of the best score(s) for each given cell, and use this information for traceback. This increases storage requirements by a constant factor (i.e. they are still O(N^2)).
* Or, during traceback, calculate which cells(s) could have been the origin of the best score(s) for each cell. This increases the computational cost of traceback by a constant factor (i.e. it is still O(N)).

In this case, the provided traceback function will work out which path to follow, so you don't need to keep track of the path as you calculate the scores.

**Exercise 3:** Complete the `calculate_scoregrid()` function to calculate the scores needed for global alignment via the Needleman-Wunsch algorithm.

In [18]:
def calculate_scoregrid(a, b,
                        indel_score=-1, match_score=2, mismatch_score=-1):
    """
    Given two strings a and b, calculate the maximum score grid, using
    specified scores for indels, matches and mismatches. Return the grid.
    Grid row and column 0 correspond to "before" the start of each string,
    so grid indexes are offset by 1 from string indexes. That is,
    grid position [1,1] represents the result of matching a[0] to b[0].
    """
    # The grid needs to be 1 bigger in each direction than the string lengths
    X = len(a)+1
    Y = len(b)+1
    # Initialise the grid with zeroes
    scoregrid = np.zeros((X,Y), np.int)
    #### YOUR CODE HERE #### 
    # You need to:
    # * initialise the top edge of grid, i.e. scoregrid[x,0] for all x, with indel scores
    # * initialise the left edge of grid, i.e. scoregrid[0,y] for all x, with indel scores
    # * loop over x and y, filling out each cell of the grid by looking for the
    #   maximum possible score from each of the three earlier cells
    
    return scoregrid

In [18]:
# Pre-defined functions to get the traceback given a correct scoregrid
# Use help(traceback) or help(get_alignment) to see how to call them
from alignment_functions import traceback, get_alignment

If `calculate_scoregrid()` works correctly, the below will work:

In [16]:
a = "GATTACA"
b = "GACTATA"

In [21]:
# Once you've implemented calculate_scoregrid, this should show the correct
# values instead of all zeroes
scoregrid = calculate_scoregrid(a,b)
scoregrid

array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]])

In [22]:
print("Alignment score:",scoregrid[-1,-1])

Alignment score: 0


In [None]:
# If the score grid isn't correct and consistent with the scoring system
# and the strings, traceback won't be able to find a path and will give an error
trace = traceback(a,b,scoregrid)
aligned_string_a, aligned_string_b = get_alignment(trace)
print(aligned_string_a)
print(aligned_string_b)

Try aligning the cDNA strings `human_ins` and `mouse_ins`.

**Challenge exercise 4:** Change your `calculate_scoregrid()` function to perform local instead of global alignment. You can import the `traceback_local()` function to help you test your result.

In [17]:
from alignment_functions import traceback_local