# Local alignment with the Smith-Waterman algorithm

This notebook uses a simple implementation of local alignment methods to align the two short DNA sequences from the lecture discussion. 

Let's start with some bookkeeping:

*The Python library NumPy is used for the matrix operations.*

In [None]:
# run this cell to import Python numpy library
import numpy

In [1]:
# run this cell to check your Python version is OK for this notebook!
import sys
def check_python_version_above_3_6():
    major = sys.version_info.major
    minor = sys.version_info.minor
    if major < 3 or minor < 6:
        print('ERROR you need to run this notebook with Python 3.6 or above (as f-strings used)')
        print('ERROR current Python version is {}.{}'.format(major, minor))        
        print('ERROR Please see:\n',
              '      https://canvas.anglia.ac.uk/courses/15139/pages/azure-notebooks-switching-kernel\n'
              '      for information on switching kernel on Azure Notebooks')
    else:
        print('Python version {}.{} you are good to go'.format(major, minor))
check_python_version_above_3_6()

Python version 3.8 you are good to go


To apply Smith-Waterman we need to be able to score an alignment.

Recall that unlike edit distance based methods the aim will be to *maximize* the score from this matrix applied to the sequence position comparisons. The scoring scheme includes a positive value for matching - distinct from edit-distance calculations where matching is the default and gains nothing. 

Non-matching options include the insertion of a gap into either of the sequencse. Gaps have negative scores and are equivalent to Deletions or Insertions in terms of editting. 

The first step is to populate the V matrix based on the scoring scheme and then the second step is the traceback. Traceback follows the maximum scoring positions in the matrix until it is terminated by a zero value.

For simplicity a very simple scoring scheme is defined instead of a complete scoring matrix or substitution table. 

The following scheme defines a simple example scoring matrix: 1 for matched nucleotides, -1 for mismatched nucleotides, -1 for a gap inserted in either.

In [None]:
# run this cell to define the simple_scoring_scheme
def simple_scoring_scheme(char_a, char_b):
    """"
    return the score for two characters char_a and char_b
    Scoring scheme: 1 for match, -1 for gap, else -1 (for mismatch)
    """
    if char_a == char_b:  # matching nucleotides
        return 1
    if char_a == '-' or char_b == '-':  # gap
        return -1
    return -1  # mismatch

In [None]:
# run this cell for crude tests of simple_scoring_scheme
# (cannot easily do pytest in Juypter notebook)
assert simple_scoring_scheme('A', 'A') == 1
assert simple_scoring_scheme('A', 'T') == -1
assert simple_scoring_scheme('A', '-') == -1
print('test for simple_scoring_scheme - pass!')

Suppose we have a trial alignment of two sequences:
```
-GGTATGCTGGCGC
TATATGCGGCGTTT
```
We want a function to get the total `simple_scoring_scheme` for the alignment of two
strings of equal length (including `-` characters).

In [None]:
# write Python to complete the function
# remember to include a docstring
def simple_score_alignment(align_a, aling_b):
    pass

In [None]:
# run this cell for crude tests of simple_score_alignment
# (cannot easily do pytest in Juypter notebook)
assert simple_score_alignment('G', 'G') == 1
assert simple_score_alignment('G', 'A') == -1
assert simple_score_alignment('AG', 'AG') == 2
assert simple_score_alignment('AG', 'AA') == 0
assert simple_score_alignment('G', '-') == -1
assert simple_score_alignment('AAAAA', 'AAAAA') == 5
assert simple_score_alignment('AAAAA', 'AAA-A') == 3
print('test for simple_score_alignment - pass!')

Here are the two DNA sequences from the lecture example: 
```
GGTATGCTGGCGC
TATATGCGGCGTTT
```

In [None]:
# run this cell to define seq_a and sequence seq_b
seq_a = 'GGTATGCTGGCGC'
seq_b = 'TATATGCGGCGTTT'

As the sequence is shorter we add a gap to make them equal lengths:
```
-GGTATGCTGGCGC
TATATGCGGCGTTT
```
What is the `simple_score_two_seqs` for this alignment?

In [None]:
# write your Python code to get the score for the alignment 
# placing the gap at the start of the first sequence

When looking at alignments between two sequences it is convenient to highlight matching bases using the `|` character for identies where the two. This function returns a line that can be used to  achieve this.

In [None]:
def highlight_line(first_seq, second_seq):
    """ 
    for the two sequences returns a line where matching letters are 
    highlighted with | except if the letter are a gap
    """
    joins = ['|' if a == b and a != '-' else ' ' for a, b in zip(first_seq, second_seq)]
    return ''.join(joins)
assert highlight_line('A', 'A') == '|'
assert highlight_line('AAAA', 'AAAA') == '||||'
assert highlight_line('AAAA', 'AGGA') == '|  |'
assert highlight_line('AA-AA', 'AA-AA') == '|| ||'

Use the `hightlight_line` function to show the 'best' global alignments between the two sequences:

The problem here is that the sequences do not globally align very well at all. However there is a shorter sequence motif that they have in common. A local alignment will help locate that.

# Smith-Waterman algorithm for optimal local alignment



The `smith_waterman` function below populates the `V[*i,j*]` matrix elements using the method described in the lecture. Notice that the indices *i*-1,*j*-1 refer to the current positions in the sequences as each has an appended initial empty string. The scoring used will be the scoring scheme function defined abover rather than a *similarity matrix*.  

The local alignment V[*i,j*] expression is the *maximum* form with the added '0' value (for an empty suffix) as explained in the lecture.

The function uses the numpy function *where* to return the *i* and *j* indices of the the best similarity score which is reported.

In [None]:
def smith_waterman(u, w, scoring):
    ''' Fills V matrix values for sequences u and w using
        dynamic programming. Returns matrix and maximum alignment score.'''
    V = numpy.zeros((len(u)+1, len(w)+1), dtype=int) 
    for i in range(1, len(u)+1):
        for j in range(1, len(w)+1):
            V[i, j] = max(V[i-1, j-1] + scoring(u[i-1], w[j-1]), # max used 
                          V[i-1, j  ] + scoring(u[i-1], '-'),    
                          V[i  , j-1] + scoring('-',    w[j-1]), 
                          0)                               
    return V, V.max()

Here is the example alignment problem from the lecture for sequences `seq_a` and `seq_b` above (see lecture slide 39). The full matrix is printed out including the initial amended empty row and column.

In [None]:
V, opt = smith_waterman(seq_a, seq_b, simple_scoring_scheme)
print(V)
print(f'Optimum={opt}, ends at {numpy.unravel_index(numpy.argmax(V), V.shape)}')

To retrieve the actual alignment, a traceback following the maximum V[*i,j*] values is required. The following function calls the scoring scheme again starting from the optimum. The edits transcript is constructed together with the alignment display. 

Notice that if the full matrix is retained in memory then the scoring function is not required. It is used here for generality as for searches of large numbers of sequences often only the V max score and its position are kept.  

In [None]:
def traceback(V, u, w, scoring):
    ''' Traceback from optimum in the V matrix  '''
    # get i, j for optimum
    i, j = numpy.unravel_index(numpy.argmax(V), V.shape)
    edits, alignNu, alignNw, alignMark = [], [], [], []
    while (i > 0 or j > 0) and V[i, j] != 0:
        diagl, vertl, horizl = 0, 0, 0
        if i > 0 and j > 0:
            diagl = V[i-1, j-1] + scoring(u[i-1], w[j-1])
        if i > 0:
            vertl = V[i-1, j] + scoring(u[i-1], '-')
        if j > 0:
            horizl = V[i, j-1] + scoring('-', w[j-1])
        if diagl >= vertl and diagl >= horizl:
            match = u[i-1] == w[j-1]
            edits.append('M' if match else 'R')
            alignMark.append('|' if match else ' ')
            alignNu.append(u[i-1]); alignNw.append(w[j-1])
            i -= 1; j -= 1
        elif vertl >= horizl:
            edits.append('D')
            alignNu.append(u[i-1]); alignNw.append('-'); alignMark.append(' ')
            i -= 1
        else:
            edits.append('I')
            alignNw.append(w[j-1]); alignNu.append('-'); alignMark.append(' ')
            j -= 1
    # the edits transcript is built up in reverse
    edits = (''.join(edits))[::-1]
    # final alignment for display
    alignment = '\n'.join(map(lambda u: ''.join(u), [alignNu[::-1], alignMark[::-1], alignNw[::-1]]))
    return edits, alignment

Here is the editing transcript: M is a match and D is a deletion - equivalent to a gap insertion in the *other* sequence. We anticipate at least one deletion/gap or insertion/gap as we know the sequences differs in length by one nucleotide character.

In [None]:
print(traceback(V, seq_a, seq_b, simple_scoring_scheme)[0])

And this is a representation of the local alignment (as in the lecture). You can see that the deletion edit is actually a gap insertion in the second sequence.

In [None]:
print(traceback(V, seq_a, seq_b, simple_scoring_scheme)[1])

Notice that the alignment covers only part of the sequences (why?). You should copy this alignment to answer the question on the tw6 quiz.

Here is an alignment with a second sequence that has a further mutation. 

In [None]:
seq_a = 'GGTATGCTGGCGC'
seq_x = 'TATATGCCGCGTTT'
V, opt = smith_waterman(seq_a, seq_x, simple_scoring_scheme)
print(V)
print(f'Optimum={opt}, ends at {numpy.unravel_index(numpy.argmax(V), V.shape)}')

The optimum score is decreased and the edits transcript now includes R for 'replacement' since there is a mismatch.

In [None]:
print(traceback(V, seq_a, seq_x, simple_scoring_scheme)[0])

In [None]:
print(traceback(V, seq_a, seq_x, simple_scoring_scheme)[1])

### Optional question
Can you use your earlier code for a similarity calculation to confirm the value for this optimum local alignment? 

## Comparing the alignment results to online pairwise alignment tools.

Above we did a crude global alignment and a Smith-Waterman local alignment of two sequences:
```
GGTATGCTGGCGC
TATATGCGGCGTTT
```
This results in alignments:

Crude global
```
GGTATGCTG-GCGC
  ||||| | |   
TATATGCGGCGTTT
```

Simple Smith-Watermann:
```
TATGCTGGCG
||||| ||||
TATGC-GGCG
```

Now aligning these two sequences using
1. EMBOSS **Needle** global alignment web tool at EBI https://www.ebi.ac.uk/Tools/psa/emboss_needle/
2. EMBOSS **Water** (Smith-Waterman) local alignment tool https://www.ebi.ac.uk/Tools/psa/emboss_water/

**Make sure that in STEP 1 you select that you are entering a DNA sequences.** Leave options at the default

Copy the of default EMBOSS Water alignment into this ceell.

## Question for weekly quiz

For the two sequences copy the alignments for
* `Crude global` (from Jupyter Notebook).
* `Simple Smith-Watermann` (from Jupyter Notebook).
* `EMBOSS Needle global` 
* `EMOSS Water local`.
*(3 marks)*

(A) what is the difference between the global and local alignments?

(B) how does the crude global alignment differ from the EMBOSS Needle?

(C) Does the `Simple Smith-Watermann` alignment accord with the `EMBOSS Water`? Which has the clearer user output? 

*(3 Marks)*

## Optional advanced exercise.

Suppose you wanted to adapt the Smith-Water algorithm to look for repeats in sequence (for instance repeated domains). What changes could you make? Can you alter the code to produce multiple alignments?