# Class 10: Pairwise alignments

---

## Before Class
In class today we will be implementing the Smith-Waterman algorithm to identify optimal local alignments of two sequences.

Prior to class, please do the following:
1. Review slides on sequence alignments in detail
* Focus on how to conceptually translate the algorithm to code


---
## Learning Objectives

1. Conceptually undersand dynamic programming and sequence alignment
1. Implement Smith-Waterman algorithm for local alignment


---
## Background

Today we will be implementing Smith-Waterman. This is a dynamic programming algorithm used for local sequence alignment. For today's class, we have provided the basic implementation of the algorithm but have not populated the functions for 1) scoring the alignment in the or for 2) traceback through the matrix. You will implement these portions of the algorithm in class today.

As a reminder from the slides, the scoring for Smith-Waterman only uses the scores from the positions above, left, and above-left of the current position in the matrix as below:

<center><img src="figures/Smith-Waterman_scoring.png"></center>

For traceback, you will need to keep track of the direction of the arrows in a matrix and then begin traceback from the maximum value.

---
## Imports

In [None]:
import numpy as np

---
## Implement Smith-Waterman algorithm


```
SmithWaterman(seq1, seq2, match, mismatch, gap)
    Initialize [len(seq1)+1] x [len(seq2)+1] numpy array as scoring matrix with first column and row equal to 0
    
    Fill scoring matrix (score_matrix) and Traceback matrix, record the position with max score (max_pos):
    for i in each row number:
        for j in each column number:
            S[i][j] = max( S[i-1][j-1] + compute_diag_score, S[i-1][j] + gap_score, H[i][j-1] + gap_score, 0 )
            T[i][j] = direction of max( S[i-1][j-1] + compute_diag_score, S[i-1][j] + gap_score, S[i][j-1] + gap_score, 0 )
    
    Traceback. Find the optimal path through scoring matrix starting at max_pos
    
```

In [None]:
def smith_waterman(seq1, seq2, match=1, mismatch=-1, gap=-1):
    '''Smith-Waterman algorithm for local alignment
    
    Args:
        seq1 (str): input seq 1
        seq2 (str): input seq 2
        match: default = +1
        mismatch: default = -1
        gap: default = -1
    
    Returns:
        aligned_seq1 (str)
        aligned_seq2 (str)
        score_matrix (numpy array): scoring matrix
    '''
    
    #Initialize Matrix for our future calculations:
    num_rows = len(seq1) +1
    num_cols = len(seq2) +1
    score_matrix = np.zeros(shape=(num_rows,num_cols), dtype=int)
    traceback_matrix = np.zeros(shape=(num_rows,num_cols), dtype=int)
    max_score = 0
    max_pos = (0,0)

    #Create scoring matrix
    for i in range(1,num_rows):
        for j in range(1,num_cols): #iteration starts from position (1,1)
            score_matrix[i][j], traceback_matrix[i][j] = cal_score(score_matrix, i, j, match, mismatch, gap)
            
            # Keep track of maximum position for trackback
            if score_matrix[i][j] > max_score:
                max_score = score_matrix[i][j]
                max_pos = (i,j)
    
    #Traceback the optimal path through scoring matrix
    aligned_seq1, aligned_seq2 = traceback(seq1, seq2, traceback_matrix, max_pos)
    
    return aligned_seq1, aligned_seq2, score_matrix

In [None]:
def cal_score(matrix, i, j, match, mismatch, gap):
    '''Calculate score for position (i,j) in scoring matrix, also record move to trace back
    
    Args:
        matrix (numpy array): scoring matrix
        i (int): current row number
        j (int): current column number
        
    Returns:
        score in position (i,j)    
        move to trace back: 0-END, 1-DIAG, 2-UP, 3-LEFT
        
    Pseudocode:
        Calculate scores based on upper-left, up, and left neighbors:
            diag_score = upper-left + (match or mismatch)
            up_score = up + gap
            left_score = left + gap
        score = max(0, diag_score, up_score, left_score)
        traceback = maximum direction or end
        
    '''


In [None]:
def traceback(seq1, seq2, traceback_matrix, maximum_position):
    '''Find the opmital path through scoing matrix
        
        diagonal: match/mismatch
        up: gap in seq1
        left: gap in seq2
        
    Args:
        seq1 (str) : First sequence being aligned
        seq2 (str) : Second sequence being aligned
        traceback_matrix (numpy array): traceback matrix
        maximum_position (tuple): starting position to trace back from
        
    Returns:
        aligned_seq1 (str): e.g. GTTGAC
        aligned_seq2 (str): e.g. GTT-AC
        
    Pseudocode:
        while current_move != END:
            current_move = traceback_matrix[current_row][current_col]
            if current_move == DIAG:
                ...
            elif current_move == UP:
                ...
            elif current_move == LEFT:
                ...
            
    '''


In [None]:
# Example from slides
seq1 = 'TACTTAG'
seq2 = 'CACATTAA'

aligned_seq1, aligned_seq2, score_matrix = smith_waterman(seq1,seq2)

print (aligned_seq1)
print (aligned_seq2)
print (score_matrix)
