# Class 11: Multiple sequence alignment

---

## Before Class
In class today we will be implementing a genearlized version of Smith-Waterman algorithm to identify optimal local alignments of multiple.

Prior to class, please do the following:
1. Review slides on sequence alignments in detail
* Focus on how to conceptually translate the algorithm to code

---
Today we will be implementing a generalized form of Smith-Waterman from our previous class. This is a dynamic programming algorithm used for multiple local sequence alignments. For today's class, we have provided the basic implementation of the algorithm but have not populated the functions for 1) scoring the alignment in the or for 2) traceback through the matrix. You will implement these portions of the algorithm in class today.

As a reminder from the slides, the scoring for Smith-Waterman only uses the scores from the positions above, left, and above-left of the current position in the matrix. The main difference today will be dealing with multiple string and the scoring function for the diagonal matrix which is the average of the possible scores given all sequences.

For traceback, you will need to keep track of the direction of the arrows in a matrix and then begin traceback from the maximum value.

---
## Imports

In [None]:
import numpy as np
from itertools import product #iterate bases in two alignments

---
## Implement progressive alignment for multiple sequence alignment

In [None]:
#Define the calultation of diag (substitution) score for alignment of a pair of alignments
def compute_diag_score(aln1_chars,aln2_chars,match,mismatch): #treat mismatch & gap as same
    '''
    Calculate diag score by averaging over all base combinations in alignment 1 and alignment 2
    
    Args:
        aln1_chars (str): bases in alignment 1
        aln2_chars (str): bases in alignment 2
    
    Returns:
        diag_score (float): diag score
    '''


In [None]:
#Modify score calculation function from previous class: 
def cal_score(matrix, aln1, aln2, i, j, match, mismatch, gap):
    '''Calculate score for position (i,j) in scoring matrix, also record move to trace back
    
    Args:
        matrix (numpy array): scoring matrix
        i (int): row number
        j (int): column number
        
    Returns:
        score in position (i,j)    
        move to trace back: 0-END, 1-DIAG, 2-UP, 3-LEFT
    Pseudocode:
        aln1_chars = bases of all seqs in alignment 1 in position (i,j)
        aln2_chars = bases of all seqs in alignment 2 in position (i,j)
        calculate scores based on upper-left, up, left neighbors:
        diag_score = compute_diag_score(aln1_chars,aln2_chars)
        up_score = ...
        left_score = ...
        take the maximum:
        score = max(0, diag_score, up_score, left_score)
        move = ...
        
    '''


In [None]:
#Hint: use zip()
def traceback(traceback_matrix, maximum_position):
    '''Find the opmital path through scoing marix
        
        diagonal: match/mismatch
        up: gap in aln1
        left: gap in aln2
        
    Args:
        score_matrix (numpy array): scoring matrix
        start_row, start_col: starting position (i.e. max_pos) to trace back
        
    Returns:
        aln_final (array of str): results of multiple sequence alignment (e.g. ['GTTGAC','GTT-AC','GTTG-C'])
        
    Pseudocode:
        #Initialize alignment results for aln1 and aln2
        aligned_aln1 = [[] for i in range(len(aln1))]
        aligned_aln2 = [[] for i in range(len(aln2))]
        while current_move != END:
            current_move = traceback_matrix[current_row][current_col]
            if current_move == DIAG:
                for each element 
                ...
            elif current_move == UP:
                ...
            elif current_move == LEFT:
                ...
            
    '''

        
    return aln_final

In [None]:
#Generalize Smith-Waterman algorithm for a pair of alignments
def SmithWaterman_generalized(aln1, aln2, match=3, mismatch=-3, gap=-2):
    '''Smith-Waterman algorithm for local alignment, generalized for a pair of alignments
    
    Args:
        seq1 (array of strs): input alingment 1 (e.g. ['GTTGAC','GTT-AC'])
        seq2 (array of strs): input alingment 2 (e.g. ['GTTGAC','GTT-AC'])
        match: default = 3
        mismatch: default = -3
        gap: default = -2
    
    Returns:
        results of multiple sequence alignment (array of strs) 
        score_matrix (numpy array): scoring matrix
    '''
    
    
    num_rows = len(aln1[0]) +1
    num_cols = len(aln2[0]) +1
    score_matrix = np.zeros(shape=(num_rows,num_cols), dtype=float) #diag scores can be float
    traceback_matrix = np.zeros(shape=(num_rows,num_cols), dtype=int)
    max_score = 0
    max_pos = (0,0)

    #Create scoring matrix
    for i in range(1,num_rows):
        for j in range(1,num_cols): #iteration starts from position (1,1)
            score_matrix[i][j], traceback_matrix[i][j] = cal_score(score_matrix, aln1, aln2, i, j, match, mismatch, gap)
            
            # Keep track of maximum position for trackback
            if score_matrix[i][j] > max_score:
                max_score = score_matrix[i][j]
                max_pos = (i,j)
    
    #Traceback the optimal path through scoring matrix
    aln_final = traceback(traceback_matrix, max_pos)
    
    return aln_final, score_matrix

In [None]:
aligned_seq1 = 'GTTGAC'
aligned_seq2 = 'GTT-AC'
aln1 = ['GTTGAC', 'GTT-AC']
aln2 = ['AGTTGCG']

In [None]:
SmithWaterman_generalized(aln1,aln2)