Exercise 3.1
Consider the following substitution matrix for DNA sequences:

     A  C  G  T    
  A 10  2  5  2   
  C  2 10  2  5   
  G  5  2 10  2    
  T  2  5  2 10
 
What is the optimal (i.e. maximal) cost of a local alignment of AATAAT and AAGG using the above substitution matrix and a gap cost of -5? and gap cost 0?

In [4]:
substitution_matrix ={'A': {'A': 10, 'C': 2, 'G': 5, 'T': 2}, 
                    'C': {'A': 2, 'C': 10, 'G': 2, 'T': 5}, 
                    'G': {'A': 5, 'C': 2, 'G': 10, 'T': 2}, 
                    'T': {'A': 2, 'C': 5, 'G': 2, 'T': 10}}

import numpy as np

def local_gap_cost_matrix(seq1:str, seq2:str, gap_cost:int, substitution_matrix:dict):
    """ This function takes two sequences, a gap cost and a substitution matrix and returns the cost matrix for the alignment of the two sequences. """
    n = len(seq1) + 1
    m = len(seq2) + 1
    # Initialize the matrices
    M = np.zeros((n, m))
    for i in range(1, n):
        M[i][0] =  0* i
    for j in range(1, m):
        M[0][j] =  0 * j
    for i in range(1, n):
        for j in range(1, m):
            substitution_cost = substitution_matrix[seq1[i-1]][seq2[j-1]]
            M[i][j] = max(M[i-1][j-1] + substitution_cost, M[i][j-1] + gap_cost, M[i-1][j] + gap_cost, 0)
    return M.max(), M


print(local_gap_cost_matrix('AATAAT', 'AAGG', -5, substitution_matrix))
print(local_gap_cost_matrix('AATAAT', 'AAGG', 0, substitution_matrix))

(27.0, array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0., 10., 10.,  5.,  5.],
       [ 0., 10., 20., 15., 10.],
       [ 0.,  5., 15., 22., 17.],
       [ 0., 10., 15., 20., 27.],
       [ 0., 10., 20., 20., 25.],
       [ 0.,  5., 15., 22., 22.]]))
(30.0, array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0., 10., 10., 10., 10.],
       [ 0., 10., 20., 20., 20.],
       [ 0., 10., 20., 22., 22.],
       [ 0., 10., 20., 25., 27.],
       [ 0., 10., 20., 25., 30.],
       [ 0., 10., 20., 25., 30.]]))


In [5]:
def global_gap_cost_matrix(seq1:str, seq2:str, gap_cost:int, substitution_matrix:dict):
    """ This function takes two sequences, a gap cost and a substitution matrix and returns the cost matrix for the alignment of the two sequences. """
    n = len(seq1) + 1
    m = len(seq2) + 1
    # Initialize the matrices
    M = np.zeros((n, m))
    for i in range(1, n):
        M[i][0] =  gap_cost * i
    for j in range(1, m):
        M[0][j] =  gap_cost * j
    for i in range(1, n):
        for j in range(1, m):
            substitution_cost = substitution_matrix[seq1[i-1]][seq2[j-1]]
            M[i][j] = max(M[i-1][j-1] + substitution_cost, M[i][j-1] + gap_cost, M[i-1][j] + gap_cost)
    return M[n-1][m-1], M
    
print(global_gap_cost_matrix('AATAAT', 'AAGG', -5, substitution_matrix))
print(global_gap_cost_matrix('AATAAT', 'AAGG', 0, substitution_matrix))

(20.0, array([[  0.,  -5., -10., -15., -20.],
       [ -5.,  10.,   5.,   0.,  -5.],
       [-10.,   5.,  20.,  15.,  10.],
       [-15.,   0.,  15.,  22.,  17.],
       [-20.,  -5.,  10.,  20.,  27.],
       [-25., -10.,   5.,  15.,  25.],
       [-30., -15.,   0.,  10.,  20.]]))
(30.0, array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0., 10., 10., 10., 10.],
       [ 0., 10., 20., 20., 20.],
       [ 0., 10., 20., 22., 22.],
       [ 0., 10., 20., 25., 27.],
       [ 0., 10., 20., 25., 30.],
       [ 0., 10., 20., 25., 30.]]))


Exercise 3.2
Explain how to find the optimal cost of a local alignment of two sequences A[1..n] and B[1..m] in time O(nm) and space O(n).

Let us say that the A[h+1..i*] and B[k+1..j*] is a pair of segments with maximum similarity, i.e. their similarity is the cost of an optimal local alignment of A[1..n] and B[1..m]. Explain how to find (i*,j*) and (h,k) in linear space. Finding (i*,j*) should be easy, but you may have to think about how to find (h,k) in linear space.

Explain how to a find an optimal local alignment of two sequences A[1..n] and B[1..m] in time O(nm) and space O(n).

ANSWER:


Exercise 3.3
Can Hirschberg's idea for finding an optimal alignment of A[1..n] and B[1..m] with linear gapcost in time O(nm) and space O(n) be extended to find an optimal alignment of A and B with affine gapcost in time O(nm) and space O(n)? Are there any problems?


ANSWER: Yes we would have problems implimenting his idear because: there could be overlapping gaps across the middle, which we could have a hard time keeping track off.

Exercise 3.5
Sometimes one is interested in finding an optimal global alignment with atmost k gap blocks (where a gap block is a consecutive block of gaps in one sequence, see slides fx about affine gapcost). Design an algorithm which given two sequence A[1..n] and B[1..m], a score matrix and linear gapcost, and a number k, computes the optimal cost of a global alignment of A and B containing at most k gap blocks. What is the running time and space consumption of your algorithm?

WRONG BUt COOL

In [8]:

def inside_k_band(i,j,k):
    """ This function takes two indices i and j and a band width k and returns True if i and j are inside the band and False otherwise. """
    if i-j <= k and i-j >= -k:
        return True
    else:
        return False


def global_allignmen_k_gap_blocks(seq1:str, seq2:str, gap_cost:int, substitution_matrix:dict, k:int):
    """ This function takes two sequences, a gap cost and a substitution matrix and returns the cost matrix for the alignment of the two sequences. """
    n = len(seq1) + 1
    m = len(seq2) + 1
    # Initialize the matrices
    M = np.zeros((n, m))
    for i in range(1, n):
        M[i][0] =  gap_cost * i
    for j in range(1, m):
        M[0][j] =  gap_cost * j
    for i in range(1, n):
        for j in range(1, m):
            substitution_cost = substitution_matrix[seq1[i-1]][seq2[j-1]]
            if inside_k_band(i-1,j,k):
                M[i][j] = max(M[i-1][j-1] + substitution_cost, M[i-1][j] + gap_cost)
            elif inside_k_band(i,j-1,k):
                M[i][j] = max(M[i-1][j-1] + substitution_cost, M[i][j-1] + gap_cost)
    return M[n-1][m-1], M

print(global_allignmen_k_gap_blocks('AATAAT', 'AAGAAG', -5, substitution_matrix,1))

(44.0, array([[  0.,  -5., -10., -15., -20., -25., -30.],
       [ -5.,  10.,   5.,   0.,   0.,   0.,   0.],
       [-10.,   5.,  20.,  15.,  10.,   0.,   0.],
       [-15.,   0.,  15.,  22.,  17.,  12.,   0.],
       [-20.,   0.,  10.,  20.,  32.,  27.,  22.],
       [-25.,   0.,   0.,  15.,  30.,  42.,  37.],
       [-30.,   0.,   0.,   0.,  25.,  37.,  44.]]))


running time answer:
o(n*m) worst case 
if the differences is bounded by a constant we can get armortized o(n) (apparently)

We can use the KBand algorithm as a fast method for finding highidentity alignments:
• If we know that the two input sequences are highly similar and we
have a bound b on the number of gaps that will occur in the best
alignment, then the KBand algorithm with k = b will compute an
• optimal alignment

Exercise 3.6
Running time in practice vs theory: Say that you have implemented an algorithm, fx the algorithm for computing the optimal cost of a global pairwise alignment with linear gap cost of two strings of lengths n and m, which has an asymptotic worst case time complexity of O(nm). What does this mean, and how would you verify that the running time of your implementation in practice is as expected cf. its theoretical (worst case) time complexity?

