# Biological Sequence Alignment

In this notebook, you can find implementations of several algorithms for finding a **longest common subsequence** of two strings of arbitrary lengths as well as the bottom-up implementation of **Needleman-Wunsch algorithm**. We present first simple algorithms, gradually getting to more complicated (and efficeint) ones.

In [7]:
import numpy as np

In [8]:
# Test example #1
A = 'ABCBDAB'
B = 'BDCABA'

# Test example #2 (taken from Wikipedia (https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm))
X = 'GATTACA'
Y = 'GCATGCU'

# Test example #3
S1 = 'GTGACTGCGATAAGCTTAGATCCTCTTAAAAT'
S2 = 'GAGGGAGACATGCGATACAAGGGATCCTTGTAGATCTGCGTCTTTAA'

### Longest Common Subsequence 

As mentioned above we start with the simplest, at the same time the least efficient algorithm. Our first algorithm for finding LCS of two sequences does the following: it finds all possible subsequences of one the two sequences and checks one-by-one to see whether a subsequence of that sequence is also a subsequence of the other sequence, keeping track of the longest subsequence found.

For finding all possible subsequences of a sequence (except empty set, that is not interesting for us) we use a bitmask i.e. we run from **1** to **2^n** in binary and do bitwise **and** operation with our original string. In other words we take characters that have **1** in the corresponding index in bitmask and omit the characters in the case of **0**.

In [9]:
## Complexity = O(n*2^n), where n = len(S)
def find_all_subsequences(S):
    l = [] # empty list that will be returned full of subsequences (except the empty set)
    for e in range (1,2**len(S)):# e is the decimal number that runs from 1 to 2 to the power of length of the S
        x = ''
        b = format(e, '#0' + str(len(S)+2) + 'b')# this line of code converts e to a corresponding binary number
        for i in range(2,len(b)):
            if b[i] == '1': 
                x += S[i-2] # if the i'th index of binary is 1 then the char of S at index i is appended to x
        l.append(x)
    return l

In [10]:
# testing our function
print(find_all_subsequences('ABC'))

['C', 'B', 'BC', 'A', 'AC', 'AB', 'ABC']


Now let's understand why the complexity of our function **find_all_subsequences()** is **O(n*2^n).** This is because we have two nested for loops: one runs **2^n-1** times and the other one **n**, thus making **n*2^n-1** operations, i.e. our function's running time is **O(n*2^n).**

In [11]:
## Complexity = O(n*2^n) + O(2^m*m), where n = len(A), m = len(B)
def exponential_LCS(A,B):
    x = set(find_all_subsequences(A)) # creating set containing all subsequences of the first string
    y = set(find_all_subsequences(B)) # creating set containing all subsequences of the second string
    z = x.intersection(y) # intersecting the above two to be left with only common subsequences
    l = 0
    LCS = ""
    for each in z: # running over the set of common subsequences to find the one with max length
        if len(each) > l: 
            l = len(each)
            LCS = each
    return LCS

In [12]:
# testing our function
# we do not recommend to test this function for long sequences because it has exponential running time :)
print(exponential_LCS(A,B))

BCAB


Bellow you can find the same algorithm with some improvements in terms of running time.

In [13]:
## Complexity = O(n*2^n) + O(2^n*n) = O(n*2^n), where n = len(S)
def exponential_LCS_b(A,B):
    if len(A)>len(B):
        s = find_all_subsequences(B) 
        S = A
    else:
        s = find_all_subsequences(A)
        S = B
    lcm = ''
    for elem in s:
        if is_subsequence(elem,S):
            if len(lcm)<len(elem):
                lcm = elem
    return lcm

# This function checks if str1 is a subsequence of str2 or not
def is_subsequence(str1,str2): ## O(n), n = len(str2)
    m = len(str1)
    n = len(str2)
    if m==0:
        return True
    if n==0:
        return False
    if str1[m-1]==str2[n-1]:
        return is_subsequence(str1[0:m-1],str2[0:n-1])
    else:
        return is_subsequence(str1,str2[0:n-1])

In [14]:
# testing the function
print(exponential_LCS_b(A,B)) # As you can see here we got 'BCBA' but the result obtained form the other exponential 
# LCS function was 'BDAB'. This is not surprising because two sequences can have more than one longest common 
# subsequence and depending one the nature of the function results can be different :)
print(exponential_LCS_b(X,Y))

BCBA
GATC


Below you can find a recursive approach to finding a longest common subsequence of two strings. We named this recursive algorithm **naive**_recursive_LCS because its worst case running time is exponential: if we have strings of lengths **m** and **n** then the complexity is **O(2^(n+m))**. The idea behind such a running time is that during each recursive step, in the worst case, we call the same function twice decreasing the length of one of the strings by just one. 

In [15]:
def naive_recursive_LCS(A,B):
    if len(A)==0 or len(B)==0:
        return ""
    elif len(A)>0 and len(B)>0:
        if A[len(A)-1] == B[len(B)-1]:
            return naive_recursive_LCS(A[0:len(A)-1],B[0:len(B)-1]) + A[len(A)-1]
        else:
            x = naive_recursive_LCS(A,B[0:len(B)-1])
            y = naive_recursive_LCS(A[0:len(A)-1],B)
            if len(x) > len (y):
                return x
            else:
                return y

In [16]:
# testing our function
print(naive_recursive_LCS(A,B))

BCBA


The function below is the implementation of the recursive LCS using memoization techniques which significantly improves the running time. It transforms exponential-time recursive solution to a polynomial-time solution. In this case, the running time is **O(m*n)** as we have **n*m** distinct subproblems and we solve each subproblem just once by keeping a matrix of size **(m+1) * (n+1)** (where m and n stand for the sizes of the input strings A and B correspondingly). This function only returns the length of a longest common subsequence. For printing the longest common subsequence found by the function we need another function. We will do it for bottom-up approach, since we prefer it more :)

In [17]:
def recursive_LCS_memoized(A,B):
    m = len(A)
    n = len(B)
    l = -np.ones((m+1,n+1),dtype=int) 
    l[0,1:] = np.zeros(n,dtype=int)
    l[:,0] = np.zeros(m+1,dtype=int)
    return LCS_aux(A,B,l)
    
def LCS_aux(A,B,l):
    if l[len(A)][len(B)]!=-1:
        return l[len(A)][len(B)]
    else:
        if A[len(A)-1] == B[len(B)-1]:
            res = 1 + LCS_aux(A[0:len(A)-1],B[0:len(B)-1],l)
            l[len(A)][len(B)] = res
        else:
            a = LCS_aux(A,B[0:len(B)-1],l)
            b = LCS_aux(A[0:len(A)-1],B,l)
            if a>b:
                res = a
                l[len(A)][len(B)-1] = res
            else:
                res = b
                l[len(A)-1][len(B)] = res
        return res

In [18]:
# testing the function recursive_LCS_memoized
print(recursive_LCS_memoized(A,B))
print(recursive_LCS_memoized(X,Y))
print(recursive_LCS_memoized(S1,S2)) # We test this function also for S1 and S2 since we now have much faster 
# function :) We will print out a longest common subsequnce of S1 and S2 later. Coming soon!

4
4
28


Below you can find the bottom up memoized version of finding a longest common subsequence of two given sequences.
As in the recursive approach here again we keep a matrix that will contain the values of already solved subproblems. In addition, we keep another matrix that keeps our path, that is, from which cell the current one has been achieved. Obviously, again the running time is **O(m x n)** as we have 2 nested for loops.

In [19]:
def bottom_up_LCS_memoized(A,B):
    m = len(A)
    n = len(B)
    p = np.empty((m+1,n+1),dtype=np.str)# path matrix
    direc = np.empty((m+1,n+1),dtype=np.str)
    direc[1:,0] = np.array(list(A))
    direc[0,1:] = np.array(list(B))
    l = np.zeros((m+1,n+1),dtype=int) # the matrix that we keep for taking the values of already solved subproblems
    # when needed. Note that l[m][n], at the end of the algorithm, will contain the length of the longest common 
    # subsequence found. (initially, this matrix is filled with 0s)
    for i in range(1,m+1):
        for j in range(1,n+1):
            if A[i-1]==B[j-1]: 
                l[i][j] = l[i-1][j-1] + 1
                p[i][j] = "d" # 'd' stands for main-diagonally up
                direc[i][j] = '\u2196'
            elif l[i-1][j]<l[i][j-1]:
                l[i][j] = l[i][j-1] 
                p[i][j] = "l" # l stands for left
                direc[i][j] = '\u2190'
            elif l[i-1][j]>=l[i][j-1]: 
                l[i][j] = l[i-1][j] 
                p[i][j] = "u" # stands for up 
                direc[i][j] = '\u2191'
    return (l,p,direc)

In [20]:
# testing our function
l,p,d = bottom_up_LCS_memoized(A,B)

In [21]:
print(l)
# So the length of the LCS found is 4 as expected

[[0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1]
 [0 1 1 1 1 2 2]
 [0 1 1 2 2 2 2]
 [0 1 1 2 2 3 3]
 [0 1 2 2 2 3 3]
 [0 1 2 2 3 3 4]
 [0 1 2 2 3 4 4]]


In [23]:
# path
print(d)

[['' 'B' 'D' 'C' 'A' 'B' 'A']
 ['A' '↑' '↑' '↑' '↖' '←' '↖']
 ['B' '↖' '←' '←' '↑' '↖' '←']
 ['C' '↑' '↑' '↖' '←' '↑' '↑']
 ['B' '↖' '↑' '↑' '↑' '↖' '←']
 ['D' '↑' '↖' '↑' '↑' '↑' '↑']
 ['A' '↑' '↑' '↑' '↖' '↑' '↖']
 ['B' '↖' '↑' '↑' '↑' '↖' '↑']]


The function **bottom_up_LCS_memoized()** constructed the path matrix, but we need another function **print_LCS** that traverses over the matrix and recovers the LCS. The algorithm is simple: we start from the rightmost bottom cell and check if the value is **d** then we move main-diagonally up concatenating the letter of the strings corresponding to the updated cell. For values **u** and **l** we move recpectively a cell up or left without any concatenation.  

In [24]:
def print_LCS(p,A,i,j):
    if i==0 or j==0:
        return
    if p[i][j] == 'd':
        print_LCS(p,A,i-1,j-1)
        print(A[i-1],end='')
    elif p[i][j] == 'u': 
        print_LCS(p,A,i-1,j)
    else:
        print_LCS(p,A,i,j-1)

In [25]:
# printing LCS for A and B using print_LCS
print_LCS(p,A,len(A),len(B))

BCBA

In [26]:
# printing LCS for X and Y using print_LCS
l,p,_ = bottom_up_LCS_memoized(X,Y)
print_LCS(p,X,len(X),len(Y))

GATC

In [27]:
# printing LCS for S1 and S2 using print_LCS. Finally we know one LCS of S1 and S2! :)
l,p,_ = bottom_up_LCS_memoized(S1,S2)
print_LCS(p,S1,len(S1),len(S2))

GGACTGCGATAAGCTTAGATCCTCTTAA

In the below block of code, you can find the function that edits the input strings to make them interpret the way the LCS is obtained. By interpretebility, we mean that some "-"s are placed between the characters of the input strings so that the elements of the LCS found appear in front of each other when the two input sequences are aligned. 

In [28]:
l1 = []
l2 = []
def find_best_alignment(p,A,B,i,j):
    if i==0 or j==0:
        if i==0 and j==0:
            pass
        elif i==0:
            l2.append(B[j-1])
            l1.append('-')
        elif j==0:
            l1.append(A[i-1])
            l2.append('-')
        return
    if p[i][j] == 'd':
        find_best_alignment(p,A,B,i-1,j-1)
        l1.append(A[i-1])
        l2.append(B[j-1])
    elif p[i][j] == 'u':
        find_best_alignment(p,A,B,i-1,j)
        l1.append(A[i-1])
        l2.append('-')
    else:
        find_best_alignment(p,A,B,i,j-1)
        l1.append('-')
        l2.append(B[j-1])

In [29]:
# Printing best alignment for S1 and S2
find_best_alignment(p,S1,S2,len(S1),len(S2))
str1 = ''.join(l1)
str2 = ''.join(l2)
print(str1)
print(str2)
l1=[]
l2=[]

-G-TGAC-TGCGAT--AA--G---C-T-TAGATC--C-TC-TTAAAAT
GGA-GACATGCGATACAAGGGATCCTTGTAGATCTGCGTCTTT--AA-


### Needleman-Wunsch algorithm

Below you can see our bottom up implementation of Needleman Wunsch algorithm.

In [30]:
# Worst case running time O(m*n), where m = len(A), n = len(B) 
def bottom_up_Needleman_Wunsch(A,B,match=1,mismatch=-1,gap=-1):
    m = len(A)
    n = len(B)
    p = np.empty((m+1,n+1),dtype=np.str) 
    direc = np.empty((m+1,n+1),dtype=np.str)
    direc[1:,0] = np.array(list(A))
    direc[0,1:] = np.array(list(B))
    l = np.zeros((m+1,n+1),dtype=int)
    l[0,1:] = np.array([gap*i for i in range(1,n+1)],dtype=int)
    l[:,0] = np.array([gap*i for i in range(m+1)],dtype=int)
    for i in range(1,m+1):
        for j in range(1,n+1):
            if A[i-1]==B[j-1]:
                l[i][j] = l[i-1][j-1] + match
                p[i][j] = 'd'
                direc[i][j] = '\u2196'
            else:
                if l[i-1][j-1] + mismatch > max(l[i][j-1] + gap,l[i-1][j] + gap):
                    l[i][j] = l[i-1][j-1] + mismatch
                    p[i][j] = 'd'
                    direc[i][j] = '\u2196'
                else:
                    if l[i][j-1] > l[i-1][j]:
                        l[i][j] = l[i][j-1] + gap
                        p[i][j] = 'l'
                        direc[i][j] = '\u2190'
                    else:
                        l[i][j] = l[i-1][j] + gap
                        p[i][j] = 'u'  
                        direc[i][j] = '\u2191'
    return (l,p,direc)

In [31]:
l,p,d = bottom_up_Needleman_Wunsch(X,Y)

In [32]:
# As we can see the best possible score is 0 (see the rightmost bottom cell value), given that match=1 and 
# mismatch=gap=-1
print(l)

[[ 0 -1 -2 -3 -4 -5 -6 -7]
 [-1  1  0 -1 -2 -3 -4 -5]
 [-2  0  0  1  0 -1 -2 -3]
 [-3 -1 -1  0  2  1  0 -1]
 [-4 -2 -2 -1  1  1  0 -1]
 [-5 -3 -3 -1  0  0  0 -1]
 [-6 -4 -2 -2 -1 -1  1  0]
 [-7 -5 -3 -1 -2 -2  0  0]]


In [33]:
# path
print(d)

[['' 'G' 'C' 'A' 'T' 'G' 'C' 'U']
 ['G' '↖' '←' '←' '←' '↖' '←' '←']
 ['A' '↑' '↖' '↖' '←' '←' '←' '←']
 ['T' '↑' '↑' '↑' '↖' '←' '←' '←']
 ['T' '↑' '↑' '↑' '↖' '↖' '←' '←']
 ['A' '↑' '↑' '↖' '↑' '↑' '↖' '←']
 ['C' '↑' '↖' '↑' '↑' '↑' '↖' '←']
 ['A' '↑' '↑' '↖' '↑' '↑' '↑' '↖']]


In [34]:
l,_,_ = bottom_up_Needleman_Wunsch(A,B,match=1,mismatch=0,gap=0) 
# Note that this is equiavlent to just finding LCS of A and B :) since we take match=1 and mismatch=gap=0

In [36]:
print(l)

[[0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1]
 [0 1 1 1 1 2 2]
 [0 1 1 2 2 2 2]
 [0 1 1 2 2 3 3]
 [0 1 2 2 2 3 3]
 [0 1 2 2 3 3 4]
 [0 1 2 2 3 4 4]]


In [37]:
l,p,d = bottom_up_Needleman_Wunsch(S1,S2)

In [39]:
# we can see that the best possible score is 7 (see the rightmost bottom cell value)
print(l)

[[  0  -1  -2 ..., -45 -46 -47]
 [ -1   1   0 ..., -43 -44 -45]
 [ -2   0   0 ..., -41 -42 -43]
 ..., 
 [-30 -28 -26 ...,   5   7   9]
 [-31 -29 -27 ...,   4   6   8]
 [-32 -30 -28 ...,   5   5   7]]


In [40]:
print(d)

[['' 'G' 'A' ..., 'T' 'A' 'A']
 ['G' '↖' '←' ..., '←' '←' '←']
 ['T' '↑' '↖' ..., '↖' '←' '←']
 ..., 
 ['A' '↑' '↖' ..., '↑' '↖' '↖']
 ['A' '↑' '↖' ..., '↑' '↖' '↖']
 ['T' '↑' '↑' ..., '↖' '↑' '↑']]


In [41]:
# As we can notice we have different alignment in this case for these two sequences compared to when we run this
# function when looking at LCS as now mismatch=gap=-1 and not 0
find_best_alignment(p,S1,S2,len(S1),len(S2))
str1 = ''.join(l1)
str2 = ''.join(l2)
print(str1)
print(str2)
l1=[]
l2=[]

GT----GAC-TGCGAT--AA--G---C-T-TAGATC--C-TCTTAAAAT
GAGGGAGACATGCGATACAAGGGATCCTTGTAGATCTGCGTCTTT-AA-


### Needleman-Wunsch algo for large sequences 

We will use http://www.bioinformatics.org/sms2/mutate_dna.html to generate DNA sequence and mutate it.

In [55]:
# original DNA sequence
Original = 'ATGTCTGATTCGCTAAATCATCCATCGAGTTCTACGGTGCATGCAGATGATGGATTCGAGCCACCAACATCTCCGGAAGACAACAACAAAAAACCGTCTTTAGAACAAATTAAACAGGAAAGAGAAGCGTTGTTTACGGATCTATTCGCAGATCGTCGACGAAGCGCTCGTTCTGTGATTGAAGAAGCTTTCCAAAACGAACTCATGAGTGCTGAACCAGTCCAGCCAAACGTGCCGAATCCACATTCGATTCCCATTCGTTTCCGTCATCAACCAGTTGCTGGACCTGCTCATGATGTTTTCGGAGACGCGGTGCATTCAATTTTTCAAAAAATAATGTCCAGAGGAGTGAACGCGGATTATAGTCATTGGATGTCATATTGGATCGCGTTGGGAATCGACAAAAAAACACAAATGAACTATCATATGAAACCGTTTTGCAAAGATACTTATGCAACTGAAGGCTCCTTAGAAGCGAAACAAACATTTACTGATAAAATCAGGTCAGCTGTTGAGGAAATTATCTGGAAGTCCGCTGAATATTGTGATATTCTTAGCGAGAAGTGGACAGGAATTCATGTGTCGGCCGACCAACTGAAAGGTCAAAGAAATAAGCAAGAAGATCGTTTTGTGGCTTATCCAAATGGACAATACATGAATCGTGGACAGAGTGACATTTCACTTCTTGCGGTGTTCGATGGGCATGGCGGACACGAGTGCTCTCAATATGCAGCTGCTCATTTCTGGGAAGCATGGTCCGATGCTCAACATCATCATTCACAAGATATGAAACTTGACGAACTCCTAGAAAAGGCTCTAGAAACATTGGACGAAAGAATGACAGTCAGAAGTGTTCGAGAATCTTGGAAAGGTGGAACCACTGCTGTCTGCTGTGCTGTTGATTTGAACACTAATCAAATCGCATTTGCCTGGCTTGGAGATTCACCAGGTTACATCATGTCAAACTTGGAGTTCCGCAAATTCACTACTGAACACTCCCCGTCTGACCCGGAGGAATGTCGACGAGTCGAAGAAGTCGGTGGCCAGATTTTTGTGATCGGTGGTGAGCTCCGTGTGAATGGAGTACTCAACCTGACGCGAGCACTAGGAGACGTACCTGGAAGACCAATGATATCCAACAAACCTGATACCTTACTGAAGACGATCGAACCTGCGGATTATCTTGTTTTGTTGGCCTGTGACGGGATTTCTGACGTCTTCAACACTAGTGATTTGTACAATTTGGTTCAGGCTTTTGTCAATGAATATGACGTAGAAGATTATCACGAACTTGCACGCTACATTTGCAATCAAGCAGTTTCAGCTGGAAGTGCTGACAATGTGACAGTAGTTATAGGTTTCCTCCGTCCACCAGAAGACGTTTGGCGTGTAATGAAAACAGACTCGGATGATGAAGAGAGCGAGCTCGAGGAAGAAGATGACAATGAATAG'

In [56]:
print(len(Original))

1452


In [57]:
print(Original)

ATGTCTGATTCGCTAAATCATCCATCGAGTTCTACGGTGCATGCAGATGATGGATTCGAGCCACCAACATCTCCGGAAGACAACAACAAAAAACCGTCTTTAGAACAAATTAAACAGGAAAGAGAAGCGTTGTTTACGGATCTATTCGCAGATCGTCGACGAAGCGCTCGTTCTGTGATTGAAGAAGCTTTCCAAAACGAACTCATGAGTGCTGAACCAGTCCAGCCAAACGTGCCGAATCCACATTCGATTCCCATTCGTTTCCGTCATCAACCAGTTGCTGGACCTGCTCATGATGTTTTCGGAGACGCGGTGCATTCAATTTTTCAAAAAATAATGTCCAGAGGAGTGAACGCGGATTATAGTCATTGGATGTCATATTGGATCGCGTTGGGAATCGACAAAAAAACACAAATGAACTATCATATGAAACCGTTTTGCAAAGATACTTATGCAACTGAAGGCTCCTTAGAAGCGAAACAAACATTTACTGATAAAATCAGGTCAGCTGTTGAGGAAATTATCTGGAAGTCCGCTGAATATTGTGATATTCTTAGCGAGAAGTGGACAGGAATTCATGTGTCGGCCGACCAACTGAAAGGTCAAAGAAATAAGCAAGAAGATCGTTTTGTGGCTTATCCAAATGGACAATACATGAATCGTGGACAGAGTGACATTTCACTTCTTGCGGTGTTCGATGGGCATGGCGGACACGAGTGCTCTCAATATGCAGCTGCTCATTTCTGGGAAGCATGGTCCGATGCTCAACATCATCATTCACAAGATATGAAACTTGACGAACTCCTAGAAAAGGCTCTAGAAACATTGGACGAAAGAATGACAGTCAGAAGTGTTCGAGAATCTTGGAAAGGTGGAACCACTGCTGTCTGCTGTGCTGTTGATTTGAACACTAATCAAATCGCATTTGCCTGGCTTGGAGATTCACCAGGTTACATCATGTCAAACTTGGAGTTCCGCAAATTCACTACTGAACACTCCC

In [58]:
# mutated sequence
Mutated = 'ATGACTGATTTGTTAAACCATCAAACGCGTCCCAGGGTGCATGCAGATGATGGATTGAAGCCACCGAAATTTCCGGTAAACAAGAACAAGAAACCGTCTCTAGAACAAATTAAACAGGAAAAACAGGCGTTGATAACGGTTGTTTTTGTTGGTCGTCGACGGAGCGCTCGTACTAGGATTGGAGACGCTTTCCGAAACGAAATCATGGGTGCAGAACCCGTCTCGCCAAACGTGCCGAAACCACGTTCGAGTCCCATTCGTTTCCGAAATCACACATATACCGTACCTGTTCATGATGTTTTCGGAGAGACGGTGCACTAACTGCTTCAAAAAGTAATGTCCATAGTAGCGAACGTTGGTTATAGTCATTGGATGTCGCATTGGTATGCGATGGGAAGCCACGAAATGATACAAATGAACTATGATATGAAACCGTCTTGCAAAGATACTTAGGCAACTGAAGGCTCCTTAGAATCGAAACAAACATTTATTGAAAAACTCGTGTTAGCTGTTGCGGAAATTAGTAGGAAGTCGGGTGAAAATTGCGAGATTCTTAGATAGCCGTAGACAGGAACTCAAGTTTCGGTCGACCAACTGAAGGGTCAAAGTAATAAGCAAGAATAACGTTTGGTGCCTTATCCGACTGGACCAGACATGTACCGTGGACTGACTGACCTATCACGACTTGCGGTGGCCGATGAGAATGGCGGTCACGAGTCCTCTCAGTCCGCAGCTGCTCCTTTCTGAGAAGCATGGTCCGATACTCGACATCATCATTTTCGAGATTTGAAACTTGACGGACGCCTAGAAAAGGCTCTAGAAAAATTGGACCAGCGAATGACAGTCTGTAGTATTCGAAAACTTGGGAAAGGAGGAACCCCTTCTGTCAGCTGTGCTGTCTATTTGATCACTAATCAAATAACATAGGCCTAGCATGAAGATTCAACAGGTGAGGTGACCTCAAAATTGGAGCTCCGCAAATTCACTATTTAACACTCCCAGTCTGACGCGGAGGAATGTCGAGGAGTCGAATAAGACGGTGATCAGATTTTTGTGATCGGTGGTGAGCTCCGTGTTAATGGGGTACTCAACCTGACGCGTGCACACGGAGACGTGCCTGGCAGCCCAGTGCTATTCAACAAACCTTATACCTTACTGAAGACGATTGAACCTGCGTATTATTTTGTTTTGTTGGCCTCTGACTGGATTTCTGACTTCTTTAGCATTAGTGATCTGTATAGTTTTTTGCGGGCTTTATTCAGCGAATACCACGTAGAAGATTATCAAGAACTTGAACGCTCCATTTTCAAGGAAGTCGTGTCAGTTGGCAGTGCAGACAATGTGACAGTACTTACTAGGCTCCTCTGTCCACCAGATGACGTTTGGCGTGTAATGGAAACAGATTCGTATGATGAAGAAATCGAGGTCGTAGAAGAGGATGAGAATGATTAC'

In [59]:
print(len(Mutated))

1452


In [60]:
print(Mutated)

ATGACTGATTTGTTAAACCATCAAACGCGTCCCAGGGTGCATGCAGATGATGGATTGAAGCCACCGAAATTTCCGGTAAACAAGAACAAGAAACCGTCTCTAGAACAAATTAAACAGGAAAAACAGGCGTTGATAACGGTTGTTTTTGTTGGTCGTCGACGGAGCGCTCGTACTAGGATTGGAGACGCTTTCCGAAACGAAATCATGGGTGCAGAACCCGTCTCGCCAAACGTGCCGAAACCACGTTCGAGTCCCATTCGTTTCCGAAATCACACATATACCGTACCTGTTCATGATGTTTTCGGAGAGACGGTGCACTAACTGCTTCAAAAAGTAATGTCCATAGTAGCGAACGTTGGTTATAGTCATTGGATGTCGCATTGGTATGCGATGGGAAGCCACGAAATGATACAAATGAACTATGATATGAAACCGTCTTGCAAAGATACTTAGGCAACTGAAGGCTCCTTAGAATCGAAACAAACATTTATTGAAAAACTCGTGTTAGCTGTTGCGGAAATTAGTAGGAAGTCGGGTGAAAATTGCGAGATTCTTAGATAGCCGTAGACAGGAACTCAAGTTTCGGTCGACCAACTGAAGGGTCAAAGTAATAAGCAAGAATAACGTTTGGTGCCTTATCCGACTGGACCAGACATGTACCGTGGACTGACTGACCTATCACGACTTGCGGTGGCCGATGAGAATGGCGGTCACGAGTCCTCTCAGTCCGCAGCTGCTCCTTTCTGAGAAGCATGGTCCGATACTCGACATCATCATTTTCGAGATTTGAAACTTGACGGACGCCTAGAAAAGGCTCTAGAAAAATTGGACCAGCGAATGACAGTCTGTAGTATTCGAAAACTTGGGAAAGGAGGAACCCCTTCTGTCAGCTGTGCTGTCTATTTGATCACTAATCAAATAACATAGGCCTAGCATGAAGATTCAACAGGTGAGGTGACCTCAAAATTGGAGCTCCGCAAATTCACTATTTAACACTCCC

In [62]:
print("Please wait...")
l,_,_ = bottom_up_Needleman_Wunsch(Original,Mutated,match=2,mismatch=-1,gap=0)
print("Done!")

Please wait...
Done!


In [63]:
# As we can see, if give scores match=2, mismatch=-1, and gap=0, the best score will be 2444
print(l)

[[   0    0    0 ...,    0    0    0]
 [   0    2    2 ...,    2    2    2]
 [   0    2    4 ...,    4    4    4]
 ..., 
 [   0    2    4 ..., 2442 2442 2442]
 [   0    2    4 ..., 2442 2444 2444]
 [   0    2    4 ..., 2442 2444 2444]]


In [64]:
print("Please wait...")
l,_,_ = bottom_up_Needleman_Wunsch(Original,Mutated,match=1,mismatch=-2,gap=-1)
print("Done!")

Please wait...
Done!


In [65]:
# And if give scores match=1, mismatch=-2, and gap=-1, the best score will be 762
print(l)

[[    0    -1    -2 ..., -1450 -1451 -1452]
 [   -1     1     0 ..., -1448 -1449 -1450]
 [   -2     0     2 ..., -1446 -1447 -1448]
 ..., 
 [-1450 -1448 -1446 ...,   763   762   761]
 [-1451 -1449 -1447 ...,   762   764   763]
 [-1452 -1450 -1448 ...,   761   763   762]]
