To generalize the alignment scoring model, we still award +1 for matches, but we also penalize mismatches by some positive constant µ (the mismatch penalty) and indels by some positive constant σ (the indel penalty). As a result, the score of an alignment is equal to

num matches - µ · num mismatches - σ · num indels

Biologists have further refined this cost function to allow for the fact that some mutations may be more likely than others, which calls for mismatches and indel penalties that differ depending on the specific symbols involved (ako manje penaliziramo vjerojatnije mutacije, a više penaliziramo manje vjerojatne mutacije, onda ćemo dobiti alignment sa manje manje vjerojatnih mutacija i sa više vjerojatnijih mutacija). We will extend the k-letter alphabet to include the space symbol and then construct a (k + 1) x (k + 1) scoring
matrix Score holding the score of aligning every pair of symbols. The scoring matrix for comparing DNA sequences (k = 4) when all mismatches are penalized by µ and all indels are penalized by σ is shown below.

  A C G T - 

+1 -µ -µ -µ -s

C -µ +1 -µ -µ -s

G -µ -µ +1 -µ -s

T -µ -µ -µ +1 -s

-

 -σ -σ -σ -σ

To solve the Global Alignment Problem, we still must find a longest path in the alignment graph after updating the edge weights to reflect the values in the scoring matrix.

Recalling that deletions correspond to vertical edges, insertions correspond to horizontal edges, and matches/mismatches correspond to diagonal edges, we obtain the following recurrence for si,j, the length of a longest path
from (0, 0) to (i, j):

si,j = max {

si1,j + Score(vi,-)

si,j-1 + Score(-,wj)

si-1,j-1 + Score(vi,wj)

}

When the match reward is +1, the mismatch penalty is µ, and the indel penalty is σ, the alignment recurrence can be written as follows:

si,j = max {

si-1,j - σ

si,j-1 - σ

si-1,j-1 + 1 , if vi = wj

si-1,j-1  -µ , if vi != wj

}

BLOSUM62 is a commonly used scoring matrix in alignment problems considering protein strings. It assigns different alignment scores to substituted amino acids depending on the particular acids substituted; scores were computed based on the relative frequency of one amino acid to be substituted for another in a collection of known alignments (više penaliziramo manje frekventne zamjene, a manje penaliziramo više frekventne zamjene).

The scoring matrix is "symmetric", which in this case means that the score assigned to substituting symbol x for symbol y is the same as that of substituting y for x.

As a result, the score of an alignment is equal to

num matches - µ · num mismatches - σ · num indels

si,j = max {

si1,j + Score(vi,-)

si,j-1 + Score(-,wj)

si-1,j-1 + Score(vi,wj)

}

In [11]:
import numpy as np

In [102]:
blosum_62_matrix_values = list()
with open('/content/blosum_62.txt') as task_file:
  blosum_62_matrix_values.extend([line.rstrip() for line in task_file])
for i in range(len(blosum_62_matrix_values)):
  blosum_62_matrix_values[i] = blosum_62_matrix_values[i].split(' ')
  blosum_62_matrix_values[i] = [int(string) for string in blosum_62_matrix_values[i] if string != '']
blosum_62_matrix = np.full((20,20),fill_value= blosum_62_matrix_values)

In [103]:
def IndexToAminoacid(aminoacids,aminoacid):
  return aminoacids.index(aminoacid) #mapping from aminoacid to index defined by mapping from index to element at the index

When the match reward is +1, the mismatch penalty is µ, and the indel penalty is σ, the alignment recurrence can be written as follows:

si,j = max {

si-1,j - σ

si,j-1 - σ

si-1,j-1 + 1 , if vi = wj

si-1,j-1  -µ , if vi != wj

}

In [1]:
def LongestCommonSubsequenceBacktrack(v,w,sigma):
  s = np.full((len(v)+1,len(w)+1),np.NINF)
  backtrack = np.full((len(v)+1,len(w)+1),'')
  aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
  s[0][0] = 0 #the longest path from source to source has length 0
  for i in range(1,len(v)+1):
    s[i][0] = i*(-sigma)
  for j in range(1,len(w)+1):
    s[0][j] = j*(-sigma)
  for i in range(1,len(v)+1):
    for j in range(1,len(w)+1):
      terms = [s[i-1][j] - sigma,s[i][j-1] - sigma]
      if v[i-1] == w[j-1]:
        #terms.append(s[i-1][j-1] + blosum_62_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])])
        terms.append(s[i-1][j-1] + 1)
      elif v[i-1] != w[j-1]:
        #terms.append(s[i-1][j-1] + blosum_62_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])])
        terms.append(s[i-1][j-1] - 1)
      s[i][j] = max(terms)
      if s[i][j] == s[i-1][j] - sigma:
        backtrack[i][j] = 'D' #we came to node (i,j) from node (i-1,j) --> deletion
      elif s[i][j] == s[i][j-1] - sigma:
        backtrack[i][j] = 'R' #we came to node (i,j) from node (i,j-1) --> insertion
      elif s[i][j] == (s[i-1][j-1] + 1) and v[i-1] == w[j-1]:
        backtrack[i][j] = 'M' #we came to node (i,j) from node (i-1,j-1)
      elif s[i][j] == (s[i-1][j-1] - 1) and v[i-1] != w[j-1]:
        backtrack[i][j] = 'L' #we came to node (i,j) from node (i-1,j-1)
  return backtrack

In [2]:
def OutputLongestCommonSubsequence(backtrack,w,i,j,f):
  if i == 0 or j == 0: #if we arrive to first column we can only follow down edges to source node, if we arrive to first row we can only follow right edges to sink node --> we cannot follow diagonal edges --> only some of diagonal edges present matches
    if i == 0:
      f.write(j*'R')
    else: #j=0
      f.write(i*'D')
    return
  if backtrack[i][j] == 'D':
    f.write('D')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j,f) #we came to node (i,j) from node (i-1,j)
  elif backtrack[i][j] == 'R':
    f.write('R')
    OutputLongestCommonSubsequence(backtrack,w,i,j-1,f) #we came to node (i,j) from node (i,j-1)
  elif backtrack[i][j] == 'M':
    f.write('M')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j-1,f) #we came to node (i,j) from node (i-1,j-1)
  elif backtrack[i][j] == 'L':
    f.write('L')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j-1,f) #we came to node (i,j) from node (i-1,j-1)

deletions --> vertical edges 

insertions --> horizontal edges

matches/mismatches --> diagonal edges

insertion --> v

deletion --> w

In [3]:
def ReturnAlignment(backtrack_sequence,v,w):
  aligned_v = str()
  aligned_w = str()
  pointer_v = 0 #start of string v
  pointer_w = 0 #start of string w
  for i in range(len(backtrack_sequence)):
    if backtrack_sequence[i] == 'D': #deletion --> w
      aligned_w = aligned_w + '-'
      aligned_v = aligned_v + v[pointer_v]
      pointer_v = pointer_v + 1
    elif backtrack_sequence[i] == 'R': #insertion --> v
      aligned_v = aligned_v + '-'
      aligned_w = aligned_w + w[pointer_w]
      pointer_w = pointer_w + 1
    elif backtrack_sequence[i] == 'M' or backtrack_sequence[i] == 'L': #match or mismtach -->v,w
      aligned_v = aligned_v + v[pointer_v]
      aligned_w = aligned_w + w[pointer_w]
      pointer_v = pointer_v + 1
      pointer_w = pointer_w + 1
  return aligned_v,aligned_w

In [4]:
def AlignmentScore(aligned_v,aligned_w,blosum_62_matrix,aminoacids,sigma):
  alignment_score = 0
  for i in range(len(aligned_v)): #alinged_v and aligned_w have the same length
    if aligned_v[i] == '-' or aligned_w[i] == '-':
      alignment_score = alignment_score - sigma
    else:
      alignment_score = alignment_score + blosum_62_matrix[IndexToAminoacid(aminoacids,aligned_v[i])][IndexToAminoacid(aminoacids,aligned_w[i])]
  return alignment_score

In [38]:
def EditDistance(aligned_w,aligned_v): #returns minimum number of edit operations (substituion,insertion,deletion) needed to transfomr string v into string w
  edit_distance = 0
  for i in range(len(aligned_w)):
   if aligned_w[i] == '-' or aligned_v[i] == '-': #deletion or insertion
     edit_distance = edit_distance + 1
   elif aligned_v[i] != '-' and aligned_w[i] != '-' and aligned_w[i] != aligned_v[i]: #mismatch --> substitution
     edit_distance = edit_distance + 1
  return edit_distance

In [144]:
v = 'PLEASANTLY'

In [145]:
w = 'MEANLY'

In [130]:
#sigma = 5

In [55]:
sigma = 1

In [56]:
backtrack = LongestCommonSubsequenceBacktrack(v,w,sigma)

In [70]:
#backtrack = LongestCommonSubsequenceBacktrack(v,w,sigma,blosum_62_matrix)

In [57]:
f = open("task_result.txt","w")
OutputLongestCommonSubsequence(backtrack,w,len(v),len(w),f)
f.close()
with open('task_result.txt') as task_file:
  backtrack_sequence = [line.rstrip() for line in task_file]
  backtrack_sequence = backtrack_sequence[0][::-1]
aligned_v,aligned_w = ReturnAlignment(backtrack_sequence,v,w)
f = open("task_result.txt","w")
f.write(str(EditDistance(aligned_v,aligned_w)))
f.close()

In [58]:
with open('/content/rosalind_ba5g_2_dataset.txt') as task_file:
  task_arguments = [line.rstrip() for line in task_file]

In [59]:
v = task_arguments[0]

In [60]:
w = task_arguments[1]

In [61]:
sigma = 1

In [62]:
backtrack = LongestCommonSubsequenceBacktrack(v,w,sigma)

In [63]:
import sys
sys.setrecursionlimit(10000)
f = open("task_result.txt","w")
OutputLongestCommonSubsequence(backtrack,w,len(v),len(w),f)
f.close()
with open('task_result.txt') as task_file:
  backtrack_sequence = [line.rstrip() for line in task_file]
  backtrack_sequence = backtrack_sequence[0][::-1]
aligned_v,aligned_w = ReturnAlignment(backtrack_sequence,v,w)
f = open("task_result.txt","w")
f.write(str(EditDistance(aligned_w,aligned_v)))
f.close()