Analysis of homeobox genes offers an example of a problem for which global alignment may fail to reveal biologically relevant similarities. Homeobox genes are long, and they differ greatly between species, but an approximately 60
amino acid-long region in each gene, called the homeodomain, is highly conserved. The immediate question is how to find this conserved segment within the much longer genes and ignore the flanking areas, which exhibit little similarity. Global alignment seeks similarities between two strings across their entire length; however, when searching for homeodomains, we are looking for smaller, local regions of similarity and do not need to align the entire strings. --> dakle, kada tražimo homeodomains u homeobox genima, tražimo manje, lokalne regije homeobox gena i ne trebamo alignati cijele homeobox gene (npr. homeobox gene za ljude i miševe) kako bi pronašli homeodomains --> nema smisla tražiti global alignment između homeobox gena za ljude i miševe jer global alignment traži sličnosti između između 2 stringa preko cijele duljine stringova, homeobx geni se razlikuju, homeobox geni su dosta slični u homedomains koje su očuvane kod vrsta

When biologically significant similarities are present in some parts of sequences v and w and absent from others, biologists attempt to ignore global alignment and instead align substrings of v and w, which yields a local alignment of the two strings. The problem of finding substrings that maximize the global alignment score over all substrings of v and w is called the Local Alignment Problem. The straightforward way to solve the Local Alignment Problem is to find the longest path connecting every pair of nodes in the alignment graph (rather than just those connecting the source and sink, as in the Global Alignment Problem), and then to select the path having maximum weight over all these longest paths.

For a faster local alignment approach, imagine a “free taxi ride” from the source (0,0) to the node representing the start node of the conserved (red) interval. Imagine also a free taxi ride from the end node of the conserved interval to the sink. If such rides were available, then you could reach the starting node of the conserved interval for free, instead of incurring heavy penalties as in global alignment. Then, you could travel along the conserved interval to its end node, accumulating positive match scores. Finally, you could take another free ride from the end node of the conserved interval to the sink for free, instead of incurring heavy penalties as in global alignment. The resulting score of this ride is equal to the alignment score of only the conserved intervals, as desired. Connecting the source (0,0) to every other node by adding a zero-weight edge and connecting every node to the sink (n,m) by a zero-weight edge will result in a DAG perfectly suited for solving the Local Alignment Problem. Because of free taxi rides, we no longer need to construct a longest path between every pair of nodes in the graph — the longest path from source to sink yields an optimal local alignment!

The total number of edges in the graph in is O(|v|·|w|) (ako je |v|*|w| > |v|*|v| i |v|*|w| > |w|*|w|), which is still small. Since the runtime of finding a longest path in a DAG is defined by the number of edges in the graph, the resulting local alignment algorithm will be fast. As for computing the values si,j, adding zero-weight edges from (0, 0) to every node has made
the source node (0,0) a predecessor of every node (i,j). Therefore, there are now four edges entering (i,j), which adds only one new term to the longest path recurrence relation:
si,j = max {

0

si1, j + Score(vi, -)

si, j1 + Score(-, wj)

si1, j1 + Score(vi, wj)

}


In [376]:
import numpy as np

In [574]:
pam_250_matrix_values = list()
with open('/content/pam_250.txt') as task_file:
  pam_250_matrix_values.extend([line.rstrip() for line in task_file])
for i in range(len(pam_250_matrix_values)):
  pam_250_matrix_values[i] = pam_250_matrix_values[i].split(' ')
  pam_250_matrix_values[i] = [int(string) for string in pam_250_matrix_values[i] if string != '']
pam_250_matrix = np.full((20,20),fill_value= pam_250_matrix_values)

In [378]:
def IndexToAminoacid(aminoacids,aminoacid):
  return aminoacids.index(aminoacid) #mapping from aminoacid to index defined by mapping from index to element at the index

Sink čvor mora biti povezan sa svim ostalim čvorovima.

In [573]:
def ArgMax(s):
  return np.unravel_index(np.argmax(s),s.shape)

In [579]:
def LongestCommonSubsequenceBacktrack(v,w,sigma,pam_250_matrix):
  s = np.full((len(v)+1,len(w)+1),np.NINF)
  backtrack = np.full((len(v)+1,len(w)+1),'')
  aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
  s[0][0] = 0 #the longest path from source to source has length 0
  for i in range(1,len(v)+1):
    s[i][0] = 0
    backtrack[i][0] = 'N'
  for j in range(1,len(w)+1):
    s[0][j] = 0
    backtrack[0][j] = 'N'
  for i in range(1,len(v)+1):
    for j in range(1,len(w)+1):
      terms = [0,s[i-1][j] - sigma,s[i][j-1] - sigma]
      if v[i-1] == w[j-1]:
        terms.append(s[i-1][j-1] + pam_250_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])])
      elif v[i-1] != w[j-1]:
        terms.append(s[i-1][j-1] + pam_250_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])])
      s[i][j] = max(terms)
      if s[i][j] == s[i-1][j] - sigma:
        backtrack[i][j] = 'D' #we came to node (i,j) from node (i-1,j) --> deletion
      elif s[i][j] == s[i][j-1] - sigma:
        backtrack[i][j] = 'R' #we came to node (i,j) from node (i,j-1) --> insertion
      elif s[i][j] == (s[i-1][j-1] + pam_250_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])]) and v[i-1] == w[j-1]:
        backtrack[i][j] = 'M' #we came to node (i,j) from node (i-1,j-1)
      elif s[i][j] == s[i-1][j-1] + pam_250_matrix[IndexToAminoacid(aminoacids,v[i-1])][IndexToAminoacid(aminoacids,w[j-1])] and v[i-1] != w[j-1]:
        backtrack[i][j] = 'L' #we came to node (i,j) from node (i-1,j-1)
      elif s[i][j] == 0:
        backtrack[i][j] = 'N' #we came to node (i,j) from source node, N-->new edge from source node to node
  s[len(v)][len(w)] = np.amax(s) #all nodes are connected with sink node
  backtrack_starting_node = ArgMax(s)
  return backtrack,backtrack_starting_node

In [583]:
def OutputLongestCommonSubsequence(backtrack,w,i,j,f):
  if backtrack[i][j] == 'N': #we came to node (i,j) from source node
    f.write('N')
    return
  if backtrack[i][j] == 'D':
    f.write('D')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j,f) #we came to node (i,j) from node (i-1,j)
  elif backtrack[i][j] == 'R':
    f.write('R')
    OutputLongestCommonSubsequence(backtrack,w,i,j-1,f) #we came to node (i,j) from node (i,j-1)
  elif backtrack[i][j] == 'M':
    f.write('M')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j-1,f) #we came to node (i,j) from node (i-1,j-1)
  elif backtrack[i][j] == 'L':
    f.write('L')
    OutputLongestCommonSubsequence(backtrack,w,i-1,j-1,f) #we came to node (i,j) from node (i-1,j-1)

deletions --> vertical edges 

insertions --> horizontal edges

matches/mismatches --> diagonal edges

insertion --> v

deletion --> w

In [621]:
def ReturnAlignment(backtrack_sequence,v,w,backtrack_starting_node):
  aligned_v = str()
  aligned_w = str()
  pointer_v = backtrack_starting_node[0] - 1
  pointer_w = backtrack_starting_node[1] - 1
  i = 0
  while backtrack_sequence[i] != 'N':
    if backtrack_sequence[i] == 'D': #deletion --> w
      aligned_w = aligned_w + '-'
      aligned_v = aligned_v + v[pointer_v]
      pointer_v = pointer_v - 1
    elif backtrack_sequence[i] == 'R': #insertion --> v
      aligned_v = aligned_v + '-'
      aligned_w = aligned_w + w[pointer_w]
      pointer_w = pointer_w - 1
    elif backtrack_sequence[i] == 'M' or backtrack_sequence[i] == 'L': #match or mismatch -->v,w
      aligned_v = aligned_v + v[pointer_v]
      aligned_w = aligned_w + w[pointer_w]
      pointer_v = pointer_v - 1
      pointer_w = pointer_w - 1
    i = i + 1
  return aligned_v[::-1],aligned_w[::-1]

In [622]:
def AlignmentScore(aligned_v,aligned_w,pam_250_matrix,aminoacids,sigma):
  alignment_score = 0
  for i in range(len(aligned_v)): #alinged_v and aligned_w have the same length
    if aligned_v[i] == '-' or aligned_w[i] == '-':
      alignment_score = alignment_score - sigma
    else:
      alignment_score = alignment_score + pam_250_matrix[IndexToAminoacid(aminoacids,aligned_v[i])][IndexToAminoacid(aminoacids,aligned_w[i])]
  return alignment_score

In [623]:
v = 'MEANLY'

In [624]:
w = 'PENALTY'

In [627]:
# v = 'KYVILIVGN'

In [628]:
# w = 'DDVISLIVPL'

In [629]:
sigma = 5

In [630]:
backtrack,backtrack_starting_node = LongestCommonSubsequenceBacktrack(v,w,sigma,pam_250_matrix)
f = open("task_result.txt","w")
OutputLongestCommonSubsequence(backtrack,w,backtrack_starting_node[0],backtrack_starting_node[1],f)
f.close()
with open('task_result.txt') as task_file:
  backtrack_sequence = [line.rstrip() for line in task_file][0] #returns longest path --> if subpaths of longest path have higher score than whole path then return highest scoring subpath of path as all nodes are connected with sink node
aligned_v,aligned_w = ReturnAlignment(backtrack_sequence,v,w,backtrack_starting_node) #returns alignment with highest score
aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
alignment_score = AlignmentScore(aligned_v,aligned_w,pam_250_matrix,aminoacids,sigma)
f = open("task_result.txt","w")
f.write(str(alignment_score) + '\n')
f.write(aligned_v + '\n')
f.write(aligned_w)
f.close()

In [631]:
with open('/content/rosalind_ba5f.txt') as task_file:
  task_arguments = [line.rstrip() for line in task_file]

In [632]:
v = task_arguments[0]

In [633]:
w = task_arguments[1]

In [634]:
sigma = 5

In [636]:
import sys
sys.setrecursionlimit(10000)
backtrack,backtrack_starting_node = LongestCommonSubsequenceBacktrack(v,w,sigma,pam_250_matrix)
f = open("task_result.txt","w")
OutputLongestCommonSubsequence(backtrack,w,backtrack_starting_node[0],backtrack_starting_node[1],f)
f.close()
with open('task_result.txt') as task_file:
  backtrack_sequence = [line.rstrip() for line in task_file][0] #returns longest path --> if subpaths of longest path have higher score than whole path then return highest scoring subpath of path as all nodes are connected with sink node
aligned_v,aligned_w = ReturnAlignment(backtrack_sequence,v,w,backtrack_starting_node) #returns alignment with highest score
aminoacids = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
alignment_score = AlignmentScore(aligned_v,aligned_w,pam_250_matrix,aminoacids,sigma)
f = open("task_result.txt","w")
f.write(str(alignment_score) + '\n')
f.write(aligned_v + '\n')
f.write(aligned_w)
f.close()