Even after read breaking, most assemblies still have gaps in k-mer coverage, causing the de Bruijn graph to have missing edges, and so the search for an Eulerian path fails --> de Bruijn graph is not strongly connected --> Euler’s Theorem: Every balanced, strongly connected directed graph is Eulerian --> Eulerian graph is a graph that has Eulerian cycle

In this case, biologists often settle on assembling contigs (long, contiguous segments of the genome) rather than entire chromosomes. For example, a typical bacterial sequencing project may result in about a hundred contigs, ranging in length from a few thousand to a few hundred thousand nucleotides. For most genomes, the order of these contigs along the genome remains unknown. Needless to say, biologists would prefer to have the entire genomic sequence, but the cost of ordering the contigs into a final assembly and closing the gaps using more expensive experimental methods is often prohibitive. 

Fortunately, we can derive contigs from the de Bruijn graph. A path in a graph is called non-branching if IN(v) = OUT(v) = 1 for each intermediate node v of this path, i.e., for each node except possibly the starting and ending node of a path. A maximal non-branching path is a non-branching path that cannot be extended into a longer non-branching path. We are interested in these paths because the strings of nucleotides that they spell out must be present in any assembly with a given k-mer composition. For this reason, contigs correspond to strings spelled by maximal non-branching paths in the de Bruijn graph.

contigs --> dugi, kontinuirani segmenti genoma

non-branching put --> ako za svaki čvor v na putu u grafu vrijedi In(v) = Out(v), onda je put non-branching

maksimalni non-branching put --> non-branching put koji se ne može proširiti u veći non-branching put, tj. non-branching put koji nije dio nekog većeg non-branching puta, dakle najveći non-branching put, tj. non-branching put sa najvećim brojem čvorova

In practice, biologists have no choice but to break genomes into contigs, even in the case of perfect coverage, since repeats prevent them from being
able to infer a unique Eulerian path. --> Euler’s Theorem: Every balanced, strongly connected directed graph is Eulerian --> Eulerian graph is a graph that has Eulerian cycle

Postoji samo jedan Eulerov ciklus u Eulerovom grafu --> Eulerov put formiramo tako da maknemo vezu između predzadnjeg čvora u Eulerovom ciklusu i zadnjeg čvora, tj. početnog čvora budući da su početni i završni čvorovi isti --> budući da Eulerov put formiramo preko Eulerovog ciklusa i budući da imamo samo jedan Eulerov ciklus u grafu, onda imamo samo jedan Eulerov put u grafu  -->  možemo ih imati više, ali od svih Eulerovih puteva tražimo najdulji Eulerov put u grafu


In [1]:
def DefineGraphDict(graph):
  graph_dict = {}
  for node_adjacent_nodes in graph:
    node_adjacent_nodes = node_adjacent_nodes.split(' -> ')
    graph_dict.setdefault(int(node_adjacent_nodes[0]),[])
    node_adjacent_nodes[1] = node_adjacent_nodes[1].split(',')
    for node in node_adjacent_nodes[1]:
      graph_dict[int(node_adjacent_nodes[0])].append(int(node))
  return graph_dict

In [12]:
def GraphDictEdges(graph_dict):
  graph_dict_edges = []
  for key,adjacent_nodes in graph_dict.items():
    for node in adjacent_nodes:
      graph_dict_edges.append((key,node))
  return graph_dict_edges

In [19]:
def NumberOfOccurences(graph_dict_edges,edge):
  count = 0
  for graph_edge in graph_dict_edges:
    if graph_edge[0] == edge[1]:
      count = count + 1
  return count

In [74]:
def GraphEdgeNumberOfOccurences(graph_dict_edges,edge):
  return graph_dict_edges.count(edge)

In [73]:
def NonBranchingPathNumberOfOccurences(non_branching_path,edge):
  return non_branching_path.count(edge)

In [75]:
def NextEdge(graph_dict_edges,non_branching_path,edge):
  for graph_edge in graph_dict_edges:
    if graph_edge[0] == edge[1] and NumberOfOccurences(graph_dict_edges,edge) == 1 and NonBranchingPathNumberOfOccurences(non_branching_path,graph_edge) < GraphEdgeNumberOfOccurences(graph_dict_edges,graph_edge):
      return [graph_edge]
  return []

In [105]:
def FindAllKmers(non_branching_path,k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(non_branching_path) - 1:
    kmers_list.append(non_branching_path[i:i+k])
    i = i + 1
  return kmers_list

In [196]:
def CheckBiggerNonBranchingPaths(non_branching_paths,possible_non_branching_path):
   for non_branching_path in non_branching_paths:
     if possible_non_branching_path in FindAllKmers(non_branching_path,len(possible_non_branching_path)) or sorted(possible_non_branching_path, key=lambda x:x[0]) == non_branching_path:
       return False
   return True

In [197]:
def MaximalNonBranchingPaths(graph_dict_edges):
  maximal_non_branching_paths = []
  for edge in graph_dict_edges:
    non_branching_path = [edge]
    next_edge = NextEdge(graph_dict_edges,non_branching_path,non_branching_path[len(non_branching_path)-1])
    while len(next_edge):
      next_edge = next_edge[0]
      non_branching_path.append(next_edge)
      next_edge = NextEdge(graph_dict_edges,non_branching_path,non_branching_path[len(non_branching_path)-1])
    if CheckBiggerNonBranchingPaths(maximal_non_branching_paths,non_branching_path):
      maximal_non_branching_paths.append(non_branching_path)
  return maximal_non_branching_paths

In [198]:
def PrintResult(maximal_non_branching_paths):
  for maximal_non_branching_path in maximal_non_branching_paths:
    string_to_print = ''
    string_to_print = string_to_print + str(maximal_non_branching_path[0][0]) + ' -> ' + str(maximal_non_branching_path[0][1]) + ' -> '
    for i in range(1,len(maximal_non_branching_path)):
      string_to_print = string_to_print + str(maximal_non_branching_path[i][1]) + ' -> '
    string_to_print = string_to_print[0:len(string_to_print) - 4]
    print(string_to_print)

In [199]:
def PrintResultToFile(maximal_non_branching_paths):
  f = open("task_result.txt","w")
  for maximal_non_branching_path in maximal_non_branching_paths:
    string_to_print = ''
    string_to_print = string_to_print + str(maximal_non_branching_path[0][0]) + ' -> ' + str(maximal_non_branching_path[0][1]) + ' -> '
    for i in range(1,len(maximal_non_branching_path)):
      string_to_print = string_to_print + str(maximal_non_branching_path[i][1]) + ' -> '
    string_to_print = string_to_print[0:len(string_to_print) - 4]
    f.write(string_to_print + '\n')
  f.close()

In [200]:
graph = [
'1 -> 2',
'2 -> 3',
'3 -> 4,5',
'6 -> 7',
'7 -> 6']

In [201]:
graph_dict = DefineGraphDict(graph)
graph_dict

{1: [2], 2: [3], 3: [4, 5], 6: [7], 7: [6]}

In [202]:
graph_dict_edges = GraphDictEdges(graph_dict)
graph_dict_edges

[(1, 2), (2, 3), (3, 4), (3, 5), (6, 7), (7, 6)]

In [203]:
MaximalNonBranchingPaths(graph_dict_edges)

[[(1, 2), (2, 3)], [(3, 4)], [(3, 5)], [(6, 7), (7, 6)]]

In [204]:
PrintResult(MaximalNonBranchingPaths(graph_dict_edges))

1 -> 2 -> 3
3 -> 4
3 -> 5
6 -> 7 -> 6


In [182]:
with open('/content/rosalind_ba3m.txt') as task_file:
  graph = [line.rstrip() for line in task_file]

In [183]:
graph_dict = DefineGraphDict(graph)

In [184]:
graph_dict_edges = GraphDictEdges(graph_dict)

In [186]:
PrintResultToFile(MaximalNonBranchingPaths(graph_dict_edges))