Given a (k, d)-mer (a1 ... ak | b1, ... bk), we define its prefix and suffix as the following
(k  1, d + 1)-mers:

PREFIX((a1 ... ak | b1,... bk)) = (a1 ... ak1 | b1 ... bk1)

SUFFIX((a1 ... ak | b1,... bk)) = (a2 ... ak | b2 ... bk)

We define COMPOSITIONGRAPHk,d(Text) as the graph consisting of |Text| - (k + d +
k) + 1 isolated edges that are labeled by the (k, d)-mers in Text, and whose nodes are
labeled by the prefixes and suffixes of these labels. As you may have
guessed, gluing identically labeled nodes in PAIREDCOMPOSITIONGRAPHk,d(Text)
results in exactly the same de Bruijn graph as gluing identically labeled nodes in
PATHGRAPHk,d(Text). Of course, in practice, we will not know Text; however, we can form COMPOSITIONGRAPHk,d(Text) directly from the (k, d)-mer composition of Text, and the gluing step will result in the paired de Bruijn graph of this composition. The genome can be reconstructed by following an Eulerian path in this de Bruijn graph.

In [45]:
def ReadPairPrefix(read_pair,k):
  read_pair_prefix = list()
  for kmers in read_pair:
    read_pair_prefix.append(kmers[0:k-1])
  return read_pair_prefix[0] + read_pair_prefix[1]

In [46]:
def ReadPairSuffix(read_pair,k):
  read_pair_suffix = []
  for kmers in read_pair:
    read_pair_suffix.append(kmers[1:k])
  return read_pair_suffix[0] + read_pair_suffix[1]

Given a string Text, we construct a graph PATHGRAPHk,d(Text) that represents a
path formed by |Text| - (k + d + k) + 1 edges corresponding to all (k, d)-mers in Text.
We label edges in this path by (k, d)-mers and label the starting and ending nodes of an
edge by its prefix and suffix, respectively. Note that the paired de
Bruijn graph is less tangled than the de Bruijn graph constructed from individual reads.

In [47]:
def PairedDeBruijnGraphKD(read_pairs_list,k):
  adjacency_dict = {}
  for read_pair in read_pairs_list:
    adjacency_dict.setdefault(ReadPairPrefix(read_pair,k),[])
    adjacency_dict.setdefault(ReadPairSuffix(read_pair,k),[])
  for read_pair in read_pairs_list:
    adjacency_dict[ReadPairPrefix(read_pair,k)].append(ReadPairSuffix(read_pair,k))
  return adjacency_dict

U DeBruijn grafu su identitični čvorovi gluani zajedno --> dakle, u listi čvorova ne možemo imati duplikate, možemo imati samo duplikatne veze, tj. readove koji se ponavljaju

In [48]:
def DeBruijnGraphNodes(de_bruijn_graph_dict):
  de_bruijn_graph_nodes = set()
  for key,values_list in de_bruijn_graph_dict.items():
    de_bruijn_graph_nodes.update([key] + values_list)
  return list(de_bruijn_graph_nodes) #mapping from read to number defined by mapping from index of list --> element at that index

In [49]:
def DefineGraphDict(de_bruijn_graph_dict, de_bruijn_graph_nodes):
  graph_dict = {}
  for i in range(len(de_bruijn_graph_nodes)):
    graph_dict.setdefault(i,[])
  for key,values_list in de_bruijn_graph_dict.items():
    for value in values_list:
      graph_dict[de_bruijn_graph_nodes.index(key)].append(de_bruijn_graph_nodes.index(value))
  return graph_dict

In [50]:
def VisitedEdges(graph_dict):
  visited_edges_dict = {}
  for key,values_list in graph_dict.items():
    for value in values_list:
      visited_edges_dict.setdefault((key,value), 0)
  return visited_edges_dict

In [51]:
def NodeInDegree(graph_dict,node):
  node_indegree = 0
  for adjacent_nodes_list in graph_dict.values():
    if node in adjacent_nodes_list:
      node_indegree = node_indegree + 1
  return node_indegree

In [52]:
def NodeOutDegree(graph_dict,node):
  return len(graph_dict[node])

In [53]:
import numpy as np
from random import randint, randrange

In [54]:
def EulerianCycle(graph_dict):
  visited_edges_dict = VisitedEdges(graph_dict)
  starting_node = randint(min(graph_dict.keys()), max(graph_dict.keys()))
  cycle = [starting_node]
  while sum(visited_edges_dict.values()) < len(visited_edges_dict): #repeat until Eulerian cycle is found --> input is an Eulerian directed graph --> Eulerian cycle can always be found
    #while loop entered --> sum(visited_edges_dict.values()) < len(visited_edges_dict) --> cycle smaller than Eulerian cycle is being formed
    possible_adjacent_nodes = [key[1] for key in visited_edges_dict.keys() if key[0] == cycle[len(cycle)-1] and visited_edges_dict[key] == 0]
    if len(possible_adjacent_nodes) == 0 and sum(visited_edges_dict.values()) < len(visited_edges_dict): #cycle smaller than Eulerian cycle completed as we got stuck at starting node --> all edges are not visited
      #possible_starting_nodes_list = [node for node in cycle if NodeOutDegree(graph_dict,node) >= NodeOutDegree(graph_dict,cycle[0])] #no, thsi way we are choosing node regardless of the number of times it appeared in cycle
      #possible_starting_nodes_list = [visited_edge[0] for node in cycle for visited_edge in visited_edges_dict.keys() if visited_edge[0] == node and visited_edges_dict[visited_edge] == 1] --> this caused efficiency problems
      possible_starting_nodes_list = [node for node in cycle if NodeOutDegree(graph_dict,node) > cycle.count(node)] #if NodeOutDegree(node) > number of times node occurs in cycle then there are unused outgoing edges, every occurence measn that one outgoing edge is used 
      starting_node = possible_starting_nodes_list[randrange(0,len(possible_starting_nodes_list))] #randomly choose new starting node among nodes with higher NodeOutDegree than previous starting node
      cycle = cycle[cycle.index(starting_node):len(cycle)] + cycle[1:cycle.index(starting_node)+1] #construct new_cycle using previous cycle
    else: #len(possible_adjacent_nodes) > 1 and sum(visited_edges_dict.values()) < len(visited_edges_dict) --> cycle is not finished yet
      if len(possible_adjacent_nodes) == 1:
        next_node = possible_adjacent_nodes[0]
        visited_edges_dict[(cycle[len(cycle)-1], next_node)] = 1
        cycle.append(next_node)
      else:
        next_node = possible_adjacent_nodes[randint(0,len(possible_adjacent_nodes)-1)]
        visited_edges_dict[(cycle[len(cycle)-1], next_node)] = 1
        cycle.append(next_node)
  return cycle

In [55]:
def FindUnbalancedNodes(graph_dict):
  unbalanced_nodes = []
  for node in graph_dict.keys():
    if NodeInDegree(graph_dict,node) != NodeOutDegree(graph_dict,node):
      unbalanced_nodes.append(node)
  for adjacent_nodes_list in graph_dict.values():
    for adjacent_node in adjacent_nodes_list:
      if adjacent_node not in graph_dict.keys():
        unbalanced_nodes.append(adjacent_node)
  return unbalanced_nodes

In [56]:
def UnbalancedNodesOrder(graph_dict,unbalanced_nodes):
  ordered_unbalanced_nodes = [unbalanced_nodes[0]]
  if NodeInDegree(graph_dict, unbalanced_nodes[1]) < NodeOutDegree(graph_dict, unbalanced_nodes[1]): #node lacks one incoming edge --> node is starting node in Eulerian path
    ordered_unbalanced_nodes.insert(0, unbalanced_nodes[1])
  else: #NodeInDegree(graph_dict, unbalanced_nodes[1]) > NodeOutDegree(graph_dict, unbalanced_nodes[1]) --> node lacks one outgoing edge --> node is ending node in Eulerian path
    ordered_unbalanced_nodes.insert(1, unbalanced_nodes[1])
  return ordered_unbalanced_nodes

In [57]:
def BalanceUnbalancedNodes(graph_dict, ordered_unbalanced_nodes):
  #ordered_unbalanced_nodes = [starting_node, ending_node]
  graph_dict.update({ordered_unbalanced_nodes[1]:[ordered_unbalanced_nodes[0]]})
  return graph_dict

In [58]:
def FindEulerianPathInEulerianCycle(ordered_unbalanced_nodes,eulerian_cycle):
  #ordered_unbalanced_nodes = [starting_node,ending_node]
  eulerian_cycle = np.array(eulerian_cycle)
  eulerian_path = []
  eulerian_path_start_indices = list(np.where(eulerian_cycle == ordered_unbalanced_nodes[0])[0])
  eulerian_path_end_indices = list(np.where(eulerian_cycle == ordered_unbalanced_nodes[1])[0])
  eulerian_cycle = list(eulerian_cycle)
  for start_index in eulerian_path_start_indices:
    for end_index in eulerian_path_end_indices:
      if end_index < start_index:
        if (len(eulerian_cycle) - 1 - start_index + 1) + (end_index - 0 + 1) == len(eulerian_cycle):
          return eulerian_cycle[start_index:len(eulerian_cycle)] + eulerian_cycle[1:end_index+1]
      else:
        if (end_index - start_index + 1) == len(eulerian_cycle):
          return eulerian_cycle[start_index:end_index+1]

String Reconstruction from Read-Pairs Problem

Reconstruct a string from its paired composition.

Given: Integers k and d followed by a collection of paired k-mers PairedReads.

Return: A string Text with (k, d)-mer composition equal to PairedReads. (If multiple answers exist, you may return any one.)

PairedReads su (k,d)-merovi iz stringa Text. Od read parova formiramo PairedDeBruijnGraphKD, veze grafa su (k,d)-merovi, početni čvor veze je prefiks (k,d)-mera, završni čvor veze je sufiks (k,d)-mera. U PairedDeBruijnGraphKD grafu su identični čvorovi gluani zajedno. U grafu postoji Eulerov put jer se string Text može rekonstruirati preko (k,d)-merova, tj. readova veličine (k+d+k). Ako u stringu Text postoji Eulerov put, onda početni i završni čvor Eulerovog puta nisu balansirani, tj. graf nije balansiran, tj. graf nije Eulerov graf. --> budući da u grafu postoji Eulerov put, u grafu ne postoji Eulerov ciklus jer graf nije balansiran --> string Text možemo asemblirati na samo jedan način jer string nije cirkularan pa postoji samo jedan Eulerov put u grafu

Let Reads be the collection of all 2N k-mer reads taken from N read-pairs. Note that a read-pair formed by k-mer reads Read1 and Read2 corresponds to two edges in the de Bruijn graph DEBRUIJNk(Reads). Since these reads are separated by distance d in the genome, there must be a path of length k + d + 1 in DEBRUIJNk(Reads) connecting the node at the beginning of the edge corresponding to Read1 with the node at the end of the edge corresponding to Read2. If there is only one path of length k + d + 1 connecting these nodes, or if all such paths spell out the same string, then we can transform a read-pair formed by reads Read1 and Read2 into a virtual read of length 2 · k + d that starts as Read1, spells out this path, and ends with Read2. 

Although the idea of transforming read-pairs into long virtual reads is used in many
assembly programs, we have made an optimistic assumption: “If there is only one path of length k + d + 1 connecting these nodes, or if all such paths spell out the same string . . . ”. In practice, this assumption limits the application of the long virtual read approach to assembling read-pairs because highly repetitive genomic regions often contain multiple paths of the same length between two edges, and these paths often spell different strings.

In [59]:
def PairedDeBruijnGraphNodesToPairedRead(first_node,second_node,k):
  paired_read = []
  paired_read.append(first_node[0:k-1] + second_node[k-2])
  paired_read.append(first_node[k-1:len(first_node)] + second_node[len(second_node)-1])
  return paired_read

|Text| = num_of_read_pairs + (2*k + d) - 1

In [83]:
def AssembleStringFromNodes(eulerian_path,k,d,num_of_read_pairs):
  string_list = []
  for i in range(num_of_read_pairs + 2*k + d - 1):
    string_list.append(' ')
  i = 0
  while i + 1 <= len(eulerian_path) - 1: #i --> starting position of first kmer in read pair, starting position of second kmer is i + k + d
    read_pair = PairedDeBruijnGraphNodesToPairedRead(eulerian_path[i],eulerian_path[i+1],k)
    string_list[i:i+k] = list(read_pair[0])
    string_list[(i+k+d):(i+k+d)+k] = list(read_pair[1])
    i = i + 1
  return ''.join(string_list)

In [75]:
def StringReconstruction(read_pairs,k,d):
  paired_de_bruijn_graph_dict = PairedDeBruijnGraphKD(read_pairs,k)
  paired_de_bruijn_graph_nodes = DeBruijnGraphNodes(paired_de_bruijn_graph_dict)
  graph_dict = DefineGraphDict(paired_de_bruijn_graph_dict,paired_de_bruijn_graph_nodes)
  unbalanced_nodes = FindUnbalancedNodes(graph_dict)
  unbalanced_nodes = UnbalancedNodesOrder(graph_dict,unbalanced_nodes)
  graph_dict = BalanceUnbalancedNodes(graph_dict,unbalanced_nodes)
  eulerian_cycle = EulerianCycle(graph_dict)
  eulerian_path = FindEulerianPathInEulerianCycle(unbalanced_nodes,eulerian_cycle)
  for i in range(len(eulerian_path)):
    eulerian_path[i] = paired_de_bruijn_graph_nodes[eulerian_path[i]]
  string = AssembleStringFromNodes(eulerian_path,k,d,len(read_pairs))
  return string

Veličina stringa Text mora biti (2k + d) - 1 + len(read_pairs)

In [76]:
def FormatReadPairs(read_pairs):
  read_pairs_list = []
  for read_pair in read_pairs:
    read_pair = read_pair.split('|')
    read_pairs_list.append(read_pair)
  return read_pairs_list

In [77]:
k = 4

In [78]:
d = 2

In [79]:
read_pairs = [
'GAGA|TTGA',
'TCGT|GATG',
'CGTG|ATGT',
'TGGT|TGAG',
'GTGA|TGTT',
'GTGG|GTGA',
'TGAG|GTTG',
'GGTC|GAGA',
'GTCG|AGAT']

In [80]:
read_pairs = FormatReadPairs(read_pairs)

In [81]:
read_pairs

[['GAGA', 'TTGA'],
 ['TCGT', 'GATG'],
 ['CGTG', 'ATGT'],
 ['TGGT', 'TGAG'],
 ['GTGA', 'TGTT'],
 ['GTGG', 'GTGA'],
 ['TGAG', 'GTTG'],
 ['GGTC', 'GAGA'],
 ['GTCG', 'AGAT']]

In [84]:
StringReconstruction(read_pairs,k,d)

'GTGGTCGTGAGATGTTGA'

In [85]:
k = 30

In [86]:
d = 100

In [87]:
with open('/content/rosalind_ba3j.txt') as task_file:
  read_pairs = [line.rstrip() for line in task_file]

In [88]:
read_pairs = FormatReadPairs(read_pairs)

In [None]:
StringReconstruction(read_pairs,k,d)