# Class : De Bruijn graph

---
## Before Class
1. Review slides on De Bruijn graphs and Eulerian walks

---
## Learning Objectives
1. Understand and implement De Bruijn graphs for assembly


---
## De Bruijn graphs

In class today we will be implementing one of the primary assembly algorithms from short-read data that is used today. We will implement a simple form of the algorithm where we assume perfect sequencing. That is, everything is sequenced exactly once and there are no errors or variants in the sequencing. This graph will use a similar structure to our previous work with trees, but we will need to track edges between nodes in our graph. We have provided the basic class structure as well as functions to `add_edge` and `remove_edge` from the graph. 

```
build_debruijn_graph:
define substring length k and input string
For each k-length substring of input:
  split k mer into left and right k-1 mer
  add k-1 mers as nodes with a directed edge from left k-1 mer to right k-1 mer
```

---
## Eulerian walk

For the second part of the implementation, we will use our De Bruijn graph to output a valid sequence from the assembly. This is implemented as a recursive algorithm by considering all valid edges. You will notice that as you change k, we are able to better recapitulate our sequence depending on how repetitive it is. In a more complex implementation of a Eulerian walk there are heuristics and defined rules for determining the validity of traversing a specific edge in the graph to result in a full graph-traversal. One of these methods is to traverse the graph in a depth first manner (as we previously implemented in class) to avoid sectioning off any part of the graph in the traversal. In our implementation we will ignore these for simplicity.

```
eulerian_walk:
Beginning at first_node as node

For node:
    follow a random valid edge from node
    remove edge
    recurse
```


In [24]:
from collections import defaultdict
import random

class DeBruijnGraph():
    """Main class for De Bruijn graphs
    
    Private Attributes:
        graph (defaultdict of lists): Edges for De Bruijn graph
        first_node (str): starting position for traversing the graph
    """

    def __init__(self, input_string, k):
        self.graph = defaultdict(list)
        self.first_node = ''
        self.build_debruijn_graph(input_string, k)
        
    def add_edge(self, left, right):
        ''' This function adds a new edge to the graph
        
        Args:
            left (str): The k-1 mer for the left edge
            right (str): The k-1 mer for the right edge

        Updates graph attribute to add right to the list named left in defaultdict   
        '''
        self.graph[left].append(right)
        
    def remove_edge(self, left, right):
        ''' This function removed an edge from the graph
        
        Args:
            left (str): The k-1 mer for the left edge
            right (str): The k-1 mer for the right edge

        Updates graph attribute to remove right from the list named left in defaultdict
        '''
        matching_edges = []
        for i, key in enumerate(self.graph[left]):
            if key == right:
                self.graph[left].pop(i)
                break

        
    def build_debruijn_graph(self, input_string, k):
        ''' This function builds a De Buijn graph from a string
        
        Args:
            input_string (str): string to use for building the graph
            k (int): k-mer length for graph construction

        Updates graph attribute to add all valid edges from the string
        
        Example:
        >>> dbg = DeBruijnGraph("this this this is a test", 4)
        >>> print(dbg.graph) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
        defaultdict(<class 'list'>, {'thi': ['his', 'his', 'his'], 'his': ['is ', 'is ', 'is '], ...)
        '''
        
        for i in range( 0, len( input_string ) - k + 1 ):
            ai, bi = input_string[i:][:k][:-1], input_string[i:][:k][1:]
            self.graph[ ai ].append( bi )

            
    def print_eulerian_walk(self, seed=None):
        ''' This function starts the recursive walk function
        at the first node in the graph
        
        Args: None
        
        Returns:
            tour (list): list of k-1 mers traversed by the algorithm
        
        Example:
        >>> dbg = DeBruijnGraph("this this this is a test", 4)
        >>> dbg.print_eulerian_walk(seed=1) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
        ['thi', 'his', 'is ', 's i', ' is', 'is ', ...]
        '''
        tour = []
        tour = self.eulerian_walk(self.first_node, seed=seed)
        tour.append(self.first_node)
        return tour[::-1]
        
    def eulerian_walk(self, node, seed=None):
        ''' This is a recursive function that follows all edges from a node
        to traverse the graph
        
        Args: 
            node (str): current node to traverse from
            seed (int): seed for random selection of edge to follow
        
        Returns:
            tour (list): list of k-1 mers traversed so far by the algorithm
            Note: this will be reverse order because of recursion
            
        Example:
        >>> dbg = DeBruijnGraph("this this this is a test", 4)
        >>> dbg.eulerian_walk('thi', seed=1) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
        ['is ', 'his', 'thi', ' th', ...]
        '''
        
        out = [ node ]
        
        while len( self.graph[ out[-1] ] ) > 0:
            out.append( random.choice( self.graph[ out[-1] ] ) )
            self.graph[ out ].remove( out[-1] )
                
        return out
        
        



In [23]:
a = "this this this is a test"

k = 4
for i in range( 0, len( a ) - k + 1 ):
    ai,bi = a[i:][:k][:-1], a[i:][:k][1:]
    
    print( ai, bi )

thi his
his is 
is  s t
s t  th
 th thi
thi his
his is 
is  s t
s t  th
 th thi
thi his
his is 
is  s i
s i  is
 is is 
is  s a
s a  a 
 a  a t
a t  te
 te tes
tes est


In [12]:
a = [1,2,3]
a.remove( 2 )
a

[1, 3]

In [None]:
graph = DeBruijnGraph("fool me once shame on shame on you fool me", 6)
print(graph.graph)
walk = graph.print_eulerian_walk(seed=11)
walk[0] + ''.join(map(lambda x: x[-1], walk[1:]))

In [18]:
import doctest
doctest.testmod()

**********************************************************************
File "__main__", line 97, in __main__.DeBruijnGraph.eulerian_walk
Failed example:
    dbg.eulerian_walk('thi', seed=1) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
Exception raised:
    Traceback (most recent call last):
      File "/home/loganaw/miniconda3/lib/python3.7/doctest.py", line 1329, in __run
        compileflags, 1), test.globs)
      File "<doctest __main__.DeBruijnGraph.eulerian_walk[1]>", line 1, in <module>
        dbg.eulerian_walk('thi', seed=1) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
      File "<ipython-input-17-801edf4c5bdc>", line 105, in eulerian_walk
        self.graph[ out ].remove( out[-1] )
    TypeError: unhashable type: 'list'
**********************************************************************
File "__main__", line 75, in __main__.DeBruijnGraph.print_eulerian_walk
Failed example:
    dbg.print_eulerian_walk(seed=1) #doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
Exception raised:
    Traceba

TestResults(failed=2, attempted=6)