### Material sources
#### http://betatim.github.io/posts/genome-hackers/ 
This gives you the idea how to solve DNA assembly problem. The site explains quite good in details just in case you didn't understand the lecture today. 

#### http://rosalind.info/problems/
The lecture follows the problems in Rosalind where you can solve the DNA assembly. The lecture here will follow the tree of solving `Genome assembly with perfect coverage`. The idea is to solve the related problems ordered from easy to difficult level until you can assemble the DNA.

#### https://www.pnas.org/content/98/17/9748
Paper describes Eulerian's path algorithm in assembling the DNA. 

#### Transcribing DNA into RNA
##### (Source: http://rosalind.info/problems/rna/)

In [1]:
import re

dna = 'GATGGAACTTGACTACGTAAATT'
rna = 'GAUGGAACUUGACUACGUAAAUU'

def transcribe(dna):
    return re.sub(r'T', 'U', dna, flags=re.IGNORECASE)

print(transcribe(dna) == rna)

True


#### Complementing a Strand of DNA

In [2]:
def complementary_dna(dna):
    return ''.join([sub_dict[nt] for nt in dna.upper()[::-1]])
    
dna = 'AAAACCCGGT'
compl = 'ACCGGGTTTT'
sub_dict = {'A': 'T', 'G': 'C', 'T': 'A', 'C': 'G'}
print(complementary_dna(dna) == compl)

True


#### Finding a Motif in DNA
##### (Source: http://rosalind.info/problems/subs/, https://docs.python.org/3.6/library/re.html)

Finding DNA substring using regular expression by default a 'non-overlapping search'. To use re to search/match text with overlapping enable, you need to create the pattern using `lookahead assertion` -- `(?=(text)`.

In [3]:
import re

def find_motif(s, t):
    """
    """
    result = re.finditer(r'(?=({}))'.format(t), s)
    for i in result:
        print(i.start(1), i.end(1))

        
s = 'GATATATATGCATATACTT'
t = 'ATAT'

find_motif(s, t)

1 5
3 7
5 9
11 15


#### Overlap graph
##### (Source: http://rosalind.info/problems/grph/)

N.B. the original overlap_suffix function allows the overlapping to occure in the middle of sequence which is incorrect. The new one now works as it should be that overlapping occurs only on the prefix and suffix regions of the sequences.

In [4]:
def read_fasta(fasta_file):
    seq_sid = {}
    sid_seq = {} 
    try:
        with open(fasta_file, 'rt') as f:
            fasta_text = f.read()
    except:
        fasta_text = fasta_file
    for record in fasta_text.split('>')[1:]:
        sid, seq, _ = record.split('\n')
        seq_sid.setdefault(seq, sid)
        sid_seq.setdefault(sid, seq)
    return seq_sid, sid_seq

def overlap_suffix(x, y, l):
    """
    locate the length of overlapped strings
    """
    overlap_len = [0]
    i = len(x)
    while i > max(overlap_len):
        j = len(y)
        while j > max(overlap_len):
            if x[i:] != y[:j]:
                overlap_len.append(l)
            else:
                overlap_len.append(len(x[i:]))
            j -= 1
        i -= 1
    return max(overlap_len)

def find_overlap(records, overlap_len):
    overlap_pair = []
    seqs = list(records.keys())
    for i in range(len(seqs)):
        for j in range(i+1, len(seqs)):
            mi = overlap_suffix(seqs[i], seqs[j], overlap_len) # seq[i] is seq X, seq[j] is seq Y
            mj = overlap_suffix(seqs[j], seqs[i], overlap_len) # seq[i] is seq Y, seq[j] is seq X
            # pick the max overlap length to determine seq XY order in assembling, whether it is XY or YX 
            if mi > mj and mi > overlap_len: 
                overlap_pair.append([records[seqs[i]], records[seqs[j]], mi])
            elif mj > mi and mj > overlap_len:          
                overlap_pair.append([records[seqs[j]], records[seqs[i]], mj])
    return overlap_pair


fasta_file = '>Rosalind_0498\nAAATAAA\n>Rosalind_2391\nAAATTTT\n>Rosalind_2323\nTTTTCCC\n>Rosalind_0442\nAAATCCC\n>Rosalind_5013\nGGGTGGG\n'
seq_sid, sid_seq = read_fasta(fasta_file)
overlap_len = 1
overlap_pair = find_overlap(seq_sid, overlap_len)
print(overlap_pair)

[['Rosalind_0498', 'Rosalind_2391', 3], ['Rosalind_0498', 'Rosalind_0442', 3], ['Rosalind_2391', 'Rosalind_2323', 3]]


#### Genome Assembly as Shortest Superstring
##### (Source: http://rosalind.info/problems/long/)
Then we combine all the overlap pairs into the shortest superstring.

In [5]:
def shortest_superstring(overlap_pair, sid_seq):
    """
    """
    overlap_pair.sort(key= lambda x: x[2], reverse=True)
    overlap_string = ''
    done_set = []  
    for item in overlap_pair:
        if done_set == []:
            overlap_string = sid_seq[item[0]] + sid_seq[item[1]][item[2]:]
            done_set.append(item[0])
            done_set.append(item[1])
        elif item[0] == done_set[-1] and item[1] not in done_set:
            overlap_string += sid_seq[item[1]][item[2]:] 
            done_set.append(item[1])
    return overlap_string
   
fasta_file = '>Rosalind_56\nATTAGACCTG\n>Rosalind_57\nCCTGCCGGAA\n>Rosalind_58\nAGACCTGCCG\n>Rosalind_59\nGCCGGAATAC\n'
records = read_fasta(fasta_file)
overlap_len = 1
seq_sid, sid_seq = read_fasta(fasta_file)
overlap_pair = find_overlap(seq_sid, overlap_len)
sss = shortest_superstring(overlap_pair, sid_seq)

print(sss)
print(sss == 'ATTAGACCTGCCGGAATAC')
print(overlap_pair)

ATTAGACCTGCCGGAATAC
True
[['Rosalind_56', 'Rosalind_58', 7], ['Rosalind_58', 'Rosalind_57', 7], ['Rosalind_57', 'Rosalind_59', 7], ['Rosalind_56', 'Rosalind_57', 4], ['Rosalind_58', 'Rosalind_59', 4]]


#### K-mer composition
##### (Source: http://rosalind.info/problems/kmer/)

In [6]:
import itertools

def ngram2index(ngram):
    a = [''.join(item) for item in itertools.product('GACT', repeat=ngram)] 
    # the order of given alphabets is important for itertools.product 
    # if you want alphabetical order result, then given 'ACGT' should be in alphabetical order
    # or to make sure that you use the sort function
    a.sort()
    print(a == [''.join(item) for item in itertools.product('ACGT', repeat=ngram)])
    return [''.join(item) for item in itertools.product('ACGT', repeat=ngram)]

                                                        
def kmer_composition(dna, k):
    """
    This kmer composition function is the same as geneating the spectrum of DNA ngram
    """
    ngram = ngram2index(k)
    kmer = {}
    for i in range(len(dna) - k + 1):
        kmer[dna[i:i+k]] = kmer.get(dna[i:i+k], 0)+1
    return [kmer.get(item, 0) for item in ngram]

        
my_dna = 'CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGGCCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGTTTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCAAATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCGGGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGACTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTACCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG'
kmer = kmer_composition(my_dna, 4)   
print(kmer)

True
[4, 1, 4, 3, 0, 1, 1, 5, 1, 3, 1, 2, 2, 1, 2, 0, 1, 1, 3, 1, 2, 1, 3, 1, 1, 1, 1, 2, 2, 5, 1, 3, 0, 2, 2, 1, 1, 1, 1, 3, 1, 0, 0, 1, 5, 5, 1, 5, 0, 2, 0, 2, 1, 2, 1, 1, 1, 2, 0, 1, 0, 0, 1, 1, 3, 2, 1, 0, 3, 2, 3, 0, 0, 2, 0, 8, 0, 0, 1, 0, 2, 1, 3, 0, 0, 0, 1, 4, 3, 2, 1, 1, 3, 1, 2, 1, 3, 1, 2, 1, 2, 1, 1, 1, 2, 3, 2, 1, 1, 0, 1, 1, 3, 2, 1, 2, 6, 2, 1, 1, 1, 2, 3, 3, 3, 2, 3, 0, 3, 2, 1, 1, 0, 0, 1, 4, 3, 0, 1, 5, 0, 2, 0, 1, 2, 1, 3, 0, 1, 2, 2, 1, 1, 0, 3, 0, 0, 4, 5, 0, 3, 0, 2, 1, 1, 3, 0, 3, 2, 2, 1, 1, 0, 2, 1, 0, 2, 2, 1, 2, 0, 2, 2, 5, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 1, 3, 4, 0, 2, 1, 1, 0, 1, 2, 2, 1, 1, 1, 5, 2, 0, 3, 2, 1, 1, 2, 2, 3, 0, 3, 0, 1, 3, 1, 2, 3, 0, 2, 1, 2, 2, 1, 2, 3, 0, 1, 2, 3, 1, 1, 3, 1, 0, 1, 1, 3, 0, 2, 1, 2, 2, 0, 2, 1, 1]


#### Constructing De Bruijn Graph
##### (Source: http://rosalind.info/problems/dbru/, http://www.langmead-lab.org/teaching-materials/)

In [7]:
def get_kmer(dna):
    """
    from DNA reads, generate all the k-mer 
    """
    k = len(dna) - 1
    return [dna[i:i+k] for i in range(len(dna) - k + 1)]
        
def de_bruijn_graph(dna):
    """
    generate de Bruijn graph using dictionary
    dict.keys(): node
    dict.values(): edges  
    """
    edges = {}
    all_kmer = set([kmer for item in dna for kmer in get_kmer(item)]) # get all the kmers
    for i in all_kmer:
        for j in all_kmer:
            if i[1:] == j[:len(j)-1]:
                edges.setdefault(i, set([])).add(j)
            elif j[1:] == i[:len(j)-1]:
                edges.setdefault(j, set([])).add(i)
    return edges

dna = 'TGAT\nCATG\nTCAT\nATGC\nCATC\nCATC'
paths = de_bruijn_graph(dna.split('\n'))
print(paths)

{'TCA': {'CAT'}, 'ATC': {'TCA'}, 'GAT': {'ATG', 'ATC'}, 'TGA': {'GAT'}, 'ATG': {'TGC', 'TGA'}, 'CAT': {'ATG', 'ATC'}}


#### Genome assembly with perfect coverage
##### (Source: http://rosalind.info/problems/pcov/)

In [8]:
def cyclic_superstring(edges):
    cyclic_string = ''
    all_edges = list(edges.keys())
    stack = []
    while set(stack) != set(all_edges): # make sure that we have visit all nodes
        for k in edges.keys():
            if stack == []: # beginning of string, add both nodes 
                cyclic_string += k
                stack.append(k) # keep tracking which node we have already visited
                nodes = list(edges[k])
                cyclic_string += nodes[0][-1]
                stack.append(nodes[0]) # keep tracking which node we have already visited
            elif k in stack:
                nodes = list(edges[k])
                if nodes[0] not in stack:
                    cyclic_string += nodes[0][-1]
                    stack.append(nodes[0]) # keep tracking which node we have already visited
    return cyclic_string[len(all_edges[0])-1:] # remove the beginning of string of k-1 mer


dna = 'ATTAC\nTACAG\nGATTA\nACAGA\nCAGAT\nTTACA\nAGATT'
paths = de_bruijn_graph(dna.split('\n'))

cyclic_dna = cyclic_superstring(paths)
print(cyclic_dna)
answer = 'GATTACA'
print(answer == cyclic_dna, ': might not be True when compared while using linear representation for cyclic DNA')

ACAGATT
False : might not be True when compared while using linear representation for cyclic DNA


#### Practical strategies for applying de Bruijn graphs
##### (Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5531759/)
##### (Source: https://en.wikipedia.org/wiki/Scaffolding_(bioinformatics))

There are key assumptions in assembling the genome using de Bruijn graph method which are not a part of the lecture. Most developed tools have implemented remedies for these key issues in practice. 

1. Generate all k-mers present in the genome sequencing by breaking the reads into shorter k-mers, e.g. 100-nucleotide reads to 46 overlapping 55-mer.
2. Apply error correcting algorithms before aseembling the genomes to handle errors in reads.
3. Increase the number of edges connecting between prefixes and suffixes following the number of multiplicity.
4. Deal with linear and multiple chromosomes. (No extra algorithm needed as Eulerian's algorithm can handle.)
5. Use scaffolding to determine the correct order and orientation of the contigs and approximate the size of the gaps. Gap-filling algorithms can be also used t reduce the size of gaps between contigs. 