## Genome Assembly with Perfect Coverage 

# Cyclic Chromosomesclick to collapse

Recall that although chromosomes taken from eukaryotes have a linear structure, many bacterial chromosomes are actually circular. We represented a linear chromosome with a DNA string, so we only need to modify the definition of string to model circular chromosomes.

Perfect coverage is the phenomenon in fragment assembly of having a read (or k
-mer) begin at every possible location in the genome. Unfortunately, perfect coverage is still difficult to achieve, but fragment assembly technology continues to improve by leaps and bounds, and perfect coverage is perhaps not the fantasy it once was.

# Problem

A circular string is a string that does not have an initial or terminal element; instead, the string is viewed as a necklace of symbols. We can represent a circular string as a string enclosed in parentheses. For example, consider the circular DNA string (ACGTAC), and note that because the string "wraps around" at the end, this circular string can equally be represented by (CGTACA), (GTACAC), (TACACG), (ACACGT), and (CACGTA). The definitions of substrings and superstrings are easy to generalize to the case of circular strings (keeping in mind that substrings are allowed to wrap around).

Given: A collection of (error-free) DNA k
-mers (k≤50
) taken from the same strand of a circular chromosome. In this dataset, all k
-mers from this strand of the chromosome are present, and their de Bruijn graph consists of exactly one simple cycle.

Return: A cyclic superstring of minimal length containing the reads (thus corresponding to a candidate cyclic chromosome).

Sample Dataset

ATTAC

TACAG

GATTA

ACAGA

CAGAT

TTACA

AGATT

Sample Output

GATTACA

Note

The assumption made above that all reads derive from the same strand is practically unrealistic; in reality, researchers will not know the strand of DNA from which a given read has been sequenced.

In [14]:
DNA = '''ATTAC
TACAG
GATTA
ACAGA
CAGAT
TTACA
AGATT'''

DNA = DNA.split('\n')

print(DNA)

['ATTAC', 'TACAG', 'GATTA', 'ACAGA', 'CAGAT', 'TTACA', 'AGATT']


In [2]:
path = r"C:\Users\Hi\Downloads\rosalind_pcov.txt"
with open(path,'r') as file:
   DNA = file.readlines()
DNA = [dna.replace('\n','') for dna in DNA]
print(DNA)

['CGCGCTGGCCAAGTGCTTACTGATTAGAAAGTAGTCGGATTTTCATAAGG', 'TCTACGAAAGCATCTGTCTGCTTAGGTGAAGGCAGATATTATAGGGCGGC', 'GGTGCATCTTCTCGCTGCTCGCTCAGGTGGTTGATTCTACCTTAGCGTTT', 'TTAGGCTAACACCCAGAGTCTTAAGTTATCGGGCAACACACCAGGTGCAT', 'AACACCCAATGGCAGTCAATCTAAGCCAGACGCCTCGAACAAAAGTTCGG', 'AAGAACGTAATGCTCATTCGGCAAGGGTCAGTAAAGCCTCAAGGGATCTA', 'AATGCTCATTCGGCAAGGGTCAGTAAAGCCTCAAGGGATCTACGAAAGCA', 'AGTCAATCTAAGCCAGACGCCTCGAACAAAAGTTCGGGTATGAGCGCGAT', 'GCGGGAAATAGGCTAATTACCTAAAGGCCGCGGTTAAATATTCCTGCATC', 'TTGAGGGGAACTCCTCGCGATGGTGAAATATAATACTCTAGACACGGAAT', 'TTTCCTGATGGTCGACTCCCGGCATCAGCAACATGTATGGGGGGCGACGA', 'CACGAGTATTTGTCCGGTGCTGTTAGACCGAGACTACGGCTGATTCTCAA', 'GGCCAGCGAGGCAGCTAAGTCAGCGTTTCTACTCGAAAAAACACAATGCA', 'CTCATAGATCCCAAGAACGTAATGCTCATTCGGCAAGGGTCAGTAAAGCC', 'AGAAACGATGTATGTGCGGGCCTTGGATTAGGCTAACACCCAGAGTCTTA', 'TGTCCTGGGCTCTTAAGACTACCTCTAGGGAGATTAGATGTAGAGTCTAG', 'TCTGGGCGTCCAGGGAGACAGATTTTCGCGGGAAGTGACGGTACGATTGA', 'CCGGAACAGGGCGTATACTGTTCAGAATTCACGGTGATACCTCATAGATC', 'CGAATACTATTCCCGACTTCGTTCCA

In [15]:
N = len(DNA)
n = len(DNA[0])
print(n,N,DNA)

5 7 ['ATTAC', 'TACAG', 'GATTA', 'ACAGA', 'CAGAT', 'TTACA', 'AGATT']


In [16]:
superstring = DNA[0]
toCheck = DNA[1:]
print(superstring,toCheck)
pre = superstring[0:-1]
post = superstring[1:]
while n < N:
    for dna in toCheck:
        if dna[1:] == pre:
            superstring = dna[0]+superstring
            pre = superstring[0:n-2]
            post = superstring[-n+1:]
            n = len(superstring)
            toCheck.remove(dna)
            print(superstring,toCheck)
            break
        elif dna[0:-1] == post:
            superstring += dna[-1]
            pre = superstring[0:n-2]
            post = superstring[-n+1:]
            n = len(superstring)
            toCheck.remove(dna)
            print(superstring,toCheck)
            break
    


ATTAC ['TACAG', 'GATTA', 'ACAGA', 'CAGAT', 'TTACA', 'AGATT']
GATTAC ['TACAG', 'ACAGA', 'CAGAT', 'TTACA', 'AGATT']
GATTACA ['TACAG', 'ACAGA', 'CAGAT', 'AGATT']
