Our goal now is to modify our previous algorithm for the Frequent Words Problem in order to find DnaA boxes by identifying frequent k-mers, possibly with mismatches. Given input strings Text and Pattern as well as an integer d, we extend the definition of PatternCount to the function ApproximatePatternCount(Pattern, Text, d). This function computes the number of occurrences of Pattern in Text with at most d mismatches. For example,

ApproximatePatternCount(AAAAA, AACAAGCATAAACATTAAAGAG, 1) = 4

because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Notice that two of these occurrences overlap.

Code Challenge (3 points): Implement the ApproximatePatternCount function in Python. (Hint: you will only need to make a slight change to your existing code to write this function.)

In [3]:
# Input:  Two strings p and q
# Output: An integer value representing the Hamming Distance between p and q.
def HammingDistance(p, q):
    assert(len(p)==len(q))
    hamming = 0
    for ind in range(len(p)):
        if p[ind] != q[ind]:
            hamming+=1
    return hamming

# Input:  Strings Pattern and Text, and an integer d
# Output: The number of times Pattern appears in Text with at most d mismatches
def ApproximatePatternCount(Pattern, Text, d):
    count = 0 # initialize count variable

    positions = [] # initializing list of positions
    n_pattern = len(Pattern)
    for ind in range(len(Text)-n_pattern+1):
        dist = HammingDistance(Pattern,Text[ind:ind+n_pattern])
        if dist <= d:
            count+=1
    return count

In [4]:
#Test 0 # Sample Dataset (your code is not run on this dataset)
#Input:
Pattern=    "GAGG"
Text=    "TTTAGAGCCTTCAGAGG"
d=    2
#Output:
out=     4
assert(ApproximatePatternCount(Pattern, Text, d) == out)
#Test 1 # Check that the output is > 0 and that you handle overlapping kmers
#Input:
Pattern=     "AA"
Text=     "AAA"
d=     0
#Output:
out=      2
assert(ApproximatePatternCount(Pattern, Text, d) == out)
#Test 2 # Check that you find kmers with < d mismatches as well as with exactly d mismatches
#Input:
Pattern=     "ATA"
Text=     "ATA"
d=     1
#Output:
out=      1
assert(ApproximatePatternCount(Pattern, Text, d) == out)
#Test 3 # Full dataset
#Input:
Pattern=     "GTGCCG"
Text=     "AGCGTGCCGAAATATGCCGCCAGACCTGCTGCGGTGGCCTCGCCGACTTCACGGATGCCAAGTGCATAGAGGAAGCGAGCAAAGGTGGTTTCTTTCGCTTTATCCAGCGCGTTAACCACGTTCTGTGCCGACTTT"
d=     3
#Output:
out=      24
assert(ApproximatePatternCount(Pattern, Text, d) == out)

We now make a final attempt to find DnaA boxes in the region of the E. coli genome hypothesized by the minimum skew as ori. Although the minimum of the skew diagram for E. coli is found at position 3923620, we should not assume that its ori is found exactly at this position due to random fluctuations in the skew. To remedy this issue, we could choose a larger window size (e.g., 1000), but expanding the window introduces the risk that we may bring in other clumped 9-mers that do not represent DnaA boxes but appear in this window more often than the true DnaA box. It makes more sense to try a small window either starting, ending, or centered at the position of minimum skew.

Let’s cross our fingers and identify the most frequent 9-mers (with 1 mismatch) within a window of length 500 starting at position 3923620 of the E. coli genome. Bingo! The experimentally confirmed DnaA box in E. coli (TTATCCACA) is indeed a most frequent 9-mer, along with its reverse complement TGTGGATAA (with 1 mismatch):

aatgatgatgacgtcaaaaggatccggataaaacatggtgattgcctcgcataacgcggt atgaaaatggattgaagcccgggccgtggattctactcaactttgtcggcttgagaaaga cctgggatcctgggtattaaaaagaagatctatttatttagagatctgttctattgtgat ctcttattaggatcgcactgcccTGTGGATAAcaaggatccggcttttaagatcaacaac ctggaaaggatcattaactgtgaatgatcggtgatcctggaccgtataagctgggatcag aatgaggggTTATACACAactcaaaaactgaacaacagttgttcTTTGGATAActaccgg ttgatccaagcttcctgacagagTTATCCACAgtagatcgcacgatctgtatacttattt gagtaaattaacccacgatcccagccattcttctgccggatcttccggaatgtcgtgatc
aagaatgttgatcttcagtg                                        

You will notice that we highlighted an interior interval of this sequence with darker text. This region is the experimentally verified ori of E. coli, which starts 37 nucleotides after position 3923620, where the skew reaches its minimum value.

We were very fortunate that the DnaA boxes of E. coli are captured in the window that we chose. Moreover, while TTATCCACA represents a most frequent 9-mer in this 500-nucleotide window, it is not the only one: GGATCCTGG, GATCCCAGC, GTTATCCAC, AGCTGGGAT, and CTGGGATCA (along with their reverse complements) also appear four times with 1 mismatch.