## **WHERE IN THE GENOME DOES REPLICATION BEGIN? (CHAPTER 1)**

####**FINDING THE ORIGIN OF REPLICATION 1.2**

**Input** would be a DNA string "Genome" and **Output** would be the location of the origin of replication (ori) in the Genome.<br>

However, this is not a true "Computational Problem" yet because we need to be specific with what we are actually looking for. Therefore, we will dive deeper into what this "origin of replication" can be precisely defined as.

One approach would be treating DNA as its own language and seeing if there are any "words" that appear in high frequency. Now we can call this a "Pattern Counting Problem" which can be solved computationally. Pseudocode could be as follows:

**PatternCount**(*Text*, *Pattern*)<br>
count <- 0<br>
**for** i <- 0 to (|Text| - |Pattern|)<br>
**if** *Text(i, |Pattern|) = Pattern* <br>
count <- count + 1

return count



In [1]:
def PatternCount(genome: str, pattern: str) -> int:
  count = 0
  # zero based indexing means we need to add +1 to our range of i
  for i in range(len(genome) - len(pattern)+1):
    if genome[i:len(pattern)+i] == pattern:
      count += 1
  return count

In [None]:
# Testing our code
text = "ACTTGACT"
pattern = "ACT"

count = PatternCount(text, pattern)
print(count)

# after adding "+1" to our range of i, we have the correct answer

2


In [None]:
# Now we are testing overlaps
text = "GCGTAGCGCG"
pattern = "GCG"

count = PatternCount(text, pattern)
print(count)

3


In [None]:
# now testing a given txt file
txt_file = "GACATGGTTACCTATCCGTATGGTTAATGGTTAATGGAGATGGTTATTGTCGAATGGTTAACATGGTTAGGCCATGGTTAAAGATGGTTATATGGTTAGTATGGTTAATGGTTACATGGTTACATGGTTATCATGGTTAGTGTGCTGTATGGTTACGATGGTTACCCAGATGGTTAAATGGTTAATGGTTATATGGTTATCCAATGGTTAATATGGTTAATGGGGGAGATGGTTATGGTATGGTTAAGTAGGATGGTTATAATGGTTATTATGGTTAGTGCGCATTTTCCATGGTTATAATGGTTACAATGGTTATGTGGATGGTTACGTACACCATGGTTAGAATATCATGGTTAAAGAGATGGTTAGCCAATGGTTACCAAATGGTTAATGGTTAATGGTTAATGGTTAGATATGGTTAAATGGTTAATACAATGGTTAAATGGTTAGTAATGGTTAGATGGTTAAATGGTTATCATGGTTAATGATGGTTAATGGTTAGTCACGACATGGTTAATGGTTAAGATGGTTACATGGTTAATGGTTATACCCTGATATGGTTACAAATGGTTAATGGTTAGGATGGTTACCATCAAATGGTTAGAATGGTTAATGGTTATATGGTTAATGGTTAAGGCGTTATGGTTAAAAATGGTTATATGGTTATGTGATGGTTAATGGTTAACAAGAGATGGTTAATGGTTACTTCATGGTTATGGATGGTTAACTATGGTTAACCAATGGTTATGTGATACCGCAATGGTTAGAATGGTTAATGGTTAGAATGGTTAATGGTTATTCCTATATGGTTATAAATGGTTAAATGGTTATTATGGATGGTTAGACGAATGGTTACATGGTTATATGGTTAAAATCTAATGGTTACATGGTTACGCATGGTTAAGCATGGTTAATGGTTAGGGATGGTTAATGGTTATCGGTATGGTTAAGTATGGTTAATGGTTAATGGTTATTAATTATGGTTAATTTAATGGTTAAATGGTTATATGGTTAATGGTTAATGGTTA"
pattern = "ATGGTTAAT"

count = PatternCount(txt_file, pattern)
print(count)

28


In [None]:
# QUIZ QUESTION 1
txt = "ACTGTACGATGATGTGTGTCAAAG"
pattern = "TGT"
count = PatternCount(txt, pattern)
print(count)

4


In [None]:
def PatternMatching(genome: str, pattern: str) -> list[int]

**Frequent Words Problem**<br>
Now that we have an algorithm to find a specific pattern in a text sequence. We will look at finding the most frequent pattern in a given text sequence. This will allow us to better understand what patterns could be appearning often and possibly offer some understanding with what can be found near an origin of replication.

Frequent Words takes and **Input** of a string *Text* and an integer *k* where k is the length of a possible pattern. The **Ouput** will be a list of the most frequent k-mers in *Text*.

In [None]:
# FrequentWords will make a list "Count" which will keep track of the number of times a pattern of length k is found. Count[i] == number of times Text(i,|pattern|+i) is found
def FrequentWords(genome: str, k: int) -> list[str]:
  FreqPatterns = []
  Count = []
  n = len(genome) - k + 1
  for i in range(n):
    pattern = genome[i:k+i]
    # this is the bottle neck here
    # There will be times we are computing the count for a pattern we already counted.
    Count.append(PatternCount(genome,pattern))
  MaxCount = max(Count)
  for i in range(len(Count)):
    if Count[i] == MaxCount:
      FreqPatterns.append(genome[i:k+i])
  return list(set(FreqPatterns))

In [None]:
genome = "CGGAGGACTCTAGGTAACGCTTATCAGGTCCATAGGACATTCA"
k = 3

FreqPatterns = FrequentWords(genome, k)
print(FreqPatterns)

['AGG']


In [None]:
# QUIZ QUESTION 2
txt = "CGGAGGACTCTAGGTAACGCTTATCAGGTCCATAGGACATTCA"
k = 3
print(FrequentWords(txt, 3))

['AGG']


In [None]:
# QUIZ QUESTION PATTERN MATCHING
def pattern_matching(pattern: str, genome: str) -> list[int]:
    start_positions = []
    if not pattern or not genome or len(pattern) > len(genome):
      return start_positions

    # searching down the genome for the pattern
    # the last starting position will be the length of genome minus length of the pattern
    for i in range(len(genome) - len(pattern)+1):
        # add the index to the list of start positions if pattern is found
        substring = genome[i:i+len(pattern)]
        if substring == pattern:
            start_positions.append(i)

    return start_positions

In [None]:
text = "AAACATAGGATCAAC"
pattern = "AA"
print(pattern_matching(pattern,text))

[0, 1, 12]


The above funtion "FrequentWords" works, but it is extremely slow. The major limiting factor is the length of our genome. If the length of the genome is equal to n, at each index we run **PatternCount**.<br><br>
Since **PatternCount** runs through the whole genome in order to run, and will compare a pattern of length k to genome(i,k+1); we essentially have a runtime of (|Genome|-k+1)⋅k for **PatternCount** alone.<br><br>
When we factor in that **PatternCount** is getting called n times, where n is equal to |Genome|-k+1, the run time will have an *upper bound* of (|Genome|)^2⋅k.

**BETTER FREQUENT WORDS**<br>
A more efficient approach for solving the FrequentWords problem would involve **Frequency Maps**. By using the FrequencyMap we can make one pass through the genome using **PatternCount**, but instead of making a sweep for each pattern at each index, we can simply add 1 to the count of a pattern we have seen before.

In [None]:
# PSEUDO-CODE FOR "BETTER FREQUENT WORDS"
"""

BetterFrequentWords(Text,k)
    frequentPatterns <- empty list
    freqMap <- FrequencyTable(Text, k)
    max <- MaxMap(freqMap)
    FOR all string "pattern" in freqMap:
        freqMap[pattern] == max
        append pattern to frequentPatterns

    return frequentPatterns

"""

# First we must make a sub-funtion -> MaxMap

'\n\nBetterFrequentWords(Text,k)\n    frequentPatterns <- empty list\n    freqMap <- FrequencyTable(Text, k)\n    max <- MaxMap(freqMap)\n    FOR all string "pattern" in freqMap:\n        freqMap[pattern] == max\n        append pattern to frequentPatterns\n\n    return frequentPatterns\n\n'

In [None]:
def MaxMap(freqMap: map) -> int:
  max = 0
  for pattern in freqMap:
    if freqMap[pattern] > max:
      max = freqMap[pattern]
  return max

In [None]:
# testing MaxMap with a simple freqMap -> SUCCESS

test = {"ATA":3, "CTG": 5, "AAT": 1}
max = MaxMap(test)

print(max)

5


In [None]:
# Testing Duplicates -> SUCCESS
test = {"ATA":2, "CTG": 2, "AAT": 2}
max = MaxMap(test)

print(max)

2


In [None]:
# Now we will create a frequencyTable function

def FrequencyTable(text: str, k:int) -> dict:
  frequencyTable = {}
  n = len(text)
  for i in range(n-k+1):
    pattern = text[i:i+k]
    if pattern not in frequencyTable:
      frequencyTable[pattern] = 1
    else:
      frequencyTable[pattern] = frequencyTable[pattern] + 1
  return frequencyTable

In [None]:
# simple test to see if FrequencyTable is functioning properly
test = "test"
k = 1

freqTable = FrequencyTable(test, k)
print(freqTable)

{'t': 2, 'e': 1, 's': 1}


In [None]:
# NOW we can start creagting our bigger function: BetterFrequentWords()
# INPUT: string TEXT and int K.
# OUTPUT: list of most frequent k-mers in TEXT.

def BetterFrequentWords(text: str, k: int) -> list[str]:
  # using our sub-functions to create our table and maximum count in our table
  frequentWords = []
  freqTable = FrequencyTable(text,k)
  max = MaxMap(freqTable)

  # now we can compare the count of each "word" in our freqTable to the Max,
  # appending the patterns with count equal to max.
  for pattern in freqTable:
    if freqTable[pattern] == max:
      frequentWords.append(pattern)


  return frequentWords

In [None]:
# simple test to see if BetterFrequentWords is functioning properly
test = "test"
k = 1

freqWords = BetterFrequentWords(test,k)

print(freqWords)

['t']


In [None]:
# testing for multiple most frequent words
test = "tomato"
k = 1
freqWords = BetterFrequentWords(test,k)

print(freqWords)

['t', 'o']


In [None]:
# testing with a greater number for k
test = "ACTCGAATCG"
k = 3
freqWords = BetterFrequentWords(test, k)

print(freqWords)

['TCG']


In [None]:
# testing with greater k and duplicates
test = "GCGTTACTGCGTGA"
k = 3
freqWords = BetterFrequentWords(test,k)

print(freqWords)

['GCG', 'CGT']


In [None]:
test = "AGGCAGAGTTAATTGCGCCGGTTTAATTGAGGCAGAGTATTAAGGACTAAGAACATATTAAGGACTAAGAACATTAATTGTTAATTGTTAATTGTATTAAGGATTAATTGCTAAGAACATTAATTGTTAATTGCTAAGAACATATTAAGGAAGGCAGAGAGGCAGAGCGCCGGTTTAATTGCGCCGGTCGCCGGTAGGCAGAGAGGCAGAGTTAATTGCGCCGGTAGGCAGAGCTAAGAACAAGGCAGAGTATTAAGGATTAATTGTTAATTGCTAAGAACACGCCGGTCTAAGAACAAGGCAGAGAGGCAGAGAGGCAGAGTATTAAGGAAGGCAGAGCTAAGAACATTAATTGCGCCGGTTTAATTGCGCCGGTCTAAGAACACGCCGGTAGGCAGAGTATTAAGGATATTAAGGATATTAAGGACTAAGAACACTAAGAACATTAATTGCGCCGGTTTAATTGTTAATTGCGCCGGTCGCCGGTTATTAAGGATTAATTGCGCCGGTCTAAGAACACGCCGGTTTAATTGAGGCAGAGTATTAAGGATATTAAGGACGCCGGTTTAATTGTTAATTGTTAATTGTTAATTGTTAATTGAGGCAGAGAGGCAGAGAGGCAGAGTATTAAGGATATTAAGGATTAATTGTTAATTGTTAATTGAGGCAGAGAGGCAGAGTTAATTGCGCCGGTTATTAAGGATATTAAGGATTAATTGAGGCAGAGTATTAAGGACTAAGAACATATTAAGGACTAAGAACACTAAGAACATATTAAGGATTAATTGCGCCGGTTTAATTGAGGCAGAGTTAATTGTTAATTG"
k = 12

freqWords = BetterFrequentWords(test, k)

answer = " ".join(freqWords)
answer

'TTAATTGTTAAT TAATTGTTAATT AATTGTTAATTG'

####**SOME HIDDEN MESSAGES ARE MORE SUPRISING THAN OTHERS 1.3**

For some background information, we will start using k=9 for searching for k-mers since most bacteral DnaA Boxes are usually 9-nucleotides long. We will also start looking into the **complementary** DNA strange to see if there is a frequent appearnence of the complement of strange that are appearing frequently.<br><br>

Understanding the frequency of both the primary and complementary strands can give us a greater insight on what nucleotide sequences are common around known origins of replications. Helping find origins of replication in bacteria that have similar DnaA boxes.

In [None]:
def Complement(pattern: str) -> str:
    # creating an empty string
    complement = ""
    for i in range(len(pattern)):
        # checking the nucleotide and giving the concatenating the complement
        # Checking A to T, C to G and vice versa
        if pattern[i] == "A":
            complement += "T"
        elif pattern[i] == "T":
            complement += "A"
        elif pattern[i] == "C":
            complement += "G"
        elif pattern[i] == "G":
            complement += "C"

    return complement

In [None]:
# testing complement function -> SUCCESS
test = "ACTTG"
complement = Complement(test)
print(complement)

TGAAC


In [None]:
# ReverseCompelent will comprise of two subfuncitons, one to invert the given string TEXT and another subfunction to input the complement nucleotides.
# INPUT: DNA string Pattern
# OUTPUT: reverse complement of Pattern

def ReverseComplement(pattern: str) -> str:
  complement = Complement(pattern)
  # python specific method for obtaining the reverse of a string
  reverse_complement = complement[::-1]

  return reverse_complement

In [None]:
# testing ReverseComplement function -> SUCCESS
test = "GCTAGCT"
rc_pattern = ReverseComplement(test)
print(rc_pattern)

AGCTAGC


In [None]:
# testing given data set on ReverseComplement
pattern = "CGAAGAAGGCTGGACACGAGCCAGATGGCCGGACCTGAAAGGGGACATGATAGACGTACTCGAGTCGCGTCGATTATTTTAGCAACTTTCCAAGTAATAAACGGTGGCTGTCGCGCGATAAGGCGCGTTACCGCCAGTAGAACTCCCGTGGCTCTCACGAATGGACGCCGAGTATCGCATGTAGAGCACCCCCCACCCATGATTTGCATGCGATCGCCTTGCATGAACGCCTTCATACATGCGGCAATAGCAAGCTCCTTAGTAGTGGTACTGGCGGTACGAATATTGGGCACCTAGGATAACAGATCCGACATCCTTCGAGACCAAACTCAGGTTACCTACAACGATGTAGAGTGTGGGTTAGCGAACGCAATTAAATCAAATCTATGCTCGCTCTCTTCTCCGTGACTGCTGTCGAATTCTTGGACTCTAAGAGAGCGCTCGATCTAAAGTCGCGTTACTGATTGCGCCCAAAGACGCCTTCCCGTATAGACCCCAGACAAAGATAGAGAGAACCCTCACACAACATACCAGAAGCCCTAAGGATCGGATACAGTCAGCTATAAACATAGCCTCACAGAGCTATTCCTCTGGCCCAGGATGAGAGGGTGGTCTCTATAGTAGCGCTTGTAGTATAGTCCCTACTTACCTGTGGATGCACGGGGCACTGTAGCGAGACTGCAGAGTAATTAAAGTCCCCGACATAGATACACCTGGTAAACGTGGAATAGTGATCGGATAATGACCCTAGGCTCTGCACCGAAAGCCGTTCTACCTAGCAAAGAAAGTGATAAGATAAGTACTCTAACGTTTCCCTCGACTGACACGGATCAGTGCACGTGCATGACAGGCTGTAACTCGCGGGTCAATAATACATACCAGGCTCGCGCTCATGAAACGTCATAGCAGAACTCATTGGAGTAACCCGAGCAGTATTTACATCGGTCTACCTGAGTAATAAGAAATTCCTCGAACACCTAGCGGGTATGTTACACGGCTGCGAGATTCTTCGCTAGCGAGTTTTCACTAACATACTCTATCCTCAGAGCTTGCTAGACACGGACCATAAATGCGCGAGAGGGTGTTTCACTACTAGGTAAGGTACATTTAACAGTACTTAACCCGAGCATATGACACCCTCAACGAAGATCACCGCAGTGCGTGGCTTCCTGGTCCAGCTCCTATCTCTAGACGTGCAGGCACGCTCGCCGCTGGGCTTAACAGGGGGAGGTTGAAATACCTAATTTCTTTAACCCAGATGACTACTCAGGGCATCCGTAGTTATCGTTCGAGTTTTGAGTGCTAGAATCGCTCCGAGTGCCTCTGGTCAACTTCAAAACCGCGTTACAAATATCAGAGCCAAATCGTATATCTAATAAGCTACCTAGGCCGGTCCACGATAGTACAGGCTTAACACGGGACACGCGGAAGCCGTACCTATTAAACGAAGAGTGCATCTCTTTCTTTTGAGCCAGACCGCCTGCCGGGACCGTTAAGATAGAGTTTCTTATCTCCTTTATTGCTGATTTTAGACCGGAATTTTAAATCTTCGTTTCCGAAGAATAATAAAGGTGCGCCCTATGCGTGACGACGCCGGATCGGAGAAGCATAAATTTGTCTTTCCACGATTGTCAGTGTTTGTATATTACAATTGGTCCGACGTGCCGACAATAGACGTGCGGCACACGATTAAAGCATGTATAGTTGCGACAAGTTTAGTACCATAATGCACCGCGACCAGTAATTCGTAGTGTGTCAGCTATAGCACATCTGTTTTCTTATACTGCAGGCGCCATGTATGTTTCCTGTTCCGCCGCCCCTAATGCACGCCTGACTATGCTTTACATACAGTCGGTATTTCTGTGGTGGCATATTACAGCCCGTGCCCACCTAAGGCACCTTACAAATGATGGGGGCAACGACACTTCTTTCTACATAAATCGCATGTAGCTCCGACTCTGGCCAACTCTCCAGCGATCACGCAAACCTTTCTCGGACATTCAAATTTCCAGGTACAGGATGCGTCGGGAGCGCACCTCTGCAGTTGCCCTGTACACTGTTGCAAATCACTAACGGCATGCCTCCTTTACGGTCACCCCACTCTAGTGAACGGAATTGTTAAAGCGCGGTCAAGAGTCATTTGTCGGTCTTACTTATTCTATTGCTCGGGCGTGTCTAGTATGACTTACTTCCCTGGAACGTGTCGTTGCTGCCATCAATTCAGGCTGCGTTCCGACAAACGCCTCGGCACTCAGGTAGAGATTTAATGGGTTGCTGCTTCAGGGTGACATGGGGCACAATGTTTGATCGGTTTGAAGTATGAACCTTCCCGTACAGATGATAATGCTTCACAGATTGTCTAGACAATAATCACGGGCTTACGAATCATCTCGCAGGCCAACCGACCACAATATGTTACATGTCATCAAGGGTTCACGCCACTTGGAGGGGTCCGTCATAGGCTCGTGTACCACAAAGAGACATCTAGATTAAAATCTGACCTGGTGCCGGTCCCATACTGCCCTGTTACCAAGTGCCATTCTAATTTTTTATCATCGCACTCCAGGTCACTTAGACGGAGTTGGTTACTCTAACTGTTATTGCTCACCGAACAAGTCAGTAATGACTTGTTGTGACACTCCTAATAAATTGAGTGCGTAGATATCCTTGATAACGGAGTCGAGAGGGGGAGTTGTTGCTATCGAAAGACCGGCATTACAGGCTAAGCACGAATATTTCCTGGTCGATCCGGAAAGGATCGCAAGCGTATGTTGGTTCCGTGTTGCCTTTTGGTAAGTCTGTACCTGCCGTTAGCACGCTTGGATCGCTTATACTTAGCCAGGACACGCGCGACCGGTGAAATTACGATTGTTTCACTAATACCGGAGTACATGTCCGTAGCCATCGGTATCACATGCAGAAAGATGCAGCATAAATGTGATTTCGGATGTGTCACACGGTGGCGTGTTGGTCCAGGATTTCTAGCTCGTAGTGTGTCATCACTCCCGCGATCATAAGACCACACGCTGATGTGAACTACCGCTCACAGAAAGTTTTGGCCATTATAAACTCTCGAGAGGCGAAAGGTTCGAGGGCTTAGGGGATGAACGGCAGGTCTACGCTAAGTGGGGGTCTTCAGGAGTGCTCATCTGACCACATCCTTGAAGGAATCGCCAGAATTGGCCCCTTCGACGCCCGCTCCCTGCTGGACATTCGGCTACGAGTACCAAGCCAACACTGCGGCTTCCAGAAAATATAGTCTCCATTTCTGTCAACGTAGACCGTCTGCGCTATACGTCCTTGGCTGCTGTCCATTTACTCATGTTAAAGTTTGTTCCCCCCGTTCGGTTCATTGGCTTCCCCCTCGCTAGCTGGTATGGAAAATACGGCGTGGGAGTCAAGCCAATTGGTGAACTGTCTCCCGATGGTCCGTCTTCGCCCTTCTGTATGGTAGTCGAAGGCAATTCAGCTCATTCCGTTAACGGCCCCTGATGACCCGAGAAACTGCTTACTGAAGTATTGGCAGGTGAAGCTCTCTTTAACTAGCACGTTAGGGTGCAACGTGTTTCGGGCGCCTCCGGACAGTGAAGTGTGCGAGTTTCGAATCTGACAGTGCGAGATACACTGTTCTATGAAGCATCTATGGTCAGGGTCTATGCGGGGTGTCCTTCGTTGATACTCTTGAAACGGTTCCATGGTGTCGAAGGACTTTTGTGCTGGGACCCTGACACCGGCCAGAGTGCGACACCCTATTTCGCTGACCCCGTACACCGCATCCAAGGACCTAAATGCCATTGAAAAGTACCGTACGAGACGGTAAAAATAAAAGCAAGAGTGACACAGTGGTTCTGTGACGGGATTGTTGGCACGAGCGGTCAAACCTTGTCTTAGAACCTGGCCAGTTTTTGCTATTCACTAGTTTAGTATAGCCTTGTGCTTGACGTTCGGTGTTAAGCACTCTCTAGCGAATAAGGTGGCGCTCGCACGCCCTACCGAGGTGTATATCAAACGGGAAAAAATCGGGTACCCTTCTTCAACCAAATGGAAGCCTCTAATAACAAGAACGAACATTCCCCTTTCGGAGATAGTTTTCGTCTACCGCGAGAGTATGCCGTACCAGTAGGCATAGGGTTGGAAATCTGCACTACAAGCCTCTGAGCTGCCTACAGTCTGCTTCCAGGGGGGAGACCCAGTGGGCAATTTGTTCAAGGTATGGCCTATGACTCGACAGTTACAGCGACCACGGTAGTTAGCTCTAAAAAAGCATGACTTCCATGATCGTCTGGTCGTACATGTTTATTAGCCAATTAGTCCGAACAAGGGTAACGACAGAAAATGTTGCCGCTTAGTGGCAAATTATATGGGGTCCTCAAGGACACTTATTGTATGAAAATATTAGCCTAGGGCCTTTCTTCATACTTGTATAGGGAGCTCCTTATGTCAAATTTCGAACCATCGAATCTCATACCCAGGCTGTGCGGCTAACCTTCCGAAACCGTCGGACTATATACGTACGTCATCGGCAACCACCGGACACTATAGGTTTCCACGAGACAATCCCACCGCATCCAGACAGTTGAAAACATCATGTCACAACTCGTCCCATGGGAGGTGACCGCTAGGCTGACTGATCTAGCAGTTGTCGAGTTCGTTGGAATAGAGAGGCAGGTTGTGTCAAGAGGGCGCCGATTCCGTGTGGATCTTTACTATATTGAGATCATGCGGTACCTATGACCAACGGGCTATGGGAAGAATACTAGAAGTCTGATAGTTCAACACACTCATCCAACCAGTCAAGAATACTGGGGGTATCCCCCCTAACCGAAACCGTGCCTCTCTTATAGTAGCAGAGGCGGGGATATTTGCAGTTTTTCCAGCACCACCCGGACAGGACGTGAATTTACGGTTTTAACACTGGCTCTGTTACATGCGTAGGGACTCCTCCCACTAACCTAATGTCATCCCTCGTAGAGTCCTTCCCGCTTGCGCGTTCTATATCTGAACAGAGTCGCGTCAACCCACACCGTCTCCTGTACTATTGACATCGAGACCTTGACGGCGAAAAAGGTGACGCGACGCCAGGGTAAGCCTAATCAATGCAACGCAAGCGTTCATGCAAGATAAGCACCCGACTTGAAACGCATGGTCATTCAGTGAACCATTACTGTATCTTACTCAGGTCAGCCCAGCTATTAGATATCGTCTATGTGTCGGCTGAACATTGACTCTCTTTCGGGCTCCGTGTGTCGCGTCGGCACGGAGGACGAGATTGCGCACTGATAAATCGCTATGGGTCGGAGTGCAAAAACCGTTTTGGGTATCTTACGGGTACGTCAGAAAGTCGGGAAGCGCACTAACACAAGGTGAGACATAGCCGCTAAGCAAATCAGAGAGCAGAGCTTCAGTATACGGAGTGAAAATGTCTGAGATTTGTACAGTGCCCCTGGCAGTGCTAACAAGCCAGTGGGTGCCATGTCACGGGGTTAGTATGTGTCTCCCGTGTCGGTTGTGGACACATTAGACTCATTGTAGCGCGGCGTATATTCCCTAGAATCTGCTTGCCGAGCGCGTGGCCCCGGGTCCGCAGTTCAAATTCTTTTACTATCTCTGGAGACTTTACCTTGAAGGGATTGGACCTATAACCAGGAACTCCATCGCCACACCCGCCAATCTATATGTTGCACGCCCCATGATCGTTGAAAATGTCGGCAAAATCCCCGGCGCCGTTACGGTGCCAGCATTCGTACAAAAATAGGTTAAGAAACCAAGGTCGGTATTTCCGTTACCGGCGACCCGGCTGCGCTTTCAGAAGATGATGATGCTCCATAGTATGGCTTGTCCACGAATTACATACGGTCCAGAAGGGTACCTTTCTGTAAACTAAAGCCCCCGGCGTCAATGAGCTCAATAAGGCCACCCCGAACCAGCAGTTGGCCCATTCGTTATTTTTAGGATTGTATACATTTATTCTCGGAGTAGTTGCAATAAATCACTCTACAACGTAATTACTCGCTCTTACGCACCTTTGCCTCTGAACTCGCAGATTGGGCGACAGGCTATCCAACATTCGGTGGCCCGTTGATCGACTATCGGTGGTTTGAAGATTGACTAATCGCATCCGCTCTTCGAGGGTGCCATGTTGGCAAAGTGTAGCCACGCCACCCCTCGTCCGAAGTACACACGACATACTTATTACCTGTTTTAGAGTCAGTCTGTGCTAATACCCAACATATTGTTGAGGCATGCATCACCGAAGAGGATGGAACATGTGATCGATCTAAACTCCTTTATTAACATGCAGACTGCTGTACTGAACAGGTGGCCGGCAGCGAGAGCCATCCCTCGTGTGAGCGGCTCAACCAAGATCCCACTTTGACCGTGTCTTACCTCACCAAGACCTATAGTCCAGGCTTATAGCGTTTCTCAAATATCCTCGGGTGCCCTCAACCAAGGTGTTACATGCCCAGCGGGGCTCGTAATCCAACTAGTTAGTAGTCTCGTGTTTTATCGCGATACAGCACGAGTATTTGGGTGCTCACGGGACCAGGGTCACGTTAAGTAGGTTGTTGCGGTTGAACAATACCCTATTTAGGACAACCCGGTTCGGTCCCTCTCGCGAGGGGCGAACTCCCGCACACAACCGAAGATAACTTGCACAAGCGGTCATGTCGCTAATTGAATGGTTCCTTTCTGGCCTCAGACGTAGCACAGGTAGAACGGGTTCCTACACTGTAGATCTGCGGCTGAATGATAATTTCGTGGATTAAAACGAGATTATAATTCAAGACGTGACGTAAGTTTCAGTATTAAGTATCAAACTAGCAGTTGGTGACAGATGGCGACTAATTCTTCATATTCTTTACTATCACTGCCTTCCATGAAAGCCTCAACTATCGCTGATAGAAACGAAGTCCGGGGCCCATCTCAGGGCGCGAGCTATATTATCCAGCGCATCGACTCAAAAGGTCAAACCTCAGATATTTTTGTACAATTGAAACACCGCGCATCTTATGTGGCCTGGGATACATCCAGATTTAAACGAAATACCTAGATTCCCGACATCTATTCGACGGAATAATTAGGGAAAAGTTTTGCGTGTAGAATCCGTTGATCTGTAAAGTAAAGATAAGTGTACTTGAGGGCGAAACGTGCGGGATTCGCCGGCGGGCCAAGGGCGTTGCAGAGGAACCCACACAAGTGTGCATTTTGCGCGGACTACCATGATTTGTATGGCATCTATTACGCGTCTAGTTCGGTTGCAGCCGGCACTGTGCTATTATACTATTTTACAAAATTCAGCCCCAGAGGGAGGTTATTAGTGAAGTAACAACACCTAACTAGCGATGGCCTAGCGTGGAACAAAGCGGCGAGAATCCCATTTAGATTGGTATAGCCTCCTTTGTCGCCCCTACGTTGACCCCTGGCCCGTCATTATAGCGAACTGGCCCGGATCGCGCGCTTAGTGGAACTAGACAGATTAACCCTTTACGGTAATGTCAATTTGCCACACCCGTGAGTAGTTGTGTGTTTACTTTTATCTCCCACCGTTACTCTCCCTGACTAATATCTGGCCTACAAAGGAACCTCTACTTCAATCTAAAAGATCGAAGGTCGCGATCTGGTGACGAAGTATTAAGTCGCCACAGAGGCCTCTGACCGATTAAGGGGTTCTCGGCTTGGGGTAAGAGGTTGTTTACGCCTAAGTGACACACCAAGCGCATCGTTTATTTAGGGGATTAACACCTCTGGGGCCACGTGAATGTCATCTCGGCGCCCGATGGTTCCGGATCGACTATAGATGTACCAAGATATGCTTGGGCATTTGGTGATACTTATTCCCTGAAGCAGTCCAGCGACAGCTTGCCACGAGAACGGCTGTCGGAGCCAGCTGATACCGGTACTTGGGACCTGCACGGCGTCTTAGTTCAGAGTTGCGGCGTGTGTGTAGAATCATAGCAAGAGCTCTCAATAAGTATCGGCATAACGGCTTTGGGAACGCTCTCTCATAGCGCGTTTGATTTCAGCAACCACAGGCGTGACAGTCTGTTTAACAGCCTCCCACATTTGGAGTTACCTCAGTTCCTAGATAATACGTGACCAAGTGCTGGGACCTAACGCGGAGCCTTAGTGCTCCTCAGAGCATCCTGAAGGCGGGGTCTTCTGTCTCTCACTACTATGGTTTGGTAAAATCGTGATCAGCGGCCAGCCTGGTGTTCGGGCGTAATCACTAGATGGCATGGAGCGATCGTATACGTAAACGCGACGGTGTAGGTGGCGAGCCACCCGTTCAGGCCGGCTAGCAGATCCCCGGAGGCTGTGATTGTCGTTTACCCACGTGCGTGGTCCTGCCGCGACTCGAACATATCAAAATAGCAGGGGTAGGGTACTGCGAGTATACGGCTTGATTACGCAGCAACGACTGCGTCACGTGCACGTTTGTAACACGATCAATGCTGGGCTGCCACCGAGTCCCTCCGGAAAACAGCACGACGCGAAACCGGGTTCGGCGCGCCGACTAAAAGGCATTCAGTCTATTTGCCATTCCCCGTTCGCTCGTAACCGAAATAAGGATGTAGGCCGGGTTTGAGCGATTATCGTTCCTCAGGTCAGCCGGCTCGGAAAGACAGTATTATG"
rc_pattern = ReverseComplement(pattern)
display(rc_pattern)

'CATAATACTGTCTTTCCGAGCCGGCTGACCTGAGGAACGATAATCGCTCAAACCCGGCCTACATCCTTATTTCGGTTACGAGCGAACGGGGAATGGCAAATAGACTGAATGCCTTTTAGTCGGCGCGCCGAACCCGGTTTCGCGTCGTGCTGTTTTCCGGAGGGACTCGGTGGCAGCCCAGCATTGATCGTGTTACAAACGTGCACGTGACGCAGTCGTTGCTGCGTAATCAAGCCGTATACTCGCAGTACCCTACCCCTGCTATTTTGATATGTTCGAGTCGCGGCAGGACCACGCACGTGGGTAAACGACAATCACAGCCTCCGGGGATCTGCTAGCCGGCCTGAACGGGTGGCTCGCCACCTACACCGTCGCGTTTACGTATACGATCGCTCCATGCCATCTAGTGATTACGCCCGAACACCAGGCTGGCCGCTGATCACGATTTTACCAAACCATAGTAGTGAGAGACAGAAGACCCCGCCTTCAGGATGCTCTGAGGAGCACTAAGGCTCCGCGTTAGGTCCCAGCACTTGGTCACGTATTATCTAGGAACTGAGGTAACTCCAAATGTGGGAGGCTGTTAAACAGACTGTCACGCCTGTGGTTGCTGAAATCAAACGCGCTATGAGAGAGCGTTCCCAAAGCCGTTATGCCGATACTTATTGAGAGCTCTTGCTATGATTCTACACACACGCCGCAACTCTGAACTAAGACGCCGTGCAGGTCCCAAGTACCGGTATCAGCTGGCTCCGACAGCCGTTCTCGTGGCAAGCTGTCGCTGGACTGCTTCAGGGAATAAGTATCACCAAATGCCCAAGCATATCTTGGTACATCTATAGTCGATCCGGAACCATCGGGCGCCGAGATGACATTCACGTGGCCCCAGAGGTGTTAATCCCCTAAATAAACGATGCGCTTGGTGTGTCACTTAGGCGTAAACAACCTCTTACCCCAAGCCGAGAACCCCTTAATCGGTCAGAGGCCTCTGTGGCGACT

In [None]:
from re import L
# PatternMatching will help us check other locations in a genome and check if there is a similar occurrence of patterns outside of the origin of replication
# INPUT: pattern and genome as strings
# OUTPUT: list of indices where the pattern starts in the genome

def PatternMatching(pattern: str, genome: str) -> list[int]:
  pattern_loc = []
  n = len(genome)
  k = len(pattern)

  for i in range(n-k+1):
    if genome[i:i+k] == pattern:
      pattern_loc.append(i)

  return pattern_loc

In [None]:
# simple test for PatternMatching function -> SUCCESS
pattern = "ATC"
genome = "CATCGTTATCCG"
pattern_locations = PatternMatching(pattern,genome)

display(pattern_locations)

[1, 7]

In [None]:
# # testing given data set for PatternMatching
# pattern = "GTAATACGT"

# path = '/content/sample_data/dataset_30273_5.txt'
# with open(path, 'r') as file:
#   next(file)
#   genome = file.read()

# pattern_locations = PatternMatching(pattern, genome)
# # turning out list of ints into a string separating each int with a space
# new_pattern_loc = " ".join(str(loc) for loc in pattern_locations)

# display(new_pattern_loc)

####**AN EXPLOSION OF HIDDEN MESSAGES 1.4**

We will create a **Clump Finding** alogorithm that will location a non-specific k-mer that is found multiple times in a region within the DNA that has a nucleotide length of *L*. This region will become our "sliding window" where we will search for k-mers that appear at least *t* times.

In [None]:
# CLUMP FINDING ALOGIRTHM
# INPUT:Text string, k int for length of k-mer, L int for length of the search window and t for min number of occurrences
# to be considered a clump
# OUTPUT: A list of patterns that were found at least t times within the windows.

def FindClumps(genome: str, k: int, l: int, t: int) -> list[str]:
  patterns = []
  n = len(genome)
  # +1 since we are working with 0-based indexing
  for i in range(n-l+1):
    window = genome[i:l+i]
    freq_map = FrequencyTable(window, k)
    for pattern, count in freq_map.items():
      if count >= t and pattern not in patterns:
        patterns.append(pattern)
  return(patterns)

In [None]:
def BetterFrequencyTable(text: str, k: int) -> dict:
    """
    Creates a frequency table of all k-mers in the given text.
    """
    freq_map = {}
    n = len(text)
    for i in range(n - k + 1):
        kmer = text[i:i+k]
        freq_map[kmer] = freq_map.get(kmer, 0) + 1
    return freq_map

def FindClumps(genome: str, k: int, l: int, t: int) -> list[str]:
    """
    Finds all k-mers that form clumps in the genome.
    A clump is defined as a k-mer that appears at least t times in any window of length l.
    """
    patterns = set()  # Use a set to store unique patterns
    n = len(genome)

    # Initialize the frequency map for the first window
    window = genome[:l]
    freq_map = FrequencyTable(window, k)

    # Check the first window for clumps
    for kmer, count in freq_map.items():
        if count >= t:
            patterns.add(kmer)

    # Slide the window through the genome
    for i in range(1, n - l + 1):
        # Remove the leftmost k-mer
        left_kmer = genome[i-1:i-1+k]
        freq_map[left_kmer] -= 1
        if freq_map[left_kmer] == 0:
            del freq_map[left_kmer]

        # Add the new right k-mer
        right_kmer = genome[i+l-k:i+l]
        freq_map[right_kmer] = freq_map.get(right_kmer, 0) + 1

        # Check for clumps in the current window
        for kmer, count in freq_map.items():
            if count >= t:
                patterns.add(kmer)

    return list(patterns)