## WHERE IN THE GENOME DOES REPLICATION BEGIN?

#### PERCULIAR STATISTICS OF FORWARD AND REVERSE HALF-STRANDS 1.3

In [5]:
# SkewProblem
# INPUT: Text string of genome snippet
# OUTPUT: List of integers representing the C-G value of the string at each index

def Skew(genome:str) -> list[int]:
  count = 0
  skew = []
  # checking if we have any values in our string
  if len(genome) <= 1:
    return 0
  skew.append(count)
  for nucleotide in genome:
    # if the nucleotide is G, we add 1 to our skew value
    if nucleotide == 'G':
      count += 1 # the count value is now updated for the next appending
      skew.append(count)
    # if the nucleotide is C, we subtract 1 from our skew value
    if nucleotide == 'C':
      count -= 1
      skew.append(count)
    # we still need to append the count if C or G are not present.
    if nucleotide == "A" or nucleotide == "T":
      skew.append(count)
  return skew

In [6]:
# testing a given string
genome = "GAGCCACCGCGATA"
answer = (Skew(genome))
print(answer)

[0, 1, 1, 2, 1, 0, 0, -1, -2, -1, -2, -1, -1, -1, -1]


In [7]:
display(" ".join(map(str, answer)))

'0 1 1 2 1 0 0 -1 -2 -1 -2 -1 -1 -1 -1'

Time to find the minimum value within our Skew List to find the turning point in our larger Skew Graph

In [8]:
# MinSkew
# INPUT: Genome text string
# OUTPUT: List of ints that are the indices minimizing the Skew values

def MinSkew(genome:str) -> list[int]:
  skewList = Skew(genome)
  minValue= min(skewList)
  minSkew = []

  # now we will cycle through skewList and find the indices that have the minVal
  for i in range(len(skewList)):
    if skewList[i] == minValue:
      minSkew.append(i)


  return minSkew

In [9]:
# using given test sample
genome = "TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT"
print(MinSkew(genome))

[11, 24]


#### SOME HIDDEN MESSAGES ARE MORE ELUSIVE THAN OTHERS 1.4

We will be looking at the DNA sequences that are SIMILAR to our target pattern since there are Single Nucleotide Point mutations that can occur within the genome. This will allow our algorithm to be more robust and find sequences that are close our target sequence.<br>

We can compute this different by calculating the Hamming Distance between our target sequence and the sequence found in the genome. This calculation is done by adding the number of times p[i] does NOT equal q[i].

In [10]:
# HAMMING DISTANCE
# INPUT: Two strings of equal length
# OUTPUT: The hamming distance between the two strings

def HammingDistance(p: str, q:str) -> int:
  hd = 0
  for i in range(len(p)):
    if p[i] != q[i]:
      hd += 1
  return hd

Now we can have the necessary functions to solve the "Approximate Pattern Matching" problem. We will search for patterns that have a Hamming Distance of at most "d" from our desired pattern.

In [11]:
# APPROX. Pattern Matching
# INPUT: Strings "Pattern" and "Text" along with integer "d"
# OUTPUT: All starting positions where Pattern appears with at most "d" mismatches as a subtring in Text

def ApproxPatternMatching(genome:str, pattern:str, d:int) -> list[int]:
  patterns = []
  n = len(genome)
  k = len(pattern)

  for i in range(n-k+1):
    subString = genome[i:k+i]
    if HammingDistance(subString,pattern) <= d:
      patterns.append(i)

  return patterns

In [12]:
# test with given test set
pattern = "ATTCTGGA"
genome = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT"
d = 3

answer = ApproxPatternMatching(genome,pattern,d)
print(answer)

[6, 7, 26, 27]


In [13]:
# # file provided by cogniterra for testing
# file_path = "/content/sample_data/dataset_30278_4.txt"

# # reading each line and deleting the space between each in the txt file:
# with open(file_path,"r") as file:
#   lines = file.readlines()
#   # lines here is a whole string, with spaces separating each
#   lines = [line.rstrip("\n") for line in lines]

#   print(lines)

In [14]:
# pattern = lines[0]
# genome = lines[1]
# d = int(lines[2])

# answer = ApproxPatternMatching(genome,pattern,d)
# new_answer = " ".join(map(str,answer))

We now want to adjust our "FrequentWords" problem of looking for most frequent k-mers within a genome, to also searching for k-mers with at most d-mismatches. This will make our searching algorithm more robust and account for this Single Nucleotide Polymorphisms, or rather single character changes within our given Text string.

In [15]:
# ApproxPatternCount
# INPUT: Genome and Pattern as strings and "d" as integer
# OUTPUT: A list of the most frequent k-mers that have at most d-mismatches from k-mers in the given Genome

def ApproxPatternCount(genome: str, pattern: str, d: int) -> list[str]:
  count = 0
  n = len(genome)
  k = len(pattern)

  # loop through the entire given genome.
  for i in range(n-k+1):
    # subString is the pattern we are testing against our given pattern.
    subString = genome[i:i+k]
    # if the number of mismatches (ie. the Hamming Distance)
    # is less than d, we can add that pattern to our freqPatterns list.
    if HammingDistance(subString,pattern) <= d:
      count += 1

  return count

In [16]:
genome = "AACAAGCTGATAAACATTTAAAGAG"
pattern = "AAAAA"

answer = ApproxPatternCount(genome, pattern, 2)
print(answer)

11


In [17]:
# dataset = "/content/sample_data/dataset_30278_7.txt"
# with open(dataset, "r") as file:
#   lines = file.readlines()
#   # using this technique because lines is a list, not just a string
#   new_lines = [string.replace("\n", "") for string in lines]
#   print(new_lines)

In [18]:
# pattern = new_lines[0]
# genome = new_lines[1]
# d = int(new_lines[2])

# answer = ApproxPatternCount(genome,pattern, d)
# print(answer) # -> CORRECT!

Now we can apply our logic for out Approx. Count function to an updated "Frequent Words" algorithm. In this new approach, we will be also looking for the patterns that are at most d mismatches away from the patterns found in the given genome string.

In [19]:
"""
Pseudo Code for "ApproxFreqWords":
FrequentWordsWithMismathces(Text, k, d)
    Patterns <- Array of string with length 0
    freqMap <- Empty Map
    n <- Length of given Text string
    FOR i <- 0 to n-k
        pattern <- Text(i:k)
        neighborhood <- Neighbors(Pattern, d)
        FROM j <- 0 to length of neighborhood  - 1
            neighbot <- Neighborhood[j]
            IF freqMap[neighbor] doesn't exist
                freqMap[neighbor] <- 1
            ELSE
                freqMap[nrighbor] <- freqMap[neighbor] + 1

    m <- MaxMap(freqMap)
    FOR every pattern in freqMa
        IF freqMap[pattern] <- m
            append pattern to Patterns

    return Patterns
"""

'\nPseudo Code for "ApproxFreqWords":\nFrequentWordsWithMismathces(Text, k, d)\n    Patterns <- Array of string with length 0\n    freqMap <- Empty Map\n    n <- Length of given Text string\n    FOR i <- 0 to n-k\n        pattern <- Text(i:k)\n        neighborhood <- Neighbors(Pattern, d)\n        FROM j <- 0 to length of neighborhood  - 1\n            neighbot <- Neighborhood[j]\n            IF freqMap[neighbor] doesn\'t exist\n                freqMap[neighbor] <- 1\n            ELSE\n                freqMap[nrighbor] <- freqMap[neighbor] + 1\n\n    m <- MaxMap(freqMap)\n    FOR every pattern in freqMa\n        IF freqMap[pattern] <- m\n            append pattern to Patterns\n\n    return Patterns\n'

In [20]:
# first we need to make our Neighborhood function.
# NEIGHBORS FUNCTION
# Recursive Function that takes an input pattern (string) and outputs a collection of neighbor sequences that have a hamming distance of at most d.
# INPUT: A string Pattern and an integer d (for the max hamming distance)
# OUTPUT: A collection of strings Neighbors(pattern,d)

def Neighbors(pattern: str, d: int) -> list[str]:
  nucleotides = ['G', 'T', 'A','C']
  # for HammingDistance to be 0, neighbor == pattern
  if d == 0:
    return [pattern]
  # If length pattern = 1, then it is a single nucleotide, so the neightbors will just be the other nucleotides
  if len(pattern) == 1:
    return nucleotides
  neighborhood = [] # creating empty set
  SuffixPattern = pattern[1:] # Dropping the first symbol in pattern to get the SuffixPattern
  SuffixNeighbors = Neighbors(SuffixPattern, d) # Recursion, will keep calling function until base case (len(pattern) == 1)

  for text in SuffixNeighbors:
    if HammingDistance(SuffixPattern, text) < d: # <d = there can be more mismatching allowed
      for nucleotide in nucleotides:
        # we now add a nucleotide as each possible neigbor to pattern
        neighbor = nucleotide + text
        neighborhood.append(neighbor)
    else: # HammingDistance is equal to d, thus no more mismatching allowed and the prefix needs to be the prefix of the original pattern
      neighbor = pattern[0] + text # SuffixPattern[0] is the first symbol in SuffixPattern
      neighborhood.append(neighbor)

  return neighborhood

In [21]:
# TESTING NEIGHBORS()

pattern = "ACG"
d = 1

neighborhood = Neighbors(pattern, d)

print(neighborhood)

['AGG', 'ATG', 'AAG', 'GCG', 'TCG', 'ACG', 'CCG', 'ACT', 'ACA', 'ACC']


In [23]:
answer = ["ACG", "ACT", "ACA", "ACC", "ATG" ,"AGG", "AAG" ,"TCG", "GCG", "CCG"]
check = all(pattern in neighborhood for pattern in answer)

if check == True:
  print("Elements in neighborhood are all in answer")
else:
  print("Not all elements in neighborhood are in answer")

Elements in neighborhood are all in answer


In [24]:
# recreating our MaxMap function from part one of this chapter:
# INPUT: a frequecy table as a map
# OUTPUT: The maximum value within a frequency table

def MaxMap(freqMap: map) -> int:
  max = 0
  for pattern in freqMap:
    if freqMap[pattern] > max:
      max = freqMap[pattern]
  return max

In [27]:
# Now we will find the most frequent k-mer WITH mismatches
# we want to return the kmer pattern, not the ones with mismatches

# INPUT: A string text (genome) and integers k for the length of the pattern and d for max Hamming Distance
# OUTPUT: A list of most frequence k-mers with up to d-mismatches
def FrequentWordsWithMismatches(text: str, k: int, d: int) -> list[str]:
  patterns = []
  freqMap = {}
  n = len(text)

  for i in range(n-k+1):
    pattern = text[i:i+k]
    neighborhood = Neighbors(pattern, d)
    for j in range(len(neighborhood)):
      neighbor = neighborhood[j]
      if neighbor not in freqMap:
        freqMap[neighbor] = 1
      else:
        freqMap[neighbor] += 1

  # m = MaxMap(freqMap) # -> can run a bit faster if we use Python's buily in "max" function
  m = max(freqMap.values())
  for pattern in freqMap:
    if freqMap[pattern] == m:
      patterns.append(pattern)


  return patterns

In [29]:
test_text = "AGGT"
k = 2
d = 1

answer = FrequentWordsWithMismatches(test_text, k , d)

print(answer)

# first attempt was incorrect.
# after writing the algorithm on paper, found that I did no account for the zero based indexing
# fixed the issue by adding +1 to the range for i in the first "for" loop.

['GG']


In [49]:
# # testing our function with the given text file:
# dataset = "/content/sample_data/dataset_30278_10.txt"

# with open(dataset, 'r') as file:
#   lines = file.readlines()
#   new_lines = [string.replace("\n","") for string in lines]
#   new_lines.extend(new_lines[1].split(" "))
#   new_lines.pop(1)

#   print(new_lines)

['TGCATAGGGTATAGTGCATGCATGCATAGGGTACACTAGTAGCGCTGCATGCAGGTATAGTGCAGGTATGCATGCATGCATAGCACGGTACGCCGCCGCTGCACGCTGCACACTAGTGCACACTAGCGCGGTAGGTATGCATAGTGCAGGTACACCGCGGTACACCACCACCGCGGTACACTAGTGCATAGGGTAGGTATAGTGCAGGTACGCCACCGCTAGCACCACGGTATGCACGCCGCCGCGGTACACGGTATAGGGTATAGTGCATAGTAGCACGGTAGGTACGCCGCTGCACGCTAGCACTAGCGCTAGGGTACACCGCTAGCACCACTAGGGTAGGTATGCATGCATAGCGCTGCATAGCGCTGCA', '7', '3']


In [51]:
# genome = new_lines[0]
# k = int(new_lines[1])
# d = int(new_lines[2])

# answer = FrequentWordsWithMismatches(genome,k,d)
# print(answer)

['GGCACGG']


Now that we have a working function for finding the most frequent k-mers with at most d mismatches. We can tackle finding the most Frequent k-mers that take into account mismatches **AND** their reverse complement.

In [52]:
# Function for finding the Reverse Complement of a string

# Insert your reverse_complement function here, along with any subroutines you need
# ReverseComplement takes a string pattern as input, creates a complement strand then reverses it
# INPUT: String Pattern
# OUTPUT: Reverse Complement of given Pattern
def ReverseComplement(pattern: str) -> str:
    complement = Complement(pattern)
    reverse_complement = complement[::-1]

    return reverse_complement



# Complement takes a pattern string and returns the complementary strand of that pattern string
# INPUT: String pattern
# OUTPUT: Complement of string pattern
def Complement(pattern: str) -> str:
    # creating an empty string
    complement = ""
    for i in range(len(pattern)):
        # checking the nucleotide and giving the concatenating the complement
        # Checking A to T, C to G and vice versa
        if pattern[i] == "A":
            complement += "T"
        elif pattern[i] == "T":
            complement += "A"
        elif pattern[i] == "C":
            complement += "G"
        elif pattern[i] == "G":
            complement += "C"

    return complement

In [77]:
# VERSION 1.1
# Frequent words problem WITH finding the reverse complement
# INPUT: A DNA string Text , integer k for length of pattern and integer d for max Hamming Distance.
# OUPUT: All k-mers Pattern maximizing the sum Count(Text, Pattern) + Count(Text, Pattern_RC) over all possible k-mers

def FrequentWordsWithMismatchesReverseComplements(text: str, k: int, d: int) -> list[str]:
  patterns = []
  freqMap = {}
  n = len(text)

  # we will make a rc_neighborhood to search for possible neighbors of reverse complements of the patterns we find
  # we will then append the regular neighborhood and rc_neighborhood to find the most frequent k-mers in this bigger neighborhood
  for i in range(n-k+1):
    pattern = text[i:i+k]
    neighborhood = Neighbors(pattern, d)
    rc_pattern = ReverseComplement(pattern)
    rc_neighborhood = Neighbors(rc_pattern, d)
    neighborhood = neighborhood + rc_neighborhood
    for j in range(len(neighborhood)):
      neighbor = neighborhood[j]
      if neighbor not in freqMap:
        freqMap[neighbor] = 1
      else:
        freqMap[neighbor] += 1

  m = MaxMap(freqMap)
  for pattern in freqMap:
    if freqMap[pattern] == m:
      patterns.append(pattern)


  return patterns


In [79]:
test_sequence = "ACGTTGCATGTCGCATGATGCATGAGAGCT"
k = 4
d = 1

answer = FrequentWordsWithMismatchesReverseComplements(test_sequence, k, d)
print(answer)

['ATGT', 'ACAT']


In [89]:
# # testing our function with the given text file:
# dataset = "/content/sample_data/dataset_30278_12.txt"

# with open(dataset, 'r') as file:
#   lines = file.readlines()
#   new_lines = [string.replace("\n","") for string in lines]
#   new_lines.extend(new_lines[1].split(" "))
#   new_lines.pop(1)

#   print(new_lines)

new_lines = ['GATCGGAGATCGTCTACATCGGATCTGAACATCTGAACGATCGACATCTACAACAACGATCTGAACAACTCTGAACACATCTACTCTACAGAGATCTACTCGTCGTCGTCTACATCTTCTACTCGGAGAGAACTCGACACTCTGATCTTCTTCGTCGGATCTACATCGACATCTACAGATCTACATCTACAGAACTCGGATCGACACAACAACGAGATCGTCTTCGGAACATCTTCTACAAC', '7', '3']

In [90]:
genome = new_lines[0]
k = int(new_lines[1])
d = int(new_lines[2])

answer = FrequentWordsWithMismatchesReverseComplements(genome,k,d)

In [91]:
print(answer)

['TAGATAT', 'ATATCTA']


Although we are looking for 9-mers that can be related to the location of the Origin of Replication. We find other 9-mers clusterd in various bacterial genomes because there are many hidden messages within the genome that we do not as of yet know what they are specifically for. <br/>

The main take away from this is that the existing methods for finding the origin of replication is still imperfect and at times inconclusive. However, this work is not in vain, as it narrows the window of research for biologists and gives a great starting point for further experimentation and observation.

#### EPILOGUE: COMPLICATIONS IN ORI PREDICITONS 1.6

- Skew diagram is only a rough estrimation of where the Ori can be.
- The Skew Diagram is not always a clear diagram.


**Now we will attempt to find the DnaA Box of *Salmonella Enterica*.**

In [97]:
se_dataset = "/content/sample_data/Salmonella_enterica.txt"
with open(se_dataset,'r') as file:
  next(file)
  s_enterica = file.read()
  s_enterica = s_enterica.replace("\n","")

In [98]:
s_enterica

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# we will first look for the minimum skew
# with the location of the minium skew, we will start our FrequentWords problem
# FreqWords will have a window (L) of 1000, K of 9 and a d of 2.
# we will run FreqWords 3 times, once at the min_skew, then once 500bp before, and then 500bp after.