## WHICH DND PATTERNS SERVE AS MOLECULAR CLOCKS? - PART 1/2

### DO WE HAVE A "CLOCK" GENE? 1.1

We will turn our focus to the internal time keeper for plants, animals ans some bacteria - the **Circadian Clock**. The genes that are involved in letting our body know what "time" it is in order to change their gene transpcrition prduction, or **gene expression**. <br>

For plants, each cell keeps track of the day and night cycle independently using three plant genes : LHY, CCA1, and TOC1.

Theses are regulatory genes that encode for **regulatory proteins** that are usually impacted by external factors (such as the presense of certain nutrients or even sunlight). For these genes, TOC1 promotes the expression of LHY and CCA1, while LHY and CCA1 repress the expression of TOC1, resulting in a **negative feedback loop**. <br><br>
In the morning, the sunlight activates the transcription of LHY and CCA1, triggering the repression of TOC1 transcription. When the sunlight diminishes, the transcription of LHY and CCA1 diminishes as well, causing TOC1 to no longer be repressed. TOC1 peaks at night and starts promoting LHY and CCA1, continuing the cycle.

The regulatory protiens that these genes encode are known as **transcritpion factors** that can turn other genes on or off. A transcription factor regulates a gene by binding to specific short DNA intervals known as **regulatory motifs** in the gene's upstream region.

### MOTIF FINDING IS MORE DIFFICULT THAN YOU THINK 1.2

In order to find motifs we may think to use our earlier "Frequent Words" algorithm, but the Frequent Words Problem is inadequate because it does not correctly model the biological problem of motif finding. <br> <br>
A DnaA box is a pattern that clumps, or appears frequently, within a relatively short interval of the genome. In contrast, a regulatory motif is a pattern that appears at least once (perhaps with variation) in each of many different regions that are scattered throughout the genome.

Let us work out a "brute force" method for finding these motifs. <br>
Given a collection of strings Dna and an integer d, a k-mer is a (k,d)-motif if it appears in every string from Dna with at most d mismatches. For example, the implanted 15-mer in the strings above represents a (15,4)-motif.

```
# This is formatted as code
MotifEnumeration(Dna, k, d)
    Patterns ← an empty set
    for each k-mer Pattern in the first string in Dna
        for each k-mer Pattern’ differing from Pattern by at most d mismatches
            if Pattern' appears in each string from Dna with at most d
            mismatches
                add Pattern' to Patterns
    remove duplicates from Patterns
    return Patterns
```



In [None]:
# # --- OLD VERSION ---
# # Implemented Motif Problem
# # INPUT: A collection of strings "DNA" and integers k and d.
# # OUTPUT: All (k,d)-motifs in DNA
# def MotifEnumeration(DNA: str, k:int, d:int) -> list[str]:
#   patterns = set()
#   first_patterns = set()
#   n = len(DNA[0])
#   first_string = DNA[0]

#   # we will create a list of patterns found in the first string of DNA
#   for i in range(n-k+1):
#     pattern = first_string[i:i+k]
#     first_patterns.add(pattern)

#   # now we will create a neighborhood for each pattern having at most d mismatches
#   neighborhood = []
#   # print(first_patterns) ->  WORKING

#   for pattern in first_patterns:
#     neighbors = Neighbors(pattern,d)
#     neighborhood.update(neighbors)
#     # checking if each neighbor is present in all of the other strings in DNA
#   for neighbor in neighborhood:
#     # PROBLEM here is that we are checking for EXACT matches in the other DNA strings
#     if all(neighbor in dna_substring for dna_substring in DNA[1:]) == True:
#       patterns.append(neighbor)


#   return patterns



In [None]:
# --- OLD VERSION ---
# Implemented Motif Problem
# INPUT: A collection of strings "DNA" and integers k and d.
# OUTPUT: All (k,d)-motifs in DNA
def MotifEnumeration(DNA: str, k:int, d:int) -> list[str]:
  patterns = set()
  first_patterns = set()

  if isinstance(DNA,list):
    n = len(DNA[0])
  else:
    DNA = DNA.split()
    n = len(DNA[0])

  first_string = DNA[0]

  # we will create a list of patterns found in the first string of DNA
  for i in range(n-k+1):
    pattern = first_string[i:i+k]
    first_patterns.add(pattern)

  # now we will create a neighborhood for each pattern having at most d mismatches
  neighborhood = set()
  # print(first_patterns) ->  WORKING

  for pattern in first_patterns:
    neighbors = Neighbors(pattern,d)
    neighborhood.update(neighbors)
    # checking if each neighbor is present in all of the other strings in DNA
  for neighbor in neighborhood:
    isMotif = True
    for dna_string in DNA[1:]:
      found = False
      for i in range(len(dna_string)-k+1):
        kmer = dna_string[i:i+k]
        if HammingDistance(kmer,neighbor) <= d:
          found = True
          # found that the neighbor is in this dna_substring
          # exiting this for loop and moving on to next substring
          break
      if not found:
        # not in this substring, no longer qualifies for motif, next neighbor
        isMotif = False
        break
    if isMotif:
      # kmer was found in each DNA substring, can now add it to patterns
      patterns.add(neighbor)

  return patterns

In [None]:
# HAMMING DISTANCE
# INPUT: Two strings of equal length
# OUTPUT: The hamming distance between the two strings

def HammingDistance(p: str, q:str) -> int:
  hd = 0
  for i in range(len(p)):
    if p[i] != q[i]:
      hd += 1
  return hd


# first we need to make our Neighborhood function.
# NEIGHBORS FUNCTION
# Recursive Function that takes an input pattern (string) and outputs a collection of neighbor sequences that have a hamming distance of at most d.
# INPUT: A string Pattern and an integer d (for the max hamming distance)
# OUTPUT: A collection of strings Neighbors(pattern,d)

def Neighbors(pattern: str, d: int) -> list[str]:
  nucleotides = ['G', 'T', 'A','C']

  # for HammingDistance to be 0, neighbor == pattern
  if d == 0:
    return [pattern]
  # If length pattern = 1, then it is a single nucleotide, so the neightbors will just be the other nucleotides
  if len(pattern) == 1:
    return nucleotides
  neighborhood = [] # creating empty set
  SuffixPattern = pattern[1:] # Dropping the first symbol in pattern to get the SuffixPattern
  SuffixNeighbors = Neighbors(SuffixPattern, d) # Recursion, will keep calling function until base case (len(pattern) == 1)

  for text in SuffixNeighbors:
    if HammingDistance(SuffixPattern, text) < d: # <d = there can be more mismatching allowed
      for nucleotide in nucleotides:
        # we now add a nucleotide as each possible neigbor to pattern
        neighbor = nucleotide + text
        neighborhood.append(neighbor)
    else: # HammingDistance is equal to d, thus no more mismatching allowed and the prefix needs to be the prefix of the original pattern
      neighbor = pattern[0] + text # SuffixPattern[0] is the first symbol in SuffixPattern
      neighborhood.append(neighbor)

  return neighborhood

In [None]:
# testing MOTIF ENUMERATION function
DNA = "ATTTGGC TGCCTTA CGGTATC GAAAATT"
DNA_list = DNA.split()
k = 3
d = 1

answer = MotifEnumeration(DNA, k, d)
print("your motifs are:",answer)
# CORRECT

your motifs are: {'GTT', 'ATT', 'ATA', 'TTT'}


In [None]:
if isinstance(DNA,list):
  print("is a list")
else:
  print("is not a list")

is not a list


In [None]:
# test data set given from Coursera
DNA = "GATGGGCGAACTTCAGCGCGGCCGA TCAGACTATCGCGCGCGACATGGAC GCGAACCGCTTGTTTACTGACCGAC CGTTGGCAGAAGATTGCGCCTAGTT AAGCCGGGGCATAGTGCGTCTGTCC ATTGTGCGGTGACTTCTCAAGGGTC"
k = 5
d = 2

answer = MotifEnumeration(DNA, k,d)
display(" ".join(answer))

'CCTGA CTTCC AGCTC TACGG CCTGC TTGAC TTGCC TGTAG CTTGC TAGGG GCAGG TCGTA TTTGG ATAGA ATGAG GCGCC TGCTC CTCGC TGAGC GACCC TCTGC CCGTC ACTAT CTACG CTATA CGATA ACGGC ATAAC ATGCT AAGCG ATTAC ACGTA CGATG CGCTC GGCCC GTACC ATACG GAGGG CAGAA AACCT TTAGC CTGCA TACCT GTTGC CCGGA CAAAC TAACG CATGG GTGAA AGTAC GTGCG GACCT TGGTG AATCT GCTAG TAAGG CTGGT ATAGC CTGGA AGGGC TATGC AGCCG GAGTG TATAG ACAGC CCGGG TATCT ACTGA TGAAC CTTTC GTAGC GACTG TGAGG CCGTG GGTCT TCCCC TTACC CCGGC CTTGT GATCG GCAAA GCGGA AGACC GGATT GGCCT ACTCC AGGAC GTCAG TGCCA TATGA TGCTA GTATT CTTAC ATGGA GTTGG TACGC CAGTT TTCGA TGAGT GACCG TTGGC GTTCG CGCTG TCCGC GCGTG CTAAG CCCGA CGGCC CGCCA CGTGA TGTTC CGCAA ATGGT CGCGA GAGAA CGCGG TTCGT TCAGC GCGAA TCGCA TACGA GCCAA GCACG TCGCC AGCGT AAAGG TGAAT ATTGT GGATC TCGAC CACAT CACTG AGCTT CGTCT TCCGT CTAGC TATGG TGGTA ACCAG CGCTT GGAAA ACGAG CCGCG ATCAG GCTAA TGCGA GCCCA GTCTT TCACA GCAAG GCGTT ATCGC TGCCC GTCGT TGTGC TTCTA GTTTC GATCT TCCTG TAAGT GTCCG GCCTA CGGAA GAACG ATCCT CATCT GCC

Although out MotifEnumeration function seems to be working, the overall time complexity is exponential. The major subfunctions in our major functions take into account the length of the target pattern sequence, possibly checknig a neighbor for each position with 4 possible nucleotide variations, as well as checking several strips of DNA sequences and checking if our pattern is present.<br><br>
The longer the DNA target pattern (k) and the greater number of allowable mismathces (d), our function will exponentially increase the time it will take to run. One of the major downfalls of this **"Brute Force"** approach.

#### SCORING MOTIFS 1.3

An inherit issue our function has when it comes to real world data is confronting noise. It is normal to find sequences that have nothing to do with let's say the circadian clock gene to be amplified in the evening.For such noisy datasets, any algorithm for the Implanted Motif Problem would fail, because as long as a single sequence does not contain the transcription factor binding site, a (k, d)-motif does not exist!

We should instead look for sequences that are close to the "ideal" defintion of a motif.

For our new approach to Motif Finding, we will be building a matrix using possible motifs in the DNA sequence. How we find these motifs will be discussed later, but what we do with them will be Motif Scoring, then Motif Counting, following by a Motif Profile and finally a Motif Consensus that will help us find a possible motif in the DNA sequence.

**MOTIF SCORING:** Completed by setting a matrix of n by k where n is the number of possible motifs and k is the length of each motif. We will then count the number of nucleotides that differ from the majority nucleotide in that column. We will add the totals up for each column and that total is the "Motif Score". We will aim to *minimize* this score.<br><br>

**MOTIF COUNTING:**  We will then add up the number of times each nucleotide appears in each *column* and create a new mattrix. This new matrix will have four rows, one for each nucleotide and k columns. Each position corresponding to the number of times a specific nucleotide was found in that column.<br><br>

**MOTIF PROFILE:** Now that we have the count for each nucleotide for the positions of the possible motifs, we can create a frequency table that has the frequency of the Ith nucleotide in the Jth column of the motifs.<br><br>

**MOTIF CONSENSUS:** We take the nucleotide with the highest frequency in each  column and put together an ideal candidate for a regulatory motif for these regions.

Now there is also the issue of making sure account for more "conservative" locations in the motif sequence. For instance, having a column with a Motif Count of (6 C, 4 T) should be scored lower than a column of (4 C, 2 A, 2 T). The biological reason could be that a differentiation for a specific nucleotide at a certain location can still allow for protein binding, while other locations are more crucial to have the correct nucleotide for proper binding. <br><br>
With all this in mind we will be taking a closer look at our "Motif Profile" which is essentially a **Probabilty Distribution**.

##### CLOSER LOOK AT MOTIF PROFILE AND MOTIF SCORE

**Entropy** is essentially a measure of uncertainty in a probability distribution. <br><br>

This can be calculated by taking the negative sum of the probability of i multiplied by the Log base 2 of the probabilty of i where i starts at 0 and increments by 1 until i is equal N, the number of total items in a column.<br><br>

We would find the entropy of more conserved columns to be lower than the entropy of the columns that are less reserved, i.e. have more variation in the probabilty distribution. <br><br>

*Note:* Log base 2 (0) is technically undefined, but we will assume that 0 times log base 2 (0) == 0.

The **Entropy** of the Motif Matrix would then be as simple as the sum of the Entropy of each column. However, for simplicity, we will continue with the Score(Motif) algorithm, knowing that an Entropy(Motif) algorithm is more often used in practice since the lower the entropy score on a column, the more conserved the column is.

In [None]:
# countMotifPercent
# INPUT: A list of DNA strings
# OUTPUT: A profile matrix, showing the probability of a nucleotide being present at different indices

def countMotifPercent(motifs):
    count = {}
    columns = []
    for i in range(len(motifs[0])):
        columns.append([motif[i] for motif in motifs])
    for i in range(len(columns)):
        count[i] = {'A': columns[i].count('A')/len(columns[i]), 'C': columns[i].count('C')/len(columns[i]), 'G': columns[i].count('G')/len(columns[i]), 'T': columns[i].count('T')/len(columns[i])}

    return count



In [None]:
strings = "TCGGGGGTTTTT CCGGTGACTTAC ACGGGGATTTTC TTGGGGACTTTT AAGGGGACTTCC TTGGGGACTTCC TCGGGGATTCAT TCGGGGATTCCT TAGGGGAACTAC TCGGGTATAACC"
string_list = strings.split()

In [None]:
answer = countMotifPercent(string_list)

In [None]:
display(answer)

{0: {'A': 0.2, 'C': 0.1, 'G': 0.0, 'T': 0.7},
 1: {'A': 0.2, 'C': 0.6, 'G': 0.0, 'T': 0.2},
 2: {'A': 0.0, 'C': 0.0, 'G': 1.0, 'T': 0.0},
 3: {'A': 0.0, 'C': 0.0, 'G': 1.0, 'T': 0.0},
 4: {'A': 0.0, 'C': 0.0, 'G': 0.9, 'T': 0.1},
 5: {'A': 0.0, 'C': 0.0, 'G': 0.9, 'T': 0.1},
 6: {'A': 0.9, 'C': 0.0, 'G': 0.1, 'T': 0.0},
 7: {'A': 0.1, 'C': 0.4, 'G': 0.0, 'T': 0.5},
 8: {'A': 0.1, 'C': 0.1, 'G': 0.0, 'T': 0.8},
 9: {'A': 0.1, 'C': 0.2, 'G': 0.0, 'T': 0.7},
 10: {'A': 0.3, 'C': 0.4, 'G': 0.0, 'T': 0.3},
 11: {'A': 0.0, 'C': 0.6, 'G': 0.0, 'T': 0.4}}

In [None]:
import math
def motifEntropy(motifs):
    entropy = 0
    percents = countMotifPercent(motifs)
    for i in range(len(percents)):
        for nucleotide in percents[i]:
            if percents[i][nucleotide] != 0:
                entropy += percents[i][nucleotide] * math.log2(percents[i][nucleotide])
    return -entropy

In [None]:
answer = motifEntropy(string_list)
display(answer)

9.916290005356972

#### FROM MOTIF FINDING TO FINDING A MEDIAN STRING 1.4

As stated in "Bioinformatics Algorithms" by Phillip Compeau and Pavel Pezner:

> A brute force algorithm for the Motif Finding Problem (referred to as BruteForceMotifSearch) considers every possible choice of k-mers Motifs from Dna (one k-mer from each string of n nucleotides) and returns the collection Motifs having minimum score. Because there are n - k + 1 choices of k-mers in each of t sequences, there are (n - k + 1)t different ways to form Motifs. For each choice of Motifs, the algorithm calculates Score(Motifs), which requires k · t steps. Thus, assuming that k is smaller than n, the overall running time of the algorithm is O(nt · k · t). We need to come up with a faster algorithm!



In [None]:
# MOTIF FINDING PROBLEM
# INPUT: A collection of strings DNA, and integer k
# OUTPUT: A collection of Motifs of kmers, one from each string DNA, minimizing Score(Motifs), among all possible choices of Motifs.

Being able to compute the minimum HammingDistance between a given Pattern and various patterns in a string (or row) in DNA will reduce the complexity of our algorithm since we do not have to try to compare the nucleotide count of each position of every string. We are now computing the minimum hamming distance between the given pattern and a k-mer in each INDIVIDUAL row of DNA strings <br><br>
The result of minimizing d(Pattern,DNAᵢ) will give us Motif(Pattern,Text); the sequence that gave us the minimized HammingDistance between Pattern and DNAᵢ.

To begin computing the MedianString, we need to create a function that calculates d(Pattern, DNA); i.e. the sum of all HammingDistances between Pattern and DNAᵢ.

In [None]:
# Finding the distance between a Pattern and a given string of DNA sequences.
# DistanceBetweenPatternAndStrings
# INPUT: A string pattern followed by a collection of space separated strings DNA
# OUTPUT: Distance of Pattern and DNA (d(Pattern,DNA))

def DistanceBetweenPatternAndStrings(pattern, DNA) -> int:
  distance = 0
  k = len(pattern)
  DNA_string = DNA.split(" ")

  for string in DNA_string:
    # we are looking to MINIMIZE hd for pattern and pattern' in DNA[i]
    # therefore, we will set hd to infinite (or simply k, since the max hd is
    # equal to the length of the pattenr), and only add hd to distance that are
    # smaller than the previous hd

    # resetting hd to be equal to k
    champion_hd = k
    n = len(string)

    for i in range(n-k+1):
      # looking at each pattern_prime from i to n-k
      pattern_prime = string[i:i+k]
      challenger_hd = HammingDistance(pattern,pattern_prime)
      # we want the champion_hd to be the minimal hd between pattern and
      # a pattern_prime in string
      if champion_hd > challenger_hd:
        champion_hd = challenger_hd

    # when we exit the nested for loop, we should have the minimum hd
    # between pattern and a given pattern within the string be
    # equal to champion hd, we can now add that minimized value to
    # the total distance

    distance += champion_hd

  return distance

In [None]:
# TESTING DistanceBetweenPatternAndStrings

pattern = "AAA"
DNA = "TTACCTTAAC GATATCTGTC ACGGCGTTCG CCCTAAAGAG CGTCAGAGGT"

answer = DistanceBetweenPatternAndStrings(pattern,DNA)

print(answer)

5


In [None]:
# url = "/content/sample_data/dataset_30312_1.txt"
# with open(url, "r") as file:
#   my_file = file.readlines()

#   new_file = [line.replace("\n","") for line in my_file]

In [None]:
pattern = "GTAGTAA"
DNA = "CGTTCCGACGTCCCACAAATCGAAGCACGAAGGCGCTCCAACCCAAAGTATCCGCTACCAATAACGGATGGACATAGATTTGAAAACGATTCTCTAGTC ATTAGGCAATTGTCTGCAAAATGTTAACTGTTGAGAAATCTGATTGATTATTCAAGTCAGCGGATTGAAGTAGATCGCGTGGCCTCGACGCGCAGTGGT CTCGAAAGCCTGTCACATCAAGGGGAGCGAGATCAAAGTGACGACTTGCACACCCCTCCGCGGATACTGACACACGATCGCTTGTGATGACGCGGTCGA ATATATATAAACCAATTTATTTAAGTATTACGTACTCCATCTGGATTAAAATTCTGATTGTTAGAACAGGTGCTAGTGCCGCATGTGGAGCGCGAGCAA ACCGGTTCTAGCTTTTATTGGTATGTTGGGGTTAGTAAGCCTGTAACACTATACGTATGGCTTGAGCTTAAATGTTGTGCATTTTCGGTCGGGAGCGCC CGCCCATGTATGTCGGTGCTTAGCCAAACGTTTGTATACGGATTAGTGTGTGTAATAAATCATCGAATCGCCGCAATACAAAGGACCCCATGGGAGCTA TCTTCTGCGGCTTTTTCGGGACTTACGAGGGAGGGGGACCCGTTATGTCTAGAAATGAAACGGTAATAAACAGCAGGGCTGGACAAGTGACGGTTCTTG CTTATTTCTTCAATAAAGATGGACTTTTCCCCCTGAGCTCGGGGCCATGGGTCAGGCACTATCGACTTCAACTCCCAATGACCGGTCTCGTCGCCGTTG GTTCTTTCACCAAGTTCACCCAAAGTAAGATGCGCTAAAAATAAAAATCCACTAGTTCAAGGACTTAACTATCTGATCGCCACGTTCATGTGCGAGCAT CGGGTCCGAAATACTTCGAGAAGGTAGCACCAATTGATATCCATAACTATTAGCCCAGGTTCCGTAATAGAGTCGATGGGAATCACAGATATGGGTTGT TTGATATAATTCAGTCTAGCACGTTGCTGCCCTTGATTTTAGGATCATCACATGACGCCCCGCGCGAATCTAGTGAATGGCTTTCCCAAGGTAGATTCT CAAGTGATAGAAGAGGAGTCCCCTTACTATGGGTGGAATGTTTGTGTTTCGACACAAAGGGAGTAGGCAAGCGGCCCAAAAATTACTAGGGTAGTGGAT ACGTGAACCCTGAGGAACCACGGCTAAGGCACGGCTCATGGAGAACTCGCCTTCCAGATCCAGTGCCCTAGAAGTTGGATTCCTTTGATTGTATAGCTA AGTATATAGTAAGTACACGTCTTACTAGAATCCTCATAGTAGCGGAAGTCCTCTCCGAAGTACAACCCCACTAAGTACTTCGCTCTGTAGAGCATTTAC AGAAATAAACGCTAAAATTCTATTGTCGTATTCACGATTATACTTCCCGAATGCACCACAAATGATTACCGGTTGGTGTTTCCTCATTCACGCAGCACA GCATCCGGCGAGACCCAATCCCGAGTCCGTAAATTACGCGTGTGTAACAAAAAGCTCGCTTGTTATAGTGTAGTGGCGGTTGGTTGCTCCATGGCTGTG TTTTATGCAGCGAGGCTGCTATACTCGTTATCAGAAGTCGATGGTGGCGGGGGCCCGCCTACCCTTTTGTGGACCCAGTTAGCTTCCAAAGGGGAGGTT GCTAACCACCATGATTTAGTGCTAACGGAACAGCATTCCTTCCCAGCCGAGACTAGCGACTGTAGTGATGCGACCGTACGATGTCAGCGATTGCTCAAT GGTAATTTGGGGAGGCCGGCTAAGAGGGAATATAACTGCTGATTTATAAGGGGCCCGGAAAAGTACCTCGGAATTTTACAAACGAATGTGATTGGAGGA CAGGCGTTGAGCAGTCCGCATGATTGGTAGAGCGTCCCTCATGAGCTAATAAATCGAGGAATATCAAGTGAGGGAACGGTTCGTCTGCCGGCTGCTGAT TGGCATTAACACGTTAACGTACCGGTATCTAGCCACCCCCGTCTTGCAGGGCGCGGCTCTATTTCGGGCAGCTGATGCTACCGCCTGATAGAATAAGGA TGGTAATCCGACCCTTGACATTGGGTTAACAACTGCAACGCATATGCGACCGGCGGTAAACAAGGGCACGGCATAATATCCAAAGACACTCATGATATA GAAGCCACGATCATAATACCGTCACACCATGTAAGGGTACTATGGTGTAAGGGCATGACGCATGATACGTGGGATCGGCGTTACACAAAACGGAGGAAT ACACTTATTTTCTTCGTAACCTATAAGTTTACACTCCTATAATGCTCAGCTATTCGCTCTTTCTGCCTCGTGGAGTTAGGACAGTCGAAGGTCTCGTGC CTTCTGTTTTAAGCCATGGCTACCTCCTGCAGCACGAAATTATCCGTCTAGCCATCCCTATTCAGGGGGGCGTGGCCGTACACTAACTAAGCCCTCCGA TGTACACCCATTATATGTCGGATTCTAAGGCAGTAGATTGATTTAGCTGGTCTACGTAAGCAACGATCTCTTGCTAGAGTTTTCCAATGTAGCATGTGT CAATATACCTGTGCGATCCCATAGTGTTAACCGTCGCAAGAGTGTTCTATACTGTCAATCAGTAGACGGCCGCGCTGTAGAGAACACCCATACCCCAGG CCATGGTAACCTGGTAGTGGTTCAACATCCCTATGACGCCCAGCTGAATGGGTTTGTCATTCAAACAATCTCGTAGTCTCTTTCAGCTTGGGCCCCTGC GGCGCGAGCGACGCCTTAAGTTAGGTGGTCAATCGACTGTCCGCGTGAACTTACAGCTGCTTCGCCGGTTGCACTATCAAAGTACCTACCGAACCTGTA CAATTACGCTTTGGGACACTATTTAATCTATCAGCTTAATTAAGATCACTAGATTTCAACCTAGGGGCTGGATGTTAGATTGATAATGAACCCCTAGCC GGAAGAGTGTCTTTATCCTCAACGTGAGGTGCAACCTTCAAGTTGATAGGCGCAGGATTGATCGACTCTGCAAGGATTCACATGAGATGTGGTTCGTTC"

answer = DistanceBetweenPatternAndStrings(pattern,DNA)
print(answer)

65


Now that we have a function to compute the distance between Pattern and all pattern' in each string in DNA, we can start looking for the median string, i.e., the string that minimizes the distance between Pattern and all pattern'. <br><br>
However, we are not given the Pattern that is capable of this, therefore we have to search for the Pattern that minimzes d(Pattern,DNA).

In [None]:
my_string = "HELLO"

print(my_string[:0]+"B"+my_string[0+1:])

for i in range(len(my_string)+1):
  print(my_string[:i])

BELLO

H
HE
HEL
HELL
HELLO


In [None]:
# AllStrings creates all possible k-mers of length k.
# INPTUT: integer k for the length of kmers
# OUTPUT: A list of all possible kmers of length k

def AllStrings1(k: int) -> list[str]:
  patterns = set()
  nucleotides = ["A","C","T","G"]
  pattern = ""
  loop = k
  # establishing first_pattern
  while loop > 0:
    pattern = pattern + "A"
    loop = loop - 1

  patterns.add(pattern)

  print(pattern)
  neighborhood = Neighbors(pattern, k)


  # now with our first string, "A...A" of length k, we remove

  # for i in range(len(pattern)):
  #   for replacement in nucleotides:
  #     print(replacement)
  #     new_pattern = pattern[:i] + replacement + pattern[i+1:]
  #     patterns.add(new_pattern)


  return neighborhood

In [None]:
# Improved version of AllStrings that essentially rifts off of out Neighborhood function
# However, there is no limiting factor in HammingDistance we are looking for, so we will not have the if statement for HD

def AllStrings(k: int) -> list[str]:
  # our "ingredients" for different kmers
  nucleotides = ["A","T","C","G"]

  # our k is one, than all kmers are just our nucleotides,
  # this our base case, once this is returned, the actual recursion begins
  if k == 1:
    return nucleotides

  # now we set up for our neighborhood
  neighborhood = set() # creating empty set
  SuffixPattern = pattern[1:] # Dropping the first symbol in pattern to get the SuffixPattern

  # INCREMENT DOWN K, TO MEAT OUR BASE CASE
  k -= 1
  SuffixNeighbors = AllStrings(k) # Recursion, will keep calling function until base case (len(pattern) == 1)

  for text in SuffixNeighbors:
    for nucleotide in nucleotides:
      # we now add a nucleotide as each possible neigbor to pattern
      neighbor = nucleotide + text
      neighborhood.add(neighbor)

  return neighborhood

In [None]:
print(sorted(AllStrings(2)))

# NOT WORKING AS OF 3/10/25

['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']


Now that we have a function to compute all possible kmers of length k, and a function to minimize the distance (HammingDistance) between one of these found patterns and a set of sequences, we can now look for the MedianString.

In [None]:
# MEDIAN STRING
# INPUT: space separated string DNA and int k.
# OUTPUT: A k-mer Pattern that minimizes d(Pattern, Dna) among all possible choices of k-mers. (If there are multiple such strings Pattern, then you may return any one.)

def MedianString(DNA: str, k: int) -> list[str]:
  patterns = AllStrings(k)
  medians = []

  # base case is the max possible distance, in this case the max distance is equal to k
  median_distance = k

  for pattern in patterns:
    distance = DistanceBetweenPatternAndStrings(pattern, DNA)
    if distance < median_distance:
      median_distance = distance
      median = pattern
      medians.append(median)

  return medians




In [None]:
dna = "CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG"
k = 7

answer = MedianString(dna,k)
print(answer)

AATCCTA


In [None]:
# TESTING MEDIAN STRING
k = 3
DNA = "AAATTGACGCAT GACGACCACGTT CGTCAGCGCCTG GCTGAGCACCGG AGTTCGGGACAG"

answer = MedianString(DNA,k)
print(answer)

GAC


In [None]:
# url = "/content/sample_data/dataset_30304_9.txt"

# with open(url, "r") as file:
#   my_file = file.readlines()

# k = int(my_file[0])
# DNA = my_file[1]

# answer = MedianString(DNA,k)
# print(answer)

# # IT WORKED!

Quotes from "Bioinformatics Algorithms: An Active Learning Approach" by Phillip Compeau and Pavel A. Pevzner:

> "To see why we reformulated the Motif Finding Problem as the equivalent Median String Problem, consider the runtime of MedianString and BruteForceMotifSearch. The former algorithm computes d(Pattern, Dna) for each of the 4k k-mers Pattern. Each computation of d(Pattern, Dna) requires a single pass over each string in Dna, which requires approximately k · n · t operations for t strings of length n in Dna. Therefore, MedianString has a running time of O(4k · n · k · t), which in practice compares favorably with the O(nt · k · t) running time of BruteForceMotifSearch because the length of a motif (k) typically does not exceed 20 nucleotides, whereas t is measured in the thousands."

> "The Median String Problem teaches an important lesson, which is that sometimes rethinking how a problem is formulated can lead to dramatic improvements in the runtime required to solve it. In this case, our simple observation that Score(Motifs) could just as easily be computed row-by-row as column-by-column produced the faster MedianString algorithm."

> "Of course, the ultimate test of a bioinformatics algorithm is how it performs in practice. Unfortunately, since MedianString has to consider 4k k-mers, it becomes too slow for the Subtle Motif Problem, for which k = 15. We will run MedianString with k = 13 in the hope that it will capture a substring of the correct 15-mer motif. The algorithm still requires half a day to run on our computer and returns the median string AAAAAtAGaGGGG (with distance 29). This 13-mer is not a substring of the implanted pattern AAAAAAAAGGGGGGG, but it does come close."

#### GREEDY MOTIF SEARCH 1.5

A "Greedy Algorithm" will select the "best" looking answer for each *individual* step within an algorithm. However, this scope can be too narrow, leading to inaccurate solutions in the end. Another approach would be a "Heuristic Algorithm", which would be less accurate but perform much faster to find an approximate solution. Both have their place when it comes to using or creating an algorithm, but the use of each is contextual.

We will be using a "Greedy Approach" for finding a possible Motif by using the probabilty matrix. The higher the probability of a kmer, it is more liekly to be similar to the consensus string.

In [None]:
# # --- OLD VERSION ---

# # ProfileMostProbableKmer -> We will find a Profile-Most probable kmer of a string
# # INPUT: a string TEXT, integer k, and a 4xk profile matrix (row x column)
# # OUTPUT: A Profile-most probable kmer in TEXT.

# # we are basing out rows from previous examples, order is as follows:
# # 0 -> A, 1 -> C, 2 -> G, 3 -> T
# def ProfileMostProbableKmer(text: str, k: int, profiles) -> str:
#   mostProbableKmer = text[:k]
#   kmers = set()
#   prob_table = {}
#   # lets list our kmers
#   for i in range(len(text)-k+1):
#     kmer = text[i:i+k]
#     kmers.add(kmer)

#   # for each kmer, we want to compute its probability based on the given probabilty matrix
#   for kmer in kmers:
#     # setting the base probability as 1, but this will change accordingly when multiplied by the actual probability.
#     prob_table[kmer] = 1

#     # for each nucleotide in kmer, we check which it is and then multiply the appropriate probability based on its position in the string.
#     for nucleotide in range(k):
#       if kmer[nucleotide] == "A":
#         prob_table[kmer] = (prob_table[kmer] * profiles[0][nucleotide])

#       if kmer[nucleotide] == "C":
#         prob_table[kmer] = (prob_table[kmer] * profiles[1][nucleotide])

#       if kmer[nucleotide] == "G":
#         prob_table[kmer] = (prob_table[kmer] * profiles[2][nucleotide])

#       if kmer[nucleotide] == "T":
#         prob_table[kmer] = (prob_table[kmer] * profiles[3][nucleotide])

#   # now we look for the max probability value in our table
#   max_val = max(prob_table.values())

#   # now we look for the kmer that has that max value
#   for kmer in prob_table:
#     if prob_table[kmer] == max_val:
#       mostProbableKmer = kmer

#   return mostProbableKmer

In [None]:
# -- UPDATED VERSION -- 03/25/25
# ProfileMostProbableKmer -> We will find a Profile-Most probable kmer of a string
# INPUT: a string TEXT, integer k, and a 4xk profile matrix (row x column)
# OUTPUT: A Profile-most probable kmer in TEXT.

# we are basing out rows from previous examples, order is as follows:
# 0 -> A, 1 -> C, 2 -> G, 3 -> T
def ProfileMostProbableKmer(text: str, k: int, profiles) -> str:
  mostProbableKmer = text[:k]
  kmers = set()
  prob_table = {}
  # lets list our kmers
  for i in range(len(text)-k+1):
    kmer = text[i:i+k]
    kmers.add(kmer)

  # for each kmer, we want to compute its probability based on the given probabilty matrix
  for kmer in kmers:
    # setting the base probability as 1, but this will change accordingly when multiplied by the actual probability.
    prob_table[kmer] = 1

    # for each nucleotide in kmer, we check which it is and then multiply the appropriate probability based on its position in the string.
    for column in range(k):
      # UPDATE
      # using the .get() method allows us to only obtain the probability in profile correspondin to the nucleotide and position of kmer[column]
      prob_table[kmer] = (prob_table[kmer] * profiles[column].get(kmer[column],0))

  # now we look for the max probability value in our table
  max_val = max(prob_table.values())

  # now we look for the kmer that has that max value
  for kmer in prob_table:
    if max_val == 0:
      return mostProbableKmer

    if prob_table[kmer] == max_val:
      mostProbableKmer = kmer

  return mostProbableKmer

In [None]:
# testing the different methods for grabbing indices and values from maps/dicts

profile = [[1,2,3,4],[5,6,7,8]]

for row, index_row in enumerate(profile):
  print("row:",row,"index row:",index_row)


my_dict = {"A":0.4,"C":0.6}
print(my_dict["A"])

for letter in my_dict:
  print("letter:",letter, "value:",my_dict[letter])

row: 0 index row: [1, 2, 3, 4]
row: 1 index row: [5, 6, 7, 8]
0.4
letter: A value: 0.4
letter: C value: 0.6


In [None]:
# tesing MostProbableKmer function
text = "ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT"
k = 5
# prob_matrix_1 = [[0.2, 0.2, 0.3, 0.2, 0.3],
#                [0.4, 0.3, 0.1, 0.5, 0.1],
#                [0.3, 0.3, 0.5, 0.2, 0.4],
#                [0.1, 0.2, 0.1, 0.1, 0.2,]]
# changed the desired profile format to be a dictionary, where the keys are the
# column indices, and the values are the probabilities for each nucleotide in that column
prob_matrix_2 = {0:{"A":0.2,"C":0.4,"G":0.3,"T":0.1},
                1:{"A":0.2,"C":0.3,"G":0.3,"T":0.2},
                2:{"A":0.3,"C":0.1,"G":0.5,"T":0.1},
                3:{"A":0.2,"C":0.5,"G":0.2,"T":0.1},
                4:{"A":0.3,"C":0.1,"G":0.4,"T":0.2}}

answer = ProfileMostProbableKmer(text,k,prob_matrix_2)
print(answer)

CCGAG


In [None]:
prob = max(prob_matrix_2[0].values())
nucleotide = max(prob_matrix_2[0], key=prob_matrix_2[0].get)
print(prob)
print(nucleotide)
print(prob_matrix_2[0].get)

0.4
C
<built-in method get of dict object at 0x7ad019c0aa80>


In [None]:
# url = "/content/sample_data/dataset_30305_3.txt"

# with open(url, "r") as file:
#   my_file = file.readlines()

#   new_files = [line.replace("\n","") for line in my_file]

#   # k = int(new_files[0])
#   # text = new_files[1]

# text = new_files[0]
# k = int(new_files[1])

# prof_matrix = []
# for line in new_files[2:]:
#   prof_matrix.append([float(i) for i in line.split()])
# answer = ProfileMostProbableKmer(text,k,prof_matrix)
# print(answer)

`"Our proposed greedy motif search algorithm, GreedyMotifSearch, starts by forming a motif matrix from arbitrarily selected k-mers in each string from Dna (which in our specific implementation is the first k-mer in each string). It then attempts to improve this initial motif matrix by trying each of the k-mers in Dna1 as the first motif. For a given choice of k-mer Motif1 in Dna1, it builds a profile matrix Profile for this lone k-mer, and sets Motif2 equal to the Profile-most probable k-mer in Dna2. It then iterates by updating Profile as the profile matrix formed from Motif1 and Motif2, and sets Motif3 equal to the Profile-most probable k-mer in Dna3. In general, after finding i − 1 k-mers Motifs in the first i − 1 strings of Dna, GreedyMotifSearch constructs Profile(Motifs) and selects the Profile-most probable k-mer from Dnai based on this profile matrix. After obtaining a k-mer from each string to obtain a collection Motifs, GreedyMotifSearch tests to see whether Motifs outscores the current best scoring collection of motifs and then moves Motif1 one symbol over in Dna1, beginning the entire process of generating Motifs again."`

Now to use our greedy motif search algorithm, we will start by collecting "randomly" selected motifs from each DNA string; for our implementation it will be the first kmer within each DNA string. <br>

The function will then attempt to improve the kmer selection by building a profile matrix Profile for this kmer, and Motif2 will be the Profile-most probable kmer in DNA2, and Motif3 will be the DNA most probable kmer in DNA3 and so on and selects the motif that maximizes the Profile within DNAi. <br>

Essentially, our GreedyMotifSearch tests the Motifs against the current best scoring collection of motifs, moved Motif1 one nucleotide down and starts the whole process all over again, generating another batch of motifs to compare against "BestMotif" (that was selected from the previous batch).




```
# PseudoCode for GreedyMotifSearch
def GreedyMotifSearch(DNA, k , t):
  BestMotifs <- Motif Matrix formed by first kmer in each string
  FOR each kmer motif in the first string of DNA
    Motif_1 <- Motif
    FOR i = 2 to t:
      form Profile from Motif_1...Motif_t-1
      Motif_i <- Profile-Most probable kmer from ith string DNA
    Motifs <- (Motifs_i...Motifs_t)
    IF score(Motifs) < score(BestMotifs)
      BestMotifs <- Motifs
    
  return BestMotifs

```


In [None]:
motif = ["ACCT"]
display(countMotifPercent(motif))

{0: {'A': 1.0, 'C': 0.0, 'G': 0.0, 'T': 0.0},
 1: {'A': 0.0, 'C': 1.0, 'G': 0.0, 'T': 0.0},
 2: {'A': 0.0, 'C': 1.0, 'G': 0.0, 'T': 0.0},
 3: {'A': 0.0, 'C': 0.0, 'G': 0.0, 'T': 1.0}}

In [None]:
# # GreedyMotifSearch
# # INPUT: integers k,t, followed by space separated colletion of strings DNA
# # OUTPUT: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t).
# #         If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring first.

# def GreedyMotifSearch(DNA: str, k: int, t: int) -> list[str]:
#   bestMotifs = []
#   dnaList = DNA.split()
#   consensusMotifs = []

#   # # creating our first set of "best" motifs from the first kmers in each DNA string
#   # for string in dnaList:
#   #   bestMotifs.append(string[:k])

#   for i in range(len(dnaList[0])-k+1):
#     motif = dnaList[0][i:i+k]
#     bestMotifs = [motif]

#     for j in range(1,len(dnaList)):
#       profile = countMotifPercent(bestMotifs)
#       motif = ProfileMostProbableKmer(dnaList[j],k,profile)
#       bestMotifs.append(motif)
#


#       if j == len(dnaList)-1:
#         consensus_motif = ""
#         for column in range(k):
#           # append the the nucleotide with the highest probability in each column
#           consensus = max(profile[column], key=profile[column].get)
#           consensus_motif = consensus_motif[:column] + consensus + consensus_motif[column+1:]
#           if column == k-1:
#             consensusMotifs.append(consensus_motif)

            # NEEDED TO COMPUTE THE MOTIF SCORE HERE, THEN IF IT IS LOWER, HAVE THE CURRENT SET OF MOTIFS BE THE "BEST MOTIFS"

#   return consensusMotifs

#   # A C G T

#   return bestMotifs


In [None]:
# --- UPDATED VERSION ---
# 03/26/25

def GreedyMotifSearch(DNA: str, k: int, t: int) -> list[str]:
    dnaList = DNA.split()
    bestMotifs = [string[:k] for string in dnaList]  # Initialize with first k-mers
    bestScore = float('inf')  # Track the best score (lower is better)

    # Iterate over all possible k-mers in the first DNA string
    for i in range(len(dnaList[0]) - k + 1):
        motifs = [dnaList[0][i:i+k]]  # Start with the current k-mer from the first sequence

        # Build the profile and find motifs in the remaining sequences
        for j in range(1, t):
            profile = countMotifPercent(motifs)
            nextMotif = ProfileMostProbableKmer(dnaList[j], k, profile)
            motifs.append(nextMotif)

        # Compute the score of the current motif set
        currentScore = computeMotifScore(motifs)

        # Update bestMotifs if the current set is better
        if currentScore < bestScore:
            bestScore = currentScore
            bestMotifs = motifs.copy()

    return bestMotifs


def computeMotifScore(motifs: list[str]) -> int:
    """Compute the total number of mismatches against the consensus motif."""
    if not motifs:
        return 0

    consensus = []
    k = len(motifs[0])

    # Build the consensus string
    for i in range(k):
        column = [motif[i] for motif in motifs]
        mostCommon = max(set(column), key=column.count)
        consensus.append(mostCommon)
    consensusMotif = ''.join(consensus)
    print(consensusMotif)

    # Calculate total mismatches
    score = 0
    # for motif in motifs:
    #     score += sum(1 for a, b in zip(motif, consensusMotif) if a != b)
    for motif in motifs:
      score += HammingDistance(motif,consensusMotif)

    return score

In [None]:
dna = "GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG"
k = 3
t = 5

answer = GreedyMotifSearch(dna,k,t)

print(answer)
# getting too many nucleotides in the result
  # expecting 5, but got 4

['CAG', 'CAG', 'CAA', 'CAA', 'CAA']


In [None]:
# url = "/content/sample_data/dataset_30305_5.txt"
# with open (url, "r") as file:
#   my_file = file.readlines()

#   new_files = [line.replace("\n","") for line in my_file]

In [None]:
my_file = ['12 25',
 'ATGGGCAGCATGGAGTACACCTCACGAATGTTGGGTACGGGCTAGGGGAATGAACCTTTGGTCCGAGAGTTTGCCTACACCACAGTTTAACAGACTTAACTACCACGCTCGTACATAGAGTGTAGGCAGCTGGCTATGCTACGAACCGTAAGGTAG GGAAAGACACAATATCGTTTGGGTGTGCAAATTGTGTGGTGCTAGCATTGAAATGGAGCAGTACCGTGTACCCGAGGGGGTGTCGTTCAAATCCAGGACCCGCGAGGCGACTATCCCACACGGCGTATGAGAACTTTGGACAGCTAGGGATTGCCG GTGTTCACTTGGGAATAGTTTATCGGTATAAGACCATGTCTCGCAAACTGGTGTACAGCAGTCAGCAGACGGGGCTTGCCCACATCTACGTACTCCCTGTCTGTGGCTAGTGTAATCACCCGTAGCCGCGATCGATTATGGCGACAATGTTATGGA GGGCCGTTCTTTTCTCTCGCGTAAAACTGAGCGTGCGAAATGTTCCGGCAAAAGCTCCAGTTATCTGGGATCGACTGCCCCACAGGAGGGATTATCCAACGACTACATACGACAGGAGCAGAATAACGTGAGCCGAACCAGTACATGACGGCATCA CGGGATGCGGTATTTTACCGGTCAACACGTTTTCAGAATGCCTTCCAAAATCACCAAAGACAGCTTTGAAATCCACCCTCGTCGTGCTGGTCCGCGATGCGGGACTTTCATTTCGACCACTCCAATAAGCATTTCAGGGGCAGGGACTGGTCCACA GGCACGAAGTCTCAACTCATTACACTTAGGCCTAACCGACTATGGGACGACTGGACCACATACCGAGGGCTGTTAGACCATGTCGCGGAAACTCCTCGAGAATATTGTAGATGCTAATTCGCTCACGCTGCCGTGGTGAGTGGTTGATCCAGCGTT TACCACAGGGGTCGTGTTCGTGGCTGCGGAATGATTTAACAGGCTTGGCGTGGGTGCTAGGGCACTAAGGGCCTAAAAAAGTTCACTAACCGGCTGATAGGGGCGGTTGTTATGGCTTTGGTCTGGGCCACAATTCATGAGAAGTCGCATAGTAAC TTATACCTGACTTTGCTTGAATTTGTGATAGTAGAAGTCTCGCCCACATGTGGACATGTGGGTATTTCGAATAGGAACTGGGGGGTGTATTATCATGGAATCAATTAACATCGGTCAAAGGGGCTCACGCGAATACGAGCCGGTAGCATGCGTTGA CTAAGTATATTCGACTACACCACACTTTCTGCATACAGTCAACCGAGACCCGAACTTCACTATTGTAAGTAACGTTTGAGTTACGTGTAAAAGGTTGGACGCAAATCAGAGTGTGCGTGCCTCGTCACAGAATACCCGGACATGGCAAAATGTGAT GACTGAGCCACAAACATCTCACGTGCTTCTCCTACTAACCCACTGTGGCCTAATCCGAAGTCAGAATCTCTCGGTCCCTGAGATTTCGTGACAACGGTCTCGGCGTATTATTAGCAAAAACGGCGGCGTATAGAGTTATGACTGAACAGAAACATC TGAAAAGTACACAATCGATCGGTGGTAGTGATAGACAAAACTTACCACTCGCAGAGGGCATCCAAGAAAGCTGCCCAGAGAGCAATCAAAACACCATCAGTACGCTCTCCAAGAAGATACAGAAGTCCCACTGACTTGGCCACAAGCACTGTATGT TATCTAATGCTCTGCATTGAACGTTACCCAGTGTCGATATGAACAAGGGCCTCGTCCACACCGCAGAGCTGCCAACCTGAGAGCGGTAGGCAGTCCCTTGGACGGTGCAACTATGGCGCTGGGGTAACTAAGAGAAGGTCTATCGTGAAGCGCTAT AGTTGAATATTTTACTGAATCAATCGATAACTTCTATTCCCTGCCCTAATTGCCCCACGTGACTGTACCACACACGAGCTATCATACGGGTGTGCATGTTCCCTGACCGAGCAAGCGTTCGACAGCAAACCCCCCGGTGATAACATACCCCCTGCA GTAGGCCTGCAGGCCTTGGCCACAGGTGGCCGAGTGCAGCTCATGAGTAGCGCAACTCAGCGACGCGTCGTTATGTGAAGACACAAACCCCCAAACTTCCGCAACAACTCCAGCCTGGTCGCGGCACTACTGACATAAGACGGGTTCGGCTGGAAG AAGCCGATGTAAAAGCATATGGCCCTATGGTACACCAGAAGGTAACTAGTCTCCCGCATAACAGCGGACATCGGTGAACCACAAGAAGCATCGGCCTGCTCCCTAGCCTCCAAATTATCGGAGATCTATCGCGTCTAGCCCACAACACGGGAGTGT GTATCATGTTAAAAATGCGTGTCGTCACCCAACACCCCCACGGGCACCGTCTCAGCCACAAAAGGGTAGCGATTGTTGATCGGTGCTTAATGCGGAGTATTCCTGCTAACTTGTCCGAGGTAAAGGACCCAATGTGCTCTAAATCTACGTTGGATA TATACCGGTCGAGCTTACTGAATCCCAATGTTCGCTCGCGGGACAGGAATAGTGGAACGCTTGGCACTTGCGAATACTGTGCGGGTCTGACCCACAACGGCGAACAATTCGTGTCCGTTGATCTACGGTATCCGTTATAATCCCACGGTGGGCAAT CTGATTACTTACCGTGGCAGGTTTTGTGGCGTAGGCCTTTCATCGTACTTCTGCCGTTTTGACAAATATGATTGTAAAATCAAACTCGTGACATTCGCAGCATTAGTTTGGCGCGAATTCCTCGTTCACTGCGCCTTTCCCACAGAGGCGGGCAAT TATTATAGTCGGGCGCTTCGAATGAAAATATTGGTAAAAAGAGACAGCAGAGTTAAAATTTACCGCATTCCATATCCACTCTTCGGACATCCTCTACGGATATAATTCACTGAACAAGTTGCCTTAGCCACATCTATTTGCGGCTCTGTGGAAAAT AAAGCCGTGGATCCTGTGGGCGATAGTAGCCCTTCGTTTAAGGAGCGCAGCGCGAGGTGTCCCCACCCGGATGACTTCTCCACATATAGCCCCCCGCGCAATCACATAACGGAATCCGCGTAATGGTTGAGGTCTGTTTGTTGGGACGGGGGGAAG TGTTCCAGACCGGCTGTGCTAGTAGGCTCATCCACAGGTCCGTGACGCGGGTTCTCGGGTTTGAGTCTCCCCGGTCTTGCTGCGGACAACCCGCGAGTGTGTTGTCATTGCCGATAAAAGCCGCTGTCTCGTGTCACGGTACGTCTGAGCGGTCTA ACGCTGTATTTACAAGCGGCCTCTATCGGGGTGTCTAGGACTAATGTTTAGAGTGGTATGAACCTCATTCACGCAGCATACGTTGGAATTGTCCTGAGAACGCTCTACCGATCTCTACATACAAGCCCTGCAGCCAGCAGGGGTGCCTACACCACA ATGCTCGCCCTTAACCGGACCCATCGCATAAGTCCTATTGGTTAGGCTCGACTCTCGTTAGAGATACCTCACGACGCCTACCACTACCAGTACCTAACAGGTGACACAGCCGCTTACAAGGTCTGTCCCACAGGCACGGGGTGTCCGCGCGATTAT GCGGGTTTTGTGCGAAATTGTATAAACTAGCCTCAGTCAACGAACTGCACACAGAGAAGTGGATACGCCCCGACATCAACTCAGGCAACCTACAGGACAACGTCGCCGGGCTACTCCACACCCAGAATGCAGAAGTATGATGTATGTACTGTATAT TAGTCCGGCGGTACCCGCGTAAAGGGCGTTGTCATTTCTACGCACTCGATATGTAAGTACGGGACTTCGGCCCGTAGGCCCGTGGTCTGGTCCACATAAGAGCCATGGTCACACGCTCCAAATCCATGGGGCAGATTGAGAAAGGTAAGAGGTGTT']

In [None]:
# new_files[0] = new_files[0].split()
# k = int(new_files[0][0])
# t = int(new_files[0][1])
# dna = new_files[1]

# answer = GreedyMotifSearch(dna,k,t)

In [None]:
print(answer)

['TAGGGGAATGAA', 'GGAAAGACACAA', 'GTGTTCACTTGG', 'GGGCCGTTCTTT', 'CGGGATGCGGTA', 'GGGACGACTGGA', 'GGAATGATTTAA', 'GGAATCAATTAA', 'GTAACGTTTGAG', 'GAGATTTCGTGA', 'GAGAGCAATCAA', 'GTGTCGATATGA', 'TGAATCAATCGA', 'GTGAAGACACAA', 'CAAATTATCGGA', 'TAAAGGACCCAA', 'GAGCTTACTGAA', 'CTGCCGTTTTGA', 'TAATTCACTGAA', 'GTAATGGTTGAG', 'GTGTTGTCATTG', 'GAGTGGTATGAA', 'CAGGTGACACAG', 'GAAATTGTATAA', 'GGGCAGATTGAG']


In [None]:
display(answer)

['TAGGGGAATGAA',
 'GGAAAGACACAA',
 'GTGTTCACTTGG',
 'GGGCCGTTCTTT',
 'CGGGATGCGGTA',
 'GGGACGACTGGA',
 'GGAATGATTTAA',
 'GGAATCAATTAA',
 'GTAACGTTTGAG',
 'GAGATTTCGTGA',
 'GAGAGCAATCAA',
 'GTGTCGATATGA',
 'TGAATCAATCGA',
 'GTGAAGACACAA',
 'CAAATTATCGGA',
 'TAAAGGACCCAA',
 'GAGCTTACTGAA',
 'CTGCCGTTTTGA',
 'TAATTCACTGAA',
 'GTAATGGTTGAG',
 'GTGTTGTCATTG',
 'GAGTGGTATGAA',
 'CAGGTGACACAG',
 'GAAATTGTATAA',
 'GGGCAGATTGAG']

In [None]:
display(" ".join(answer))

'TAGGGGAATGAA GGAAAGACACAA GTGTTCACTTGG GGGCCGTTCTTT CGGGATGCGGTA GGGACGACTGGA GGAATGATTTAA GGAATCAATTAA GTAACGTTTGAG GAGATTTCGTGA GAGAGCAATCAA GTGTCGATATGA TGAATCAATCGA GTGAAGACACAA CAAATTATCGGA TAAAGGACCCAA GAGCTTACTGAA CTGCCGTTTTGA TAATTCACTGAA GTAATGGTTGAG GTGTTGTCATTG GAGTGGTATGAA CAGGTGACACAG GAAATTGTATAA GGGCAGATTGAG'

#### MOTIF FINDING MEETS OLIVER CROMWELL 1.6

We will be taking a look at **Cromwell's Rule**, which advises against using 1 or 0 for prior probabilities unless only using it for logical expressions that can only be true or false.<br>

Introducing the use of **pseudocounts**. By adding 1 to the number of occurences of ALL events, we will essentially replace any zero probabilities with a non-zero probability.

In [None]:
# rewriting Greedy Motif Search to include pseudocounts
# --- UPDATED VERSION ---
# 03/26/25

def GreedyMotifSearchWithPseudocounts(DNA: str, k: int, t: int) -> list[str]:
    dnaList = DNA.split()
    bestMotifs = [string[:k] for string in dnaList]  # Initialize with first k-mers
    bestScore = float('inf')  # Track the best score (lower is better)

    # Iterate over all possible k-mers in the first DNA string
    for i in range(len(dnaList[0]) - k + 1):
        motifs = [dnaList[0][i:i+k]]  # Start with the current k-mer from the first sequence

        # Build the profile and find motifs in the remaining sequences
        for j in range(1, t):
            profile = countMotifPercentPseudocounts(motifs)
            nextMotif = ProfileMostProbableKmer(dnaList[j], k, profile)
            motifs.append(nextMotif)

        # Compute the score of the current motif set
        currentScore = computeMotifScore(motifs)

        # Update bestMotifs if the current set is better
        if currentScore < bestScore:
            bestScore = currentScore
            bestMotifs = motifs.copy()

    return bestMotifs


In [None]:
def countMotifPercentPseudocounts(motifs):
    count = {}
    columns = []
    for i in range(len(motifs[0])):
        columns.append([motif[i] for motif in motifs])
    for i in range(len(columns)):
        # adding a psuedocount (+1) to remove probabilities of zero
        count[i] = {'A': (columns[i].count('A'))+1/len(columns[i]), 'C': (columns[i].count('C'))+1/len(columns[i]), 'G': (columns[i].count('G'))+1/len(columns[i]), 'T': (columns[i].count('T'))+1/len(columns[i])}

    return count

In [None]:
k = 3
t = 5
dna = "GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG"

answer = GreedyMotifSearchWithPseudocounts(dna,k,t)
print(answer)

['TTC', 'ATC', 'TTC', 'ATC', 'TTC']


In [None]:
url = "/content/sample_data/dataset_30306_9.txt"

with open(url, "r") as file:
  my_file = file.readlines()

  new_files = [line.replace("\n","") for line in my_file]

k = int(new_files[0].split()[0])
t = int(new_files[0].split()[1])
dna = new_files[1]

answer = GreedyMotifSearchWithPseudocounts(dna,k,t)
print(answer)

['AAGACCCGTGCA', 'CAGCCCCCTGCA', 'TAGACCCTTACA', 'AAGACCCTTCCA', 'GAGTCCCATGCA', 'TAGGCCCGTACA', 'TAGCCCCATTCA', 'CAGACCCGTTCA', 'GAGACCCATCCA', 'TAGTCCCATCCA', 'GAGTCCCTTCCA', 'GAGTCCCCTTCA', 'CAGGCCCTTGCA', 'GAGGCCCCTTCA', 'GAGGCCCGTCCA', 'CAGGCCCTTCCA', 'GAGACCCGTTCA', 'GAGCCCCGTACA', 'CAGCCCCGTGCA', 'CAGCCCCCTTCA', 'CAGGCCCCTGCA', 'CAGCCCCATGCA', 'AAGGCCCATCCA', 'TAGCCCCATACA', 'CAGTCCCTTACA']


In [None]:
display(" ".join(answer))

'AAGACCCGTGCA CAGCCCCCTGCA TAGACCCTTACA AAGACCCTTCCA GAGTCCCATGCA TAGGCCCGTACA TAGCCCCATTCA CAGACCCGTTCA GAGACCCATCCA TAGTCCCATCCA GAGTCCCTTCCA GAGTCCCCTTCA CAGGCCCTTGCA GAGGCCCCTTCA GAGGCCCGTCCA CAGGCCCTTCCA GAGACCCGTTCA GAGCCCCGTACA CAGCCCCGTGCA CAGCCCCCTTCA CAGGCCCCTGCA CAGCCCCATGCA AAGGCCCATCCA TAGCCCCATACA CAGTCCCTTACA'

In [None]:
profile = {0: {'A': 0.4, 'C': 0.2, 'G': 0.1, 'T': 0.3},
           1: {'A': 0.3, 'C': 0.3, 'G': 0.3, 'T': 0.1},
          2: {'A': 0.0, 'C': 0.0, 'G': 1.0, 'T': 0.0},
           3: {'A': 0.1, 'C': 0.4, 'G': 0.1, 'T': 0.4},
           4: {'A': 0.0, 'C': 0.0, 'G': 0.5, 'T': 0.5},
           5: {'A': 0.9, 'C': 0.1, 'G': 0.0, 'T': 0.0}}

In [None]:
entropy = 0
for i in range(len(profile)):
    for nucleotide in profile[i]:
        if profile[i][nucleotide] != 0:
            entropy += profile[i][nucleotide] * math.log2(profile[i][nucleotide])


display(-entropy)

6.932824877385981

In [None]:
mostCommon = max(profile[0], key=profile[0].get)
print(mostCommon)

A


In [None]:
consensus = []
for i in range(len(profile)):
  mostCommon = max(profile[i], key=profile[i].get)
  consensus.append(mostCommon)
consensusMotif = ''.join(consensus)
print(consensusMotif)

AAGCGA


In [None]:
motif_matrix = "CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG"
k = 7

In [None]:
MedianString(motif_matrix,k)

['AATCCAG', 'CAATCGG', 'GTTTCAT', 'AGTTTCA', 'GTAGTTT', 'ACCACGG', 'AATCCTA']