Many algorithms are iterative procedures that must choose among many alternatives at each iteration. Some of these alternatives may lead to correct solutions, whereas others may not. Greedy algorithms select the “most attractive” alternative at each iteration. In general, most greedy algorithms typically fail to find an exact solution of the problem; instead, they are often fast heuristics that trade accuracy for speed in order to find an approximate solution.

Our proposed greedy motif search algorithm, GREEDYMOTIFSEARCH, tries each of the k-mers in Dna1 as the first motif (nije greediness). For a given choice of k-mer Motif1 in Dna1, it then builds a profile matrix Profile for this lone k-mer, and sets Motif2 equal to the Profile-most probable k-mer in Dna2. It then iterates by updating Profile as the profile matrix formed from Motif1 and Motif2, and sets Motif3 equal to the Profile-most probable k-mer in Dna3. In general, after finding i-1 k-mers Motifs in the first i -1 strings of Dna, GREEDYMOTIFSEARCH constructs Profile(Motifs) and selects the Profile-most probable k-mer from Dnai based on this profile matrix. After obtaining a k-mer from each string to generate a collection Motifs, GREEDYMOTIFSEARCH tests to see whether Motifs outscores the current best scoring collection of motifs and then moves Motif1 one symbol over in Dna1, beginning the entire process of generating Motifs again.

In [194]:
import numpy as np

In [195]:
def FindAllKmers(dna_string, k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(dna_string) - 1:
    kmers_list.append(dna_string[i:i+k])
    i = i + 1
  return kmers_list

In [196]:
from collections import Counter

In [197]:
def IndexNucletoide(index):
  if index == 0:
    return 'A'
  elif index == 1:
    return 'C'
  elif index == 2:
    return 'G'
  else:
    return 'T'

In [198]:
def NucleotideIndex(nucleotide):
  if nucleotide == 'A':
    return 0
  elif nucleotide == 'C':
    return 1
  elif nucleotide == 'G':
    return 2
  else:
    return 3

In [199]:
def KmerProbability(profile, kmer):
  probability = 1
  for nucleotide in enumerate(kmer):
    probability = probability * profile[NucleotideIndex(nucleotide[1]),nucleotide[0]]
  return probability

In general, if there are multiple Profile-most probable k-mers in Text, then we select the first such k-mer occurring in Text.

In [201]:
from collections import OrderedDict

In [202]:
def ProfileMostProbableKmer(text,k,profile):
  kmers_probabilities_dict = OrderedDict()
  for kmer in FindAllKmers(text, k):
    kmers_probabilities_dict.update({kmer:KmerProbability(profile,kmer)})
  for key in kmers_probabilities_dict.keys():
    if kmers_probabilities_dict[key] == max(kmers_probabilities_dict.values()):
      return key #return first kmer with highest probability if there are multiple kmers with same probability, otherwise return the kmer with highest probability

In [203]:
def FormProfile(motifs,k):
  for i in range(len(motifs)):
    motifs[i] = list(motifs[i])
  motifs_array = np.chararray((len(motifs),k))
  profile = np.full((4,k),np.zeros((4,k)))
  for i in range(len(motifs)):
    motifs_array[i] = motifs[i]
  for i in range(motifs_array.shape[1]):
    column_i_count = Counter(motifs_array[:,i])
    for j in range(0,4):
      profile[j][i] = column_i_count[bytes(IndexNucletoide(j),encoding='utf-8')]
  for i in range(len(motifs)):
    motifs[i] = ''.join(motifs[i])
  return profile

In [218]:
def Score(motifs,profile,k):
  for i in range(len(motifs)):
    motifs[i] = list(motifs[i])
  motifs_array = np.chararray((len(motifs),k))
  for i in range(len(motifs)):
    motifs_array[i] = motifs[i]
  score = 0
  for i in range(motifs_array.shape[1]):
    score = score + sum(motifs_array[:,i] != bytes(IndexNucletoide(np.argmax(profile[:,i])),encoding='utf-8'))
  for i in range(len(motifs)):
    motifs[i] = ''.join(motifs[i])
  return score

In [205]:
def GreedyMotifSearch(dna,k,t):
  best_motifs = []
  for i in range(len(dna)):
    best_motifs.append(dna[i][0:k])
  for kmer in FindAllKmers(dna[0],k):
    motifs = [kmer]
    for i in range(1,t):
      profile = FormProfile(motifs,k)
      motifs.append(ProfileMostProbableKmer(dna[i],k,profile))
    if Score(motifs,profile,k) < Score(best_motifs,FormProfile(best_motifs,k),k):
      best_motifs = motifs
  return best_motifs

In [206]:
dna = [
'GGCGTTCAGGCA',
'AAGAATCAGTCA',
'CAAGGAGTTCGC',
'CACGTCAATCAC',
'CAATAATATTCG']

In [207]:
k = 3

In [208]:
t = 5

In [219]:
GreedyMotifSearch(dna,k,t)

['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

In [220]:
for solution in GreedyMotifSearch(dna,k,t):
  print(solution)

CAG
CAG
CAA
CAA
CAA


In [225]:
with open('/content/rosalind_ba2d.txt') as task_file:
  task_arguments = [line.rstrip() for line in task_file]

In [227]:
dna = task_arguments[1:len(task_arguments)]

In [230]:
k = 12

In [231]:
t = 25

In [232]:
f = open("task_result.txt", "w")
for solution in GreedyMotifSearch(dna,k,t):
  f.write(solution + '\n')
f.close()