At first glance, GREEDYMOTIFSEARCH may seem like a reasonable algorithm, but
it is not! Let’s see whether GREEDYMOTIFSEARCH will find the (4, 1)-motif ACGT
implanted in the following strings Dna:

ttACCTtaac

gATGTctgtc

acgGCGTtag

ccctaACGAg

cgtcagAGGT

We will assume that the algorithm has already correctly chosen the implanted 4-mer ACCT from the first string in Dna and constructed the corresponding Profile:

A: 1 0 0 0

C: 0 1 1 0

G: 0 0 0 0

T: 0 0 0 1

The algorithm is now ready to search for a Profile-most probable 4-mer in the second sequence. The issue, however, is that there are so many zeros in the profile matrix that the probability of every 4-mer but ACCT is zero! Thus, unless ACCT is present in every string in Dna (profile most probable kmer je ACCT jer smo konstruirali Profile matricu od samo jednog kmera), there is little chance that GREEDYMOTIFSEARCH will find the implanted motif. Zeroes in the profile matrix are not just a minor annoyance but rather a persistent problem that we must address.

statistical maxim called Cromwell’s rule, which states that we
should not use probabilities of 0 or 1 unless we are talking about logical statements that can only be true or false (statement jedino može biti istina ili laž, tj. imamo samo 2 moguća ishoda). In other words, we should allow a small probability for extremely unlikely events, such as “this book was written by aliens” or “the sun will not rise tomorrow”. --> dakle, trebali bi odrediti kardinalni broj skupa elementarnih događaja tako da znamo kada možemo koristiti vjerojatnosti 0 i 1, a kada ne (P(Ω) = 1, kod uniformnog vjerojatnosnog modela svi elementarni događaju su jednako vjerojatni, ako ih ima n onda je vjerojatnost svakog od njih 1/n, dakle nijedna vjerojatnost nije 0, ako slučajni pokus sa 2 moguća ishoda (uspjeh ili neuspjeh) modeliramo Bernoullijevom slučajnom varijablom onda je vjerojatnost P(X=1)=p, P(X=0)=1-p=q, dakle ako je p = 1 onda slučajni pokus ima samo jedan ishod i nema smisla govoriti o vjerojatnosti 0, ako je p = 0 onda slučajni pokus ima samo jedan ishod i ima smisla govoriti o vjerojatnsti 0, elementarni događaji su svi mogući, a nedjeljivi, ishodi slučajnog pokusa, dakle treba u obzir uzeti sve elementarne događaje)

**Cromwell’s rule is relevant to the calculation of the probability of a string based on a profile matrix**. For example, consider the following Profile:

A: .2 .2 .0 .0 .0 .0 .9 .1 .1 .1 .3 .0

C: .1 .6 .0 .0 .0 .0 .0 .4 .1 .2 .4 .6

G: .0 .0  1  1 .9 .9 .1 .0 .0 .0 .0 .0

T: .7 .2 .0 .0 .1 .1 .0 .5 .8 .7 .3 .4

Pr(TCGTGGATTTCC|Profile) = .7 · .6 · 1 · .0 · .9 · .9 · .9 · .5 · .8 · .7 · .4 · .6 = 0

The fourth symbol of TCGTGGATTTCC causes Pr(TCGTGGATTTCC|Profile) to equal zero (event with non-zero probability didn't occur, its observed frequency is zero, setting its probability to zero represents and inaccurate oversimplification thah may casue problems). As a result, the entire string is assigned a zero probability, even though TCGTGGATTTCC differs from the consensus string at only one position (inaccurate oversimplification that caused problems). For that matter, TCGTGGATTTCC has the same low probability as AAATCTTGGAA, which is very different from the consensus string (inaccurate oversimplification that caused problems). In order to improve this unfair scoring, bioinformaticians often substitute zeroes with small numbers called pseudocounts. In the case of
motifs, pseudocounts often amount to adding 1 (or some other small number) to each element of COUNT(Motifs).

In [25]:
import numpy as np

In [26]:
def FindAllKmers(dna_string, k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(dna_string) - 1:
    kmers_list.append(dna_string[i:i+k])
    i = i + 1
  return kmers_list

In [27]:
from collections import Counter

In [28]:
def IndexNucletoide(index):
  if index == 0:
    return 'A'
  elif index == 1:
    return 'C'
  elif index == 2:
    return 'G'
  else:
    return 'T'

In [29]:
def NucleotideIndex(nucleotide):
  if nucleotide == 'A':
    return 0
  elif nucleotide == 'C':
    return 1
  elif nucleotide == 'G':
    return 2
  else:
    return 3

In [30]:
def KmerProbability(profile, kmer):
  probability = 1
  for nucleotide in enumerate(kmer):
    probability = probability * profile[NucleotideIndex(nucleotide[1]),nucleotide[0]]
  return probability

In general, if there are multiple Profile-most probable k-mers in Text, then we select the first such k-mer occurring in Text.

In [31]:
from collections import OrderedDict

In [32]:
def ProfileMostProbableKmer(text,k,profile):
  kmers_probabilities_dict = OrderedDict()
  for kmer in FindAllKmers(text, k):
    kmers_probabilities_dict.update({kmer:KmerProbability(profile,kmer)})
  for key in kmers_probabilities_dict.keys():
    if kmers_probabilities_dict[key] == max(kmers_probabilities_dict.values()):
      return key #return first kmer with highest probability if there are multiple kmers with same probability, otherwise return the kmer with highest probability

In [44]:
def FormProfile(motifs,k,t):
  for i in range(len(motifs)):
    motifs[i] = list(motifs[i])
  motifs_array = np.chararray((len(motifs),k))
  profile = np.full((4,k),np.zeros((4,k)))
  for i in range(len(motifs)):
    motifs_array[i] = motifs[i]
  for i in range(motifs_array.shape[1]):
    column_i_count = Counter(motifs_array[:,i])
    for j in range(0,4):
      profile[j][i] = column_i_count[bytes(IndexNucletoide(j),encoding='utf-8')] / t
  for i in range(len(motifs)):
    motifs[i] = ''.join(motifs[i])
  return profile

In [43]:
def LaplacesRuleOfSuccession(profile):
  return profile + 1

In [42]:
def Score(motifs,profile,k):
  for i in range(len(motifs)):
    motifs[i] = list(motifs[i])
  motifs_array = np.chararray((len(motifs),k))
  for i in range(len(motifs)):
    motifs_array[i] = motifs[i]
  score = 0
  for i in range(motifs_array.shape[1]):
    score = score + sum(motifs_array[:,i] != bytes(IndexNucletoide(np.argmax(profile[:,i])),encoding='utf-8'))
  for i in range(len(motifs)):
    motifs[i] = ''.join(motifs[i])
  return score

In [41]:
def GreedyMotifSearch(dna,k,t):
  best_motifs = []
  for i in range(len(dna)):
    best_motifs.append(dna[i][0:k])
  for kmer in FindAllKmers(dna[0],k):
    motifs = [kmer]
    for i in range(1,t):
      profile = LaplacesRuleOfSuccession(FormProfile(motifs,k,t))
      motifs.append(ProfileMostProbableKmer(dna[i],k,profile))
    if Score(motifs,profile,k) < Score(best_motifs,FormProfile(best_motifs,k,t),k):
      best_motifs = motifs
  return best_motifs

In [49]:
dna = [
'GGCGTTCAGGCA',
'AAGAATCAGTCA',
'CAAGGAGTTCGC',
'CACGTCAATCAC',
'CAATAATATTCG']

In [50]:
k = 3

In [51]:
t = 5

In [52]:
GreedyMotifSearch(dna,k,t)

['TCA', 'TCA', 'TCG', 'TCA', 'TCG']

In [55]:
with open('/content/rosalind_ba2e.txt') as task_file:
  task_arguments = [line.rstrip() for line in task_file]

In [57]:
dna = task_arguments[1:len(task_arguments)]

In [59]:
k = 12

In [60]:
t = 25

In [61]:
f = open("task_result.txt", "w")
for solution in GreedyMotifSearch(dna,k,t):
  f.write(solution + '\n')
f.close()