Randomized algorithms may be nonintuitive because they lack the control of traditional algorithms (uvodimo neizvjesnost zbog uniform random number generator). Some randomized algorithms are Las Vegas algorithms, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms, including the motif finding algorithms that we will consider in this chapter, are Monte Carlo algorithms. These algorithms are not guaranteed to return exact solutions, but they do quickly find approximate solutions (fast heuristics that trade accuracy for speed in order to find an approximate solution as greedy algorithms). Because of their speed, they can be run many times, allowing us to
choose the best approximation from thousands of runs.

In general, we can begin from a collection of randomly chosen k-mers Motifs in Dna (random kmers --> randomized algorithm), construct PROFILE(Motifs), and use this profile to generate a new collection of k-mers:

MOTIFS(PROFILE(Motifs),Dna)

Why would we do this? Because our hope is that MOTIFS(PROFILE(Motifs), Dna) has a better score than the original collection of k-mers Motifs. We can then form the profile matrix of these k-mers, PROFILE(MOTIFS(PROFILE(Motifs), Dna)),
and use it to form the most probable k-mers,

MOTIFS(PROFILE(MOTIFS(PROFILE(Motifs), Dna)), Dna).

We can continue to iterate 

... PROFILE(MOTIFS(PROFILE(MOTIFS(PROFILE(Motifs), Dna)), Dna))...

for as long as the score of the constructed motifs keeps improving

Since a single run of RANDOMIZEDMOTIFSEARCH may generate a rather poor set of
motifs, bioinformaticians usually run this algorithm thousands of times. On each run,
they begin from a new randomly selected set of k-mers, selecting the best set of k-mers
found in all these runs.


In [2]:
import numpy as np
from numpy.random import randint

In [3]:
def SelectRandomKmers(dna,k,t):
  random_kmers = []
  for dna_string in dna:
    random_kmers.append(FindAllKmers(dna_string, k)[randint(0, len(dna_string) - k + 1)]) #upper bound not included
  return random_kmers

In [4]:
def DnaToArray(dna):
  dna_array = np.zeros((len(dna), len(dna[0])), dtype='str')
  for i in range(len(dna)):
    dna_array[i,:] = np.asarray(list(dna[i]), dtype='str')
  return dna_array

In [5]:
from collections import Counter

In [6]:
def IndexNucleotide(index):
  if index == 0:
    return 'A'
  elif index == 1:
    return 'C'
  elif index == 2:
    return 'G'
  else:
    return 'T'

In [7]:
def GenerateProfile(motifs,k,t):
  profile = np.zeros((4,k))
  motifs_array = DnaToArray(motifs)
  for i in range(k):
    nucleotide_frequency_dict = Counter(motifs_array[:,i])
    for nucleotide_index in range(4):
      profile[nucleotide_index][i] = nucleotide_frequency_dict[IndexNucleotide(nucleotide_index)]
  profile = profile / t
  return profile

In [16]:
def LaplacesRuleOfSucccession(profile):
  return profile + 1

In [9]:
def NucleotideIndex(nucleotide):
  if nucleotide == 'A':
    return 0
  elif nucleotide == 'C':
    return 1
  elif nucleotide == 'G':
    return 2
  else:
    return 3

In [10]:
def KmerProbability(profile,kmer):
  probability = 1
  for i in range(len(kmer)):
    probability = probability * profile[NucleotideIndex(kmer[i])][i]
  return probability

In [11]:
def FindAllKmers(dna_string,k):
  kmers_list = []
  i = 0
  while i + k - 1 <= len(dna_string) - 1:
    kmers_list.append(dna_string[i:i+k])
    i = i + 1
  return kmers_list

In [41]:
def Motifs(profile,dna,k):
  most_probable_kmers = []
  for dna_string in dna:
    probabilities = []
    for kmer in FindAllKmers(dna_string,k):
      probabilities.append(KmerProbability(profile, kmer))
    most_probable_kmers.append(FindAllKmers(dna_string,k)[np.argmax(probabilities)])
  return most_probable_kmers

In [13]:
def Consensus(profile,k):
  consensus = str()
  for i in range(k):
    consensus = consensus + IndexNucleotide(np.argmax(profile[:,i]))
  return consensus

In [14]:
def HammingDistance(first_string,second_string):
  hamming_distance = 0
  for i in range(min(len(first_string),len(second_string))):
    if first_string[i] != second_string[i]:
      hamming_distance = hamming_distance + 1
  return hamming_distance + abs(len(first_string) - len(second_string))

In [15]:
def Score(motifs,k,t):
  score = 0
  profile = GenerateProfile(motifs,k,t)
  consensus = Consensus(profile,k)
  for motif in motifs:
    score = score + HammingDistance(consensus,motif)
  return score

**Cromwell’s rule is relevant to the calculation of the probability of a string based on a profile matrix**. For example, consider the following Profile:

A: .2 .2 .0 .0 .0 .0 .9 .1 .1 .1 .3 .0

C: .1 .6 .0 .0 .0 .0 .0 .4 .1 .2 .4 .6

G: .0 .0  1  1 .9 .9 .1 .0 .0 .0 .0 .0

T: .7 .2 .0 .0 .1 .1 .0 .5 .8 .7 .3 .4

Pr(TCGTGGATTTCC|Profile) = .7 · .6 · 1 · .0 · .9 · .9 · .9 · .5 · .8 · .7 · .4 · .6 = 0

The fourth symbol of TCGTGGATTTCC causes Pr(TCGTGGATTTCC|Profile) to equal zero (event with non-zero probability didn't occur, its observed frequency is zero, setting its probability to zero represents an inaccurate oversimplification that may cause problems --> profile matricu nismo generirali sa kmerima koji su imali nukleotid T na 4. poziciji, vjerojatnost da je T na 4. poziciji nije 0). As a result, the entire string is assigned a zero probability, even though TCGTGGATTTCC differs from the consensus string at only one position (inaccurate oversimplification that caused problems). For that matter, TCGTGGATTTCC has the same low probability as AAATCTTGGAA, which is very different from the consensus string (inaccurate oversimplification that caused problems). In order to improve this unfair scoring, bioinformaticians often substitute zeroes with small numbers called pseudocounts. In the case of
motifs, pseudocounts often amount to adding 1 (or some other small number) to each element of COUNT(Motifs).

In [42]:
def RandomizedMotifSearch(dna,k,t):
  motifs  = SelectRandomKmers(dna,k,t)
  best_motifs = motifs
  while 1:
    profile = LaplacesRuleOfSucccession(GenerateProfile(motifs,k,t))
    motifs = Motifs(profile,dna,k)
    if Score(motifs,k,t) < Score(best_motifs,k,t):
      best_motifs = motifs
    else:
      return [best_motifs,Score(best_motifs,k,t)]

In [43]:
def RunRandomizedMotifSearch(dna,k,t):
  best_motifs = []
  best_score = []
  for i in range(1000):
    motifs, motifs_score = RandomizedMotifSearch(dna,k,t)
    if len(best_motifs) == 0:
      best_motifs = motifs
      best_score = motifs_score
    else:
      if best_score > motifs_score:
        best_score = motifs_score
        best_motifs = motifs
  return best_motifs

https://www.google.com/search?q=random+dna+strings&rlz=1C1AVFC_enHR898HR898&oq=random+dna+strings&aqs=chrome..69i57j0i546l4.2474j0j7&sourceid=chrome&ie=UTF-8 --> za random dna sekvence

In [11]:
Dna_random = ['gtcaatggcg', 'gaatgttccg', 'tcgttttttg', 'aggaagttac', 'tccgcctacc']

In [12]:
for i in range(len(Dna_random)):
  Dna_random[i] = Dna_random[i].upper()

In [13]:
Dna_random

['GTCAATGGCG', 'GAATGTTCCG', 'TCGTTTTTTG', 'AGGAAGTTAC', 'TCCGCCTACC']

In [14]:
Dna_random_array = DnaToArray(Dna_random)

In [15]:
Dna_random_array

array([['G', 'T', 'C', 'A', 'A', 'T', 'G', 'G', 'C', 'G'],
       ['G', 'A', 'A', 'T', 'G', 'T', 'T', 'C', 'C', 'G'],
       ['T', 'C', 'G', 'T', 'T', 'T', 'T', 'T', 'T', 'G'],
       ['A', 'G', 'G', 'A', 'A', 'G', 'T', 'T', 'A', 'C'],
       ['T', 'C', 'C', 'G', 'C', 'C', 'T', 'A', 'C', 'C']], dtype='<U1')

In [21]:
GenerateProfile(Dna_random_array,len(Dna_random_array),len(Dna_random_array))

array([[0.2, 0.2, 0.2, 0.4, 0.4],
       [0. , 0.4, 0.4, 0. , 0.2],
       [0.4, 0.2, 0.4, 0.2, 0.2],
       [0.4, 0.2, 0. , 0.4, 0.2]])

In [63]:
Dna_random2 = ['ctcacatgca', 'attgctctca', 'ctttaccaac', 'tcctcgccgg', 'caccccgggc']

In [64]:
for i in range(len(Dna_random2)):
  Dna_random2[i] = Dna_random2[i].upper()

In [65]:
Dna_random2

['CTCACATGCA', 'ATTGCTCTCA', 'CTTTACCAAC', 'TCCTCGCCGG', 'CACCCCGGGC']

In [66]:
Dna_random_array2 = DnaToArray(Dna_random2)

In [67]:
Dna_random_array2

array([['C', 'T', 'C', 'A', 'C', 'A', 'T', 'G', 'C', 'A'],
       ['A', 'T', 'T', 'G', 'C', 'T', 'C', 'T', 'C', 'A'],
       ['C', 'T', 'T', 'T', 'A', 'C', 'C', 'A', 'A', 'C'],
       ['T', 'C', 'C', 'T', 'C', 'G', 'C', 'C', 'G', 'G'],
       ['C', 'A', 'C', 'C', 'C', 'C', 'G', 'G', 'G', 'C']], dtype='<U1')

In [68]:
GenerateProfile(Dna_random_array2, len(Dna_random_array2))

array([[0.22222222, 0.22222222, 0.11111111, 0.22222222, 0.22222222],
       [0.44444444, 0.22222222, 0.44444444, 0.22222222, 0.55555556],
       [0.11111111, 0.11111111, 0.11111111, 0.22222222, 0.11111111],
       [0.22222222, 0.44444444, 0.33333333, 0.33333333, 0.11111111]])

In [69]:
GenerateProfile(Dna_random_array, len(Dna_random_array)) == GenerateProfile(Dna_random_array2, len(Dna_random_array2))

array([[ True,  True, False, False, False],
       [False, False, False, False, False],
       [False, False, False,  True, False],
       [False, False, False,  True, False]])

In [70]:
Dna_random3 = ['tgtgaccccc', 'gttaggggtt', 'tgtgcggcat', 'tagtaacagc', 'atcttcgaaa']

In [71]:
for i in range(len(Dna_random3)):
  Dna_random3[i] = Dna_random3[i].upper()

In [72]:
Dna_random3

['TGTGACCCCC', 'GTTAGGGGTT', 'TGTGCGGCAT', 'TAGTAACAGC', 'ATCTTCGAAA']

In [73]:
Dna_random_array3 = DnaToArray(Dna_random3)

In [74]:
Dna_random_array3

array([['T', 'G', 'T', 'G', 'A', 'C', 'C', 'C', 'C', 'C'],
       ['G', 'T', 'T', 'A', 'G', 'G', 'G', 'G', 'T', 'T'],
       ['T', 'G', 'T', 'G', 'C', 'G', 'G', 'C', 'A', 'T'],
       ['T', 'A', 'G', 'T', 'A', 'A', 'C', 'A', 'G', 'C'],
       ['A', 'T', 'C', 'T', 'T', 'C', 'G', 'A', 'A', 'A']], dtype='<U1')

In [75]:
GenerateProfile(Dna_random_array3, len(Dna_random_array3))

array([[0.22222222, 0.22222222, 0.11111111, 0.22222222, 0.33333333],
       [0.11111111, 0.11111111, 0.22222222, 0.11111111, 0.22222222],
       [0.22222222, 0.33333333, 0.22222222, 0.33333333, 0.22222222],
       [0.44444444, 0.33333333, 0.44444444, 0.33333333, 0.22222222]])

In [76]:
GenerateProfile(Dna_random_array3, len(Dna_random_array3)) == GenerateProfile(Dna_random_array2, len(Dna_random_array2))

array([[ True,  True,  True,  True, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False,  True, False]])

In [77]:
GenerateProfile(Dna_random_array3, len(Dna_random_array3)) == GenerateProfile(Dna_random_array, len(Dna_random_array))

array([[ True,  True, False, False,  True],
       [ True, False, False,  True,  True],
       [False, False, False, False,  True],
       [False, False, False,  True,  True]])

https://birc.au.dk/~palle/php/fabox/random_sequence_generator.php

In [78]:
Dna_random4 = ['tgactgtcat', 'cggttatcat', 'gtacatttgc', 'tcgtatctag', 'atttgcgtac']
Dna_random4

['tgactgtcat', 'cggttatcat', 'gtacatttgc', 'tcgtatctag', 'atttgcgtac']

In [79]:
for i in range(len(Dna_random4)):
  Dna_random4[i] = Dna_random4[i].upper()

In [80]:
Dna_random4

['TGACTGTCAT', 'CGGTTATCAT', 'GTACATTTGC', 'TCGTATCTAG', 'ATTTGCGTAC']

In [81]:
Dna_random_array4 = DnaToArray(Dna_random4)

In [82]:
Dna_random_array4

array([['T', 'G', 'A', 'C', 'T', 'G', 'T', 'C', 'A', 'T'],
       ['C', 'G', 'G', 'T', 'T', 'A', 'T', 'C', 'A', 'T'],
       ['G', 'T', 'A', 'C', 'A', 'T', 'T', 'T', 'G', 'C'],
       ['T', 'C', 'G', 'T', 'A', 'T', 'C', 'T', 'A', 'G'],
       ['A', 'T', 'T', 'T', 'G', 'C', 'G', 'T', 'A', 'C']], dtype='<U1')

In [83]:
GenerateProfile(Dna_random_array4, len(Dna_random_array4))

array([[0.22222222, 0.11111111, 0.33333333, 0.11111111, 0.33333333],
       [0.22222222, 0.22222222, 0.11111111, 0.33333333, 0.11111111],
       [0.22222222, 0.33333333, 0.33333333, 0.11111111, 0.22222222],
       [0.33333333, 0.33333333, 0.22222222, 0.44444444, 0.33333333]])

In [84]:
GenerateProfile(Dna_random_array4, len(Dna_random_array4)) == GenerateProfile(Dna_random_array3, len(Dna_random_array3)) 

array([[ True, False, False, False,  True],
       [False, False, False, False, False],
       [ True,  True, False, False,  True],
       [False,  True, False, False, False]])

In [85]:
GenerateProfile(Dna_random_array4, len(Dna_random_array4)) == GenerateProfile(Dna_random_array2, len(Dna_random_array2)) 

array([[ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False]])

In [86]:
GenerateProfile(Dna_random_array4, len(Dna_random_array4)) == GenerateProfile(Dna_random_array, len(Dna_random_array)) 

array([[ True, False, False, False,  True],
       [False, False, False, False, False],
       [False, False,  True, False,  True],
       [ True, False, False, False, False]])

Budući da su dna stringovi nasumični, odnosno budući da ne sadrže uzorak u sebi, nema smisla gledati koji je kmer više vjerojatan od drugih kmera. 

Such a uniform profile is essentially useless for motif finding because no string is more probable than any other according to this profile and because it does not provide any clues on what an implanted motif looks like. --> If the strings in Dna were truly random (Dna stringovi ne sadrže uzorke u sebi), then we would expect that all nucleotides
in the selected k-mers would be equally likely, resulting in an expected Profile in which every entry is approximately 0.25

If the strings in Dna were truly random, then we would expect that all nucleotides in the selected k-mers would be equally likely, resulting in an expected Profile in which every entry is approximately 0.25 --> ako su stringovi u Dna nasumični, onda svaki nukleotid ima istu vjerojatnost pojavljivanja jer bi inače određeni stringovi imali veću vjerojatnost pojavljivanja --> ako su svi stringovi jednako vjerojatni, onda svaki element Profile matrice je 0.25 --> da svi stringovi nisu jednako vjerojatni, onda bi u Profile matrici mogli pronaći one vjerojatnosti koje, kada se pomnože, davaju najveću vjerojatnost pojavljivanja stringa --> taj bi string imao najveću frekevenciju pojavljivanja u Dna

A: 0.25 0.25 0.25 0.25

C: 0.25 0.25 0.25 0.25

G: 0.25 0.25 0.25 0.25

T: 0.25 0.25 0.25 0.25

Primjer

Dna 

ttACCT**taac**

gAT**GTct**gtc

**ccgG**CGTtag

c**acta**ACGAg

cgtcag**AGGT**


Motifs PROFILE(Motifs)

A: 0.4 0.2 0.2 0.2

C: 0.2 0.4 0.2 0.2

G: 0.2 0.2 0.4 0.2

T: 0.2 0.2 0.2 0.4


We can now compute the probabilities of every 4-mer in Dna based on this profile
matrix. For example, the probability of the first 4-mer in the first string of Dna is PR(ttAC|Profile) = 0.2 · 0.2 · 0.2 · 0.2 = 0.0016. 

ttAC tACC ACCT CCTt CTta Ttaa taac

.0016 .0016 .0128 .0064 .0016 .0016 .0016

gATG ATGT TGTc GTct Tctg ctgt tgtc

.0016 .0128 .0016 .0032 .0032 .0032 .0016

ccgG cgGC gGCG GCGT CGTt GTta Ttag

.0064 .0036 .0016 .0128 .0032 .0016 .0016

cact acta ctaA taAC aACG ACGA CGAg

.0032 .0064 .0016 .0016 .0032 .0128 .0016

cgtc gtca tcag cagA agAG gAGG AGGT

.0016 .0016 .0016 .0032 .0032 .0032 .0128

Određeni kmeri imaju dosta veću vjerojatnost od drugih kmera, npr. kmeri ACCT, ATGT, GCGT, ACGA, AGGT --> If the strings in Dna were truly random, then we would expect that all nucleotides in the selected k-mers would be equally likely, resulting in an expected Profile in which every entry is approximately 0.25 --> ako su svi nukleotidi u svakom stupcu Profile matrice jednako vjerojatni, onda su svi kmeri jednako vjerojatni --> budući da određeni kmeri imaju dosta veću vjerojatnost od drugih kmera, svi nukletoidi u svim stupcima nisu jednako vjerojatni --> iz Profile matrice vidimo da najveću vjerojatnost pojavljivanja ima kmer ACGT --> ako svaki stupac Profile matrice gledamo kao da bacamo 4-sided kocku, onda je vjerojatnost da ćemo u npr. 4 uzastopna bacanja 4 različite 4-sided kocke dobiti ACCT, ATGT, GCGT, ACGA, AGGT dosta veća od vjerojatnosti za ostale kmere --> bacanje svake od 4 4-sided kocke je slučajni pokus, promatramo ishod 4 bacanja 4 različite kocke --> složeni pokus --> ako su kocke biased onda imamo veću vjerojatnost pojavljivanja određenog nukleotida na određenoj poziciji --> jesu li kocke biased ili ne ovisi o tome koje kmere smo uzeli da konstruiramo Profile matricu i koliko slabo očuvanih (weakly conserved) i jako očuvanih pozicija (strongly conserved) ima motiv

Such a uniform profile is essentially useless for motif finding because no string is more probable than any other according to this profile and because it does not provide any clues on what an implanted motif looks like. --> uniformna Profile matrica je beskorisna jer nijedan string nije više vjerojatan od drugih stringova pa ne znamo kako izgleda motiv

We have already noticed that if the strings in Dna were random, then
RANDOMIZEDMOTIFSEARCH would start from a nearly uniform profile, and there
would be nothing to work with. However, the key observation is that the strings in Dna are not random because they include the implanted motif! --> stringovi u Dna nisu random jer sadržavaju motiv, odnosno da su random ne bi mogli objasniti povećanu ekspresiju određenog gena jer bi se transcription factor vezao na bilo koji kmer, tj. bilo koji kmer bi bio transcription factor binding site te ne bismo mogli objasniti zašto ekspresija određenih gena slijedi cirkadijarni ritam. --> jesu li kmeri u Dna random ako odabiranjem kmera ne uhvatimo motiv ni u jednom Dna stringu? --> ako odabiranjem odaberemo npr. 5 kmera koji su svi jednako vjerojatni i od njih napravimo Profile matricu, Profile matrica bi trebala biti uniformna jer su svi stringovi jednako vjerojatni (nijedan string nije više vjerojatan od drugih stringova što nam ne govori kako motiv izgleda)

**These multiple occurrences of the same motif may create a bias in the profile matrix, directing it away from the uniform profile and toward the implanted motif**. --> ako motiv ima dosta jako očuvanih pozicija (strongly conserved positions) i malo slabo očuvanih pozicija (weakly conserved positions) onda višestruke pojave motiva unose bias u Profile matricu

In [44]:
dna = ['CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA', 'GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG', 'TAGTACCGAGACCGAAAGAAGTATACAGGCGT', 'TAGATCAAGTTTCAGGTGCACGTCGGTGAACC', 'AATCCACCAGCTCCACGTGCAATGTTGGCCTA']

In [45]:
k = 8

In [46]:
t = 5

In [48]:
RunRandomizedMotifSearch(dna,k,t)

['TCTCGGGG', 'CCAAGGTG', 'TACAGGCG', 'TTCAGGTG', 'TCCACGTG']

In [49]:
with open('/content/rosalind_ba2f.txt') as task_file:
  task_arguments = [line.rstrip() for line in task_file]

In [50]:
k = int(task_arguments[0][0:2])

In [51]:
t = int(task_arguments[0][3:5])

In [52]:
dna = task_arguments[1:len(task_arguments)]

In [57]:
f = open("task_result.txt", "w")
for solution in RunRandomizedMotifSearch(dna,k,t):
  f.write(solution + '\n')
f.close()