<h1><b>Statistique en Bioinformatique : </b> TME9 </h1><br>

L’objectif de ce TME est l'implementation de la méthode Expectation-Maximisation pour la recherche de motifs.

<div class="alert alert-warning" role="alert" style="margin: 10px">
<div class="alert alert-warning" role="alert" style="margin: 10px">
<p><b>Soumission</b></p>
<ul>
<li>Renomer le fichier TME9.ipynb pour NomEtudiant1_NomEtudiant2.ipynb </li>
<li>Soumettre via moodle </li>
</div>

<H1>Expectation-Maximisation Motif</H1>
<br>
La méthode EM (Expectation-Maximisation) permet de détecter des motifs dans un ensemble de séquences ADN ou protéiques reliées, non alignées. En particulier, étant donné un groupe de séquences de longueur $L$, dont on sait qu'elles partagent un motif commun de longueur $w$, l’algorithme EM:
- infère un modèle $(\Theta,Z)$ pour le motif;
- localise l’occurrence du motif dans chaque séquence.

$\Theta$ representé la matrice des poids-positions $p_{c,k}$ du motif (où $c \in \{A,C,G,T\}$ ou $c \in \{A,C,D,...,W\}$  et $k \in \{0 \dots w\}$), $p_{c,0}$  est le vecteur de probabilités du modèle nul ou "background".
$Ζ$ est la matrice des variables cachées, qui donnent les positions initiales du motif: 
- $Z_{i,j} = 1$, si le motif commence en position $j$ de la séquence $i$,
- $Z_{i,j} = 0$, sinon. 

L’algorithme affine les paramètres du modèle de manière itérative par espérance-maximisation. Chaque itération $t$ se compose de deux étapes:
- (E) Calcul des valeurs attendues $Ζ^{(t)}$ de $Ζ$, étant donnés $\Theta^{(t-1)}$
- (M) Estimation de  $\Theta^{(t)}$  à partir de  $Ζ^{(t)}$

1\. Implémentez une fonction `read_training_file` pour lire le fichier d'entré qui contient un ensemble de séquences ADN non alignées. Pour simplifier nous allons utiliser les données vu en cours du fichier `toyEx.txt`.

In [82]:
import numpy as np
import matplotlib.pyplot as plt

nts = ['A', 'C', 'G', 'T']

w = 3
input_f = "toyEx.txt"

def read_training_file(input_f):
    """
    Read a file with no-aligned sequences
    input input_f : file name
    output seqs : list of sequences
    """
    seqs = []
    
    f = open(input_f)
    for lines in f.readlines():
        seqs.append(lines[:-1])
   
    return seqs

seqs = read_training_file(input_f)
print (seqs) #['GTCAGG', 'GAGAGT', 'ACGGAG', 'CCAGTC']

['GTCAGG', 'GAGAGT', 'ACGGAG', 'CCAGTC']


2\. Implémentez une fonction `initialiseP` pour initialiser la matrice poids-position $p_{c,k}$. On considère le modèle nul par défaut $p_0 = (0.25, 0.25, 0.25, 0.25)$. Pour initialiser $p^{(t)}$, on prend généralement un motif au hasard dans une sequence, et on fixe à $0.5$ les poids du nucleotide correspondant et à $\frac{1-0.5}{3}$ les trois autres. 

In [83]:
# def initialiseP(seqs, w, alph):
#     """
#     Initialise pc,k
#     input seqs : list of sequences
#     input w : motif length
#     input alph : alphabet (nucleotides or amino acids)
#     output P: position probability matrix
#     """
#     q=len(alph)
#     P = np.zeros((q, w+1))
    
#     for j in range (q): 
#         P[j][0]= 0.25
        
#     print(w)
    
#     alea1 =np.random.randint(len(seqs))
#     random_seq = seqs[alea1]
# #     print("random_seq", random_seq)
#     alea2 = np.random.randint(0,(len(random_seq)-w))
#     motif = random_seq[alea2:(alea2+w)]
# #     print("motif",motif)
    
    
#     for i in range(1,len(motif)):
#         print("i", i)
#         indice = alph.index(motif[i])
#         print("indice", indice)
#         P[indice][i] = 0.5
        
#         for j in  range (len(motif)+1): 
#             if j != indice: 
#                 P[j][i] = (1 - 0.5)/3
# #         if motif[i]==alpha
    

    
   
   
#     return P

# #test
# p = initialiseP(seqs, w, nts)
# print (p)
# print(nts)
# print(np.random.randint(0,4))

In [84]:
def initialiseP(seqs, w, alph):
    """
    Initialise pc,k
    input seqs : list of sequences
    input w : motif length
    input alph : alphabet (nucleotides or amino acids)
    output P: position probability matrix
    """
    q=len(alph)
    P = np.zeros((q, w+1))
    
    for j in range (q): 
        P[j][0]= 0.25
        
#     print(w)
    
    alea1 =np.random.randint(len(seqs))
    random_seq = seqs[alea1]
#     print("random_seq", random_seq)
    alea2 = np.random.randint(0,(len(random_seq)-w))
    motif = random_seq[alea2:(alea2+w)]
#     print("motif",motif)
    
    liste_indice =[]
    for i in range(len(motif)):
#         print("MOTIF[i]",motif[i])
        indice = alph.index(motif[i])
        
#         print("indice dans boucle",indice)
        liste_indice.append(indice)
        
    for j in range(len(liste_indice)):
           
        P[liste_indice[j]][j+1]=0.5
    
        
    for i in range(1,w+1):
        for j in range(q):
            if P[j][i]==0.5:
                continue
            else:
                P[j][i]=((1-0.5)/3)

   
    return P

#test
p = initialiseP(seqs, w, nts)
print (p)
# print(nts)
# print(np.random.randint(0,4))

[[0.25       0.16666667 0.16666667 0.16666667]
 [0.25       0.16666667 0.16666667 0.5       ]
 [0.25       0.5        0.16666667 0.16666667]
 [0.25       0.16666667 0.5        0.16666667]]


3\. Implémenter une fonction `initialiseZ` pour initialiser la matrice $Z$ à uns. Rappelez-vous que la dimension de $Z$ est $nbSeq \times (lenSeq -w +1)$, où $nbSeq$ est le nombre de sequences et $lenSeq$ est la taille/longueur des sequences.

In [85]:
# import re

# IND = []
# ici = ['GTCAGG', 'GAGAGT', 'ACGGAG', 'CCAGTC']
# for i in range (len(ici)): 
#     string = ici[i]
#     char = "GGA"
#     indices = [i.start() for i in re.finditer(char, string)]
#     IND.append(indices)
# print(indices)
# print(IND)

# for i in range (len(IND)): 
#     if IND[i]== []: 
#         print('vide')
#     else: 
#         print(IND[i],"trouvé !!!")
        


In [86]:
import re

IND = []
ici = ['GTCAGG', 'GAGAGT', 'ACGGAG', 'CCAGTC']
# char = "GGA"
char = "GTC"
print(char)


longueur =len(char)
inverse=char[longueur::-1] 
print(inverse)

# indices = string.index(str(char))
# IND.append(indices)
# print(indices)
# print(IND)

# for i in range (len(IND)): 
#     if IND[i]== []: 
#         print('vide')
#     else: 
#         print(IND[i],"trouvé !!!")
        

GTC
CTG


In [87]:
# def initialiseZ(seqs, w):
#     """
#     Initialise Z
#     input seqs : list of sequences
#     input w : motif length
#     output Z :  matrix of motif start positions
#     """
    
#     Z = np.zeros((len(seqs), len(seqs[0])-w+1))
    
#     alea1 =np.random.randint(len(seqs))
#     random_seq = seqs[alea1]
#     print("random_seq", random_seq)
    
#     alea2 = np.random.randint(0,(len(random_seq)-w))
#     motif = random_seq[alea2:(alea2+w)]
#     print("motif",motif)
    
    
#     print("seqs -->",seqs)
    
    
#     curseur = 0
#     for seq in seqs: 
#         print("\n====================> ON TRAVAILLE AVEC", seq)
#         if motif in seq: 
#             print(seq)
#             ou = seq.index(motif)
#             print(seq.index(motif))
#             Z[curseur][ou] += 1
#         else: 
#             print(motif, "pas dans" ,seq)            
#         curseur +=1   
            
#     return Z

# Z = initialiseZ(seqs, w) # matrix of motif start positions
# print(Z)


In [88]:
# ANASTASIA 

def initialiseZ(seqs, w):
    """
    Initialise Z
    input seqs : list of sequences
    input w : motif length
    output Z :  matrix of motif start positions
    """
    
#     Z = []
    Z = np.zeros((len(seqs), len(seqs[0])-w+1))
    l=[]
    for i in range(len(seqs)):
        r=np.random.randint(0,len(seqs[0])-w+1)
        l.append(r)
#     print(l)
    
    for i in range(len(Z)):
        Z[i][l[i]]=1
   
    return Z

Z = initialiseZ(seqs, w)
print(Z)


[[0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]]


In [89]:
print(nts)

['A', 'C', 'G', 'T']


In [90]:
print(p) # petit p !!!! 

[[0.25       0.16666667 0.16666667 0.16666667]
 [0.25       0.16666667 0.16666667 0.5       ]
 [0.25       0.5        0.16666667 0.16666667]
 [0.25       0.16666667 0.5        0.16666667]]


4\. Écrivez une fonction `E_step` pour le pas Expectation qui estime $Z$ à partir de  $p_{c,k}$. 
Écrivez aussi une fonction `normaliseZ` pour normaliser $Z$.

In [10]:
def E_step(seqs, P, Z, w, alph):
    """
    Implement Expectation step
    input seqs : list of sequences
    input P : position probability matrix
    input Z :  matrix of motif start positions
    input w : motif length
    input alph : alphabet (nucleotides or amino acids)
    output Z :  matrix of motif start positions
    """
    
    Z = initialiseZ(seqs, w)
#     print("Z dans la focntion\n ", Z, "\n")
    

    
    seq_nb = -1 # pour indiquer quelle séquence on traite
#     print("valeur i --> len(seqs[0])-w+1) =", len(seqs[0])-w+1)
    for seq in seqs: 
        seq_nb+=1
#         print("\n\nON TRAVAILL AVEC LA SEQUENCE : ",seq)
        for i in range (len(seqs[0])-w+1): # la position du motif 
#             print("--------------------------------------------------")
#             print("i=", i)
            val = 1 
            for j in range(len(seq)):
#                 print(seq[j], j)
                lettre = alph.index(seq[j])
                if j>= i and j<=i+2: 
#                     print("motif et j =", j, "j-i =", j-i, "multiplier par", P[lettre][(j-i)+1]) # le motif dure de j à j-i, longueur du motif = w ntd
                    val = val * P[lettre][(j-i)+1]
                else: 
#                     print("modele nul, on multiplie par", P[lettre][0])
                    val = val * (P[lettre][0])
                Z[seq_nb][i]=val
#                 print("Z intermédiare !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n",Z,"\n")
                    
    
            
            
            
    
    
    return Z

def normaliseZ(z):
    """
    Normalise Z matrix
    input Z : unnormalised matrix
    output Zn : normalised matrix
    """
    return np.array([row/sum(row) for row in z])

Z = E_step(seqs, p, Z, w, nts)
print(Z)
z_norm = normaliseZ(Z)
print(z_norm)

[[7.23379630e-05 1.95312500e-03 7.23379630e-05 7.23379630e-05]
 [7.23379630e-05 2.17013889e-04 7.23379630e-05 7.23379630e-05]
 [2.17013889e-04 7.23379630e-05 2.17013889e-04 7.23379630e-05]
 [6.51041667e-04 7.23379630e-05 7.23379630e-05 7.23379630e-05]]
[[0.03333333 0.9        0.03333333 0.03333333]
 [0.16666667 0.5        0.16666667 0.16666667]
 [0.375      0.125      0.375      0.125     ]
 [0.75       0.08333333 0.08333333 0.08333333]]


In [11]:
print(Z)

[[7.23379630e-05 1.95312500e-03 7.23379630e-05 7.23379630e-05]
 [7.23379630e-05 2.17013889e-04 7.23379630e-05 7.23379630e-05]
 [2.17013889e-04 7.23379630e-05 2.17013889e-04 7.23379630e-05]
 [6.51041667e-04 7.23379630e-05 7.23379630e-05 7.23379630e-05]]


In [12]:
print(z_norm)

[[0.03333333 0.9        0.03333333 0.03333333]
 [0.16666667 0.5        0.16666667 0.16666667]
 [0.375      0.125      0.375      0.125     ]
 [0.75       0.08333333 0.08333333 0.08333333]]


In [13]:
for i in range(4): 
    print(np.sum(z_norm[i]))

1.0000000000000002
0.9999999999999999
1.0
1.0


5\. Implémentez une fonction `M_step` pour le pas Maximisation qui estime $p_{c,k}$ à partir de $Z$. 
Utilisez les "pseudocounts" pour éviter les probabilités ègales à zero.

In [14]:
def totalNumberofCH(seqs,alph):

    q = len(alph)
    totalN = np.zeros((q))
    for i in range(q):
        e = alph[i]
        for seq in seqs:
            for s in seq:
                if(s == e):
                    totalN[i]+=1
    return totalN
                    
# def M_step(seqs, Z, w, alph):
#     """
#     Implement Expectation step
#     input seqs : list of sequences
#     input Z :  matrix of motif start positions
#     input w : motif length
#     input alph : alphabet (nucleotides or amino acids)
#     output P : position probability matrix
#     """

# #     P = []
#     P = np.zeros((len(alph), len(seqs[0])-w+1))
    
#     for i in range(len(seqs[0])-w+1): # parcourir les colonnes
#         print("==========================colonne i =",i,"==================================")
#         for j in range(len(alph)): # parocurir les lignes
#             if i==0: 
#                 P[j][i] = 0.25
                
#             else: 
#                 print(" j =", j,)
#                 print("==> on veut regarder", alph[j], " en ",i, "ème posiiton")
#                 for seq in seqs: 
#                     print("ON TRAVAILLE AVEC LA SEQUENCE : ",seq)
#                     for ntd in seq: 
#                         print(ntd)
#                         if ntd == alph[j]: 
#                             print(" ntd == alph[j] pour j =", j, "possible? ")
#                 print("")
                
                
    
    
    
    return P

# compte le nb  TOTAL de A,C,G,T dans les 4 seqs confondues
totalN = totalNumberofCH(seqs, nts)
print("totalN \n",totalN)

# P = M_step(seqs, z_norm, w, nts)
# print(P)

totalN 
 [ 6.  5. 10.  3.]


In [15]:
# ANASTASIA 

# def totalNumberofCH(seqs,alph):

#     q = len(alph)
#     totalN = np.zeros((q))
#     for i in range(q):
#         e = alph[i]
#         for seq in seqs:
#             for s in seq:
#                 if(s == e):
#                     totalN[i]+=1
#     return totalN

def M_step(seqs, Z, w, alph):
    """
    Implement Expectation step
    input seqs : list of sequences
    input Z :  matrix of motif start positions
    input w : motif length
    input alph : alphabet (nucleotides or amino acids)
    output P : position probability matrix
    """
    dico={}
    
    for seq in seqs:
        for i in range(len(seq)-w+1):
            if seq not in dico:
                dico[seq]=[seq[i:i+w]]
            else:
                dico[seq].append(seq[i:i+w])

    P = []
#     print(dico)
    dicopos={}
    
    for key,val in dico.items():
        for v in val:
            for i in range(len(v)):
                if v[i]=='A':
                    nt='A'
                if v[i]=='C':
                    nt='C'
                if v[i]=='G':
                    nt='G'
                if v[i]=='T':
                    nt='T'

                if(v[i],i) not in dicopos:
                    dicopos[(v[i],i)]=[(v,key)]
                else:
                    dicopos[(v[i],i)].append((v,key))
       
#     print("dicopos ",dicopos)
    dicoPup={}
    for key,val in dicopos.items():
        nt,pos=key
        for v in val:
            m,seq=v
            idnt=alph.index(nt)
            iseq=seqs.index(seq)
            if key not in dicoPup:
                dicoPup[key]=Z[iseq][idnt]
            else:
                dicoPup[key]+=Z[iseq][idnt]
                
#     print("DICO P UP",dicoPup)
    
    som=0
    for i in range(len(seqs)):
        for j in range(len(seqs[0])-w+1):
            som+=Z[i][j]
    
    dicoFinal={}
    for key,val in dicoPup.items():
        dicoFinal[key]=(val/som)
        
#     print("DICOFINAL",dicoFinal)
    P=initialiseP(seqs, w, nts)
    for k,v in dicoFinal.items():
        nt,pos=k
        idnt=alph.index(nt)
        P[idnt][pos+1]=v
        
    
        
        
    
    return P


P = M_step(seqs, z_norm, w, nts)
print(P)

[[0.25       0.37291667 0.37291667 0.33125   ]
 [0.25       0.29791667 0.27708333 0.24583333]
 [0.25       0.3        0.3        0.40208333]
 [0.25       0.00833333 0.02916667 0.0625    ]]


In [16]:
print(P)

[[0.25       0.37291667 0.37291667 0.33125   ]
 [0.25       0.29791667 0.27708333 0.24583333]
 [0.25       0.3        0.3        0.40208333]
 [0.25       0.00833333 0.02916667 0.0625    ]]


In [17]:
print(0.3266369+ 0.3578869 +0.44122024 +0.046875)

1.17261904


In [18]:
# pour normaliser P selon les colonnes 

import copy 
P_copied = copy.copy(P)
print(P_copied)

ligne, colonne = P_copied.shape
print(ligne, colonne)

SOMME = []
for c in range (colonne):
    somme = np.sum(P_copied[:,c])
    SOMME.append(somme)
print(SOMME)




for c in range (colonne):
    somme = np.sum(P_copied[:,c])
    for l in range(ligne): 
        P_copied[l][c] = P_copied[l][c] / somme 
    
print("\nP_copied = P normalisée \n",P_copied)
SOMME = []
for c in range (colonne):
    somme = np.sum(P_copied[:,c])
    SOMME.append(somme)
print(SOMME)

        
        

[[0.25       0.37291667 0.37291667 0.33125   ]
 [0.25       0.29791667 0.27708333 0.24583333]
 [0.25       0.3        0.3        0.40208333]
 [0.25       0.00833333 0.02916667 0.0625    ]]
4 4
[1.0, 0.9791666666666664, 0.9791666666666665, 1.0416666666666665]

P_copied = P normalisée 
 [[0.25       0.38085106 0.38085106 0.318     ]
 [0.25       0.30425532 0.28297872 0.236     ]
 [0.25       0.30638298 0.30638298 0.386     ]
 [0.25       0.00851064 0.02978723 0.06      ]]
[1.0, 1.0, 1.0, 1.0]


6\. Écrivez une fonction `likelihood` qui calcule la log-vraisemblance de l'ensemble des sequences.

In [19]:
print(seqs)

['GTCAGG', 'GAGAGT', 'ACGGAG', 'CCAGTC']


In [49]:
def likelihood(seqs, Z, P, w, alph):
    """
    Implement log likelihood function of P
    input seqs : list of sequences
    input Z :  matrix of motif start positions
    input p : position probability matrix
    input w : motif length
    input alph : alphabet (nucleotides or amino acids)
    output lLikelihood : log likelihood of P 

    """
    M = len(seqs)
    L = len(seqs[0])
    
    Likelihood =( L - w + 1)**(-M)
    print(Likelihood)
    ligne, colonne = Z.shape
    
    
    new_Z = E_step(seqs, P, Z, w, alph)
#     print(new_Z)
    new_Z_norm = normaliseZ(new_Z) 
#     print(new_Z_norm)
    
    for i in range (ligne): 
# #         print(" np.sum(new_Z[i]) ====>",  np.sum(new_Z[i]))
# #         print(" (new_Z[i]) ====>",  (new_Z[i]))
        Likelihood =Likelihood * np.sum(new_Z[i])
        
#         print(" np.sum(new_Z_norm[i]) ====>",  np.sum(new_Z_norm[i]))
#         Likelihood = Likelihood * np.sum(new_Z_norm[i])

            
            
#     lLikelihood = np.log((L-w+1)**-M)
    
    lLikelihood = np.log(Likelihood)
#     lLikelihood = (Likelihood)
    
    print ("Likelihood = ", Likelihood)
    print("Llikelihood = ", lLikelihood)
#     re = np.exp(lLikelihood) # pour revenir à Likilihood 
#     print(re)

    
      
    return lLikelihood

logvraisemblance = likelihood(seqs, Z, P, w, nts)
print(logvraisemblance)

0.00390625
Likelihood =  3.569752257333432e-14
Llikelihood =  -30.963695104238006
-30.963695104238006


In [50]:
# def likelihood(seqs, Z, P, w, alph):
#     """
#     Implement log likelihood function of P
#     input seqs : list of sequences
#     input Z :  matrix of motif start positions
#     input p : position probability matrix
#     input w : motif length
#     input alph : alphabet (nucleotides or amino acids)
#     output lLikelihood : log likelihood of P 

#     """
#     M = len(seqs)
#     L = len(seqs[0])
     
#     lLikelihood = 
    
#     print("Llikelihood = ", lLikelihood)

    
      
#     return lLikelihood

# logvraisemblance = likelihood(seqs, Z, P, w, nts)
# print(logvraisemblance)

In [51]:
# # si on prend Z NORMALISE (out)

# P
# [[0.25       0.38712798 0.38712798 0.34546131]
#  [0.25       0.28258929 0.25915179 0.18415179]
#  [0.25       0.29471726 0.29471726 0.42328869]
#  [0.25       0.01785714 0.02566964 0.04947917]]
#  np.sum(new_Z_norm[i]) ====> 1.0
#  np.sum(new_Z_norm[i]) ====> 1.0
#  np.sum(new_Z_norm[i]) ====> 0.9999999999999999
#  np.sum(new_Z_norm[i]) ====> 1.0
# -5.545177444479562

In [52]:
# # si on prendre Z PAS normalisé (out)

# P
# [[0.25       0.38712798 0.38712798 0.34546131]
#  [0.25       0.28258929 0.25915179 0.18415179]
#  [0.25       0.29471726 0.29471726 0.42328869]
#  [0.25       0.01785714 0.02566964 0.04947917]]
#  np.sum(new_Z[i]) ====> 0.001524895107729391
#  np.sum(new_Z[i]) ====> 0.002213263527992516
#  np.sum(new_Z[i]) ====> 0.002437814382410838
#  np.sum(new_Z[i]) ====> 0.0012288246552664348
# -30.86264475807634

In [53]:
sup = 0 
inf = 0
cpt = 0

print(P)

for p in P:
    for pp in p: 
        cpt += 1
        print("==========================>",pp, type(pp))
        if pp > 1e-5: 
            print("sup")
            sup += 1
        else: 
            print("inf")
            inf += 1
    print(sup)
    
if sup == cpt: 
    print("FINI")

[[0.25       0.37291667 0.37291667 0.33125   ]
 [0.25       0.29791667 0.27708333 0.24583333]
 [0.25       0.3        0.3        0.40208333]
 [0.25       0.00833333 0.02916667 0.0625    ]]
sup
sup
sup
sup
4
sup
sup
sup
sup
8
sup
sup
sup
sup
12
sup
sup
sup
sup
16
FINI


7\. Implémentez l'algorithme Expectation-Maximisation. Vous calculerez la valeur de la log-vraisemblance totale du modèle à chaque iteration et l'algorithme prendra fin lorsque $\Delta \log \text{Pr}(D | \Theta) < \varepsilon$. Utilisez $\varepsilon = 1e-4$. Votre implementation devra renvoyer les paramètres du modele ($p$ et la log-likelihood associé), ainsi bien que la liste des meilleures positions du motif dans chaque sequence (matrice $Z$). Faites attention à utiliser $Z$ non-normalisé afin de trouver la log-vraisemblance!

In [67]:
A = np.asarray([[1,2,3],[4,5,6]])
B = np.asarray([[1,2,3],[4,5,6]])



if np.array_equal(A, B):
    print("The two matrices are equal.")
else:
    print("The two matrices are not equal.")
    
    
    

The two matrices are equal.


In [75]:
matrix1 = np.asarray([[1.0000000001, 2], [3, 4]])
matrix2 = np.asarray([[1, 2], [3, 4]])

threshold = 0.0001  # au hasard 

# Compare the two matrices for equality with threshold
comparison = np.abs(matrix1 - matrix2) <= threshold
if np.all(comparison):
    print("The two matrices are equal with threshold.")
else:
    print("The two matrices are not equal with threshold.")
    

The two matrices are equal with threshold.


In [77]:
def ExpectationMaximisation(seqs, w, alph, eps):
    """
    Implement Expectation Maximisation algorithm
    input seqs : list of sequences
    input w : motif length
    input alph : alphabet (nucleotides or amino acids)
    input eps : threahold 
    output P : position probability matrix
    output Z :  matrix of motif start positions
    output lLikelihood : log likelihood of P 
    output pos_motif : positions of motifs in seqs
    """
    P = initialiseP(seqs, w, alph)
    Z = initialiseZ(seqs, w)
    
    lLikelihood = []
    pos_motif = []
    P_avant = P 
    
    Z =  E_step(seqs, P, Z, w, alph)
    P = M_step(seqs, Z, w, alph)
    LL = likelihood(seqs, Z, P, w, alph) #LogLikelihood
    lLikelihood.append(LL)
    L = np.exp(LL) # likelihood
    
    print(eps)
    print(L)
    if L > eps: 
        print("okooooooo")
        

 
#     while L > eps or P_avant == P :
# #         print("entrée")
#         P_avant = P
#         Z =  E_step(seqs, P, Z, w, alph)
#         P = M_step(seqs, Z, w, alph)
#         LL = likelihood(seqs, Z, P, w, alph) #LogLikelihood
#         lLikelihood.append(LL)
#         L = np.exp(LL) # likelihood
    
       
          
#     for i in range (50):
#         print("================================", i)
#         if L < eps: 
#             P_avant = P
#             Z =  E_step(seqs, P, Z, w, alph)
#             print("Z\n",Z)
#             P = M_step(seqs, Z, w, alph)
#             print("P\n", P)
#             LL = likelihood(seqs, Z, P, w, alph) #LogLikelihood
#             lLikelihood.append(LL)
#             L = np.exp(LL) # likelihood
#         else: 
#             continue


    for i in range (50): 
        threshold = eps
        # Compare the two matrices for equality with threshold
        comparison = np.abs(P_avant - P) <= threshold
        if np.all(comparison):
            print("The two matrices are equal with threshold.")
            continue
        else:
            print("The two matrices are not equal with threshold.")
            P_avant = P
            Z =  E_step(seqs, P, Z, w, alph)
            print("Z\n",Z)
            P = M_step(seqs, Z, w, alph)
            print("P\n", P)
            LL = likelihood(seqs, Z, P, w, alph) #LogLikelihood
            lLikelihood.append(LL)
            L = np.exp(LL) # likelihood
    
    

    return (P, lLikelihood, Z, pos_motif)

def analyse_results(results, seqs):
    print('Les paramètres du modele:')
    print('p:\n',results[0])
    print('\n')
    print('log-vraisemblance associé:',results[1][-1])
    print('\n')
    print('Z:\n',results[2])
    print('\nLes motifs:')
    for i in range(len(seqs)):
        print('seq'+str(i+1)+': '+ str(results[3][i]+1) + '  ' + seqs[i][results[3][i]:results[3][i]+w])

    plt.plot(results[1])
    plt.xlabel('Iteration')
    plt.ylabel('Log-vraisemblance')
    plt.show()

eps = 10**-4
EMResults = ExpectationMaximisation(seqs, w, nts, eps)
print(EMResults[0])
print("\n\n\n\n\n")
print(EMResults[1])
print("\n\n\n\n\n")
print(EMResults[2])
print("\n\n\n\n\n")
print(EMResults[3])
print("\n\n\n\n\n")
analyse_results(EMResults,seqs)

0.00390625
Likelihood =  6.0883237536228386e-15
Llikelihood =  -32.73240359678647
0.0001
6.088323753622834e-15
The two matrices are not equal with threshold.
Z
 [[6.32693554e-05 8.77862306e-05 5.67526118e-04 2.96100583e-04]
 [2.96100583e-04 8.22501620e-05 2.96100583e-04 3.29000648e-05]
 [4.56488399e-04 1.04774052e-03 1.51846453e-04 2.96100583e-04]
 [4.48685178e-04 5.67526118e-04 3.29000648e-05 6.32693554e-05]]
P
 [[0.25       0.32604505 0.32604505 0.26418461]
 [0.25       0.4743625  0.35579666 0.13690588]
 [0.25       0.31260671 0.31260671 0.46289585]
 [0.25       0.06186044 0.07507848 0.02009142]]
0.00390625
Likelihood =  8.705960774609827e-14
Llikelihood =  -30.072183364354938
The two matrices are not equal with threshold.
Z
 [[5.02060314e-05 9.08536573e-05 1.11864242e-03 7.37189638e-04]
 [7.37189638e-04 4.20729971e-04 7.37189638e-04 3.19968090e-05]
 [8.39040257e-04 1.07253621e-03 4.03389073e-04 7.37189638e-04]
 [6.96690305e-04 1.11864242e-03 3.19968090e-05 5.02060314e-05]]
P
 [[0.25

IndexError: list index out of range

8\. Qu'est-ce que vous observez en exécutant l'algorithme EM plusieurs fois? Justifiez votre réponse.

Reponse:

<font color="blue">
En exécutant l'algorithme EM plusieurs fois, on remarque qu'il y a convergence de la matrice P.
</font>

9\. Pour éviter le problème identifié au point précedent, écrivez une fonction `EM_iteratif` qui exécute l'algorithme `EM` $N$ fois ($N=10$) et qui prend les paramètres associés à la meilleure log-vraisemblance. Trouvez-vous les bons motifs?

In [None]:
def EM_iteratif(N, seqs, w, alph, eps):
    """
    Implement a iterative version of Expectation Maximisation algorithm
    input N : number of iterations
    input seqs : list of sequences
    input w : motif length
    input eps : threahold 
    output bestModel : the parameter of the best model
    """
  
    bestModel = P, Z, lLikelihood, pos_motif
 
    return bestModel


eps = 10**-4
N=10
meilleurEM = EM_iteratif(N, seqs, w, nts, eps)
analyse_results(meilleurEM,seqs)

10\. Appliquez votre algorithme `EM` à l'ensemble des séquence du fichier `trainingSequences.txt` en utilisant $w=10$. 

In [78]:
w= 10
input_f = "trainingSequences.txt"
seqs_train = read_training_file(input_f)
print (seqs_train)
eps = 10**-4
N=10
meilleurEM = EM_iteratif(N, seqs_train, w, nts, eps)
analyse_results(meilleurEM,seqs_train)

['ACAACCATATATAGTAGCCACTGAAT', 'CCACCCCATATATAGTACGGGTGGTG', 'CCATAAATAGAGCAGACTGTCGCTGT', 'GTAAACATAAAACCCCATAAATAGGA', 'TTCAAGAAACTGCCATAAATAGCGAT', 'TAGAGGTTTTTGTGCCATAAATAGGT', 'CCCCATAAATAGGAATATCGGCCTGA', 'TTGCCATTAAATTATACCATATATGG', 'TATCAACAACGATAACCCATATATGG', 'TTTCCAAATATAGAAGGTGTGGAAAG', 'TCCAAATATAGTAAAATCGAGTCGAT', 'GACTGGGGCCCAAATATAGCATGTTC', 'ATCATTAGCTTTTACTCCATAAATGG', 'ATTCTTTTGCCATAAATGGTAACTCG', 'CCATAAATGGCAAGTCTGTCGAATAA', 'CCCATAAATGGCAGGGTATTAGCACG', 'CCAAAAATAGTGTGTCGTAACAGCTT', 'CCAAAAATAGGGGAATGGAAGTGGGG', 'CCAAAAATAGGCCAGAGTTTACAACG', 'CCAAAAATAGTTAAATAATATACATT', 'CTACACCTTCCAAAAATAGTATATCT', 'TTGCCAAATATGGGGTTAGAGTGTTC', 'GTCTTTACCAAAAATGGTGATCCTGT', 'TTGCCAAAAATGGAGCGTTTACCAAT', 'ATCCACCATTTATAGATTCAGGAGGC', 'GCATAAGAGAACATTCCATTTATAGG', 'TCAACCCCATTTATAGCCACGTCAGT', 'CATCCATTAATAGTAGCCTAATGGCG', 'GGAGTAGGCCCATTAATAGTATCTTT', 'CCATTAATAGACAAAATCGACTCAAG', 'CCAATTATAGAAAGTGGCTGGTCGTC', 'AACTATTATTTCTCACCCATTAATGG', 'ATGCTTTACCAATAATAGAGCTGCAA', 'GGTCAGTT

NameError: name 'EM_iteratif' is not defined

11\. Construire un LOGO pour le motif prédit avec le service <i>WebLogo</i>. Pour cela, identifiez le motif dans chaque séquence, utiliser <i>ClustalOmega</i> pour les aligner et puis <i>WebLogo</i> pour générer le LOGO. Ajouter le LOGO à votre réponse.

In [79]:
fhandler = open('motifs.txt','w')
for i in range(len(seqs_train)):
    fhandler.write('>motif'+str(i+1)+'\n')
    fhandler.write(seqs_train[i][meilleurEM[3][i]:meilleurEM[3][i]+w])
    fhandler.write('\n')
fhandler.close()

NameError: name 'meilleurEM' is not defined

12\. Comparez les motifs trouvés par votre programme avec les motifs du fichier `testingSequences.txt`, où les vrais motifs sont montrés en lettre majuscule. Quelle est la performance de votre programme? 

In [80]:
import re

motifs_extracted = []
motifs_real = []


input_f = "testingSequences.txt"
