<h1><b>Statistique en Bioinformatique : </b> TME2 </h1>
<br>
L’objectif de ce TME est: 
<br>
<ul>
<li> objectif 1: comprendre les différences d'alignement global et local, </li>
<li> objectif 2: reconstruire une matrice de substitution du type Blosum, </li>
</ul>
<br>
<div class="alert alert-warning" role="alert" style="margin: 10px">
<p>**Soumission**</p>
<ul>
<li>Renomer le fichier TME2.ipynb pour NomEtudiant1_NomEtudiant2.ipynb </li>
<li>Envoyer par email à nikaabdollahi@gmail.com, l’objet du email sera [SBAS-2019] TME2</li>
</ul>
</div>

Nom etudiant 1 :
<br>
Nom etudiant 2 :
<br>

<b> Exercice 1 </b>: On va étudier un alignement “difficile” entre la protéine emph50s ribosomal L20 chez A. aeolicus, 
et la protéine ligase UBR5 chez l’homme (L’alignement structural est montré ci-dessous). 
<br>

<img src="bacthum.png" alt="Smiley face" height="210" width="202"> 



<br><br>
<b>A.</b> Récupérez la séquence de la protéine “E3 ubiquitin ligase UBR5” chez l’homme et de la protéine “50S ribosomal L20” chez A. aeolicus 
via le site <a href="http://www.uniprot.org/">Uniprot</a>  au format .fasta.

<br>
<b>B.</b> A l’aide du site <a href="https://www.ebi.ac.uk/Tools/psa/">psa</a>, alignez ces séquences. Commentez les faibles pourcentage d’identité malgré le bon alignement des ces structures.


Réponse : La seconde séquence est physiquement beaucoup plus courte que la première, il est alors tout simplemenent improbable que le nombre d'acide aminée identique soit élevé. 

<b>C.</b> Récupérez la séquence de la protéine “metL Bifunctional aspartokinase/homoserine dehydrogenase 2” et de la protéine “lysC Lysine-sensitive aspartokinase 3” chez E. coli via le site <a href="http://www.uniprot.org/"> Uniprot </a> au format .fasta. Reproduisez, à l’aide du site <a href="https://www.ebi.ac.uk/Tools/psa/"> psa</a>, l’alignement global et local vu en TD. Observez la modification des résultats en changeant la matrice de substitution et/ou les pénalités de gap (“Gap_penalty” et “Extend_penalty”).

Reponse: Avec l'alignement local on obtient une comparaison beaucoup plus pertinente puisqu'elle ne montre pas toute la suite d'acides aminées alignés à des gaps.

<b>Exercice 2</b>: Score alignement global, local <br>
Faire une fonction pour calculer : 1) la pourcentage de identité, 2) la pourcentage de similarité, 3) le score d’un alignement en utilisant la matrix blosum62. Le calcul de la similarité doit prendre en considération les acides amine ayant la même propriété physico chimique. Dans notre cas, tous les acides amine qui ont des valeurs supérieur à zéro dans la matrice BLOSUM62. Le score basé sur la matrice blosum doit aussi prendre en considération les deux pénalités de gaps, ouverture et extension. Teste vos fonctions en utilisant le fichier test.fasta


In [2]:
import numpy as np

#variable initiation
aa = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'X']

#parameters
q = 21
gap_open = -5
gap_ext = 0.5

#files
input_test_f = 'test.fasta'
input_blosum_f = 'BLOSUM62.txt'

#For a simple test use:
#input_test_f = 'testToy.fasta'



In [3]:
from Bio import SeqIO

#Read the test.fasta file
def read_fasta (input_f):
    seqs = []
    with open(input_f, 'r') as handle:
        for record in SeqIO.parse(handle, 'fasta'):
            seqs.append( str(record.seq) )
    return seqs
    
testAln = read_fasta(input_test_f)
print (testAln)

['EADINIAFYQRDHGDNSPFDGPNGILAHAFQPGQGIGGDAHFDAEETWT', 'MADILVVFARGAHGDFHAFDGKGG-LAHAFGPGSGIGGDAHFDEDEFWT']


In [4]:
#read Blosum
def read_blosum (input_f):
    
    with open(input_f, 'r') as handle:
        it = 0
        for line in handle:
            if(line[0] == '#'):
                continue
            elif(line[0] == ' '):
                indices = line.strip().split(' ')
                indices = list ( filter(None,indices) )
                mat = np.zeros( (len(indices), len(indices)) )
                it = 0
                continue
            else:
                line = line[1:].strip()
                separ = list ( filter(None,line.split(' ')) ) 
                separ = np.array(separ)
                mat[it] = separ
                it+=1
    return mat, indices
    
matr_62, indices = read_blosum (input_blosum_f)
#print(matr_62)
print(indices)

#manual fix
indices[23] = '-'

['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V', 'B', 'Z', 'X', '*']


In [5]:
#id: ~0.61
#sim: ~0.67
#score: ~165


#1) la pourcentage de identité
# -> prc d'AA identique
def ident(seq1, seq2):
    iden = 0
    for i in range(len(seq1)):
        if (seq1[i] == seq2[i]):
            iden+=1
    return iden / len(seq1)

#2) la pourcentage de similarité
# -> score AA1 -> AA2 (avec matrice blossum)
def simil(seq1, seq2, mat):
    sim = 0
    for i in range(len(seq1)):
        a = indices.index(seq1[i])
        b = indices.index(seq2[i])
        if(mat[a][b] > 0):
            sim += 1
    return sim / len(seq1)


#3) le score d’un alignement en utilisant la matrix blosum62
def align(seq1, seq2, gap_open, gap_ext, mat):
    score = 0
    extend = False
    for i in range (len(seq1)):
        if(seq1[i] != seq2[i]):
            if(extend):
                score += gap_ext
            else:
                score += gap_open
                extend = True
            continue
            
        ext = False
        a = indices.index(seq1[i])
        b = indices.index(seq2[i])
        score += mat[a][b]
        
    return score


def indentite_calcul(seq1, seq2, gap_open, gap_ext, matr_62):
    identitee = ident(seq1, seq2)
    similarity = simil(seq1,seq2,matr_62)
    score = align(seq1, seq2, gap_open, gap_ext, matr_62)
    
    return identitee, similarity, score


identitee,similarity,score = indentite_calcul(testAln[0], testAln[1], gap_open, gap_ext, matr_62)
print ("identitee= ", identitee , " similarity= ", similarity, " score= ", score )

identitee=  0.6122448979591837  similarity=  0.673469387755102  score=  180.0


<b>Exercice 3</b>: Matrice de substitution <br>
Faire une programme (plusieurs fonctions seront nécessaires) pour produire une matrix de substitution comme Blosum. Utiliser l’alignement du fichiers <b>blocks.dat</b>. 


In [57]:
from itertools import combinations, combinations_with_replacement
from collections import Counter


input_block_f = 'block.dat'

#For a simple test do:
#input_block_f = 'blockToy.dat'
#q = 3
#aa = ['A', 'B', 'C']

#generate all aa combination
aa.sort()
pairs_freq_dict = {x:0 for x in combinations_with_replacement(aa,2)}

print(pairs_freq_dict)

{('A', 'A'): 0, ('A', 'C'): 0, ('A', 'D'): 0, ('A', 'E'): 0, ('A', 'F'): 0, ('A', 'G'): 0, ('A', 'H'): 0, ('A', 'I'): 0, ('A', 'K'): 0, ('A', 'L'): 0, ('A', 'M'): 0, ('A', 'N'): 0, ('A', 'P'): 0, ('A', 'Q'): 0, ('A', 'R'): 0, ('A', 'S'): 0, ('A', 'T'): 0, ('A', 'V'): 0, ('A', 'W'): 0, ('A', 'X'): 0, ('A', 'Y'): 0, ('C', 'C'): 0, ('C', 'D'): 0, ('C', 'E'): 0, ('C', 'F'): 0, ('C', 'G'): 0, ('C', 'H'): 0, ('C', 'I'): 0, ('C', 'K'): 0, ('C', 'L'): 0, ('C', 'M'): 0, ('C', 'N'): 0, ('C', 'P'): 0, ('C', 'Q'): 0, ('C', 'R'): 0, ('C', 'S'): 0, ('C', 'T'): 0, ('C', 'V'): 0, ('C', 'W'): 0, ('C', 'X'): 0, ('C', 'Y'): 0, ('D', 'D'): 0, ('D', 'E'): 0, ('D', 'F'): 0, ('D', 'G'): 0, ('D', 'H'): 0, ('D', 'I'): 0, ('D', 'K'): 0, ('D', 'L'): 0, ('D', 'M'): 0, ('D', 'N'): 0, ('D', 'P'): 0, ('D', 'Q'): 0, ('D', 'R'): 0, ('D', 'S'): 0, ('D', 'T'): 0, ('D', 'V'): 0, ('D', 'W'): 0, ('D', 'X'): 0, ('D', 'Y'): 0, ('E', 'E'): 0, ('E', 'F'): 0, ('E', 'G'): 0, ('E', 'H'): 0, ('E', 'I'): 0, ('E', 'K'): 0, ('E', 'L'

In [58]:
#read alignment file
from io import StringIO 
import sys

def readAlnFile(input_f):
    seq = []
    with open(input_f, 'r') as f:
        for l in f:
            seq.append(l.strip())
    return seq

sequences = readAlnFile(input_block_f)

In [64]:
#compute fij frequences

n = len(sequences)
w = len(sequences[0])

fij = Counter()
for i in range(w):
    array = []
    for j in range(n):
        array.append(sequences[j][i])    
    comb = list(combinations(array,2))  
    
    simplerList = []
    for c in comb:
        if(c not in pairs_freq_dict):
            simplerList.append( (c[1],c[0]) )
        else:
            simplerList.append(c)
    comb.clear()
    
    
    fij += Counter(simplerList)
    
print(fij)

Counter({('G', 'G'): 104073, ('A', 'A'): 94758, ('I', 'V'): 48969, ('A', 'G'): 41768, ('I', 'I'): 39302, ('L', 'L'): 33742, ('A', 'V'): 32900, ('V', 'V'): 31535, ('K', 'K'): 31294, ('D', 'D'): 30644, ('K', 'R'): 27384, ('A', 'L'): 24526, ('G', 'L'): 23962, ('N', 'N'): 22091, ('A', 'S'): 19608, ('R', 'R'): 18292, ('E', 'E'): 17423, ('A', 'K'): 17129, ('F', 'L'): 17124, ('L', 'V'): 16870, ('A', 'I'): 16673, ('G', 'V'): 16138, ('I', 'L'): 16051, ('A', 'C'): 14516, ('E', 'K'): 14128, ('F', 'F'): 14000, ('A', 'E'): 13023, ('D', 'E'): 12745, ('T', 'T'): 12030, ('G', 'I'): 11808, ('S', 'S'): 10619, ('A', 'M'): 10093, ('K', 'L'): 9838, ('G', 'S'): 9166, ('A', 'T'): 8316, ('A', 'D'): 8313, ('E', 'L'): 8188, ('A', 'R'): 8108, ('F', 'V'): 8108, ('G', 'M'): 8025, ('D', 'T'): 7895, ('D', 'N'): 7825, ('K', 'S'): 7289, ('A', 'F'): 7278, ('L', 'S'): 7186, ('E', 'S'): 6765, ('C', 'F'): 6276, ('I', 'T'): 6225, ('A', 'Y'): 6200, ('D', 'S'): 6004, ('E', 'Q'): 5969, ('L', 'M'): 5960, ('K', 'Q'): 5934, ('D'

In [60]:
#compute T
T = (w*n*(n-1))/2

In [61]:
#compute pi

qij = fij.copy()
for k in qij:
    qij[k] /= T
    
print(qij)
    
pi = []
for i in aa:
    p = 0 
    val = 0
    for key in pairs_freq_dict:
        if( i in key ):
            if( key[0] == key[1]):
                p += qij[key]
            else:
                val += qij[key]
    p += val/2
    pi.append(p)
    

print(qij)
print(pi)


Counter({('G', 'G'): 0.07776071728775567, ('A', 'A'): 0.07080078453348276, ('I', 'V'): 0.03658840011207621, ('A', 'G'): 0.03120799476977678, ('I', 'I'): 0.029365461847389557, ('L', 'L'): 0.02521117026244513, ('A', 'V'): 0.024582049126739516, ('V', 'V'): 0.02356215559914075, ('K', 'K'): 0.02338208648547679, ('D', 'D'): 0.02289642290090595, ('K', 'R'): 0.020460633230596804, ('A', 'L'): 0.018325207807976092, ('G', 'L'): 0.0179038012515177, ('N', 'N'): 0.01650583730269917, ('A', 'S'): 0.014650602409638554, ('R', 'R'): 0.013667320444568974, ('E', 'E'): 0.013018025590735033, ('A', 'K'): 0.012798356215559914, ('F', 'L'): 0.012794620341832446, ('L', 'V'): 0.012604837956477072, ('A', 'I'): 0.01245764453161483, ('G', 'V'): 0.012057906042775754, ('I', 'L'): 0.011992901839917811, ('A', 'C'): 0.010845988605585132, ('E', 'K'): 0.010556084804333614, ('F', 'F'): 0.010460446436910433, ('A', 'E'): 0.009730456710563183, ('D', 'E'): 0.009522742131315962, ('T', 'T'): 0.008988512188288035, ('G', 'I'): 0.008

In [62]:
#compute Sij
eij =[]
for key in pairs_freq_dict:
    if( key[0] == key[1]):
        i = aa.index(key[0])
        eij.append(pi[i]**2)
    else:
        i = aa.index(key[0])
        j = aa.index(key[1])
        eij.append(2*(pi[i]*pi[j]))
        
print(eij)
        
sij = []
for key in pairs_freq_dict:
    k = list(pairs_freq_dict.keys()).index(key)
    sij.append(np.log2(qij[key]/eij[k]))
    

print(sij)

[0.026591749053542457, 0.0060677122769064356, 0.01626146890210925, 0.016443500270416442, 0.012347794483504601, 0.044021252568956185, 0.002882163331530557, 0.025969808545159546, 0.022298842617631154, 0.027668767982693358, 0.006947530557057869, 0.012044408869659278, 0.00403502866414278, 0.005673310978907519, 0.013257951325040563, 0.013500659816116822, 0.010739850730124394, 0.028305877771768528, 0.00039440129799891835, 0.0008798182801514331, 0.0032158875067604107, 0.00034613304488912924, 0.0018552731206057328, 0.0018760411032990808, 0.0014087614926987565, 0.005022390481341264, 0.0003288263926446728, 0.0029628988642509463, 0.0025440778799351005, 0.00315673336938886, 0.0007926446727961059, 0.0013741481882098435, 0.00046035694970254196, 0.0006472687939426719, 0.0015126014061654948, 0.0015402920497566254, 0.0012253109789075176, 0.0032294213088155765, 4.4997295835586805e-05, 0.00010037858301784747, 0.000366901027582477, 0.002486065981611682, 0.005027790156841536, 0.003775480800432667, 0.013460

In [63]:
#print Matrix
matrice = np.zeros((q,q))
for i in aa:
    for j in aa:
        key = (i,j)
        if( key not in  pairs_freq_dict ):
            key = (j,i)
        l = aa.index(i)
        c = aa.index(j)
        ij = list(pairs_freq_dict.keys()).index(key)
        matrice[l][c] = sij[ij]
        
print(matrice)


[[ 1.41278667  0.83793698 -1.38849887 -0.75693801 -1.18312182 -0.49628453
  -0.45552494 -1.05980407 -0.80101031 -0.59442885  0.11830048 -1.56187313
  -0.20378105 -0.65175435 -1.12992227  0.11793007 -0.78949468 -0.20349648
  -1.52044094 -1.09563566  0.52656887]
 [ 0.83793698  3.19961716 -3.20643591 -2.97654551  1.73493557 -1.28157712
  -0.53373718 -1.72806863 -2.90686103 -0.15671696 -0.20238993 -3.01825579
  -1.10218459 -1.38366411 -2.52386693 -0.48297137 -0.23233141 -0.28515874
  -1.9122488   1.69508151  2.29448914]
 [-1.38849887 -3.20643591  3.20318573  0.92145263 -2.43102183 -2.76836743
   0.102164   -1.96346856 -0.65132007 -1.46236048 -2.31842776  0.66683776
   0.26123138 -0.21443109 -1.20096041  0.12000465  0.84507592 -1.80715985
   0.12494982 -0.01224281 -0.57896433]
 [-0.75693801 -2.97654551  0.92145263  2.35645446 -1.16480171 -2.87100371
   1.37856027 -1.48994776  0.61456725 -0.48369656  0.48761635 -0.49928037
   0.74066504  1.34627714 -0.33412856  0.27611076  0.2318502  -1.2060