# From Statistical Physics to Data-Driven Modelling with Applications to Quantitative Biology
Tutorial 5 : Analysis of Protein Sequence Data to infer protein structure. 

This Tutorial is based on the work:

Faruck Morcos, Andrea Pagnani, Bryan Lunt, Arianna Bertolino, Debora S Marks, Chris Sander, Riccardo Zecchina, Jose  Onuchic, Terence Hwa, and Martin Weigt. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49):E1293–E1301, 2011.

For a review  of Direct Coupling Analysis see: 
Simona Cocco, Christoph Feinauer, Matteo Figliuzzi, Remi Monasson, and Martin Weigt. Inverse statistical physics of protein sequences: a key issues review. Reports on Progress in Physics, 81(3):032601, 2018.

This tutorial is long but can be split in two parts: Question 9 can be treated in a following tutirial.

Starting Notebook SC RM FZ.


In [1]:
import numpy as np
import math
import matplotlib.pyplot as plt
from scipy.sparse import coo_matrix
import numpy.matlib
#%matplotlib inline
#plt.rcParams["font.family"] = "serif"
#plt.rcParams["mathtext.fontset"] = "dejavuserif"
#plt.rcParams["figure.figsize"] = (10, 8)
#plt.rcParams["font.size"] = 26

import numpy.linalg as LA
from numpy.linalg import inv

Useful functions for the tutorial.

In [3]:
# Function to convert the amino acids letters into integer numbers from 0 to 20
def letter2number(a): 
    
    switcher = {
        '-': 0,
        'A': 1,
        'C': 2,
        'D':3,
        'E':4,
        'F':5,
        'G':6,
        'H':7,
        'I':8,
        'K':9,
        'L':10,
        'M':11,
        'N':12,
        'P':13,
        'Q':14,
        'R':15,
        'S':16,
        'T':17,
        'V':18,
        'W':19,
        'Y':20,     
    }
    #return switcher.get(a, "nothing")
    return switcher.get(a,0)

In [4]:
# Function transform the Couplings in zero sum gauge and  Compute the Frobenius Norm
def Frobenius(J):
    F=np.zeros((L,L))
    Fv=np.zeros(Np)
    ind=np.zeros((Np,2)).astype(int)
    indv=np.zeros((L,L)).astype(int)
    l=0
    for i in range (L):
        for j in range(i+1,L):
            #matrix  21x21 including also the gauge symbol
            jinf2=np.zeros((q+1,q+1))
            jinf2[0:q,0:q]=J[i*q:(i+1)*q,j*q:(j+1)*q]
            #zero-sum matrix
            #J_norm=np.transpose(np.transpose(jinf2 - np.mean(jinf2,0))
            #                     - np.mean(jinf2,1)) + np.mean(jinf2)
            J_norm=jinf2 - np.mean(jinf2,0)
            J_norm=J_norm - np.mean(J_norm,1,keepdims=True)
            #Frobenius norm
            F[i,j] = LA.norm(J_norm)
            F[j,i]=F[i,j]
            ind[l,0]=i
            ind[l,1]=j
            Fv[l]=F[i,j]
            indv[i,j]=l
            l=l+1        
    return [Fv,indv,F,ind]

In [5]:
#Function to compute the Positive Predictive Value of predicted contacts when ranking pairs of contacts 
#in decreasing order of their Frobenious Norms and comparing with residues distances on the real structure 5PTI
#(downloaded from PDB data base)

def ppv(Finput,indv,ind):
    #Sort Finput in reverse order 
    Fsort=np.sort(Finput)[::-1]
    Fsort_index=np.argsort(Finput)[::-1]

    #Read structural Distances
    backmap_distances=np.loadtxt('Data/backmap_distances_e_PF00014.txt')
    distv=np.ones(Np)*50;
    lb=np.size(backmap_distances,0)
    dist=np.zeros((L,L))
    for l in range (lb):
        i=int(backmap_distances[l,0]-1)
        j=int(backmap_distances[l,1]-1)
        dist[i,j]=backmap_distances[l,4]
        leff=indv[i,j]
        distv[leff]=backmap_distances[l,4]

    #Positive Predicted Value for all contacts 
    sux=np.zeros(Np+1)
    suxn=np.zeros(Np+1)
    #2 residues are defined as in contact when their distances <= 8 Angstrom in the  cristallographic structure
    dc2=8;
    sux[0]=0
    for i in range (Np):
        if distv[Fsort_index[i]]<dc2:
            sux[i+1]=sux[i]+1
        else:
            sux[i+1]=sux[i]
        suxn[i+1]=sux[i+1]/(i+1)

    #Positive Predicted Value for contacts distant along the backbone i-j>4
    sud=np.zeros(Np+1)
    sudn=np.zeros(Np+1)
    dc2=8
    sud[0]=0
    ni=0
    for i in range (Np):
        if ind[Fsort_index[i],1]-ind[Fsort_index[i],0] >4:
            if distv[Fsort_index[i]]<dc2:
                        sud[ni+1]=sud[ni]+1
            else:
                        sud[ni+1]=sud[ni]  
            sudn[ni+1]=sud[ni+1]/(ni+1)
            ni=ni+1
    return [suxn,sudn]
    

In [6]:
#Function to apply the Average Product Correction to the Frobenious Norm of the couplings
def Apc(F):
    #Average Product Correction which improves contact Predictions
    Fapc=np.zeros(Np)
    avcoupl1=np.sum(F,1)/L
    sumj=np.sum(avcoupl1)/L
    l=0
    for i in range (L):
        for j in range (i+1,L):
            Fapc[l]=F[i,j]-avcoupl1[i]*avcoupl1[j]/sumj
            l=l+1         
    return Fapc           

Question 1: Read the Data: PF0014 sequences form PFAM (2013). 

In [6]:
# Open the file and write in "seqs" a list containing all lines
data=open('Data/seqPF14.txt', 'r')
# readlines read all lines in the file in the variable "seqs".
seqs = data.readlines()

data.close()

In [7]:
#Show data
#seqs[0][:]

Question 2: Convert the MSA alphabet from letters to numbers 0...20.

In [8]:
#Extract M and L and convert the MSA in a numerical matrix
M=np.size(seqs)
L=len(seqs[0])-1
Np=int(L*(L-1)/2)
print( 'Number of Sequences in the MSA M:',M,'Number of Residues in the sequence L:', L)
align=np.zeros((M,L)).astype(int)
for m in range (M):
    for i in range (L):
        align[m,i]=letter2number(seqs[m][i])  

Number of Sequences in the MSA M: 2143 Number of Residues in the sequence L: 53


Question 3: One Hot Encoding of  the alignment in a binary (M,20xL) array.

Question 4: Extract from the data the frequencies and connected correlations. 

Question 5: Add a pseudocount to regularize the data.

and Invert the Covariance matrix to obtain the Coupling Matrix. 

Question 6: Compute the Frobenious Norm to estimate the couplings strength for each pair of residues.

 Question 7: Plot the positive predictive value (PPV) for the contact predictions for all pairs of resudues and for distant pairs along the sequence (i-j>4).

Question 8: Apply the Average Product Correction to the Frobenious Norm of the Couplings and plot the PPV for nearby and distant pairs.

Question 10 (Bonus): Rank the contact directly from the Mutual Information, with the Average Product Correction, between the occurrences of two amino acids in sites i and j, obtained directly from the MSA and plot the PPV for contact prediction.

# Part 2. Question 9A: Implement the Graphical Lasso Algorithm and use it to infer  the couplings.

Fix  the regularization strength gamma to the expected optimal value and infer couplings.

Question 9B: Compare Frobenious Norm of Gaussian couplings with GLasso couplings.

Question 9C: Plot the positive predictive value (PPV) with Glasso  for the contact predictions for all pairs of resudues and distant pairs along the sequence (i-j>4).

Question 9D: Apply the Average Product Correction to the Frobenious Norm of the Glasso Couplings and plot the PPV for nearby and distant pairs.