# From Statistical Physics to Data-Driven Modelling with Applications to Quantitative Biology

Tutorial 7: Learning a Perceptron for the prediction of binding between PDZ proteins and peptides.

Data are taken from the work: 

Michael A. Stiffler, Jiunn R. Chen, Viara P. Grantcharova, Ying Lei, Daniel Fuchs, John E. Allen, Lioudmila A. Zaslavskaia, Gavin MacBeath, PDZ Domain Binding Selectivity Is Optimized Across the Mouse Proteome. Science 317, 364 (2007)


Starting Notebook SC RM FZ.


In [6]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
%matplotlib inline
#plt.rcParams["font.family"] = "serif"
#plt.rcParams["mathtext.fontset"] = "dejavuserif"
#plt.rcParams["figure.figsize"] = (10, 8)
#plt.rcParams["font.size"] = 26

from scipy.sparse import coo_matrix
import numpy.matlib
import numpy.linalg as LA
from numpy.linalg import inv

Useful functions to read and process the data.

In [7]:
# Function to convert the amino acids letters into integer numbers from 0 to 20
def letter2number(a): 
    
    switcher = {
        '-': 0,
        'A': 1,
        'C': 2,
        'D':3,
        'E':4,
        'F':5,
        'G':6,
        'H':7,
        'I':8,
        'K':9,
        'L':10,
        'M':11,
        'N':12,
        'P':13,
        'Q':14,
        'R':15,
        'S':16,
        'T':17,
        'V':18,
        'W':19,
        'Y':20,     
    }
    #return switcher.get(a, "nothing")
    return switcher.get(a,0)

def seq2number(a):
    ris = []
    for i in range(len(a)):
        ris.append(letter2number(a[i]))
    return ris

Read the PDZ-peptides interaction matrix

In [8]:
int_matrix = pd.read_excel('./Data/fp_interaction_matrix.xlsx', index_col=0)
PDZ = np.array(int_matrix.index)
NPDZ=len(PDZ)
print(NPDZ)

74


In [9]:
int_matrix

Unnamed: 0,AcvR1,AcvR2,AcvR2b,AN2,APC,Aquaporin4,ASIC2,AXL,Cacna1a,Caspr2,...,TRPV3,TRPV4,TRPV6,TYRO3,VEGFR2,VEGFR3,Unnamed: 218,Unnamed: 219,Unnamed: 220,Unnamed: 221
Cipp (03/10),0,0.0,-1.0,-1.0,-1,-1.0,-1,-1,-1.0,4162.31648,...,-1.0,-1,-1,-1,-1,-1,,,,
Cipp (05/10),0,0.0,0.0,0.0,0,0.0,0,0,0.0,0.00000,...,0.0,0,-1,0,0,0,,,,
Cipp (08/10),0,0.0,0.0,0.0,-1,-1.0,0,0,0.0,0.00000,...,0.0,0,0,0,0,0,,,,
Cipp (09/10),0,0.0,0.0,-1.0,0,-1.0,0,0,0.0,0.00000,...,0.0,0,0,-1,0,0,,,,
Cipp (10/10),0,0.0,0.0,0.0,-1,0.0,0,-1,0.0,0.00000,...,0.0,0,0,0,0,0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Whirlin (3/3),0,0.0,0.0,0.0,0,0.0,0,0,0.0,0.00000,...,0.0,0,0,0,0,0,,,,
ZO-1 (1/3),-1,0.0,0.0,0.0,0,0.0,0,0,0.0,0.00000,...,0.0,0,0,0,0,0,,,,
ZO-1 (2/3),0,0.0,0.0,-1.0,0,0.0,0,0,0.0,0.00000,...,0.0,0,0,0,0,0,,,,
ZO-2 (1/3),0,0.0,0.0,-1.0,-1,0.0,0,-1,0.0,0.00000,...,0.0,0,0,0,0,0,,,,


Read the sequences of the Peptides

In [10]:
# ATTENTION: the peptides in peptides.free are not in the same order as in the interaction matrix
pep = []
with open('./Data/peptides.free') as f:
    for line in f:
        x = line.split()
        pep.append(x)
Npep=len(pep)
print(Npep)
# check that all the peptides are correctly listed in pep
for i in range(Npep):
    if len(np.extract(int_matrix.columns==pep[i][0],int_matrix.columns))==0 :
        print(pep[i])

217


In [11]:
# Check that the binding is read correctly
i=11
j=6
#print('PDZ: ',PDZ[i],'\nPeptide: ',pep[j][0],'with sequence',pep[j][1],
#      '\nInteraction:',int_matrix.get_value(PDZ[i],pep[j][0]))
print('PDZ: ',PDZ[i],'\nPeptide: ',pep[j][0],'with sequence',pep[j][1],
      '\nInteraction:',int_matrix.at[PDZ[i],pep[j][0]])    

PDZ:  Gm1582 (2/3) 
Peptide:  Cftr with sequence TEEEVQETRL 
Interaction: 47845.88612


In [12]:
# Construct the matrix of peptide sequences
tmp = []
for i in range(Npep):
    tmp.append(seq2number(pep[i][1][:]))
pep_seq=np.asarray(tmp)
print(np.shape(pep_seq))
Nbase=np.shape(pep_seq)[1]

(217, 10)


In [13]:
#Expand the matrix of peptide sequences in a bynary (Npep,19x10) array X by a one-hot encoding
#gauge: last a.a. remove the last symbol
#add a last line of all one to have a constant term in dot(X,J)
q=20
#X=-np.ones((Npep,Nbase*(q-1)+1))    ### USE {-1,1} CONVENTION FOR INPUT
X=np.zeros((Npep,Nbase*(q-1)+1))     ### USE {0,1} CONVENTION FOR INPUT
for m in range(Npep):
    X[m,Nbase*(q-1)]=1
    for i in range(Nbase):
        if (pep_seq[m,i]!=q):
            X[m,i*(q-1)+pep_seq[m,i]-1]=1
print(np.shape(X))

(217, 191)


In [14]:
#Get for a given PDZ the  vector of binding/nonbinding for each peptide Y
def getY(iPDZ):
    Y = -np.ones(Npep)
    for j in range(Npep):
        #Kd=int_matrix.get_value(PDZ[iPDZ],pep[j][0])
        Kd=int_matrix.at[PDZ[iPDZ],pep[j][0]]
        if (Kd>0 and Kd<100000):
            Y[j]=1
    return Y

In [15]:
# Check how many peptides are binding to each PDZ
binding=[]
for j in range(NPDZ):
    Y=getY(j)
    binding.append(sum(Y>0))
print(binding)

[2, 5, 10, 8, 3, 2, 1, 3, 2, 4, 2, 16, 5, 1, 6, 2, 41, 29, 5, 3, 3, 19, 4, 3, 2, 1, 12, 2, 13, 1, 25, 2, 1, 4, 8, 2, 1, 5, 3, 4, 3, 1, 3, 2, 1, 11, 7, 20, 2, 1, 14, 11, 20, 13, 13, 9, 5, 3, 21, 7, 17, 18, 3, 6, 15, 4, 1, 1, 6, 1, 20, 1, 13, 1]


Question 1:  Implement the perceptron learning algorithm , using a large positive value for the classification condition:

c = 1000.

Question 2: Choose  PDZ 11 and split the data in training (containing 150 sequences) and test set (containing 67 sequences).

Question 2: Plot the normalised stability parameter as a function of the iteration parameter.

Question 3: Verify that for the test set the network reproduces the input-output association. Quantify the test-error.

Question 4: Test the perceptron performances for all the PDZs.

Question 5 (Bonus 1): Use the Lasso algorithm to fit the training set. Compare the quality of the predictions on the test set with the perceptron result.

Question 5 (Bonus 2): use the Keras package to train a perceptron, with a sigmoid activation function and a binary cross-entropy loss (in a 0, 1 representation of the binding labels).
