# Computational biology and bioinformatics - <span style="color:#1CA766">INFO-F-439</span>
# Assignment 1: <span style="color:#1CA766">GOR III secondary structure prediction</span>

> ## <span style="color:#2E66A7"> Alberto Parravicini</span>

*****

# <span style="color:#2E66A7">Part 1:</span> Introduction

Proteins are for the most part defined by the aminoacid sequences that compose them. However, due to the nature of the aminoacids, proteins have more complex structures that characterize them.

Hydrogen bonds between the aminoacids in the sequence causes the proteins to assume three-dimensional structures of various kind. Being able to recognize these structures is useful to compare proteins, and to get a better insight on their properties.

Classifying correctly which portion of a protein assumes a given secondary structure isn't an easy task, but it can be seen that there are links between specific patterns of aminoacids and given secondary structures.
As an example, in $\alpha$-helices (one type of secondary structure) every third aminoacid along the sequence will tend to be hydrophobic. Also, regions richer of *alanine (A)* and *leucine (L)* will tend to form helices. (see *Gollery, Martin. "Bioinformatics: Sequence and Genome Analysis", David W. Mount. Pag. 387-388*)

***

### <span style="color:#1CA766">GOR I & II</span>:

One idea that can be exploited to predict secondary structures from aminoacid sequeces, is that the secondary structure corresponding to a certain aminoacid is influenced not only by the aminoacid itself, but also by the aminoacids that are close to it.

Given a dataset of proteins where for each aminoacid its secondary structure is known, it is possible to build a model that is able to predict secondary sequences, by using the idea above.

This is the core idea behind the **GOR I** and **GOR II** algorithms.

For a given position $j$, it will be predicted the secondary structure $S_j$ that has the highest value of *information*, as function of the aminoacid sequence.
Assuming that our secondary structures are in the set $S = \{\alpha$-helices (**H**), $\beta$-sheets (**B**), and coils (**C**)$\}$, the one that will be chosen is $$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)}$$ where $a_1, \ldots, a_n$ is the aminoacid sequence, and $S_j$ is picked among the secondary structures in S.

The previous computation is approximated by **GOR** by considering a window of size $17$ around position $j$, which gives the formula $$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)} \approx \arg\max_{S}{\sum_{m = -8}^{m = 8}{I(\Delta S_j;\ a_{j+m})}}$$ 

In the previous formulation, $I(\Delta S_j;\ a_{j+m})$ is a measure of *self-information*, and is given by
$$I(\Delta S_j;\ a_{j+m}) = log(\frac{c_{S_j,\ a_{j+m}}}{c_{\neg S_j,\ a_{j+m}}}) + log(\frac{c_{\neg S}}{c_{S}})$$

where $c$ refers to the count of the occurrences (in a given dataset) of what is specified in the subscript; for instance, $c_{\neg S}$ is the number of aminoacids with a secondary structure different from $S$.

In Python, we can compute $c_{S_j,\ a_{j+m}}$ and $I(\Delta S_j;\ a_{j+m})$ in the following way:

In [4]:
def count_residues_gor_1(s_list, a_list, s_j, a_jm, lag):
    """
    Count the number of times that structure "s_j" appears,
    and residue "a_jm" is at distance "lag" from them.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name.
    :param a_jm: an aminoacid name.
    :param lag: distance from a_j at which a_jm is searched.
    :return: int
    """
    n = len(s_list)
    count = 0
    for i in range(max(0, -lag), min(n, n - lag)):
        if s_list.iloc[i] == s_j and a_list.iloc[i + lag] == a_jm:
            count += 1
    return count

In [5]:
def info_value_gor_1(s_list, a_list, s_j, a_jm, lag, a_occ):
    """
    Compute the information value associated to a secondary structure s_j, in position j,
    and to an aminoacid a_jm, at distance m from s_j.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name.
    :param a_jm: an aminoacid name.
    :param lag: distance from a_j at which a_jm is searched.
    :param a_occ: number of occurrencies of each aminoacid; pandas Series.
    :return: double
    """
    num_1 = count_residues_gor_1(s_list, a_list, s_j, a_jm, lag)
    den_1 = a_occ[a_jm] - num_1
    den_2 = s_occ[s_j]
    num_2 = len(s_list) - den_2
    return np.log(num_1 * num_2) - np.log(den_1 * den_2)

***

### <span style="color:#1CA766">GOR III</span>:

It is possible to improve the accuracy of **GOR I & II** by considering not only the occurrencies of aminoacids close to the position where we want to predict the secondary structure, but also their occurrencies relatively to the aminoacid found at that given position.

This means that the information value of a secondary structure $S_j$ will be approximated as 

$$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)} \approx \arg\max_{S}{\sum_{m = -8}^{m = 8}{I(\Delta S_j;\ a_{j+m}\ |\ a_j)}} = \ldots$$ 
$$\ldots = I(\Delta S_j;\ a_j) + \sum_{m = -8,\ m \neq 0}^{m = 8}{I(\Delta S_j;\ a_{j+m}\ |\ a_j)}$$

where it holds $$I(\Delta S_j;\ a_{j+m}\ |\ a_j) = log(\frac{c_{S_j,\ a_{j+m},\ a_j}}{c_{\neg S_j,\ a_{j+m},\ a_j}}) + log(\frac{c_{\neg S_j,\ a_j}}{c_{S_j,\ a_j}})$$

In Python, we can compute $c_{S_j,\ a_{j+m},\ a_j}$ and $I(\Delta S_j;\ a_{j+m}\ |\ a_j)$ as:

In [3]:
def count_residues_gor_3(s_list, a_list, s_j, a_j, a_jm, lag):
    """
    Count the number of times that structure "s_j" appears together with residue "a_j",
    and residue "a_jm" is at distance "lag" from them.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name
    :param a_j: an aminoacid name
    :param a_jm: an aminoacid name
    :param lag: distance from a_j at which a_jm is searched
    :return: int
    """
    n = len(s_list)
    count = 0
    for i in range(max(0, -lag), min(n, n - lag)):
        if s_list.iat[i] == s_j and a_list.iat[i] == a_j and a_list.iat[i + lag] == a_jm:
            count += 1
    return count

In [4]:
def info_value_gor_3(s_list, a_list, s_j, a_j, a_jm, lag, a_occ):
    """
    Compute the information value associated to an aminoacid a_j in position j,
    with secondary structure s_j, in position j,
    and to an aminoacid a_jm, at distance m from s_j.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name
    :param a_j: an aminoacid name
    :param a_jm: an aminoacid name
    :param lag: distance from a_j at which a_jm is searched
    :param a_occ: number of occurrencies of each aminoacid
    :return: double
    """
    # Num of times s_j appears with a_j, and a_jm is lag position distant.
    num_1 = count_residues_gor_3(s_list, a_list, s_j, a_j, a_jm, lag)
    # Num of times a_j appears with a secondary structure different from s_j,
    # and a_jm is lag position distant.
    den_1 = count_aminoacid_lag(a_list, a_j, a_jm, lag) - num_1
    
    # Number of times aminoacid a_j appears together with structure s_j.                        
    den_2 = count_residues_gor_3(s_list, a_list, s_j, a_j, a_j, 0)                          
    # Number of times aminoacid a_j appears with a different structure from s_j.
    num_2 = a_occ[a_j] - den_2
                 
    info = np.log(num_1) - np.log(den_1) + np.log(num_2) - np.log(den_2)   
    return info     

*****

# <span style="color:#2E66A7">Part 2:</span> Data pre-processing

To compute the information values for each combination of aminoacids and secondary structures, it is necessary to have a large dataset of proteins with known secondary structures. We are provided with a set of 498 proteins, where the secondary structures have been determined by **DSSP** and **STRIDE**.

Before being able to apply **GOR III**, it is required to preprocess the dataset, clean the inconsistencies that might be present, and store the aminoacid sequences in an appropriate way.

It can be noted that there are a few rows with aminoacids that are not part of the usual 20. As the number of these rows is very limited compared to the total, we can just remove them entirely, and this won't affect the overall results of the algorithm.

In [5]:
#%% IMPORT DATA

import pandas as pd
import numpy as np
import timeit
import pickle

aminoacid_list = "arndceqghilkmfpstwyv"

# Dictionary that maps aminoacid codes to single letters.
aminoacid_codes = {"ala":  "a",
                   "arg":  "r",
                   "asn":  "n",
                   "asp":  "d",
                   "cys":  "c",
                   "gln":  "q",
                   "glu":  "e",
                   "gly":  "g",
                   "his":  "h",
                   "ile":  "i",
                   "leu":  "l",
                   "lys":  "k",
                   "met":  "m",
                   "phe":  "f",
                   "pro":  "p",
                   "ser":  "s",
                   "thr":  "t",
                   "trp":  "w",
                   "tyr":  "y",
                   "val":  "v"}

In [6]:
def preprocess_input(file_name, aminoacid_codes):
    """
    Load and preprocess a file containing protein sequences.
    The output will be a DataFrame ready to be used by GOR.
    :param file_name: string, name of the file to be opened.
    :param aminoacid_codes: dictionary that maps aminoacid codes to single letters.
    :return: DataFrame
    """
    # Load the data
    input_data = pd.read_csv(file_name,
                         header=None, sep="\t",
                         names=["PDB_code", "PDB_chain_code", "PDB_seq_code", "residue_name", "secondary_structure"])

    # CLEAN DATA - AMINOACIDS

    # There are some values that are weird, like "a", "b", ...
    # Some of those can be interpreted as correct aminoacids.

    # Remove rows with "X", "b", "UNK"
    # First, remove any leading spaces, then remove "X", "b", "UNK"
    input_data["residue_name"] = input_data["residue_name"].str.strip()
    input_data = input_data.drop(input_data[input_data["residue_name"].isin(["X", "b", "UNK"])].index).reset_index(drop=True)


    # Add a column for the aminoacids, while preserving the original one
    input_data["a"] = input_data["residue_name"]
    input_data["a"] = input_data["a"].str.lower()


    # replace the codes with something shorter
    input_data["a"] = input_data["a"].map(lambda x: aminoacid_codes[x] if x in aminoacid_codes else x)

    # Create a new column
    input_data["s"] = input_data["secondary_structure"]
    input_data["s"] = input_data["s"].str.lower()
    # Replace "other" with "coil"
    input_data.loc[input_data["s"] == "other", "s"] = "coil"
    # Shorten values
    input_data["s"] = input_data["s"].str[0]
    
    # Add a composite key to the dataset.
    # Append the PDB chain code to the PDB code, to obtain unique protein identifier.
    input_data["PDB_code_and_chain"] = input_data.PDB_code + "_" + input_data.PDB_chain_code

    return input_data


We can load and work on the *STRIDE* dataset. The *DSSP* dataset is similar, and will be compared to *STRIDE* later in the report. 

In [7]:
    # Type of the data to read ("stride", "dssp")
    data_type = "stride"
    # File name
    file_name = "../data/" + data_type + "_info.txt"
    # Read the data
    input_data = preprocess_input(file_name, aminoacid_codes)

    # Look at the aminoacid values
    aminoacids = set(input_data.residue_name)
    print(aminoacids)
    # Same stuff with the secondary structures
    secondary_structures = set(input_data.secondary_structure)
    print(secondary_structures)

{'CYS', 'LYS', 'VAL', 'TYR', 'MET', 'PHE', 'ARG', 'SER', 'GLY', 'GLN', 'ASN', 'GLU', 'ALA', 'THR', 'PRO', 'HIS', 'ILE', 'ASP', 'LEU', 'TRP'}
{'Other', 'Beta', 'Helix'}


In [8]:
input_data.iloc[0:5, ]

Unnamed: 0,PDB_code,PDB_chain_code,PDB_seq_code,residue_name,secondary_structure,a,s,PDB_code_and_chain
0,1w0n,A,12,ILE,Other,i,c,1w0n_A
1,1w0n,A,13,THR,Beta,t,b,1w0n_A
2,1w0n,A,14,LYS,Beta,k,b,1w0n_A
3,1w0n,A,15,VAL,Beta,v,b,1w0n_A
4,1w0n,A,16,GLU,Beta,e,b,1w0n_A


**PDB_code** and **PDB_chain_code** work as identifier of the row. The column **a** and **s** are just shortened versions of the aminoacid codes and of the secondary structure codes.

We can also compute some other values, that will be useful later.

In [9]:
# Number of residues
n_res = len(input_data.index)

# Compute the number of occurrencies of each aminoacid
a_occ = input_data["a"].value_counts()

# Compute the number of occurrencies of each secondary structure
s_occ = input_data["s"].value_counts()

In [10]:
n_res

71073

In [11]:
a_occ

l    6449
a    5963
g    5119
v    4960
e    4816
d    4244
k    4213
s    4148
i    4034
t    3937
r    3665
p    3181
n    3053
f    2890
q    2849
y    2467
h    1715
m    1260
c    1103
w    1007
Name: a, dtype: int64

In [12]:
s_occ

c    29170
h    25755
b    16148
Name: s, dtype: int64

Now, to compute the information values of **GOR III** it is required to add a few more functions.

The first will compute a matrix that stores, for each aminoacid, the occurrences of each secondary structure.  
The second will build a dictionary where each key corresponds to a secondary structure, and each value is a tensor (technically, a pandas.Panel) that stores the number of times where an aminoacid $a_j$ appears with secondary structure $S_j$,
and another aminoacid $a_{j+m}$ appears at a distance $m$ from them, with $m$ in $[-8, 8]$.

It is necessary to precompute these matrices as counting from scratch the occurrencies, for each prediction, could be very inefficient.

In [13]:
def build_sj_aj_matrix(s_list, a_list):
    """
    Count the number of times that each aminoacid a_j has secondary structure s_j.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :return: pandas.DataFrame
    """
    # Set of all secondary structures.
    sec_structure_set = set(s_list)
    # Set of all aminoacids.
    aminoacid_set = set(a_list)
    # Occurrency matrix.
    sj_aj_matrix = pd.DataFrame(0, index=sec_structure_set, columns=aminoacid_set, dtype=int)
    # Zip together the lists of secondary structures and aminoacids, it makes counting easier.
    s_a_list = [list(x) for x in zip(s_list,a_list)]
    # Count the occurrencies.
    for s in sec_structure_set:
        for a in aminoacid_set:
            sj_aj_matrix.at[s, a] = s_a_list.count([s, a])

    return sj_aj_matrix

In [14]:
start_time = timeit.default_timer()
sj_aj_matrix = build_sj_aj_matrix(input_data.s, input_data.a)
end_time = timeit.default_timer()
print("! -> EXECUTION TIME OF build_sj_aj_matrix:", (end_time - start_time), "\n")

! -> EXECUTION TIME OF build_sj_aj_matrix: 0.26465835296067575 



In [15]:
sj_aj_matrix

Unnamed: 0,i,v,n,l,f,c,t,p,a,s,g,e,k,h,q,y,r,d,w,m
b,1572,2115,447,1573,938,301,1135,335,1007,771,804,773,809,400,493,827,742,530,294,282
h,1468,1601,916,3071,1012,305,1089,467,2992,1204,850,2348,1795,570,1369,850,1586,1338,377,547
c,994,1244,1690,1805,940,497,1713,2379,1964,2173,3465,1695,1609,745,987,790,1337,2376,336,431


In [None]:
def build_sj_aj_ajm(input_data, print_details=True):
    """
    Count the times where aminoacid a_j appears with secondary structure s_j,
    and aminoacid a_jm appears at a distance m from them, with m in [-8, 8].
    """
    # Set of all proteins.
    protein_set = list(set(input_data.PDB_code_and_chain))
    # Set of all secondary structures.
    sec_structure_set = list(set(input_data.s))
    # Set of all aminoacids.
    aminoacid_set = list(set(input_data.a))
    # Build a dictionary where the keys are the secondary structures, and the values are 3D tensors
    # with index a_j, a_jm, m.
    sj_aj_ajm_dict = {s: pd.Panel(data=0, items=aminoacid_set, major_axis=aminoacid_set, minor_axis=np.arange(-8, 9), dtype=int) for s in sec_structure_set}

    # The counting must be done separately for each protein,
    # otherwise the aminoacids at the end of one protein would be counted
    # as part of the next one!
    for i_p, p in enumerate(protein_set):
        s_list = input_data.loc[input_data['PDB_code_and_chain'] == p].s
        a_list = input_data.loc[input_data['PDB_code_and_chain'] == p].a
        n = len(s_list)
        for i_aj, a_j in enumerate(aminoacid_set):
            for i_ajm, a_jm in enumerate(aminoacid_set):
                for i_m, m in enumerate(np.arange(-8, 9)):
                    if print_details:
                        print(p, a_j, a_jm, m, "-------", i_p / len(protein_set), i_aj / 20, i_ajm / 20)

                    for i in range(max(0, -m), min(n, n - m)):
                        if a_list.iat[i] == a_j and a_list.iat[i + m] == a_jm:
                            # Increment the count of the right dictionary.
                            sj_aj_ajm_dict[s_list.iat[i]].iat[i_aj, i_ajm, i_m] += 1

    return sj_aj_ajm_dict

In [16]:
# Count the times where aminoacid a_j appears with secondary structure s_j,
# and aminoacid a_jm appears at a distance m from them, with m in [-8, 8].
#   WARNING: this can take a bit of time, usually about 20-30 minutes;
#            do it at your own risk!
start_time = timeit.default_timer()
##########################
# UNCOMMENT IF NEEDED ####
##########################
# sj_aj_ajm_dict = build_sj_aj_ajm(input_data)

end_time = timeit.default_timer()
print("! -> EXECUTION TIME OF build_sj_aj_ajm:", (end_time - start_time), "\n")

! -> EXECUTION TIME OF build_sj_aj_ajm: 5.784613011439177e-05 



As the computation of the dictionary can take a fair bit of time, it is a good idea to store the result, and load it when needed.

Note: comment/uncomment the following lines depending on your needs!

In [17]:
# Save data
dict_file_name = "sj_aj_ajm_dict_"+ data_type + ".p"

# # Open the file for writing.
# file_object = open(dict_file_name.encode('utf-8').strip(), 'wb')

# # Save data
# pickle.dump(sj_aj_ajm_dict, file_object)
#
# file_object.close()

Data can also be loaded with:

In [18]:
# Load data
file_object = open(dict_file_name.encode('utf-8').strip(), 'rb')
sj_aj_ajm_dict = pickle.load(file_object)

Here we can visualize a slice of **sj_aj_ajm_dict**: for instance, we can consider the occurrences of a $\beta$-sheets with an aminoacid distance of $-7$. 

In [20]:
sj_aj_ajm_dict["b"].iloc[:, :, 0]

Unnamed: 0,g,d,f,w,r,q,c,n,a,e,p,t,v,y,l,i,k,s,m,h
g,67,51,69,25,42,37,23,24,79,58,30,77,156,59,114,114,64,63,29,33
d,43,41,62,18,43,24,14,24,62,39,22,56,139,47,84,88,38,42,16,21
f,23,21,43,14,31,15,9,19,33,30,15,37,65,33,56,53,33,28,15,8
w,13,8,18,4,11,4,6,3,11,9,3,14,21,13,22,13,12,11,3,6
r,30,28,31,17,45,22,10,17,40,56,16,64,106,47,91,64,35,35,8,24
q,28,16,25,9,32,12,12,20,36,27,8,51,81,33,60,56,23,30,6,19
c,14,5,12,4,4,12,11,7,15,11,4,7,41,10,16,15,8,11,4,6
n,38,20,35,18,42,30,14,22,45,28,11,56,74,35,65,78,40,24,10,15
a,63,29,62,20,44,29,17,37,95,58,22,89,178,59,131,133,48,54,27,34
e,53,30,59,19,45,36,19,32,63,43,21,64,147,42,104,101,49,46,15,23
