# Computational biology and bioinformatics - <span style="color:#1CA766">INFO-F-439</span>
# Assignment 1: <span style="color:#1CA766">GOR III secondary structure prediction</span>

> ## <span style="color:#2E66A7"> Alberto Parravicini</span>

*****

# <span style="color:#2E66A7">Part 1:</span> Introduction

Proteins are for the most part defined by the aminoacid sequences that compose them. However, due to the nature of the aminoacids, proteins have more complex structures that characterize them.

Hydrogen bonds between the aminoacids in the sequence causes the proteins to assume three-dimensional structures of various kind. Being able to recognize these structures is useful to compare proteins, and to get a better insight on their properties.

Classifying correctly which portion of a protein assumes a given secondary structure isn't an easy task, but it can be seen that there are links between specific patterns of aminoacids and given secondary structures.
As an example, in $\alpha$-helices (one type of secondary structure) every third aminoacid along the sequence will tend to be hydrophobic. Also, regions richer of *alanine (A)* and *leucine (L)* will tend to form helices. (see *Gollery, Martin. "Bioinformatics: Sequence and Genome Analysis", David W. Mount. Pag. 387-388*)

***

### <span style="color:#1CA766">GOR I & II</span>:

One idea that can be exploited to predict secondary structures from aminoacid sequeces, is that the secondary structure corresponding to a certain aminoacid is influenced not only by the aminoacid itself, but also by the aminoacids that are close to it.

Given a dataset of proteins where for each aminoacid its secondary structure is known, it is possible to build a model that is able to predict secondary sequences, by using the idea above.

This is the core idea behind the **GOR I** and **GOR II** algorithms.

For a given position $j$, it will be predicted the secondary structure $S_j$ that has the highest value of *information*, as function of the aminoacid sequence.
Assuming that our secondary structures are in the set $S = \{\alpha$-helices (**H**), $\beta$-sheets (**B**), and coils (**C**)$\}$, the one that will be chosen is $$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)}$$ where $a_1, \ldots, a_n$ is the aminoacid sequence, and $S_j$ is picked among the secondary structures in S.

The previous computation is approximated by **GOR** by considering a window of size $17$ around position $j$, which gives the formula $$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)} \approx \arg\max_{S}{\sum_{m = -8}^{m = 8}{I(\Delta S_j;\ a_{j+m})}}$$ 

In the previous formulation, $I(\Delta S_j;\ a_{j+m})$ is a measure of *self-information*, and is given by
$$I(\Delta S_j;\ a_{j+m}) = log(\frac{c_{S_j,\ a_{j+m}}}{c_{\neg S_j,\ a_{j+m}}}) + log(\frac{c_{\neg S}}{c_{S}})$$

where $c$ refers to the count of the occurrences (in a given dataset) of what is specified in the subscript; for instance, $c_{\neg S}$ is the number of aminoacids with a secondary structure different from $S$.

In Python, we can compute $c_{S_j,\ a_{j+m}}$ and $I(\Delta S_j;\ a_{j+m})$ in the following way:

In [4]:
def count_residues_gor_1(s_list, a_list, s_j, a_jm, lag):
    """
    Count the number of times that structure "s_j" appears,
    and residue "a_jm" is at distance "lag" from them.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name.
    :param a_jm: an aminoacid name.
    :param lag: distance from a_j at which a_jm is searched.
    :return: int
    """
    n = len(s_list)
    count = 0
    for i in range(max(0, -lag), min(n, n - lag)):
        if s_list.iloc[i] == s_j and a_list.iloc[i + lag] == a_jm:
            count += 1
    return count

In [5]:
def info_value_gor_1(s_list, a_list, s_j, a_jm, lag, a_occ):
    """
    Compute the information value associated to a secondary structure s_j, in position j,
    and to an aminoacid a_jm, at distance m from s_j.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name.
    :param a_jm: an aminoacid name.
    :param lag: distance from a_j at which a_jm is searched.
    :param a_occ: number of occurrencies of each aminoacid; pandas Series.
    :return: double
    """
    num_1 = count_residues_gor_1(s_list, a_list, s_j, a_jm, lag)
    den_1 = a_occ[a_jm] - num_1
    den_2 = s_occ[s_j]
    num_2 = len(s_list) - den_2
    return np.log(num_1 * num_2) - np.log(den_1 * den_2)

***

### <span style="color:#1CA766">GOR III</span>:

It is possible to improve the accuracy of **GOR I & II** by considering not only the occurrencies of aminoacids close to the position where we want to predict the secondary structure, but also their occurrencies relatively to the aminoacid found at that given position.

This means that the information value of a secondary structure $S_j$ will be approximated as 

$$S_j = \arg\max_{S}{I(\Delta S_j;\ a_1, \ldots, a_n)} \approx \arg\max_{S}{\sum_{m = -8}^{m = 8}{I(\Delta S_j;\ a_{j+m}\ |\ a_j)}} = \ldots$$ 
$$\ldots = I(\Delta S_j;\ a_j) + \sum_{m = -8,\ m \neq 0}^{m = 8}{I(\Delta S_j;\ a_{j+m}\ |\ a_j)}$$

where it holds $$I(\Delta S_j;\ a_{j+m}\ |\ a_j) = log(\frac{c_{S_j,\ a_{j+m},\ a_j}}{c_{\neg S_j,\ a_{j+m},\ a_j}}) + log(\frac{c_{\neg S_j,\ a_j}}{c_{S_j,\ a_j}})$$

In Python, we can compute $c_{S_j,\ a_{j+m},\ a_j}$ and $I(\Delta S_j;\ a_{j+m}\ |\ a_j)$ as:

In [6]:
def count_residues_gor_3(s_list, a_list, s_j, a_j, a_jm, lag):
    """
    Count the number of times that structure "s_j" appears together with residue "a_j",
    and residue "a_jm" is at distance "lag" from them.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name
    :param a_j: an aminoacid name
    :param a_jm: an aminoacid name
    :param lag: distance from a_j at which a_jm is searched
    :return: int
    """
    n = len(s_list)
    count = 0
    for i in range(max(0, -lag), min(n, n - lag)):
        if s_list.iat[i] == s_j and a_list.iat[i] == a_j and a_list.iat[i + lag] == a_jm:
            count += 1
    return count

In [7]:
def info_value_gor_3(s_list, a_list, s_j, a_j, a_jm, lag, a_occ):
    """
    Compute the information value associated to an aminoacid a_j in position j,
    with secondary structure s_j, in position j,
    and to an aminoacid a_jm, at distance m from s_j.
    :param s_list: a list of secondary structures; pandas Series.
    :param a_list: a list of aminoacids; pandas Series.
    :param s_j: a secondary structure name
    :param a_j: an aminoacid name
    :param a_jm: an aminoacid name
    :param lag: distance from a_j at which a_jm is searched
    :param a_occ: number of occurrencies of each aminoacid
    :return: double
    """
    # Num of times s_j appears with a_j, and a_jm is lag position distant.
    num_1 = count_residues_gor_3(s_list, a_list, s_j, a_j, a_jm, lag)
    # Num of times a_j appears with a secondary structure different from s_j,
    # and a_jm is lag position distant.
    den_1 = count_aminoacid_lag(a_list, a_j, a_jm, lag) - num_1
    
    # Number of times aminoacid a_j appears together with structure s_j.                        
    den_2 = count_residues_gor_3(s_list, a_list, s_j, a_j, a_j, 0)                          
    # Number of times aminoacid a_j appears with a different structure from s_j.
    num_2 = a_occ[a_j] - den_2
                 
    info = np.log(num_1) - np.log(den_1) + np.log(num_2) - np.log(den_2)   
    return info     

*****

# <span style="color:#2E66A7">Part 2:</span> Data pre-processing

To compute the information values for each combination of aminoacids and secondary structures, it is necessary to have a large dataset of proteins with known secondary structures. We are provided with a set of 498 proteins, where the secondary structures have been determined by **DSSP** and **STRIDE**.

Before being able to apply **GOR III**, it is required to preprocess the dataset, clean the inconsistencies that might be present, and store the aminoacid sequences in an appropriate way.

In [8]:
#%% IMPORT DATA

import pandas as pd
import numpy as np
import timeit
import pickle

aminoacid_list = "arndceqghilkmfpstwyv"

# Dictionary that maps aminoacid codes to single letters.
aminoacid_codes = {"ala":  "a",
                   "arg":  "r",
                   "asn":  "n",
                   "asp":  "d",
                   "cys":  "c",
                   "gln":  "q",
                   "glu":  "e",
                   "gly":  "g",
                   "his":  "h",
                   "ile":  "i",
                   "leu":  "l",
                   "lys":  "k",
                   "met":  "m",
                   "phe":  "f",
                   "pro":  "p",
                   "ser":  "s",
                   "thr":  "t",
                   "trp":  "w",
                   "tyr":  "y",
                   "val":  "v"}

In [9]:
def preprocess_input(file_name, aminoacid_codes):
    """
    Load and preprocess a file containing protein sequences.
    The output will be a DataFrame ready to be used by GOR.
    :param file_name: string, name of the file to be opened.
    :param aminoacid_codes: dictionary that maps aminoacid codes to single letters.
    :return: DataFrame
    """
    # Load the data
    input_data = pd.read_csv(file_name,
                         header=None, sep="\t",
                         names=["PDB_code", "PDB_chain_code", "PDB_seq_code", "residue_name", "secondary_structure"])

    # CLEAN DATA - AMINOACIDS

    # There are some values that are weird, like "a", "b", ...
    # Some of those can be interpreted as correct aminoacids.

    # Remove rows with "X", "b", "UNK"
    # First, remove any leading spaces, then remove "X", "b", "UNK"
    input_data["residue_name"] = input_data["residue_name"].str.strip()
    input_data = input_data.drop(input_data[input_data["residue_name"].isin(["X", "b", "UNK"])].index).reset_index(drop=True)


    # Add a column for the aminoacids, while preserving the original one
    input_data["a"] = input_data["residue_name"]
    input_data["a"] = input_data["a"].str.lower()


    # replace the codes with something shorter
    input_data["a"] = input_data["a"].map(lambda x: aminoacid_codes[x] if x in aminoacid_codes else x)

    # Create a new column
    input_data["s"] = input_data["secondary_structure"]
    input_data["s"] = input_data["s"].str.lower()
    # Replace "other" with "coil"
    input_data.loc[input_data["s"] == "other", "s"] = "coil"
    # Shorten values
    input_data["s"] = input_data["s"].str[0]

    return input_data


We can load and work on the *STRIDE* dataset. The *DSSP* dataset is similar, and will be compared to *STRIDE* later in the report. 

In [14]:
    # Type of the data to read ("stride", "dssp")
    data_type = "stride"
    # File name
    file_name = "../data/" + data_type + "_info.txt"
    # Read the data
    input_data = preprocess_input(file_name, aminoacid_codes)

    # Look at the aminoacid values
    aminoacids = set(input_data.residue_name)
    print(aminoacids)
    # Same stuff with the secondary structures
    secondary_structures = set(input_data.secondary_structure)
    print(secondary_structures)

{'SER', 'ASP', 'ASN', 'ALA', 'GLU', 'LYS', 'ILE', 'GLY', 'PRO', 'PHE', 'ARG', 'MET', 'VAL', 'TYR', 'TRP', 'HIS', 'LEU', 'GLN', 'THR', 'CYS'}
{'Helix', 'Other', 'Beta'}


In [15]:
input_data.iloc[0:5, ]

Unnamed: 0,PDB_code,PDB_chain_code,PDB_seq_code,residue_name,secondary_structure,a,s
0,1w0n,A,12,ILE,Other,i,c
1,1w0n,A,13,THR,Beta,t,b
2,1w0n,A,14,LYS,Beta,k,b
3,1w0n,A,15,VAL,Beta,v,b
4,1w0n,A,16,GLU,Beta,e,b


**PDB_code** and **PDB_chain_code** work as identifier of the row. The column **a** and **s** are just shortened versions of the aminoacid codes and of the secondary structure codes.