# Protein Protein Interactions Prediction

## Amino Acid sequence generation:

The `dbs/HPRD/pos_inter.csv` file contanins a database of known positive protein-protein interactions. The problem lies in the fact that these proteins are registered as gene symbols, whereas any feature extraction operation requires the complete amino acid sequence of the protein.

The goal of this notebook is to extract all the gene symbols present in the HPRD database (`data/symbol_data.csv`) and encode them as AA sequences using the NCBI repository.

In [7]:
from utils.db_utils import DBUtils
from utils.pssm_utils import PSSMUtils
import os
from tqdm import tqdm
import numpy as np

To do so, we call on the function `generate_protein_fasta()` to generate a `.fasta` file that contains the proteins' RefSeq Ids and their corresponding AA sequences

In [8]:
# create an instance of the DBUtils class
db_utils = DBUtils()

In [2]:
db_utils.generate_protein_fasta()

Removed 2160 rows - (5.5%)
Current progress: 5.22%
Current progress: 10.43%
Current progress: 15.64%
Current progress: 20.57%
Current progress: 25.74%
Current progress: 30.94%
Current progress: 36.13%
Current progress: 41.34%
Current progress: 46.58%
Current progress: 51.71%
Current progress: 56.56%
Current progress: 61.77%
Current progress: 66.93%
Current progress: 72.12%
Current progress: 76.97%
Current progress: 81.86%
Current progress: 87.02%
Current progress: 92.11%
Current progress: 97.13%
Current progress: 97.34%
Total number of sequences: 9268


Now that we have such a file, we can go ahead with the feature extraction, starting with the PSSM matrix.

## PSSM matrix generation:

To generate the PSSM matrix of each protein sequence, we first need to have a database to use with the PSI BLAST tool.

For this project, we chose the SwissProt database for two reasons:
<ul>
    <li>The database is peer reviewed and contains only valid information
    <li>The database is relatively small (~200MBs) compared to ther options</li>
</ul>

To do so, we make call on the function `generate_swissprot_db()` that makes use of the `makeblastdb` command line tool

This should generate some additional files in the `../dbs/SwissProt/` directory

In [3]:
db_utils.generate_swissprot_db()

Once this step complete, we can proceed to generate the PSSM matrices of our generated sequences in `./dbs/generated/sequences.fasta` with the `generate_pssm()` function that makes use of the `PSI-BLAST` command line tool

We use the following default values as our parameters:

<ul>
    <li>e value: 0.001</li>
    <li>N of iterations: 3</li>
    <li>N of threads: 16</li>
</ul>

In [None]:
db_utils.generate_pssm_matrices()

This will generate multiple txt files in the `../dbs/generated/pssms`, one for each protein sequence. We will use these files to read the pccm matrix and put it in numpy arrays

In [2]:
# create an instance of the PSSMUtils class
pssm_utils = PSSMUtils()

In [15]:
# get all the files in the pssm directory
pssm_files = os.listdir('../dbs/generated/pssms/')
parsed_pssms = {}
for file in tqdm(pssm_files):

    # we parse the pssm file and get the sequence and the pssm matrix
    # the matrix is parsed into a list of lists, this is done so that can be serialized into a json file
    # we will convert them to numpy arrays when we will use them
    seq, pssm = pssm_utils.pssm_file_to_2Darray('../dbs/generated/pssms/' + file)
    seq_id = int(file.split('.')[1])
    
    # check if the sequence has already been parsed
    # if yes, take the parsing with the bigger id
    # this ensures that we take the pssm matrix generated on the after all the iterations of the blast
    # (because for some reason, PSI-BLAST generates multiple pssm files for the same sequence)
    if seq in parsed_pssms:
        if seq_id > parsed_pssms[seq][1]:
            parsed_pssms[seq] = (pssm, seq_id)
    else:
        parsed_pssms[seq] = (pssm, seq_id)

100%|██████████| 17173/17173 [00:53<00:00, 319.04it/s]


In [16]:
# print the number of parsed pssms
print(len(parsed_pssms))

# we can get rid of the seq_id now
parsed_pssms = {k: v[0] for k, v in parsed_pssms.items()}

# save the parsed pssms as json file, so that we don't have to parse them again
pssm_utils.save_parsed_pssms(parsed_pssms)

9074


## Feature extraction:

Now that we have the protein sequences and their corresponding PSSM matrices, we can start extarcting the features used to train our model.

First we start by loading our JSON parsed data into numpy arrays so that we can perform mathematical computations on them more easily.

The result is a dictionnary in the format `{seq : pssm}` that we can use to extract features from either sequence or the pssm matrix.

In [5]:
seq_pssm = pssm_utils.load_parsed_pssms_into_nparrays()

# print the number of parsed pssms
print("Number of sequences loaded: ", len(seq_pssm))

Loaded PSSM data from JSON file.


100%|██████████| 9074/9074 [00:05<00:00, 1585.64it/s]

Converted PSSM values to numpy arrays.
9074





In [10]:
# print a random example of a parsed pssm
example = list(seq_pssm.items())[np.random.randint(0, len(seq_pssm))]

print("Sequence:")
print(example[0])
print()
print("PSSM:")
print(example[1])

Sequence:
MANQVNGNAVQLKEEEEPMDTSSVTHTEHYKTLIEAGLPQKVAERLDEIFQTGLVAYVDLDERAIDALREFNEEGALSVLQQFKESDLSHVQNKSAFLCGVMKTYRQREKQGSKVQESTKGPDEAKIKALLERTGYTLDVTTGQRKYGGPPPDSVYSGVQPGIGTEVFVGKIPRDLYEDELVPLFEKAGPIWDLRLMMDPLSGQNRGYAFITFCGKEAAQEAVKLCDSYEIRPGKHLGVCISVANNRLFVGSIPKNKTKENILEEFSKVTGLTEGLVDVILYHQPDDKKKNRGFCFLEYEDHKSAAQARRRLMSGKVKVWGNVVTVEWADPVEEPDPEVMAKVKVLFVRNLATTVTEEILEKSFSEFGKLERVKKLKDYAFVHFEDRGAAVKAMDEMNGKEIEGEEIEIVLAKPPDKKRKERQAARQASRSTAYEDYYYHPPPRMPPPIRGRGRGGGRGGYGYPPDYYGYEDYYDDYYGYDYHDYRGGYEDPYYGYDDGYAVRGRGGGRGGRGAPPPPRGRGAPPPRGRAGYSQRGAPLGPPRGSRGGRGGPAQQQRGRGSRGSRGNRGGNVGGKRKADGYNQPDSKRRQTNNQQNWGSQPIAQQPLQQGGDYSGNYGYNNDNQEFYQDTYGQQWK

PSSM:
[[-1 -1 -2 ... -2 -1  1]
 [ 2 -2 -1 ... -3 -2  2]
 [-1  0  3 ... -3 -2  1]
 ...
 [-1  0  0 ... -3 -2 -2]
 [-3 -3 -4 ... 11  2 -3]
 [-1  2  0 ... -3 -2 -2]]
