# Protein Protein Interactions Prediction

( Put this file in the `.src/` folder to avoid relative import errors)

## Amino Acid sequence generation:

The `dbs/HPRD/pos_inter.csv` file contanins a database of known positive protein-protein interactions. The problem lies in the fact that these proteins are registered as gene symbols, whereas any feature extraction operation requires the complete amino acid sequence of the protein.

The goal of this notebook is to extract all the gene symbols present in the HPRD database (`data/symbol_data.csv`) and encode them as AA sequences using the NCBI repository.

In [1]:

from utils.db_utils import DBUtils
from utils.pssm_utils import PSSMUtils
import os
from tqdm import tqdm
import numpy as np
from dotenv.main import load_dotenv
import psycopg2
from urllib.parse import urlparse
import json

To do so, we call on the function `generate_protein_fasta()` to generate a `.fasta` file that contains the proteins' RefSeq Ids and their corresponding AA sequences

In [2]:
# create an instance of the DBUtils class
db_utils = DBUtils()

In [2]:
db_utils.generate_protein_fasta()

Removed 2160 rows - (5.5%)
Current progress: 5.22%
Current progress: 10.43%
Current progress: 15.64%
Current progress: 20.57%
Current progress: 25.74%
Current progress: 30.94%
Current progress: 36.13%
Current progress: 41.34%
Current progress: 46.58%
Current progress: 51.71%
Current progress: 56.56%
Current progress: 61.77%
Current progress: 66.93%
Current progress: 72.12%
Current progress: 76.97%
Current progress: 81.86%
Current progress: 87.02%
Current progress: 92.11%
Current progress: 97.13%
Current progress: 97.34%
Total number of sequences: 9268


Now that we have such a file, we can go ahead with the feature extraction, starting with the PSSM matrix.

## PSSM matrix generation:

To generate the PSSM matrix of each protein sequence, we first need to have a database to use with the PSI BLAST tool.

For this project, we chose the SwissProt database for two reasons:
<ul>
    <li>The database is peer reviewed and contains only valid information
    <li>The database is relatively small (~200MBs) compared to ther options</li>
</ul>

To do so, we make call on the function `generate_swissprot_db()` that makes use of the `makeblastdb` command line tool

This should generate some additional files in the `../dbs/SwissProt/` directory

In [3]:
db_utils.generate_swissprot_db()

Once this step complete, we can proceed to generate the PSSM matrices of our generated sequences in `./dbs/generated/sequences.fasta` with the `generate_pssm()` function that makes use of the `PSI-BLAST` command line tool

We use the following default values as our parameters:

<ul>
    <li>e value: 0.001</li>
    <li>N of iterations: 3</li>
    <li>N of threads: 16</li>
</ul>

In [None]:
db_utils.generate_pssm_matrices()

This will generate multiple txt files in the `../dbs/generated/pssms`, one for each protein sequence. We will use these files to read the pccm matrix and put it in numpy arrays

In [3]:
# create an instance of the PSSMUtils class
pssm_utils = PSSMUtils()

In [15]:
# get all the files in the pssm directory
pssm_files = os.listdir('../dbs/generated/pssms/')
parsed_pssms = {}
for file in tqdm(pssm_files):

    # we parse the pssm file and get the sequence and the pssm matrix
    # the matrix is parsed into a list of lists, this is done so that can be serialized into a json file
    # we will convert them to numpy arrays when we will use them
    seq, pssm = pssm_utils.pssm_file_to_2Darray('../dbs/generated/pssms/' + file)
    seq_id = int(file.split('.')[1])
    
    # check if the sequence has already been parsed
    # if yes, take the parsing with the bigger id
    # this ensures that we take the pssm matrix generated on the after all the iterations of the blast
    # (because for some reason, PSI-BLAST generates multiple pssm files for the same sequence)
    if seq in parsed_pssms:
        if seq_id > parsed_pssms[seq][1]:
            parsed_pssms[seq] = (pssm, seq_id)
    else:
        parsed_pssms[seq] = (pssm, seq_id)

100%|██████████| 17173/17173 [00:53<00:00, 319.04it/s]


In [16]:
# print the number of parsed pssms
print(len(parsed_pssms))

# we can get rid of the seq_id now
parsed_pssms = {k: v[0] for k, v in parsed_pssms.items()}

# save the parsed pssms as json file, so that we don't have to parse them again
pssm_utils.save_parsed_pssms(parsed_pssms)

9074


## Dealing with missing sequences:

Throught the whole opertion, some proteins haven't been parsed into the `parsed_passm.json` file, either because their corresponding sequences haven't been found in the `RefSeq ID --> Sequence` step, or the `PSIBLAST cli tool` found an error while blasting them.

These values have been compiled into a file named `missing.fasta` and will try to be blasted one last time to minimize the amount of missing data.

In [4]:
# blasting the missing sequences against the swissprot database
db_utils.generate_pssm_matrices(input_file="../dbs/generated/missing.fasta", output_dir="../dbs/generated/missing_pssms/")

Running psiblast...
Error running psiblast: FASTA-Reader: Ignoring invalid residues at position(s): On line 2: 554, 557-565
FASTA-Reader: Ignoring invalid residues at position(s): On line 1553: 891, 894-902



In [4]:
# get all the files in the pssm directory
missing_pssm_files = os.listdir('../dbs/generated/missing_pssms/')
parsed_missing_pssms = {}
for file in tqdm(missing_pssm_files):

    # we parse the pssm file and get the sequence and the pssm matrix
    # the matrix is parsed into a list of lists, this is done so that can be serialized into a json file
    # we will convert them to numpy arrays when we will use them
    seq, pssm = pssm_utils.pssm_file_to_2Darray('../dbs/generated/missing_pssms/' + file)
    seq_id = int(file.split('.')[1])
    
    # check if the sequence has already been parsed
    # if yes, take the parsing with the bigger id
    # this ensures that we take the pssm matrix generated on the after all the iterations of the blast
    # (because for some reason, PSI-BLAST generates multiple pssm files for the same sequence)
    if seq in parsed_missing_pssms:
        if seq_id > parsed_missing_pssms[seq][1]:
            parsed_missing_pssms[seq] = (pssm, seq_id)
    else:
        parsed_missing_pssms[seq] = (pssm, seq_id)

100%|██████████| 1848/1848 [00:08<00:00, 205.76it/s]


In [11]:
# updating the parsed pssms with the missing ones
pssm_utils.update_parsed_pssms(parsed_missing_pssms)

Loaded PSSM data from JSON file.
Number of PSSMs added: 963
Updated PSSM data saved to JSON file.


## Migrating to a cloud PostgreSQL DB:

Using a DBMS to manage our data instead of JSON file will make DB operations a lot more easier and optimized. Operations like querrying and fetching data can take time, so we figured that migrating to PostgreSQL database will allow us to work more easily with our data.

The DB will be hosted on Render, to make group work easier.

In [2]:
# Reading the DB API info from the .env file
load_dotenv()
URI = urlparse(os.getenv("DB_URI"))

In [25]:
# Connecting to the database
with psycopg2.connect(URI.geturl()) as conn:
    print("Connected to the database successfully!")

    # create the table to hold the sequences and pssm matrices
    cur = conn.cursor()
    cur.execute(
        """
        CREATE TABLE IF NOT EXISTS PSSMS (
            id SERIAL PRIMARY KEY,
            sequence TEXT NOT NULL,
            pssm smallint[][] NOT NULL);
        """
    )

    # print out the tables in the database
    cur.execute(
        """
        SELECT table_name
        FROM information_schema.tables
        WHERE table_schema = 'public'
        ORDER BY table_name;
        """
    )
    print(cur.fetchall())

    # Make sure that the db is empty
    cur.execute(
        """
        SELECT COUNT(*)
        FROM PSSMS;
        """
    )
    print(cur.fetchall())

Connected to the database successfully!
[('pssms',)]
[(0,)]


In [4]:
with open("../dbs/generated/parsed_pssm.json", "r") as f:
        parsed_pssms = json.load(f)
        print("Parsed pssms loaded successfully!")

Parsed pssms loaded successfully!


In [27]:
with psycopg2.connect(URI.geturl()) as conn:
    print("Connected to the database successfully!")

    
    # insert the parsed pssms into the database
    cur = conn.cursor()
    for seq, pssm in tqdm(parsed_pssms.items()):

        # convert the pssm matrix to a postgresql array
        pssm_string = pssm_utils.pssm_to_postgresql(pssm)

        # insert the sequence and the pssm matrix into the database
        cur.execute(
            """
            INSERT INTO PSSMS (sequence, pssm)
            VALUES (%s, %s);
            """,
            (seq, pssm_string)
        )

print("Inserted parsed pssms into the database successfully!")

Connected to the database successfully!


100%|██████████| 10037/10037 [14:00<00:00, 11.95it/s]

Inserted parsed pssms into the database successfully!





In [29]:
# test the pssm matrix retrieval
with psycopg2.connect(URI.geturl()) as conn:
    print("Connected to the database successfully!")

    # print the number of rows in the table
    cur = conn.cursor()
    cur.execute(
        """
        SELECT COUNT(*)
        FROM PSSMS;
        """
    )
    print(f"Number of rows in the table: {cur.fetchone()[0]}")

    # get the pssm matrix for the sequence with id 1
    cur = conn.cursor()
    cur.execute(
        """
        SELECT sequence, pssm FROM PSSMS WHERE id = 1;
        """
    )

    seq, pssm = cur.fetchone()

    # get the pssm matrix
    print("Sequence:")
    print(seq, end="\n\n")
    print("PSSM:")
    print(pssm)

Connected to the database successfully!
Number of rows in the table: 10037
Sequence:
MRSKGRARKLATNNECVYGNYPEIPLEEMPDADGVASTPSLNIQEPCSPATSSEAFTPKEGSPYKAPIYIPDDIPIPAEFELRESNMPGAGLGIWTKRKIEVGEKFGPYVGEQRSNLKDPSYGWEILDEFYNVKFCIDASQPDVGSWLKYIRFAGCYDQHNLVACQINDQIFYRVVADIAPGEELLLFMKSEDYPHETMAPDIHEERQYRCEDCDQLFESKAELADHQKFPCSTPHSAFSMVEEDFQQKLESENDLQEIHTIQECKECDQVFPDLQSLEKHMLSHTEEREYKCDQCPKAFNWKSNLIRHQMSHDSGKHYECENCAKVFTDPSNLQRHIRSQHVGARAHACPECGKTFATSSGLKQHKHIHSSVKPFICEVCHKSYTQFSNLCRHKRMHADCRTQIKCKDCGQMFSTTSSLNKHRRFCEGKNHFAAGGFFGQGISLPGTPAMDKTSMVNMSHANPGLADYFGANRHPAGLTFPTAPGFSFSFPGLFPSGLYHRPPLIPASSPVKGLSSTEQTNKSQSPLMTHPQILPATQDILKALSKHPSVGDNKPVELQPERSSEERPFEKISDQSESSDLDDVSTPSGSDLETTSGSDLESDIESDKEKFKENGKMFKDKVSPLQNLASINNKKEYSNHSIFSPSLEEQTAVSGAVNDSIKAIASIAEKYFGSTGLVGLQDKKVGALPYPSMFPLPFFPAFSQSMYPFPDRDLRSLPLKMEPQSPGEVKKLQKGSSESPFDLTTKRKDEKPLTPVPSKPPVTPATSQDQPLDLSMGSRSRASGTKLTEPRKNHVFGGKKGSNVESRPASDGSLQHARPTPFFMDPIYRVEKRKLTDPLEALKEKYLRPSPGFLFHPQFQLPDQRTWMSAIENMAEKLESFSALKPEASELLQSVPSMFNFRAPPNALPENLLR