# Protein Protein Interactions Prediction

## Feature extraction:

Now that we have the protein sequences and their corresponding PSSM matrices, we can start extarcting the features used to train our model.

First we start by loading our JSON parsed data into numpy arrays so that we can perform mathematical computations on them more easily.

The result is a dictionnary in the format `{seq : pssm}` that we can use to extract features from either sequence or the pssm matrix.

In [5]:
import psycopg2
import os
import numpy as np
from tqdm import tqdm
from dotenv.main import load_dotenv
from urllib.parse import urlparse

In [3]:
# Reading the DB API info from the .env file
load_dotenv()
URI = urlparse(os.getenv("DB_URI"))

## First feature vector: Estimation of the distribution of the protein

In [None]:
# connecting to the database
with psycopg2.connect(URI.geturl()) as conn:
    with conn.cursor() as cur:

        # add a new column to the table
        # cur.execute("ALTER TABLE PSSMS ADD COLUMN pssm_sum NUMERIC[]")
        
        # get all the proteins in the database
        print("Fetching all the proteins...")
        cur.execute("SELECT sequence, pssm FROM PSSMS")

        # fetch all the proteins
        proteins = cur.fetchall()
        print("Done!")

        # for each protein in the DB, get the sum column wise
        for protein in tqdm(proteins):
            seq, pssm = protein

            # converting the pssm to a numpy array for easier manipulation
            pssm_np = np.array(pssm)

            # summing the columns
            pssm_sum = np.sum(pssm_np, axis=0)

            # converting the numpy array to a list
            pssm_sum = pssm_sum.tolist()

            # updating the database
            cur.execute("UPDATE PSSMS SET pssm_sum = %s WHERE sequence = %s", (pssm_sum, seq))

### Normalizing the data:

To avoid biases, we can normalize the 20-length vectors that we generated using the following formula: `d_i = (d_i - min)/(L * max)`

In [15]:
# connecting to the database
with psycopg2.connect(URI.geturl()) as conn:
    with conn.cursor() as cur:

        # get all the proteins in the database
        print("Fetching all the proteins...")
        cur.execute("SELECT sequence, pssm_sum FROM PSSMS")

        # fetch all the proteins
        proteins = cur.fetchall()
        print("Done!")
        
        # get the max and min values for each position
        pssm_sum_matrix = np.array([protein[1] for protein in proteins])
        pssm_sum_max = np.max(pssm_sum_matrix, axis=0)
        pssm_sum_min = np.min(pssm_sum_matrix, axis=0)
        
        # normalize the pssm_sum vector
        for protein in tqdm(proteins):
            seq, pssm_sum = protein
            L = len(seq)

            # converting the pssm to a numpy array for easier manipulation
            pssm_sum = np.array(pssm_sum)

            # normalizing the values
            pssm_sum_norm = (pssm_sum - pssm_sum_min) / (pssm_sum_max * L)

            # converting the numpy array to a list
            pssm_sum_norm = pssm_sum_norm.tolist()

            # updating the database
            cur.execute("UPDATE PSSMS SET pssm_sum = %s WHERE sequence = %s", (pssm_sum_norm, seq))

Fetching all the proteins...
Done!


100%|██████████| 10037/10037 [13:17<00:00, 12.58it/s]


In [16]:
# connecting to the database
with psycopg2.connect(URI.geturl()) as conn:
    with conn.cursor() as cur:

        # get a random protein and show it's pssm_sum
        cur.execute("SELECT sequence, pssm_sum FROM PSSMS LIMIT 1")

        for protein in cur.fetchall():
            
            # fetch the protein
            seq, pssm_sum = protein

            # print the protein
            print(seq)
            print()
            print(pssm_sum)


MEEVVIAGMSGKLPESENLQEFWDNLIGGVDMVTDDDRRWKAGLYGLPRRSGKLKDLSRFDASFFGVHPKQAHTMDPQLRLLLEVTYEAIVDGGINPDSLRGTHTGVWVGVSGSETSEALSRDPETLVGYSMVGCQRAMMANRLSFFFDFRGPSIALDTACSSSLMALQNAYQAIHSGQCPAAIVGGINVLLKPNTSVQFLRLGMLSPEGTCKAFDTAGNGYCRSEGVVAVLLTKKSLARRVYATILNAGTNTDGFKEQGVTFPSGDIQEQLIRSLYQSAGVAPESFEYIEAHGTGTKVGDPQELNGITRALCATRQEPLLIGSTKSNMGHPEPASGLAALAKVLLSLEHGLWAPNLHFHSPNPEIPALLDGRLQVVDQPLPVRGGNVGINSFGFGGSNVHIILRPNTQPPPAPAPHATLPRLLRASGRTPEAVQKLLEQGLRHSQDLAFLSMLNDIAAVPATAMPFRGYAVLGGERGGPEVQQVPAGERPLWFICSGMGTQWRGMGLSLMRLDRFRDSILRSDEAVKPFGLKVSQLLLSTDESTFDDIVHSFVSLTAIQIGLIDLLSCMGLRPDGIVGHSLGEVACGYADGCLSQEEAVLAAYWRGQCIKEAHLPPGAMAAVGLSWEECKQRCPPGVVPACHNSKDTVTISGPQAPVFEFVEQLRKEGVFAKEVRTGGMAFHSYFMEAIAPPLLQELKKVIREPKPRSARWLSTSIPEAQWHSSLARTSSAEYNVNNLVSPVLFQEALWHVPEHAVVLEIAPHALLQAVLKRGLKPSCTIIPLMKKDHRDNLEFFLAGIGRLHLSGIDANPNALFPPVEFPAPRGTPLISPLIKWDHSLAWDVPAAEDFPNGSGSPSAAIYNIDTSSESPDHYLVDHTLDGRVLFPATGYLSIVWKTLARALGLGVEQLPVVFEDVVLHQATILPKTGTVSLEVRLLEASRAFEVSENGNLVVSGKVYQWDDPDPRLFDHPESPTPNPTEPLFLAQAEVYKELRLRGYD

## Second feature vector: (Idk fih hhhhhhhh)