# Protein Protein Interactions Prediction

## Feature extraction:

Now that we have the protein sequences and their corresponding PSSM matrices, we can start extarcting the features used to train our model.

First we start by loading our JSON parsed data into numpy arrays so that we can perform mathematical computations on them more easily.

The result is a dictionnary in the format `{seq : pssm}` that we can use to extract features from either sequence or the pssm matrix.

In [1]:
import psycopg2
import os
from dotenv.main import load_dotenv
from urllib.parse import urlparse
import numpy as np
import json

In [2]:
# Reading the DB API info from the .env file
load_dotenv()
URI = urlparse(os.getenv("DB_URI"))

In [3]:
# connecting to the database
with psycopg2.connect(URI.geturl()) as conn:
    with conn.cursor() as cur:
        
        # getting the tables
        cur.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='public'")

        # getting the table names
        print("Available tables:")
        for table in cur.fetchall():
            print(table[0])
        
        # getting the number of rows in the PSSMS table
        cur.execute("SELECT COUNT(*) FROM pssms")
        print(f"\nNumber of rows in the PSSMS table: {cur.fetchone()[0]}", end="\n\n")

        # Printing an example of a sequence and it's correspondng PSSM
        cur.execute("SELECT sequence, pssm FROM pssms LIMIT 1")
        seq, pssm = cur.fetchone()
        print("Example of a PSSM:")
        print("Sequence:")
        print(seq)
        print("PSSM:")
        print(pssm)

Available tables:
pssms

Number of rows in the PSSMS table: 10037

Example of a PSSM:
Sequence:
MRSKGRARKLATNNECVYGNYPEIPLEEMPDADGVASTPSLNIQEPCSPATSSEAFTPKEGSPYKAPIYIPDDIPIPAEFELRESNMPGAGLGIWTKRKIEVGEKFGPYVGEQRSNLKDPSYGWEILDEFYNVKFCIDASQPDVGSWLKYIRFAGCYDQHNLVACQINDQIFYRVVADIAPGEELLLFMKSEDYPHETMAPDIHEERQYRCEDCDQLFESKAELADHQKFPCSTPHSAFSMVEEDFQQKLESENDLQEIHTIQECKECDQVFPDLQSLEKHMLSHTEEREYKCDQCPKAFNWKSNLIRHQMSHDSGKHYECENCAKVFTDPSNLQRHIRSQHVGARAHACPECGKTFATSSGLKQHKHIHSSVKPFICEVCHKSYTQFSNLCRHKRMHADCRTQIKCKDCGQMFSTTSSLNKHRRFCEGKNHFAAGGFFGQGISLPGTPAMDKTSMVNMSHANPGLADYFGANRHPAGLTFPTAPGFSFSFPGLFPSGLYHRPPLIPASSPVKGLSSTEQTNKSQSPLMTHPQILPATQDILKALSKHPSVGDNKPVELQPERSSEERPFEKISDQSESSDLDDVSTPSGSDLETTSGSDLESDIESDKEKFKENGKMFKDKVSPLQNLASINNKKEYSNHSIFSPSLEEQTAVSGAVNDSIKAIASIAEKYFGSTGLVGLQDKKVGALPYPSMFPLPFFPAFSQSMYPFPDRDLRSLPLKMEPQSPGEVKKLQKGSSESPFDLTTKRKDEKPLTPVPSKPPVTPATSQDQPLDLSMGSRSRASGTKLTEPRKNHVFGGKKGSNVESRPASDGSLQHARPTPFFMDPIYRVEKRKLTDPLEALKEKYLRPSPGFLFHPQFQLPDQRTWMSAIENMAEKLESFSALKPEASELLQSVPSMFNFRA