### 38615: Computational Modelling and Machine Learning

### Section 1: Preparing the data for binary classification

In [2]:

#importing necessary libraries
import numpy as np
import pandas as pd

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [3]:
#loading the dataset
train_Y = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/train_y_new.csv")
train_X = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/train_X_new.csv") 
test_X = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/test_X_new.csv") 

In [3]:
#checking for missing values
print("Number of null values in Train X are - ",train_X.isnull().values.any().sum())
print("Number of null values in Train Y are - ",train_Y.isnull().values.any().sum())
print("Number of null values in Test X are - ",test_X.isnull().values.any().sum())

Number of null values in Train X are -  0
Number of null values in Train Y are -  0
Number of null values in Test X are -  0


From the output above, we can see that there is no missing values. Hence, not necessary to remove any missing values from our dataset. 

In [4]:
train_X = train_X.drop('Unnamed: 0', axis=1)
train_X = train_X.set_index(['Id'])
train_X.head()

Unnamed: 0_level_0,ConstructedAASeq_cln
Id,Unnamed: 1_level_1
11328,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
5781,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
13681,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
30804,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
30813,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...


In [5]:
test_X = test_X.set_index(['Id'])
test_X.head()

Unnamed: 0_level_0,ConstructedAASeq_cln
Id,Unnamed: 1_level_1
50579,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
37987,SKGEELFTGVVPILVELDGDVSGHKFSVSGEGEGDATYGKLTLKFI...
53977,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...
10677,SKGEELFTGVVPILVELDGDVNGHKLSVSGEGEGDATYGKLTLKFI...
35653,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...


In [6]:
train_Y = train_Y.drop('Unnamed: 0', axis=1)
train_Y = train_Y.set_index(['Id'])
train_Y.head()

Unnamed: 0_level_0,Brightness_Class
Id,Unnamed: 1_level_1
11328,0
5781,0
13681,0
30804,0
30813,0


### Section 2: Featurization

Descriptors:
1) DPPS
2) MS-WHIM
3) Physical
4) ST-scale
5) T-scale
6) VHSE-scale
7) Z-scale

I have chose 4 predictors that I believe are the most useful for this assignment. 

Z-scale: Z-scale, which encodes various physicochemical properties of amino acids, can be a valuable choice as it offers a comprehensive representation of the amino acid sequence. It is valuable for capturing the underlying factors that influence GFP fluorescence.

Physical descriptors: Specific physical properties like charge (polarity), hydrophobicity, and size are essential for understanding how mutations in the amino acid sequence can impact the fluorescence properties of GFP.

DPPS: DPPS can be useful for capturing dipeptide-level interactions and how they influence the overall GFP structure and function, which can be important for predicting brightness.

MS-WHIM: MS-WHIM is a molecular descriptor that takes into account the three-dimensional arrangement of atoms in a molecule. It's designed to capture spatial information and is useful as the spatial arrangement of amino acids plays a key role in influencing the brightness level of GFP.

#### 1) Z-Scale

In [53]:
z_scale = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/Z-scale.csv")
z_scale.head(10)

Unnamed: 0,#from,10.1021/jm00390a003,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Name,Z-scale,,,
1,AA_3,AA_1,Z(1),Z(2),Z(3)
2,Ala,A,0.07,-1.73,0.09
3,Arg,R,2.88,2.52,-3.44
4,Asn,N,3.22,1.45,0.84
5,Asp,D,3.64,1.13,2.36
6,Cys,C,0.71,-0.97,4.13
7,Gln,Q,2.18,0.53,-1.14
8,Glu,E,3.08,0.39,-0.07
9,Gly,G,2.23,-5.36,0.3


In [54]:
z_scale = z_scale.iloc[2:]
z_scale = z_scale.drop(columns=['#from '])
z_scale = z_scale.rename(columns={'10.1021/jm00390a003': 'Amino Acid'})

print(z_scale)

   Amino Acid Unnamed: 2 Unnamed: 3 Unnamed: 4
2           A       0.07      -1.73       0.09
3           R       2.88       2.52      -3.44
4           N       3.22       1.45       0.84
5           D       3.64       1.13       2.36
6           C       0.71      -0.97       4.13
7           Q       2.18       0.53      -1.14
8           E       3.08       0.39      -0.07
9           G       2.23      -5.36        0.3
10          H       2.41       1.74       1.11
11          I      -4.44      -1.68      -1.03
12          L      -4.19      -1.03      -0.98
13          K       2.84       1.41      -3.14
14          M      -2.49      -0.27      -0.41
15          F      -4.92        1.3       0.45
16          P      -1.22       0.88       2.23
17          S       1.96      -1.63       0.57
18          T       0.92      -2.09       -1.4
19          W      -4.75       3.65       0.85
20          Y      -1.39       2.32       0.01
21          V      -2.69      -2.53      -1.29


##### A) Combined Scores

Since there are more than 1 value for each amino acid, we will be using a combined score for each. 

This will help to combine their collective influence on the prediction, since each value represent different but complementary information about the amino acid sequences

In [55]:
z_scale['Unnamed: 2'] = pd.to_numeric(z_scale['Unnamed: 2'])
z_scale['Unnamed: 3'] = pd.to_numeric(z_scale['Unnamed: 3'])
z_scale['Unnamed: 4'] = pd.to_numeric(z_scale['Unnamed: 4'])

z_scale['Combined Score'] = z_scale['Unnamed: 2'] + z_scale['Unnamed: 3'] + z_scale['Unnamed: 4']


z_scale.head(20)

Unnamed: 0,Amino Acid,Unnamed: 2,Unnamed: 3,Unnamed: 4,Combined Score
2,A,0.07,-1.73,0.09,-1.57
3,R,2.88,2.52,-3.44,1.96
4,N,3.22,1.45,0.84,5.51
5,D,3.64,1.13,2.36,7.13
6,C,0.71,-0.97,4.13,3.87
7,Q,2.18,0.53,-1.14,1.57
8,E,3.08,0.39,-0.07,3.4
9,G,2.23,-5.36,0.3,-2.83
10,H,2.41,1.74,1.11,5.26
11,I,-4.44,-1.68,-1.03,-7.15


In [155]:
# Function to encode a sequence
def encode_sequence(sequence, df):
    encoded_sequence = np.array([df.loc[z_scale['Amino Acid'] == aa, 'Combined Score'].values[0] for aa in sequence])
    return encoded_sequence

In [180]:
def encode_sequence1(sequence, df, columns):
    encoded_sequence = []

    for aa in sequence:
        values_for_aa = df.loc[z_scale['Amino Acid'] == aa, columns].values
        encoded_sequence.extend(values_for_aa)
        
    return np.array(encoded_sequence)

In [156]:
# Apply the encoding function to each sequence in df2
train_X['Zscore_Encoded Sequence'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence(x, z_scale))
train_X['Zscore_Encoded Sequence'] = train_X['Zscore_Encoded Sequence'].apply(lambda x: [round(value,2) for value in x])

train_X.head(10)

Unnamed: 0.1,Unnamed: 0,ConstructedAASeq_cln,Id,Zscore_Encoded Sequence
0,0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
1,1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
2,2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
3,3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
4,4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
5,5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5983,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
6,6,SKGEELFTGVVPILVELDGDVNGHKFSESGEGEGDATYGKLTLKFI...,20374,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
7,7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22332,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
8,8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,17800,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5..."
9,9,SKGEELLTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22064,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -6.2, -2.57..."


In [204]:
train_X['Zscore_NewEncoded'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence1(x, z_scale, columns=["Unnamed: 2","Unnamed: 3","Unnamed: 4"]))

##### B) Individual Scores

In [56]:
# Create a ZSCALE dictionary
zscale_dict = {}

# Iterate through unique AA in the 'Amino Acid' column
for aa in z_scale['Amino Acid']:
    # Filter rows corresponding to the current letter and extract values from other columns
    zscale_list = z_scale[z_scale['Amino Acid'] == aa][['Unnamed: 2', 'Unnamed: 3','Unnamed: 4']].values.tolist()
    
    # Add the letter and its corresponding values to the dictionary
    zscale_dict[aa] = zscale_list

print(zscale_dict)

{'A': [[0.07, -1.73, 0.09]], 'R': [[2.88, 2.52, -3.44]], 'N': [[3.22, 1.45, 0.84]], 'D': [[3.64, 1.13, 2.36]], 'C': [[0.71, -0.97, 4.13]], 'Q': [[2.18, 0.53, -1.14]], 'E': [[3.08, 0.39, -0.07]], 'G': [[2.23, -5.36, 0.3]], 'H': [[2.41, 1.74, 1.11]], 'I': [[-4.44, -1.68, -1.03]], 'L': [[-4.19, -1.03, -0.98]], 'K': [[2.84, 1.41, -3.14]], 'M': [[-2.49, -0.27, -0.41]], 'F': [[-4.92, 1.3, 0.45]], 'P': [[-1.22, 0.88, 2.23]], 'S': [[1.96, -1.63, 0.57]], 'T': [[0.92, -2.09, -1.4]], 'W': [[-4.75, 3.65, 0.85]], 'Y': [[-1.39, 2.32, 0.01]], 'V': [[-2.69, -2.53, -1.29]]}


In [57]:
zscale_dictsequence = []

for i in train_X['ConstructedAASeq_cln']:
    
    aasequence = []
    for aa in i:
        aasequence.append(zscale_dict[aa])
        
    zscale_dictsequence.append(np.array(aasequence).flatten())
    
zscale_df = pd.DataFrame(zscale_dictsequence)
zscale_df = zscale_df.set_index(train_X.index)

zscale_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,701,702,703,704,705,706,707,708,709,710
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11328,1.96,-1.63,0.57,2.84,1.41,-3.14,2.23,-5.36,0.3,3.08,...,-0.07,-4.19,-1.03,-0.98,-1.39,2.32,0.01,2.84,1.41,-3.14
5781,1.96,-1.63,0.57,2.84,1.41,-3.14,2.23,-5.36,0.3,3.08,...,-0.07,-4.19,-1.03,-0.98,-1.39,2.32,0.01,2.84,1.41,-3.14
13681,1.96,-1.63,0.57,2.84,1.41,-3.14,2.23,-5.36,0.3,3.08,...,-0.07,-4.19,-1.03,-0.98,-1.39,2.32,0.01,2.84,1.41,-3.14
30804,1.96,-1.63,0.57,2.84,1.41,-3.14,2.23,-5.36,0.3,3.08,...,-0.07,-4.19,-1.03,-0.98,-1.39,2.32,0.01,2.84,1.41,-3.14
30813,1.96,-1.63,0.57,2.84,1.41,-3.14,2.23,-5.36,0.3,3.08,...,-0.07,-4.19,-1.03,-0.98,-1.39,2.32,0.01,2.84,1.41,-3.14


#### 2) Physical

In [157]:
physical = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/Physical.csv")
physical.head(10)

Unnamed: 0,#from,10.1021/acs.jcim.7b00488,Unnamed: 2,Unnamed: 3
0,Name,Physical,,
1,AA_3,AA_1,Vol,Hydro
2,Ala,A,-2.9,-1.03
3,Arg,R,2.41,1.31
4,Asn,N,-0.68,0.79
5,Asp,D,-0.92,1.23
6,Cys,C,-1.89,0.15
7,Gln,Q,0.36,1.09
8,Glu,E,0.16,1.28
9,Gly,G,-4.04,0.01


In [158]:
physical = physical.iloc[2:]
physical = physical.drop(columns=['#from '])
physical = physical.rename(columns={'10.1021/acs.jcim.7b00488': 'Amino Acid'})

physical.head(10)

Unnamed: 0,Amino Acid,Unnamed: 2,Unnamed: 3
2,A,-2.9,-1.03
3,R,2.41,1.31
4,N,-0.68,0.79
5,D,-0.92,1.23
6,C,-1.89,0.15
7,Q,0.36,1.09
8,E,0.16,1.28
9,G,-4.04,0.01
10,H,0.83,1.15
11,I,0.51,-1.32


##### A) Combined Scores

In [159]:
physical['Unnamed: 2'] = pd.to_numeric(physical['Unnamed: 2'])
physical['Unnamed: 3'] = pd.to_numeric(physical['Unnamed: 3'])

physical['Combined Score'] = physical['Unnamed: 2'] + physical['Unnamed: 3']
physical.head(10)


Unnamed: 0,Amino Acid,Unnamed: 2,Unnamed: 3,Combined Score
2,A,-2.9,-1.03,-3.93
3,R,2.41,1.31,3.72
4,N,-0.68,0.79,0.11
5,D,-0.92,1.23,0.31
6,C,-1.89,0.15,-1.74
7,Q,0.36,1.09,1.45
8,E,0.16,1.28,1.44
9,G,-4.04,0.01,-4.03
10,H,0.83,1.15,1.98
11,I,0.51,-1.32,-0.81


In [160]:
# Apply the encoding function to each sequence in df2
train_X['Physical_Encoded Sequence'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence(x, physical))
train_X['Physical_Encoded Sequence'] = train_X['Physical_Encoded Sequence'].apply(lambda x: [round(value,2) for value in x])

train_X.head(10)

Unnamed: 0.1,Unnamed: 0,ConstructedAASeq_cln,Id,Zscore_Encoded Sequence,Physical_Encoded Sequence
0,0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
1,1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
2,2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
3,3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
4,4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
5,5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5983,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
6,6,SKGEELFTGVVPILVELDGDVNGHKFSESGEGEGDATYGKLTLKFI...,20374,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
7,7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22332,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
8,8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,17800,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ..."
9,9,SKGEELLTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22064,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -6.2, -2.57...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, -0.88,..."


##### B) Individual Scores

In [181]:
train_X['Physical_NewEncoded'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence1(x, physical, columns=["Unnamed: 2","Unnamed: 3"]))

In [182]:
train_X.head()

Unnamed: 0.1,Unnamed: 0,ConstructedAASeq_cln,Id,Zscore_Encoded Sequence,Physical_Encoded Sequence,MSWHIM_Encoded Sequence,Physical_NewEncoded
0,0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [..."
1,1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [..."
2,2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [..."
3,3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [..."
4,4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [..."


#### 3) DPPS

In [8]:
dpps = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/DPPS.csv")
dpps.head(10)

Unnamed: 0,#from,﻿10.2174/092986608786071120,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,Name,DPPS,,,,,,,,,,
1,AA_3,AA_1,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10
2,Ala,A,-1.02,-2.88,-0.56,0.36,-6.15,-1.68,0.04,-2.51,-1.94,-0.01
3,Arg,R,1.99,4.13,-4.41,-1.02,4.78,3.04,-9.06,6.71,4.41,0.07
4,Asn,N,-2.19,1.86,0.38,-0.13,-2.3,1.41,-5.71,-1.11,1.73,-0.19
5,Asp,D,-6.6,3.32,1.61,0.36,-3.25,1.95,-7.36,0.14,1.24,-0.15
6,Cys,C,0.21,1.12,3.42,-0.68,-2.27,-1.22,3.11,-2.98,-1.7,1.57
7,Gln,Q,-0.47,1.16,-0.57,0.69,0.39,1.93,-5.46,-0.84,1.93,0.85
8,Glu,E,-5.39,0.65,-0.98,1.39,-0.23,2.51,-6.84,-0.68,1.41,1.28
9,Gly,G,-2.86,-5,-2.97,0.53,-11.45,1.89,-2.11,-3.99,-2.16,-0.76


In [9]:
dpps = dpps.iloc[2:]
dpps = dpps.drop(columns=['#from '])
dpps.columns = ['Amino Acid', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
       'Unnamed: 10', 'Unnamed: 11']

dpps.head(10)

Unnamed: 0,Amino Acid,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
2,A,-1.02,-2.88,-0.56,0.36,-6.15,-1.68,0.04,-2.51,-1.94,-0.01
3,R,1.99,4.13,-4.41,-1.02,4.78,3.04,-9.06,6.71,4.41,0.07
4,N,-2.19,1.86,0.38,-0.13,-2.3,1.41,-5.71,-1.11,1.73,-0.19
5,D,-6.6,3.32,1.61,0.36,-3.25,1.95,-7.36,0.14,1.24,-0.15
6,C,0.21,1.12,3.42,-0.68,-2.27,-1.22,3.11,-2.98,-1.7,1.57
7,Q,-0.47,1.16,-0.57,0.69,0.39,1.93,-5.46,-0.84,1.93,0.85
8,E,-5.39,0.65,-0.98,1.39,-0.23,2.51,-6.84,-0.68,1.41,1.28
9,G,-2.86,-5.0,-2.97,0.53,-11.45,1.89,-2.11,-3.99,-2.16,-0.76
10,H,0.73,2.68,-0.66,-1.89,1.6,1.13,-1.94,-0.11,0.44,0.15
11,I,1.91,-3.13,0.01,1.14,2.7,-4.55,8.93,0.18,-1.1,-0.76


##### A) Combined Scores

In [14]:
dpps['Unnamed: 2'] = pd.to_numeric(dpps['Unnamed: 2'])
dpps['Unnamed: 3'] = pd.to_numeric(dpps['Unnamed: 3'])
dpps['Unnamed: 4'] = pd.to_numeric(dpps['Unnamed: 4'])
dpps['Unnamed: 5'] = pd.to_numeric(dpps['Unnamed: 5'])
dpps['Unnamed: 6'] = pd.to_numeric(dpps['Unnamed: 6'])
dpps['Unnamed: 7'] = pd.to_numeric(dpps['Unnamed: 7'])
dpps['Unnamed: 8'] = pd.to_numeric(dpps['Unnamed: 8'])
dpps['Unnamed: 9'] = pd.to_numeric(dpps['Unnamed: 9'])
dpps['Unnamed: 10'] = pd.to_numeric(dpps['Unnamed: 10'])
dpps['Unnamed: 11'] = pd.to_numeric(dpps['Unnamed: 11'])

dpps['Combined Score'] = dpps['Unnamed: 2'] + dpps['Unnamed: 3']+ dpps['Unnamed: 4']+ dpps['Unnamed: 5'] + dpps['Unnamed: 6'] + dpps['Unnamed: 7'] + dpps['Unnamed: 8'] + dpps['Unnamed: 9'] + dpps['Unnamed: 10'] + dpps['Unnamed: 11']
dpps.head(10)

Unnamed: 0,Amino Acid,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Combined Score
2,A,-1.02,-2.88,-0.56,0.36,-6.15,-1.68,0.04,-2.51,-1.94,-0.01,-16.35
3,R,1.99,4.13,-4.41,-1.02,4.78,3.04,-9.06,6.71,4.41,0.07,10.64
4,N,-2.19,1.86,0.38,-0.13,-2.3,1.41,-5.71,-1.11,1.73,-0.19,-6.25
5,D,-6.6,3.32,1.61,0.36,-3.25,1.95,-7.36,0.14,1.24,-0.15,-8.74
6,C,0.21,1.12,3.42,-0.68,-2.27,-1.22,3.11,-2.98,-1.7,1.57,0.58
7,Q,-0.47,1.16,-0.57,0.69,0.39,1.93,-5.46,-0.84,1.93,0.85,-0.39
8,E,-5.39,0.65,-0.98,1.39,-0.23,2.51,-6.84,-0.68,1.41,1.28,-6.88
9,G,-2.86,-5.0,-2.97,0.53,-11.45,1.89,-2.11,-3.99,-2.16,-0.76,-28.88
10,H,0.73,2.68,-0.66,-1.89,1.6,1.13,-1.94,-0.11,0.44,0.15,2.13
11,I,1.91,-3.13,0.01,1.14,2.7,-4.55,8.93,0.18,-1.1,-0.76,5.33


In [64]:
# Function to encode a sequence
def encode_sequence(sequence, df):
    encoded_sequence = np.array([df.loc[z_scale['Amino Acid'] == aa, 'Combined Score'].values[0] for aa in sequence])
    return encoded_sequence

In [194]:
# Apply the encoding function to each sequence in df2
train_X['DPPS_Encoded Sequence'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence(x, dpps))
train_X['DPPS_Encoded Sequence'] = train_X['DPPS_Encoded Sequence'].apply(lambda x: [round(value,2) for value in x])

train_X.head(10)

Unnamed: 0.1,Unnamed: 0,ConstructedAASeq_cln,Id,Zscore_Encoded Sequence,Physical_Encoded Sequence,MSWHIM_Encoded Sequence,Physical_NewEncoded,DPPS_Encoded Sequence
0,0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
1,1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
2,2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
3,3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
4,4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
5,5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5983,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
6,6,SKGEELFTGVVPILVELDGDVNGHKFSESGEGEGDATYGKLTLKFI...,20374,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
7,7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22332,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
8,8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,17800,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 19...."
9,9,SKGEELLTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22064,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -6.2, -2.57...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, -0.88,...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, -0.1...","[[-2.36, 0.38], [0.92, 1.23], [-4.04, 0.01], [...","[-13.16, 0.11, -28.88, -6.88, -6.88, 5.32, 5.3..."


##### B) Individual Scores

In [15]:
# Create a DPPS dictionary
dpps_dict = {}

# Iterate through unique AA in the 'Amino Acid' column
for aa in dpps['Amino Acid']:
    # Filter rows corresponding to the current letter and extract values from other columns
    dpps_list = dpps[dpps['Amino Acid'] == aa][['Unnamed: 2','Unnamed: 3','Unnamed: 4','Unnamed: 5','Unnamed: 6','Unnamed: 7','Unnamed: 8','Unnamed: 9','Unnamed: 10','Unnamed: 11' ]].values.flatten().tolist()
    
    # Add the letter and its corresponding values to the dictionary
    dpps_dict[aa] = dpps_list

print(dpps_dict)

{'A': [-1.02, -2.88, -0.56, 0.36, -6.15, -1.68, 0.04, -2.51, -1.94, -0.01], 'R': [1.99, 4.13, -4.41, -1.02, 4.78, 3.04, -9.06, 6.71, 4.41, 0.07], 'N': [-2.19, 1.86, 0.38, -0.13, -2.3, 1.41, -5.71, -1.11, 1.73, -0.19], 'D': [-6.6, 3.32, 1.61, 0.36, -3.25, 1.95, -7.36, 0.14, 1.24, -0.15], 'C': [0.21, 1.12, 3.42, -0.68, -2.27, -1.22, 3.11, -2.98, -1.7, 1.57], 'Q': [-0.47, 1.16, -0.57, 0.69, 0.39, 1.93, -5.46, -0.84, 1.93, 0.85], 'E': [-5.39, 0.65, -0.98, 1.39, -0.23, 2.51, -6.84, -0.68, 1.41, 1.28], 'G': [-2.86, -5.0, -2.97, 0.53, -11.45, 1.89, -2.11, -3.99, -2.16, -0.76], 'H': [0.73, 2.68, -0.66, -1.89, 1.6, 1.13, -1.94, -0.11, 0.44, 0.15], 'I': [1.91, -3.13, 0.01, 1.14, 2.7, -4.55, 8.93, 0.18, -1.1, -0.76], 'L': [1.64, -2.57, 0.0, 1.35, 2.62, -2.65, 7.72, 0.05, -1.03, -1.81], 'K': [2.47, 1.54, -4.28, -0.86, 2.77, 2.06, -6.18, 2.05, 2.19, -1.65], 'M': [1.93, -0.01, 1.21, 0.99, 2.79, -0.56, 5.33, -0.87, -0.99, -1.09], 'F': [2.68, 0.84, 2.22, 0.71, 5.02, -0.3, 8.6, 1.13, -1.4, -0.28], 'P':

In [75]:
dpps_dictsequence = []

for i in train_X['ConstructedAASeq_cln']:
    
    aasequence = []
    for aa in i:
        aasequence.append(dpps_dict[aa])
        
    dpps_dictsequence.append(np.array(aasequence).flatten())

In [11]:
dpps_df = pd.DataFrame(dpps_dictsequence)
dpps_df = dpps_df.set_index(train_X.index)

In [12]:
dpps_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,2360,2361,2362,2363,2364,2365,2366,2367,2368,2369
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11328,-1.76,-0.19,1.06,-0.69,-5.72,0.14,-4.14,-2.42,-0.13,0.69,...,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65
5781,-1.76,-0.19,1.06,-0.69,-5.72,0.14,-4.14,-2.42,-0.13,0.69,...,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65
13681,-1.76,-0.19,1.06,-0.69,-5.72,0.14,-4.14,-2.42,-0.13,0.69,...,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65
30804,-1.76,-0.19,1.06,-0.69,-5.72,0.14,-4.14,-2.42,-0.13,0.69,...,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65
30813,-1.76,-0.19,1.06,-0.69,-5.72,0.14,-4.14,-2.42,-0.13,0.69,...,2.47,1.54,-4.28,-0.86,2.77,2.06,-6.18,2.05,2.19,-1.65


In [16]:
dpps_dictsequence1 = []

for i in train_X['ConstructedAASeq_cln']:
    
    aasequence = []
    for aa in i:
        aasequence.extend(dpps_dict[aa])
    
    dpps_dictsequence1.append(aasequence)
    
dpps_dictsequence2 = np.array(dpps_dictsequence1)

In [17]:
dpps_dictsequence2

array([[-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       ...,
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65]])

#### 4) MS-WHIM

In [13]:
mswhim = pd.read_csv("/Users/srinidhi/Desktop/Datasetnew/MS-WHIM.csv")
mswhim.head(10)

Unnamed: 0,#from,10.1021/ci980211b,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,Name,MS-WHIM,,,
1,AA_3,AA_1,Ist,2nd,3rd
2,Ala,A,-0.73,0.2,-0.62
3,Arg,R,-0.22,0.27,1
4,Asn,N,0.14,0.2,-0.66
5,Asp,D,0.11,-1,-0.96
6,Cys,C,-0.66,0.26,-0.27
7,Gln,Q,0.3,1,-0.3
8,Glu,E,0.24,-0.39,-0.04
9,Gly,G,-0.31,-0.28,-0.75


In [14]:
mswhim = mswhim.iloc[2:]
mswhim = mswhim.drop(columns=['#from '])
mswhim.columns = ['Amino Acid', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3']

mswhim.head(10)

Unnamed: 0,Amino Acid,Unnamed: 1,Unnamed: 2,Unnamed: 3
2,A,-0.73,0.2,-0.62
3,R,-0.22,0.27,1.0
4,N,0.14,0.2,-0.66
5,D,0.11,-1.0,-0.96
6,C,-0.66,0.26,-0.27
7,Q,0.3,1.0,-0.3
8,E,0.24,-0.39,-0.04
9,G,-0.31,-0.28,-0.75
10,H,0.84,0.67,-0.78
11,I,-0.91,0.83,-0.25


##### A) Combined Scores

In [163]:
mswhim['Unnamed: 1'] = pd.to_numeric(mswhim['Unnamed: 1'])
mswhim['Unnamed: 2'] = pd.to_numeric(mswhim['Unnamed: 2'])
mswhim['Unnamed: 3'] = pd.to_numeric(mswhim['Unnamed: 3'])

mswhim['Combined Score'] = mswhim['Unnamed: 1'] + mswhim['Unnamed: 2']+ mswhim['Unnamed: 3']
mswhim.head(10)

Unnamed: 0,Amino Acid,Unnamed: 1,Unnamed: 2,Unnamed: 3,Combined Score
2,A,-0.73,0.2,-0.62,-1.15
3,R,-0.22,0.27,1.0,1.05
4,N,0.14,0.2,-0.66,-0.32
5,D,0.11,-1.0,-0.96,-1.85
6,C,-0.66,0.26,-0.27,-0.67
7,Q,0.3,1.0,-0.3,1.0
8,E,0.24,-0.39,-0.04,-0.19
9,G,-0.31,-0.28,-0.75,-1.34
10,H,0.84,0.67,-0.78,0.73
11,I,-0.91,0.83,-0.25,-0.33


In [164]:
train_X['MSWHIM_Encoded Sequence'] = train_X['ConstructedAASeq_cln'].apply(lambda x: encode_sequence(x, mswhim))
train_X['MSWHIM_Encoded Sequence'] = train_X['MSWHIM_Encoded Sequence'].apply(lambda x: [round(value,2) for value in x])

train_X.head(10)

Unnamed: 0.1,Unnamed: 0,ConstructedAASeq_cln,Id,Zscore_Encoded Sequence,Physical_Encoded Sequence,MSWHIM_Encoded Sequence
0,0,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,11328,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
1,1,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5781,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
2,2,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,13681,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
3,3,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,30804,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
4,4,SKGEELFTGVVPILVELDGDVNGHTFSVSGEGEGDATYGELTLKFI...,30813,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
5,5,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,5983,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
6,6,SKGEELFTGVVPILVELDGDVNGHKFSESGEGEGDATYGKLTLKFI...,20374,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
7,7,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22332,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
8,8,SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,17800,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -3.17, -2.5...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, 0.75, ...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, 1.27..."
9,9,SKGEELLTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFI...,22064,"[0.9, 1.11, -2.83, 3.4, 3.4, -6.2, -6.2, -2.57...","[-1.98, 2.15, -4.03, 1.44, 1.44, -0.88, -0.88,...","[-1.19, 0.17, -1.34, -0.19, -0.19, -0.18, -0.1..."


##### B) Individual Scores

In [15]:
# Create a MSWHIM dictionary
mswhim_dict = {}

# Iterate through unique AA in the 'Amino Acid' column
for aa in mswhim['Amino Acid']:
    # Filter rows corresponding to the current letter and extract values from other columns
    mswhim_list = mswhim[mswhim['Amino Acid'] == aa][['Unnamed: 1', 'Unnamed: 2','Unnamed: 3']].values.tolist()
    
    # Add the letter and its corresponding values to the dictionary
    mswhim_dict[aa] = mswhim_list

print(mswhim_dict)

{'A': [['-0.73', '0.2', '-0.62']], 'R': [['-0.22', '0.27', '1']], 'N': [['0.14', '0.2', '-0.66']], 'D': [['0.11', '-1', '-0.96']], 'C': [['-0.66', '0.26', '-0.27']], 'Q': [['0.3', '1', '-0.3']], 'E': [['0.24', '-0.39', '-0.04']], 'G': [['-0.31', '-0.28', '-0.75']], 'H': [['0.84', '0.67', '-0.78']], 'I': [['-0.91', '0.83', '-0.25']], 'L': [['-0.74', '0.72', '-0.16']], 'K': [['-0.51', '0.08', '0.6']], 'M': [['-0.7', '1', '-0.32']], 'F': [['0.76', '0.85', '-0.34']], 'P': [['-0.43', '0.73', '-0.6']], 'S': [['-0.8', '0.61', '-1']], 'T': [['-0.58', '0.85', '-0.89']], 'W': [['1', '0.98', '-0.47']], 'Y': [['0.97', '0.66', '-0.16']], 'V': [['-1', '0.79', '-0.58']]}


In [16]:
mswhim_dictsequence = []

for i in train_X['ConstructedAASeq_cln']:
    
    aasequence = []
    for aa in i:
        aasequence.append(mswhim_dict[aa])
        
    mswhim_dictsequence.append(np.array(aasequence).flatten())

In [17]:
mswhim_df = pd.DataFrame(mswhim_dictsequence)
mswhim_df = mswhim_df.set_index(train_X.index)


In [18]:
mswhim_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,701,702,703,704,705,706,707,708,709,710
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11328,-0.8,0.61,-1,-0.51,0.08,0.6,-0.31,-0.28,-0.75,0.24,...,-0.04,-0.74,0.72,-0.16,0.97,0.66,-0.16,-0.51,0.08,0.6
5781,-0.8,0.61,-1,-0.51,0.08,0.6,-0.31,-0.28,-0.75,0.24,...,-0.04,-0.74,0.72,-0.16,0.97,0.66,-0.16,-0.51,0.08,0.6
13681,-0.8,0.61,-1,-0.51,0.08,0.6,-0.31,-0.28,-0.75,0.24,...,-0.04,-0.74,0.72,-0.16,0.97,0.66,-0.16,-0.51,0.08,0.6
30804,-0.8,0.61,-1,-0.51,0.08,0.6,-0.31,-0.28,-0.75,0.24,...,-0.04,-0.74,0.72,-0.16,0.97,0.66,-0.16,-0.51,0.08,0.6
30813,-0.8,0.61,-1,-0.51,0.08,0.6,-0.31,-0.28,-0.75,0.24,...,-0.04,-0.74,0.72,-0.16,0.97,0.66,-0.16,-0.51,0.08,0.6


#### Trying MS-WHIM for Logistic Regression

##### A) Using Individual Scores:

In [24]:
scaler = StandardScaler() 

MSWHIM_TrainX = scaler.fit_transform(mswhim_df)
MSWHIM_Xtrain, MSWHIM_Xvalid, MSWHIM_Ytrain, MSWHIM_Yvalid = train_test_split(MSWHIM_TrainX, train_Y, test_size=0.2, random_state=42)

#### Ridge Model

In [26]:
from sklearn.model_selection import cross_val_score

# Models to evaluate
models = [
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='liblinear', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='newton-cg', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='saga', random_state=42)) 

]
# Evaluate each model
for model_name, model in models:
    scores = cross_val_score(model, MSWHIM_Xtrain, MSWHIM_Ytrain, cv=5, scoring='f1')
    print(f'{model_name} - F1 Score: {np.mean(scores)} (+/- {np.std(scores)})')

Logistic Regression (L2) - F1 Score: 0.8345162079514049 (+/- 0.005149539895373402)
Logistic Regression (L2) - F1 Score: 0.8340966508810718 (+/- 0.005279736945303123)
Logistic Regression (L2) - F1 Score: 0.8341375584334472 (+/- 0.005207035780747201)
Logistic Regression (L2) - F1 Score: 0.832117472781122 (+/- 0.0047036511967988815)


In [28]:
#trying for lbfgs solver first
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_log = None
best_f1_log = 0.0

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, solver='lbfgs', random_state=42)

    # Train the model on the training data
    model.fit(MSWHIM_Xtrain, MSWHIM_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(MSWHIM_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1 = f1_score(MSWHIM_Yvalid, y_pred)
    
    # Check if the current model has better performance
    if f1 > best_f1_log:
        best_c_log = c
        best_f1_log = f1

# Print the best hyperparameters and scores for L1 penalty
print("Best C for Logistic Regression:", best_c_log)
print("Best F1 Score on Validation Set for Logistic Regression:", best_f1_log)

Best C for Logistic Regression: 0.5
Best F1 Score on Validation Set for Logistic Regression: 0.8354781231439319


#### Elastic Model

In [52]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
l1_ratio_values = [0.1, 0.3, 0.5, 0.7, 0.9]

# Initialize variables to store the best hyperparameters and scores for Elastic Net
best_c_elastic = None
best_l1_ratio_elastic = None
best_f1_elastic = 0.0
elasticf1_scores = []

# Iterate through the range of C values and l1_ratio values
for c in c_values:
    for l1_ratio in l1_ratio_values:
        # Create a logistic regression model with Elastic Net penalty and the current C and l1_ratio values
        model = LogisticRegression(C=c, penalty='elasticnet', random_state=42, solver="saga", l1_ratio=l1_ratio)

        # Train the model on the training data
        model.fit(MSWHIM_Xtrain, MSWHIM_Ytrain)

        # Make predictions on the validation set
        y_pred = model.predict(MSWHIM_Xvalid)

        # Compute accuracy and F1 score on the validation set
        f1_elastic = f1_score(MSWHIM_Yvalid, y_pred)
        elasticf1_scores.append(f1_elastic)


        # Check if the current model has better performance
        if f1_elastic > best_f1_elastic:
            best_c_elastic = c
            best_l1_ratio_elastic = l1_ratio
            best_f1_elastic = f1_elastic

# Print the best hyperparameters and scores for Elastic Net penalty
print("Best C for Elastic Net Penalty:", best_c_elastic)
print("Best l1_ratio for Elastic Net Penalty:", best_l1_ratio_elastic)
print("Best F1 Score on Validation Set for Elastic Net:", best_f1_elastic)

Best C for Elastic Net Penalty: 0.1
Best l1_ratio for Elastic Net Penalty: 0.5
Best F1 Score on Validation Set for Elastic Net: 0.8352640879638721


#### Trying Z-Score for Logistic Regression

##### A) Using Individual Scores

In [58]:
scaler = StandardScaler() 

ZSCALE_TrainX = scaler.fit_transform(zscale_df)
ZSCALE_Xtrain, ZSCALE_Xvalid, ZSCALE_Ytrain, ZSCALE_Yvalid = train_test_split(ZSCALE_TrainX, train_Y, test_size=0.2, random_state=42)

##### Ridge Model

In [59]:
# Models to evaluate
models = [
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='liblinear', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='newton-cg', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='saga', random_state=42)) 

]
# Evaluate each model
for model_name, model in models:
    scores = cross_val_score(model, ZSCALE_Xtrain, ZSCALE_Ytrain, cv=5, scoring='f1')
    print(f'{model_name} - F1 Score: {np.mean(scores)} (+/- {np.std(scores)})')

Logistic Regression (L2) - F1 Score: 0.852743767719421 (+/- 0.005042035865738226)
Logistic Regression (L2) - F1 Score: 0.8525140557045363 (+/- 0.005279906452773745)
Logistic Regression (L2) - F1 Score: 0.8526578743349411 (+/- 0.005376983304256841)
Logistic Regression (L2) - F1 Score: 0.8505125299380524 (+/- 0.004790112692413259)


In [60]:
#trying for lbfgs solver first
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_log = None
best_f1_log = 0.0

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, solver='lbfgs', random_state=42)

    # Train the model on the training data
    model.fit(ZSCALE_Xtrain, ZSCALE_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(ZSCALE_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1 = f1_score(ZSCALE_Yvalid, y_pred)
    
    # Check if the current model has better performance
    if f1 > best_f1_log:
        best_c_log = c
        best_f1_log = f1

# Print the best hyperparameters and scores for L1 penalty
print("Best C for Logistic Regression:", best_c_log)
print("Best F1 Score on Validation Set for Logistic Regression:", best_f1_log)

Best C for Logistic Regression: 0.3
Best F1 Score on Validation Set for Logistic Regression: 0.8542332268370607


##### Elastic Model

In [61]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
l1_ratio_values = [0.1, 0.3, 0.5, 0.7, 0.9]

# Initialize variables to store the best hyperparameters and scores for Elastic Net
best_c_elastic = None
best_l1_ratio_elastic = None
best_f1_elastic = 0.0
elasticf1_scores = []

# Iterate through the range of C values and l1_ratio values
for c in c_values:
    for l1_ratio in l1_ratio_values:
        # Create a logistic regression model with Elastic Net penalty and the current C and l1_ratio values
        model = LogisticRegression(C=c, penalty='elasticnet', random_state=42, solver="saga", l1_ratio=l1_ratio)

        # Train the model on the training data
        model.fit(MSWHIM_Xtrain, MSWHIM_Ytrain)

        # Make predictions on the validation set
        y_pred = model.predict(MSWHIM_Xvalid)

        # Compute accuracy and F1 score on the validation set
        f1_elastic = f1_score(MSWHIM_Yvalid, y_pred)
        elasticf1_scores.append(f1_elastic)


        # Check if the current model has better performance
        if f1_elastic > best_f1_elastic:
            best_c_elastic = c
            best_l1_ratio_elastic = l1_ratio
            best_f1_elastic = f1_elastic

# Print the best hyperparameters and scores for Elastic Net penalty
print("Best C for Elastic Net Penalty:", best_c_elastic)
print("Best l1_ratio for Elastic Net Penalty:", best_l1_ratio_elastic)
print("Best F1 Score on Validation Set for Elastic Net:", best_f1_elastic)

Best C for Elastic Net Penalty: 0.1
Best l1_ratio for Elastic Net Penalty: 0.5
Best F1 Score on Validation Set for Elastic Net: 0.8352640879638721


##### B) Using Combined Scores:

In [165]:
Zscore_list = train_X['Zscore_Encoded Sequence'].tolist()
Train1_X = np.array(Zscore_list)

In [166]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() 

Train1_X = scaler.fit_transform(Train1_X)

In [174]:
from sklearn.model_selection import train_test_split
X1_train, X1_valid, Y1_train, Y1_valid = train_test_split(Train1_X, train_Y, test_size=0.2, random_state=42)

In [175]:
X1_train = X1_train.reshape(X1_train.shape[0], -1)
X1_valid = X1_valid.reshape(X1_valid.shape[0], -1)

#### Ridge Model

In [176]:
# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'solver': ['liblinear', 'newton-cg', 'lbfgs'],
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='l2', random_state=42)

# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=100, cv=5, scoring='f1', random_state=42)

# Fit the RandomizedSearchCV object to your training data
random_search.fit(X1_train, Y1_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Get the best estimator (model)
ridge_model = random_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = ridge_model.predict(X1_valid)

# Compute F1 score on the validation set
f1_ridge = f1_score(Y1_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Ridge Regression:", best_params['C'])
print("Best Solver for Ridge Regression:", best_params['solver'])

print("Best F1 Score on Validation Set for Ridge Regression:", f1_ridge)

Best C for Ridge Regression: 0.5
Best Solver for Ridge Regression: liblinear
Best F1 Score on Validation Set for Ridge Regression: 0.7326132272124313


#### Lasso Model

In [177]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
}

# Create a Logistic Regression model
lasso_model = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X1_train, Y1_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
lasso_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = lasso_model.predict(X1_valid)

# Compute F1 score on the validation set
f1_z_lasso = f1_score(Y1_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Logistic Regression:", best_params['C'])

print("Best F1 Score on Validation Set for Logistic Regression:", f1_z_lasso)

Best C for Logistic Regression: 0.5
Best F1 Score on Validation Set for Logistic Regression: 0.7327520849128127


#### Elastic Model

In [178]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'l1_ratio' : [0.1,0.3,0.5,0.7,0.9] ,
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='elasticnet', solver='saga', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X1_train, Y1_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
elastic_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = elastic_model.predict(X1_valid)

# Compute F1 score on the validation set
f1_z_elastic = f1_score(Y1_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Elastic-Net Regression:", best_params['C'])
print("Best L1 Ratio for Elastic-Net Regression:", best_params['l1_ratio']) 

print("Best F1 Score on Validation Set for Elastic-Net Regression:", f1_z_elastic)

Best C for Elastic-Net Regression: 0.5
Best L1 Ratio for Elastic-Net Regression: 0.5
Best F1 Score on Validation Set for Elastic-Net Regression: 0.7327145292669067


##### B) Using the individual values:

In [205]:
Zscore_newlist = train_X['Zscore_NewEncoded'].tolist()
Train11_X = np.array(Zscore_newlist)

In [206]:
Train11_X = Train11_X.reshape(Train11_X.shape[0], -1)

In [207]:
Train11_X = scaler.fit_transform(Train11_X)
X11_train, X11_valid, Y11_train, Y11_valid = train_test_split(Train11_X, train_Y, test_size=0.2, random_state=42)

#### Ridge Model

In [208]:
# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'solver': ['liblinear', 'newton-cg', 'lbfgs'],
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='l2', random_state=42)

# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=100, cv=5, scoring='f1', random_state=42)

# Fit the RandomizedSearchCV object to your training data
random_search.fit(X11_train, Y11_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Get the best estimator (model)
ridge_model = random_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = ridge_model.predict(X11_valid)

# Compute F1 score on the validation set
f1_ridge1 = f1_score(Y11_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Ridge Regression:", best_params['C'])
print("Best Solver for Ridge Regression:", best_params['solver'])

print("Best F1 Score on Validation Set for Ridge Regression:", f1_ridge1)

Best C for Ridge Regression: 5
Best Solver for Ridge Regression: lbfgs
Best F1 Score on Validation Set for Ridge Regression: 0.8536


#### Trying Physical for Logistic Regression

In [183]:
Physical_list = train_X['Physical_Encoded Sequence'].tolist()
Train2_X = np.array(Physical_list)

In [184]:
Train2_X = scaler.fit_transform(Train2_X)

In [185]:
X2_train, X2_valid, Y2_train, Y2_valid = train_test_split(Train2_X, train_Y, test_size=0.2, random_state=42)

In [186]:
X2_train = X2_train.reshape(X2_train.shape[0], -1)
X2_valid = X2_valid.reshape(X2_valid.shape[0], -1)

#### Ridge Model

In [187]:
# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'solver': ['liblinear', 'newton-cg', 'lbfgs'],
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='l2', random_state=42)

# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=100, cv=5, scoring='f1', random_state=42)

# Fit the RandomizedSearchCV object to your training data
random_search.fit(X2_train, Y2_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Get the best estimator (model)
ridge_model = random_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = ridge_model.predict(X2_valid)

# Compute F1 score on the validation set
f1_phy_ridge = f1_score(Y2_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Ridge Regression:", best_params['C'])
print("Best Solver for Ridge Regression:", best_params['solver'])

print("Best F1 Score on Validation Set for Ridge Regression:", f1_phy_ridge)

Best C for Ridge Regression: 10
Best Solver for Ridge Regression: liblinear
Best F1 Score on Validation Set for Ridge Regression: 0.6316


#### Lasso Model

In [188]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
}

# Create a Logistic Regression model
lasso_model = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X2_train, Y2_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
lasso_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = lasso_model.predict(X2_valid)

# Compute F1 score on the validation set
f1_phy_lasso = f1_score(Y2_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Logistic Regression:", best_params['C'])

print("Best F1 Score on Validation Set for Logistic Regression:", f1_phy_lasso)

Best C for Logistic Regression: 5
Best F1 Score on Validation Set for Logistic Regression: 0.6318527410964386


#### Elastic Model

In [189]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'l1_ratio' : [0.1,0.3,0.5,0.7,0.9] ,
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='elasticnet', solver='saga', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X2_train, Y2_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
elastic_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = elastic_model.predict(X2_valid)

# Compute F1 score on the validation set
f1_phy_elastic = f1_score(Y2_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Elastic-Net Regression:", best_params['C'])
print("Best L1 Ratio for Elastic-Net Regression:", best_params['l1_ratio']) 

print("Best F1 Score on Validation Set for Elastic-Net Regression:", f1_phy_elastic)

Best C for Elastic-Net Regression: 0.1
Best L1 Ratio for Elastic-Net Regression: 0.9
Best F1 Score on Validation Set for Elastic-Net Regression: 0.6317058468957204


#### Trying DPPS for Logistic Regression

#### A) Using Combined Scores

In [196]:
DPPS_list = train_X['DPPS_Encoded Sequence'].tolist()
Train3_X = np.array(DPPS_list)

In [197]:
Train3_X = scaler.fit_transform(Train3_X)

In [198]:
X3_train, X3_valid, Y3_train, Y3_valid = train_test_split(Train3_X, train_Y, test_size=0.2, random_state=42)

In [199]:
X3_train = X3_train.reshape(X3_train.shape[0], -1)
X3_valid = X3_valid.reshape(X3_valid.shape[0], -1)

#### Ridge Model

In [200]:
# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'solver': ['liblinear', 'newton-cg', 'lbfgs'],
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='l2', random_state=42)

# Create a RandomizedSearchCV object
random_search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=100, cv=5, scoring='f1', random_state=42)

# Fit the RandomizedSearchCV object to your training data
random_search.fit(X3_train, Y3_train)

# Get the best hyperparameters
best_params = random_search.best_params_

# Get the best estimator (model)
ridge_model = random_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = ridge_model.predict(X3_valid)

# Compute F1 score on the validation set
f1_dpps_ridge = f1_score(Y3_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Ridge Regression:", best_params['C'])
print("Best Solver for Ridge Regression:", best_params['solver'])

print("Best F1 Score on Validation Set for Ridge Regression:", f1_dpps_ridge)

Best C for Ridge Regression: 7
Best Solver for Ridge Regression: lbfgs
Best F1 Score on Validation Set for Ridge Regression: 0.7470295132234572


#### Lasso Model

In [201]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
}

# Create a Logistic Regression model
lasso_model = LogisticRegression(penalty='l1', solver='liblinear', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X3_train, Y3_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
lasso_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = lasso_model.predict(X3_valid)

# Compute F1 score on the validation set
f1_dpps_lasso = f1_score(Y3_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Logistic Regression:", best_params['C'])

print("Best F1 Score on Validation Set for Logistic Regression:", f1_dpps_lasso)

Best C for Logistic Regression: 7
Best F1 Score on Validation Set for Logistic Regression: 0.7470295132234572


#### Elastic Model

In [202]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for C values
param_grid = {
    'C': [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
    'l1_ratio' : [0.1,0.3,0.5,0.7,0.9] ,
}

# Create a Logistic Regression model
model = LogisticRegression(penalty='elasticnet', solver='saga', random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(model, param_grid=param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to your training data
grid_search.fit(X3_train, Y3_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best estimator (model)
elastic_model = grid_search.best_estimator_

# Make predictions on the validation set using the best model
y_pred = elastic_model.predict(X3_valid)

# Compute F1 score on the validation set
f1_dpps_elastic = f1_score(Y3_valid, y_pred)

# Print the best hyperparameters and scores
print("Best C for Elastic-Net Regression:", best_params['C'])
print("Best L1 Ratio for Elastic-Net Regression:", best_params['l1_ratio']) 

print("Best F1 Score on Validation Set for Elastic-Net Regression:", f1_dpps_elastic)

Best C for Elastic-Net Regression: 1
Best L1 Ratio for Elastic-Net Regression: 0.3
Best F1 Score on Validation Set for Elastic-Net Regression: 0.7469833365255698


#### B) Using Individual Scores

In [29]:
DPPS_TrainX = scaler.fit_transform(dpps_df)
DPPS_Xtrain, DPPS_Xvalid, DPPS_Ytrain, DPPS_Yvalid = train_test_split(DPPS_TrainX, train_Y, test_size=0.2, random_state=42)

##### Ridge Regression

In [30]:
# Models to evaluate
models = [
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='liblinear', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='newton-cg', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='saga', random_state=42)) 

]
# Evaluate each model
for model_name, model in models:
    scores = cross_val_score(model, DPPS_Xtrain, DPPS_Ytrain, cv=5, scoring='f1')
    print(f'{model_name} - F1 Score: {np.mean(scores)} (+/- {np.std(scores)})')

Logistic Regression (L2) - F1 Score: 0.8643703629980031 (+/- 0.0032604373089134124)
Logistic Regression (L2) - F1 Score: 0.8652959827673701 (+/- 0.003273814741658868)
Logistic Regression (L2) - F1 Score: 0.8652382801012974 (+/- 0.0032761812821317533)
Logistic Regression (L2) - F1 Score: 0.8675099237559806 (+/- 0.002624734097712019)


In [32]:
#trying for lbfgs solver first
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_log = None
best_f1_log = 0.0

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, solver='saga', random_state=42)

    # Train the model on the training data
    model.fit(DPPS_Xtrain, DPPS_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(DPPS_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1 = f1_score(DPPS_Yvalid, y_pred)
    
    # Check if the current model has better performance
    if f1 > best_f1_log:
        best_c_log = c
        best_f1_log = f1

# Print the best hyperparameters and scores for L1 penalty
print("Best C for Logistic Regression:", best_c_log)
print("Best F1 Score on Validation Set for Logistic Regression:", best_f1_log)

Best C for Logistic Regression: 0.1
Best F1 Score on Validation Set for Logistic Regression: 0.8688458434221147


##### Lasso Regression

In [46]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_l1 = None
best_f1_l1 = 0.0
lassof1_scores = []

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, penalty='l1', random_state=42, solver="liblinear")

    # Train the model on the training data
    model.fit(DPPS_Xtrain, DPPS_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(DPPS_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1_lasso = f1_score(DPPS_Yvalid, y_pred)
    lassof1_scores.append(f1_lasso)

    # Check if the current model has better performance
    if f1_lasso > best_f1_l1:
        best_c_l1 = c
        best_f1_l1 = f1_lasso

# Print the best hyperparameters and scores for L1 penalty
print("Best C for L1 Penalty:", best_c_l1)
print("Best F1 Score on Validation Set for L1 Regression:", best_f1_l1)

Best C for L1 Penalty: 0.1
Best F1 Score on Validation Set for L1 Regression: 0.8733974358974359


##### Elastic Regression

In [47]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
l1_ratio_values = [0.1, 0.3, 0.5, 0.7, 0.9]

# Initialize variables to store the best hyperparameters and scores for Elastic Net
best_c_elastic = None
best_l1_ratio_elastic = None
best_f1_elastic = 0.0
elasticf1_scores = []

# Iterate through the range of C values and l1_ratio values
for c in c_values:
    for l1_ratio in l1_ratio_values:
        # Create a logistic regression model with Elastic Net penalty and the current C and l1_ratio values
        model = LogisticRegression(C=c, penalty='elasticnet', random_state=42, solver="saga", l1_ratio=l1_ratio)

        # Train the model on the training data
        model.fit(DPPS_Xtrain, DPPS_Ytrain)

        # Make predictions on the validation set
        y_pred = model.predict(DPPS_Xvalid)

        # Compute accuracy and F1 score on the validation set
        f1_elastic = f1_score(DPPS_Yvalid, y_pred)
        elasticf1_scores.append(f1_elastic)


        # Check if the current model has better performance
        if f1_elastic > best_f1_elastic:
            best_c_elastic = c
            best_l1_ratio_elastic = l1_ratio
            best_f1_elastic = f1_elastic

# Print the best hyperparameters and scores for Elastic Net penalty
print("Best C for Elastic Net Penalty:", best_c_elastic)
print("Best l1_ratio for Elastic Net Penalty:", best_l1_ratio_elastic)
print("Best F1 Score on Validation Set for Elastic Net:", best_f1_elastic)

Best C for Elastic Net Penalty: 0.1
Best l1_ratio for Elastic Net Penalty: 0.9
Best F1 Score on Validation Set for Elastic Net: 0.8738522954091815


#### B) Using Individual Scores - with lists

In [18]:
dpps_dictsequence2

array([[-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       ...,
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65],
       [-1.76, -0.19,  1.06, ...,  2.05,  2.19, -1.65]])

In [20]:
scaler = StandardScaler() 
DPPS_list_TrainX = scaler.fit_transform(dpps_dictsequence2)
DPPSlist_Xtrain, DPPSlist_Xvalid, DPPSlist_Ytrain, DPPSlist_Yvalid = train_test_split(DPPS_list_TrainX, train_Y, test_size=0.2, random_state=42)

##### Ridge Regression

In [101]:
# Models to evaluate
models = [
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='lbfgs', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='liblinear', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='newton-cg', random_state=42)),
    ('Logistic Regression (L2)', LogisticRegression(penalty='l2', solver='saga', random_state=42)) 

]
# Evaluate each model
for model_name, model in models:
    scores = cross_val_score(model, DPPSlist_Xtrain, DPPSlist_Ytrain, cv=5, scoring='f1')
    print(f'{model_name} - F1 Score: {np.mean(scores)} (+/- {np.std(scores)})')

Logistic Regression (L2) - F1 Score: 0.8649773922606794 (+/- 0.0031259597730482714)
Logistic Regression (L2) - F1 Score: 0.8652959827673701 (+/- 0.003273814741658868)
Logistic Regression (L2) - F1 Score: 0.8652382801012974 (+/- 0.0032761812821317533)
Logistic Regression (L2) - F1 Score: 0.8675099237559806 (+/- 0.002624734097712019)


In [102]:
#trying for lbfgs solver first
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_log = None
best_f1_log = 0.0

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, solver='saga', random_state=42)

    # Train the model on the training data
    model.fit(DPPSlist_Xtrain, DPPSlist_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(DPPSlist_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1 = f1_score(DPPSlist_Yvalid, y_pred)
    
    # Check if the current model has better performance
    if f1 > best_f1_log:
        best_c_log = c
        best_f1_log = f1

# Print the best hyperparameters and scores for L1 penalty
print("Best C for Logistic Regression:", best_c_log)
print("Best F1 Score on Validation Set for Logistic Regression:", best_f1_log)

Best C for Logistic Regression: 0.1
Best F1 Score on Validation Set for Logistic Regression: 0.8688458434221147


##### Lasso Regression

In [106]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

# Initialize variables to store the best hyperparameters and scores for L1 penalty
best_c_l1 = None
best_f1_l1 = 0.0
lassof1_scores = []

# Iterate through the range of C values
for c in c_values:
    # Create a logistic regression model with L1 penalty and the current C value
    model = LogisticRegression(C=c, penalty='l1', random_state=42, solver="liblinear")

    # Train the model on the training data
    model.fit(DPPSlist_Xtrain, DPPSlist_Ytrain)

    # Make predictions on the validation set
    y_pred = model.predict(DPPSlist_Xvalid)

    # Compute accuracy and F1 score on the validation set
    f1_lasso = f1_score(DPPSlist_Yvalid, y_pred)
    lassof1_scores.append(f1_lasso)

    # Check if the current model has better performance
    if f1_lasso > best_f1_l1:
        best_c_l1 = c
        best_f1_l1 = f1_lasso

# Print the best hyperparameters and scores for L1 penalty
print("Best C for L1 Penalty:", best_c_l1)
print("Best F1 Score on Validation Set for L1 Regression:", best_f1_l1)

Best C for L1 Penalty: 0.1
Best F1 Score on Validation Set for L1 Regression: 0.8733466933867736


##### Elastic Regression

In [107]:
c_values = [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
l1_ratio_values = [0.1, 0.3, 0.5, 0.7, 0.9]

# Initialize variables to store the best hyperparameters and scores for Elastic Net
best_c_elastic = None
best_l1_ratio_elastic = None
best_f1_elastic = 0.0
elasticf1_scores = []

# Iterate through the range of C values and l1_ratio values
for c in c_values:
    for l1_ratio in l1_ratio_values:
        # Create a logistic regression model with Elastic Net penalty and the current C and l1_ratio values
        model = LogisticRegression(C=c, penalty='elasticnet', random_state=42, solver="saga", l1_ratio=l1_ratio)

        # Train the model on the training data
        model.fit(DPPSlist_Xtrain, DPPSlist_Ytrain)

        # Make predictions on the validation set
        y_pred = model.predict(DPPSlist_Xvalid)

        # Compute accuracy and F1 score on the validation set
        f1_elastic = f1_score(DPPSlist_Yvalid, y_pred)
        elasticf1_scores.append(f1_elastic)


        # Check if the current model has better performance
        if f1_elastic > best_f1_elastic:
            best_c_elastic = c
            best_l1_ratio_elastic = l1_ratio
            best_f1_elastic = f1_elastic

# Print the best hyperparameters and scores for Elastic Net penalty
print("Best C for Elastic Net Penalty:", best_c_elastic)
print("Best l1_ratio for Elastic Net Penalty:", best_l1_ratio_elastic)
print("Best F1 Score on Validation Set for Elastic Net:", best_f1_elastic)

Best C for Elastic Net Penalty: 0.1
Best l1_ratio for Elastic Net Penalty: 0.9
Best F1 Score on Validation Set for Elastic Net: 0.8738522954091815


### Choosing the Best Model

In [209]:
summary = [
    ['Z-Scale','Ridge Regression Model', f1_ridge],
    ['Z-Scale','Lasso Regression Model', f1_z_lasso],
    ['Z-Scale','Elastic Regression Model', f1_z_elastic],
    ['Physical','Ridge Regression Model', f1_phy_ridge],
    ['Physical','Lasso Regression Model', f1_phy_lasso],
    ['Physical','Elastic Regression Model', f1_phy_elastic],    
    ['DPPS','Ridge Regression Model', f1_dpps_ridge],
    ['DPPS','Lasso Regression Model', f1_dpps_lasso],
    ['DPPS','Elastic Regression Model', f1_dpps_elastic],
]

columns = ['Descriptor', 'Model Type', 'Best F1 Score']

summary = pd.DataFrame(summary, columns=columns)
print(summary)

  Descriptor                Model Type  Best F1 Score
0    Z-Scale    Ridge Regression Model       0.732613
1    Z-Scale    Lasso Regression Model       0.732752
2    Z-Scale  Elastic Regression Model       0.732715
3   Physical    Ridge Regression Model       0.631600
4   Physical    Lasso Regression Model       0.631853
5   Physical  Elastic Regression Model       0.631706
6       DPPS    Ridge Regression Model       0.747030
7       DPPS    Lasso Regression Model       0.747030
8       DPPS  Elastic Regression Model       0.746983


In [210]:
max_value_f1 = summary['Best F1 Score'].max()
row_max_f1 = summary[summary['Best F1 Score'] == max_value_f1]

print("\nMaximum F1 Value is observed for:")
print(row_max_f1)


Maximum F1 Value is observed for:
  Descriptor              Model Type  Best F1 Score
6       DPPS  Ridge Regression Model        0.74703
7       DPPS  Lasso Regression Model        0.74703
