### Demonstrative notebooks for encoding protein sequences

This notebook facilitates the demonstration of encoder strategies for protein sequences. Specifically, this notebook shows how to use:

- One Hot
- Ordinal Encoder
- Physicochemical Properties
- FFT-based encoder
- Frequency-based methods
- KMers-based approaches

In [1]:
import warnings
warnings.filterwarnings("ignore")

- Loading libraries

In [2]:
import sys
sys.path.insert(0, "../src/")

In [3]:
import pandas as pd
from encoder_methods.FrequencyEncoders import FrequencyEncoder
from encoder_methods.FFTEncoders import FFTEncoder
from encoder_methods.KMersEncoders import KMersEncoders
from encoder_methods.OneHotEncoder import OneHotEncoder
from encoder_methods.OrdinalEncoder import OrdinalEncoder
from encoder_methods.PhysicochemicalEncoder import PhysicochemicalEncoder

- Reading input data

In this case, we will make the demonstrative cases with the antimicrobial dataset

In [4]:
df_data = pd.read_csv("../raw_data/Antimicrobial/train_data.csv")
df_data.head(5)

Unnamed: 0,sequence,label
0,QEDCELCINVACTGC,0
1,MAATTTATSLFSSRLHFQNQNQGYGFPAKTPNSLQVNQIIDGRKMR...,0
2,SKGKKANKDVELARG,1
3,ADLEVVAATYVLVA,1
4,MAESPSESTSDSLSTTTSTKPAQSGTVSISSPQSHHVVFPEIPIEIVS,0


- One hot and ordinal encoder

These two methods facilitate the applycation of dummy encoder methods. In this case, we only will demonstrate the approaches with a few number of examples

In [5]:
one_hot_instance = OneHotEncoder(
    dataset=df_data, 
    sequence_column="sequence", 
    ignore_columns=["label"],
    max_length=50
)

one_hot_instance.run_process()
one_hot_instance.coded_dataset.head()

Checking canonical residues in the dataset
Estimating lenght in protein sequences
Evaluating length in protein sequences
Start encoding


Unnamed: 0,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,...,p_991,p_992,p_993,p_994,p_995,p_996,p_997,p_998,p_999,label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
ordinal_encoder_instance = OrdinalEncoder(
    dataset=df_data, 
    sequence_column="sequence", 
    ignore_columns=["label"],
    max_length=50
)

ordinal_encoder_instance.run_process()
ordinal_encoder_instance.coded_dataset.head()

Checking canonical residues in the dataset
Estimating lenght in protein sequences
Evaluating length in protein sequences
Start encoding


Unnamed: 0,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,...,p_42,p_43,p_44,p_45,p_46,p_47,p_48,p_49,p_50,label
0,13,3,2,1,3,10,1,7,8,17,...,0,0,0,0,0,0,0,0,0.0,0
1,15,9,5,9,9,0,8,9,2,17,...,0,0,0,0,0,0,0,0,0.0,1
2,0,2,10,3,17,17,0,0,16,19,...,0,0,0,0,0,0,0,0,0.0,1
3,11,0,3,15,12,15,3,15,16,15,...,12,7,3,7,17,15,0,0,0.0,0
4,11,10,14,4,16,6,17,10,8,8,...,0,0,0,0,0,0,0,0,0.0,0


- KMers-based encoder

In [7]:
kmer_instance = KMersEncoders(
    dataset=df_data, 
    sequence_column="sequence", 
    ignore_columns=["label"],
    size_kmer=3
)

kmer_instance.process_dataset()
kmer_instance.coded_dataset

Checking canonical residues in the dataset
Estimating lenght in protein sequences
Evaluating length in protein sequences


Unnamed: 0,AAA,AAC,AAD,AAE,AAF,AAG,AAH,AAI,AAK,AAL,...,YYN,YYP,YYQ,YYR,YYS,YYT,YYV,YYW,YYY,label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21215,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
21216,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
21217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
21218,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


- Frequency based encoder

The parameter max_length need to be configured for the zero padding. However, in the case of protein variants (the same length of the sequences) it is enough with the length of the wild sequence

In [8]:
frequency_instance = FrequencyEncoder(
    dataset=df_data, 
    sequence_column="sequence", 
    ignore_columns=["label"],
    max_length=50
)

frequency_instance.run_process()
frequency_instance.coded_dataset.head(5)

Checking canonical residues in the dataset
Estimating lenght in protein sequences
Evaluating length in protein sequences
Start encoding


Unnamed: 0,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,...,p_42,p_43,p_44,p_45,p_46,p_47,p_48,p_49,p_50,label
0,0.066667,0.133333,0.066667,0.266667,0.133333,0.066667,0.266667,0.066667,0.066667,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.066667,0.266667,0.133333,0.266667,0.266667,0.133333,0.066667,0.266667,0.066667,0.066667,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.285714,0.071429,0.142857,0.071429,0.285714,0.285714,0.285714,0.285714,0.071429,0.071429,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.020833,0.041667,0.083333,0.270833,0.104167,0.270833,0.083333,0.270833,0.125,0.270833,...,0.104167,0.083333,0.083333,0.083333,0.083333,0.270833,0.0,0.0,0.0,0
4,0.055556,0.138889,0.138889,0.055556,0.027778,0.055556,0.055556,0.138889,0.055556,0.055556,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


- Physicochemical property encoders and FFT-based encoder

The use of physicochemical encoders depends on the csv file with the encoders. In this case, the folder "input_config" has two dataset with encoders. The first is the encoder of AAIndex and the second is a dataset with encoders generated from semantic clustering using as input the descriptions of the AAIndex.

- Reading the input encoders. In this case, we will use the AAIndex physicochemical properties

In [11]:
input_encoder = pd.read_csv("../input_config/aaindex_encoders.csv")
input_encoder.index = input_encoder["residue"].values
input_encoder.head()

Unnamed: 0,residue,ANDN920101,ARGP820101,ARGP820102,ARGP820103,BEGF750101,BEGF750102,BEGF750103,BHAR880101,BIGC670101,...,KARS160113,KARS160114,KARS160115,KARS160116,KARS160117,KARS160118,KARS160119,KARS160120,KARS160121,KARS160122
A,A,4.35,0.61,1.18,1.56,1.0,0.77,0.37,0.357,52.6,...,6.0,6.0,6.0,6.0,12.0,6.0,12.0,0.0,6.0,0.0
L,L,4.17,1.53,3.23,2.93,1.0,0.83,0.53,0.365,102.0,...,12.0,15.6,12.0,18.0,30.0,6.0,25.021,0.0,9.6,3.113
R,R,4.38,0.6,0.2,0.45,0.52,0.72,0.84,0.529,109.1,...,19.0,31.444,20.0,38.0,45.0,5.0,23.343,0.0,10.667,4.2
K,K,4.36,1.15,0.06,0.15,0.6,0.55,0.75,0.466,105.1,...,12.0,24.5,18.0,31.0,37.0,6.17,22.739,-0.179,10.167,1.372
N,N,4.75,0.06,0.23,0.27,0.35,0.55,0.97,0.463,75.7,...,12.0,16.5,14.0,20.0,33.007,6.6,27.708,0.0,10.0,3.0


In [12]:
physicochemical_instance = PhysicochemicalEncoder(
    dataset=df_data, 
    sequence_column="sequence", 
    ignore_columns=["label"],
    max_length=50,
    name_property="ARGP820101",
    df_properties=input_encoder
)

physicochemical_instance.run_process()
physicochemical_instance.df_data_encoded.head(5)

Checking canonical residues in the dataset
Estimating lenght in protein sequences
Evaluating length in protein sequences
Encoding and Processing results
Creating dataset
Export dataset


Unnamed: 0,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,...,p_42,p_43,p_44,p_45,p_46,p_47,p_48,p_49,p_50,label
0,0.47,0.46,1.07,0.47,1.53,1.07,2.22,0.06,1.32,0.61,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
1,0.05,1.15,0.07,1.15,1.15,0.61,0.06,1.15,0.46,1.32,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
2,0.61,0.46,1.53,0.47,1.32,1.32,0.61,0.61,0.05,1.88,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1
3,1.18,0.61,0.47,0.05,1.95,0.05,0.47,0.05,0.05,0.05,...,0.47,2.22,1.32,0.05,0.0,0.0,0.0,0.0,0,0
4,1.18,1.53,0.6,2.02,0.05,0.61,1.32,1.53,0.06,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0


- Once the physicochemical encoder has been applied, a FFT-based transform can be generated to produce an alternative transformation

In [13]:
fft_instance = FFTEncoder(
    dataset=physicochemical_instance.df_data_encoded, 
    sequence_column="sequence", 
    ignore_columns=["label"]
)

fft_instance.encoding_dataset()
fft_instance.df_fft.head()

Removing columns data
Get near pow 2 value
Apply zero padding
Creating dataset
Export dataset


Unnamed: 0,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,p_9,...,p_23,p_24,p_25,p_26,p_27,p_28,p_29,p_30,p_31,label
0,11.54,10.877525,9.072347,6.617258,4.179077,2.433426,1.711014,1.420103,1.028057,0.732624,...,0.888172,1.319507,1.671883,2.144617,2.846879,3.585711,4.082729,4.204985,4.065425,0
1,10.45,9.734584,7.758625,4.986741,2.073977,0.872207,2.372149,3.023772,2.784621,1.966112,...,1.601796,1.887878,2.041305,1.798725,1.103576,0.716607,1.904474,3.24252,4.215803,1
2,13.64,12.654231,9.937319,6.148215,2.276297,1.859492,3.914161,4.749295,4.414119,3.400864,...,0.883138,1.609954,2.254046,2.742977,3.035804,3.079051,2.801255,2.157088,1.182592,1
3,35.59,15.184089,13.460313,8.597916,5.842053,5.081926,2.723396,3.536184,8.077508,5.816234,...,1.316577,5.831917,6.065166,5.170851,5.365212,7.449582,4.939812,3.213988,7.27859,0
4,28.44,14.080404,5.977206,4.80973,6.238447,1.91139,5.90314,3.390447,3.869664,0.804363,...,5.362877,3.878827,3.916763,2.238962,2.61777,2.291193,3.375534,5.546504,7.179156,0
