<h3> Training classification models </h3>

We need to import the relevant packages associated with the encoding strategy to make numerical representations of the protein sequences

In [1]:
# with sys we adding the folders in the execution path
import sys
sys.path.insert(0, '../')

#modules for numerical representations
from numerical_representation_strategies.physicochemical_properties import physicochemical_encoder
from numerical_representation_strategies.constant_values import constant_values

#pandas
import pandas as pd

The example dataset is the absorption example obtained from [this paper](https://academic.oup.com/bioinformatics/article-abstract/34/15/2642/4951834)

In [2]:
df_data = pd.read_csv("../../dataset_examples/demo_df_sequences.csv")


We need to prepare the encoder, in this case we can use the encoders in folder "inputs_encoder", there are two available encoding strategies:

- Physicochemical properties
- Topics of the physicochemical properties

In the case of the physicochemical properties, there are different options, each property has an ID, with this ID the method recognize the column in the dataset and then apply the coding process.

In the case of topics of physicochemical properties, there are only 8 options [See the paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9329607/)


In [3]:
# The property with id ANDN920101 will be used, in the case of the use of topics, please use properties like "Group_XXX", See the inputs encoders for more details.
property_to_use = 'ANDN920101'
encoders = pd.read_csv("../../inputs_encoder/aaindex_encoders.csv")
encoders.index = encoders['residue']

column_with_seq = 'sequence'#the name of column with the sequence in the dataset to process
column_with_response = 'm'#select the response column, only for demostration we are using this column

Instancing the object, we need to give:

1. The input dataset
2. Property to use
3. The selected encoder.
4. An object instance of constant_values
5. Column with the sequence (name of column)
6. Column with the response (name of column)

In [4]:
print("Instance object")
encoding_instance = physicochemical_encoder(df_data,
                 property_to_use,
                 encoders,
                 constant_values(),
                 column_with_seq,
                 column_with_response)

Instance object


Using the method encoding_dataset of class physicochemical_encoder it is possible to obtain the processed dataset

In [5]:
df_data_encoding = encoding_instance.encoding_dataset()
df_data_encoding

Processing results
Creating dataset
Export dataset


Unnamed: 0,m,p_0,p_1,p_2,p_3,p_4,p_5,p_6,p_7,p_8,...,p_288,p_289,p_290,p_291,p_292,p_293,p_294,p_295,p_296,p_297
0,0,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
1,1,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
2,1,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
3,1,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
4,1,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,2,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
77,3,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
78,3,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5
79,3,4.52,4.17,4.52,4.35,3.95,4.66,4.5,4.5,4.35,...,4.35,4.75,4.36,4.5,3.97,3.97,4.5,4.17,3.95,4.5


<h4> Training classification models </h4>

To train a classification models, you need to split the dataset into responses and data to train

In [6]:
response_column = df_data_encoding['m']
df_data_to_train = df_data_encoding.drop(columns=['m'])

When the input dataset is splitted, using the classification_models class, you can train a predictive model:

1. Instance the object passing as parameters the dataset to train and the responses
2. Split the dataset between training and testing
3. Fit the model, There are three options available:
   1. Option 1: Random Forest
   2. Option 2: Decision Tree
   3. Option 3: Bagging
4. Obtain the performances and evaluate the model

In [7]:
from training_classic_ml.class_ml_models import classification_models

class_models = classification_models(df_data_to_train, response_column)
X_train, X_test, y_train, y_test = class_models.split_to_train(test_size=0.2)

class_models.apply_model(X_train=X_train, y_train=y_train, option=1)

performances = class_models.get_performances(X_test=X_test, y_test=y_test)

#the performances are a dictionary
performances

  _warn_prf(average, modifier, msg_start, len(result))


{'accuracy': 0.7058823529411765,
 'precision': 0.5390374331550802,
 'recall': 0.7058823529411765,
 'f_score': 0.6088235294117647}