### Demonstrative notebook to develop predictive models

This notebook facilitates the implementation of predictive models using machine learning strategies. Specifically, this notebook trains classification models for antimicrobial peptides and predictive models for protein solubility. 

The notebook also shows how to export models and how to use with new data

In [110]:
import warnings

warnings.filterwarnings("ignore")

- Loading libraries

In [111]:
import sys

sys.path.insert(0, "../../utils/")

In [112]:
import pandas as pd
import seaborn as sns

from training_models.classification_models import ClassificationModels
from training_models.performance_models import *
from training_models.regression_models import RegressionModels

- Loading datasets: Train, validation, and testing dataset, all encoders with physicochemical properties. In all cases, we will remove the label from the dataset and generate a matrix data with the input for training model and an array with the response.

In [113]:
df_train = pd.read_csv("../../data/split_data/train_data.csv")
df_val = pd.read_csv("../../data/split_data/val_data.csv")

In [114]:
train_values = df_train.drop(columns=["target", "source"]).values
train_response = df_train["target"].values

validation_values = df_val.drop(columns=["target", "source"]).values
validation_response = df_val["target"].values

- Instance a classificationModels object to facilitate the training of a model

In [115]:
validation_values

array([[ 0, 10, 12, ...,  0,  0,  0],
       [ 9,  9,  9, ...,  0,  0,  0],
       [10,  9,  0, ...,  0,  0,  0],
       ...,
       [ 9,  9,  9, ...,  0,  0,  0],
       [ 4,  7,  9, ...,  0,  0,  0],
       [ 9,  9,  9, ...,  0,  0,  0]])

In [116]:
clf_model = ClassificationModels(
    X_train=train_values, X_val=validation_values, y_train=train_response, y_val=validation_response
)

- We will train a model with k-fold cross validation with k=5

In [117]:
clf_model.instance_linear_svc()
clf_model.process_model(kfold=True, k=5)

- We can show the performances

In [118]:
clf_model.performances

{'training_metrics': {'f1_weighted': 0.41689481622576263,
  'recall_weighted': 0.45714285714285713,
  'precision_weighted': 0.44178460082015975,
  'accuracy': 0.45714285714285713},
 'validation_metrics': {'Accuracy': 0.5019011406844106,
  'Precision': 0.44000287432153,
  'Recall': 0.5019011406844106,
  'F1-score': 0.44593373887398063,
  'MCC': 0.10087367162448285,
  'Confusion Matrix': [[0.8175182481751825,
    0.11678832116788321,
    0.06569343065693431],
   [0.7411764705882353, 0.11764705882352941, 0.1411764705882353],
   [0.5609756097560976, 0.1951219512195122, 0.24390243902439024]]}}

In [119]:
train=clf_model.performances["training_metrics"]
valid=clf_model.performances["validation_metrics"]
valid.pop("Confusion Matrix", None)

[[0.8175182481751825, 0.11678832116788321, 0.06569343065693431],
 [0.7411764705882353, 0.11764705882352941, 0.1411764705882353],
 [0.5609756097560976, 0.1951219512195122, 0.24390243902439024]]

In [120]:
rename_map = {
    "f1_weighted": "F1-score",
    "recall_weighted": "Recall",
    "precision_weighted": "Precision",
    "accuracy": "Accuracy"
}

In [121]:
train_r = {rename_map.get(k, k): v for k, v in train.items()}

In [122]:
clf_model.performances

{'training_metrics': {'f1_weighted': 0.41689481622576263,
  'recall_weighted': 0.45714285714285713,
  'precision_weighted': 0.44178460082015975,
  'accuracy': 0.45714285714285713},
 'validation_metrics': {'Accuracy': 0.5019011406844106,
  'Precision': 0.44000287432153,
  'Recall': 0.5019011406844106,
  'F1-score': 0.44593373887398063,
  'MCC': 0.10087367162448285}}

In [123]:
valid

{'Accuracy': 0.5019011406844106,
 'Precision': 0.44000287432153,
 'Recall': 0.5019011406844106,
 'F1-score': 0.44593373887398063,
 'MCC': 0.10087367162448285}

In [124]:
train_r

{'F1-score': 0.41689481622576263,
 'Recall': 0.45714285714285713,
 'Precision': 0.44178460082015975,
 'Accuracy': 0.45714285714285713}

In [125]:
df_metrics = pd.DataFrame({
    "Training": train_r,
    "Validation": valid
})

In [126]:
df_metrics

Unnamed: 0,Training,Validation
F1-score,0.416895,0.445934
Recall,0.457143,0.501901
Precision,0.441785,0.440003
Accuracy,0.457143,0.501901
MCC,,0.100874
