# MLACP Case Study

This section will present a comparative analysis to demonstrate the application and performance of proPythia for addressing sequence-based prediction problems. The second case study is with anticancer peptides and tries to replicate the study made by Manavalan et al., “MLACP: machine-learning-based prediction of anticancer peptides”

In this publication, SVM and RF ML methods were developed. The features used to predict ACPs were calculated from the aminoacid sequence, including AAC, DPC, ATC, and from physicochemical properties. Tyagi-B datasets were used to train the models (free available at: http://www.thegleelab.org/MLACPData.html).

B. Manavalan, S. Basith, T. Hwan Shin, S. Choi, M. Ok Kim, and G. Lee, “MLACP:
machine-learning-based prediction of anticancer peptides” Oncotarget, vol. 8, no. 44,
pp. 77121–77136, 2017.

In [4]:
import csv
import pandas as pd
from propythia.sequence import ReadSequence
from propythia.descriptors import Descriptor
from propythia.machine_learning import MachineLearning
from sklearn.metrics import make_scorer, accuracy_score, recall_score, confusion_matrix

1. CREATION OF DATASETS

Tyagi-B datasets were used to train the models (free available at: http://www.thegleelab.org/MLACPData.html).

In [20]:
def create_dataset():
    acp_data = r'datasets/Tyagi-B-positive_ori.txt'  # 187
    non_acp_data = r'datasets/Tyagi-B-negative_ori.txt' # 398

    with open('datasets/test_MLACP.csv', 'w', newline='') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ',
                                quotechar='|', quoting=csv.QUOTE_MINIMAL)
        with open(acp_data, newline='') as csvfile_ACP:
            spamreader = csv.reader(csvfile_ACP, delimiter=' ', quotechar='|')
            for row in spamreader:
                if row[0].startswith('>'):
                    pass
                else:# just sequences. not '>acp_number' character
                    spamwriter.writerow(row)

        with open(non_acp_data, newline='') as csvfile_nonACP:
            spamreader = csv.reader(csvfile_nonACP, delimiter=' ', quotechar='|')
            for row in spamreader:
                if row[0].startswith('>'):
                    pass
                else:
                    spamwriter.writerow(row)

    with open('datasets/test_MLACP.csv', 'r', newline='') as csvfile:
        spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
        row_count = sum(1 for row in spamreader)
        print(row_count)

create_dataset()

585


2. ADD FEATURES


As the physicochemical properties defined in the article are different from the ones available in the package, the features AAC, DPC, ATC were considered to compare results.


In [21]:
def add_features_article():
    dataset_in=r'datasets/test_MLACP.csv'
    rows_list = [] #creating an empty list of dataset rows

    #opening dataset
    with open(dataset_in) as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
        for row in spamreader:
            res={'sequence':row[0]}
            sequence=ReadSequence() # creating sequence object
            ps=sequence.read_protein_sequence(row[0])
            protein = Descriptor(ps) # creating object to calculate descriptors)
            feature=protein.adaptable([21,6]) #calculate AAC, DPC, ATP AN. the PCP of the article were replaced by the ones I have
            res.update(feature)
            print(res)
            rows_list.append(res)

    df = pd.DataFrame(rows_list)
    df.set_index(['sequence'],inplace=True)

    # adding labels to dataset
    labels=['ACP']*187 + ['non_ACP']*398
    df['labels'] = labels

    dataset_out=r'datasets/test_MLACP_ART_dpc_atc.csv'
    df.to_csv(dataset_out,index=False)
    print(df.shape)
    # print(df.head(10))


add_features_article()

{'sequence': 'ACYCRIPACIAGERRYGTCIYQGRLWAFCC', 'AA': 0.0, 'AR': 0.0, 'AN': 0.0, 'AD': 0.0, 'AC': 6.9, 'AE': 0.0, 'AQ': 0.0, 'AG': 3.45, 'AH': 0.0, 'AI': 0.0, 'AL': 0.0, 'AK': 0.0, 'AM': 0.0, 'AF': 3.45, 'AP': 0.0, 'AS': 0.0, 'AT': 0.0, 'AW': 0.0, 'AY': 0.0, 'AV': 0.0, 'RA': 0.0, 'RR': 3.45, 'RN': 0.0, 'RD': 0.0, 'RC': 0.0, 'RE': 0.0, 'RQ': 0.0, 'RG': 0.0, 'RH': 0.0, 'RI': 3.45, 'RL': 3.45, 'RK': 0.0, 'RM': 0.0, 'RF': 0.0, 'RP': 0.0, 'RS': 0.0, 'RT': 0.0, 'RW': 0.0, 'RY': 3.45, 'RV': 0.0, 'NA': 0.0, 'NR': 0.0, 'NN': 0.0, 'ND': 0.0, 'NC': 0.0, 'NE': 0.0, 'NQ': 0.0, 'NG': 0.0, 'NH': 0.0, 'NI': 0.0, 'NL': 0.0, 'NK': 0.0, 'NM': 0.0, 'NF': 0.0, 'NP': 0.0, 'NS': 0.0, 'NT': 0.0, 'NW': 0.0, 'NY': 0.0, 'NV': 0.0, 'DA': 0.0, 'DR': 0.0, 'DN': 0.0, 'DD': 0.0, 'DC': 0.0, 'DE': 0.0, 'DQ': 0.0, 'DG': 0.0, 'DH': 0.0, 'DI': 0.0, 'DL': 0.0, 'DK': 0.0, 'DM': 0.0, 'DF': 0.0, 'DP': 0.0, 'DS': 0.0, 'DT': 0.0, 'DW': 0.0, 'DY': 0.0, 'DV': 0.0, 'CA': 0.0, 'CR': 3.45, 'CN': 0.0, 'CD': 0.0, 'CC': 3.45, 'CE': 0.0,

3. CONSTRUCTION OF MODELS

RF and SVM models were built using as features AAC, DPC and a hybrid model with the features AAC+DPC. 

A param grid with a cross validation of 10-fold was performed as described in the article. 
In SVM a ‘rbf’ kernel was defined and the parameters C and g optimized (param range: 0.001, 0.01, 0.1, 1). 
In RF models the grid search contained the parameters number of estimators (100, 300, 400, 500), number of maximum features (‘sqrt’or 2,3,5,7) and minimal number of samples splits [5, 6, 7, 8]. The parameters changed were the same as used in the article, however, and due to computer limitations, significantly less range of parameters were tested.

In [22]:
def machine_learning_rf(dataset_in):
    dataset = pd.read_csv(dataset_in, delimiter=',')
    x_original=dataset.loc[:, dataset.columns != 'labels']
    labels=dataset['labels']

    ml=MachineLearning(x_original, labels,classes=['ACP','non_ACP'])

    # with parameters defined by article
    param_grid = [{'clf__n_estimators': [10,100,200,300,400,500],
                   'clf__max_features': ['sqrt',2,3,5,7],
                   'clf__min_samples_split':[3,6,7,9,10]}]
    # best_rf_model_AMPEPparameters=ml.train_best_model('rf',param_grid=param_grid)
    # print(ml.score_testset(best_rf_model_AMPEPparameters))

   #optimize ROC AUC
    best_rf_model = ml.train_best_model('rf',score='roc_auc',param_grid=param_grid)
    print(ml.score_testset(best_rf_model))

In [24]:
def machine_learning_svm(dataset_in):
    dataset = pd.read_csv(dataset_in, delimiter=',')
    x_original=dataset.loc[:, dataset.columns != 'labels']
    labels=dataset['labels']

    ml=MachineLearning(x_original, labels,classes=['ACP','non_ACP'])

    #with grid search
    param_range = [0.001, 0.01, 0.1, 1.0]

    param_grid = [{'clf__C': param_range,
                   'clf__kernel': ['rbf'],
                   'clf__gamma': param_range
                   }]

    # optimize accuracy
    best_svm_model = ml.train_best_model('svm',score='roc_auc',param_grid=param_grid)
    print(ml.score_testset(best_svm_model))

In [25]:
dataset_aac=r'datasets/test_MLACP_ART_aac.csv'
dataset_dpc=r'datasets/test_MLACP_ART_dac.csv'
dataset_aac_dpc=r'datasets/test_MLACP_ART_aac_dpc.csv'

machine_learning_rf(dataset_aac)
machine_learning_rf(dataset_dpc)
machine_learning_rf(dataset_aac_dpc)

machine_learning_svm(dataset_aac)
machine_learning_svm(dataset_dpc)
machine_learning_svm(dataset_aac_dpc)

performing grid search...
Best score rf (scorer: roc_auc) and parameters from a 10-fold cross validation:
MCC score:	0.878
Parameters:	{'clf__max_features': 'sqrt', 'clf__min_samples_split': 3, 'clf__n_estimators': 200}
             Scores
MCC            0.62
accuracy       0.83
precision      0.83
recall         0.93
f1             0.88
roc_auc        0.79
TN            41.00
FP            22.00
FN             8.00
TP           105.00
FDR            0.17
sensitivity    0.93
specificity    0.65
performing grid search...
Best score rf (scorer: roc_auc) and parameters from a 10-fold cross validation:
MCC score:	0.871
Parameters:	{'clf__max_features': 2, 'clf__min_samples_split': 7, 'clf__n_estimators': 300}
             Scores
MCC            0.64
accuracy       0.84
precision      0.81
recall         0.96
f1             0.88
roc_auc        0.78
TN            38.00
FP            25.00
FN             4.00
TP           109.00
FDR            0.19
sensitivity    0.96
specificity    0.60
perfo

Bigger differences are found in sensitivity and specificity values. MCC and Accuracy are very similar with the values from the ProPythia package being slightly worse. Differences may be due to the difference in parameters range used or the use of different packages to calculate the models. Similar with AmPEP article, the authors did not specify which measure they took into account to select the best models. Here, to perform grid search, MCC score was used.
Overall, this comparative analysis evidences the performance and validates the package here described.