# AMPEP Case Study 

This section will present a comparative analysis to demonstrate the application and performance
of proPythia for addressing sequence-based prediction problems. The first case study is with antimicorbial peptides and tries to replicate the study made by P. Bhadra and all, “AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest” which is described to highly perform on AMP prediction methods.

In the publication, Bhadra et al., used a dataset with a positive:negative ratio (AMP/non-AMP) of 1:3, based on the distribution patterns of aa properties along the sequence (CTD features), with a 10 fold cross validation RF model. The collection of data with sets of AMP and non-AMP data is freely available at https://sourceforge.net/projects/axpep/files/). Their model obtained a sensitivity of 0.95, a specificity and accuracy of 0.96, MCC of 0.9 and AUC-ROC of 0.98.


P. Bhadra, J. Yan, J. Li, S. Fong, and S. W. Siu, “AmPEP: Sequence-based prediction
of antimicrobial peptides using distribution patterns of amino acid properties and
random forest,” Scientific Reports, vol. 8, no. 1, pp. 1–10, 2018.

In [1]:
import csv
import pandas as pd
from propythia.sequence import ReadSequence
from propythia.descriptors import Descriptor
from propythia.preprocess import Preprocess
from propythia.feature_selection import FeatureSelection
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from propythia.machine_learning import MachineLearning


ModuleNotFoundError: No module named 'propythia'

1. CONSTRUCTION OF DATASETS

First, based on the available collection of data available of AMP and non AMP a dataset constituting of a 1:3 ratio was built. 


In [11]:
def create_dataset():
    AMP_data=r'datasets/M_model_train_AMP_sequence.fasta'
    #AMP 3268  sequences
    non_AMP_data=r'datasets/M_model_train_nonAMP_sequence.fasta'
    #non-AMP 166791 sequences

    with open('datasets/test_AmPEP.csv', 'w', newline='') as csvfile:
        spamwriter = csv.writer(csvfile, delimiter=' ',
                                quotechar='|', quoting=csv.QUOTE_MINIMAL)
        with open(AMP_data, newline='') as csvfile_AMP:
            spamreader = csv.reader(csvfile_AMP, delimiter=' ', quotechar='|')
            for row in spamreader:
                if len(row[0])>1: #just sequences. not '>' character
                    spamwriter.writerow(row)

        with open(non_AMP_data, newline='') as csvfile_nonAMP:
            spamreader = csv.reader(csvfile_nonAMP, delimiter=' ', quotechar='|')
            for _ in range(5001):  # skip the first 500 rows
                next(spamreader)
            count=0
            non_AMP_data=9805 #number of non AMP to add

            for row in spamreader:#arbitrary number to not start in the beggining
                if count<=non_AMP_data:
                    if len(row[0])>1:
                        spamwriter.writerow(row)
                        count+=1

    with open(r'datasets/test_AmPEP.csv', 'r', newline='') as csvfile:
            spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
            row_count = sum(1 for row in spamreader)
            print(row_count)
create_dataset()

13074


2. Calculus of features/ protein sequence descriptors
Taking this dataset as base, two datasets were assembled.
On the first one, CTD descriptors were calculated. A derived dataset was constructed restraining the features to the D feature. This two datasets were used to mimic the model published.

To understand if adding features would alter the performance of the model a second dataset was built. 
Physicochemical (15), AAC and DPC (420), CTD (147) and CTriad (343) descriptors were calculated. To reduce the number of features and select the more important ones, the dataset was scanned for invariable columns, and a univariate feature selector was used to reduce the number of features to 250 (mutual info classif used as function, selecting the best k=250 features). This dataset was standard scaled. After, a L1 logistic regression model (C=0.01) was applied, being the final dataset 160 features selected.


In [12]:
def add_features_CTD():
    dataset_in=r'datasets/test_AmPEP.csv'
    rows_list = [] #creating an empty list of dataset rows

    #opening dataset
    with open(dataset_in) as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
        for row in spamreader:
            res={'sequence':row[0]}
            sequence=ReadSequence() #creating sequence object
            ps=sequence.read_protein_sequence(row[0])
            protein = Descriptor(ps) # creating object to calculate descriptors)
            feature=protein.adaptable([32]) #CTD feature
            res.update(feature)
            rows_list.append(res)

    df = pd.DataFrame(rows_list)
    df.set_index(['sequence'],inplace=True)
    labels=['AMP']*3268 + ['non_AMP']*9806 #adding labels to dataset


    #select only D feature
    d_cols = [col for col in df.columns if 'D' in col]
    ignore=['_NormalizedVDWVC1','_NormalizedVDWVC2','_NormalizedVDWVC3','_NormalizedVDWVT12','_NormalizedVDWVT13','_NormalizedVDWVT23']

    df=df[df.columns.intersection(d_cols)]
    df=df.drop(columns=['_NormalizedVDWVC1','_NormalizedVDWVC2','_NormalizedVDWVC3','_NormalizedVDWVT12','_NormalizedVDWVT13','_NormalizedVDWVT23'])
    df['labels'] = labels
    dataset_out=r'test_AmPEP_CTD_D.csv'
    df.to_csv(dataset_out,index=False)
    print(df.shape)
    print(df.head(10))

add_features_CTD()

(13074, 106)
                                                    _PolarizabilityD1001  \
sequence                                                                   
AACSDRAHGHICESFKSFCKDSGRNGVKLRANCKKTCGLC                           6.667   
AAEFPDFYDSEEQMGPHQEAEDEKDRADQRVLTEEEKKELENLAAMD...                 5.882   
AAFFAQQKGLPTQQQNQVSPKAVSMIVNLEGCVRNPYKCPADVWTNG...                 2.000   
AAFRGCWTKNYSPKPCL                                                 20.000   
AAGMGFFGAR                                                        16.667   
AAGNPSETGGAVATYSTAVGSFLDGTVKVVATGGASRVPGNCGTAAV...                 2.041   
AAKNKKEKGKKGASDCTEWTWGSCIPNSKDCGAGTREGTCKEETRKL...                 2.222   
AAKNKKEKGKKGASDCTEWTWGSCIPNSKDCGAGTREGTCKEETRKL...                 2.222   
AAKPMGITCDLLSLWKVGHAACAAHCLVLGDVGGYCTKEGLCVCKE                     5.882   
AALKGCWTKSIPPKPCFGKR                                              16.667   

                                                    _PolarizabilityD1025  

In [4]:
def add_features_all():
    dataset_in=r'datasets/test_AmPEP.csv'
    rows_list = [] #creating an empty list of dataset rows

    #opening dataset
    with open(dataset_in) as csvfile:
        spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
        for row in spamreader:
            res={'sequence':row[0]}
            sequence=ReadSequence() #creating sequence object
            ps=sequence.read_protein_sequence(row[0])
            protein = Descriptor(ps) # creating object to calculate descriptors)
            feature=protein.adaptable([19,20,21,32,33]) #calculate dot know each features!!!!!!!
            res.update(feature)
            rows_list.append(res)

    df = pd.DataFrame(rows_list)
    df.to_csv(r'datasets/test_AmPEP_all__BACKUP.csv',index=False)

    df.set_index(['sequence'],inplace=True)
    labels=['AMP']*3268 + ['non_AMP']*9806 #adding labels to dataset
    df['labels'] = labels

    dataset_out=r'datasets/test_AmPEP_all.csv'
    df.to_csv(dataset_out,index=False)
    print(df.shape)
add_features_all()

In [13]:
def select_features():
    dataset_in=r'datasets/test_AmPEP_all.csv'
    dataset=pd.read_csv(dataset_in, delimiter=',')
    #separate labels
    labels=dataset['labels']
    dataset=dataset.loc[:, dataset.columns != 'labels']

    prepro=Preprocess() #Create Preprocess object

    #do the preprocessing
    dataset_clean,columns_deleted=prepro.preprocess(dataset, columns_names=True, threshold=0, standard=True)

    dataset_clean['labels']=labels #put labels back

    print('dataset original',dataset.shape)
    print('dataset after preprocess',dataset_clean.shape)

    pd.DataFrame(dataset_clean).to_csv(r'datasets/test_AmPEP_all_clean.csv',index=False)
    
    x_original=dataset_clean.loc[:, dataset_clean.columns != 'labels']
    fselect=FeatureSelection(dataset_clean, x_original, labels)

    # # #KBest com *mutual info classif*
    X_fit_univariate, X_transf_univariate,column_selected,scores,dataset_features= \
        fselect.univariate(score_func=mutual_info_classif, mode='k_best', param=250)

    # # Select from model L1
    # model_svc=SVC(C=0.1, penalty="l1", dual=False)
    model_lr=LogisticRegression(C=0.1, penalty="l2", dual=False)
    #model= logistic regression
    X_fit_model, X_transf_model,column_selected,feature_importances,feature_importances_DF,dataset_features= \
        fselect.select_from_model_feature_elimination( model=model_lr)

    pd.DataFrame(dataset_features).to_csv(r'datasets/test_AmPEP_all_selected.csv',index=False)
    #print(df.head(10))
select_features()

dataset original (13074, 1104)
dataset after preprocess (13074, 581)


RF models were built using:
            the parameters of the article
            RF models performing grid search
            The same features as in article (D from CTD)
            CTD features
            Adding a considerable number of features
            
            
To mimic the model published, a RF model using the D from CTD descriptors with 105 estimators and sqrt as maximum number of features and a CV of 10 was built. 
This model obtained a sensitivity of 0.91, a specificity of 0.93, accuracy of 0.96, MCC of 0.90 and AUC-ROC of 0.95 against the test set. 


With the same descriptors but using a grid search approach, the model yielded the same results.


Using all the CTD features with grid approach, a sensitivity of 0.98, a specificity of 0.93, accuracy of 0.96, MCC of 0.90 and AUC-ROC of 0.95 yielding as described in the article slightly better results. 


Using the dataset with more features, the resultant model achieved the same results.
            
 

In [14]:
def machine_learning_rf(dataset_in, grid=None):
    dataset = pd.read_csv(dataset_in, delimiter=',')
    x_original=dataset.loc[:, dataset.columns != 'labels']

    labels=dataset['labels']

    ml=MachineLearning(x_original, labels,classes=['AMP','non_AMP'])

    if grid == 'AmPEP':
        #with parameters defined by article
        param_grid = [{'clf__n_estimators': [100],
                   'clf__max_features': ['sqrt']}]

        #optimize MCC
        #best_rf_model_AMPEPparameters=ml.train_best_model('rf',score=make_scorer(matthews_corrcoef),param_grid=param_grid)

        #optimize ROC_AUC
        best_rf_model_AMPEPparameters=ml.train_best_model('rf',param_grid=param_grid)
        print(ml.score_testset(best_rf_model_AMPEPparameters))

    else: 
        #with grid search
        #optimize MCC
        #best_rf_model = ml.train_best_model('rf')

        #optimize ROC-AUC
        best_rf_model = ml.train_best_model('rf')
        print(ml.score_testset(best_rf_model))

    
# RF with only D features (AMPEP PARAMETERSS)
machine_learning_rf('datasets/test_AmPEP_CTD_D.csv', grid = 'AmPEP' )
# RF with only D features (GRID SEARCH)
machine_learning_rf('datasets/test_AmPEP_CTD_D.csv')

# RF with more features(PARAMETERS AMPEP)
machine_learning_rf(r'datasets/test_AmPEP_all_selected.csv')
# RF with more features(GRID SEARCH)
machine_learning_rf(r'datasets/test_AmPEP_all_selected.csv')

performing grid search...
Best score rf (scorer: roc_auc) and parameters from a 10-fold cross validation:
MCC score:	0.989
Parameters:	{'clf__max_features': 'sqrt', 'clf__n_estimators': 100}
              Scores
MCC             0.89
accuracy        0.96
precision       0.97
recall          0.97
f1              0.97
roc_auc         0.95
TN            916.00
FP             81.00
FN             80.00
TP           2846.00
FDR             0.03
sensitivity     0.97
specificity     0.92
performing grid search...
Best score rf (scorer: roc_auc) and parameters from a 10-fold cross validation:
MCC score:	0.989
Parameters:	{'clf__bootstrap': True, 'clf__criterion': 'gini', 'clf__max_features': 'sqrt', 'clf__n_estimators': 500}
              Scores
MCC             0.90
accuracy        0.96
precision       0.97
recall          0.97
f1              0.97
roc_auc         0.95
TN            921.00
FP             76.00
FN             75.00
TP           2851.00
FDR             0.03
sensitivity     0.97
s

To test if a SVM model would outperformed a RF based one, SVM models using grid search were
also simulated.

This models used a grid search. With the models built using:
            only the CTD features, the model obtained a sensitivity of 0.96, a specificity of 0.89, accuracy of 0.94, MCC of 0.86 and AUC-ROC of 0.93 against the test

            the dataset with more features the model obtained a sensitivity of 0.96, a specificity of 0.91, accuracy of 0.95, MCC of 0.87 and AUC-ROC of 0.94 against the test set.

In [None]:
def machine_learning_svm(dataset_in):
    dataset = pd.read_csv(dataset_in, delimiter=',')
    x_original=dataset.loc[:, dataset.columns != 'labels']

    labels=dataset['labels']

    ml=MachineLearning(x_original, labels,classes=['AMP','non_AMP'])

    #with grid search
    param_range = [0.001, 0.01, 0.1, 1.0]

    param_grid = [{'clf__C': param_range,
                       'clf__kernel': ['linear'],
                       'clf__gamma': param_range
                       }]

    best_svm_model = ml.train_best_model('svm',param_grid=param_grid, scaler=None)
    print(ml.score_testset(best_svm_model))

# SVM with only CTD features
machine_learning_svm(r'datasets/test_AmPEP_CTD_D.csv')
# SVM with all features
machine_learning_svm(r'datasets/test_AmPEP_all_selected.csv')

performing grid search...


The model mimicking the AmPEP predictive model yielded slightly different results,
achieving more sensitivity but less specificity and having the same accuracy and MCC scores.

This result shows that ProPythia can be used to build models as it performs with similar
performance to the best ones described in literature. Taking the results into account, it is
also notorious that adding more sequence descriptors in RF models did not lead to better
models whereas in SVM models more features led to better performance results. Both SVM
models performed worse than any RF model, which was also in concordance with article
that reports RF models performing better than SVM ones. The small differences observed
when using the same model may be due to the methods used to perform RF or the scoring
functions used to choose the best performance models. In the article, the authors did not
specify which measure they took into account to select the best models. Here, to perform
grid search, MCC score was used.

