Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The datasets folder contains all the feature for the experiments. All the features need to unzipped and kept in the datasets folder for the codes to run properly.

"All_32620_Features_Test_and_Train.zip" contains all the features extratced from both the train and the test datasets, these were used for the Recursive feature Selection.
"Group Test Dataset.zip" and "Group Train Dataset.zip" contains test and train files for the Grouped feature Selection. The features groups are separated in different csv files.

The Codes

Grouped Feature Selection

The coding of this technique was done manually and spearately for different combinations of features. We carried out all the experiemnts and stored the results in the "Grouped_Feature_Selection_All_Results.xlsx" files. After carrying out all the experiemnts we found out the best group combination and the tested it on the train dataset. The "Grouped_Feature_Selection_Final_GCEF_Test_Train.py" contains that final code where we calculated both the train and the test results.

Recursive Feature Selection

In this technique we ranked the features using Random Forest classifier and identified the least important feature and removed it from the train dataset. We ran the loop for 32620 times as we have that many features and chose the feature set with the best accuracy. After choosing the optimal feature set we we tested it on the test dataset.

"Recursive_Feature_Selection.py" contains the entire code for recursive feature selecton.
"Recursive_Best_Feature_Set_Train_Test.py" contains the code where we only ran the code till the optimal set of features was reached and then tested the feature set on the testing data.

Classifiers Used

The following classifiers have been used in the experiements:

Random Forest
Extra Tree Classifier
Support Vector Machine
Logistic Regression
AdaBoost
Decision Tree
Gaussian Naive Bayes
K-Nearest Neighbour
Linear Discriminant Analysis

Performance Metrics

The classifiers were evaluated using the following metrics:

Accuracy
Sensitivity or Recall
Specificity
Mathews correlation coefficient (MCC)
area under receiver operating characteristic curve (auROC)
area under precision recall curve (auPR)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
datasets		datasets
Grouped_Feature_Selection_All_Results.xlsx		Grouped_Feature_Selection_All_Results.xlsx
Grouped_Feature_Selection_Final_GCEF_Test_Train.py		Grouped_Feature_Selection_Final_GCEF_Test_Train.py
Grouped_Feature_Selection_MCC_Train_Results.py		Grouped_Feature_Selection_MCC_Train_Results.py
README.md		README.md
Recursive_Best_Feature_Set_Train_Test.py		Recursive_Best_Feature_Set_Train_Test.py
Recursive_Feature_Selection.py		Recursive_Feature_Selection.py
Recursive_Feature_Selection_MCC_Train_Results.py		Recursive_Feature_Selection_MCC_Train_Results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The Codes

Grouped Feature Selection

Recursive Feature Selection

Classifiers Used

Performance Metrics

About

Releases

Packages

Languages

SkAdilina/DNA_Binding

Folders and files

Latest commit

History

Repository files navigation

Classification of DNA-Binding Proteins Using Sequence Based Features and Feature Selection

The Datasets

The Codes

Grouped Feature Selection

Recursive Feature Selection

Classifiers Used

Performance Metrics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages