# Feature Selection in Catalytic Site Prediction

by **[Tony Kabilan Okeke](mailto:tko35@drexel.edu)**

In this assignment, you are asked to perform feature selection in catalytic 
site prediction problem, with the  goal of improving prediction accuracy 
and/or simplifying the predictive model. You may use a filter or wrapper 
method for feature selection. You may use any classification method.

## Setup

In [1]:
# Import the necessary libraries
from sklearn.model_selection import cross_validate
from sklearn.feature_selection import RFECV
from sklearn.svm import SVC
import pandas as pd

# BMES module
import sys, os
sys.path.append(os.environ['BMESAHMETDIR'])
import bmes

# Patch scikit-learn
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
# Download the data
URL = ('http://sacan.biomed.drexel.edu/lib/exe/fetch.php?rev=&media=course:'
       'ml:featureselect:hwfeatureselect.catsite:catsitedata_withrandfeats.tab')
datafile = bmes.downloadurl(URL, 'catsitedata_withrandfeats.tab')

# Load the data
df = pd.read_csv(datafile, sep='\t')

# Split the data into features and labels
X = df.drop('class', axis=1).values
y = df['class'].values
features = df.drop('class', axis=1).columns.to_numpy()

## Baseline Prediction
Perform classification using all features. What is the performance?

In [3]:
# Classification using all features
clf = SVC(kernel='linear')
scores = cross_validate(
    clf, X, y, cv=5, scoring='accuracy', n_jobs=-1
)['test_score']

# Print the accuracy
print('Accuracy: {:.2f} +/- {:.2f}'.format(scores.mean(), scores.std()))

Accuracy: 0.84 +/- 0.03


## Select Features
You can decide which feature selection method to use. After performing feature selection, report the names of the selected features.

In [4]:
# Use Recursive Feature Elimination with Cross-Validation to select features
clf = SVC(kernel='linear')
rfecv = RFECV(estimator=clf, cv=5, scoring='accuracy', n_jobs=-1)
rfecv.fit(X, y)

# Report selected features
print(f'Selected features (n={rfecv.n_features_}):')
for feat in features[rfecv.get_support()]:
    print(feat, end=', ')

Selected features (n=21):
A, R, N, D, C, E, G, H, I, L, K, M, F, P, T, W, Y, V, nearest_cleft_distance, HB_main_chain_protein, ScoreConsScore, 

## Prediction with the Selected Features
What is the performance with the selected features?

In [5]:
# Classification using selected features
X_sel = rfecv.transform(X)
clf = SVC(kernel='linear')
scores = cross_validate(
    clf, X_sel, y, cv=5, scoring='accuracy', n_jobs=-1
)['test_score']

# Print the accuracy
print('Accuracy: {:.2f} +/- {:.2f}'.format(scores.mean(), scores.std()))

Accuracy: 0.84 +/- 0.01
