# Wrapper Methods

## The goals for using feature selection is as listed below
- Reduce the number of features used to train the ML model
- Improve the accuracy of the trained model by reducing the least important features
- Imporve the analysis of features
- Reduce the model training time

## Below Wrapper methods are being used
- [**_Forward selection_**](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/#:~:text=Forward%20Selection%3A%20Forward%20selection%20is,the%20performance%20of%20the%20model.)
    - Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.
- [**_Backward elimination_**](https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/#:~:text=Forward%20Selection%3A%20Forward%20selection%20is,the%20performance%20of%20the%20model.)
    - In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.
- [**_Step-wise selection_**](https://bookdown.org/max/FES/greedy-stepwise-selection.html)
    - Stepwise selection was original developed as a feature selection technique for linear regression models. The forward stepwise regression approach uses a sequence of steps to allow features to enter or leave the regression model one-at-a-time. Often this procedure converges to a subset of features.

In [67]:
_leukemia_dataset_file = '../Datasets/Leukemia_GSE9476.csv'
_rna_dataset_file = '../Datasets/METABRIC_RNA_Mutation.csv'

In [68]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [69]:
import json
import pandas as pd
import matplotlib.pyplot as plt

# Univariate Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

# Recursive Feature Elimination
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# PCA
from sklearn.decomposition import PCA

# Feature Importance
from sklearn.ensemble import ExtraTreesClassifier

# Support Vector Machines
from sklearn.svm import SVC

# Cecition Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# KNN
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [70]:
df = pd.read_csv(_leukemia_dataset_file)
df.head(5)

Unnamed: 0,samples,type,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,...,AFFX-r2-Hs28SrRNA-5_at,AFFX-r2-Hs28SrRNA-M_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,1,Bone_Marrow_CD34,7.745245,7.81121,6.477916,8.841506,4.546941,7.957714,5.344999,4.673364,...,5.058849,6.810004,12.80006,12.718612,5.391512,4.666166,3.974759,3.656693,4.160622,4.139249
1,12,Bone_Marrow_CD34,8.087252,7.240673,8.584648,8.983571,4.548934,8.011652,5.579647,4.828184,...,4.436153,6.751471,12.472706,12.333593,5.379738,4.656786,4.188348,3.792535,4.204414,4.1227
2,13,Bone_Marrow_CD34,7.792056,7.549368,11.053504,8.909703,4.549328,8.237099,5.406489,4.615572,...,4.392061,6.086295,12.637384,12.499038,5.316604,4.600566,3.845561,3.635715,4.174199,4.067152
3,14,Bone_Marrow_CD34,7.767265,7.09446,11.816433,8.994654,4.697018,8.283412,5.582195,4.903684,...,4.633334,6.375991,12.90363,12.871454,5.179951,4.641952,3.991634,3.704587,4.149938,3.91015
4,15,Bone_Marrow_CD34,8.010117,7.405281,6.656049,9.050682,4.514986,8.377046,5.493713,4.860754,...,5.305192,6.700453,12.949352,12.782515,5.341689,4.560315,3.88702,3.629853,4.127513,4.004316


In [71]:
LE = LabelEncoder()
df['type'] = LE.fit_transform(df['type'])

In [72]:
df.head()

Unnamed: 0,samples,type,1007_s_at,1053_at,117_at,121_at,1255_g_at,1294_at,1316_at,1320_at,...,AFFX-r2-Hs28SrRNA-5_at,AFFX-r2-Hs28SrRNA-M_at,AFFX-r2-P1-cre-3_at,AFFX-r2-P1-cre-5_at,AFFX-ThrX-3_at,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at
0,1,2,7.745245,7.81121,6.477916,8.841506,4.546941,7.957714,5.344999,4.673364,...,5.058849,6.810004,12.80006,12.718612,5.391512,4.666166,3.974759,3.656693,4.160622,4.139249
1,12,2,8.087252,7.240673,8.584648,8.983571,4.548934,8.011652,5.579647,4.828184,...,4.436153,6.751471,12.472706,12.333593,5.379738,4.656786,4.188348,3.792535,4.204414,4.1227
2,13,2,7.792056,7.549368,11.053504,8.909703,4.549328,8.237099,5.406489,4.615572,...,4.392061,6.086295,12.637384,12.499038,5.316604,4.600566,3.845561,3.635715,4.174199,4.067152
3,14,2,7.767265,7.09446,11.816433,8.994654,4.697018,8.283412,5.582195,4.903684,...,4.633334,6.375991,12.90363,12.871454,5.179951,4.641952,3.991634,3.704587,4.149938,3.91015
4,15,2,8.010117,7.405281,6.656049,9.050682,4.514986,8.377046,5.493713,4.860754,...,5.305192,6.700453,12.949352,12.782515,5.341689,4.560315,3.88702,3.629853,4.127513,4.004316


In [73]:
df['type'].unique()

array([2, 1, 0, 3, 4])

In [74]:
X = df[df.columns.drop('type')]
y = df['type']

In [87]:
x_axis = []
y_axis_dtc_forward = []
y_axis_dtc_backward = []
y_axis_gnb = []
y_axis_etc = []
y_axis_rgc = []
y_axis_svc = []
y_axis_knn = []

for i in range(2, X.columns.size, 1000):
    x_axis.append(i)

    sfs_forward = SFS(
        LinearRegression(),
        k_features=i,
        forward=True,
        floating=False
    )
    sfs_forward.fit(X, y)
    best_features = df[list(pd.DataFrame(sfs_forward.subsets_).transpose().iloc[-1, -1])]

    train, test, train_labels, test_labels = train_test_split(
        best_features, y, test_size=0.20, random_state=42
    )

    dtcforward = DecisionTreeClassifier()
    dtcforward = dtcforward.fit(train, train_labels)
    y_pred = dtcforward.predict(test)
    dtc_res = metrics.accuracy_score(test_labels, y_pred)
    y_axis_dtc_forward.append(dtc_res) 

    

In [None]:
plt.figure(figsize=(20, 5))
plt.plot(x_axis, y_axis_dtc_forward, label='forward')
# plt.plot(x_axis, y_axis_dtc_backward, label='backward')
plt.legend()
plt.show()

In [75]:
sfs_forward = SFS(
    LinearRegression(),
    k_features=i,
    forward=True,
    floating=False
)

In [76]:
sfs_forward.fit(X, y)

SequentialFeatureSelector(estimator=LinearRegression(), k_features=2)

In [78]:
best_features = df[list(pd.DataFrame(sfs_forward.subsets_).transpose().iloc[-1, -1])]

In [79]:
train, test, train_labels, test_labels = train_test_split(
    best_features, y, test_size=0.20, random_state=42
)

In [81]:
best_features

Unnamed: 0,207233_s_at,221556_at
0,5.227508,4.473222
1,5.559323,5.142558
2,5.327038,4.650552
3,5.170414,4.765737
4,5.280610,4.575199
...,...,...
59,5.100100,5.527422
60,5.246819,5.875831
61,5.183198,5.669177
62,5.377604,5.953397


In [83]:
len(train_labels)

51

In [84]:
dtcforward = DecisionTreeClassifier()
dtcforward = dtcforward.fit(train, train_labels)

In [85]:
y_pred = dtcforward.predict(test)
dtc_res = metrics.accuracy_score(test_labels, y_pred)

In [86]:
dtc_res

0.5384615384615384