# Performing Feature Selection

## Recursive Feature Elimination

NOTE: these are being used for classification and the dataset is the extended Wisconsin Breast Cancer dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. 

In [6]:
import pandas as pd
import numpy as np

# read in the file from UCI <recommend you save locally and load it if your connectivity is iffy>

# Loading the file over the internet
#filename = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data" 

# Loading the file locally in the same folder as the Python Notebook
filename = "wdbc.data"
names = ['ID','Diagnosis',
         'Mean-Radius','Mean-Texture','Mean-Perimeter',
         'Mean-Area','Mean-Smoothness','Mean-Compactness',
         'Mean-Concavity','Mean-ConcavePoints',
         'Mean-Symmetry','Mean-FractalDimension', 
         'StdErr-Radius','StdErr-Texture','StdErr-Perimeter',
         'StdErr-Area','StdErr-Smoothness','StdErr-Compactness',
         'StdErr-Concavity','StdErr-ConcavePoints',
         'StdErr-Symmetry','StdErr-FractalDimension',
         'Worst-Radius','Worst-Texture','Worst-Perimeter',
         'Worst-Area','Worst-Smoothness','Worst-Compactness',
         'Worst-Concavity','Worst-ConcavePoints',
         'Worst-Symmetry','Worst-FractalDimension']

# loading the file into a dataframe
data = pd.read_csv(filename, names=names, header=None) 

# Convert the Diagnosis to a numeric variable
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
# Malignant tumors = 1 or True and Benign tumors = 0 or False

# Loading the X and y matrices
X = data.iloc[:, 2:32]   # load features into X dataframe
Y = data.iloc[:, 1]      # Load target into y dataframe

# Get the rows and columns of the numpy array
(nRows, nCols) = X.shape
#X.head(0).T

## Recursive Feature Elimination
Recursively removes attributes and builds models on those attributes that remain. It accomplishes this by training on the full set then determining the feature importances given the model selected then it prunes the worst, the next worst and so on building a model each time until it ends up with the final set. Default removal each time (step) is one.

Let's see what the 10 most predictive features are with RFE.

In [1]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

numFeat = 20 # change here to set number of features to "keep"

# feature extraction 
model = LogisticRegression() 
rfe = RFE(model, numFeat) # where the number is the features retained
rfe = rfe.fit(X,Y) 

ranking = rfe.ranking_
selected = rfe.support_

ranking = np.vstack((ranking, selected))

(rows, cols) = ranking.shape

# This dataframe doesn't hold the columns selected, 
# it is only for pretty printing the selected features
rfe_selected = pd.DataFrame(data=ranking,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))


array = rfe_selected.T # transpose
array.columns = ['rank', 'selected']
output= array['selected'] == 1
dfSelect = array[selected]
dfSelect



NameError: name 'X' is not defined

In [10]:
# Look at all of the features and find the worst of the worst
rankAll = pd.DataFrame(data=ranking,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))
rankAll.columns = names[2:32] 

array = rankAll.T # transpose
array.columns = ['Rank', 'Selected']
array.sort_values( by = 'Rank' )

Unnamed: 0,Rank,Selected
Mean-Radius,1,1
Worst-ConcavePoints,1,1
Worst-Concavity,1,1
Worst-Compactness,1,1
Worst-Smoothness,1,1
Worst-Perimeter,1,1
Worst-Texture,1,1
Worst-Radius,1,1
StdErr-Compactness,1,1
Worst-Symmetry,1,1


Copyright (c) 2019 Kristin Tolle