# Performing Feature Selection

## Example using selectkbest

NOTE: these are being used for classification and the dataset is the extended Wisconsin Breast Cancer dataset: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data. 

In [1]:
import pandas as pd
import numpy as np

# read in the file from UCI <recommend you save locally and load it if your connectivity is iffy>

# Loading the file over the internet
#filename = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data" 

# Loading the file locally in the same folder as the Python Notebook
filename = "wi_breast_cancer.csv"
names = ['ID','Diagnosis',
         'Mean-Radius','Mean-Texture','Mean-Perimeter',
         'Mean-Area','Mean-Smoothness','Mean-Compactness',
         'Mean-Concavity','Mean-ConcavePoints',
         'Mean-Symmetry','Mean-FractalDimension', 
         'StdErr-Radius','StdErr-Texture','StdErr-Perimeter',
         'StdErr-Area','StdErr-Smoothness','StdErr-Compactness',
         'StdErr-Concavity','StdErr-ConcavePoints',
         'StdErr-Symmetry','StdErr-FractalDimension',
         'Worst-Radius','Worst-Texture','Worst-Perimeter',
         'Worst-Area','Worst-Smoothness','Worst-Compactness',
         'Worst-Concavity','Worst-ConcavePoints',
         'Worst-Symmetry','Worst-FractalDimension']

# loading the file into a dataframe
data = pd.read_csv(filename, names=names, header=None) 

# Convert the Diagnosis to a numeric variable
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
# Malignant tumors = 1 or True and Benign tumors = 0 or False

# Loading the X and y matrices
X = data.iloc[:, 2:32]   # load features into X dataframe
Y = data.iloc[:, 1]      # Load target into y dataframe

# Get the rows and columns of the numpy array
(nRows, nCols) = X.shape
#X.head(0).T

## SelectKBest Features 
Testing SelectKBest in order to ensure we are using the right features for our dataset. The example below uses the Chi-Squared ${(χ2)}$ statistical test for non-negative features to select the best features from the dataset. The method it uses for selecting them is a one-way ANOVA F-test. 

A large score suggests that the means of the that ${K}$ groups are not all equal. This is true only when the input variables come from normally distributed populations, and the population variance of the ${K}$ are the same. 

In [2]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2

# Setting precision for display
pd.options.display.float_format = '{:,.2f}'.format
np.set_printoptions(precision = 2)

fitScores1 = []

# feature extraction; where k is the number of features you want to select
test = SelectKBest(score_func=chi2, k=10)
fit = test.fit(X, Y)

# Find the scores for every feature so that you know which were selected
fitScores1 = fit.scores_

# Convert the numpy array of scores back into a DF with the correct column names
features1 = pd.DataFrame(fitScores1.reshape(-1, len(fitScores1)),columns=names[2:32]).T
print(features1) # transpose to make it easier to read

                                 0
Mean-Radius                 266.10
Mean-Texture                 93.90
Mean-Perimeter            2,011.10
Mean-Area                53,991.66
Mean-Smoothness               0.15
Mean-Compactness              5.40
Mean-Concavity               19.71
Mean-ConcavePoints           10.54
Mean-Symmetry                 0.26
Mean-FractalDimension         0.00
StdErr-Radius                34.68
StdErr-Texture                0.01
StdErr-Perimeter            250.57
StdErr-Area               8,758.50
StdErr-Smoothness             0.00
StdErr-Compactness            0.61
StdErr-Concavity              1.04
StdErr-ConcavePoints          0.31
StdErr-Symmetry               0.00
StdErr-FractalDimension       0.01
Worst-Radius                491.69
Worst-Texture               174.45
Worst-Perimeter           3,665.04
Worst-Area              112,598.43
Worst-Smoothness              0.40
Worst-Compactness            19.31
Worst-Concavity              39.52
Worst-ConcavePoints 

In eyeballing the data we can see the variables with the highest scores are (in order of score): Worst-Area, Mean-Area, StdErr-Area, Worst-Perimeter and Mean-Perimeter. 

Let's create a dataframe of just the selected variables. 

In [3]:
# Hand coding the headers, but will show later how to do this automatically
colHeads = ['Mean-Perimeter','Mean-Area','StdErr-Area','Worst-Perimeter','Worst-Area']

# perform the selection of fields so we have them for later analysis
kSelect = SelectKBest(chi2, k=5).fit_transform(X, Y)
(rows, cols) = kSelect.shape 

# Create a dataframe to hold the selected values (only) for later processing
selected = pd.DataFrame(data=kSelect,
          index=np.array(range(1, rows+1)),
          columns=np.array(range(1, cols+1)))

# Add the column headers for the X array--the range from names for this dataframe
selected.columns = colHeads 
selected.head(1)

Unnamed: 0,Mean-Perimeter,Mean-Area,StdErr-Area,Worst-Perimeter,Worst-Area
1,122.8,1001.0,153.4,184.6,2019.0


Let's take a look at the p-values returned to see what light this might shed...

In [None]:
fitPValues1 = fit.pvalues_
# print(type(fitPValues)) 

# Convert the numpy array of scores back into a DF with the correct column names
pValues1 = pd.DataFrame(fitPValues1.reshape(-1, len(fitScores1)),columns=names[2:32]).T
print(pValues1)

What we find is that MANY of these would be consider significant. So perhaps we should be looking, not at a number to keep, but using this technique to eliminate features. But if there is a massive skew to the data, we might be eliminating good variables, using this method, simply because they are normally distributed. 

What this tells us is that it is likely that the following should be considered for removal because they have a p-value above 0.5 (not to be confused with 0.05): 
* Mean-Smoothness
* Mean-Symmetry
* Mean-FractalDimension*
* StdErr-Texture*
* StdErr-Smoothness*
* StdErr-Compactness
* StdErr-ConcavePoints
* StdErr-Symmetry*
* StdErr-FractalDimension*
* Worst-Smoothness
* Worst-FractalDimension

This would *remove* the 11 least predictive. If we choose to be even more careful and use 0.9 or higher, we'd drop 5 features marked with astericks above.

Copyright (c) 2019 Kristin Tolle