# Diabetes Data: Feature Selection
## Notes
Project Links:
* https://machinelearningmastery.com/feature-selection-machine-learning-python/
* https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
* https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
* https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

Feature selection is the process of extracting only the most useful and relevant features from your source dataset, the main advantages include:
* **Improving accuracy** by only using the most correlated or impacting variables.
* **Reducing overfitting** by removing redundant/irrelevant variables which would just add to the noise of the data and result in model fitting of values which weren't relevant.
* **Improving the speed** of the process by simply reducing the number of features and therefore data that needs to be processed.

There are many ways of conducting feature selection so we will run through a few of the key ones here on a diabetes dataset which contains a number of numerical variables (features) as well as a class variable (0 or 1 to signify not having or having diabetes respectively). As a result, this is a classification problem in this case, it should be noted that different statistical methods of selecting features should be used depending on whether or not your problem is classification or regression and also which type of data you're using (i.e. numerical, bianry, categorical etc.).

## Types of Feature Selection Methods
There are 3 main categories of feature selection methods:
* Feature methods
* Wrapper methods
* Embedded methods

**Feature methods** look at one feature variable at a time (either in isolation or in the context of the target variable) and determine a score for each variable before keeping the variables with the highest score. An example of this would be the **univariate feature selection** section below which commonly uses chi-squared tests or correlation coefficients (e.g. in linear regressions).

**Wrapper methods** look at the combination of features, build a model and then assess the score of the model as a whole. It will try different combinations and eliminations of features to determine the best model accuracy score possible. The search can be **methodical** (e.g. best first search), **stochastic** (i.e. random, e.g. random hill-climbing) or **heuristic** (i.e. autonomous, e.g. recursive feature elimination).

**Embedded methods** are built into the development stages of the model itself, they will learn the best combination of features whilst the model is being built and use weighting and penalising to adjust the model on the fly. Most commonly these include **regularization methods** such as lasso, elastic net and ridge regression to introduce specific constraints on the model.

## When to Apply Feature Selection
It is important to build and apply the feature selection models/methods right **before you run the model**. This means that you should not build your feature selection model on the raw dataset, then split your sets down into train, test etc. and apply your model because in this case your feature selection will be enhanced by having seen the entire dataset and will likely be **overtrained and bias** on all data points.

Instead, you should split your datasets down into their relevant folds (e.g. test/train or folds within cross-validation steps) and then **build your model right before applying it**. This way your feature extraction is specific to the subset of data only and can be compared to the overall dataset without as much bias.

## 1. Univariate Feature Selection
This method simply selects the most relevant features from our dataset for use in the final model. It uses a selected test or threshold to determine which feature variables are the most effective predictors of our target variable and removes the others. There are different ways of doing this, such as setting a score threshold and removing all features which don't meet that score, but here we will use **SelectKBest** which selects a specified number of features **k** from our dataset using a specified statistical test to score each variable.

There are a number of different statistical tests you can use, the choice depends on the type of problem (e.g. regression or classification) and the data you're running it on (e.g. numerical, binary...). In this case, we're looking at a numerical classification problem so we will use **ANOVA F-value** via the **f_classif()** function in SKLearn.

**Further notes:** https://machinelearningmastery.com/an-introduction-to-feature-selection/

In [2]:
# load libraries
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif

# load data
df = pd.read_csv('Diabetes.csv')

# peek at data
df.head()

Unnamed: 0,Number of times pregnant,Plasma glucose concentration,Diastolic blood pressure (mm Hg),Triceps skin fold thickness (mm),2-hour serum insulin (mm U/ml),BMI,Diabetes pedigree function,Age (years),Diabetes (Binary Y/N)
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [11]:
# extract X and y variables
array = df.values
X = array[:, :8]
y = array[:, 8]

# check shape before
print('Input shape: ', X.shape)

# define feature extraction (pick 4 usinf ANOVA F-value)
test = SelectKBest(score_func=f_classif, k=4) # build object
fit = test.fit(X, y) # score each variable in X vs. y

# extract features
print(fit.scores_) # show test scores per variable
features = fit.transform(X) # extract 4 features
print(features[:5,:]) # show selected features top 5 rows
print('Output shape: ', features.shape)

Input shape:  (768, 8)
[ 39.67022739 213.16175218   3.2569504    4.30438091  13.28110753
  71.7720721   23.8713002   46.14061124]
[[  6.  148.   33.6  50. ]
 [  1.   85.   26.6  31. ]
 [  8.  183.   23.3  32. ]
 [  1.   89.   28.1  21. ]
 [  0.  137.   43.1  33. ]]
Output shape:  (768, 4)


You can see that our original input had 8 features whilst the output following feature extraction has just 4 as specified. The scores for each variable are shown in the first array above and the method selected the 4 highest scores from this and produces the output array 'features' above.

For help deciding which statistical test to use, see this link: https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

## 2. Recursive Feature Elimination (RFE)
This method works by recursively to elminiate a variable from the dataset, build the model again, test the score and then continue to do so until it finds the optimum model accuracy/score. It can cross-validate itself to find the best specific combination of variables within the dataset. It works by identifying a score for each variable (e.g. **coef_** for linear regression variables or **feature_importances_** for others) and removes the variable(s) with the lowest scores.

The method below uses RFE with a logistic regression function to find the 3 optimal variables in our dataset, the function used doesn't matter too much as long as it's consistent *(further reading needed here)*.

In [12]:
# load libraries
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# build model
model = LogisticRegression(solver='lbfgs') # logistic regression model
rfe = RFE(model, 3) # pick 3 best features using logistic regression scoring

# fit model
fit = rfe.fit(X, y)

# analyse output
print("Num Features: %d" % fit.n_features_) # show final # of features
print("Selected Features: %s" % fit.support_) # show T/F inclusion of features
print("Feature Ranking: %s" % fit.ranking_) # show ranking of each feature

Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 4 5 6 1 1 3]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


The recursive model eliminated 5 features as specified to leave us with the 3 it deemed optimal.

## 3. Principal Component Analysis (PCA)
PCA is technically a **dimensionality reduction** process and so I will cover it in more detail in it's own notebook, but for now here's a basic overview of the code and concept.

PCA is designed to pick the optimal components (i.e. features) from your dataset, again given a threshold or final number to select by you. The core concept of PCA is that it simplifies and compresses your dataset into fewer variables and then uses these as predictors.

In [13]:
# load libraries
from sklearn.decomposition import PCA

# feature extraction
pca = PCA(n_components=3) # create model object, extract 3 features
fit = pca.fit(X, y) # fit and extract

# summarize components
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.88854663 0.06159078 0.02579012]
[[-2.02176587e-03  9.78115765e-02  1.60930503e-02  6.07566861e-02
   9.93110844e-01  1.40108085e-02  5.37167919e-04 -3.56474430e-03]
 [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01  5.78614699e-02
   9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01]
 [-2.24649003e-02  1.43428710e-01 -9.22467192e-01 -3.07013055e-01
   2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]


The output data above clearly looks nothing like the original input after it has been compressed and 'extracted'. Again though, this isn't pure feature extraction, it's dimensionality reduction as the input variables have been transformed rather than simply selected.

## 4. Feature Importance
Finally, you can calculate an **importance** score for each variable and use this to extract the most valuable variables in your analysis. This is particularly useful in **bagged decision tree** models such as Random Forest or Extra Trees where you need to determine which features to weight and use most.

Here, we can use the **ExtraTreesClassifier** to calculate the importance of each variable in our dataset.

In [16]:
# load libraries
from sklearn.ensemble import ExtraTreesClassifier

# feature extraction
model = ExtraTreesClassifier(n_estimators=10) # create model
model.fit(X, y) # fit model
print(model.feature_importances_) # show importance of each feature

[0.10620699 0.24336721 0.09617079 0.08446838 0.07060376 0.14199293
 0.11640966 0.14078028]


The higher the score the higher the feature's importance, the above results suggest that the 2nd, 6th and 8th columns are the most important, although you may want to interrogate and define your own thresholds.

**Documentation:** https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html