# Feature Selection
Too many features is a problem due to

* NumFeatures > NumInstances
* Curse of dimensionality
* Worse-than-linear algorithm complexity
* Model confusion from irrelevent features
* Overfitting to variance in irrelevant features

After FEATURE EXTRACTION,   
we need to reduce complexity by  
FEATURE SELECTION or DIMENSIONALITY REDUCTION.

But finding the optimal subset is NP-hard.
Also, cannot just use features with high correlation (predictive value)
due to important feature-interaction effects. 
So there are heuristics and domain-specific rules.
* Which features can we do without? 
* Which features can we ignore? 

## Separability
See our notebook on separability.  
Separability over all features sets the upper limit on the learning accuracy.  
Max-Min problem: maximally reduce features while minimally reducing separability.

## Correlation
Find features that are highly correlated to the lables.

Pearson's correlation between feature f and class y   
$R = \frac {cov(f,y)} {\sqrt{var(f)var(y)}}$   
$R = \frac {\sum[(f-\bar{f})(y-\bar{y})]} {\sqrt{\sum[(f-\bar{f})^2\sum(y-\bar{y})^2]}}$ 
Usually use $|R|$ or $R^2$.  

Problem: Correlation only detects linear relationships.  
Important features may be predictive of class collectively 
but uncorrelated to class individually.

## Mutual Information
Find a feature subset that conveys the same information 
as the labels. 

Compared to correlation, 
mutual information detects non-linear relationships,
but is harder to compute.

Information:  
$I(x) = p(x) ln(p(x))$  

Mutual information between feature xi and class y,
uses the joint probability divided by the marginal probabilities:  
$I(x_i,y) = p(x_i,y) ln[p(x_i,y)/p(x_i)p(y)]$

With discrete variables, 
and using frequencies as empirical estimates of probabilities:  
$I(X,Y) = \sum_x \sum_y P(X=x,Y=y) ln[P(X=x,Y=y)/P(X=x)P(Y=y)]$

## Random Forest
Random forest ranks features by importance,
and you can try using the top-ranked features.

Random Forest computes 
Gini Impurity or Information Gain at each node of each decision tree.
The overall importance is a combination of these.

This is a form of random guess plus hill climbing. 
This is supervised.

## Filter vs Wrapper methods
Empirically, try classification while leaving out some features.

### Filter methods
These are independent of any model.   
Also called unsupervised.   
May not be optimal for a chosen model.    

Examples: 
* Choose features that improve clustering and reduce outliers.
* Ignore some features that correlate with other features.
* Choose features that correlate with the label (supervised).

### Wrapper methods
These are specific to a model.   
Also called supervised.   
Test goodness by cross-validation using the model of interest.  

Methods for feature selection with a wrapper 
* Forward selection ... start with one feature and iteratively add more.   
* Reverse selection ... start with full feature set and iteratively remove a feature. 
* Bidirectional selection ... after each feature addition, check if any previous could now be removed.

Note these are greedy and compute-intense.