# Feature Scaling

[All you need to know](https://en.wikipedia.org/wiki/Feature_scaling)

In [1]:
from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))

MinMaxScaler(copy=True, feature_range=(0, 1))


In [2]:
MinMaxScaler(copy=True, feature_range=(0, 1))
print(scaler.data_max_)

[  1.  18.]


In [3]:
print(scaler.transform(data))

[[ 0.    0.  ]
 [ 0.25  0.25]
 [ 0.5   0.5 ]
 [ 1.    1.  ]]


In [4]:
print(scaler.transform([[2, 2]]))

[[ 1.5  0. ]]


## Algorithms that need rescaling

In general, algorithms that exploit distances or similarities (e.g. in form of scalar product) between data samples, such as k-NN and SVM, are sensitive to feature transformations.
* **SVM with RBF Kernel**
* **K-Means Clustering**


[Usefull resource](http://www.dataschool.io/comparing-supervised-learning-algorithms/)

# Feature Selection

Take all your features and then find the most important features.

Why do we care?
* Cures of dimensionality
* Knowlage discovery & Insight


## Types of feature selection

![image.png](attachment:image.png)

**NOTE**: 
* The filter has a criteria that is independant of the learner. While the wrapping method has a feedback mechanisim of the criteria.
* Wrappers use the bias of the learner to find the best features while filters don't.

Decision trees are a kind of filtering algorithm by using the information gain metric for each feature.

**Different Kinds of Metrics for Filtering**
* Information Gain
* Varaince
* Entropy
* Non-redundant

**Different Kinds of Metrics for Warpping**
* Hill climbing
* Random opt.
* Forward - Backward search

### Forward Search

Build out the features by looking at the combination of features. One feature at a time. 1 is the group of all features. Then we look at all the features individually and determine feature 1 is best. We then look a feature 1 combined with each of the other features, individually, and find that feature 1 and 5 gives the best performance. Then we find that feature 1,5,3 gives the best perforamcne bu the difference in perforamnce between 1,5 and 1,5,3 is negligible, so we stop.
![image.png](attachment:image.png)

### Backward Search

Works similarly to forwards search but in reverse. You look at what you can eliminate. e.g. in the example below, between stewp 3 and 4 there was a very large degradation in performance, so we stick to step 3. 
![image.png](attachment:image.png)

## Relevance

Relevance Measures Effect on B.O.C

Why do we pick B.O.C? Computes the best label given all the labels as a factor of true information gain

**B.O.C** - bayes optimal classifier

![image.png](attachment:image.png)

## Relevance vs Usefulness

* Relevance Measures Effect on B.O.C
* Usefulness measures effect on a particular predictor

i.e
* Relevance ~ Information gain/lose
* Usefulness ~ Error | Model/Learner 
  * like $y=w^Tx$