# The Feature Ranking Approach

As we explained, in the ranking approach, features are ranked by some criteria and those which are above a defined threshold are selected. A general algorithm can be considered for such approach where you just need to decide which one if the best ranking criteria to be used. In the F-Selector Package, such criteria is represented by a set of functions that calculate weights to your features according to a model.

# R Packages

In [1]:
library(MASS)
library(pROC)
library(RWeka)

Type 'citation("pROC")' for a citation.

Attaching package: ‘pROC’

The following objects are masked from ‘package:stats’:

    cov, smooth, var



In [24]:
library(FSelector)
data(iris)

weights <- information.gain(Species~., iris)
print(weights)

subset <- cutoff.k(weights, 2)

f <- as.simple.formula(subset, "Species")
print(f)

             attr_importance
Sepal.Length       0.4521286
Sepal.Width        0.2672750
Petal.Length       0.9402853
Petal.Width        0.9554360
Species ~ Petal.Width + Petal.Length
<environment: 0x5567e80390d0>


The idea behind FSelector and its functions is to choose the best combination of attributes found in a data set. Maybe, some attributes are unnecesary (maybe), that depends on the dataset you are dealing with.

information.gain is a function that select the best combination of attributes according to its "Information Gain". 

In [52]:
#calculate weights for each attribute using some function

weights <- information.gain(BCA~., mydata)
print(weights)

        attr_importance
STIL         0.22057783
MSMB         0.17549995
TSPAN13      0.17383600
AGR2         0.11872998
PECI         0.11438001
ABHD12       0.10643349
SLC26A2      0.09992717
GDF15        0.09812053
EIF2B5       0.07533550
MAD1L1       0.06429559
TM4SF1       0.06213647
COL6A32      0.04495090
TNFSF10      0.04162775


In [56]:
data(iris)

  weights <- information.gain(Species~., iris)
  print(weights)
  subset <- cutoff.k(weights, 2)
  f <- as.simple.formula(subset, "Species")
  print(f)

  weights <- information.gain(Species~., iris, unit = "log2")
  print(weights)

  weights <- gain.ratio(Species~., iris)
  print(weights)
  subset <- cutoff.k(weights, 2)
  f <- as.simple.formula(subset, "Species")
  print(f)

  weights <- symmetrical.uncertainty(Species~., iris)
  print(weights)
  subset <- cutoff.biggest.diff(weights)
  f <- as.simple.formula(subset, "Species")
  print(f)

             attr_importance
Sepal.Length       0.4521286
Sepal.Width        0.2672750
Petal.Length       0.9402853
Petal.Width        0.9554360
Species ~ Petal.Width + Petal.Length
<environment: 0x5567e8070160>
             attr_importance
Sepal.Length       0.6522837
Sepal.Width        0.3855963
Petal.Length       1.3565450
Petal.Width        1.3784027
             attr_importance
Sepal.Length       0.4196464
Sepal.Width        0.2472972
Petal.Length       0.8584937
Petal.Width        0.8713692
Species ~ Petal.Width + Petal.Length
<environment: 0x5567e65a6488>
             attr_importance
Sepal.Length       0.4155563
Sepal.Width        0.2452743
Petal.Length       0.8571872
Petal.Width        0.8705214
Species ~ Petal.Width + Petal.Length
<environment: 0x5567e7c528e8>


## Feature Selection using Information Gain in R


When considering a predictive model, you might be interested in knowing which features of your data provide the most information about the target variable of interest. For example, suppose we’d like to predict the species of Iris based on sepal length and width as well as petal length and width (using the iris dataset in R).

Which of these 4 features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species, and could only ask for the value of 1 feature, which feature would give you the greatest likelihood of winning your bet?

While there are many R packages out there for attribute selection, I’ve coded a few basic functions for my own usage for selecting attributes based on Information Gain (and hence on Shannon Entropy).

For starters, let’s define what we mean by Entropy and Information Gain.

$$
\begin{array} { c } { \text { Shannon Entropy } } \\ { H \left( p _ { 1 } \ldots p _ { n } \right) = \sum _ { i = 1 } ^ { n } p _ { i } \log _ { 2 } p _ { i } } \end{array}
$$


Where $p_i$ is the probability of value i and n is the number of possible values. For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing $\frac{1}{3}$ of the data. Therefore

$$
\sum _ { i = 1 } ^ { 3 } \frac { 1 } { 3 } i \log _ { 2 } \frac { 1 } { 3 } _ { i } = 1.59
$$


$$
\begin{array} { c } { \text { Information Gain } } \\ { I G = H _ { p } - \sum _ { i = 1 } ^ { n } p _ { c i } H _ { c i } } \end{array}
$$

Where $H_p$ is the entropy of the parent (the complete, unsegmented dataset), n  is the number of values of our target variable (and the number of child segments), $p_{ci}$ is the probability that an observation is in child i (the weighting), and $H_{ci}$ is the entropy of child (segment) i.

Continuing with our iris example, we could ask the following: “Can we improve (reduce) the entropy of the parent dataset by segmenting on Sepal Length?”

In this case, Sepal Length is numeric. You’ll notice the code provides functions for both numeric and categorical variables. For categorical variables, we simply segment on each possible value. However in the numeric case, we will bin the data according to the desired number of breaks (which is set to 4 by default).

If we segment using 5 breaks, we get 5 children. Note e is the computed entropy for this subset, p is the proportion of records, N is the number of records, and min and max are… the min and max.

We improve on the entropy of the parent in each child. In fact, segment 5 is perfectly pure, though weighted lightly due to the low proportion of records it contains. We can formalize this using the information gain formula noted above. Calling the IG_numeric function, we see the that IG(Sepal.Length) = .64 using 5 breaks.

Note that the categorical and numeric functions are called as follows

IG_numeric(data, feature, target, bins=4)

IG_cat(data,feature,target)

Both functions return the IG value, however you can change return(IG) to return(dd_data) to return the summary of the segments as a data.frame for investigation


In [67]:
gini_process <-function(classes,splitvar = NULL){
  #Assumes Splitvar is a logical vector
  if (is.null(splitvar)){
    base_prob <-table(classes)/length(classes)
    return(1-sum(base_prob**2))
  }
  base_prob <-table(splitvar)/length(splitvar)
  crosstab <- table(classes,splitvar)
  crossprob <- prop.table(crosstab,2)
  No_Node_Gini <- 1-sum(crossprob[,1]**2)
  Yes_Node_Gini <- 1-sum(crossprob[,2]**2)
  return(sum(base_prob * c(No_Node_Gini,Yes_Node_Gini)))
}

In [68]:
data(iris)
gini_process(iris$Species) #0.6667
gini_process(iris$Species,iris$Petal.Length<2.45) #0.3333
gini_process(iris$Species,iris$Petal.Length<5) #0.4086
gini_process(iris$Species,iris$Sepal.Length<6.4) #0.5578