## Using Boruta on the Madalon Data Set

[link](https://github.com/scikit-learn-contrib/boruta_py/blob/master/boruta/examples/Madalon_Data_Set.ipynb)

This example demonstrates using Boruta to find all relevant features in the Madalon dataset, which is an artificial dataset used in NIPS2003 and cited in the Boruta paper

This dataset has 2000 observations and 500 features. We will use Boruta to identify the features that are relevant to the classification task.



In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

In [2]:
def load_data():
    # URLS for dataset via UCI
    train_data_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.data'
    train_label_url='https://archive.ics.uci.edu/ml/machine-learning-databases/madelon/MADELON/madelon_train.labels'

    X_data = pd.read_csv(train_data_url, sep=" ", header=None)
    y_data = pd.read_csv(train_label_url, sep=" ", header=None)
    data = X_data.loc[:, :499]
    data['target'] = y_data[0]
    return data

data = load_data()


data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,491,492,493,494,495,496,497,498,499,target
0,485,477,537,479,452,471,491,476,475,473,...,481,477,485,511,485,481,479,475,496,-1
1,483,458,460,487,587,475,526,479,485,469,...,478,487,338,513,486,483,492,510,517,-1
2,487,542,499,468,448,471,442,478,480,477,...,481,492,650,506,501,480,489,499,498,-1
3,480,491,510,485,495,472,417,474,502,476,...,480,474,572,454,469,475,482,494,461,1
4,484,502,528,489,466,481,402,478,487,468,...,479,452,435,486,508,481,504,495,511,1


In [3]:
y = data.pop('target')
X = data.copy(deep=True).values

print(X.shape, y.shape)

(2000, 500) (2000,)



Boruta conforms to the sklearn api and can be used in a Pipeline as well as on it's own. Here we will demonstrate stand alone operation.

First we will instantiate an estimator that Boruta will use. Then we will instantiate a Boruta Object.

In [4]:
rf = RandomForestClassifier(n_jobs=-1, class_weight=None, max_depth=7, random_state=0)

# Define Boruta feature selection method

feature_selector = BorutaPy(estimator=rf, n_estimators='auto', verbose=2, random_state=0)

Once built, we can use this object to identify the relevant features in our dataset.



In [6]:
feature_selector.fit(X, y)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	500
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	22
Rejected: 	478
Iteration: 	9 / 100
Confirmed: 	19
Tentative: 	3
Rejected: 	478
Iteration: 	10 / 100
Confirmed: 	19
Tentative: 	3
Rejected: 	478
Iteration: 	11 / 100
Confirmed: 	19
Tentative: 	3
Rejected: 	478
Iteration: 	12 / 100
Confirmed: 	19
Tentative: 	3
Rejected: 	478
Iteration: 	13 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	479
Iteration: 	14 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	479
Iteration: 	15 / 100
Confirmed: 	19
Tentative: 	2
Rejected: 	479
Iteration: 	16 / 100
Confirmed: 	19
Tenta

BorutaPy(alpha=0.05,
     estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=7, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=92, n_jobs=-1,
            oob_score=False,
            random_state=<mtrand.RandomState object at 0x00000206C1B77688>,
            verbose=0, warm_start=False),
     max_iter=100, n_estimators='auto', perc=100,
     random_state=<mtrand.RandomState object at 0x00000206C1B77688>,
     two_step=True, verbose=2)


Boruta has confirmed only a few features as useful. When our run ended, Boruta was undecided on 2 features. '

We can interrogate .support to understand which features were selected. .support returns an array of booleans that we can use to slice our feature matrix to include only relevant columns. Of course, .transform can also be used, as expected in the scikit API.

In [10]:
# Check selected features
print(feature_selector.support_)
# Select the chosen features from our dataframe.
selected = X[:, feature_selector.support_]
print ("")
print ("Selected Feature Matrix Shape")
print (selected.shape)

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
  True False False False False False False False False False False False
 False False False False  True False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False  True False False
 False False False False False False False False Fa

We can also interrogate the ranking of the unselected features with .ranking_



In [12]:
feature_selector.ranking_

array([429, 359, 124, 370,  20, 272, 160, 475, 421, 366,   2, 216, 262,
       107, 305, 339, 288, 177, 113, 416, 348, 266, 439, 284, 302, 137,
       189, 422,   1, 231, 440,  56, 134, 231, 359, 151, 213, 296,  79,
       465, 454,  95, 255,  44,  42, 262, 149, 181,   1, 116, 165, 171,
       400, 426, 198,  30,  27, 198, 387, 161,  65, 318, 378, 455,   1,
       181, 246, 231, 168, 412, 350, 238, 195,  36, 304, 293, 330,  91,
       149, 119, 425, 128, 355, 430, 270,  59, 402, 251, 335,  57, 482,
       245, 368, 282, 206,  88, 450, 271, 113, 171, 207, 385, 424, 269,
       203,   1,  39, 401, 363, 237, 422, 126, 399, 327, 144, 310,  23,
       357, 286,  82, 461, 272, 125, 212, 337, 142, 308,  74,   1, 133,
        73,  66, 316, 448, 156, 154,  12,  15, 426, 106,  97, 301, 327,
       345, 299, 333, 312,  91, 373,  61, 240, 321, 107,   1, 472, 195,
       268, 242, 331, 313, 393,  68, 407, 416, 100, 217, 464, 440, 467,
       358, 227, 122, 127, 457, 371,  43, 368, 208, 248, 319,  6

In [18]:
feature_selector.ranking_[feature_selector.ranking_==2]

array([2])