This is feature elimination based on **[Boruta](https://m2.icm.edu.pl/boruta/)**. Thanks to **[@olivier](https://www.kaggle.com/ogrellier)** for **[discussions](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41595#233852)** and for **[this notebook](https://www.kaggle.com/ogrellier/noise-analysis-of-porto-seguro-s-features)** that got me going in this direction.

Note that olivier used LightGBM as his **[base estimator](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/41595#234273)** while I am using Random Forest. Because of that, the results are different. I have no way of telling which feature selection is better as I haven't tested either one yet. If you do test them, please leave a note here. I will do the same.

In [1]:
from __future__ import print_function

import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds.' % (thour, tmin, round(tsec, 2)))

Loading files.

In [2]:
train = pd.read_csv('../input/train.csv', dtype={'target': np.int8, 'id': np.int32})
X = train.drop(['id','target'], axis=1).values
y = train['target'].values
tr_ids = train['id'].values
n_train = len(X)
test = pd.read_csv('../input/test.csv', dtype={'id': np.int32})
X_test = test.drop(['id'], axis=1).values
te_ids = test['id'].values

It is worth playing with **[RFC parameters](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**. Initially, I had *n_estimators=100* and *max_depth=10* which was not selecting enough features. Boruta parameters are explained **[here](https://github.com/scikit-learn-contrib/boruta_py)**.

In [3]:
rfc = RandomForestClassifier(n_estimators=200, n_jobs=4, class_weight='balanced', max_depth=6)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2)
start_time = timer(None)
boruta_selector.fit(X, y)
timer(start_time)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	57
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	9 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	10 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	11 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	12 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	13 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	14 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	15 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	16 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	17 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	18 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	19 / 100
Confirmed: 	24
Tentative: 	3
Rejected: 	30




Iteration: 	20 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	21 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	22 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	23 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	24 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	25 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	26 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	27 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	28 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	29 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	30 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	31 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	32 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	33 / 100
Confirmed: 	24
Tentative: 	2
Rejected: 	31




Iteration: 	34 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	35 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	36 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	37 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	38 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	39 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	40 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	41 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	42 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	43 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	44 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	45 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	46 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	47 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	48 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	49 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	50 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	51 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	52 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	53 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	54 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	55 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	56 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	57 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	58 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	59 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	60 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	61 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	62 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	63 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	64 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	65 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	66 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	67 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	68 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	69 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	70 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	71 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	72 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	73 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32




Iteration: 	74 / 100
Confirmed: 	24
Tentative: 	1
Rejected: 	32
Iteration: 	75 / 100
Confirmed: 	24
Tentative: 	0
Rejected: 	33


BorutaPy finished running.

Iteration: 	76 / 100
Confirmed: 	24
Tentative: 	0
Rejected: 	33

 Time taken: 0 hours 45 minutes and 29.8 seconds.




The summary of the whole run is shown here. Couple of attributes at the end are commented out. Finally, we save train and test datasets with a subset of selected features.

In [4]:
print ('\n Initial features: ', train.drop(['id','target'], axis=1).columns.tolist() )

# number of selected features
print ('\n Number of selected features:')
print (boruta_selector.n_features_)

feature_df = pd.DataFrame(train.drop(['id','target'], axis=1).columns.tolist(), columns=['features'])
feature_df['rank']=boruta_selector.ranking_
feature_df = feature_df.sort_values('rank', ascending=True).reset_index(drop=True)
print ('\n Top %d features:' % boruta_selector.n_features_)
print (feature_df.head(boruta_selector.n_features_))
feature_df.to_csv('boruta-feature-ranking.csv', index=False)

# check ranking of features
print ('\n Feature ranking:')
print (boruta_selector.ranking_)

# check selected features
# print ('\n Selected features:')
# print (boruta_selector.support_)

# check weak features
# print ('\n Support for weak features:')
#print (boruta_selector.support_weak_)

selected = train.drop(['id','target'], axis=1).columns[boruta_selector.support_]
train = train[selected]
train['id'] = tr_ids
train['target'] = y
train = train.set_index('id')
train.to_csv('train_boruta_filtered.csv', index_label='id')
test = test[selected]
test['id'] = te_ids
test = test.set_index('id')
test.to_csv('test_boruta_filtered.csv', index_label='id')


 Initial features:  ['ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin', 'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat', 'ps_car_11_cat', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04', 'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09', 'ps_calc_10', 'ps_calc_11', 'ps_calc_12', 'ps_calc_13', 'ps_calc_14', 'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin', 'ps_calc_19_bin', 'ps_calc_20_bin']

 Number of selected features:
24

 Top 24 features:
         features  rank
0       ps_ind_0