## TalkingData AdTracking Fraud Detection Challenge
https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection


------------------


### This notebook is meant to demo a feature selection algo -- Boruta all-relevant feature selection method

Boruta repo: [boruta_py Github Link](https://github.com/scikit-learn-contrib/boruta_py)

Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some classifier has a minimal error.


There are three types of feature selection methods in general:

* Filter Methods : filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithm. Instead the features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable. Some common filter methods are Correlation metrics (Pearson, Spearman, Distance), Chi-Squared test, Anova, Fisher's Score etc.


* Wrapper Methods : in wrapper methods, you try to use a subset of features and train a model using them. Based on the inferences that you draw from the previous model, you decide to add or remove features from the subset. Forward Selection, Backward elimination are some of the examples for wrapper methods.


* Embedded Methods : these are the algorithms that have their own built-in feature selection methods. LASSO regression is one such example.


### The Boruta Algorithm
The Boruta algorithm is a wrapper built around the random forest classification algorithm. It tries to capture all the important, interesting features you might have in your dataset with respect to an outcome variable.


* First, it duplicates the dataset, and shuffle the values in each column. These values are called shadow features. * Then, it trains a classifier, such as a Random Forest Classifier, on the dataset. By doing this, you ensure that you can have an idea of the importance -via the Mean Decrease Accuracy or Mean Decrease Impurity- for each of the features of your data set. The higher the score, the better or more important.


* Then, the algorithm checks for each of your real features if they have higher importance. That is, whether the feature has a higher Z-score than the maximum Z-score of its shadow features than the best of the shadow features. If they do, it records this in a vector. These are called a hits. Next,it will continue with another iteration. After a predefined set of iterations, you will end up with a table of these hits. Remember: a Z-score is the number of standard deviations from the mean a data point is, for more info click here.


* At every iteration, the algorithm compares the Z-scores of the shuffled copies of the features and the original features to see if the latter performed better than the former. If it does, the algorithm will mark the feature as important. In essence, the algorithm is trying to validate the importance of the feature by comparing with random shuffled copies, which increases the robustness. This is done by simply comparing the number of times a feature did better with the shadow features using a binomial distribution.


* If a feature hasn't been recorded as a hit in say 15 iterations, you reject it and also remove it from the original matrix. After a set number of iterations -or if all the features have been either confirmed or rejected- you stop.

In [1]:
import os
import psutil
import time
import pandas as pd
import gc
# sklearn imports
from sklearn.ensemble import RandomForestClassifier
# boruta imports
from boruta import BorutaPy

# memory
process = psutil.Process(os.getpid())
memused = process.memory_info().rss
print('Total memory in use before reading data: {:.02f} GB'.format(memused/(2**30)))

Total memory in use before reading data: 0.10 GB


In [2]:
# # read data
df_train = pd.read_hdf('../data/train_v3.hdf').tail(int(1e5)).fillna(0)
# col
target = 'is_attributed'
features = [
    'app',
    'device',
    'os',
    'channel',
    'hour',
    'in_test_hh',
    'ip_day_hour_clicks',
    'ip_app_day_hour_clicks',
    'ip_os_day_hour_clicks',
    'ip_device_day_hour_clicks',
    'ip_day_test_hh_clicks',
    'ip_app_device_clicks',
    'ip_app_device_day_clicks',
    'ip_day_nunique_app',
    'ip_day_nunique_device',
    'ip_day_nunique_channel',
    'ip_day_nunique_hour',
    'ip_nunique_app',
    'ip_nunique_device',
    'ip_nunique_channel',
    'ip_nunique_hour',
    'app_day_nunique_channel',
    'app_nunique_channel',
    'ip_app_day_nunique_os',
    'ip_app_nunique_os',
    'ip_device_os_day_nunique_app',
    'ip_device_os_nunique_app',
    'ip_app_day_var_hour',
    'ip_device_day_var_hour',
    'ip_os_day_var_hour',
    'ip_channel_day_var_hour',
    'ip_app_os_var_hour',
    'ip_app_channel_var_day',
    'ip_app_channel_mean_hour',
    'ip_day_cumcount',
    'ip_cumcount',
    'ip_app_day_cumcount',
    'ip_app_cumcount',
    'ip_device_os_day_cumcount',
    'ip_device_os_cumcount',
    'next_click',
    'previous_click',
]
# dump X, y
X = df_train[features]
y = df_train[target]
# clean up
del df_train
gc.collect()
# memory status
memused = process.memory_info().rss
print('Total memory in use after reading data: {:.02f} GB '
      ''.format(memused / (2 ** 30)))

Total memory in use after reading data: 1.50 GB 


### Before feature pruning

In [15]:
X.head()

Unnamed: 0,app,device,os,channel,hour,in_test_hh,ip_day_hour_clicks,ip_app_day_hour_clicks,ip_os_day_hour_clicks,ip_device_day_hour_clicks,...,ip_app_channel_var_day,ip_app_channel_mean_hour,ip_day_cumcount,ip_cumcount,ip_app_day_cumcount,ip_app_cumcount,ip_device_os_day_cumcount,ip_device_os_cumcount,next_click,previous_click
184803890,56,1,30,406,15,2,11373,135,66,10208,...,0.032738,11.377953,64128,48473,1188,1231,653,1609,0.0,3000.0
184803891,12,1,11,481,15,2,256,43,15,252,...,1.303955,9.483334,4830,16741,530,1899,27,134,0.0,373.0
184803892,3,1,19,137,15,2,29386,3328,6651,26395,...,1.465981,11.08274,45828,58202,52702,4786,7756,51924,1.0,5.0
184803893,26,1,19,477,15,2,232,2,20,231,...,0.641897,8.576923,6041,19346,107,314,1523,4047,0.0,5743.0
184803894,15,1,3,386,15,2,104,8,36,104,...,0.0,8.0,586,586,30,30,50,50,22.0,18.0


In [18]:
X.columns

Index(['app', 'device', 'os', 'channel', 'hour', 'in_test_hh',
       'ip_day_hour_clicks', 'ip_app_day_hour_clicks', 'ip_os_day_hour_clicks',
       'ip_device_day_hour_clicks', 'ip_day_test_hh_clicks',
       'ip_app_device_clicks', 'ip_app_device_day_clicks',
       'ip_day_nunique_app', 'ip_day_nunique_device', 'ip_day_nunique_channel',
       'ip_day_nunique_hour', 'ip_nunique_app', 'ip_nunique_device',
       'ip_nunique_channel', 'ip_nunique_hour', 'app_day_nunique_channel',
       'app_nunique_channel', 'ip_app_day_nunique_os', 'ip_app_nunique_os',
       'ip_device_os_day_nunique_app', 'ip_device_os_nunique_app',
       'ip_app_day_var_hour', 'ip_device_day_var_hour', 'ip_os_day_var_hour',
       'ip_channel_day_var_hour', 'ip_app_os_var_hour',
       'ip_app_channel_var_day', 'ip_app_channel_mean_hour', 'ip_day_cumcount',
       'ip_cumcount', 'ip_app_day_cumcount', 'ip_app_cumcount',
       'ip_device_os_day_cumcount', 'ip_device_os_cumcount', 'next_click',
       'previous_

### Run BorutaPy

In [6]:
# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    class_weight='balanced',
    n_jobs=-1
)

# define Boruta feature selection method
feat_selector = BorutaPy(
    rf, 
    n_estimators=100, 
    verbose=2, 
    random_state=1
)

# find all relevant features - 5 features should be selected
feat_selector.fit(X.values, y.values)

# check selected features - first 5 features are selected
feat_selector.support_

# check ranking of features
feat_selector.ranking_

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X.values)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	42
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	21
Tentative: 	17
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	9 / 100
Confirmed: 	21
Tentative: 	17
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	10 / 100
Confirmed: 	21
Tentative: 	17
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	11 / 100
Confirmed: 	21
Tentative: 	17
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	12 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	13 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	14 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	15 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	16 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	17 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	18 / 100
Confirmed: 	25
Tentative: 	13
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	19 / 100
Confirmed: 	26
Tentative: 	12
Rejected: 	4


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	20 / 100
Confirmed: 	26
Tentative: 	11
Rejected: 	5


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	21 / 100
Confirmed: 	26
Tentative: 	11
Rejected: 	5


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	22 / 100
Confirmed: 	26
Tentative: 	11
Rejected: 	5


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	23 / 100
Confirmed: 	26
Tentative: 	10
Rejected: 	6


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	24 / 100
Confirmed: 	26
Tentative: 	10
Rejected: 	6


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	25 / 100
Confirmed: 	26
Tentative: 	10
Rejected: 	6


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	26 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	27 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	28 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	29 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	30 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	31 / 100
Confirmed: 	26
Tentative: 	9
Rejected: 	7


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	32 / 100
Confirmed: 	26
Tentative: 	8
Rejected: 	8


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	33 / 100
Confirmed: 	26
Tentative: 	8
Rejected: 	8


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	34 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	35 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	36 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	37 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	38 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	39 / 100
Confirmed: 	26
Tentative: 	7
Rejected: 	9


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	40 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	41 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	42 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	43 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	44 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	45 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	46 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	47 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	48 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	49 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	50 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	51 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	52 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	53 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	54 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	55 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	56 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	57 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	58 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	59 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	60 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	61 / 100
Confirmed: 	26
Tentative: 	6
Rejected: 	10


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	62 / 100
Confirmed: 	26
Tentative: 	5
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	63 / 100
Confirmed: 	26
Tentative: 	5
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	64 / 100
Confirmed: 	26
Tentative: 	5
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	65 / 100
Confirmed: 	26
Tentative: 	5
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	66 / 100
Confirmed: 	26
Tentative: 	5
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	67 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	68 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	69 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	70 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	71 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	72 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	73 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	74 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	75 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	76 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	77 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	78 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	79 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	80 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	81 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	82 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	83 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	84 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	85 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	86 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	87 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	88 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	89 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	90 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	91 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	92 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	93 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	94 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	95 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	96 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	97 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


Iteration: 	98 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11
Iteration: 	99 / 100
Confirmed: 	27
Tentative: 	4
Rejected: 	11


BorutaPy finished running.

Iteration: 	100 / 100
Confirmed: 	27
Tentative: 	2
Rejected: 	11


  hits = np.where(cur_imp[0] > imp_sha_max)[0]


### After feature pruning

In [13]:
X.loc[:, feat_selector.support_].head(5)

Unnamed: 0,device,ip_day_hour_clicks,ip_os_day_hour_clicks,ip_device_day_hour_clicks,ip_day_test_hh_clicks,ip_app_device_clicks,ip_app_device_day_clicks,ip_day_nunique_app,ip_day_nunique_channel,ip_day_nunique_hour,...,ip_device_os_day_nunique_app,ip_device_os_nunique_app,ip_app_day_var_hour,ip_os_day_var_hour,ip_day_cumcount,ip_cumcount,ip_app_day_cumcount,ip_app_cumcount,ip_device_os_day_cumcount,ip_device_os_cumcount
184803890,1,11373,66,10208,28481,1252,1209,130,144,24,...,36,38,6.353009,31.078529,64128,48473,1188,1231,653,1609
184803891,1,256,15,252,721,2424,624,51,96,24,...,17,23,37.54578,24.670393,4830,16741,530,1899,27,134
184803892,1,29386,6651,26395,14086,38496,58967,188,133,24,...,60,95,40.062443,37.24754,45828,58202,52702,4786,7756,51924
184803893,1,232,20,231,1145,446,128,45,97,24,...,34,45,43.902988,29.981924,6041,19346,107,314,1523,4047
184803894,1,104,36,104,150,32,32,33,98,15,...,17,17,21.286291,6.6,586,586,30,30,50,50


In [14]:
X.loc[:, feat_selector.support_].columns

Index(['device', 'ip_day_hour_clicks', 'ip_os_day_hour_clicks',
       'ip_device_day_hour_clicks', 'ip_day_test_hh_clicks',
       'ip_app_device_clicks', 'ip_app_device_day_clicks',
       'ip_day_nunique_app', 'ip_day_nunique_channel', 'ip_day_nunique_hour',
       'ip_nunique_app', 'ip_nunique_channel', 'ip_nunique_hour',
       'app_day_nunique_channel', 'app_nunique_channel',
       'ip_app_day_nunique_os', 'ip_app_nunique_os',
       'ip_device_os_day_nunique_app', 'ip_device_os_nunique_app',
       'ip_app_day_var_hour', 'ip_os_day_var_hour', 'ip_day_cumcount',
       'ip_cumcount', 'ip_app_day_cumcount', 'ip_app_cumcount',
       'ip_device_os_day_cumcount', 'ip_device_os_cumcount'],
      dtype='object')