# Hands-On Guide To Automated Feature Selection Using Boruta

https://analyticsindiamag.com/hands-on-guide-to-automated-feature-selection-using-boruta/#:~:text=Boruta%20is%20a%20Python%20package,relevant%E2%80%9D%20approach%20to%20feature%20selection.&text=Feature%20selection%20is%20one%20of,that's%20exactly%20what%20Boruta%20does.

Feature selection is one of the most crucial and time-consuming phases of the machine learning process, second only to data cleaning. What if we can automate the process? Well, that’s exactly what Boruta does. Boruta is an algorithm designed to take the “all-relevant” approach to feature selection, i.e., it tries to find all features from the dataset which carry information relevant to a given task. The counterpart to this is the “minimal-optimal” approach, which sees the minimal subset of features that are important in a model. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

In [2]:
 URL = "https://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/pos_modelling_data.csv"

In [3]:
data = pd.read_csv(URL)
data.info()
X = data.drop('Position', axis = 1)
y = data['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1793 entries, 0 to 1792
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Position             1793 non-null   object 
 1   Clean sheets         1793 non-null   float64
 2   Goals conceded       1793 non-null   float64
 3   Tackles              1793 non-null   float64
 4   Tackle success %     1793 non-null   int64  
 5   Blocked shots        1793 non-null   float64
 6   Interceptions        1793 non-null   float64
 7   Clearances           1793 non-null   float64
 8   Recoveries           1793 non-null   float64
 9   Successful 50/50s    1793 non-null   float64
 10  Own goals            1793 non-null   float64
 11  Assists              1793 non-null   int64  
 12  Passes               1793 non-null   int64  
 13  Passes per match     1793 non-null   float64
 14  Big chances created  1793 non-null   float64
 15  Crosses              1793 non-null   f

In [4]:
data.head()

Unnamed: 0,Position,Clean sheets,Goals conceded,Tackles,Tackle success %,Blocked shots,Interceptions,Clearances,Recoveries,Successful 50/50s,...,Shots,Shooting accuracy %,Saves,Penalties saved,age,value_eur,overall,Arial Saves,Duels %,Aerial battles %
0,Midfielder,0.0,0.0,4.0,100,0.0,1.0,0.0,9.0,4.0,...,2.0,50,0.0,0.0,21,4400000,72,0.0,46.153846,25.0
1,Defender,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,...,0.0,0,0.0,0.0,22,10500000,77,0.0,0.0,0.0
2,Forward,0.0,0.0,10.0,0,11.0,1.0,19.0,0.0,0.0,...,42.0,36,0.0,0.0,19,7500000,73,0.0,0.0,0.0
3,Midfielder,0.0,0.0,9.0,56,3.0,9.0,14.0,40.0,12.0,...,10.0,30,0.0,0.0,31,4800000,74,0.0,55.384615,58.333333
4,Midfielder,0.0,0.0,22.0,59,5.0,14.0,0.0,58.0,6.0,...,11.0,18,0.0,0.0,28,0,83,0.0,40.869565,36.666667


###### Creating a baseline RandomForrestClassifier model with all the features.

In [5]:
rf_all_features = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_all_features.fit(X_train, y_train) 

RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=1)

In [6]:
accuracy_score(y_test, rf_all_features.predict(X_test))

0.7298050139275766

4. Creating a BorutaPy object with RandomForestClassifier as the estimator and ranking the features. 

One important thing to note here is that Boruta works on NumPy arrays only

In [8]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X_train), np.array(y_train))

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	29
Tentative: 	5
Rejected: 	0
Iteration: 	9 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	10 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	11 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	12 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	13 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	14 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	15 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	16 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
I

BorutaPy(estimator=RandomForestClassifier(max_depth=5, n_estimators=160,
                                          random_state=RandomState(MT19937) at 0x1EDACE366A8),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x1EDACE366A8, verbose=2)

BorutaPy object created with RandomForestClassifier as the estimator

In [9]:
print("Ranking: ",boruta_selector.ranking_)          
print("No. of significant features: ", boruta_selector.n_features_) 

Ranking:  [1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 2 1 1 1 1 1 1]
No. of significant features:  31


Boruta has selected 31 features, the features with rank 1 are selected. Let’s create a table and see exactly what features were rejected.

In [10]:
selected_rf_features = pd.DataFrame({'Feature':list(X_train.columns),
                                       'Ranking':boruta_selector.ranking_})
selected_rf_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
0,Clean sheets,1
31,Arial Saves,1
30,overall,1
29,value_eur,1
28,age,1
26,Saves,1
25,Shooting accuracy %,1
24,Shots,1
23,Goals per match,1
22,Goals,1


5. Using the BorutaPy object to transform the features in the dataset.

In [11]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))

6. Creating another RandomForestClassifier model with the same parameters as the baseline classifier and training it with the selected features.

In [12]:
rf_boruta = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_boruta.fit(X_important_train, y_train)

RandomForestClassifier(max_depth=5, n_estimators=1000, random_state=1)

In [13]:
accuracy_score(y_test, rf_boruta.predict(X_important_test))

0.7325905292479109