#__Automatic Feature Selection with Boruta__

Let's use Boruta to learn about automatic feature selection.

## Step 1: Import Required Libraries

- Import pandas, NumPy, RandomForestClassifier, BorutaPy, train_test_split, and accuracy_score


`pip install boruta`

In [10]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Step 2: Load and Prepare the Dataset

- Load the dataset using pandas
- Split the dataset into features and target variables
- Split the data into training and testing sets


In [11]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


__Obeservations__
- As you can see above, we have 35 columns and 1793 observations.
- There are no missing values, and all are non-null values.

In [12]:
URL = "http://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/pos_modelling_data.csv"
data = pd.read_csv(URL)
data.info()
X = data.drop('Position', axis = 1)
y = data['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1793 entries, 0 to 1792
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Position             1793 non-null   object 
 1   Clean sheets         1793 non-null   float64
 2   Goals conceded       1793 non-null   float64
 3   Tackles              1793 non-null   float64
 4   Tackle success %     1793 non-null   int64  
 5   Blocked shots        1793 non-null   float64
 6   Interceptions        1793 non-null   float64
 7   Clearances           1793 non-null   float64
 8   Recoveries           1793 non-null   float64
 9   Successful 50/50s    1793 non-null   float64
 10  Own goals            1793 non-null   float64
 11  Assists              1793 non-null   int64  
 12  Passes               1793 non-null   int64  
 13  Passes per match     1793 non-null   float64
 14  Big chances created  1793 non-null   float64
 15  Crosses              1793 non-null   f

Now, lets split the data into X and y columns.

In [13]:
X = data.drop('Position', axis = 1)
y = data['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Step 4: Train a RandomForest Classifier

- Train a RandomForest classifier using all features
- Calculate the accuracy score on the test set


In [14]:
rf_all_features = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_all_features.fit(X_train, y_train)

accuracy_score(y_test, rf_all_features.predict(X_test))

0.7298050139275766

__Observation__
- The accuracy score for the model is  72.9%.

## Step 5: Perform Feature Selection Using Boruta

- Train a RandomForest classifier for Boruta feature selection
- Perform feature selection using Boruta
- Display the ranking and the number of significant features


In [15]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)

In [16]:
#compatibility fix - from https://github.com/scikit-learn-contrib/boruta_py/issues/122
np.int = np.int32
np.float = np.float64
np.bool = np.bool_


In [17]:
boruta_selector.fit(X_train.values, y_train.values)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	34
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	29
Tentative: 	5
Rejected: 	0
Iteration: 	9 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	10 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	11 / 100
Confirmed: 	29
Tentative: 	3
Rejected: 	2
Iteration: 	12 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	13 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	14 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	15 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
Iteration: 	16 / 100
Confirmed: 	30
Tentative: 	2
Rejected: 	2
I

__Observations__
- As you can see here, we have iterations for every 100 samples.
- It also shows how many variables are tentative and confirmed, and based on this, the ranking will be formed.

In [18]:
print("Ranking: ",boruta_selector.ranking_)
print("No. of significant features: ", boruta_selector.n_features_)

Ranking:  [1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 2 1 1 1 1 1 1]
No. of significant features:  31


__Observation__

- Out of 35 attributes, 31 have significant features and have ranking of 1.

## Step 6: Display the Ranking of Features

- Create a DataFrame with the feature ranking
- Sort the DataFrame based on the ranking


In [19]:
selected_rf_features = pd.DataFrame({
                                        'Feature':list(X_train.columns),
                                        'Ranking':boruta_selector.ranking_
                                        })
selected_rf_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
0,Clean sheets,1
31,Arial Saves,1
30,overall,1
29,value_eur,1
28,age,1
26,Saves,1
25,Shooting accuracy %,1
24,Shots,1
23,Goals per match,1
22,Goals,1


__Observation__
- Now, we know the attributes apart from rank 1.

## Step 7: Train a RandomForest Classifier Using the Selected Features

- Transform the training and testing sets using the Boruta selector
- Train a RandomForest classifier using the selected features
- Calculate the accuracy score on the test set

In [20]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))

- Using the important feature selection, let's fit the RandomForestClassifier model.

In [21]:
rf_boruta = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_boruta.fit(X_important_train, y_train)

In [22]:
accuracy_score(y_test, rf_boruta.predict(X_important_test))

0.7325905292479109

__Observation__

- As seen above, the accuracy is better by a small fraction 73 even though we eliminated 4 variables.
- the advantage here is we were able to determine which features can be removed without impacting the model performance. Therefore, we saved on computational consumption

___

In [23]:

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns


In [24]:
df = pd.read_csv("data/wisconsin_breast_cancer_dataset.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'data/wisconsin_breast_cancer_dataset.csv'

In [None]:

print(df.describe().T)  #Values need to be normalized before fitting.
print(df.isnull().sum())
#df = df.dropna()

#Rename Dataset to Label to make it easy to understand
df = df.rename(columns={'Diagnosis':'Label'})
print(df.dtypes)

#Understand the data
#sns.countplot(x="Label", data=df) #M - malignant   B - benign


####### Replace categorical values with numbers########
df['Label'].value_counts()

#Define the dependent variable that needs to be predicted (labels)
y = df["Label"].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
Y = labelencoder.fit_transform(y) # M=1 and B=0
#################################################################
#Define x and normalize values

#Define the independent variables. Let's also drop Gender, so we can normalize other data
X = df.drop(labels = ["Label", "ID"], axis=1)

import numpy as np
feature_names = np.array(X.columns)  #Convert dtype string?


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

##Split data into train and test to verify accuracy after fitting the model.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

###########################################################################
# Define XGBOOST classifier to be used by Boruta
import xgboost as xgb
model = xgb.XGBClassifier()  #For Boruta

"""
Create shadow features – random features and shuffle values in columns
Train Random Forest / XGBoost and calculate feature importance via mean decrease impurity
Check if real features have higher importance compared to shadow features
Repeat this for every iteration
If original feature performed better, then mark it as important
"""

from boruta import BorutaPy

# define Boruta feature selection method
feat_selector = BorutaPy(model, n_estimators='auto', verbose=2, random_state=1)

# find all relevant features
feat_selector.fit(X_train, y_train)


# check selected features
print(feat_selector.support_)  #Should we accept the feature

# check ranking of features
print(feat_selector.ranking_) #Rank 1 is the best

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X_train)  #Apply feature selection and return transformed data

"""
Review the features
"""
# zip feature names, ranks, and decisions
feature_ranks = list(zip(feature_names,
                         feat_selector.ranking_,
                         feat_selector.support_))

# print the results
for feat in feature_ranks:
    print('Feature: {:<30} Rank: {},  Keep: {}'.format(feat[0], feat[1], feat[2]))


############################################################
#Now use the subset of features to fit XGBoost model on training data
import xgboost as xgb
xgb_model = xgb.XGBClassifier()

xgb_model.fit(X_filtered, y_train)

#Now predict on test data using the trained model.

#First apply feature selector transform to make sure same features are selected from test data
X_test_filtered = feat_selector.transform(X_test)
prediction_xgb = xgb_model.predict(X_test_filtered)


#Print overall accuracy
from sklearn import metrics
print ("Accuracy = ", metrics.accuracy_score(y_test, prediction_xgb))

#Confusion Matrix - verify accuracy of each class
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, prediction_xgb)
#print(cm)
sns.heatmap(cm, annot=True)

#######################################################


