# Boruta Feature Selection

## Problem Statement :

#### To check the power of 'Boruta Feature Selection Function' from Python

#### Apply Boruta Feature Selection Function over Ranadom Forest algorithms for every change made in the datasets and compare results.

#### Features should be selected with labeled 'confirmed', 'tentative' and 'rejected'

##### In this project we will see the practical application of Boruta, which is a Feature Selection technique and can be applied on 
##### Random Forest algorithm.Let us install Baruta package and some useful libraries.

In [1]:
! pip install boruta



In [2]:
# Import the library

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Library Boruta 
from boruta import BorutaPy

In [3]:
# Read the file 

cr = pd.read_csv(r"D:\sushma\data sets\CreditRisk.csv")

In [4]:
# Check the shape.

cr.shape

(981, 13)

In [5]:
# Check the first 5 records  

cr.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0.0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1.0,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0.0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0.0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0.0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
# Check the nulls 

cr.isnull().sum()

Loan_ID               0
Gender               24
Married               3
Dependents           25
Education             0
Self_Employed        55
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           27
Loan_Amount_Term     20
Credit_History       79
Property_Area         0
Loan_Status           0
dtype: int64

## Data Handling

#### Removing Nulls

In [7]:
cr.Gender.value_counts()
cr.Gender = cr.Gender.fillna("Male")

In [8]:
cr.Married.value_counts()
cr.Married = cr.Married.fillna("Yes")

In [9]:
cr.Dependents.value_counts()
cr.Dependents = cr.Dependents.fillna(0)

In [10]:
cr.Self_Employed.value_counts()
cr.Self_Employed = cr.Self_Employed.fillna("No")

In [11]:
cr.LoanAmount.value_counts()
cr.LoanAmount = cr.LoanAmount.fillna(cr.LoanAmount.mean())

In [12]:
cr.Loan_Amount_Term.value_counts()
cr.Loan_Amount_Term = cr.Loan_Amount_Term.fillna(cr.Loan_Amount_Term.mean())

In [13]:
cr.Credit_History.value_counts()
cr.Credit_History = cr.Credit_History.fillna(1)

In [14]:
cr.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [15]:
# Droping Loan ID column which is not necessary.

cr = cr.iloc[: , 1:13]

#### Converting categorical data into numeric form using Label Encoder

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [17]:
cr[cr.select_dtypes(include=['object']).columns] = cr[cr.select_dtypes(include=['object']).columns].apply(le.fit_transform)

In [18]:
cr.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,0,0.0,0,0,5849,0.0,142.51153,360.0,1.0,2,1
1,1,1,1.0,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,1,1,0.0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,1,1,0.0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,1,0,0.0,0,0,6000,0.0,141.0,360.0,1.0,2,1


###### Drive data into x and y

In [19]:
cr_x = cr.iloc[: , 0:11]
cr_y = cr.iloc[: , -1]

In [20]:
# Create a backup for further use.

cr_x_backup = cr_x

In [21]:
# Convert the data frame into array form to apply the Boruta

cr_x = np.array(cr_x)
cr_y = np.array(cr_y)

#### Here we have divided the data into two parts, one of Independent featirs and other part of Target variable.
#### Now we will import Random Forest Algorithm and Boruta Algorithm. 
#### We will give maximum iterations 25 times.

#### Run the RandomForest Function

In [22]:
rfc = RandomForestClassifier()

In [23]:
# Apply Boruta function 

boruta_feature_selector = BorutaPy(rfc , max_iter=25 , verbose= 2)

In [24]:
boruta_feature_selector

BorutaPy(estimator=RandomForestClassifier(), max_iter=25, verbose=2)

In [25]:
boruta_feature_selector.fit(cr_x , cr_y)

Iteration: 	1 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	2 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	3 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	4 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	5 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	6 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	7 / 25
Confirmed: 	0
Tentative: 	11
Rejected: 	0
Iteration: 	8 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	9 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	10 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	11 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	12 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	13 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	14 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	15 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	16 / 25
Confirmed: 	2
Tentative: 	1
Rejected: 	8
Iteration: 	17 / 25
Confir

BorutaPy(estimator=RandomForestClassifier(n_estimators=1000,
                                          random_state=RandomState(MT19937) at 0x5871540),
         max_iter=25, random_state=RandomState(MT19937) at 0x5871540,
         verbose=2)

#### we will check which are the important features and which are not.

In [26]:
# Check out is feature confirmed or rejected by applying 'support' function

boruta_feature_selector.support_

array([False, False, False, False, False,  True, False, False, False,
        True, False])

In [27]:
# Use our backup stored to read column name to create the data frame in next step 

cr_x_backup.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

In [28]:
# Create data frame using feature and importance.

feature_importance = pd.DataFrame({"feature" : cr_x_backup.columns , "Importance" : boruta_feature_selector.support_})

In [29]:
feature_importance

Unnamed: 0,feature,Importance
0,Gender,False
1,Married,False
2,Dependents,False
3,Education,False
4,Self_Employed,False
5,ApplicantIncome,True
6,CoapplicantIncome,False
7,LoanAmount,False
8,Loan_Amount_Term,False
9,Credit_History,True


### Conclusion:

#### Here we got the result as which feature is confirmed and rejected by using Data frame.
#### As we can see above, "Credit_History" and "ApplicantIncome" these are the three important features identified by Boruta. 
#### Rest of the features, such as "Gender", are least important. With this observation we can say that there is no Gender Bias followed for loan approval.

### ------------------------Thank You------------------------