# **Feature Selection:** AutoML
* **Name:** Arsalan Ali
* **Email:** arslanchaos@gmail.com

#### **What is Feature Selection?**
Feature selection techniques are used to find the most relevant features while constructing ML model. <br>
There are a lot of ways to do it but the following libraries automate that process.
* **mRMR:** It is a minimal-optimal feature selection algorithm.
* **Boruta:** It's a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant.

### **Importing Libraries**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### **Importing Dataset**

In [23]:
df = sns.load_dataset("titanic")
df.drop(columns=['deck'], axis=1, inplace=True)
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


### **Dataset Info**

In [17]:
print(df.shape)
df.info()

(891, 14)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  embark_town  889 non-null    object  
 12  alive        891 non-null    object  
 13  alone        891 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(5)
memory usage: 79.4+ KB


### **Check for Missing Data**

In [24]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
embark_town      2
alive            0
alone            0
dtype: int64

In [25]:
df = df.dropna()
df.isnull().sum()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [26]:
df.shape

(712, 14)

In [33]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     712 non-null    int64  
 1   pclass       712 non-null    int64  
 2   sex          712 non-null    float64
 3   age          712 non-null    float64
 4   sibsp        712 non-null    int64  
 5   parch        712 non-null    int64  
 6   fare         712 non-null    float64
 7   embarked     712 non-null    float64
 8   class        712 non-null    float64
 9   adult_male   712 non-null    bool   
 10  embark_town  712 non-null    float64
 11  alive        712 non-null    float64
 12  alone        712 non-null    bool   
dtypes: bool(2), float64(7), int64(4)
memory usage: 68.1 KB


## **mRMR**: minimum Redundancy - Maximum Relevance
It's important to apply feature encoding before passing in the features and the target

In [70]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
# create some pandas data
X = df.drop("who", axis=1)
y = df["who"]

# # Feature Encoding
for key,val in X.items():
  datatype = X.loc[:,key].dtypes
  if datatype == "object":
    function_X = (X.groupby(key).size()) / len(X)
    X[key] = X[key].apply(lambda i:function_X[i])
  elif datatype == "timedelta64[ns]":
    X[key] = X[key].dt.components['minutes']
  elif datatype == "category":
    ordinal_encoder = OrdinalEncoder()
    X[key] = ordinal_encoder.fit_transform(X[key].values.reshape(-1, 1))

# Label Encoding
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

In [38]:
# select top 5 features using mRMR for classification
from mrmr import mrmr_classif
selected_features_classification = mrmr_classif(X=X, y=y, K=5)

# select top 5 features using mRMR for regression
y = df["age"]
from mrmr import mrmr_regression
selected_features_regression = mrmr_regression(X=X, y=y, K=5)

print(f"\nFeatures for Classification (Label 'who') : {selected_features_classification}\n")
print(f"Features for Regression (Label 'age') : {selected_features_regression}\n")

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:00<00:00, 35.06it/s]
100%|██████████| 5/5 [00:00<00:00, 28.98it/s]


Features for Classification (Label 'who') : ['adult_male', 'sex', 'age', 'sibsp', 'alive']

Features for Regression (Label 'age') : ['class', 'sibsp', 'adult_male', 'pclass', 'parch']






## **Boruta**
It's important to apply feature encoding before passing in the features and the target

In [71]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

# load X and y
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
X = X.values
y = y
y = y.ravel()

# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators=5, verbose=0, random_state=1)

# find all relevant features - 5 features should be selected
feat_selector.fit(X, y)

# check selected features - first 5 features are selected
feat_selector.support_

# check ranking of features
feat_selector.ranking_

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)

selected_features = df.drop("who", axis=1).columns[feat_selector.support_].to_list()[:5]
rejected_features = df.drop("who", axis=1).columns[feat_selector.support_weak_].to_list()
print('Selected Features:', selected_features)
print('Rejected Features:', rejected_features)

Selected Features: ['survived', 'sex', 'age', 'sibsp', 'parch']
Rejected Features: ['pclass']
