## Feature Selection

Feature selection is primarily focused on removing non-informative or redundant predictors from the model. When you are done creating hundreds of thousands of features, it’s time to select a few of them. Well, we should never create hundreds of thousands of useless features. Having too many features pose a problem well known as the curse of dimensionality. If you have a lot of features, you must also have a lot of training samples to capture all the features.

### Step for forwarding Selection
- Start with the empty feature set
- Try the remaining feature
- Estimate classification/regression error for adding each feature in the model
- Select a feature that gives maximum improvement
- Stop when there is no significant improvement

### Step for Backward Selection

- Start with a complete feature set
- Try the remaining feature
- Estimate classification/regression error for adding each feature in the model
- Drop feature that gives less improvement
- Stop when there is no significant improvement

### Methods/Technique

#### Univariate

- Pearson Correlation
- F-score
- Chi-square
- Signal to noise ratio

#### Multivariate

- Compute 'w' on all features
- Remove feature with smallest 'w'
- Recompute 'w' on reduced data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("salary.csv")
data.sample(10)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
8522,79,Private,333230,HS-grad,9,Married-spouse-absent,Prof-specialty,Not-in-family,White,Male,0,0,6,United-States,<=50K
21360,41,Private,126622,11th,7,Divorced,Handlers-cleaners,Unmarried,White,Female,0,0,40,United-States,<=50K
3269,21,?,224209,HS-grad,9,Married-civ-spouse,?,Wife,Black,Female,0,0,30,United-States,<=50K
24967,26,Private,151551,Some-college,10,Separated,Sales,Own-child,Amer-Indian-Eskimo,Male,2597,0,48,United-States,<=50K
27874,55,Private,119751,Masters,14,Never-married,Prof-specialty,Other-relative,Asian-Pac-Islander,Female,0,0,40,Thailand,<=50K
31373,50,Local-gov,231725,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,40,United-States,<=50K
24205,69,Private,130413,Bachelors,13,Widowed,Exec-managerial,Not-in-family,White,Female,2346,0,15,United-States,<=50K
18924,19,?,133983,Some-college,10,Never-married,?,Own-child,White,Female,0,0,20,United-States,<=50K
1387,40,Self-emp-not-inc,266324,Some-college,10,Divorced,Exec-managerial,Other-relative,White,Male,0,1564,70,Iran,>50K
5651,62,Private,122246,Some-college,10,Never-married,Craft-repair,Not-in-family,White,Female,8614,0,39,United-States,>50K


In [3]:
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [4]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Data Preprocessing

In [6]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

In [7]:
data.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education-num        16
marital-status        7
occupation           15
relationship          6
race                  5
sex                   2
capital-gain        119
capital-loss         92
hours-per-week       94
native-country       42
salary                2
dtype: int64

In [8]:
categorial_data = []
numerical_data = []
for col in data.columns:
    if data[col].dtype == "O":
        categorial_data.append(col)
    else:
        numerical_data.append(col)

In [9]:
le = LabelEncoder()

In [10]:
for category in categorial_data:
    data[category] = le.fit_transform(data[category])

In [11]:
data.sample(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
8159,27,2,74056,9,13,4,13,1,4,1,0,0,50,39,0
697,31,4,118710,12,14,2,13,0,4,1,0,1902,40,39,1
32396,56,4,135458,11,9,0,13,1,2,0,0,0,40,39,0


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  int64
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  int64
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  int64
 6   occupation      32561 non-null  int64
 7   relationship    32561 non-null  int64
 8   race            32561 non-null  int64
 9   sex             32561 non-null  int64
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  int64
 14  salary          32561 non-null  int64
dtypes: int64(15)
memory usage: 3.7 MB


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X = data.drop('salary',axis=1)
y = data.salary

In [15]:
minmax = MinMaxScaler()
X = minmax.fit_transform(X)

## Feature Selection Methods

In [42]:
def model_train(X,y):
    from sklearn.linear_model import LogisticRegression
    x_train,x_test,y_train,y_test = train_test_split(X,y,train_size=0.78,random_state=42)
    model = LogisticRegression()
    model = model.fit(X,y)
    return model,x_test,y_test

In [43]:
def scoring(y_actual,y_pred):
    from sklearn.metrics import accuracy_score,roc_auc_score,precision_score
    print("Accuracy:",accuracy_score(y_pred=y_pred,y_true=y_actual))
    print("Precision:",precision_score(y_pred=y_pred,y_true=y_actual))
    print("ROC AUC Score:",roc_auc_score(y_true=y_actual,y_score=y_pred))

In [44]:
from sklearn.feature_selection import chi2,mutual_info_classif,f_classif,VarianceThreshold

In [45]:
from sklearn.feature_selection import SelectKBest,SelectPercentile

### 1. Variance Threshold

In [46]:
varThresh = VarianceThreshold()
tranform_data = varThresh.fit_transform(X)

In [47]:
tranform_data.shape

(32561, 14)

In [48]:
model,x_test,y_test = model_train(tranform_data,y)

In [49]:
predict = model.predict(x_test)

In [50]:
scoring(y_test,predict)

Accuracy: 0.8234226689000559
Precision: 0.7129455909943715
ROC AUC Score: 0.6929595815364497


### 2. Chi-square

In [51]:
selectK = SelectKBest(chi2,k=5) 
#k is hyperparameter should be equal more than 0 and less than n_features in the dataset

In [52]:
#in our case we have 14 features

In [53]:
selectK.fit(X,y)
x_trans = selectK.transform(X)

In [54]:
model2,x_test,y_test = model_train(x_trans,y)

In [55]:
predict2 = model2.predict(x_test)

In [56]:
scoring(y_test,predict2)

Accuracy: 0.8055555555555556
Precision: 0.7595541401273885
ROC AUC Score: 0.624877523449632


### 3. F score for classification

In [57]:
f_score_method = SelectKBest(f_classif,k=6) 

In [58]:
f_score_method.fit(X,y)
x_f_score = f_score_method.transform(X)

In [59]:
model3,x_test,y_test = model_train(x_f_score,y)

In [60]:
predict3 = model3.predict(x_test)

In [61]:
scoring(y_test,predict3)

Accuracy: 0.8218872138470128
Precision: 0.7152575315840622
ROC AUC Score: 0.6871725344833388


### 4. Mutual Information for classification

In [62]:
mutu_method = SelectKBest(mutual_info_classif,k=8) 

In [63]:
mutu_method.fit(X,y)
x_mut = mutu_method.transform(X)

In [64]:
model4,x_test,y_test = model_train(x_mut,y)

In [65]:
predict4 = model4.predict(x_test)

In [66]:
scoring(y_test,predict4)

Accuracy: 0.818676716917923
Precision: 0.7034883720930233
ROC AUC Score: 0.6830701109139948


Now you know how to do Feature Selection. Read this article to understand when and how to choose [Feature Selection](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)