## Dealing with Large Number of features

Human Resources are critical resources of any organiazation. Organizations spend huge amount of time and money to hire and nuture their employees. It is a huge loss for companies if employees leave, especially the key resources. So if HR can predict weather employees are at risk for leaving the company, it will allow them to identify the attrition risks and help understand and provie necessary support to retain those employees or do preventive hiring to minimize the impact to the orgranization.

This dataset is taken from kaggle https://www.kaggle.com/ludobenistant/hr-analytics

Fields in the dataset include:

- Employee satisfaction level
- Last evaluation
- Number of projects
- Average monthly hours
- Time spent at the company
- Whether they have had a work accident
- Whether they have had a promotion in the last 5 years
- Department
- Salary
- Whether the employee has left


In [1]:
import sklearn
print(sklearn.__version__)

1.0.2


In [2]:
import pandas as pd
import numpy as np

In [3]:
hr_df = pd.read_csv('https://drive.google.com/uc?export=download&id=1XwDeBvO2VtO7z6TXifQTDJfsDeU3Lw0x')

URLError: <urlopen error [Errno 11001] getaddrinfo failed>

In [None]:
hr_df.head(10)

In [None]:
hr_df.shape

In [None]:
hr_df['left'].value_counts()

In [None]:
hr_df.info()

## Encoding Categorical variables

**Note**: we are using get_dummies() for quick demonstration of feature selection. Please consider using One Hot Encoder (OHE) in real world implementation.

In [None]:
encoded_hr_df = pd.get_dummies( hr_df,
                                columns = ['Work_accident', 'promotion_last_5years', 'sales', 'salary'])

In [None]:
encoded_hr_df.info()

## Split Dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(encoded_hr_df,
                                     train_size = 0.8,
                                     random_state = 100)

In [None]:
x_features = list(train_df.columns)

In [None]:
x_features.remove('left')

## L1 Based Feature Selection

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with SelectFromModel to select the non-zero coefficients. 

In [None]:
from sklearn.linear_model import LogisticRegression

- **C, default=1.0** - Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.


In [None]:
logreg = LogisticRegression( penalty = 'l1', C = .1, solver = 'liblinear' )

In [None]:
logreg.fit(train_df[x_features], train_df['left'])

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report( test_df['left'], 
                            logreg.predict(test_df[x_features])))

In [None]:
l1_selection_df = pd.DataFrame( {"features": x_features,
                                 "coef": np.round(logreg.coef_[0], 2)} )

In [None]:
l1_selection_df[l1_selection_df.coef == 0.0]

## Sequential Feature Selection

In [None]:
import sklearn
print(sklearn.__version__)

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier( max_depth = 10 )

In [None]:
sfs = SequentialFeatureSelector(tree, n_features_to_select=10)

In [None]:
sfs.fit(train_df[x_features], train_df['left'])

In [None]:
sfs_features = [feature for feature, selected in zip(x_features, sfs.support_) if selected == True]

In [None]:
sfs_features

## Embedded Methods

- The embedded methods use statistical criteria e.g. information gain as a filter to select featuresusing a machine learning algorithm and then select the subset of features with the highest significance or importance.

- Embedded methods do not use iterations like RFE.

In [None]:
from sklearn.tree import DecisionTreeClassifier

rf_reg = DecisionTreeClassifier(max_depth = 5, criterion = 'gini')
rf_reg.fit(train_df[x_features], train_df['left'])

In [None]:
features_rf_imp = pd.DataFrame({"features": list(x_features),
                                "importance": rf_reg.feature_importances_})
features_rf_imp = features_rf_imp.sort_values("importance", ascending=False).reset_index()
features_rf_imp

In [None]:
features_rf_imp['cumsum'] = features_rf_imp.importance.cumsum()
features_rf_imp


## Recursive Feature Elimination (RFE)

- Use a machine learning algorithm as a black box evaluator to find the best subsets of features, and so, they are dependent on the estimator.
- Trains the model iteratively and each time removes the least important feature using the weights of the algorithm as the criterion.
- It is a multivariate method in the sense that it evaluates the relevance of several features considered jointly.
- When used as a ranker, in each iteration, the feature that is removed is added into a stack until all features are tested.
- More than one feature can be removed at a single step for computational efficiency.

In [None]:
from sklearn.feature_selection import RFE

In [None]:
rfe_selector = RFE(tree, 
                   n_features_to_select=5, 
                   step=1, 
                   verbose=1)
rfe_selector.fit(train_df[x_features], train_df['left'])

In [None]:
features_rfe = pd.DataFrame({"features": list(x_features),
                             "rank": rfe_selector.ranking_})
features_rfe.sort_values("rank", ascending=True)