# Predict HR Stay or Leave

Here [HR Analytics](https://www.kaggle.com/giripujar/hr-analytics) dataset by [Giri Pujar](https://www.kaggle.com/giripujar) is used to create a classifier if a `HR` will stay or leave.

Using the `unbalanced dataset` of employees of the company to predict which employee might stay or leave the company. `SMOT` is used to deal with the unbalanced dataset. `SMOTE` (synthetic minority oversampling technique) is one of the most commonly used `oversampling` methods to solve the imbalance problem.

![](https://media.giphy.com/media/l0DAI7ZQCXxSZzaO4/giphy.gif)

In [None]:
import itertools

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from imblearn.over_sampling import SMOTE

from sklearn import linear_model
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import (
    GridSearchCV, StratifiedKFold, cross_val_score, learning_curve,
    train_test_split
)
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, f1_score, log_loss,
    precision_score, recall_score, roc_curve, roc_auc_score, precision_recall_curve, 
    auc
)
from sklearn.pipeline import Pipeline

# Models
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression, SGDClassifier

from joblib import dump

In [None]:
# For seaborn colors
sns.set(style='whitegrid', color_codes=True)

In [None]:
# Loading the dataset
df = pd.read_csv('/kaggle/input/hr-analytics/HR_comma_sep.csv')
df.sample(5)

## Data preparation

In [None]:
df.info()

No missing data

In [None]:
def plot_countplot(column, ax=None):
    with sns.axes_style('ticks'):
        sns.countplot(x=column, palette=sns.color_palette('rocket'), ax=ax)
        sns.despine(offset=6)

In [None]:
# Looking at how much is the dataset imbalanced?

num_of_stay = round(len(df[df.left == 0]) / len(df) * 100, 2)
print(f'HR stay - {num_of_stay}%')
print(f'HR leave - {round(100 - num_of_stay, 2)}%')

plot_countplot(df.left)

`Nominal data` assigns names to each data point without placing it in some sort of order. For example, the results of a test could be each classified nominally as a **pass** or **fail**.

`Ordinal data` groups data according to some sort of ranking system: it orders the data. For example, test results could be grouped in descending order by grade: **A, B, C, D, E and F**

More on difference between `nominal data` and `ordinal data` 👉 [Source](https://sciencing.com/difference-between-nominal-ordinal-data-8088584.html)

### Working with ordinal data like the salary coloumn

In [None]:
replacement = {
    'low': 0, 
    'medium': 1, 
    'high': 2
}

df.salary = df.salary.apply(lambda x: replacement[x])
df.salary[:5]

### Working with nominal data like deparment column

In [None]:
ohe = OneHotEncoder()

dept_ohe_df = pd.DataFrame(df.Department)
dept_ohe_df = pd.DataFrame(
    ohe.fit_transform(dept_ohe_df[['Department']]).toarray()
)

print(f'Unique Departments: {len(df.Department.unique())}')

In [None]:
col_names = []
for col_name in ohe.get_feature_names():
    col_name = col_name.split('_')[1]
    col_names.append(col_name)

col_names


In [None]:
dept_ohe_df.columns = col_names
dept_ohe_df.head()

Removing one column from `dep_ohe_df` to avoid multi-corrliearity

In [None]:
dept_ohe_df = dept_ohe_df.drop(['IT'], axis='columns')
dept_ohe_df.head()

In [None]:
# Target column
y = df[['left']]
y.head()

In [None]:
# Adding the ohe results and removing `left` column

df = df.drop(['Department', 'left'], axis='columns')
df = pd.concat([dept_ohe_df, df], axis='columns')
df.head()

In [None]:
x = df.copy()
x.head()

## Modelling

In [None]:
# Scaling the dataset
for column in x.columns:
    x[column] = StandardScaler().fit_transform(x[column].values.reshape(-1, 1))
    
x.head()

### Balancing the unbalanced data

In [None]:
# Creating train and test datasets using x and y
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=10)

# Creating train and cross-validation datasets using the x_train and y_train
x_train, x_cv, y_train, y_cv = train_test_split(x_train, y_train, test_size=0.2, random_state=0)

print(f'Training set size: {len(x_train)}')
print(f'Validation set size: {len(x_cv)}')
print(f'Test set size: {len(x_test)}')

Splitting the train dataset into train and cross validation data sets before oversampling to avoid `oversampling to bleed data` for cross_val_score.

In [None]:
# Oversampling to balance the data

_smote = SMOTE(random_state=0)

sm_cols = x_train.columns

x_train, y_train = _smote.fit_resample(x_train, y_train)
x_train = pd.DataFrame(data=x_train, columns=sm_cols)
y_train = pd.DataFrame(data=y_train, columns=['left'])

# We can Check the numbers of our data
print(f'Length of oversampled data is {len(x_train)}')

print(f'Number of left no {len(y_train[y_train.left == 0])}')
print(f'Number of left yes {len(y_train[y_train.left == 1])}')

print(f'Proportion of left no data in oversampled data is {len(y_train[y_train.left == 0])/len(x_train)}')
print(f'Proportion of left yes data in oversampled data is {len(y_train[y_train.left == 1])/len(x_train)}')

### Featrue Selection

In [None]:
# Using Pearson Correlation

plt.figure(figsize=(22, 12))
cor = x.corr()
sns.heatmap(cor, annot=True, cmap=sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True))
plt.show()

In [None]:
# For cross validation
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)

In [None]:
col_names.remove('IT') # since IT is dropped
col_names

In [None]:
x_cv = np.array(x_cv)
x_cv = x_cv.astype('int')
y_cv = np.array(y_cv)

In [None]:
models = [
    LogisticRegression(), 
    SGDClassifier(), 
    KNeighborsClassifier(), 
    GaussianNB(), 
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    SVC(),
]

In [None]:
def cross_val_score(models, x_cv, y_cv):
    for model in models:
        scores = []
        for train, test in skf.split(x_cv, y_cv):
            x_train, x_test = x_cv[train], x_cv[test]
            y_train, y_test = y_cv[train], y_cv[test]

            _smote = SMOTE(random_state=0)
            x_train_sm, y_train_sm = _smote.fit_resample(x_train, y_train)

            model.fit(x_train_sm, y_train_sm)

            score = model.score(x_test, y_test)
            scores.append(score)

        print(f'== {model} ==')
        print(f'Cross-Validation mean-score: {np.mean(score)}')
        print()


cross_val_score(models, x_cv, y_cv)

### Recursive Feature Elimination

In [None]:
rfe = RandomForestClassifier()

rfe = RFE(rfe, n_features_to_select=5)
rfe.fit(x_train, y_train.values.ravel())

selector = rfe.support_

print(rfe.support_)
print(rfe.ranking_)

Here we took RandomForestClassifier model with 5 features and RFE gave feature ranking as above, but the selection of number ‘5’ was random. Now we need to find the optimum number of features, for which the accuracy is the highest. We do that by using loop starting with 1 feature and going up to 18. We then take the one for which the accuracy is highest.

In [None]:
len(x.columns)

In [None]:
def rfe(model, x_cv, y_cv):
    # number of features
    nof_list = np.arange(1, 17 + 1)
    high_score = 0

    # variable to store the optimum features
    nof = 0
    score_list = []

    for n in range(len(nof_list)):
        x_train, x_test, y_train, y_test = train_test_split(
            x_cv, y_cv, test_size=0.3, random_state=0
        )

        _smote = SMOTE(random_state=0)
        x_train_sm, y_train_sm = _smote.fit_resample(x_train, y_train)

        rfe = RFE(model, n_features_to_select=nof_list[n])
        x_train_rfe = rfe.fit_transform(x_train_sm, y_train_sm)
        x_test_rfe = rfe.transform(x_test)

        model.fit(x_train_rfe, y_train_sm)

        score = model.score(x_test_rfe, y_test)
        score_list.append(score)

        if score > high_score:
            high_score = score
            nof = nof_list[n]

    return (nof, high_score)


nof, high_score = rfe(RandomForestClassifier(), x_cv, y_cv)

print("Optimum number of features: %d" % nof)
print("Score with %d features: %f" % (nof, high_score))

As seen from above code, the optimum number of features is `nof`. We now feed `nof` as number of features to RFE and get the final set of features given by RFE method, as follows

### Performing Feature Elimination

In [None]:
model = RandomForestClassifier()

rfe = RFE(model, n_features_to_select=nof)
rfe.fit(x_train, y_train.values.ravel())

selector = rfe.support_

print(rfe.support_)
print(rfe.ranking_)

num_of_selected_features = len(rfe.support_)
print(f'\nNumber of features selected: {num_of_selected_features}')

In [None]:
# Selected features

col = (x_train.columns)
result = itertools.compress(col, selector)

col_names = []
for c in result:
    col_names.append(c)
    print(c)

In [None]:
x_train = x_train[col_names]
x_test = x_test[col_names]

len(col_names)

In [None]:
# Implementing the model

cols = col_names.copy()

x_train = x_train[cols]
y_train = y_train['left']

logit_model = sm.Logit(y_train, x_train)

result = logit_model.fit()

print(result.summary2())

Every Feature that we got from `Recursive Feature Elimination` is selected since no feature's `p-value is greater that 0.05`.

### Creating the model

In [None]:
def rt_param_selection(x, y, nfolds):
    criterion = ['gini', 'entropy']
    max_features = ['auto', 'sqrt', 'log2']
    param_grid = {'criterion': criterion, 'max_features': max_features}

    grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=nfolds)
    grid_search.fit(x, y)
    grid_search.best_estimator_
    return grid_search.best_estimator_


skf = StratifiedKFold(n_splits=10)
best_estimator_ = rt_param_selection(x_train, y_train, skf)
best_estimator_

In [None]:
cross_val_score(best_estimator_, x_cv, y_cv)

In [None]:
# Plotting learning curve

_size = np.arange(0.01, 1.01, 0.01)
train_sizes = np.array(_size)
scoring = 'neg_mean_squared_error'

train_sizes_abs, train_scores, cv_scores = learning_curve(
    RandomForestClassifier(criterion='entropy'), 
    x_train, y_train, 
    train_sizes=train_sizes, cv=skf, scoring=scoring
)

In [None]:
train_scores_mean = []
for row in train_scores:
    _mean = row.mean()
    train_scores_mean.append(_mean)
    
cv_scores_mean = []
for row in cv_scores:
    _mean = row.mean()
    cv_scores_mean.append(_mean)    
    
train_scores_mean = -np.array(train_scores_mean)
cv_scores_mean = -np.array(cv_scores_mean)

In [None]:
f, ax = plt.subplots(figsize=(10, 5))

ax.plot(train_sizes_abs, train_scores_mean, label='Train')
ax.plot(train_sizes_abs, cv_scores_mean, label='Cross Validation')

ax.legend()

In [None]:
# Fitting the model
model = best_estimator_
model.fit(x_train, y_train)

## Evaluation

In [None]:
y_test_pred = model.predict(x_test)
print(y_test_pred)
print(f"\nPrediction: \n{pd.DataFrame(y_test_pred)[0].value_counts()}")

In [None]:
print(y_test.values.reshape(1, -1)[0])
print()
print(f"Actual: \n{pd.DataFrame(y_test)['left'].value_counts()}")

In [None]:
y_test_prob = model.predict_proba(x_test)
y_test_prob

In [None]:
print(f'Model Score: {model.score(x_test, y_test)}')
print(f'f1-score: {f1_score(y_test, y_test_pred, average="weighted")}')
print(f'precision score: {precision_score(y_test, y_test_pred, average="weighted")}')
print(f'recall score: {recall_score(y_test, y_test_pred, average="weighted")}')

In [None]:
def plot_confusion_matrix(
    cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues
):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(
            j,
            i,
            format(cm[i, j], fmt),
            horizontalalignment="center",
            color="white" if cm[i, j] > thresh else "black"
        )

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')


print(confusion_matrix(y_test, y_test_pred, labels=[1, 0]))

In [None]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_test_pred, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Leave=1','Stay=0'], normalize= False,  title='Confusion matrix')

In [None]:
print(classification_report(y_test, y_test_pred))

In [None]:
logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])

plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

# The blue farther from red-dotted the better model

### Pipeline

In [None]:
scaling = ('scale', StandardScaler())
model = ('model', RandomForestClassifier(criterion='entropy'))

# Steps in the pipeline
steps = [scaling, model]

pipe = Pipeline(steps=steps)

# Fiitting the model
model = pipe.fit(x_train, y_train)

# Out-Of-Sample Forecast
y_test_pred = model.predict(x_test)

# Evaluation
print(f'Model Score: {model.score(x_test, y_test)}')
print(f'f1-score: {f1_score(y_test, y_test_pred, average="weighted")}')
print(f'precision score: {precision_score(y_test, y_test_pred, average="weighted")}')
print(f'recall score: {recall_score(y_test, y_test_pred, average="weighted")}')

### Precision-Recall vs Threshold Chart

In [None]:
log_pred_y = model.predict(x_test) 
log_probs_y = model.predict_proba(x_test) 

precision, recall, thresholds = precision_recall_curve(y_test, log_probs_y[:, 1]) 
pr_auc = auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])

### To control the threshold of probability abpve which we want to consider it has true

In [None]:
THRESHOLD = 0.45
preds = np.where(model.predict_proba(x_test)[:,1] > THRESHOLD, 1, 0)

results_data = [
    accuracy_score(y_test, preds), 
    recall_score(y_test, preds), 
    precision_score(y_test, preds), 
    f1_score(y_test, preds), 
    roc_auc_score(y_test, preds)
]
results_indexes = ["accuracy", "recall", "precision", "f1_score", "roc_auc_score"]
results = pd.DataFrame(data=results_data, index=results_indexes)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, preds, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Leave=1','Stay=0'], normalize= False,  title='Confusion matrix')

print(results)

In [None]:
# Saving the model
dump(model, 'model.joblib')

---

I'll wrap things up there. If you want to find some other answers then go ahead `edit` this kernel. If you have any `questions` then do let me know.

If this kernel helped you then don't forget to 🔼 `upvote` and share your 🎙 `feedback` on improvements of the kernel.

![](https://media.giphy.com/media/qatu2fd5vCi7C/giphy.gif)

---