# Employee Satisfactory Analysis 


In this notebook, we make use of random forest machine learning algorithms to perform a classification analysis of 2021 Federal Employee Viewpoint Survey. The aim is to determine whether the employee would leave the job using their survey ansers and guide the employer to focus on improvement in certain areas. 

## I. Import Initial Survey Answers

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r'Datasets/2021_OPM_FEVS_PRDF.csv')


## II. Initial Cleaning/Pre-Processing

Before analyzing data we need to clean the data and pre-processing the data.

### 1. Cleaning the data

During this stage, we drop the NA and duplicate values in our data.

In [3]:
import numpy as np

In [4]:
df.drop_duplicates() ##dropping all the duplicate values


Unnamed: 0,RandomID,agency,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,...,DRNO,DHISP,DDIS,DAGEGRP,DSUPER,DFEDTEN,DSEX,DMIL,DLEAVING,POSTWT
0,112970976817,XX,5.0,5.0,5.0,5.0,5,4,4,4,...,A,B,B,B,B,A,A,A,A,2.209652
1,194868625278,XX,3.0,2.0,4.0,3.0,4,2,4,2,...,,A,B,B,B,B,A,A,C,2.209652
2,152966380283,XX,5.0,5.0,4.0,4.0,3,4,5,4,...,B,B,B,B,A,B,A,B,C,1.858874
3,193041162980,XX,5.0,5.0,5.0,5.0,5,5,5,5,...,B,B,A,B,B,B,A,A,A,1.228573
4,146655962451,XX,4.0,5.0,5.0,4.0,4,3,5,4,...,B,B,B,B,B,B,A,A,A,1.735842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
292515,154057939422,ST,3.0,4.0,4.0,3.0,4,4,2,4,...,B,B,B,A,A,A,B,B,A,3.004992
292516,151758964104,ST,3.0,4.0,2.0,4.0,2,2,3,3,...,B,B,B,A,A,A,B,B,D,4.427855
292517,143492802997,ST,3.0,3.0,,4.0,4,4,4,3,...,,,,,,A,A,,A,4.202227
292518,110267537558,ST,2.0,4.0,5.0,5.0,2,1,5,4,...,B,B,B,A,A,A,B,B,C,3.523113


In [5]:
df_c  = df.copy()
objects= [column for column, is_type in (df_c.dtypes=="object").items() if is_type] ## Extracting out all the value which should be integer but is not
## The column 'agency' holds no significance here so it is to be dropped.
objects.remove('agency')

In [6]:
df_c = df_c.dropna()

### 2. Pre-Processing Data

Now we process the data by replacing the no answer ones with modes. 

In [7]:
for i in objects:
    df_c[i] = df_c[i].replace(['X'], np.nan)
    df_c[i] = df_c[i].replace(np.nan, 0)
for i in objects[:-9]:
    df_c[i] = df_c[i].astype(float)

df_ca = df_c.copy()
for i in df_ca.columns.values:
        df_ca[i] = df_ca[i].replace(0,df_ca[i].value_counts().idxmax())

Now we replace the demographics data with numbers so that they could also become features for our analysis.

In [8]:
df_ca['leaveYN']=0
df_ca.loc[df_ca.DLEAVING =='B', 'leaveYN']=1
df_ca.loc[df_ca.DLEAVING =='C', 'leaveYN']=1
df_ca.loc[df_ca.DLEAVING =='D', 'leaveYN']=1

df_rf = df_ca.iloc[:, 2:69].copy()
df_rf['DRNO']=0
df_rf.loc[df_ca.DRNO =='B', 'DRNO']=1
df_rf.loc[df_ca.DRNO =='C', 'DRNO']=2
df_rf.loc[df_ca.DRNO =='D', 'DRNO']=3
df_rf.DRNO=df_rf.DRNO.astype('float')
df_rf['DHISP']=0
df_rf.loc[df_ca.DHISP =='B', 'DHISP']=1
df_rf.DHISP=df_rf.DHISP.astype('float')
df_rf['DDIS']=0
df_rf.loc[df_ca.DDIS =='B', 'DDIS']=1
df_rf.DDIS=df_rf.DDIS.astype('float')
df_rf['DAGEGRP']=0
df_rf.loc[df_ca.DAGEGRP =='B', 'DAGEGRP']=1
df_rf.DAGEGRP=df_rf.DAGEGRP.astype('float')
df_rf['DSUPER']=0
df_rf.loc[df_ca.DSUPER =='B', 'DSUPER']=1
df_rf.DSUPER=df_rf.DSUPER.astype('float')
df_rf['DFEDTEN']=0
df_rf.loc[df_ca.DFEDTEN =='B', 'DFEDTEN']=1
df_rf.loc[df_ca.DFEDTEN =='C', 'DFEDTEN']=2
df_rf.DFEDTEN=df_rf.DFEDTEN.astype('float')
df_rf['DSEX']=0
df_rf.loc[df_ca.DSEX =='B', 'DSEX']=1
df_rf.DSEX=df_rf.DSEX.astype('float')
df_rf['DMIL']=0
df_rf.loc[df_ca.DMIL =='B', 'DMIL']=1
df_rf.DMIL = df_rf.DMIL.astype('float')

## III. Exploratory Data Analysis

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Correlation Analysis

### 2. Agency specific Analysis

### 3. PCA and Factor Analysis

## IV. Classification with Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

First, we tune the hyper parameter for our random forests using results from our factor analysis.

In [12]:
from Data_Analysis.Tune import RFtune

In [13]:
bestF, X_test, y_test = RFtune(depth = range(1,2), split=[2], leaf = range(1,2), estimators=[50], df = df_rf, y = df_ca.leaveYN)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


In [14]:
# print best parameter after tuning
print(bestF.best_params_)
  
# print how our model looks after hyper-parameter tuning
print(bestF.best_estimator_)


{'max_depth': 1, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
RandomForestClassifier(max_depth=1, max_features=9, n_estimators=50,
                       random_state=20)


In [15]:
grid_predictions = bestF.predict(X_test)
  
# print classification report
print(classification_report(y_test, grid_predictions))
print("Accuracy:",accuracy_score(y_test, grid_predictions))

              precision    recall  f1-score   support

           0       0.73      0.96      0.83     12655
           1       0.71      0.23      0.35      5915

    accuracy                           0.72     18570
   macro avg       0.72      0.59      0.59     18570
weighted avg       0.72      0.72      0.67     18570

Accuracy: 0.7239095315024232


## V. Conclusions
