## TABLE OF CONTENTS
>• Perkenalan.<br>
• Exploratory Data Analysis / Deskriptif Analisis.<br>
• Inferencial Analysis.<br>

# INTRODUCTION
***
> ## Data Introduction
This dataset contain informations about employees that has left or not (stay) in a company. Employees are one of the most important assets. Good employees can bring innovations, more profits and help businessess grow, so it is very important to maintain this employees. In this occasion, I will do data analysis to find insights that can help company to understand more about their employee and prevent them to left, and also I will create an algorithm to predict whether an employee will left or not based on several factors. This dataset includes information about:<br>
1.Satisfaction_level - employee satisfaction level ranging from 0 to 1.<br>
2.Last_Evaluation - last employee evaluation by their supervisor ranging from 0 to 1.<br>
3.Number_project - total of projects completed.<br>
4.Average_montly_hours - average working time per month.<br>
5.Time_spend_company - years spent in the company.<br>
6.Work_Accident - whether employee had ever accident or not.<br>
7.Left - employee status (left or stay).<br>
8.Promotion_last_5_years - whether employee got a promotion or not in past 5 years.<br>
9.Department - employee's working department.<br>
10.Salary - employee's salary.

>## Objective
1.Find valuable insights or informations from the dataset.<br>
2.Create a good machine learning model to predict whether employee will left or not.

# Data Preparation (Import libraries, data cleaning & data wrangling)
***

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.weightstats import ztest as ztest
from treeinterpreter import treeinterpreter as ti
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import gaussian_kde
import matplotlib_inline
#matplotlib_inline.backend_inline.set_matplotlib_formats('svg') 
sns.set()

In [2]:
df = pd.read_csv('HR_comma_sep.csv')
df.shape

FileNotFoundError: [Errno 2] No such file or directory: 'HR_comma_sep.csv'

In [None]:
#checking data type and null values
df.info() 

In [None]:
#convert data types, rename columns and drop duplicates.
df = df.convert_dtypes(convert_floating=False, convert_integer=False)
df.rename(columns = {'satisfaction_level':'satisfaction', 'last_evaluation':'evaluation', 'number_project':'projects','average_montly_hours':'hours','time_spend_company':'timespend','Work_accident':'accident','promotion_last_5years':'promotion','sales':'department'}, inplace=True)
df.drop_duplicates(inplace=True)

> All the columns data seem to be valid.

In [None]:
dfnumeric = df.select_dtypes(include=['int64','float64'])
dfcat = df.drop(columns = dfnumeric.columns)

In [None]:
#checking for outliers.
fig, axarr = plt.subplots(1,5, figsize=(10, 4))
y = 0
for x in dfnumeric.columns:
    axarr[y].boxplot(dfnumeric[x])
    axarr[y].set_xlabel(x)
    y += 1
    if y == 5:
        break
plt.suptitle("Outliers checking on numeric columns")
fig.tight_layout(pad=1)
plt.show()    

> There's outliers in the timespend column, but this outliers is a valid data because there are only a few out of thousands people that actually working more than 5 years.

# Exploratory Data Analysis
***
## Data Overview:

In [None]:
#checking left column,
df.groupby('left')['left'].agg(lambda x: (x.count() / 11991)*100)

>There's around 17% employee that left the company or 0.17 turn over ratio.

In [None]:
dfnumeric.describe()

In [None]:
x = pd.DataFrame([dfnumeric[dfnumeric.left == 1].describe().loc['mean',:],dfnumeric[dfnumeric.left == 0].describe().loc['mean',:]]).transpose()
x.columns = ['Mean-Left','Mean-Stay']
x

>There's a big difference mean value in satisfaction, accident, and promotion between the left and stay group. I will explore them later.

In [None]:
dfcat.describe()

In [None]:
dfnumeric

In [None]:
df.groupby('department')['satisfaction'].agg(lambda x: x.count()).sort_values(ascending=False).plot(kind='bar',figsize=(7,4),rot=45,color='mediumseagreen',width=0.35)
plt.title("Total employee per department", fontweight='bold', size=14)
plt.show()

>sales, technical, support and IT are the largest department. The rest departments seems to be pretty equal to each other.

In [None]:
#memplot jumlah karyawan.
fig, axarr = plt.subplots(1,2, figsize=(10, 4))
df.groupby('department')[['left']].agg(Left = ('left', lambda x: x.sum()), Stay = ('left',lambda x: (x == 0).sum())).plot(kind='barh', ax=axarr[0])
axarr[0].set_title("Total employee attrition", fontweight='bold', size=11)
axarr[0].set(ylabel = None)
df.groupby('department')[['left']].agg(Left = ('left', lambda x: x.sum() * 100 / x.count()), Stay = ('left',lambda x: (x == 0).sum() * 100 / x.count())).plot(kind='barh',rot=True, ax=axarr[1])
axarr[1].set_title("Employee attrition comparison by %", fontweight='bold', size=11)
axarr[1].set(ylabel = None)
axarr[1].set_xticks(range(0,90,10))
fig.tight_layout(pad=1)

>As you can see from the right graphic, employee attrition per department on all department is pretty equal at +- 15%

In [None]:
sns.pairplot(df, hue='left')

>From the pair plot above, satisfaction, evaluation, projects and hours are very interesting features to discuss..., I will perform in-depth analysis later..

## Correlation

In [None]:
sns.heatmap(df.drop(columns=['department','salary']).corr().round(2), annot=True)
plt.show()

>As we can see from the heatmap above, that evaluation, projects and hours have a weak positive correlation among each other.This mean that more projects consume more time thus make better evaluation. There is also a negative correlation between satisfaction and left, this mean that the lower the satisfaction level the more chances that employee will quit.

## Satisfaction analysis
First, let's make a function to plot distribution.

In [None]:
def pdf_plot(df,colom,x):
    means = df.satisfaction.mean()
    medians = df.satisfaction.median()
    data = df[df.left == 1][colom]
    data1 = df[df.left == 0][colom]
    
    kde = gaussian_kde(data)
    kde1 = gaussian_kde(data1)
    dist_space = np.linspace( min(data), max(data), 200)
    dist_space1 = np.linspace( min(data1), max(data1), 200)
    axarr[x].plot( dist_space, kde(dist_space), label='Left', color='orange' )
    axarr[x].plot( dist_space1, kde1(dist_space1), label='Stay', color='blue')
    axarr[x].axvline(x = means, linestyle = '--', color='g', label='Mean')
    axarr[x].axvline(x = medians, linestyle = '--', color='r', label='Median')
    axarr[x].set_title('Probability', fontweight='bold', size=12)
    axarr[x].set(ylabel = 'Probability', xlabel = colom)
    axarr[x].legend()

In [None]:
fig, axarr = plt.subplots(1,2, figsize=(12, 4))
g = sns.histplot(data=df, x = 'satisfaction',hue='left', ax=axarr[0], multiple='stack')
axarr[0].set_title('Distribution', fontweight='bold', size=12)
pdf_plot(df,'satisfaction',1)
plt.show()

>Just from above plot, we can see employee that stay has larger satisfaction level. But is it certain that satisfaction level does affect employee attrition?

>Let's do hyphotesis testing on satisfaction between the left and stay groups.<Br>
I will conduct two way z-test with 95% confidence level.<br>
Ho : There's no significance difference between the left and stay groups.<br>
Ha : There's a significance difference between the left and stay groups.

In [None]:
print(f"Z score, P value : {ztest(df[df.left == 1]['satisfaction'],df[df.left == 0]['satisfaction'])}")

>As a whole the p value is very small and we can reject the null hyphothesis, but let's conduct the z test per department.

In [None]:
y = ['sales','technical','support','IT','RandD','product_mng','marketing','accounting','hr','management']
dfleft = df[df.left == 1]
dfnotleft = df[df.left == 0]
m = []
for x in y:
    a = ztest(dfleft[dfleft.department == x]['satisfaction'], dfnotleft[dfnotleft.department == x]['satisfaction'])[1]
    m.append(a)
b = pd.DataFrame(m,y).reset_index()
b["SigniLevel"] = 0.05
b["Status"] = b.apply(lambda x: "Reject Ho Hyphothesis" if int(x[0]) < x["SigniLevel"] else "Do not Reject Ho Hyphothesis",axis=1)
b.columns = ['department','P-val','SigniLevel','Status']
b      

>>As you can see, we conduct z test on all department and all the result is the p value is lower than significance level, this mean we can safely reject the null hyphothesis and assume that satisfaction level does significantly affect employee attrition!.

## Evaluation, projects, hours vs satisfaction
As we discussed above that evaluation project and hours seems to have a correlation. First, let's see the satisfaction distribution once again.

In [None]:
fig, axarr = plt.subplots(1,2, figsize=(12, 4))
g = sns.histplot(data=df, x = 'evaluation',hue='left', ax=axarr[0], multiple='stack')
axarr[0].set_title('Distribution', fontweight='bold', size=12)
pdf_plot(df,'evaluation',1)
plt.show()

In [None]:
fig, axarr = plt.subplots(1,2, figsize=(12, 4))
g = sns.histplot(data=df, x = 'projects',hue='left', ax=axarr[0], binwidth=0.15, multiple='stack')
df.groupby('projects')['left'].agg(lambda x: x.sum() * 100 / x.count()).plot(kind='bar', width=0.3,rot=True, color='purple')
axarr[0].set_title('Distribution', fontweight='bold', size=12)
axarr[1].set_title('Left probability', fontweight='bold', size=12)
plt.show()

In [None]:
fig, axarr = plt.subplots(1,2, figsize=(12, 4))
g = sns.histplot(data=df, x = 'hours',hue='left', ax=axarr[0], multiple='stack')
axarr[0].set_title('Distribution', fontweight='bold', size=12)
pdf_plot(df,'hours',1)
plt.show()

In [None]:
fig, axarr = plt.subplots(1, 3, figsize=(10.3, 4))
axarr[0].scatter(df['projects'], df['satisfaction'], c=df['left'] == 1, cmap='coolwarm', alpha=0.5, s=3)
axarr[0].set_xlabel('projects')
axarr[0].set_ylabel('satisfaction')
axarr[0].set_title("Satisfaction vs Projects")

axarr[1].scatter(df['evaluation'], df['satisfaction'], c=df['left'] == 1, cmap='coolwarm', alpha=0.5, s=3)
axarr[1].set_xlabel('evaluation')
axarr[1].set_ylabel('satisfaction')
axarr[1].set_title("Satisfaction vs Evaluation")

axarr[2].scatter(df['hours'], df['satisfaction'], c=df['left'] == 1, cmap='coolwarm', alpha=0.5, s=3)
axarr[2].set_xlabel('hours')
axarr[2].set_ylabel('satisfaction')
axarr[2].set_title("Satisfaction vs Hours")

fig.tight_layout(pad=1)

>>Here's interesting informations I founds:<br>
>>1. Employees tend to leave when they did very little projects or a lot of projects.<br>
2. Almost 100% employees with very low satisfaction level left the company (below 0.1).<br>
3. 100% of employees left when they worked over 300 hours. <br>
4. There are some groups of employees that leave the company when their evaluations or hours are low or high (even when their   satisfaction levels are high).<br>
5.In short, some employees tend to leave the company because they're underworked or overworked, no matter what their satisfaction levels are.

## TimeSpendCompany Analysis

In [None]:
fig, axarr = plt.subplots(1,2, figsize=(12, 4))
g = sns.histplot(data=df, x = 'timespend', hue='left',ax=axarr[0], binwidth=0.15, color='orange', multiple='stack')
df.groupby('timespend')['left'].agg(lambda x: x.sum() * 100 / x.count()).plot(kind='bar', width=0.3,rot =True, color='purple')
axarr[0].set_title('Distribution', fontweight='bold', size=12)
axarr[1].set_title('Left probability', fontweight='bold', size=12)
plt.show()

>Employee that work for 5 years has the highest probability to left the company

In [None]:
dict1 = {
    'low' : 1,
    'medium' : 2,
    'high' : 3
}
df['salary'] = df['salary'].map(dict1)
df.groupby('timespend')['salary'].agg('median')

In [None]:
fig, axarr = plt.subplots(2,2, figsize=(11, 7))
sns.barplot(data=df, hue='left', x='timespend', y='salary',ax = axarr[0,0])
sns.barplot(data=df, hue='left', x='timespend', y='hours',ax = axarr[0,1])
sns.barplot(data=df, hue='left', x='timespend', y='projects',ax = axarr[1,0])
sns.barplot(data=df, hue='left', x='timespend', y='evaluation',ax = axarr[1,1])
plt.show()

>> It seems the salary is pretty equal no matter how many years the employee work, but the workload seems to increase on employee that work more than 3 years

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(10,3))
df[df.timespend < 5].groupby('department')['timespend'].mean().sort_values().plot(kind='barh', ax = axarr[0])
df[df.timespend >= 5].groupby('department')['timespend'].mean().sort_values().plot(kind='barh', ax = axarr[1], color ='c')
fig.tight_layout(pad=1)
plt.show()

>>For employee that work below 5 years, the mean is pretty equal on all department (around 3 years). But for employees who have worked more than 5 years, we can see that management's employees are the most loyal (around 7 years), leaving all departments pretty far behind.

## Accident analysis

In [None]:
fig, axarr = plt.subplots(1,2,figsize=(9,4))
df.groupby('left')['accident'].agg(lambda x: x.sum() * 100 / x.count()).plot(kind='bar', ax = axarr[0], rot=True)
df.groupby('department')['accident'].agg(lambda x: x.sum() * 100 / x.count()).plot(kind='bar', ax = axarr[1], rot = 45, color='purple')
fig.tight_layout(pad=1)
plt.show()

>>It seems that employees who left have a lower accident rate. Also, at least 12.5% of employees in all departments have ever been in an accident.

## Conclusion
>>Before we go any further, here's a recap:<br>
>>1. Employees with a very low satisfaction level (around 0.1) are guaranteed to leave. These employees also have a very high evaluation score and long working hours (low satisfaction level + overwork). <br>
2. Some employees tend to leave when they overwork or underwork, no matter what their satisfaction levels are. <br>
3. Salary is pretty equal between junior and senior employees.<Br>
4. Senior employees tend to work more than juniors.<Br>
5. Senior employees tend to leave the company more than junior employees.<br>
6. From the data, there are very little employees that working more than 5 years. Maybe the company is still new or expanding, the data is incomplete, or the employees don't like to work longer than 5 years.
7. Accidents seems likely to be happened but it doesn't effect employee attrition.

In [None]:
def teladan(x):
    x = list(x)
    k = 0
    if x[0] > df.evaluation.quantile(q=0.75) or x[0] < df.evaluation.quantile(q=0.25):
        k += 1
    if x[1] > df.projects.quantile(q=0.75) or x[0] < df.projects.quantile(q=0.25):
        k += 1
    if x[2] > df.hours.quantile(q=0.75) or x[2] < df.hours.quantile(q=0.25):
        k += 1
    if k == 3 :
        return 1
    else:
        return 0

In [None]:
df['teladan'] = df[['evaluation','projects','hours']].apply(teladan, axis=1)

In [None]:
df.teladan.value_counts()

In [None]:
df

# Stop

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [None]:
dfrun = df.drop(columns=['salary','department'])
dfrunp = dfrun.drop(columns='left')
dfrunt = dfrun['left'].copy()

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dfrunp, dfrunt, test_size = 0.5, random_state=123)

In [None]:
clf = RandomForestClassifier(n_estimators=50)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
metrics.f1_score(y_test,y_pred)

In [None]:
metrics.accuracy_score(y_test,y_pred)

In [None]:
metrics.confusion_matrix(y_test,y_pred)

In [None]:
threshold_tuning(clf, X_test, y_test, np.arange(0.1,1.1,0.1), 20,1).sort_values('score')

In [None]:
param_grid = {
 'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [50, 100, 150, 200, 400, 600, 800, 1000]
}

In [None]:
randomcv = RandomizedSearchCV(RandomForestClassifier() , param_grid, cv=5, n_jobs=-1, n_iter=5, scoring='f1')
randomcv.fit(dfrunp , dfrunt)

In [None]:
pd.DataFrame(randomcv.cv_results_)

In [None]:
pd.DataFrame(randomcv.cv_results_)['params'].value_counts()

In [None]:
clf1 = RandomForestClassifier(n_estimators= 50, min_samples_split= 5, min_samples_leaf= 2, max_features= None, max_depth= 70, bootstrap= False)
clf1.fit(X_train, y_train) #mean f1 0.9271

In [None]:
clf2 = RandomForestClassifier(n_estimators= 400, min_samples_split= 10, min_samples_leaf= 2, max_features= 'auto', max_depth= 30, bootstrap= False)
clf2.fit(X_train, y_train) #mean f1 0.950215

In [None]:
clf = RandomForestClassifier(n_estimators=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
def threshold_tuning(model, xtest, ytest, thres, fnvalues, fpvalues):
    recall = []
    precision = []
    f1_score = []
    accuracy_score = []
    score = []
    for k in thres:
        y = pd.DataFrame(clf.predict_proba(xtest), columns=['%No','%Yes'])[['%No']]
        y['prediction'] = y['%No'].apply(lambda x: 0 if x > k else 1 )
        recall.append(metrics.recall_score(ytest, y.prediction))
        precision.append(metrics.precision_score(ytest, y.prediction))
        f1_score.append(metrics.f1_score(ytest, y.prediction))
        accuracy_score.append(metrics.accuracy_score(ytest, y.prediction))
        score.append((metrics.confusion_matrix(ytest, y.prediction)[1][0] * fnvalues) + (metrics.confusion_matrix(y_test, y.prediction)[0][1] * fpvalues))
    df = pd.DataFrame([thres,recall,precision,f1_score,accuracy_score,score]).transpose()
    df.columns = ['thresold','recall','precision','f1-score','accuracy-score','score']
    return df
    

In [None]:
threshold_tuning(clf2, X_test, y_test, np.arange(0.1,1.1,0.1), 20,1).sort_values('score')

In [None]:
threshold_tuning(clf1, X_test, y_test, np.arange(0.1,1.1,0.1), 10, 1).sort_values('score')

In [None]:
metrics.confusion_matrix(y_test, y.prediction)[1][0] * 1000000

In [None]:
df.sort_values('score', ascending=False)

In [None]:
df.sort_values('score', ascending=False)

In [None]:
df.sort_values('score', ascending=False)

In [None]:
metrics.confusion_matrix(y_test, y.prediction)

In [None]:
metrics.precision_score(y_test, y.prediction)

In [None]:
920 / (920 + 30)

In [None]:
y['prediction'] = y['%No'].apply(lambda x: 0 if x > 0.5 else 1 )
y

In [None]:
#metrics.plot_roc_curve(clf, X_test, y_test)

In [None]:
k = pd.Series(y_pred)
k.value_counts()

In [None]:
y_test.value_counts()

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
metrics.accuracy_score(y_test, y_pred)

In [None]:
metrics.precision_score(y_test, y_pred)

In [None]:
metrics.f1_score(y_test, y_pred)

In [None]:
#for k in clf.estimators_:
    #print (k.predict_proba(X_test.loc[[11938],:]))

In [None]:
m = X_test.loc[[153],:].copy()
n = y_test.loc[[153],:].copy()

In [None]:
y_test.loc[[153],:]

In [None]:
0.9836557705136758
0.9869913275517012

In [None]:
df[df.timespend >= 5].groupby('department')['timespend'].mean().sort_values()

In [None]:
fig, axarr = plt.subplots(5, 2,figsize=(15,20))

Pada distribusi satisfaction level, bisa dilihat bahwa untuk karyawan left frekuensi nilai rendah cukup banyak dibandingkan nilai yang tinggi ini merupakan kebalikan dari karyawan yang stay.

# Inferencial Analysis

In [None]:
dfnew = df.copy()
dfnew.drop(columns='department',inplace=True)
dfnew['salary'] = dfnew.apply(lambda x: 1 if x['salary'] == 'low' else 2 if x['salary'] == 'medium' else 3, axis = 1)
dfnew.corr()[['left']]

Diatas adalah tabel correlation antara variabel predictor dan variable response. Semakin mendekati angka 0 maka hubungan semakin lemah, semakin mejauhi angka 0 maka hubungan semakin kuat ( -1 < PV < 1). Bisa dilihat bahwa satisfaction level merupakan predictor dengan hubungan paling kuat.

In [None]:
#membuat parameter untuk hyperparameter tuning.
param_grid = {
 'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
 'max_features': ['auto', None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [50, 100, 150, 200, 400, 600, 800, 1000]
}

In [None]:
#menggunakan randomizedsearchcv untuk hyperparameter tuning & model validation.
randomcv = RandomizedSearchCV(RandomForestClassifier() , param_grid, cv=5, n_jobs=-1, n_iter=5)
randomcv.fit(dfnew.drop(columns='left') , dfnew.left)

df2 = df[df.left == 1]
x = df2.groupby('timespend')[['satisfaction']].agg('median')
x.plot(kind='barh')
tampung = list(np.round(df2.groupby('timespend')[['satisfaction']].agg('median')['satisfaction'], decimals=2))
for index, value in enumerate(tampung):
    plt.text(value, index,
             str(value))
plt.show()

In [None]:
pd.DataFrame(randomcv.cv_results_)

Bisa dilihat bahwa mean test score untuk setiap model sangat bagus yaitu ada di angka 98% - 99%.

In [None]:
#mengambil model random forest dengan nilai akurasi mean terbagus.
model = RandomForestClassifier(n_estimators=400,min_samples_split=10,min_samples_leaf=1,max_features=None,max_depth = 70, bootstrap=True) 

In [None]:
#melatih model
model.fit(dfnew.drop(columns='left'), dfnew.left)

In [None]:
pd.DataFrame(model.feature_importances_ * 100 , dfnew.drop(columns='left').columns , columns = ['Nilai Kepentingan']).sort_values('Tingkat Kepentingan').plot(kind='barh',figsize=(10,5))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.title("Nilai kepentingan antara variabel predictor dengan variabel response", fontweight='bold')
plt.show()

Plot diatas menunjukan hubungan antara variabel predictor dengan variabel response. Bisa dilihat bahwa satisfaction_level adalah faktor yang paling menentukan apakah karyawan left atau stay sedangkan promotion_last_5years dan salary adalah faktor yang paling kecil yang berpengaruh terhadap karyawan akan left dan tidak.

In [None]:
#membuat function untuk menginterpret bagaimana model membuat prediksi menggunakan tree interpreter.
def interpreterr(data,rf):
    df1 = pd.DataFrame()
    df2 = pd.DataFrame()
    df3 = pd.DataFrame()
    for x in range(0,len(data)):
        prediction, bias, contributions = ti.predict(rf, data.to_numpy()[x].reshape(1,8))
        prediction = pd.DataFrame(prediction[0]).T[[1]]
        bias = pd.DataFrame(bias[0]).T[[1]]
        df1 = pd.concat([df1, pd.DataFrame(contributions[0])[[1]].T], axis=0,ignore_index=True)
        df2 = pd.concat([df2, prediction],axis=0,ignore_index=True)
        df3 = pd.concat([df3, bias],axis=0,ignore_index=True)
    df1 = df1 * 100
    df1.columns = data.columns
    df2.columns = ['predict']
    df3.columns = ['bias']
    df1['bias'] = df3.bias * 100
    df1['predict-to-left'] = df2.predict * 100
    return df1
        

In [None]:
a = interpreterr(dfnew.drop(columns='left').head(10), model)
a

Dataframe diatas berisi bagaimana prediksi dibuat berdasarkan setiap employee. Prediksi = bias + feature-1 + feature-2 + ... + feature-n. Contoh, pada row index nomor 3 nilai terbesar yang disumbang ada pada variabel time_spend_company yaitu 40.6. Bisa disimpulkan pada employee index 3 alasan terbesarnnya untuk resign dikarenakan time_spend_company.

In [None]:
confidence = pd.DataFrame(([k.predict_proba(dfnew.drop(columns='left').values)[:,1] for k in model.estimators_])).T
dfnew['pred_mean'] = confidence.apply(axis=1 , func=lambda x: x.mean()).round(2).to_numpy() * 100
dfnew['pred_stdD'] = confidence.apply(axis=1 , func=lambda x: x.std()).round(3).to_numpy()
dfnew

Confidence adalah tingkat kepercayaan dari suatu model membuat sebuah prediksi. Prediksi Random Forest diambil dari hasil mean/rata-rata prediksi dari banyak decision tree. Jika prediksi dibuat berdasarkan hasil mean prediksi decision tree maka confidence level nya adalah standard deviation dari prediksi decision tree tersebut. Semakin tinggi standard deviation maka confidence levelnya semakin rendah.

In [None]:
dfnew['pred_stdD'].plot(kind='hist')
plt.title("Tingkat variasi / std deviation decision tree", fontweight='bold')
plt.show()

Bisa dilihat bahwa std deviationnya cukup kecil, ini berarti bahwa rata-rata confidence level disetiap prediksi cukup tinggi.

In [None]:
#plot nilai kepentingan antar departemen.
fig, axarr = plt.subplots(5, 2,figsize=(15,18))
z = list(df['department'].value_counts().index)
a,b = 0,0
for x in z:
    dfnew1 = df[df['department'] == x].copy()
    dfnew1.drop('department', axis=1,inplace=True)
    dfnew1['salary'] = dfnew.apply(lambda x: 1 if x['salary'] == 'low' else 2 if x['salary'] == 'medium' else 3, axis = 1)
    model.fit(dfnew1.drop(columns='left'), dfnew1.left)
    pd.DataFrame(model.feature_importances_ * 100 , dfnew1.drop(columns='left').columns , columns = ['Tingkat Kepentingan']).sort_values('Tingkat Kepentingan').plot(kind='barh',ax = axarr[a,b])
    axarr[a,b].set_title(x, fontsize=12)
    axarr[a,b].get_legend().remove()
    if b == 0:
        b = 1
    else:
        b = 0
    if z.index(x) in [1,3,5,7]:
        a += 1
plt.suptitle('Nilai kepentingan antar department', fontsize=20, fontweight='bold')
plt.tight_layout()
plt.show()


In [None]:
#plot distribusi satisfaction_level antar departemen.
fig, axarr = plt.subplots(5, 2,figsize=(15,20))
z = list(df['department'].value_counts().index)
a,b = 0,0
for x in z:
    dfnew1 = df[df['department'] == x].copy()
    dfnew1.drop('department', axis=1,inplace=True)
    dfnew1['salary'] = dfnew.apply(lambda x: 1 if x['salary'] == 'low' else 2 if x['salary'] == 'medium' else 3, axis = 1)
    sns.histplot(data=dfnew1, x = 'satisfaction_level', hue='left',ax = axarr[a,b])
    axarr[a,b].set_title(x, fontsize=15)
    axarr[a,b].set(xlabel=None)
    axarr[a,b].legend(['left','stay'],loc='upper left')
    if b == 0:
        b = 1
    else:
        b = 0
    if z.index(x) in [1,3,5,7]:
        a += 1
plt.suptitle('Satisfaction level per department', fontsize=23, fontweight='bold')
plt.tight_layout()
plt.show()
