 # **<span style="color:deepskyblue">TABLE OF CONTENTS</span>** 

**Importing Libraries** 

**Loading Data**

    * Checking basic information about dataset
    * Checking missing value
    * Checking dataset balance or not
    * Transform target to numeric values 
    * Drop ID Feature from the dataset
    
**EDA and Visualization**

    * Attrition Rate Visualization
    * Uni Variate Feature Analysis
    * Bi Variate Feature Analysis
    * Multi Variate Feature Analysis
 **Data Preprocessing**
  
    * Defining Target and Independent Features
    * Remove Features with Zero Variance
    * Separate feature into numerical and categorical 
    * Outlier Analysis of Numerical Features
    
**Feature Selection - Numerical Features**

    * checking correltion
**Feature Selection - Categorical Features**

    * K Best for Selecting Categorical Features
    
**Imbalace dataset to Balance datset**

**Split the Dataset for Training & Testing**

**Data Transformation using Standardrization**

**Model Buliding**

**Decision Tree Model**

**Hyper-Parameter Optimization using GridSearchCV for Decision Tree Model**

**K-Nearest Neighbours Classifier model**

**Random Forest Classifier**

**Hyper-Parameter Optimization using GridSearchCV for Random Forest Classifier**

**SVC Model**


 # **<span style="color:deepskyblue">Importing  Libraries</span>** 


In [2266]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline

import time

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC

from sklearn.metrics import classification_report
from sklearn import metrics
from sklearn.metrics import confusion_matrix

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import warnings;
warnings.filterwarnings('ignore');

# **<span style="color:deepskyblue">Loading Data</span>** 


In [2267]:
# reading data from file
df = pd.read_csv("/kaggle/input/hr-employee-attrition/HR-Employee-Attrition.csv")


In [2268]:
# checking shape of data
df.shape

In [2269]:
# reading 5 rows from data
df.head()

In [2270]:
# checking null values
df.isnull().sum()

In [2271]:
# checking types
df.dtypes

In [2272]:
# checking unique values
df.nunique()

In [2273]:
# Percentage of 'Attrition' unique value 
df.Attrition.value_counts()/len(df)*100

**Noteable points about dataset :****
1. There are 1470 observations and 35 features.
2. Dataset contains two types of data: object and integer.
3. There is no missing data
3. Here, Attrition is the dependent or target variable and others are independent variable.
4. It is a imbalance dataset as there almost 16% cases employee left company while 84% cases employee doing their job. 



# **<span style="color:deepskyblue">Transform target to numeric values (0,1)</span>** 

In [2274]:
# Changing feature 'Attrition as a 'target with 1 as 'Yes'
df['Attrition'] = np.where(df['Attrition']=='Yes',1,0)


# **<span style="color:deepskyblue">Drop ID Feature from the dataset</span>** 

In [2275]:
df=df.drop(['EmployeeNumber'],axis=1)

# **<span style="color:deepskyblue">EDA and Visualization</span>** 

In [2276]:
# color palette 
pal_2 = sns.color_palette("GnBu",n_colors=2)

In [2277]:
#  function for count plot palette="plasma"
def count_Plot(feature, data,xl,yl,axs,hu=None):
    ax = sns.countplot(x=feature,palette=pal_2, data=data,hue=hu,ax=axs)
    for p in ax.patches:
        ax.annotate(f'\n{p.get_height()}',(p.get_x()+0.2,p.get_height()),  ha='center', va='center', size=18)
    axs.set(xlabel=xl, ylabel=yl)
    

In [2278]:
#  function for pie plot

def pie_plot(feature,data,xl,axs):
    co = data[feature].value_counts(normalize=True)
    labels = ['Employee Stay','Employee Quit']
#   colors = sns.color_palette('bright')[0:5]
    
    axs.pie(co, labels = labels, colors=pal_2,autopct='%.0f%%')
    axs.set(xlabel=xl)


# **<span style="color:deepskyblue">Attrition Rate Visualization</span>** 

In [2279]:
#  Attrition rate in dataset
fig, axes = plt.subplots(1,2, figsize=(16, 6))
count_Plot("Attrition",df,"Employee Attrition in number","amount",axes[0])
pie_plot('Attrition',df,"Employee Attrtion in Percentage",axes[1])
# plt.show()

# **<span style="color:deepskyblue">Part 1 - Uni Variate numerical Feature Analysis</span>** 

In [2280]:
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
sns.distplot(x=df['DailyRate'], kde=True,ax=axes[0,0],axlabel='Dailyrate',color='lightseagreen')
sns.distplot(x=df['MonthlyIncome'], kde=True,ax=axes[0,1],axlabel='MonthlyIncome',color='steelblue')
sns.distplot(x=df['MonthlyRate'], kde=True,ax=axes[0,2],axlabel='MonthlyRate',color='lightseagreen')
sns.distplot(x=df['Age'], kde=True,ax=axes[1,0],axlabel='age',color='lightseagreen')
sns.distplot(x=df['DistanceFromHome'], kde=True,ax=axes[1,1],axlabel='DistanceFromHome',color='steelblue')
sns.distplot(x=df['YearsAtCompany'], kde=True,ax=axes[1,2],axlabel='YearsAtCompany',color='steelblue')
plt.show()

**Note: Above plot shows that MonthlyIncome, DistanceFromHome, YearsAtCompany features have outlier which will be handled later section in the notebook**


# **<span style="color:deepskyblue"> Part 2 - Bi Variate Catagorical Feature Analysis</span>** 

In [2281]:
# color_pall = sns.color_palette("GnBu",n_colors=4)
# color palette 
pal_7 = sns.color_palette("GnBu",n_colors=7)

In [2282]:
# selecting catagorical feature 
cata=[]
feat = df.drop(['Attrition'],axis=1)
# for colu in X.columns:
for colu in feat:
    pa =df[colu].value_counts().count()
    if (pa>1) & (pa<10) :
        cata.append(colu)


In [2283]:

for col in cata:        
    fig, axes = plt.subplots(1,2, figsize=(16, 6))
    sns.barplot(x=col,y='Attrition',data=df,ax=axes[0],palette=pal_7)
    st ="No of Attrition in "+col
    count_Plot(col,df,st,"Count",axes[1],'Attrition')
plt.show
    

**Noteable Point : Above analysis shows that**
* Attrition rate for male and female in Gender feature is almost equal.
* Attrition rate for 3 and 4 in PerformanceRating is almost equal.

Later, we will drop this two features.


# **<span style="color:deepskyblue"> Part 3 - Multi Variate Feature Analysis</span>** 

In [2284]:
plt.figure(figsize=(15,7))
sns.boxplot(x="Gender", y="MonthlyIncome", data=df,hue='Attrition',palette=pal_2)

In [2285]:
plt.figure(figsize=(15,7))
sns.boxplot(x="Gender", y="Age", data=df,hue='Attrition',palette=pal_2)


In [2286]:
df.groupby('Gender')['Attrition'].mean().to_frame()

In [2287]:
df.groupby(['Gender','Attrition'])['MonthlyIncome'].mean().to_frame()

In [2288]:
df.groupby(['Gender','Attrition'])['MonthlyIncome'].median().to_frame()

In [2289]:
df.groupby(['Gender','Attrition'])['Age'].mean().to_frame()

**Noteable Point :**

1. Above analysis confirms that Attrition rate of male is a bit higher than female 
2. MonthlyIncome rate, Age is almost equal for both Gender. 

Therefore, we can conclude that Gender does not have much influence for Employee Attrition.**

In [2290]:
#Plotting Age vs monthly income
plt.figure(figsize = (16,6))
sns.regplot(x= 'Age', y = 'MonthlyIncome' , data = df,color='lightseagreen')
plt.show()

"""There is a linear relation between Age and Monthly income"""

In [2291]:

# plt.figure(figsize = (16,6))
sns.jointplot(x='Age',y='MonthlyIncome',data=df,hue='Attrition',palette=pal_2)
plt.show()

In [2292]:
df.groupby(['Attrition'])['Age'].mean().to_frame()

In [2293]:
df.groupby(['Attrition'])['MonthlyIncome'].mean().to_frame()

**Noteable Point : Attrition rate is higher for those employees who are having lower income as well younger in age.**

In [2294]:
plt.figure(figsize=(15,7))
sns.swarmplot( x="Department", y='MonthlyIncome',data=df,hue='Attrition',palette=pal_2)


In [2295]:
df.groupby('Department')['Attrition'].mean().to_frame()

In [2296]:
df.groupby(['Department','Attrition'])['MonthlyIncome'].mean().to_frame()

**Noteable Point : Attrition rate and MonthlyIncome rate both are high in Sales Department. ******

In [2297]:
plt.figure(figsize=(15,7))
sns.barplot(x='Department',y='DistanceFromHome',data=df,hue='Attrition',palette=pal_2)

**Noteable Point : Employee those are leaving far away from the compnay, they have more chance to leave the company.**

In [2298]:
plt.figure(figsize=(15,7))
sns.swarmplot( x="Education", y='HourlyRate',data=df,hue='Attrition',palette=pal_2)


In [2299]:
df.groupby('Education')['Attrition'].mean().to_frame()

In [2300]:
df.groupby(['Education','Attrition'])['HourlyRate'].mean().to_frame()

****Noteable Point : Above analysis shows that employees whose are in education level 5, have low chance to leave the company and they are left the company with highly HourlyRate.****

In [2301]:
plt.figure(figsize=(15,7))
sns.violinplot( x="EducationField", y='DailyRate',data=df,hue='Attrition',palette=pal_2)


In [2302]:
df.groupby('EducationField')['Attrition'].mean().to_frame()

In [2303]:
df.groupby(['EducationField','Attrition'])['DailyRate'].mean().to_frame()

**Noteable Point : Above analysis confirms that employees whose are in Human Resource and Technical Degree EducationField,
have high chance to leave the comapany and emplyoees in Technical Degree background are leaving from company with highly DailyRate.**

In [2304]:
plt.figure(figsize=(15,7))
sns.swarmplot( x="JobLevel", y='YearsAtCompany',data=df,hue='Attrition',size=7,palette=pal_2)


In [2305]:
df.groupby('JobLevel')['Attrition'].mean().to_frame()

In [2306]:
df.groupby(['JobLevel','Attrition'])['YearsAtCompany'].median().to_frame()

**Noteable Point :**
*** Employee whose are working in company for 2 years as we as in JobLevel 1 group , have high chance to leave the company.**


In [2307]:
plt.figure(figsize=(15,7))
sns.boxplot(x='BusinessTravel',y='MonthlyIncome',data=df,hue='Attrition',palette=pal_2)


In [2308]:
df.groupby('BusinessTravel')['Attrition'].mean().to_frame()

In [2309]:
df.groupby(['BusinessTravel','Attrition'])['MonthlyIncome'].median().to_frame()

**Noteable Point : Employees whose MonthlyIncome rate is low and Travel Frequently, are having more chance to leave the company.**

In [2310]:
plt.figure(figsize=(18,7))
sns.barplot(x='JobRole',y='DailyRate',data=df,hue='Attrition',palette=pal_2)

**Noteable Point :  In each JobRole, employee those are leaving the company, their DailyRate rate is low as compare to those employee are staying in company.**

In [2311]:
plt.subplots(figsize=(15,7))
sns.boxplot(x='JobSatisfaction',
             y='MonthlyIncome',
             data=df,              
             hue='Attrition',palette=pal_2)
plt.show()

In [2312]:
df.groupby('JobSatisfaction')['Attrition'].mean().to_frame()

In [2313]:
df.groupby(['JobSatisfaction','Attrition'])['MonthlyIncome'].mean().to_frame()

**Noteable Point : Employee those MonthlyIncome rate is low and JobSatisfaction level 1, have more chance to leave the company**

In [2314]:
plt.subplots(figsize=(15,7))

sns.swarmplot(x='MaritalStatus',
             y='Age',
             data=df,
             hue='Attrition',
             size=8,palette=pal_2)
plt.show()

In [2315]:
df.groupby('MaritalStatus')['Attrition'].mean().to_frame()

In [2316]:
df.groupby(['MaritalStatus','Attrition'])['Age'].mean().to_frame()

In [2317]:
plt.subplots(figsize=(15,7))
sns.swarmplot(x='MaritalStatus',
             y='MonthlyIncome',
             data=df,
             hue='Attrition',
             size=8,palette=pal_2)
plt.show()

In [2318]:
df.groupby(['MaritalStatus','Attrition'])['MonthlyIncome'].mean().to_frame()

**Noteable Point : Employee those are Single and MonthlyIncome rate is low, have high chance to leave the company**

# **<span style="color:deepskyblue"> Data Preprocessing</span>** 

# **<span style="color:deepskyblue">       Defining Target and Independent Features</span>** 

In [2319]:
Y=df[['Attrition']]
X=df.drop(['Attrition'],axis=1)

# **<span style="color:deepskyblue"> Remove Features with Zero Variance</span>** 

In [2320]:
def unique_level(m):
    m = m.value_counts().count()
    return m
feature_val_count= pd.DataFrame(X.apply(lambda m: unique_level(m)))

In [2321]:
feature_val_count.columns=['uni_level']
feat_level = feature_val_count.loc[feature_val_count['uni_level']>1]
feat_level_index = feat_level.index
X = X.loc[:,feat_level_index]


# **<span style="color:deepskyblue"> Separate feature into numerical and categorical </span>** 

In [2322]:
num = X.select_dtypes(include='number')
char = X.select_dtypes(include='object')

In [2323]:
feature_level_val = pd.DataFrame(num.nunique())
feature_level_val.columns = ['unique_level']


In [2324]:
# All catagorical column with numical data
cat_feat = feature_level_val[feature_level_val['unique_level']<=20]
cat_fet_index = cat_feat.index
cat_column = num.loc[:,cat_fet_index]
cat_column.columns

In [2325]:
# All numerical column after separating
num_feat = feature_level_val[feature_level_val['unique_level']>20]
num_fet_index = num_feat.index
numerical = num.loc[:,num_fet_index]
numerical.columns

In [2326]:
# All catagorical column after separating
catagorical = pd.concat([char,cat_column],axis=1,join='inner')

# **<span style="color:deepskyblue"> Outlier Analysis of Numerical Features </span>** 

In [2327]:
numerical.describe(percentiles=[0.01,0.05,0.10,0.25,0.50,0.75,0.85,0.88,0.9,0.99])

# **<span style="color:deepskyblue"> Capping and Flooring of outliers </span>** 

In [2328]:
def outlier_cap(x):
    x=x.clip(lower=x.quantile(0.01))
    x=x.clip(upper=x.quantile(0.99))
    return(x)

In [2329]:
numerical=numerical.apply(lambda x : outlier_cap(x))

In [2330]:
numerical.describe(percentiles=[0.01,0.05,0.10,0.25,0.50,0.75,0.85,0.9,0.99])

# **<span style="color:deepskyblue"> Feature Selection - Numerical Features  </span>** 

In [2331]:
# Checking corelation between numefical featutes
plt.figure(figsize=(18,18))
cmap =sns.color_palette("GnBu",n_colors=6)
cor =numerical.corr()
sns.heatmap(cor,annot=True,vmax=0.8,cmap=cmap,fmt='.2f',linecolor='green',linewidths=0.7,square=True)

In [2332]:
#  function for removing corelated feature
def correlMatrix(data,thres):
    correlated_features = set()
    correlation_matrix = data.corr()
    for i in range(len(correlation_matrix .columns)):
        for j in range(i):
            if abs(correlation_matrix.iloc[i, j]) > thres:
                colname = correlation_matrix.columns[i]
                correlated_features.add(colname)
    return correlated_features

In [2333]:
#  show corelated column if corelation more than 0.75
correlated_feat = correlMatrix(numerical,0.75)
correlated_feat


In [2334]:
# drop corelated features
numerical.drop(correlated_feat,axis=1,inplace=True)


# **<span style="color:deepskyblue"> Feature Selection - Categorical Features   </span>** 

In [2335]:
# Dropping Gender and PerformanceRating feature
catagorical.drop(columns=['Gender','PerformanceRating'],axis=1,inplace=True)

In [2336]:
# converting to catagorical data types
catagorical = catagorical.astype(object)

In [2337]:
catagorical.head()

In [2338]:
# Create dummy features with n-1 levels for all catagorical column
catag_dum = pd.get_dummies(catagorical,drop_first = True)
catag_dum.shape

In [2339]:
# K Best for Selecting Categorical Features using k=20
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=20)
selector.fit_transform(catag_dum, Y)

cols = selector.get_support(indices=True)
X_ca= catag_dum.iloc[:,cols]

In [2340]:
# joining all inpendent features
X_all=pd.concat([numerical,X_ca],axis=1,join='inner')

In [2341]:
#  independent feature shape
X_all.shape

# **<span style="color:deepskyblue"> Imbalace dataset to balance dataset   </span>** 

In [2342]:
from imblearn.over_sampling import  RandomOverSampler

ros = RandomOverSampler(sampling_strategy='minority',random_state=1)

X_s,Y_s = ros.fit_resample(X_all, Y)

# **<span style="color:deepskyblue"> Split the Dataset for Training & Testing   </span>** 

In [2343]:
X_train,x_test,y_train,y_test=train_test_split(X_s.values,Y_s.values,test_size=0.2,random_state=1)

In [2344]:
print("Shape of Training Data",X_train.shape)
print("Shape of Testing Data",x_test.shape)
print("Attrition Rate in Training Data",y_train.mean())
print("Attrition Rate in Testing Data",y_test.mean())

# **<span style="color:deepskyblue"> Data Transformation using Standardrization   </span>** 

In [2345]:
standard_Scaler=StandardScaler()
X_train = standard_Scaler.fit_transform(X_train)  
x_test = standard_Scaler.transform(x_test)

# **<span style="color:deepskyblue"> Model Building   </span>** 

In [2346]:
#  function for training and evaluate model

def modelEval(xtr,ytr,xte,yte,model):
    
    model.fit(xtr,ytr)
    
    # Prediction for Test and Train Dataset
    test_pred=model.predict(xte)
    train_pred =model.predict(xtr)
    
    tpr_score = metrics.precision_score(ytr, train_pred)
    trc_score = metrics.recall_score(ytr, train_pred)
    tac_score =metrics.accuracy_score(ytr,train_pred)
    print("For Training Dataset.")   
    print(f'Accuracy: {tac_score:.2f}, Precision: {tpr_score:.2f}, Recall: {trc_score:.2f}')
    print("===============================")
    pr_score = metrics.precision_score(yte, test_pred)
    rc_score = metrics.recall_score(yte, test_pred)
    ac_score = metrics.accuracy_score(yte, test_pred)
    print("===============================")
    print("===============================")
    print("For Testing Dataset")
    print("===============================")
    print("F1:",metrics.f1_score(yte, test_pred))
    print(f'Accuracy: {ac_score:.2f}, Precision: {pr_score:.2f}, Recall: {rc_score:.2f}')
    print("===============================")
#     cmap =sns.color_palette("GnBu")
    metrics.plot_confusion_matrix(model,xte,yte,cmap='GnBu')
    print(classification_report(yte,test_pred))
# print(classification_report(y_train,fit_rf.predict(X_train)))

# **<span style="color:deepskyblue"> Decision Tree Model   </span>** 

In [2347]:
model_dt = DecisionTreeClassifier(random_state=11,max_depth=10, criterion = "gini")
print("Model Name : Decision Tree")
modelEval(X_train,y_train,x_test,y_test,model_dt)

# **<span style="color:deepskyblue"> Hyper-Parameter Optimization using GridSearchCV for Decision Tree Model   </span>** 

In [2348]:
# Set the random state for reproducibility
dt_model2 = DecisionTreeClassifier(random_state=3)

In [2349]:
#  selcting best parameter using GridSearchCV
start = time.time()

param_dist = {'max_depth': [7,8,9],
              'min_samples_split':[9,11,15],
              'min_samples_leaf':[9,11,13],
              'criterion': ['gini']}

cv_rf = GridSearchCV(estimator=dt_model2, cv = 10,
                     param_grid=param_dist, 
                     n_jobs = 3)

cv_rf.fit(X_train, y_train)
print('Best Parameters using grid search: \n', cv_rf.best_params_)
end = time.time()
print('Time taken in grid search: {0: .2f}'.format(end - start))

In [2350]:
# initializing best parameter using grid search
criti = cv_rf.best_params_['criterion']
sample_split=cv_rf.best_params_['min_samples_split']
sample_leaf = cv_rf.best_params_['min_samples_leaf']
depth = cv_rf.best_params_['max_depth']

In [2351]:
#  model with best parameter
dt_model2.set_params(criterion = criti,max_depth=depth,min_samples_leaf=sample_leaf,min_samples_split=sample_split)

print("Model Name : Decision Tree with Hyper parameter :")
modelEval(X_train,y_train,x_test,y_test,dt_model2)                

# **<span style="color:deepskyblue">  K-Nearest Neighbours Classifier model   </span>** 

# **<span style="color:deepskyblue">  Detection of Optimal Value for K Neighnours  </span>** 

In [2352]:
error = []

# Calculating error for K values between 1 and 40
for i in range(1, 40):  
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    pred_i = knn.predict(x_test)
    error.append(np.mean(pred_i != y_test))

In [2363]:
plt.figure(figsize=(12, 6))  
plt.plot(range(1, 40), error, color='seagreen', linestyle='dashed', marker='o',  
         markerfacecolor='blue', markersize=8)
plt.title('Error Rate K Value')  
plt.xlabel('K Value')  
plt.ylabel('Mean Error')  
plt.show()

In [2354]:
classifier = KNeighborsClassifier(n_neighbors=5)  
print("Model Name : KNeighborsClassifier:")
modelEval(X_train,y_train,x_test,y_test,classifier)                


# **<span style="color:deepskyblue">  Random Forest Classifier  </span>** 

In [2355]:
rf = RandomForestClassifier(n_estimators=50,random_state=1,max_depth=6)
print("Model Name : Random Forest")
modelEval(X_train,y_train,x_test,y_test,rf)

# **<span style="color:deepskyblue">  Hyper-Parameter Optimization using GridSearchCV for Random Forest Classifier  </span>** 

In [2356]:
# Set the random state for reproducibility
fit_rf = RandomForestClassifier(random_state=1)

In [2357]:
start = time.time()

param_dist = {'max_depth': [7,8,9],
              'max_features': ['auto'],
              'criterion': ['gini','entropy'],
              'min_samples_split':[8,9,11,12],
              'min_samples_leaf':[8,9,11,13]}

cv_rf = GridSearchCV(fit_rf, cv = 10,
                     param_grid=param_dist, 
                     n_jobs = 3)

cv_rf.fit(X_train, y_train.ravel())
print('Best Parameters using grid search: \n', cv_rf.best_params_)
end = time.time()
print('Time taken in grid search: {0: .2f}'.format(end - start))

In [2358]:
# initializing best parameter using grid search
criti = cv_rf.best_params_['criterion']
sample_split=cv_rf.best_params_['min_samples_split']
sample_leaf = cv_rf.best_params_['min_samples_leaf']
depth = cv_rf.best_params_['max_depth']

In [2359]:
# Set best parameters given by grid search 
fit_rf.set_params(n_estimators=200,bootstrap=True,criterion = criti,
                  max_features = 'auto', 
                  max_depth = depth,
                  min_samples_leaf=sample_leaf,
                  min_samples_split=sample_split)

In [2360]:
print("Model Name : Random Forest hyper parameter optimization")
modelEval(X_train,y_train,x_test,y_test,fit_rf)

# **<span style="color:deepskyblue"> SVC Model  </span>** 

In [2361]:
svc_model = SVC(kernel='rbf', gamma='scale')
print("Model Name : SVC")
modelEval(X_train,y_train,x_test,y_test,svc_model)
svc_model.fit(X_train, y_train.ravel())


**If you liked this Notebook, please do upvote.**

**If you have any suggestions or questions, feel free to comment.**

**Thank you**.
