# Decision Tree Classifier with Employee Attrition Dataset

In this notebook, we will build a decision tree classifier using the scikit-learn library. We will use a hypothetical employee attrition dataset for this example.

## Import Libraries
First, let's import the necessary libraries.

In [38]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


## Load and Explore the Dataset
Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.

In [39]:
df = pd.read_csv('employee_attrition_small.csv')
display(df)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,EducationField,Gender,HourlyRate,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,TotalWorkingYears
0,41,Yes,Travel_Rarely,1102,Sales,Life Sciences,Female,94,Sales Executive,4,Single,5993,19479,8,Yes,8
1,49,No,Travel_Frequently,279,Research & Development,Life Sciences,Male,61,Research Scientist,2,Married,5130,24907,1,No,10
2,37,Yes,Travel_Rarely,1373,Research & Development,Other,Male,92,Laboratory Technician,3,Single,2090,2396,6,Yes,7
3,33,No,Travel_Frequently,1392,Research & Development,Life Sciences,Female,56,Research Scientist,3,Married,2909,23159,1,Yes,8
4,27,No,Travel_Rarely,591,Research & Development,Medical,Male,40,Laboratory Technician,2,Married,3468,16632,9,No,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,No,Travel_Frequently,884,Research & Development,Medical,Male,41,Laboratory Technician,4,Married,2571,12290,4,No,17
1466,39,No,Travel_Rarely,613,Research & Development,Medical,Male,42,Healthcare Representative,1,Married,9991,21457,4,No,9
1467,27,No,Travel_Rarely,155,Research & Development,Life Sciences,Male,87,Manufacturing Director,2,Married,6142,5174,1,Yes,6
1468,49,No,Travel_Frequently,1023,Sales,Medical,Male,63,Sales Executive,2,Married,5390,13243,2,No,17


## Preprocess the Data
We need to preprocess the data, including handling categorical variables and missing values.

In [40]:
# check for missing values
print(df.isnull().sum())

Age                   0
Attrition             0
BusinessTravel        0
DailyRate             0
Department            0
EducationField        0
Gender                0
HourlyRate            0
JobRole               0
JobSatisfaction       0
MaritalStatus         0
MonthlyIncome         0
MonthlyRate           0
NumCompaniesWorked    0
OverTime              0
TotalWorkingYears     0
dtype: int64


In [41]:
from sklearn.calibration import LabelEncoder
# drop the outcome column
x = df.drop(columns=['Attrition'], axis=1)
y = df['Attrition']

# perform encoding
# take note that encoding should be done with values from x and not df

le = LabelEncoder()

# use a dictionary of LabelEncoders for safe, consistent encoding
encoders = {}
categorical_cols = ['BusinessTravel','Department',
                    'EducationField','Gender','JobRole',
                    'MaritalStatus','OverTime']
for col in categorical_cols:
    x[col] = le.fit_transform(x[col])

# to also encode y since it is not yet 0 and 1
y = le.fit_transform(y)

display(x)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,EducationField,Gender,HourlyRate,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,TotalWorkingYears
0,41,2,1102,2,1,0,94,7,4,2,5993,19479,8,1,8
1,49,1,279,1,1,1,61,6,2,1,5130,24907,1,0,10
2,37,2,1373,1,4,1,92,2,3,2,2090,2396,6,1,7
3,33,1,1392,1,1,0,56,6,3,1,2909,23159,1,1,8
4,27,2,591,1,3,1,40,2,2,1,3468,16632,9,0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1465,36,1,884,1,3,1,41,2,4,1,2571,12290,4,0,17
1466,39,2,613,1,3,1,42,0,1,1,9991,21457,4,0,9
1467,27,2,155,1,1,1,87,4,2,1,6142,5174,1,1,6
1468,49,1,1023,2,3,1,63,7,2,1,5390,13243,2,0,17


## Split the Dataset
We will split the dataset into training and testing sets.

In [42]:
# Split the dataset into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## Train and evaluate the Decision Tree Model
## Please not that the maximum depth shouldn't be greater than 3

In [43]:
# Create and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(max_depth=3, max_leaf_nodes=10, random_state=42)
dt_classifier.fit(x_train,y_train)

# Make predictions on the test set
y_pred_dt = dt_classifier.predict(x_test)
print(y_pred_dt)

# Calculate the accuracy of the decision tree model
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f'Decision Tree Accuracy: {accuracy_dt:.2f}')

[0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Decision Tree Accuracy: 0.87
