# Decision Tree Classifier with Employee Attrition Dataset

In this notebook, we will build a decision tree classifier using the scikit-learn library. We will use a hypothetical employee attrition dataset for this example.

## Import Libraries
First, let's import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


## Load and Explore the Dataset
Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.

In [2]:
df = pd.read_csv("employee_attrition_small.csv")

## Preprocess the Data
We need to preprocess the data, including handling categorical variables and missing values.

In [3]:
df.head()


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,EducationField,Gender,HourlyRate,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,TotalWorkingYears
0,41,Yes,Travel_Rarely,1102,Sales,Life Sciences,Female,94,Sales Executive,4,Single,5993,19479,8,Yes,8
1,49,No,Travel_Frequently,279,Research & Development,Life Sciences,Male,61,Research Scientist,2,Married,5130,24907,1,No,10
2,37,Yes,Travel_Rarely,1373,Research & Development,Other,Male,92,Laboratory Technician,3,Single,2090,2396,6,Yes,7
3,33,No,Travel_Frequently,1392,Research & Development,Life Sciences,Female,56,Research Scientist,3,Married,2909,23159,1,Yes,8
4,27,No,Travel_Rarely,591,Research & Development,Medical,Male,40,Laboratory Technician,2,Married,3468,16632,9,No,6


In [4]:
print(df['BusinessTravel'].unique())
print(df['Department'].unique())

['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
['Sales' 'Research & Development' 'Human Resources']


In [5]:
## Check for missing values
print(df.isnull().sum())

Age                   0
Attrition             0
BusinessTravel        0
DailyRate             0
Department            0
EducationField        0
Gender                0
HourlyRate            0
JobRole               0
JobSatisfaction       0
MaritalStatus         0
MonthlyIncome         0
MonthlyRate           0
NumCompaniesWorked    0
OverTime              0
TotalWorkingYears     0
dtype: int64


In [6]:
#Drop those columns that are deemed to be useless to our analysis
X = df.drop(columns=['Attrition'], axis=1)
Y = df['Attrition']

In [13]:
from sklearn.calibration import LabelEncoder

#Perform encoding
le = LabelEncoder()

# Encode categorical features
X['BusinessTravel']=le.fit_transform(df['BusinessTravel'])
X['Department']=le.fit_transform(df['Department'])
X['EducationField']=le.fit_transform(df['EducationField'])
X['Gender']=le.fit_transform(df['Gender'])
X['JobRole']=le.fit_transform(df['JobRole'])
X['MaritalStatus']=le.fit_transform(df['MaritalStatus'])
X['OverTime']=le.fit_transform(df['OverTime'])


## Split the Dataset
We will split the dataset into training and testing sets.

In [14]:
# drop the outcome column

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

## Train and evaluate the Decision Tree Model
## Please not that the maximum depth shouldn't be greater than 3

In [17]:
# Create and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(max_depth=3, max_leaf_nodes = 10, random_state = 42)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
#test_arr = np.array([[0,1,3,1,......]])
#print(dt_classifier.predict(test_arr))

# Calculate the accuracy of the decision tree model
accuracy =accuracy_score(y_test, y_pred)
print(f"accuracy is  {accuracy}")


accuracy is  0.8503401360544217
