# Decision Tree Classifier with Employee Attrition Dataset

In this notebook, we will build a decision tree classifier using the scikit-learn library. We will use a hypothetical employee attrition dataset for this example.

## Import Libraries
First, let's import the necessary libraries.

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

## Load and Explore the Dataset
Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.

In [8]:
df = pd.read_csv('employee_attrition_small.csv')

print("Dateset shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nBasic info:")
print(df.info())
print("\nStatistic describe:")
print(df.describe())
print("\nTarget variable distribution:")
print(df['Attrition'].value_counts())


Dateset shape: (1470, 16)

First 5 rows:
   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

  EducationField  Gender  HourlyRate                JobRole  JobSatisfaction  \
0  Life Sciences  Female          94        Sales Executive                4   
1  Life Sciences    Male          61     Research Scientist                2   
2          Other    Male          92  Laboratory Technician                3   
3  Life Sciences  Female          56     Research Scientist                3   
4        Medical    Male          40  Laboratory Technician                2   

  MaritalStatus  MonthlyIncome  M

## Preprocess the Data
We need to preprocess the data, including handling categorical variables and missing values.

In [17]:
data = df.copy()

label_encoders = {}
categorical_columns = ['BusinessTravel', 'Department', 'EducationField', 'Gender', 
                      'JobRole', 'MaritalStatus', 'OverTime', 'Attrition']

for col in categorical_columns:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le
    print(f"{col} {dict(zip(le.classes_, le.transform(le.classes_)))}")

print("\nMissing value:")
print(data.isnull().sum())


BusinessTravel {'Non-Travel': 0, 'Travel_Frequently': 1, 'Travel_Rarely': 2}
Department {'Human Resources': 0, 'Research & Development': 1, 'Sales': 2}
EducationField {'Human Resources': 0, 'Life Sciences': 1, 'Marketing': 2, 'Medical': 3, 'Other': 4, 'Technical Degree': 5}
Gender {'Female': 0, 'Male': 1}
JobRole {'Healthcare Representative': 0, 'Human Resources': 1, 'Laboratory Technician': 2, 'Manager': 3, 'Manufacturing Director': 4, 'Research Director': 5, 'Research Scientist': 6, 'Sales Executive': 7, 'Sales Representative': 8}
MaritalStatus {'Divorced': 0, 'Married': 1, 'Single': 2}
OverTime {'No': 0, 'Yes': 1}
Attrition {'No': 0, 'Yes': 1}

Missing value:
Age                   0
Attrition             0
BusinessTravel        0
DailyRate             0
Department            0
EducationField        0
Gender                0
HourlyRate            0
JobRole               0
JobSatisfaction       0
MaritalStatus         0
MonthlyIncome         0
MonthlyRate           0
NumCompaniesWorke

## Split the Dataset
We will split the dataset into training and testing sets.

In [15]:
# drop the outcome column
X = data.drop('Attrition', axis=1)
y = data['Attrition']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

## Train and evaluate the Decision Tree Model
## Please not that the maximum depth shouldn't be greater than 3

In [16]:
# Create and train the decision tree classifier
dt_classifier = DecisionTreeClassifier(
    max_depth=3,
    random_state=42
)
dt_classifier.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Calculate the accuracy of the decision tree model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")


Accuracy: 0.8265
