# Decision Tree Classifier with Employee Attrition Dataset

In this notebook, we will build a decision tree classifier using the scikit-learn library. We will use a hypothetical employee attrition dataset for this example.

## Import Libraries
First, let's import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


## Load and Explore the Dataset
Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.

In [2]:
df = pd.read_csv('employee_attrition_small.csv')


## Preprocess the Data
We need to preprocess the data, including handling categorical variables and missing values.

In [7]:
from sklearn.preprocessing import LabelEncoder
df_processed = df.copy()

categorical_columns = df_processed.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:", categorical_columns)


label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df_processed[col] = le.fit_transform(df_processed[col])
    label_encoders[col] = le
    print(f"\n{col} encoding:")
    for i, label in enumerate(le.classes_):
        print(f"  {label} -> {i}")



Categorical columns: ['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']

Attrition encoding:
  No -> 0
  Yes -> 1

BusinessTravel encoding:
  Non-Travel -> 0
  Travel_Frequently -> 1
  Travel_Rarely -> 2

Department encoding:
  Human Resources -> 0
  Research & Development -> 1
  Sales -> 2

EducationField encoding:
  Human Resources -> 0
  Life Sciences -> 1
  Marketing -> 2
  Medical -> 3
  Other -> 4
  Technical Degree -> 5

Gender encoding:
  Female -> 0
  Male -> 1

JobRole encoding:
  Healthcare Representative -> 0
  Human Resources -> 1
  Laboratory Technician -> 2
  Manager -> 3
  Manufacturing Director -> 4
  Research Director -> 5
  Research Scientist -> 6
  Sales Executive -> 7
  Sales Representative -> 8

MaritalStatus encoding:
  Divorced -> 0
  Married -> 1
  Single -> 2

OverTime encoding:
  No -> 0
  Yes -> 1


## Split the Dataset
We will split the dataset into training and testing sets.

In [8]:
# drop the outcome column

# Split the dataset into training and testing sets
X = df_processed.drop('Attrition', axis=1)
y = df_processed['Attrition']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


## Train and evaluate the Decision Tree Model
## Please not that the maximum depth shouldn't be greater than 3

In [10]:
# Create and train the decision tree classifier


# Make predictions on the test set
dt_classifier = DecisionTreeClassifier(
    max_depth=3,
    random_state=42,
    criterion='gini',
    min_samples_split=2,
    min_samples_leaf=1
)
dt_classifier.fit(X_train, y_train)

# Calculate the accuracy of the decision tree model
y_pred = dt_classifier.predict(X_test)
y_train_pred = dt_classifier.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Training Accuracy: {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"Testing Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

Training Accuracy: 0.8614 (86.14%)
Testing Accuracy: 0.8265 (82.65%)
