# Decision Tree Classifier with Employee Attrition Dataset

In this notebook, we will build a decision tree classifier using the scikit-learn library. We will use a hypothetical employee attrition dataset for this example.

## Import Libraries
First, let's import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


## Load and Explore the Dataset
Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.

In [2]:
# Next, we will load the employee attrition dataset ('employee_attrition_small.csv') and explore its contents.
data = pd.read_csv('employee_attrition_small.csv')
print(data.head())
# Display basic information about the dataset
print(data.info())
# Check for missing values
print(data.isnull().sum())
# Display summary statistics
print(data.describe())

   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

  EducationField  Gender  HourlyRate                JobRole  JobSatisfaction  \
0  Life Sciences  Female          94        Sales Executive                4   
1  Life Sciences    Male          61     Research Scientist                2   
2          Other    Male          92  Laboratory Technician                3   
3  Life Sciences  Female          56     Research Scientist                3   
4        Medical    Male          40  Laboratory Technician                2   

  MaritalStatus  MonthlyIncome  MonthlyRate  NumCompaniesWorked OverTime  

## Preprocess the Data
We need to preprocess the data, including handling categorical variables and missing values.

In [3]:
# We need to preprocess the data, including handling categorical variables and missing values.
data = pd.get_dummies(data, drop_first=True)

## Split the Dataset
We will split the dataset into training and testing sets.

In [4]:
# drop the outcome column
X = data.drop('Attrition_Yes', axis=1)
y = data['Attrition_Yes']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train and evaluate the Decision Tree Model
## Please not that the maximum depth shouldn't be greater than 3

In [5]:
# Create and train the decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy of the decision tree model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.80
