# Scenario: Predict Employee Attrition using Classification Algorithms  
Employee attrition is one of the biggest metrics that a company should have in mind when thinking of growth. Employee attrition is caused when the total strenght of the company is greatly reduced as more employees leave the company as expected.
So, what is **Attrition**? It is basically the turnover rate of employees in a particular organization.
* Reasons for *Attition*:
  - Employees looking for better oportunities
  - A negative working environment
  - Bad managemet
  - Sickness of an employee
  - Excessive working hours

# Problem Statement 
Uncover the factors that lead to employee attrition and explore the resons as to why people are leaving the organization and predict whether an employee will leave the company or not by creating a web appusing Streamlit that takes input from users online.

# Dataset Description 
The dataset contains the following attributes:

* satisfaction_level 
* last_evaluation 
* number_project 
* average_monthly_hours 
* time_spend_company 
* work_accident 
* quit 
* promotion_last_5years
* department
* salary

### Import the required libraries for data manipulation, visualization, modeling and evaluation, as well as tools that help us visually render the resulting structure with interactive controls.

In [5]:
# Data Manipulation and Visualization 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import requests

# Modeling and evaluation tools 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

from sklearn import tree
from sklearn.tree import export_graphviz # Converts the trained tree into DOT format.
from graphviz import Source # Renders the DOT format into a visualization.
from IPython.display import display, SVG, Image # Tools to display the visualization in the notebook.
from ipywidgets import interactive, IntSlider, FloatSlider, interact # Tools for interactive parameter tuning.
import ipywidgets
from subprocess import call # Used to execute external commands.
import matplotlib.image as mpimg # For image file handling.

# Supress warnings 
import warnings
warnings.filterwarnings('ignore')
import os 

print('Libraries imported')

Libraries imported


In [6]:
# Load the dataset into a Dataframe  
df = pd.read_csv('datasets/employee_data.csv') 
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Departments,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


In [7]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Departments            14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB


None

In [8]:
# Basic descriptive statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
satisfaction_level,14999.0,0.612834,0.248631,0.09,0.44,0.64,0.82,1.0
last_evaluation,14999.0,0.716102,0.171169,0.36,0.56,0.72,0.87,1.0
number_project,14999.0,3.803054,1.232592,2.0,3.0,4.0,5.0,7.0
average_montly_hours,14999.0,201.050337,49.943099,96.0,156.0,200.0,245.0,310.0
time_spend_company,14999.0,3.498233,1.460136,2.0,3.0,3.0,4.0,10.0
Work_accident,14999.0,0.14461,0.351719,0.0,0.0,0.0,0.0,1.0
left,14999.0,0.238083,0.425924,0.0,0.0,0.0,0.0,1.0
promotion_last_5years,14999.0,0.021268,0.144281,0.0,0.0,0.0,0.0,1.0


# ðŸ“Š KEY INSIGHTS 
-----
**1. DATA QUALITY & COMPLETENESS**:
* The dataset contains 14,999 `non-null entries`, meaning there are no `missing values` to handle.

**2. ATTRITION RATE** (Target Variable 'left'):
* The mean of the `'left'` column is 0.238. This indicates that 23.8% of employees left the company, highlighting a high attrition rate.

**3. EMPLOYEE SATISFACTION** ('satisfaction_level'):
* The average satisfaction level is 0.61 (out of 1.0).
* Crucially, the minimum satisfaction is 0.09, suggesting a group of highly dissatisfied employees which may contribute heavily to attrition.

**4. WORKLOAD** ('average_monthly_hours'):
* `Average monthly hours`abs stand at 201.
* There is an extremely wide range, from a minimum of 96 hours to a maximum of 310 hours, pointing to potential burnout issues among some employees.

**5. TENURE** ('time_spend_company') and **PROJECTS** ('number_project'):
* The average `tenure` is 3.5 years.
* The majority of employees work on 3 to 5 `projects`.


In [10]:
# Rename the columns to a standardized and easy to manipulate format 
df.columns = df.columns.str.lower() 
df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'work_accident', 'left',
       'promotion_last_5years', 'departments ', 'salary'],
      dtype='object')

In [11]:
# Rename the `quit` column 
df = df.rename(columns={'left': 'target'}) 
df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'work_accident', 'target',
       'promotion_last_5years', 'departments ', 'salary'],
      dtype='object')

# Feature Engineering
> To prepare the data for the model, we need to convert the **categorical features** into **dummy variables**. 

In [13]:
# Categorical variables
cat_vars = df.select_dtypes(include='O').columns.tolist()
print('Categorical Columns: ', cat_vars)

Categorical Columns:  ['departments ', 'salary']


In [14]:
# Transfrom categorical variables into dummy variables 
for var in cat_vars:
    cat_list = pd.get_dummies(df[var], prefix=var) 
    df = df.join(cat_list) 

df = df.drop(cat_vars, axis=1) 
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,work_accident,target,promotion_last_5years,departments _IT,departments _RandD,...,departments _hr,departments _management,departments _marketing,departments _product_mng,departments _sales,departments _support,departments _technical,salary_high,salary_low,salary_medium
0,0.38,0.53,2,157,3,0,1,0,False,False,...,False,False,False,False,True,False,False,False,True,False
1,0.8,0.86,5,262,6,0,1,0,False,False,...,False,False,False,False,True,False,False,False,False,True
2,0.11,0.88,7,272,4,0,1,0,False,False,...,False,False,False,False,True,False,False,False,False,True
3,0.72,0.87,5,223,5,0,1,0,False,False,...,False,False,False,False,True,False,False,False,True,False
4,0.37,0.52,2,159,3,0,1,0,False,False,...,False,False,False,False,True,False,False,False,True,False


# Model Preparation
* Defining Features and Target variables 
* Split Data into Training and Testing Sets

In [16]:
# Check class balance for the target varible 
display(df['target'].value_counts()) 
display(df['target'].value_counts(normalize=True))

target
0    11428
1     3571
Name: count, dtype: int64

target
0    0.761917
1    0.238083
Name: proportion, dtype: float64

> There's definitely a class imbalance in the target variable, but 24% is an acceptable minority ratio. However, this imbalance means that Accuracy alone is insufficient to evaluate the Decision Tree model. Instead, I will prioritize Precision, Recall, and the F1-Score for the minority class (employees who left) to ensure the model is genuinely predictive, rather than simply biased towards the majority class.

In [18]:
# Split the data into features (X) and the target variable (y)
X = df.drop('target', axis=1) 

y = df['target']

In [19]:
# Split into training and test datasets 
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,   # Test size usually rangess between 0.2 and 0.3 % of the data
                                                    random_state=42,   # Ensures reproducibility
                                                    stratify=y # Ensures the same class proportion
                                                   ) 
X_train.shape, X_test.shape, y_train.shape, y_test.shape                            

((11249, 20), (3750, 20), (11249,), (3750,))

In [20]:
@interact # Convert the function into an interactive one 
def plot_tree(
    crit=['gini', 'entropy'],
    split=['best', 'random'],
    depth=IntSlider(min=1, max=25, value=2, continuous_update=False),
    min_split = IntSlider(min=1, max=5, value=2, continuous_update=False), 
    min_leaf = IntSlider(min=1, max=5, value=1, continuous_update=False)):

    # Create an instance of the decision tree classifier
    estimator = DecisionTreeClassifier(criterion=crit, 
                                       splitter=split, 
                                       max_depth=depth, 
                                       min_samples_leaf=min_leaf, 
                                       min_samples_split=min_split)

    
    estimator.fit(X_train, y_train) # fit the model on the training data 

    # Predictions 
    train_preds = estimator.predict(X_train)
    test_preds = estimator.predict(X_test)

    # Evaluation metrics 
    a = round(accuracy_score(y_train, train_preds), 2)
    b = round(accuracy_score(y_test, test_preds), 2)
    
    print("Decision Tree Training Accuracy: ", a) 
    print("Decision Tree Testing Accuracy: ", b) 

    print("Decision Tree Training F1 Score: ", round(f1_score(y_train, train_preds), 2)) 
    print("Decision Tree Training F1 Score: ", round(f1_score(y_test, test_preds), 2)) 

    if a > 0.99:
         print("Decision Tree Training Accuracy: ", a, "Decision Tree Testing Accuracy: ", b)
         print("Criterion", crit, "Splitter: ", split, "Max Depth: ", depth, "Min Samples Leaf: ", min_leaf, "Min Samples Split: ", min_split) 


    # Use GraphViz to export the model and display it as an image on the screen 
    graph = Source(tree.export_graphviz(estimator, out_file=None, 
                                        feature_names=X_train.columns, 
                                        class_names=['stayed', 'quit'],
                                        filled=True ))
                   
    display(Image(data=graph.pipe(format='png')))

Decision Tree Training Accuracy:  0.85
Decision Tree Testing Accuracy:  0.85
Decision Tree Training F1 Score:  0.58
Decision Tree Training F1 Score:  0.57


ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH

interactive(children=(Dropdown(description='crit', options=('gini', 'entropy'), value='gini'), Dropdown(descriâ€¦

In [41]:
print("--- Reinstalarea Ipywidgets È™i JupyterLab-widgets ---")
!conda install -c conda-forge ipywidgets jupyterlab-widgets -y

--- Reinstalarea Ipywidgets È™i JupyterLab-widgets ---
Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed



PackagesNotFoundError: The following packages are not available from current channels:

  - jupyterlab-widgets

Current channels:

  - https://conda.anaconda.org/conda-forge
  - defaults
  - https://repo.anaconda.com/pkgs/main
  - https://repo.anaconda.com/pkgs/r
  - https://repo.anaconda.com/pkgs/msys2

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.


