# Predicting Employee Productivity Using Tree Models

The garment industry is one of the key examples of the industrial globalization of the modern era.

It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies.

So, it is highly desirable among the decision-makers in the garments industry to track, analyze, and predict the productivity performance of the working teams in their factories.

This dataset can be used for regression purposes by predicting the productivity range (0-1) or for classification purposes by transforming the productivity range (0-1) into different classes.

The following is the dataset's official column information:

- date: date in MM-DD-YYYY
- quarter: a portion of the month — month was divided into four quarters
- department: associated department with the instance
- day: day of the week
- team: associated team number with the instance
- targeted_productivity: targeted productivity set by the authority for each team for each day
- smv: standard minute value — the allocated time for a task
- wip: work in progress — includes the number of unfinished items for products
- over_time: represents the amount of overtime by each team in minutes
- incentive: represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action
- idle_time: the duration of time when the production was interrupted due to several reasons
- idle_men: the number of workers who were idle due to production interruption
- no_of_style_change: number of changes in the style of a particular product
- no_of_workers: number of workers on each team
- actual_productivity: the actual % of productivity that was delivered by the workers — it ranges from 0 to 1.

In [57]:
# Importing all necessary libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [58]:
gwd = pd.read_csv('garments_worker_productivity.csv')

gwd.head() # Display the first few roles of the dataset

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865
2,1/1/2015,Quarter1,sweing,Thursday,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
3,1/1/2015,Quarter1,sweing,Thursday,12,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
4,1/1/2015,Quarter1,sweing,Thursday,6,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382


In [59]:
gwd.shape # Number of Observation & Features

(1197, 15)

In [60]:
gwd.columns

Index(['date', 'quarter', 'department', 'day', 'team', 'targeted_productivity',
       'smv', 'wip', 'over_time', 'incentive', 'idle_time', 'idle_men',
       'no_of_style_change', 'no_of_workers', 'actual_productivity'],
      dtype='object')

In [61]:
gwd.isna().sum() # Checking for missing values

date                       0
quarter                    0
department                 0
day                        0
team                       0
targeted_productivity      0
smv                        0
wip                      506
over_time                  0
incentive                  0
idle_time                  0
idle_men                   0
no_of_style_change         0
no_of_workers              0
actual_productivity        0
dtype: int64

There's only one feature that as missing values i.e wip column

In [62]:
gwd.dtypes

date                      object
quarter                   object
department                object
day                       object
team                       int64
targeted_productivity    float64
smv                      float64
wip                      float64
over_time                  int64
incentive                  int64
idle_time                float64
idle_men                   int64
no_of_style_change         int64
no_of_workers            float64
actual_productivity      float64
dtype: object

In [63]:
gwd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  actual_productivity    1197 non-null   float64
dtypes: f

In [64]:
gwd.describe()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
count,1197.0,1197.0,1197.0,691.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0
mean,6.426901,0.729632,15.062172,1190.465991,4567.460317,38.210526,0.730159,0.369256,0.150376,34.609858,0.735091
std,3.463963,0.097891,10.943219,1837.455001,3348.823563,160.182643,12.709757,3.268987,0.427848,22.197687,0.174488
min,1.0,0.07,2.9,7.0,0.0,0.0,0.0,0.0,0.0,2.0,0.233705
25%,3.0,0.7,3.94,774.5,1440.0,0.0,0.0,0.0,0.0,9.0,0.650307
50%,6.0,0.75,15.26,1039.0,3960.0,0.0,0.0,0.0,0.0,34.0,0.773333
75%,9.0,0.8,24.26,1252.5,6960.0,50.0,0.0,0.0,0.0,57.0,0.850253
max,12.0,0.8,54.56,23122.0,25920.0,3600.0,300.0,45.0,2.0,89.0,1.120437


In [65]:
gwd['department'].value_counts()

sweing        691
finishing     257
finishing     249
Name: department, dtype: int64

The department column as some contradicting values 'finishing ' & 'finishing'. We are to replace the column with the right values

In [66]:
gwd.loc[gwd['department'] == 'finishing ', 'department'] = 'finishing'
gwd['department'].value_counts()

sweing       691
finishing    506
Name: department, dtype: int64

In [67]:
gwd['quarter'].value_counts()

Quarter1    360
Quarter2    335
Quarter4    248
Quarter3    210
Quarter5     44
Name: quarter, dtype: int64

There are only 4 quarters in the normal calender year and the number of observations for the quarter5 is small compared to others. We push all Quarter5 to Quarter4

In [68]:
gwd.loc[gwd['quarter'] == 'Quarter5', 'quarter'] = 'Quarter4'
gwd['quarter'].value_counts()

Quarter1    360
Quarter2    335
Quarter4    292
Quarter3    210
Name: quarter, dtype: int64

Drop all uncessary columns we do not need in the datasets. Which are 'date', 'wip', "idle_time", "idle_men", "no_of_style_change" columns

In [69]:
gwd = gwd.drop(['date', 'wip', "idle_time", "idle_men", "no_of_style_change"], axis=1)

Since we are preparing the dataset for a machine learning classification project, we need convert all object columns to int64 and to encode the values to numerical data

In [70]:
gwd['quarter'] = gwd['quarter'].replace({'Quarter1': 1, 'Quarter2': 2, 
                                         'Quarter3': 3, 'Quarter4': 4})
gwd['quarter'].value_counts()

1    360
2    335
4    292
3    210
Name: quarter, dtype: int64

In [71]:
gwd['quarter'].dtypes

dtype('int64')

In [72]:
# create and convert the target column to classification data True/False
gwd['productivity'] = gwd['actual_productivity'] >= gwd['targeted_productivity']

In [73]:
gwd['department'] = gwd['department'].replace({'sweing': 1, 'finishing': 0})
gwd['department'].value_counts()

1    691
0    506
Name: department, dtype: int64

In [74]:
day_dummy = pd.get_dummies(gwd['day'], prefix=None)
gwd = pd.concat([gwd, day_dummy], axis=1)

In [75]:
quarter_dummy = pd.get_dummies(gwd['quarter'], prefix='q')
gwd = pd.concat([gwd, quarter_dummy], axis=1)

In [76]:
team_dummy = pd.get_dummies(gwd['team'], prefix='team')
gwd = pd.concat([gwd, team_dummy], axis=1)

In [77]:
gwd = gwd.drop(["day", "quarter", 'team'], axis=1)

In [80]:
gwd['no_of_workers'] = gwd['no_of_workers'].astype('int64')

In [81]:
gwd.head()

Unnamed: 0,department,targeted_productivity,smv,over_time,incentive,no_of_workers,actual_productivity,productivity,Monday,Saturday,...,team_3,team_4,team_5,team_6,team_7,team_8,team_9,team_10,team_11,team_12
0,1,0.8,26.16,7080,98,59,0.940725,True,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0.75,3.94,960,0,8,0.8865,True,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0.8,11.41,3660,50,30,0.80057,True,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,0.8,11.41,3660,50,30,0.80057,True,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,0.8,25.9,1920,50,56,0.800382,True,0,0,...,0,0,0,1,0,0,0,0,0,0


Now our dataset is clean and ready for training and testing on machine learning