# Tree Models: Predicting Employee Productivity 

## Introduction

For this project, we'll be introducing the dataset *Productivity Prediction of Garment Employees*. The original dataset is in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Productivity+Prediction+of+Garment+Employees). Below is a description of the dataset, according to its official summary:

>The garment industry is one of the key examples of the industrial globalization of the modern era. It is a highly labour-intensive industry with lots of manual processes. 
>
>Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies.
>
>So, it is highly desirable among the decision-makers in the garments industry to track, analyze, and predict the productivity performance of the working teams in their factories.

![germent manufacturing](https://s3.amazonaws.com/dq-content/755/garment-factory-unsplash.jpg)

What's interesting about the dataset is that we can use it with both regression and classification algorithms, as is clearly stated in the final sentence of the official summary:

>This dataset can be used for regression purposes by predicting the productivity range (0-1) or for classification purposes by transforming the productivity range (0-1) into different classes.

In this project, we will focus on working with a classification tree. 

Let's start by loading *pandas*. To make sure that the data was successfully loaded, we will use the `.head()` function to visualize the headers and the first five observations.

Don't worry about understanding what the different columns are telling us yet, because that's exactly what we will be doing in the following section.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("garments_worker_productivity.csv")
df.head()

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865
2,1/1/2015,Quarter1,sweing,Thursday,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
3,1/1/2015,Quarter1,sweing,Thursday,12,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057
4,1/1/2015,Quarter1,sweing,Thursday,6,0.8,25.9,1170.0,1920,50,0.0,0,0,56.0,0.800382


## Dataset Exploration (EDA)

It is important to first understand what the dataset is telling us, along with its structure and general characteristics.

Let's start by getting the dataset's shape, where the first value indicates the number of observations and the second one the number of columns.

In [3]:
df.shape

(1197, 15)

Now let's explore the column numbers, how many non-null observations each one has, and their respective data types (dtypes).

In *pandas*, the "object" dtype means the observations of that specific column are treated as strings/text.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  actual_productivity    1197 non-null   float64
dtypes: f

The following is the dataset's official column information:

`date`: date in MM-DD-YYYY

`quarter`: a portion of the month — month was divided into four quarters

`department`: associated department with the instance

`day`: day of the week

`team`: associated team number with the instance

`targeted_productivity`: targeted productivity set by the authority for each team for each day

`smv`: standard minute value — the allocated time for a task

`wip`: work in progress — includes the number of unfinished items for products

`over_time`: represents the amount of overtime by each team in minutes

`incentive`: represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action

`idle_time`: the duration of time when the production was interrupted due to several reasons

`idle_men`: the number of workers who were idle due to production interruption

`no_of_style_change`: number of changes in the style of a particular product

`no_of_workers`: number of workers on each team

`actual_productivity`: the actual % of productivity that was delivered by the workers — it ranges from 0 to 1.

We will now get general statistics about the numerical columns.

Remember that *std* stands for Standard Deviation, and the *percentages* represent percentiles. *min* and *max* indicate the maximum values on every column, so these are particularly useful to detect outliers.

In [5]:
df.describe()

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
count,1197.0,1197.0,1197.0,691.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0,1197.0
mean,6.426901,0.729632,15.062172,1190.465991,4567.460317,38.210526,0.730159,0.369256,0.150376,34.609858,0.735091
std,3.463963,0.097891,10.943219,1837.455001,3348.823563,160.182643,12.709757,3.268987,0.427848,22.197687,0.174488
min,1.0,0.07,2.9,7.0,0.0,0.0,0.0,0.0,0.0,2.0,0.233705
25%,3.0,0.7,3.94,774.5,1440.0,0.0,0.0,0.0,0.0,9.0,0.650307
50%,6.0,0.75,15.26,1039.0,3960.0,0.0,0.0,0.0,0.0,34.0,0.773333
75%,9.0,0.8,24.26,1252.5,6960.0,50.0,0.0,0.0,0.0,57.0,0.850253
max,12.0,0.8,54.56,23122.0,25920.0,3600.0,300.0,45.0,2.0,89.0,1.120437


We can see, for instance, that the `actual_productivity` column actually surpasses the limit of 1 that was indicated on the dataset description!

Also, the maximum `wip` (Work in Progress) value is 23122. This means there is an observation where the number of unfinished items for products is 23122!

In addition, we can conclude that time management in this factory is pretty efficient, since we barely have `idle time` and `idle men`. It appears there was either a single or a reduced number of incidents where production was stopped.

There are a lot of other interesting facts that we can discover by carefully examining the *describe()* table. It's always important to understand what the dataset is telling us to avoid confusions during subsequent steps in the process.

In the next subsections, we will explore every column individually.

### "date" column

We can also use the *head()* function on single columns to see the first five values...

In [6]:
df["date"].head()

0    1/1/2015
1    1/1/2015
2    1/1/2015
3    1/1/2015
4    1/1/2015
Name: date, dtype: object

Also, it's useful to select a number of random observations to get a general idea of the data in the column. In this case, we will choose 20.

One important clarification: although the *sample()* method returns random observations, in this case, to ensure reproducibility, we've set up the **random_state** parameter to always get the same observations.

In [7]:
df["date"].sample(20, random_state = 14)

959     2/26/2015
464     1/27/2015
672      2/8/2015
321     1/19/2015
282     1/17/2015
307     1/18/2015
609      2/4/2015
1123     3/8/2015
877     2/22/2015
950     2/26/2015
692     2/10/2015
51       1/4/2015
505     1/29/2015
554      2/1/2015
801     2/16/2015
1017     3/2/2015
340     1/20/2015
732     2/12/2015
616      2/4/2015
806     2/17/2015
Name: date, dtype: object

### "quarter" column

This column's title is pretty peculiar, in the sense that when we say "quarter", we are usually referring to part of a year. But here, it's actually referring to part of a month.

This teaches us a valuable lesson: never make assumptions about the data based purely on the title of a column! It's always a good idea to keep a dataset's description close at hand to refresh our memory if we need to.

Let's use the *value_counts()* method to see how many observations per week we have:

In [8]:
df["quarter"].value_counts()

Quarter1    360
Quarter2    335
Quarter4    248
Quarter3    210
Quarter5     44
Name: quarter, dtype: int64

Interestingly, we see there are 44 observations with a *Quarter 5* classification. Let's specifically explore them by using a mask on our dataset:

In [9]:
df[df["quarter"] == "Quarter5"]

Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
498,1/29/2015,Quarter5,sweing,Thursday,2,0.8,22.52,1416.0,6840,113,0.0,0,0,57.0,1.00023
499,1/29/2015,Quarter5,finishing,Thursday,4,0.8,4.3,,1200,0,0.0,0,0,10.0,0.989
500,1/29/2015,Quarter5,sweing,Thursday,3,0.8,22.52,1287.0,6840,100,0.0,0,0,57.0,0.950186
501,1/29/2015,Quarter5,sweing,Thursday,4,0.8,22.52,1444.0,6900,88,0.0,0,0,57.5,0.9008
502,1/29/2015,Quarter5,sweing,Thursday,10,0.8,22.52,1088.0,6720,88,0.0,0,0,56.0,0.90013
503,1/29/2015,Quarter5,finishing,Thursday,6,0.5,2.9,,1200,0,0.0,0,0,10.0,0.899
504,1/29/2015,Quarter5,finishing,Thursday,8,0.65,4.15,,960,0,0.0,0,0,8.0,0.877552
505,1/29/2015,Quarter5,finishing,Thursday,11,0.6,2.9,,960,0,0.0,0,0,8.0,0.864583
506,1/29/2015,Quarter5,finishing,Thursday,10,0.8,3.94,,1200,0,0.0,0,0,10.0,0.85695
507,1/29/2015,Quarter5,finishing,Thursday,1,0.75,3.94,,1200,0,0.0,0,0,10.0,0.853667


If we check the "dates" column, we can see that "Quarter 5" always comprises observations where the date is either 29th or 31st.