In [1]:
import pandas as pd

# Garments Dataset

Important characteristics of the clothing manufacturing process are included in this dataset, along with the 
productivity of the workers, which was manually gathered and confirmed by the 
industry insiders. It contains features that can serve as indicators of productivity over a certain period of time. The data was extracted by Rahim et al. (2021) using advanced data mining techniques such as the tree ensemble model and gradient boosted tree model which happened to be the best performing candidates in a pool of other gathering methods. A tree ensemble is a machine learning model which makes predictions by utilizing several decision trees as opposed to just one. On the other hand, a Gradient Boosted Tree model is an ensemble learning technique wherein new models are built to predict the errors or residuals of previous models, which are then combined to produce the final prediction. It also makes use of the gradient descent algorithm to minimize the loss when adding new models.s

In [19]:
df = pd.read_csv('Dataset/garments.csv')

In [20]:
print(df.head())

       date   quarter  department       day  team  targeted_productivity  \
0  1/1/2015  Quarter1      sweing  Thursday     8                   0.80   
1  1/1/2015  Quarter1  finishing   Thursday     1                   0.75   
2  1/1/2015  Quarter1      sweing  Thursday    11                   0.80   
3  1/1/2015  Quarter1      sweing  Thursday    12                   0.80   
4  1/1/2015  Quarter1      sweing  Thursday     6                   0.80   

     smv     wip  over_time  incentive  idle_time  idle_men  \
0  26.16  1108.0       7080         98        0.0         0   
1   3.94     NaN        960          0        0.0         0   
2  11.41   968.0       3660         50        0.0         0   
3  11.41   968.0       3660         50        0.0         0   
4  25.90  1170.0       1920         50        0.0         0   

   no_of_style_change  no_of_workers  actual_productivity  
0                   0           59.0             0.940725  
1                   0            8.0        

In [41]:
print(df['day'].unique())

['Thursday' 'Saturday' 'Sunday' 'Monday' 'Tuesday' 'Wednesday']


The dataset consists of 1197 rows (entires) and 15 columns (features).
The 15 features are:
- date
>  Date in MM-DD-YYYY
- quarter
> A portion of the month. A month was divided into four quarters.
- department
> Department associated with the instance.
- day
> Day of the week
- team
> Team number associated with the instance.
- targeted_productivity
> Targeted productivity set by the authority for each team for each 
da
- smv
> Standard Minute Value; the allocated time for a task
- wip
> Work in progress. Includes the number of unfinished items for products.
- over_time
> Represents the amount of overtime by each team in minutes.
- incentive
> Represents the amount of financial incentive that enables or motivates a 
particular course of action
- idle_time
> The amount of time when the production was interrupted due to several 
reasons
- idle_men
> The number of workers who were idle due to production interruption.
- no_of_style_change
> Number of changes in the style of a particular product
- no_of_workers
> Number of workers in each team
- actual_productivity
> The actual % of productivity that was delivered by the workers. It 
ranges from 0-1om 0-1.y

## Cleaning and preprocessing

Before we do any machine learning, we need to prepare our data and ensure that it is as clean as possible to improve our model's accuracy.

In [5]:
from sklearn.impute import SimpleImputer

In [25]:
imputer_num = SimpleImputer(strategy='mean')

In [24]:
null_count = df.isnull().sum()
print(null_count)

date                       0
quarter                    0
department                 0
day                        0
team                       0
targeted_productivity      0
smv                        0
wip                      506
over_time                  0
incentive                  0
idle_time                  0
idle_men                   0
no_of_style_change         0
no_of_workers              0
actual_productivity        0
dtype: int64


As you can see from the display above, most columns do not contain a null value, with the only exception being the column **wip**. To counter this, we will be using a SimpleImputer from the sklearn library with a mean strategy.

In [26]:
df['wip'] = imputer_num.fit_transform(df[['wip']])

In [27]:
null_count = df.isnull().sum()
print(null_count)

date                     0
quarter                  0
department               0
day                      0
team                     0
targeted_productivity    0
smv                      0
wip                      0
over_time                0
incentive                0
idle_time                0
idle_men                 0
no_of_style_change       0
no_of_workers            0
actual_productivity      0
dtype: int64


After using the SimpleImputer, there are now 0 null values in the dataset

Next, let's look for misrepresented data or noise that could be a result of typos, input error, data leakage, etc.

In [38]:
print(df['department'].unique())

['sweing' 'finishing ' 'finishing']


In the **department** column, there are two different column values for **finishing**, one suffixed with an empty space and one without.

In [39]:
df['department'] = df['department'].replace('finishing ', 'finishing')

In [40]:
print(df['department'].unique())

['sweing' 'finishing']


In [None]:
Now looking at normalization.