<h1 style="color:#ffc0cb;font-size:70px;font-family:Georgia;text-align:center;"><strong>Worker Productivity Problem</strong></h1>

### <b>Author: Nguyen Dang Huynh Chau</b>

<h1 style="color:#ffc0cb;font-size:40px;font-family:Georgia;text-align:center;"><strong> 📜 Table of Content</strong></h1>

### 1. [Data Preparation](#1)

1.1 [Importing Necessary Libraries and datasets](#1.1)

1.2 [Data Retrieving](#1.2)

1.3 [Data information](#1.3)

<br>

### 2. [Data Cleaning](#2)

2.1 [About This Dataset](#2.1)

2.2 [Drop column](#2.2)

2.3 [Typo check](#2.3)

2.4 [Missing Values](#2.4)

> - 2.4.1 [Check missing values](#2.4.1)
> - 2.4.2 [Fill missing values](#2.4.1)  
>> - 2.4.2.a [Filling missing values for Embarked Feature](#2.4.2.a)
>> - 2.4.2.b [Filling missing values for Cabin Feature](#2.4.2.b)
>> - 2.4.2.c [Filling missing values for Fare Feature](#2.4.2.c)
>> - 2.4.2.d [Filling missing values for Age Feature](#2.4.2.d)

2.5 [Data type](#2.5)

2.6 [Upper Case the content](#2.6)

2.7 [Extra-whitespaces](#2.7)

2.8 [Descriptive statistics for Central Tendency](#2.8)

2.9 [Detect Outlier](#2.9)

2.10 [Save The Intermediate Data](#2.10)

<br>

### 3. [Data Exploration (EDA)](#3)

3.1 [Overall look on target variable](#3.1)

3.2 [Frequency of each corresponiding Target variable type](#3.2)

3.3 [Statistical Overview](#3.3)

3.4 [Correlation Matrix and Heatmap](#3.4)

<br>

### 4. [Feature Engineering](#4)

4.1 [Separating dependent and independent variables](#4.1)

4.2 [Encoding](#4.2)

> - 4.2.1 [Binary Encoding for Name and Ticket Feature:](#4.2.1)
> - 4.2.2 [Binary Encoding for Embarked Feature:](#4.2.2)

4.3 [Separating dependent and independent variables](#4.3)

4.4 [Splitting the training data](#4.4)

4.5 [Feature Scaling](#4.5)

<br>

### 5. [Model Building](#5) 

5.1 [Logistic Regression](#5.1)

> - 5.1.1 [Logistic Regression without GridSearch](#5.1.1)
>> - 5.1.1.a [Train model](#5.1.1.a) 
>> - 5.1.1.b [Evaluating a classification model](#5.1.1.b) 
> - 5.1.2 [Logistic Regression with GridSearch](#5.1.2)
>> - 5.1.1.a [Train model](#5.1.1.a) 
>> - 5.1.1.b [Evaluating a classification model](#5.1.1.b) 


5.2 [Random Forest](#5.2)
> - 5.2.1 [Random Forest with Pipelines](#5.2.1)
>> - 5.2.1.a [Train model](#5.2.1.a) 
>> - 5.2.1.b [Evaluating a classification model](#5.2.1.b) 
> - 5.2.2 [Combining GridSearch+Random Forest with Pipelines](#5.2.2)
>> - 5.2.1.a [Train model](#5.2.1.a) 
>> - 5.2.1.b [Evaluating a classification model](#5.2.1.b)

5.3 [K-Nearest Neighbors with GridSearchCV](#5.3)
>> - 5.3.1 [Train model](#5.2.1.a) 
>> - 5.3.2 [Evaluating a classification model](#5.2.1.b)

5.4 [Ensemble Learning](#5.4)
> - 5.4.1 [Bagging Classifier](#5.4.1)
>> - 5.4.1.a [Train model](#5.4.1.a) 
>> - 5.4.1.b [Evaluating a classification model](#5.4.1.b) 
>> - 5.4.1.c [Compare Pro and Cons](#5.4.1.c) 
> - 5.4.2 [AdaBoost Classifier](#5.4.2)
>> - 5.4.1.a [Train model](#5.4.1.a) 
>> - 5.4.1.b [Evaluating a classification model](#5.4.1.b) 
>> - 5.4.1.c [Compare Pro and Cons](#5.4.1.b) 

5.5 [Extra Trees Classifier](#5.5)
>> - 5.5.1 [Train model](#5.5.1) 
>> - 5.5.2 [Evaluating a classification model](#5.5.2) 

5.6 [Random Forest](#5.6)


<br>

### 6. [Conculsions](#6)

<br>

### 7. [References](#7)

<br>

### 8. [Appendix](#8)

<hr>

<a id="1"></a>
<h1 style="color:#ffc0cb;font-size:40px;font-family:Georgia;text-align:center;"><strong> ✍️ 1. Data Preparation</strong></h1>

<a id="1.1"></a>
# ✴️ 1.1 Importing Necessary Libraries and datasets

In [1]:
# Install a conda package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install missingno
!{sys.executable} -m pip install scikit-learn
!{sys.executable} -m pip install xgboost
!{sys.executable} -m pip install statsmodels
!{sys.executable} -m pip install imbalanced-learn
!{sys.executable} -m pip install category_encoders


# work with data in tabular representation
from datetime import time
import pandas as pd
# round the data in the correlation matrix
import numpy as np
import os


# Modules for data visualization
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier

# encoding
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
import category_encoders as ce

# import LogisticRegression model in python. 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, accuracy_score

# for saving the pipeline
import joblib

# from Scikit-learn
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, Binarizer

pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 200)

# Ensure that our plots are shown and embedded within the Jupyter notebook itself. Without this command, sometimes plots may show up in pop-up windows
%matplotlib inline

# overwrite the style of all the matplotlib graphs
sns.set()

# ignore DeprecationWarning Error Messages
import warnings
warnings.filterwarnings('ignore')



In [2]:
# check the version of the packages
print("Numpy version: ", np.__version__)
print("Pandas version: ",pd.__version__)
! python --version

Numpy version:  1.20.3
Pandas version:  1.3.4
Python 3.9.7


<a id="1.2"></a>
# 📲 1.2 Data Retrieving
***
In order to load data properly, the data in csv file have to be examined carefully. First of all, all the categories are seperated by the "," and strip the extra-whitespaces at the begin by setting "skipinitialspace = True".

In [3]:
## Importing the datasets
df = pd.read_csv("Data/garments_worker_productivity.csv", delimiter=',', skipinitialspace = True)

df.columns = df.columns.str.replace(' ', '') #strip the extra-whitespaces out

print("The shape of the ORGINAL data is (row, column):", str(df.shape))

# drop Unnamed, it is just a number given to identify each house
df.head(3)

The shape of the ORGINAL data is (row, column): (1197, 15)


Unnamed: 0,date,quarter,department,day,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
0,1/1/2015,Quarter1,sweing,Thursday,8,0.8,26.16,1108.0,7080,98,0.0,0,0,59.0,0.940725
1,1/1/2015,Quarter1,finishing,Thursday,1,0.75,3.94,,960,0,0.0,0,0,8.0,0.8865
2,1/1/2015,Quarter1,sweing,Thursday,11,0.8,11.41,968.0,3660,50,0.0,0,0,30.5,0.80057


<a id="1.3"></a>
# 🔈 1.3 Data Information
*****

In [5]:
print ("The shape of the train data is (row, column):"+ str(df.shape))
print (df.info())

The shape of the train data is (row, column):(1197, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1197 entries, 0 to 1196
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   date                   1197 non-null   object 
 1   quarter                1197 non-null   object 
 2   department             1197 non-null   object 
 3   day                    1197 non-null   object 
 4   team                   1197 non-null   int64  
 5   targeted_productivity  1197 non-null   float64
 6   smv                    1197 non-null   float64
 7   wip                    691 non-null    float64
 8   over_time              1197 non-null   int64  
 9   incentive              1197 non-null   int64  
 10  idle_time              1197 non-null   float64
 11  idle_men               1197 non-null   int64  
 12  no_of_style_change     1197 non-null   int64  
 13  no_of_workers          1197 non-null   float64
 14  

<a id="2"></a>
<h1 style="color:#ffc0cb;font-size:40px;font-family:Georgia;text-align:center;"><strong> 🧹 2. Data Cleaning</strong></h1>

<a id="2.2"></a>
# ❌ 2.2 Drop column
***

In [7]:
corrMatrix = df.corr()
corrMatrix.style.background_gradient(cmap='Blues')

Unnamed: 0,team,targeted_productivity,smv,wip,over_time,incentive,idle_time,idle_men,no_of_style_change,no_of_workers,actual_productivity
team,1.0,0.030274,-0.110011,-0.033474,-0.096737,-0.007674,0.003796,0.026974,-0.011194,-0.075113,-0.148753
targeted_productivity,0.030274,1.0,-0.069489,0.062054,-0.088557,0.032768,-0.056181,-0.053818,-0.209294,-0.084288,0.421594
smv,-0.110011,-0.069489,1.0,-0.037837,0.674887,0.032629,0.056863,0.105901,0.315388,0.912176,-0.122089
wip,-0.033474,0.062054,-0.037837,1.0,0.022302,0.16721,-0.026299,-0.048718,-0.072357,0.030383,0.131147
over_time,-0.096737,-0.088557,0.674887,0.022302,1.0,-0.004793,0.031038,-0.017913,0.05979,0.734164,-0.054206
incentive,-0.007674,0.032768,0.032629,0.16721,-0.004793,1.0,-0.012024,-0.02114,-0.026607,0.049222,0.076538
idle_time,0.003796,-0.056181,0.056863,-0.026299,0.031038,-0.012024,1.0,0.559146,-0.011598,0.058049,-0.080851
idle_men,0.026974,-0.053818,0.105901,-0.048718,-0.017913,-0.02114,0.559146,1.0,0.133632,0.106946,-0.181734
no_of_style_change,-0.011194,-0.209294,0.315388,-0.072357,0.05979,-0.026607,-0.011598,0.133632,1.0,0.327787,-0.207366
no_of_workers,-0.075113,-0.084288,0.912176,0.030383,0.734164,0.049222,0.058049,0.106946,0.327787,1.0,-0.057991


<a id="2.1"></a>
# 📝 2.1 Typo check:
***
In order to check the typo, all the categories in all the columns with value_count() functions, by counting the typo mistake can appear. For those cloumns has a long values count list, they will be shown by using a for loop in order to check carefully. Each question for each columns all are listed below for catching up with those values in or to decide if the answer is valid or not.

In [6]:
categories = list(df['date'].value_counts().index)

for x in range(len(categories)):
    print (categories[x])

3/11/2015
1/31/2015
1/11/2015
3/10/2015
1/12/2015
1/24/2015
1/8/2015
1/10/2015
1/7/2015
1/13/2015
1/5/2015
3/9/2015
3/8/2015
3/3/2015
1/22/2015
2/25/2015
2/26/2015
2/28/2015
1/3/2015
1/4/2015
1/28/2015
1/27/2015
3/4/2015
1/25/2015
1/17/2015
1/14/2015
1/6/2015
2/18/2015
1/29/2015
2/17/2015
3/2/2015
3/1/2015
2/22/2015
2/19/2015
3/5/2015
3/7/2015
2/24/2015
2/23/2015
1/1/2015
2/3/2015
2/15/2015
1/15/2015
1/18/2015
1/19/2015
1/21/2015
1/26/2015
2/1/2015
2/2/2015
2/4/2015
2/7/2015
2/8/2015
2/10/2015
2/11/2015
2/12/2015
2/5/2015
2/9/2015
2/16/2015
2/14/2015
1/20/2015
