# Employee Engagement Index Analysis - EDA and Feature Engineering

#### The overall objective is to understand which dimension would contribute to explain the selected Target Variable (either Overall statisfaction or Job status - to be decided in the course of the analysis)

#### Overall pipeline
> * Data cleaning
* Interactive EDA using pandas_profiling.ProfileReport
* Analysis of categorical and numerical variables
* Feature engineering & standardization

#### **Important remark**: In order to preserve the copyright of the data. This public project disclose all the process, the code and the vizualisation of data but the access to the original dataset will remain confidential.

### Install libraries

In [64]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.fftpack as sp
import matplotlib.pyplot as plt
import pandas_profiling

import warnings
warnings.simplefilter("ignore")

## Load the file for analysis

### My computer

In [2]:
# display all columns
pd.set_option("display.max_columns", None)

In [3]:
# Importing the file and creating a dataframe
master_modeling = pd.read_csv(
    "file name",
    low_memory=False,
    skipinitialspace=True,
)  # , sep='\t'

In [4]:
master_modeling.head()

Unnamed: 0,respondent_ID,channel,area_responsibility,industry,ndustry_group,Nb employees,Company size,Gender,Professional_experience,Satisfaction,job_situation,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,workload,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
0,10882908921,Lausanne,Marketing and communication,Banking,Banking_financial_insurance,100 to 149,1 to 149,Female,16 to 20 years,Satisfied,Active,2.0,3.0,5.0,3.0,4.0,4.0,2.0,3.0,3.0,2.0,2.0,2.0,Too heavy,10.0,8.0,1.0,9.0,3.0,13.0,11.0,4.0,2.0,12.0,14.0,6.0,15.0,7.0,5.0
1,10879775971,Lausanne,Tax and accounting,Pharmaceutical,Healthcare_pharma,1000,>1000,Female,11 to 15 years,Not satisfied,Active,4.0,4.0,1.0,1.0,5.0,1.0,4.0,2.0,2.0,1.0,3.0,5.0,Too heavy,9.0,13.0,10.0,3.0,1.0,6.0,7.0,8.0,5.0,4.0,12.0,15.0,11.0,14.0,2.0
2,10879710816,Lausanne,Information Technology,Computer/software/technology,Info_tech_telco,10 to 49,1 to 149,Male,Less than 10 years,Satisfied,Active,5.0,4.0,5.0,4.0,5.0,5.0,4.0,5.0,,3.0,5.0,4.0,Good,10.0,12.0,5.0,7.0,2.0,6.0,13.0,11.0,15.0,8.0,3.0,1.0,14.0,9.0,4.0
3,10867564537,Lausanne,Human resources,Pharmaceutical,Healthcare_pharma,150 to 299,150 to 999,Male,+20 years,Satisfied,Planning,1.0,5.0,5.0,4.0,5.0,5.0,3.0,4.0,4.0,2.0,4.0,4.0,Good,1.0,8.0,5.0,15.0,10.0,3.0,2.0,14.0,7.0,4.0,12.0,13.0,11.0,6.0,9.0
4,10862147414,Lausanne,Human resources,Health care,Healthcare_pharma,1000,>1000,Male,11 to 15 years,OK,Active,2.0,4.0,4.0,2.0,5.0,4.0,2.0,3.0,3.0,4.0,3.0,2.0,Too light,3.0,7.0,5.0,6.0,1.0,15.0,12.0,10.0,8.0,2.0,4.0,11.0,13.0,9.0,14.0


In [5]:
# Check the shae of the dataframe
master_modeling.shape

(621, 39)

In [6]:
# Check data types
master_modeling.dtypes

respondent_ID                   int64
channel                        object
area_responsibility            object
industry                       object
ndustry_group                  object
Nb employees                   object
Company size                   object
Gender                         object
Professional_experience        object
Satisfaction                   object
job_situation                  object
Q3a_dev_opp                   float64
Q3b_manager                   float64
Q3c_perso_contribution        float64
Q3d_remuneration              float64
Q3e_colleagues                float64
Q3f_working_conditions        float64
Q3g_align_comp_values         float64
Q3h_sense_meaning             float64
Q3i_job_freedom               float64
Q3j_work_life_balance         float64
Q3k_training_tools            float64
Q3l_feel_challenged           float64
workload                       object
Q5a_job_achievements          float64
Q5b_feedback                  float64
Q5c_teamwork

## Data cleaning
> * The strategy we are going to adopt is to keep the consistency of data. As a consequence, we are going to remove NaN for the ranking question (177 NaN, Q5) whci corresponds to people who did stop the questionnaire at the ranking question (Q5).
* The alternative to try to predict the ranking is not relevant and would imply many different hypothesis and thus introducing too many biases
* We would keep +440 respondents which would be OK for performing some decision tree and clustering

In [7]:
# Identify NaN
master_modeling.isnull().sum()

respondent_ID                   0
channel                         0
area_responsibility           198
industry                      199
ndustry_group                 199
Nb employees                  199
Company size                  199
Gender                        200
Professional_experience       201
Satisfaction                    0
job_situation                   0
Q3a_dev_opp                    76
Q3b_manager                    75
Q3c_perso_contribution         75
Q3d_remuneration               75
Q3e_colleagues                 76
Q3f_working_conditions         74
Q3g_align_comp_values          80
Q3h_sense_meaning              76
Q3i_job_freedom                76
Q3j_work_life_balance          78
Q3k_training_tools             76
Q3l_feel_challenged            75
workload                       73
Q5a_job_achievements          177
Q5b_feedback                  177
Q5c_teamwork                  177
Q5d_opportunities_growth      177
Q5e_work_life_balance         177
Q5f_customer f

In [8]:
# Remove NaN of Q5, Professional experience
df_intermediate = master_modeling.dropna(subset=["Q5a_job_achievements"])
df_intermediate = master_modeling.dropna(subset=["Professional_experience"])

In [9]:
# Check if it remains NaN
df_intermediate.isnull().sum()

respondent_ID                 0
channel                       0
area_responsibility           0
industry                      0
ndustry_group                 0
Nb employees                  0
Company size                  0
Gender                        0
Professional_experience       0
Satisfaction                  0
job_situation                 0
Q3a_dev_opp                   2
Q3b_manager                   2
Q3c_perso_contribution        1
Q3d_remuneration              1
Q3e_colleagues                3
Q3f_working_conditions        1
Q3g_align_comp_values         4
Q3h_sense_meaning             1
Q3i_job_freedom               3
Q3j_work_life_balance         4
Q3k_training_tools            2
Q3l_feel_challenged           2
workload                      0
Q5a_job_achievements          0
Q5b_feedback                  0
Q5c_teamwork                  0
Q5d_opportunities_growth      0
Q5e_work_life_balance         0
Q5f_customer focus            0
Q5g_purpose_direction         0
Q5h_fair

In [10]:
# Remove remaining NaN
df_final = df_intermediate.dropna()

In [11]:
# Remove duplicate variables (Nb employees and industry) and not useful features. Code is not activated to avoid error during another
# df_final=df_intermediate.drop('industry', axis=1, inplace=True)
# df_final=df_intermediate.drop('Nb employees', axis=1, inplace=True)
# df_final=df_intermediate.drop('channel', axis=1, inplace=True)
# df_final=df_intermediate.drop('respondent_ID', axis=1, inplace=True)

In [12]:
df_final.head()

Unnamed: 0,respondent_ID,channel,area_responsibility,industry,ndustry_group,Nb employees,Company size,Gender,Professional_experience,Satisfaction,job_situation,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,workload,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
0,10882908921,Lausanne,Marketing and communication,Banking,Banking_financial_insurance,100 to 149,1 to 149,Female,16 to 20 years,Satisfied,Active,2.0,3.0,5.0,3.0,4.0,4.0,2.0,3.0,3.0,2.0,2.0,2.0,Too heavy,10.0,8.0,1.0,9.0,3.0,13.0,11.0,4.0,2.0,12.0,14.0,6.0,15.0,7.0,5.0
1,10879775971,Lausanne,Tax and accounting,Pharmaceutical,Healthcare_pharma,1000,>1000,Female,11 to 15 years,Not satisfied,Active,4.0,4.0,1.0,1.0,5.0,1.0,4.0,2.0,2.0,1.0,3.0,5.0,Too heavy,9.0,13.0,10.0,3.0,1.0,6.0,7.0,8.0,5.0,4.0,12.0,15.0,11.0,14.0,2.0
3,10867564537,Lausanne,Human resources,Pharmaceutical,Healthcare_pharma,150 to 299,150 to 999,Male,+20 years,Satisfied,Planning,1.0,5.0,5.0,4.0,5.0,5.0,3.0,4.0,4.0,2.0,4.0,4.0,Good,1.0,8.0,5.0,15.0,10.0,3.0,2.0,14.0,7.0,4.0,12.0,13.0,11.0,6.0,9.0
4,10862147414,Lausanne,Human resources,Health care,Healthcare_pharma,1000,>1000,Male,11 to 15 years,OK,Active,2.0,4.0,4.0,2.0,5.0,4.0,2.0,3.0,3.0,4.0,3.0,2.0,Too light,3.0,7.0,5.0,6.0,1.0,15.0,12.0,10.0,8.0,2.0,4.0,11.0,13.0,9.0,14.0
5,10861974271,Geneva,Sales,Hospitality,Others,100 to 149,1 to 149,Female,16 to 20 years,OK,Active,3.0,2.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,1.0,3.0,4.0,Too heavy,1.0,6.0,2.0,5.0,4.0,7.0,9.0,10.0,15.0,12.0,11.0,3.0,13.0,14.0,8.0


In [13]:
df_final.isnull().sum()

respondent_ID                 0
channel                       0
area_responsibility           0
industry                      0
ndustry_group                 0
Nb employees                  0
Company size                  0
Gender                        0
Professional_experience       0
Satisfaction                  0
job_situation                 0
Q3a_dev_opp                   0
Q3b_manager                   0
Q3c_perso_contribution        0
Q3d_remuneration              0
Q3e_colleagues                0
Q3f_working_conditions        0
Q3g_align_comp_values         0
Q3h_sense_meaning             0
Q3i_job_freedom               0
Q3j_work_life_balance         0
Q3k_training_tools            0
Q3l_feel_challenged           0
workload                      0
Q5a_job_achievements          0
Q5b_feedback                  0
Q5c_teamwork                  0
Q5d_opportunities_growth      0
Q5e_work_life_balance         0
Q5f_customer focus            0
Q5g_purpose_direction         0
Q5h_fair

## Save df_final for visualization usage after cluster analysis

In [62]:
# name of the file

## Interactive EDA
> * The group of features related to satisfaction is correlated which is logic considering that Q3 aims at evaluating the different degrees of satisfaction

In [65]:
# Command to issue an interactive graph
pandas_profiling.ProfileReport(df_final)

0,1
Number of variables,40
Number of observations,404
Total Missing (%),0.0%
Total size in memory,126.4 KiB
Average record size in memory,320.3 B

0,1
Numeric,29
Categorical,11
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,404
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,322.02
Minimum,0
Maximum,619
Zeros (%),0.2%

0,1
Minimum,0.0
5-th percentile,23.3
Q1,163.5
Median,329.5
Q3,483.5
95-th percentile,597.85
Maximum,619.0
Range,619.0
Interquartile range,320.0

0,1
Standard deviation,184.23
Coef of variation,0.57211
Kurtosis,-1.2319
Mean,322.02
MAD,160.11
Skewness,-0.10456
Sum,130096
Variance,33942
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
619,1,0.2%,
223,1,0.2%,
203,1,0.2%,
204,1,0.2%,
205,1,0.2%,
206,1,0.2%,
207,1,0.2%,
213,1,0.2%,
214,1,0.2%,
217,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.2%,
1,1,0.2%,
3,1,0.2%,
4,1,0.2%,
5,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
615,1,0.2%,
616,1,0.2%,
617,1,0.2%,
618,1,0.2%,
619,1,0.2%,

0,1
Distinct count,404
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,10702000000
Minimum,10586459612
Maximum,10882908921
Zeros (%),0.0%

0,1
Minimum,10586459612
5-th percentile,10642000000
Q1,10667000000
Median,10709000000
Q3,10712000000
95-th percentile,10747000000
Maximum,10882908921
Range,296449309
Interquartile range,44808000

0,1
Standard deviation,39042000
Coef of variation,0.0036482
Kurtosis,5.6497
Mean,10702000000
MAD,25897000
Skewness,1.2129
Sum,4323462750081
Variance,1524300000000000
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
10665148415,1,0.2%,
10737379686,1,0.2%,
10717961077,1,0.2%,
10709425477,1,0.2%,
10670181703,1,0.2%,
10663923413,1,0.2%,
10637559125,1,0.2%,
10709747807,1,0.2%,
10710318474,1,0.2%,
10709722972,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
10586459612,1,0.2%,
10586537449,1,0.2%,
10586543895,1,0.2%,
10587617603,1,0.2%,
10587627893,1,0.2%,

Value,Count,Frequency (%),Unnamed: 3
10861974271,1,0.2%,
10862147414,1,0.2%,
10867564537,1,0.2%,
10879775971,1,0.2%,
10882908921,1,0.2%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
CRM,332
Lausanne,59
Geneva,13

Value,Count,Frequency (%),Unnamed: 3
CRM,332,82.2%,
Lausanne,59,14.6%,
Geneva,13,3.2%,

0,1
Distinct count,12
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
Finance,99
Other,48
Human resources,48
Other values (9),209

Value,Count,Frequency (%),Unnamed: 3
Finance,99,24.5%,
Other,48,11.9%,
Human resources,48,11.9%,
Information Technology,45,11.1%,
Marketing and communication,38,9.4%,
Sales,36,8.9%,
Strategy,21,5.2%,
Company management,20,5.0%,
Supply chain,19,4.7%,
Legal,13,3.2%,

0,1
Distinct count,26
Unique (%),6.4%
Missing (%),0.0%
Missing (n),0

0,1
Banking,74
Other,56
Financial services,42
Other values (23),232

Value,Count,Frequency (%),Unnamed: 3
Banking,74,18.3%,
Other,56,13.9%,
Financial services,42,10.4%,
Consumer goods,35,8.7%,
Health care,30,7.4%,
Manufacturing,24,5.9%,
Computer/software/technology,21,5.2%,
Pharmaceutical,16,4.0%,
Education,12,3.0%,
Automotive,10,2.5%,

0,1
Distinct count,9
Unique (%),2.2%
Missing (%),0.0%
Missing (n),0

0,1
Banking_financial_insurance,121
Others,100
Healthcare_pharma,46
Other values (6),137

Value,Count,Frequency (%),Unnamed: 3
Banking_financial_insurance,121,30.0%,
Others,100,24.8%,
Healthcare_pharma,46,11.4%,
Consumer goods,35,8.7%,
Auto and manufacturing,34,8.4%,
Info_tech_telco,33,8.2%,
Energy_utilities,13,3.2%,
Education,12,3.0%,
Agriculture and food,10,2.5%,

0,1
Distinct count,8
Unique (%),2.0%
Missing (%),0.0%
Missing (n),0

0,1
1000,109
10 to 49,73
1 to 9,60
Other values (5),162

Value,Count,Frequency (%),Unnamed: 3
1000,109,27.0%,
10 to 49,73,18.1%,
1 to 9,60,14.9%,
500 to 999,40,9.9%,
300 to 499,35,8.7%,
50 to 99,34,8.4%,
150 to 299,33,8.2%,
100 to 149,20,5.0%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
1 to 149,187
>1000,109
150 to 999,108

Value,Count,Frequency (%),Unnamed: 3
1 to 149,187,46.3%,
>1000,109,27.0%,
150 to 999,108,26.7%,

0,1
Distinct count,2
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0

0,1
Male,257
Female,147

Value,Count,Frequency (%),Unnamed: 3
Male,257,63.6%,
Female,147,36.4%,

0,1
Distinct count,4
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0

0,1
+20 years,185
16 to 20 years,86
11 to 15 years,77

Value,Count,Frequency (%),Unnamed: 3
+20 years,185,45.8%,
16 to 20 years,86,21.3%,
11 to 15 years,77,19.1%,
Less than 10 years,56,13.9%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Satisfied,195
OK,127
Not satisfied,82

Value,Count,Frequency (%),Unnamed: 3
Satisfied,195,48.3%,
OK,127,31.4%,
Not satisfied,82,20.3%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Active,220
Planning,105
Stay,79

Value,Count,Frequency (%),Unnamed: 3
Active,220,54.5%,
Planning,105,26.0%,
Stay,79,19.6%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.7178
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2718
Coef of variation,0.46794
Kurtosis,-1.056
Mean,2.7178
MAD,1.0952
Skewness,0.18533
Sum,1098
Variance,1.6174
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2.0,100,24.8%,
3.0,95,23.5%,
1.0,87,21.5%,
4.0,84,20.8%,
5.0,38,9.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,87,21.5%,
2.0,100,24.8%,
3.0,95,23.5%,
4.0,84,20.8%,
5.0,38,9.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,87,21.5%,
2.0,100,24.8%,
3.0,95,23.5%,
4.0,84,20.8%,
5.0,38,9.4%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.2104
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.3869
Coef of variation,0.43199
Kurtosis,-1.1932
Mean,3.2104
MAD,1.2036
Skewness,-0.28047
Sum,1297
Variance,1.9234
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,113,28.0%,
5.0,86,21.3%,
3.0,73,18.1%,
1.0,68,16.8%,
2.0,64,15.8%,

Value,Count,Frequency (%),Unnamed: 3
1.0,68,16.8%,
2.0,64,15.8%,
3.0,73,18.1%,
4.0,113,28.0%,
5.0,86,21.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,68,16.8%,
2.0,64,15.8%,
3.0,73,18.1%,
4.0,113,28.0%,
5.0,86,21.3%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.5347
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,4
Q3,5
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2705
Coef of variation,0.35944
Kurtosis,-0.80268
Mean,3.5347
MAD,1.0882
Skewness,-0.53399
Sum,1428
Variance,1.6142
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,126,31.2%,
5.0,110,27.2%,
3.0,73,18.1%,
2.0,60,14.9%,
1.0,35,8.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,35,8.7%,
2.0,60,14.9%,
3.0,73,18.1%,
4.0,126,31.2%,
5.0,110,27.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,35,8.7%,
2.0,60,14.9%,
3.0,73,18.1%,
4.0,126,31.2%,
5.0,110,27.2%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.3416
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2088
Coef of variation,0.36174
Kurtosis,-0.81131
Mean,3.3416
MAD,1.0373
Skewness,-0.4342
Sum,1350
Variance,1.4612
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,152,37.6%,
3.0,77,19.1%,
2.0,72,17.8%,
5.0,66,16.3%,
1.0,37,9.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,37,9.2%,
2.0,72,17.8%,
3.0,77,19.1%,
4.0,152,37.6%,
5.0,66,16.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,37,9.2%,
2.0,72,17.8%,
3.0,77,19.1%,
4.0,152,37.6%,
5.0,66,16.3%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.8936
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,3
Median,4
Q3,5
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.0918
Coef of variation,0.28042
Kurtosis,0.12454
Mean,3.8936
MAD,0.8404
Skewness,-0.9026
Sum,1573
Variance,1.1921
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,150,37.1%,
5.0,139,34.4%,
3.0,63,15.6%,
2.0,37,9.2%,
1.0,15,3.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,15,3.7%,
2.0,37,9.2%,
3.0,63,15.6%,
4.0,150,37.1%,
5.0,139,34.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,15,3.7%,
2.0,37,9.2%,
3.0,63,15.6%,
4.0,150,37.1%,
5.0,139,34.4%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4703
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,4
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,1

0,1
Standard deviation,1.1453
Coef of variation,0.33004
Kurtosis,-0.60228
Mean,3.4703
MAD,0.97326
Skewness,-0.54423
Sum,1402
Variance,1.3118
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,169,41.8%,
3.0,71,17.6%,
5.0,70,17.3%,
2.0,69,17.1%,
1.0,25,6.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,25,6.2%,
2.0,69,17.1%,
3.0,71,17.6%,
4.0,169,41.8%,
5.0,70,17.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,25,6.2%,
2.0,69,17.1%,
3.0,71,17.6%,
4.0,169,41.8%,
5.0,70,17.3%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.4406
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,4
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,1

0,1
Standard deviation,1.1867
Coef of variation,0.34493
Kurtosis,-0.75026
Mean,3.4406
MAD,1.0146
Skewness,-0.41294
Sum,1390
Variance,1.4084
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,135,33.4%,
3.0,91,22.5%,
5.0,83,20.5%,
2.0,67,16.6%,
1.0,28,6.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,28,6.9%,
2.0,67,16.6%,
3.0,91,22.5%,
4.0,135,33.4%,
5.0,83,20.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,28,6.9%,
2.0,67,16.6%,
3.0,91,22.5%,
4.0,135,33.4%,
5.0,83,20.5%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.3589
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2552
Coef of variation,0.3737
Kurtosis,-0.88602
Mean,3.3589
MAD,1.0784
Skewness,-0.40773
Sum,1357
Variance,1.5756
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,135,33.4%,
5.0,80,19.8%,
3.0,80,19.8%,
2.0,68,16.8%,
1.0,41,10.1%,

Value,Count,Frequency (%),Unnamed: 3
1.0,41,10.1%,
2.0,68,16.8%,
3.0,80,19.8%,
4.0,135,33.4%,
5.0,80,19.8%,

Value,Count,Frequency (%),Unnamed: 3
1.0,41,10.1%,
2.0,68,16.8%,
3.0,80,19.8%,
4.0,135,33.4%,
5.0,80,19.8%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.6015
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,4
Q3,5
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.1922
Coef of variation,0.33104
Kurtosis,-0.5302
Mean,3.6015
MAD,0.99917
Skewness,-0.62182
Sum,1455
Variance,1.4214
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,145,35.9%,
5.0,103,25.5%,
3.0,75,18.6%,
2.0,54,13.4%,
1.0,27,6.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,27,6.7%,
2.0,54,13.4%,
3.0,75,18.6%,
4.0,145,35.9%,
5.0,103,25.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,27,6.7%,
2.0,54,13.4%,
3.0,75,18.6%,
4.0,145,35.9%,
5.0,103,25.5%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.401
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2612
Coef of variation,0.37084
Kurtosis,-0.85509
Mean,3.401
MAD,1.085
Skewness,-0.48379
Sum,1374
Variance,1.5907
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,147,36.4%,
5.0,82,20.3%,
3.0,67,16.6%,
2.0,67,16.6%,
1.0,41,10.1%,

Value,Count,Frequency (%),Unnamed: 3
1.0,41,10.1%,
2.0,67,16.6%,
3.0,67,16.6%,
4.0,147,36.4%,
5.0,82,20.3%,

Value,Count,Frequency (%),Unnamed: 3
1.0,41,10.1%,
2.0,67,16.6%,
3.0,67,16.6%,
4.0,147,36.4%,
5.0,82,20.3%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.9356
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2326
Coef of variation,0.41989
Kurtosis,-0.98127
Mean,2.9356
MAD,1.0073
Skewness,-0.06081
Sum,1186
Variance,1.5194
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
3.0,111,27.5%,
4.0,103,25.5%,
2.0,83,20.5%,
1.0,65,16.1%,
5.0,42,10.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,65,16.1%,
2.0,83,20.5%,
3.0,111,27.5%,
4.0,103,25.5%,
5.0,42,10.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,65,16.1%,
2.0,83,20.5%,
3.0,111,27.5%,
4.0,103,25.5%,
5.0,42,10.4%,

0,1
Distinct count,5
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.1584
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,3
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.2815
Coef of variation,0.40575
Kurtosis,-1.0524
Mean,3.1584
MAD,1.0918
Skewness,-0.22767
Sum,1276
Variance,1.6423
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,122,30.2%,
3.0,86,21.3%,
2.0,78,19.3%,
5.0,64,15.8%,
1.0,54,13.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,54,13.4%,
2.0,78,19.3%,
3.0,86,21.3%,
4.0,122,30.2%,
5.0,64,15.8%,

Value,Count,Frequency (%),Unnamed: 3
1.0,54,13.4%,
2.0,78,19.3%,
3.0,86,21.3%,
4.0,122,30.2%,
5.0,64,15.8%,

0,1
Distinct count,3
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Good,228
Too heavy,117
Too light,59

Value,Count,Frequency (%),Unnamed: 3
Good,228,56.4%,
Too heavy,117,29.0%,
Too light,59,14.6%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.4827
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.0
Q1,2.0
Median,5.0
Q3,10.25
95-th percentile,14.0
Maximum,15.0
Range,14.0
Interquartile range,8.25

0,1
Standard deviation,4.5668
Coef of variation,0.70446
Kurtosis,-1.2317
Mean,6.4827
MAD,4.0281
Skewness,0.38324
Sum,2619
Variance,20.856
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,69,17.1%,
2.0,42,10.4%,
3.0,34,8.4%,
4.0,32,7.9%,
12.0,29,7.2%,
5.0,28,6.9%,
14.0,27,6.7%,
7.0,25,6.2%,
10.0,25,6.2%,
8.0,18,4.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,69,17.1%,
2.0,42,10.4%,
3.0,34,8.4%,
4.0,32,7.9%,
5.0,28,6.9%,

Value,Count,Frequency (%),Unnamed: 3
11.0,17,4.2%,
12.0,29,7.2%,
13.0,14,3.5%,
14.0,27,6.7%,
15.0,14,3.5%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9.1807
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,3
Q1,6
Median,9
Q3,12
95-th percentile,15
Maximum,15
Range,14
Interquartile range,6

0,1
Standard deviation,3.6774
Coef of variation,0.40056
Kurtosis,-1.0173
Mean,9.1807
MAD,3.1401
Skewness,-0.1112
Sum,3709
Variance,13.523
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
8.0,46,11.4%,
6.0,40,9.9%,
13.0,36,8.9%,
10.0,33,8.2%,
12.0,32,7.9%,
15.0,31,7.7%,
11.0,30,7.4%,
14.0,30,7.4%,
9.0,28,6.9%,
7.0,26,6.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,2,0.5%,
2.0,8,2.0%,
3.0,18,4.5%,
4.0,22,5.4%,
5.0,22,5.4%,

Value,Count,Frequency (%),Unnamed: 3
11.0,30,7.4%,
12.0,32,7.9%,
13.0,36,8.9%,
14.0,30,7.4%,
15.0,31,7.7%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.9827
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,5
Median,8
Q3,11
95-th percentile,14
Maximum,15
Range,14
Interquartile range,6

0,1
Standard deviation,4.074
Coef of variation,0.51036
Kurtosis,-1.0954
Mean,7.9827
MAD,3.4537
Skewness,0.024292
Sum,3225
Variance,16.598
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
7.0,41,10.1%,
8.0,33,8.2%,
3.0,31,7.7%,
13.0,31,7.7%,
5.0,30,7.4%,
10.0,29,7.2%,
6.0,27,6.7%,
12.0,26,6.4%,
11.0,25,6.2%,
9.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,19,4.7%,
2.0,24,5.9%,
3.0,31,7.7%,
4.0,21,5.2%,
5.0,30,7.4%,

Value,Count,Frequency (%),Unnamed: 3
11.0,25,6.2%,
12.0,26,6.4%,
13.0,31,7.7%,
14.0,23,5.7%,
15.0,20,5.0%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.4208
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,7
Q3,11
95-th percentile,15
Maximum,15
Range,14
Interquartile range,8

0,1
Standard deviation,4.3764
Coef of variation,0.58974
Kurtosis,-1.256
Mean,7.4208
MAD,3.8504
Skewness,0.17876
Sum,2998
Variance,19.153
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
3.0,43,10.6%,
2.0,36,8.9%,
11.0,33,8.2%,
4.0,30,7.4%,
1.0,28,6.9%,
9.0,28,6.9%,
14.0,27,6.7%,
5.0,27,6.7%,
6.0,25,6.2%,
10.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,28,6.9%,
2.0,36,8.9%,
3.0,43,10.6%,
4.0,30,7.4%,
5.0,27,6.7%,

Value,Count,Frequency (%),Unnamed: 3
11.0,33,8.2%,
12.0,20,5.0%,
13.0,19,4.7%,
14.0,27,6.7%,
15.0,22,5.4%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.1634
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.0
Q1,2.75
Median,5.0
Q3,9.0
95-th percentile,14.0
Maximum,15.0
Range,14.0
Interquartile range,6.25

0,1
Standard deviation,4.3834
Coef of variation,0.7112
Kurtosis,-0.85902
Mean,6.1634
MAD,3.7303
Skewness,0.61632
Sum,2490
Variance,19.214
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,61,15.1%,
5.0,49,12.1%,
3.0,43,10.6%,
2.0,40,9.9%,
4.0,35,8.7%,
9.0,25,6.2%,
7.0,22,5.4%,
6.0,20,5.0%,
15.0,20,5.0%,
12.0,18,4.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,61,15.1%,
2.0,40,9.9%,
3.0,43,10.6%,
4.0,35,8.7%,
5.0,49,12.1%,

Value,Count,Frequency (%),Unnamed: 3
11.0,10,2.5%,
12.0,18,4.5%,
13.0,16,4.0%,
14.0,18,4.5%,
15.0,20,5.0%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.4035
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,5
Median,9
Q3,12
95-th percentile,15
Maximum,15
Range,14
Interquartile range,7

0,1
Standard deviation,4.3264
Coef of variation,0.51483
Kurtosis,-1.138
Mean,8.4035
MAD,3.6995
Skewness,-0.10864
Sum,3395
Variance,18.718
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
15.0,37,9.2%,
11.0,34,8.4%,
6.0,34,8.4%,
9.0,32,7.9%,
7.0,32,7.9%,
2.0,30,7.4%,
14.0,28,6.9%,
10.0,27,6.7%,
13.0,27,6.7%,
8.0,26,6.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,22,5.4%,
2.0,30,7.4%,
3.0,25,6.2%,
4.0,18,4.5%,
5.0,12,3.0%,

Value,Count,Frequency (%),Unnamed: 3
11.0,34,8.4%,
12.0,20,5.0%,
13.0,27,6.7%,
14.0,28,6.9%,
15.0,37,9.2%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.1807
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,7
Q3,11
95-th percentile,14
Maximum,15
Range,14
Interquartile range,8

0,1
Standard deviation,4.4765
Coef of variation,0.62341
Kurtosis,-1.2822
Mean,7.1807
MAD,3.9039
Skewness,0.11761
Sum,2901
Variance,20.039
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
1.0,55,13.6%,
2.0,36,8.9%,
8.0,34,8.4%,
12.0,30,7.4%,
14.0,27,6.7%,
11.0,27,6.7%,
4.0,26,6.4%,
9.0,26,6.4%,
3.0,24,5.9%,
5.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,55,13.6%,
2.0,36,8.9%,
3.0,24,5.9%,
4.0,26,6.4%,
5.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
11.0,27,6.7%,
12.0,30,7.4%,
13.0,20,5.0%,
14.0,27,6.7%,
15.0,15,3.7%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.2401
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,1.15
Q1,5.0
Median,8.0
Q3,11.0
95-th percentile,15.0
Maximum,15.0
Range,14.0
Interquartile range,6.0

0,1
Standard deviation,4.0486
Coef of variation,0.49133
Kurtosis,-0.96587
Mean,8.2401
MAD,3.388
Skewness,-0.043916
Sum,3329
Variance,16.391
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
8.0,41,10.1%,
6.0,36,8.9%,
10.0,34,8.4%,
9.0,31,7.7%,
5.0,28,6.9%,
13.0,28,6.9%,
7.0,28,6.9%,
15.0,28,6.9%,
11.0,25,6.2%,
2.0,25,6.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,21,5.2%,
2.0,25,6.2%,
3.0,12,3.0%,
4.0,23,5.7%,
5.0,28,6.9%,

Value,Count,Frequency (%),Unnamed: 3
11.0,25,6.2%,
12.0,22,5.4%,
13.0,28,6.9%,
14.0,22,5.4%,
15.0,28,6.9%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.9505
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,4
Median,8
Q3,12
95-th percentile,15
Maximum,15
Range,14
Interquartile range,8

0,1
Standard deviation,4.3166
Coef of variation,0.54293
Kurtosis,-1.2378
Mean,7.9505
MAD,3.7502
Skewness,0.026889
Sum,3212
Variance,18.633
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,34,8.4%,
13.0,30,7.4%,
11.0,30,7.4%,
3.0,29,7.2%,
9.0,29,7.2%,
5.0,27,6.7%,
15.0,26,6.4%,
7.0,26,6.4%,
2.0,26,6.4%,
12.0,25,6.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,25,6.2%,
2.0,26,6.4%,
3.0,29,7.2%,
4.0,34,8.4%,
5.0,27,6.7%,

Value,Count,Frequency (%),Unnamed: 3
11.0,30,7.4%,
12.0,25,6.2%,
13.0,30,7.4%,
14.0,25,6.2%,
15.0,26,6.4%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.1163
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,6
Q3,11
95-th percentile,15
Maximum,15
Range,14
Interquartile range,8

0,1
Standard deviation,4.3835
Coef of variation,0.61597
Kurtosis,-1.169
Mean,7.1163
MAD,3.8305
Skewness,0.32905
Sum,2875
Variance,19.215
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
5.0,40,9.9%,
4.0,39,9.7%,
3.0,38,9.4%,
2.0,35,8.7%,
1.0,33,8.2%,
10.0,28,6.9%,
6.0,25,6.2%,
15.0,24,5.9%,
13.0,23,5.7%,
14.0,22,5.4%,

Value,Count,Frequency (%),Unnamed: 3
1.0,33,8.2%,
2.0,35,8.7%,
3.0,38,9.4%,
4.0,39,9.7%,
5.0,40,9.9%,

Value,Count,Frequency (%),Unnamed: 3
11.0,20,5.0%,
12.0,17,4.2%,
13.0,23,5.7%,
14.0,22,5.4%,
15.0,24,5.9%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9.4356
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,6
Median,10
Q3,13
95-th percentile,15
Maximum,15
Range,14
Interquartile range,7

0,1
Standard deviation,3.9586
Coef of variation,0.41954
Kurtosis,-0.95649
Mean,9.4356
MAD,3.3878
Skewness,-0.32871
Sum,3812
Variance,15.671
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
15.0,43,10.6%,
12.0,41,10.1%,
13.0,36,8.9%,
8.0,33,8.2%,
10.0,33,8.2%,
14.0,32,7.9%,
6.0,30,7.4%,
11.0,30,7.4%,
9.0,25,6.2%,
7.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,8,2.0%,
2.0,14,3.5%,
3.0,14,3.5%,
4.0,21,5.2%,
5.0,20,5.0%,

Value,Count,Frequency (%),Unnamed: 3
11.0,30,7.4%,
12.0,41,10.1%,
13.0,36,8.9%,
14.0,32,7.9%,
15.0,43,10.6%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.9257
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,5
Median,9
Q3,13
95-th percentile,15
Maximum,15
Range,14
Interquartile range,8

0,1
Standard deviation,4.1119
Coef of variation,0.46068
Kurtosis,-1.0954
Mean,8.9257
MAD,3.5145
Skewness,-0.20983
Sum,3606
Variance,16.908
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
13.0,36,8.9%,
15.0,36,8.9%,
10.0,34,8.4%,
11.0,33,8.2%,
9.0,32,7.9%,
14.0,31,7.7%,
5.0,30,7.4%,
12.0,26,6.4%,
8.0,26,6.4%,
4.0,24,5.9%,

Value,Count,Frequency (%),Unnamed: 3
1.0,13,3.2%,
2.0,15,3.7%,
3.0,22,5.4%,
4.0,24,5.9%,
5.0,30,7.4%,

Value,Count,Frequency (%),Unnamed: 3
11.0,33,8.2%,
12.0,26,6.4%,
13.0,36,8.9%,
14.0,31,7.7%,
15.0,36,8.9%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.5965
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,5
Median,9
Q3,12
95-th percentile,15
Maximum,15
Range,14
Interquartile range,7

0,1
Standard deviation,4.0869
Coef of variation,0.47541
Kurtosis,-1.1684
Mean,8.5965
MAD,3.5363
Skewness,-0.10386
Sum,3473
Variance,16.703
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
7.0,38,9.4%,
9.0,35,8.7%,
13.0,35,8.7%,
11.0,33,8.2%,
3.0,31,7.7%,
6.0,29,7.2%,
12.0,29,7.2%,
15.0,29,7.2%,
14.0,27,6.7%,
4.0,27,6.7%,

Value,Count,Frequency (%),Unnamed: 3
1.0,8,2.0%,
2.0,22,5.4%,
3.0,31,7.7%,
4.0,27,6.7%,
5.0,18,4.5%,

Value,Count,Frequency (%),Unnamed: 3
11.0,33,8.2%,
12.0,29,7.2%,
13.0,35,8.7%,
14.0,27,6.7%,
15.0,29,7.2%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9.7104
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,7
Median,10
Q3,13
95-th percentile,15
Maximum,15
Range,14
Interquartile range,6

0,1
Standard deviation,3.8724
Coef of variation,0.39879
Kurtosis,-0.81254
Mean,9.7104
MAD,3.2787
Skewness,-0.39909
Sum,3923
Variance,14.995
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
15.0,50,12.4%,
12.0,42,10.4%,
11.0,39,9.7%,
14.0,33,8.2%,
13.0,31,7.7%,
8.0,31,7.7%,
9.0,31,7.7%,
7.0,30,7.4%,
10.0,27,6.7%,
6.0,25,6.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,5,1.2%,
2.0,18,4.5%,
3.0,9,2.2%,
4.0,15,3.7%,
5.0,18,4.5%,

Value,Count,Frequency (%),Unnamed: 3
11.0,39,9.7%,
12.0,42,10.4%,
13.0,31,7.7%,
14.0,33,8.2%,
15.0,50,12.4%,

0,1
Distinct count,15
Unique (%),3.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.2104
Minimum,1
Maximum,15
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,4
Median,7
Q3,11
95-th percentile,14
Maximum,15
Range,14
Interquartile range,7

0,1
Standard deviation,4.2587
Coef of variation,0.59064
Kurtosis,-1.2115
Mean,7.2104
MAD,3.7054
Skewness,0.20139
Sum,2913
Variance,18.137
Memory size,3.3 KiB

Value,Count,Frequency (%),Unnamed: 3
4.0,37,9.2%,
1.0,35,8.7%,
2.0,33,8.2%,
14.0,32,7.9%,
3.0,31,7.7%,
5.0,31,7.7%,
6.0,30,7.4%,
10.0,27,6.7%,
12.0,27,6.7%,
8.0,25,6.2%,

Value,Count,Frequency (%),Unnamed: 3
1.0,35,8.7%,
2.0,33,8.2%,
3.0,31,7.7%,
4.0,37,9.2%,
5.0,31,7.7%,

Value,Count,Frequency (%),Unnamed: 3
11.0,18,4.5%,
12.0,27,6.7%,
13.0,22,5.4%,
14.0,32,7.9%,
15.0,9,2.2%,

Unnamed: 0,respondent_ID,channel,area_responsibility,industry,ndustry_group,Nb employees,Company size,Gender,Professional_experience,Satisfaction,job_situation,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,workload,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
0,10882908921,Lausanne,Marketing and communication,Banking,Banking_financial_insurance,100 to 149,1 to 149,Female,16 to 20 years,Satisfied,Active,2.0,3.0,5.0,3.0,4.0,4.0,2.0,3.0,3.0,2.0,2.0,2.0,Too heavy,10.0,8.0,1.0,9.0,3.0,13.0,11.0,4.0,2.0,12.0,14.0,6.0,15.0,7.0,5.0
1,10879775971,Lausanne,Tax and accounting,Pharmaceutical,Healthcare_pharma,1000,>1000,Female,11 to 15 years,Not satisfied,Active,4.0,4.0,1.0,1.0,5.0,1.0,4.0,2.0,2.0,1.0,3.0,5.0,Too heavy,9.0,13.0,10.0,3.0,1.0,6.0,7.0,8.0,5.0,4.0,12.0,15.0,11.0,14.0,2.0
3,10867564537,Lausanne,Human resources,Pharmaceutical,Healthcare_pharma,150 to 299,150 to 999,Male,+20 years,Satisfied,Planning,1.0,5.0,5.0,4.0,5.0,5.0,3.0,4.0,4.0,2.0,4.0,4.0,Good,1.0,8.0,5.0,15.0,10.0,3.0,2.0,14.0,7.0,4.0,12.0,13.0,11.0,6.0,9.0
4,10862147414,Lausanne,Human resources,Health care,Healthcare_pharma,1000,>1000,Male,11 to 15 years,OK,Active,2.0,4.0,4.0,2.0,5.0,4.0,2.0,3.0,3.0,4.0,3.0,2.0,Too light,3.0,7.0,5.0,6.0,1.0,15.0,12.0,10.0,8.0,2.0,4.0,11.0,13.0,9.0,14.0
5,10861974271,Geneva,Sales,Hospitality,Others,100 to 149,1 to 149,Female,16 to 20 years,OK,Active,3.0,2.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,1.0,3.0,4.0,Too heavy,1.0,6.0,2.0,5.0,4.0,7.0,9.0,10.0,15.0,12.0,11.0,3.0,13.0,14.0,8.0


In [16]:
df_final.dtypes

respondent_ID                   int64
channel                        object
area_responsibility            object
industry                       object
ndustry_group                  object
Nb employees                   object
Company size                   object
Gender                         object
Professional_experience        object
Satisfaction                   object
job_situation                  object
Q3a_dev_opp                   float64
Q3b_manager                   float64
Q3c_perso_contribution        float64
Q3d_remuneration              float64
Q3e_colleagues                float64
Q3f_working_conditions        float64
Q3g_align_comp_values         float64
Q3h_sense_meaning             float64
Q3i_job_freedom               float64
Q3j_work_life_balance         float64
Q3k_training_tools            float64
Q3l_feel_challenged           float64
workload                       object
Q5a_job_achievements          float64
Q5b_feedback                  float64
Q5c_teamwork

## Review of numerical and categorical features
> * Create 2 dataframes for numerical and categorical features

### **A/ Categorical features**
> * We will transform the "Job situation" feature into a category type in order to use it as Target Variable
* We have 3 ordinal features: Company size, professional experience and workload
* We will manage the other ones as nominal with One-Hot encoding

In [17]:
df_final.shape

(404, 39)

In [18]:
df_final.head()

Unnamed: 0,respondent_ID,channel,area_responsibility,industry,ndustry_group,Nb employees,Company size,Gender,Professional_experience,Satisfaction,job_situation,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,workload,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
0,10882908921,Lausanne,Marketing and communication,Banking,Banking_financial_insurance,100 to 149,1 to 149,Female,16 to 20 years,Satisfied,Active,2.0,3.0,5.0,3.0,4.0,4.0,2.0,3.0,3.0,2.0,2.0,2.0,Too heavy,10.0,8.0,1.0,9.0,3.0,13.0,11.0,4.0,2.0,12.0,14.0,6.0,15.0,7.0,5.0
1,10879775971,Lausanne,Tax and accounting,Pharmaceutical,Healthcare_pharma,1000,>1000,Female,11 to 15 years,Not satisfied,Active,4.0,4.0,1.0,1.0,5.0,1.0,4.0,2.0,2.0,1.0,3.0,5.0,Too heavy,9.0,13.0,10.0,3.0,1.0,6.0,7.0,8.0,5.0,4.0,12.0,15.0,11.0,14.0,2.0
3,10867564537,Lausanne,Human resources,Pharmaceutical,Healthcare_pharma,150 to 299,150 to 999,Male,+20 years,Satisfied,Planning,1.0,5.0,5.0,4.0,5.0,5.0,3.0,4.0,4.0,2.0,4.0,4.0,Good,1.0,8.0,5.0,15.0,10.0,3.0,2.0,14.0,7.0,4.0,12.0,13.0,11.0,6.0,9.0
4,10862147414,Lausanne,Human resources,Health care,Healthcare_pharma,1000,>1000,Male,11 to 15 years,OK,Active,2.0,4.0,4.0,2.0,5.0,4.0,2.0,3.0,3.0,4.0,3.0,2.0,Too light,3.0,7.0,5.0,6.0,1.0,15.0,12.0,10.0,8.0,2.0,4.0,11.0,13.0,9.0,14.0
5,10861974271,Geneva,Sales,Hospitality,Others,100 to 149,1 to 149,Female,16 to 20 years,OK,Active,3.0,2.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,1.0,3.0,4.0,Too heavy,1.0,6.0,2.0,5.0,4.0,7.0,9.0,10.0,15.0,12.0,11.0,3.0,13.0,14.0,8.0


### a1/ Job situation and Satisfaction as category feature

In [19]:
job_situation = df_final[["job_situation"]].astype("category")
satisfaction = df_final[["Satisfaction"]].astype("category")

In [20]:
job_situation.head()

Unnamed: 0,job_situation
0,Active
1,Active
3,Planning
4,Active
5,Active


#### a2/ Nominal categorical features

In [21]:
# Nominal features for One-Hot encoding
df_cat_nom = df_final[["area_responsibility", "ndustry_group", "Gender"]]

In [22]:
encode_nom = pd.get_dummies(df_cat_nom)

In [23]:
encode_nom.shape

(404, 23)

In [24]:
encode_nom.dtypes

area_responsibility_Company management             uint8
area_responsibility_Finance                        uint8
area_responsibility_Human resources                uint8
area_responsibility_Information Technology         uint8
area_responsibility_Legal                          uint8
area_responsibility_Manufacturing                  uint8
area_responsibility_Marketing and communication    uint8
area_responsibility_Other                          uint8
area_responsibility_Sales                          uint8
area_responsibility_Strategy                       uint8
area_responsibility_Supply chain                   uint8
area_responsibility_Tax and accounting             uint8
ndustry_group_Agriculture and food                 uint8
ndustry_group_Auto and manufacturing               uint8
ndustry_group_Banking_financial_insurance          uint8
ndustry_group_Consumer goods                       uint8
ndustry_group_Education                            uint8
ndustry_group_Energy_utilities 

In [25]:
# Transform dtype from uint8 to int64
encode_nom = encode_nom.astype(np.int64)

In [26]:
# Ordinal features
df_cat_ordi = df_final[["Company size", "Professional_experience", "workload"]]

In [27]:
df_cat_ordi.head()

Unnamed: 0,Company size,Professional_experience,workload
0,1 to 149,16 to 20 years,Too heavy
1,>1000,11 to 15 years,Too heavy
3,150 to 999,+20 years,Good
4,>1000,11 to 15 years,Too light
5,1 to 149,16 to 20 years,Too heavy


In [28]:
df_cat_ordi = df_cat_ordi.replace(
    {"Company size": {"1 to 149": 1, "150 to 999": 2, ">1000": 3}}
)

In [29]:
df_cat_ordi = df_cat_ordi.replace(
    {
        "Professional_experience": {
            "Less than 10 years": 1,
            "11 to 15 years": 2,
            "16 to 20 years": 3,
            "+20 years": 4,
        }
    }
)

In [30]:
df_cat_ordi = df_cat_ordi.replace(
    {"workload": {"Too light": 1, "Good": 2, "Too heavy": 3}}
)

In [31]:
df_cat_ordi.head()

Unnamed: 0,Company size,Professional_experience,workload
0,1,3,3
1,3,2,3
3,2,4,2
4,3,2,1
5,1,3,3


### Concatenate categorical variables in one dataframe

In [32]:
# Merge encoded nominal and ordinal features
encode_cat_1 = pd.merge(encode_nom, df_cat_ordi, right_index=True, left_index=True)
encode_cat_1.shape

(404, 26)

In [33]:
# Add Job_situation to the dataframe
cat_tot1 = pd.merge(encode_cat_1, job_situation, right_index=True, left_index=True)

In [34]:
cat_tot = pd.merge(cat_tot1, satisfaction, right_index=True, left_index=True)

In [35]:
cat_tot.dtypes

area_responsibility_Company management                int64
area_responsibility_Finance                           int64
area_responsibility_Human resources                   int64
area_responsibility_Information Technology            int64
area_responsibility_Legal                             int64
area_responsibility_Manufacturing                     int64
area_responsibility_Marketing and communication       int64
area_responsibility_Other                             int64
area_responsibility_Sales                             int64
area_responsibility_Strategy                          int64
area_responsibility_Supply chain                      int64
area_responsibility_Tax and accounting                int64
ndustry_group_Agriculture and food                    int64
ndustry_group_Auto and manufacturing                  int64
ndustry_group_Banking_financial_insurance             int64
ndustry_group_Consumer goods                          int64
ndustry_group_Education                 

In [36]:
cat_tot.head()

Unnamed: 0,area_responsibility_Company management,area_responsibility_Finance,area_responsibility_Human resources,area_responsibility_Information Technology,area_responsibility_Legal,area_responsibility_Manufacturing,area_responsibility_Marketing and communication,area_responsibility_Other,area_responsibility_Sales,area_responsibility_Strategy,area_responsibility_Supply chain,area_responsibility_Tax and accounting,ndustry_group_Agriculture and food,ndustry_group_Auto and manufacturing,ndustry_group_Banking_financial_insurance,ndustry_group_Consumer goods,ndustry_group_Education,ndustry_group_Energy_utilities,ndustry_group_Healthcare_pharma,ndustry_group_Info_tech_telco,ndustry_group_Others,Gender_Female,Gender_Male,Company size,Professional_experience,workload,job_situation,Satisfaction
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,3,3,Active,Satisfied
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,3,2,3,Active,Not satisfied
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,2,4,2,Planning,Satisfied
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,3,2,1,Active,OK
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,3,3,Active,OK


### **B/ Numerical features**
> * We will transform 'Satisfaction' variable into a Target Variable (bin of 3, ordinal)
* We will Log transform numercial values as the Profiling Report indicates that data do not present a good Normal distribution
* Q5 related variable capture a ranking from 1st to 5th or 15th position

In [37]:
# select the integer columns
df_num = df_final.select_dtypes(include=[np.int64, np.float64])

In [38]:
# pandas_profiling.ProfileReport(df_num)

In [39]:
df_num.shape

(404, 28)

In [40]:
df_num.head()

Unnamed: 0,respondent_ID,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
0,10882908921,2.0,3.0,5.0,3.0,4.0,4.0,2.0,3.0,3.0,2.0,2.0,2.0,10.0,8.0,1.0,9.0,3.0,13.0,11.0,4.0,2.0,12.0,14.0,6.0,15.0,7.0,5.0
1,10879775971,4.0,4.0,1.0,1.0,5.0,1.0,4.0,2.0,2.0,1.0,3.0,5.0,9.0,13.0,10.0,3.0,1.0,6.0,7.0,8.0,5.0,4.0,12.0,15.0,11.0,14.0,2.0
3,10867564537,1.0,5.0,5.0,4.0,5.0,5.0,3.0,4.0,4.0,2.0,4.0,4.0,1.0,8.0,5.0,15.0,10.0,3.0,2.0,14.0,7.0,4.0,12.0,13.0,11.0,6.0,9.0
4,10862147414,2.0,4.0,4.0,2.0,5.0,4.0,2.0,3.0,3.0,4.0,3.0,2.0,3.0,7.0,5.0,6.0,1.0,15.0,12.0,10.0,8.0,2.0,4.0,11.0,13.0,9.0,14.0
5,10861974271,3.0,2.0,3.0,4.0,3.0,2.0,4.0,4.0,4.0,1.0,3.0,4.0,1.0,6.0,2.0,5.0,4.0,7.0,9.0,10.0,15.0,12.0,11.0,3.0,13.0,14.0,8.0


### Log transform numerical values

In [41]:
df_num_log = df_num.transform(func=["log"])

In [42]:
df_num_log.head()

Unnamed: 0_level_0,respondent_ID,Q3a_dev_opp,Q3b_manager,Q3c_perso_contribution,Q3d_remuneration,Q3e_colleagues,Q3f_working_conditions,Q3g_align_comp_values,Q3h_sense_meaning,Q3i_job_freedom,Q3j_work_life_balance,Q3k_training_tools,Q3l_feel_challenged,Q5a_job_achievements,Q5b_feedback,Q5c_teamwork,Q5d_opportunities_growth,Q5e_work_life_balance,Q5f_customer focus,Q5g_purpose_direction,Q5h_fairness,Q5i_respect_for_management,Q5j_comp_ben,Q5k_workplace,Q5l_communication,Q5m_performance,Q5n_diversity,Q5o_respect_for_employees
Unnamed: 0_level_1,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log,log
0,23.110459,0.693147,1.098612,1.609438,1.098612,1.386294,1.386294,0.693147,1.098612,1.098612,0.693147,0.693147,0.693147,2.302585,2.079442,0.0,2.197225,1.098612,2.564949,2.397895,1.386294,0.693147,2.484907,2.639057,1.791759,2.70805,1.94591,1.609438
1,23.110171,1.386294,1.386294,0.0,0.0,1.609438,0.0,1.386294,0.693147,0.693147,0.0,1.098612,1.609438,2.197225,2.564949,2.302585,1.098612,0.0,1.791759,1.94591,2.079442,1.609438,1.386294,2.484907,2.70805,2.397895,2.639057,0.693147
3,23.109048,0.0,1.609438,1.609438,1.386294,1.609438,1.609438,1.098612,1.386294,1.386294,0.693147,1.386294,1.386294,0.0,2.079442,1.609438,2.70805,2.302585,1.098612,0.693147,2.639057,1.94591,1.386294,2.484907,2.564949,2.397895,1.791759,2.197225
4,23.10855,0.693147,1.386294,1.386294,0.693147,1.609438,1.386294,0.693147,1.098612,1.098612,1.386294,1.098612,0.693147,1.098612,1.94591,1.609438,1.791759,0.0,2.70805,2.484907,2.302585,2.079442,0.693147,1.386294,2.397895,2.564949,2.197225,2.639057
5,23.108534,1.098612,0.693147,1.098612,1.386294,1.098612,0.693147,1.386294,1.386294,1.386294,0.0,1.098612,1.386294,0.0,1.791759,0.693147,1.609438,1.386294,1.94591,2.197225,2.302585,2.70805,2.484907,2.397895,1.098612,2.564949,2.639057,2.079442


### **C/ Consolidate categorical and numerical variables**

In [43]:
data_model = pd.merge(cat_tot, df_num_log, right_index=True, left_index=True)

In [44]:
data_model.dtypes

area_responsibility_Company management                int64
area_responsibility_Finance                           int64
area_responsibility_Human resources                   int64
area_responsibility_Information Technology            int64
area_responsibility_Legal                             int64
area_responsibility_Manufacturing                     int64
area_responsibility_Marketing and communication       int64
area_responsibility_Other                             int64
area_responsibility_Sales                             int64
area_responsibility_Strategy                          int64
area_responsibility_Supply chain                      int64
area_responsibility_Tax and accounting                int64
ndustry_group_Agriculture and food                    int64
ndustry_group_Auto and manufacturing                  int64
ndustry_group_Banking_financial_insurance             int64
ndustry_group_Consumer goods                          int64
ndustry_group_Education                 

In [45]:
data_model.shape

(404, 56)

## Feature engineering and standardization

In [46]:
# Transform the categorical variable satisfaction into an ordinal feature
df_jobsit = data_model.replace(
    {"Satisfaction": {"Not satisfied": 0, "OK": 1, "Satisfied": 2}}
)

In [47]:
# Transform the categorical variable job_situation into an ordinal feature
df_jobsit = df_jobsit.replace(
    {"job_situation": {"Stay": 0, "Planning": 1, "Active": 2}}
)

In [48]:
# Transform job_situation type into int64
df_jobsit["job_situation"] = df_jobsit["job_situation"].astype(np.int64)

In [49]:
df_jobsit.head()

Unnamed: 0,area_responsibility_Company management,area_responsibility_Finance,area_responsibility_Human resources,area_responsibility_Information Technology,area_responsibility_Legal,area_responsibility_Manufacturing,area_responsibility_Marketing and communication,area_responsibility_Other,area_responsibility_Sales,area_responsibility_Strategy,area_responsibility_Supply chain,area_responsibility_Tax and accounting,ndustry_group_Agriculture and food,ndustry_group_Auto and manufacturing,ndustry_group_Banking_financial_insurance,ndustry_group_Consumer goods,ndustry_group_Education,ndustry_group_Energy_utilities,ndustry_group_Healthcare_pharma,ndustry_group_Info_tech_telco,ndustry_group_Others,Gender_Female,Gender_Male,Company size,Professional_experience,workload,job_situation,Satisfaction,"(respondent_ID, log)","(Q3a_dev_opp, log)","(Q3b_manager, log)","(Q3c_perso_contribution, log)","(Q3d_remuneration, log)","(Q3e_colleagues, log)","(Q3f_working_conditions, log)","(Q3g_align_comp_values, log)","(Q3h_sense_meaning, log)","(Q3i_job_freedom, log)","(Q3j_work_life_balance, log)","(Q3k_training_tools, log)","(Q3l_feel_challenged, log)","(Q5a_job_achievements, log)","(Q5b_feedback, log)","(Q5c_teamwork, log)","(Q5d_opportunities_growth, log)","(Q5e_work_life_balance, log)","(Q5f_customer focus, log)","(Q5g_purpose_direction, log)","(Q5h_fairness, log)","(Q5i_respect_for_management, log)","(Q5j_comp_ben, log)","(Q5k_workplace, log)","(Q5l_communication, log)","(Q5m_performance, log)","(Q5n_diversity, log)","(Q5o_respect_for_employees, log)"
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,3,3,2,2,23.110459,0.693147,1.098612,1.609438,1.098612,1.386294,1.386294,0.693147,1.098612,1.098612,0.693147,0.693147,0.693147,2.302585,2.079442,0.0,2.197225,1.098612,2.564949,2.397895,1.386294,0.693147,2.484907,2.639057,1.791759,2.70805,1.94591,1.609438
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,3,2,3,2,0,23.110171,1.386294,1.386294,0.0,0.0,1.609438,0.0,1.386294,0.693147,0.693147,0.0,1.098612,1.609438,2.197225,2.564949,2.302585,1.098612,0.0,1.791759,1.94591,2.079442,1.609438,1.386294,2.484907,2.70805,2.397895,2.639057,0.693147
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,2,4,2,1,2,23.109048,0.0,1.609438,1.609438,1.386294,1.609438,1.609438,1.098612,1.386294,1.386294,0.693147,1.386294,1.386294,0.0,2.079442,1.609438,2.70805,2.302585,1.098612,0.693147,2.639057,1.94591,1.386294,2.484907,2.564949,2.397895,1.791759,2.197225
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,3,2,1,2,1,23.10855,0.693147,1.386294,1.386294,0.693147,1.609438,1.386294,0.693147,1.098612,1.098612,1.386294,1.098612,0.693147,1.098612,1.94591,1.609438,1.791759,0.0,2.70805,2.484907,2.302585,2.079442,0.693147,1.386294,2.397895,2.564949,2.197225,2.639057
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,3,3,2,1,23.108534,1.098612,0.693147,1.098612,1.386294,1.098612,0.693147,1.386294,1.386294,1.386294,0.0,1.098612,1.386294,0.0,1.791759,0.693147,1.609438,1.386294,1.94591,2.197225,2.302585,2.70805,2.484907,2.397895,1.098612,2.564949,2.639057,2.079442


In [50]:
df_jobsit.dtypes

area_responsibility_Company management               int64
area_responsibility_Finance                          int64
area_responsibility_Human resources                  int64
area_responsibility_Information Technology           int64
area_responsibility_Legal                            int64
area_responsibility_Manufacturing                    int64
area_responsibility_Marketing and communication      int64
area_responsibility_Other                            int64
area_responsibility_Sales                            int64
area_responsibility_Strategy                         int64
area_responsibility_Supply chain                     int64
area_responsibility_Tax and accounting               int64
ndustry_group_Agriculture and food                   int64
ndustry_group_Auto and manufacturing                 int64
ndustry_group_Banking_financial_insurance            int64
ndustry_group_Consumer goods                         int64
ndustry_group_Education                              int

## Split dataset between numerical and categorical features to process standardscaler on categorical features

In [51]:
df_jobsit.shape

(404, 56)

In [52]:
df_jobsit.columns

Index([         'area_responsibility_Company management',
                           'area_responsibility_Finance',
                   'area_responsibility_Human resources',
            'area_responsibility_Information Technology',
                             'area_responsibility_Legal',
                     'area_responsibility_Manufacturing',
       'area_responsibility_Marketing and communication',
                             'area_responsibility_Other',
                             'area_responsibility_Sales',
                          'area_responsibility_Strategy',
                      'area_responsibility_Supply chain',
                'area_responsibility_Tax and accounting',
                    'ndustry_group_Agriculture and food',
                  'ndustry_group_Auto and manufacturing',
             'ndustry_group_Banking_financial_insurance',
                          'ndustry_group_Consumer goods',
                               'ndustry_group_Education',
              

In [53]:
def standardization(dataset):
    """ Standardization of numeric fields, where all values will have mean of zero 
  and standard deviation of one. (z-score)

  Args:
    dataset: A `Pandas.Dataframe` 
  """
    dtypes = list(zip(dataset.dtypes.index, map(str, dataset.dtypes)))
    # Normalize numeric columns.
    for column, dtype in dtypes:
        if dtype == "float64":
            dataset[column] -= dataset[column].mean()
            dataset[column] /= dataset[column].std()
    return dataset

In [54]:
df_jobsit_norm = standardization(df_jobsit)

In [55]:
df_jobsit_norm.shape

(404, 56)

In [56]:
df_jobsit_norm.head()

Unnamed: 0,area_responsibility_Company management,area_responsibility_Finance,area_responsibility_Human resources,area_responsibility_Information Technology,area_responsibility_Legal,area_responsibility_Manufacturing,area_responsibility_Marketing and communication,area_responsibility_Other,area_responsibility_Sales,area_responsibility_Strategy,area_responsibility_Supply chain,area_responsibility_Tax and accounting,ndustry_group_Agriculture and food,ndustry_group_Auto and manufacturing,ndustry_group_Banking_financial_insurance,ndustry_group_Consumer goods,ndustry_group_Education,ndustry_group_Energy_utilities,ndustry_group_Healthcare_pharma,ndustry_group_Info_tech_telco,ndustry_group_Others,Gender_Female,Gender_Male,Company size,Professional_experience,workload,job_situation,Satisfaction,"(respondent_ID, log)","(Q3a_dev_opp, log)","(Q3b_manager, log)","(Q3c_perso_contribution, log)","(Q3d_remuneration, log)","(Q3e_colleagues, log)","(Q3f_working_conditions, log)","(Q3g_align_comp_values, log)","(Q3h_sense_meaning, log)","(Q3i_job_freedom, log)","(Q3j_work_life_balance, log)","(Q3k_training_tools, log)","(Q3l_feel_challenged, log)","(Q5a_job_achievements, log)","(Q5b_feedback, log)","(Q5c_teamwork, log)","(Q5d_opportunities_growth, log)","(Q5e_work_life_balance, log)","(Q5f_customer focus, log)","(Q5g_purpose_direction, log)","(Q5h_fairness, log)","(Q5i_respect_for_management, log)","(Q5j_comp_ben, log)","(Q5k_workplace, log)","(Q5l_communication, log)","(Q5m_performance, log)","(Q5n_diversity, log)","(Q5o_respect_for_employees, log)"
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,3,3,2,2,4.615814,-0.327198,0.108528,0.930459,-0.040687,0.222855,0.51225,-1.064467,-0.036605,-0.244088,-0.902278,-0.534205,-0.687065,0.848267,-0.066752,-2.721796,0.560728,-0.46887,0.869257,0.823915,-0.777973,-1.558809,0.985903,0.895972,-0.368523,1.131725,-0.37019,-0.152787
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,3,2,3,2,0,4.536723,0.958618,0.629394,-2.493131,-2.416771,0.821714,-2.77441,0.528577,-0.882524,-1.178912,-2.339714,0.262541,1.121287,0.733049,0.905184,0.595272,-0.852763,-1.742182,-0.177475,0.309696,0.219703,-0.331304,-0.41048,0.633699,1.050988,0.641999,0.867049,-1.31761
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,2,4,2,1,2,4.228228,-1.613014,1.033409,0.930459,0.581514,0.821714,1.041284,-0.132596,0.563584,0.419179,-0.902278,0.827841,0.680901,-1.669756,-0.066752,-0.403265,1.217965,0.926557,-1.115846,-1.11556,1.025181,0.119449,-0.41048,0.633699,0.829298,0.641999,-0.645343,0.59443
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,3,2,1,2,1,4.091265,-0.327198,0.629394,0.455789,-0.917629,0.821714,0.51225,-1.064467,-0.036605,-0.244088,0.535158,0.262541,-0.687065,-0.468354,-0.334068,-0.403265,0.039051,-1.742182,1.062985,0.922907,0.540882,0.298334,-1.2915,-1.235486,0.570499,0.905773,0.078395,1.156104
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,3,3,2,1,4.086886,0.424956,-0.625592,-0.156167,0.581514,-0.549207,-1.13108,0.528577,0.563584,0.419179,-2.339714,0.262541,0.680901,-1.669756,-0.642662,-1.723259,-0.195527,-0.135441,0.031212,0.595614,0.540882,1.140447,0.985903,0.485658,-1.442342,0.905773,0.867049,0.444699


In [57]:
df_jobsit_norm.columns

Index([         'area_responsibility_Company management',
                           'area_responsibility_Finance',
                   'area_responsibility_Human resources',
            'area_responsibility_Information Technology',
                             'area_responsibility_Legal',
                     'area_responsibility_Manufacturing',
       'area_responsibility_Marketing and communication',
                             'area_responsibility_Other',
                             'area_responsibility_Sales',
                          'area_responsibility_Strategy',
                      'area_responsibility_Supply chain',
                'area_responsibility_Tax and accounting',
                    'ndustry_group_Agriculture and food',
                  'ndustry_group_Auto and manufacturing',
             'ndustry_group_Banking_financial_insurance',
                          'ndustry_group_Consumer goods',
                               'ndustry_group_Education',
              

In [58]:
df_jobsit_norm_final = df_jobsit_norm.rename(
    {
        ("respondent_ID", "log"): "respondent_ID_logstd",
        ("Q3a_dev_opp", "log"): "Q3a_dev_opp_logstd",
        ("Q3b_manager", "log"): "Q3b_manager_logstd",
        ("Q3c_perso_contribution", "log"): "Q3c_perso_contribution_logstd",
        ("Q3d_remuneration", "log"): "Q3d_remuneration_logstd",
        ("Q3e_colleagues", "log"): "Q3e_colleagues_logstd",
        ("Q3f_working_conditions", "log"): "Q3f_working_conditions_logstd",
        ("Q3g_align_comp_values", "log"): "Q3g_align_comp_values_logstd",
        ("Q3h_sense_meaning", "log"): "Q3h_sense_meaning_logstd",
        ("Q3i_job_freedom", "log"): "Q3i_job_freedom_logstd",
        ("Q3j_work_life_balance", "log"): "Q3j_work_life_balance_logstd",
        ("Q3k_training_tools", "log"): "Q3k_training_tools_logstd",
        ("Q3l_feel_challenged", "log"): "Q3l_feel_challenged_logstd",
        ("Q5a_job_achievements", "log"): "Q5a_job_achievements_logstd",
        ("Q5b_feedback", "log"): "Q5b_feedback_logstd",
        ("Q5c_teamwork", "log"): "Q5c_teamwork_logstd",
        ("Q5d_opportunities_growth", "log"): "Q5d_opportunities_growth_logstd",
        ("Q5e_work_life_balance", "log"): "Q5e_work_life_balance_logstd",
        ("Q5f_customer focus", "log"): "Q5f_customer focus_logstd",
        ("Q5g_purpose_direction", "log"): "Q5g_purpose_direction_logstd",
        ("Q5h_fairness", "log"): "Q5h_fairness_logstd",
        ("Q5i_respect_for_management", "log"): "Q5i_respect_for_management_logstd",
        ("Q5j_comp_ben", "log"): "Q5j_comp_ben_logstd",
        ("Q5k_workplace", "log"): "Q5k_workplace_logstd",
        ("Q5l_communication", "log"): "Q5l_communication_logstd",
        ("Q5m_performance", "log"): "Q5m_performance_logstd",
        ("Q5n_diversity", "log"): "Q5n_diversity_logstd",
        ("Q5o_respect_for_employees", "log"): "Q5o_respect_for_employees_logstd",
    },
    axis=1,
)

In [59]:
df_jobsit_norm_final.head()

Unnamed: 0,area_responsibility_Company management,area_responsibility_Finance,area_responsibility_Human resources,area_responsibility_Information Technology,area_responsibility_Legal,area_responsibility_Manufacturing,area_responsibility_Marketing and communication,area_responsibility_Other,area_responsibility_Sales,area_responsibility_Strategy,area_responsibility_Supply chain,area_responsibility_Tax and accounting,ndustry_group_Agriculture and food,ndustry_group_Auto and manufacturing,ndustry_group_Banking_financial_insurance,ndustry_group_Consumer goods,ndustry_group_Education,ndustry_group_Energy_utilities,ndustry_group_Healthcare_pharma,ndustry_group_Info_tech_telco,ndustry_group_Others,Gender_Female,Gender_Male,Company size,Professional_experience,workload,job_situation,Satisfaction,respondent_ID_logstd,Q3a_dev_opp_logstd,Q3b_manager_logstd,Q3c_perso_contribution_logstd,Q3d_remuneration_logstd,Q3e_colleagues_logstd,Q3f_working_conditions_logstd,Q3g_align_comp_values_logstd,Q3h_sense_meaning_logstd,Q3i_job_freedom_logstd,Q3j_work_life_balance_logstd,Q3k_training_tools_logstd,Q3l_feel_challenged_logstd,Q5a_job_achievements_logstd,Q5b_feedback_logstd,Q5c_teamwork_logstd,Q5d_opportunities_growth_logstd,Q5e_work_life_balance_logstd,Q5f_customer focus_logstd,Q5g_purpose_direction_logstd,Q5h_fairness_logstd,Q5i_respect_for_management_logstd,Q5j_comp_ben_logstd,Q5k_workplace_logstd,Q5l_communication_logstd,Q5m_performance_logstd,Q5n_diversity_logstd,Q5o_respect_for_employees_logstd
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,3,3,2,2,4.615814,-0.327198,0.108528,0.930459,-0.040687,0.222855,0.51225,-1.064467,-0.036605,-0.244088,-0.902278,-0.534205,-0.687065,0.848267,-0.066752,-2.721796,0.560728,-0.46887,0.869257,0.823915,-0.777973,-1.558809,0.985903,0.895972,-0.368523,1.131725,-0.37019,-0.152787
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,3,2,3,2,0,4.536723,0.958618,0.629394,-2.493131,-2.416771,0.821714,-2.77441,0.528577,-0.882524,-1.178912,-2.339714,0.262541,1.121287,0.733049,0.905184,0.595272,-0.852763,-1.742182,-0.177475,0.309696,0.219703,-0.331304,-0.41048,0.633699,1.050988,0.641999,0.867049,-1.31761
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,2,4,2,1,2,4.228228,-1.613014,1.033409,0.930459,0.581514,0.821714,1.041284,-0.132596,0.563584,0.419179,-0.902278,0.827841,0.680901,-1.669756,-0.066752,-0.403265,1.217965,0.926557,-1.115846,-1.11556,1.025181,0.119449,-0.41048,0.633699,0.829298,0.641999,-0.645343,0.59443
4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,3,2,1,2,1,4.091265,-0.327198,0.629394,0.455789,-0.917629,0.821714,0.51225,-1.064467,-0.036605,-0.244088,0.535158,0.262541,-0.687065,-0.468354,-0.334068,-0.403265,0.039051,-1.742182,1.062985,0.922907,0.540882,0.298334,-1.2915,-1.235486,0.570499,0.905773,0.078395,1.156104
5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,3,3,2,1,4.086886,0.424956,-0.625592,-0.156167,0.581514,-0.549207,-1.13108,0.528577,0.563584,0.419179,-2.339714,0.262541,0.680901,-1.669756,-0.642662,-1.723259,-0.195527,-0.135441,0.031212,0.595614,0.540882,1.140447,0.985903,0.485658,-1.442342,0.905773,0.867049,0.444699


## Save data_model dataframe as a CSV file

In [60]:
# df_jobsit_norm_final.to_csv('data_model.csv')