## Workplace Absenteeism Project
### 365 Data Science: Complete Data Analysis Bootcamp

Preprocessing task intended to predict absenteeism from work. The goal is to help the company determine how to preserve high productivity rates. 

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
raw_df = pd.read_csv("Absenteeism-data.csv")

In [3]:
#Making a copy of the original df
df = raw_df.copy()

In [4]:
#Checking for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


### ID Column

Because the ID column contains nominal data that does not provide relevant information for the purpose of this task, I will drop it. 

In [5]:
df.drop(columns = 'ID', inplace = True)

### Reason for Absence Column

In [6]:
sorted(df['Reason for Absence'].unique())

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28]

Here I can see that the value '20' is missing.

####  Dummy variables
This column contains categorical nominal values that correspond to a legend of various reasons for being absent. In order to obtain numerical meaning from this column, I will turn the values into dummy variables. 

In [7]:
reason_col = pd.get_dummies(df['Reason for Absence'])

In [8]:
# Checking to ensure there are no missing values or multiple values for one person
reason_col['Check'] = reason_col.sum(axis = 1)
reason_col

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,21,22,23,24,25,26,27,28,Check
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
696,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [9]:
reason_col['Check'].sum(axis = 0)

700

In [10]:
reason_col['Check'].unique()

array([1])

Here, I have ensured there are no missing or multiple values by executing the following:
- Created a new column with the summed the row values to ensure it equals 1 
- Checked that the sum of all of the row values equal 700, the total number of rows
- Used the .unique() command to ensure that there are no other values like '0' and '2' that could create total of 700 but indicate an error 

In [11]:
reason_col = reason_col.drop(['Check'], axis = 1)

In order to avoid potential multicollinearity issues, I will drop 'Reason 0' from the dummy variables.

In [12]:
reason_col = pd.get_dummies(df['Reason for Absence'], drop_first = True)

#### Grouping the Reasons for Absence

When examining the legend of reasons for absence the following grouping is the most meaningful:

- Type 1, Reasons 1-14: Various diseases
- Type 2, Reasons 15-17: Pregnancy/Giving birth
- Type 3, Reasons 18-21: Signs not classified elsewhere
- Type 4, Reasons 22-28: "Light" medical care visits

In [13]:
reason_type_1 = reason_col.loc[:, 1:14].max(axis = 1)
reason_type_2 = reason_col.loc[:, 15:17].max(axis = 1)
reason_type_3 = reason_col.loc[:, 18:21].max(axis = 1)
reason_type_4 = reason_col.loc[:, 22:].max(axis = 1)

#### Concatenate Column Values

In [14]:
df.drop(['Reason for Absence'], axis = 1, inplace = True)

In [16]:
df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis = 1)  
df

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,0,1,2,3
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,0,1
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,23/05/2018,179,22,40,237.656,22,2,2,0,8,1,0,0,0
696,23/05/2018,225,26,28,237.656,24,1,1,2,3,1,0,0,0
697,24/05/2018,330,16,28,237.656,25,2,0,0,8,1,0,0,0
698,24/05/2018,235,16,32,237.656,25,3,0,0,2,0,0,0,1


In [17]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [18]:
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2','Reason_3', 'Reason_4']

In [19]:
df.columns = column_names

#### Reorder Columns

In [20]:
column_names_reordered = ['Reason_1', 'Reason_2','Reason_3', 'Reason_4','Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [21]:
df = df[column_names_reordered]

In [22]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [23]:
df_checkpoint = df.copy()

### Date Column

Change data type to datetime

In [24]:
type(df['Date'][0])

str

In [25]:
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')

In [26]:
type(df['Date'][0])

pandas._libs.tslibs.timestamps.Timestamp

#### Extract Month Value

In [27]:
list_months = []
list_months

[]

In [28]:
for i in range(df.shape[0]):
    list_months.append(df['Date'][i].month)

In [29]:
df['Month_Value'] = list_months

In [30]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month_Value
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7


#### Extract Day of the Week

In [31]:
#Return the weekday from the data in the "Date" column 

def date_to_weekday(date_value):
    return date_value.weekday()

In [32]:
# Applying the date_to_weekday function to the values in the 'Date' column, and placing that info in a new column "Day of the Week"
df['Day of the Week'] = df['Date'].apply(date_to_weekday)

In [33]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month_Value,Day of the Week
0,0,0,0,1,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


In [34]:
df = df.drop(['Date'], axis = 1)

In [35]:
df.columns

Index(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours', 'Month_Value', 'Day of the Week'],
      dtype='object')

In [36]:
new_column_names = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month_Value', 'Day of the Week',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours']

In [37]:
df = df[new_column_names]

In [38]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month_Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,1,2,1,2


The following columns do not need manipulation and are relevant to our analysis, so they will remain in their current form:

- Transportation Expense
- Distance to Work
- Age
- Daily Work Load Average
- Body Mass Index

### Education Column

In its current form, this column contains categorical features that do not have numerical meaning. Therefore, I will transform it into a dummy variable. 

In [39]:
df['Education'].unique()

array([1, 3, 2, 4])

In [40]:
df['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

In this dataset, 1 = high school, 2 = undergraduate, 3 = graduate, and 4 = postgraduate. For the purposes of our analysis we will categorize individuals as having completed high school (0) or having completed some level higher than high school (1).

In [41]:
df['Education']= df['Education'].map({1:0, 2:1, 3:1, 4:1})

In [42]:
df['Education'].unique()

array([0, 1])

In [43]:
df['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [44]:
df.to_csv('cleaned_Workplace_Absenteeism_df.csv')

### Conclusion

In this project I have used dummy variables to provide numerical meaning to categorical data and transformed standard date data into two columns ("Month Value" and "Day of the Week") which will be more useful for analysis. 