## Problem statement

In the following I will attempt to create a dataset by simulating a real-world phenomenon.
The data is not collected but will be modelled and synthesised using Python.

The following points will be investigated:

1. I will choose a real-world phenomenon which can be measured and for which I would be able to collect at least 100 data points across at least four different variables.

2. I will investigate the types of variables involved, their likely distributions and how they are related to each other.

3. I will simulate/synthesise the dataset matching their properties as closely as possible.

4. I will detail my research and implement the simulation in this Jupyter notebook. 

## Explanation of the real-world phenomenon to be investigated

The following will investigate absenteeism at work in customer support, using the following four variables: Day of the Week, Education, Owner of a pet, and Shift Work.
I got the idea for this from the following:
https://archive.ics.uci.edu/ml/machine-learning-databases/00445/
but I am using different significant variables. The data could be connected in one of the multinational companies which require shift work especially shift work in the sense that the employee would work a normal day shift for 4 weeks and then from 16:00 to midnight. 

## 2. The variables involved, likely distributions and relations

Several studies suggest that unregular shiftwork has a negative impact on the health of employees, specifically they have a higher chance to get Diabetes II, cognitive impairment, cardiovascular diseases, lung cancer, depression, ulcers, and sleep problems.  [1, 2, 3, 4] This would be a boolean variable, meaning either the person does shiftwork or not.

The day of the week seems to also have an inpact on the levels odf absenteeism. The levels of absence seem to be related to the day of week; Monday tends to have the most absences and Friday either the least or the same level as Mondays depending on the study; the researcher referred to their index as the "Blue Monday syndrome" [5, 6, 7]. Across the week the research suggests that there may be a distribution where two peaks are observed so in this case a bimodal distribution with an inverse bell shape would be most appropriate.

The level of education also seems to have an impact of the level of absenteeism in that the higher a person is educated the less days of absence they have [8]. The variable would be divided into no formal education, leaving certificate, Batchelor, Master, Doctorate. For this an exponential distribution would make most sense.

Several studies suggest that being a pet owner is beneficial to the person's health and also significantly reduces stress [9,10].  Here we have another boolean variable. It could be subdivided into which pet the person has but I do not think it is necessary.

In terms of relations between the variables, I would think that shiftwork would go together with lower education and Mondays and Fridays being the most likely days the person calls in sick. I would also suggest that pet ownership may be negatively related to this in the sense that a person doing shiftwork being less likely to have a pet and also that pet owners would be less likely to be sick on Mondays and Fridays compared to non-pet owners.

## 3. Simulation of the data

In [93]:
# Code for each variable
import numpy as np
from scipy.stats import arcsine
import pandas as pd
from pandas import DataFrame 
import matplotlib.pyplot as plt
import seaborn as sns

# Variable for shift-work
bool(sw) # boolean variable

# Day of the week Arcsine distribution (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.arcsine.html)
dow = np.linspace(arcsine.ppf(0.01), arcsine.ppf(0.99), 100)

# level of education as a variable
loe = np.random.uniform(0, 5, 100)

# variable for pet ownership
bool(po) #boolean

# Creating a random dataset
sample = np.random.sample(100)

col1 = sw
col2 = dow
col3 = loe
col4 = po

def absenteeism():
        if sw == 0:
            dow = 0 or 4, 
            loe <= 2, 
            po == 0,
            print('The likelihood of absenteeism is ')
        else:
            print('The level of absenteeism is')

#create dataframe
df = pd.DataFrame([sample()])
df

# calculate the correlation between the four different variables using the Pearson coefficient
df.corr(method='pearson')  

#create a figure showing the data
plt.figure(figsize=(40,40))
plt.show()

# this will create a heatmap using different colours to graphically display the dataset
sns.heatmap(data.corr(), cmap='BuGn')


TypeError: 'numpy.ndarray' object is not callable

## 4. Detailing the research and implementation

As mentioned before, the level of absenteeism can be affected by many different reasons. Shift-work is associated with many adverse long-term effects. Several chronic health conditions are more prevalent in shift workers, such as heart conditions, sleep problems, diabetes, and ulcers to name a few. This then would have an impact on the level of absenteeism, whether due to illness or due to calling in sick without being sick. The level of absenteeism is also associated with the day of the week. Mondays and Fridays are the most popular days to be absent from work.

## References:

[1] The impact of shift work on health: https://www.medicalnewstoday.com/articles/288310.php

[2] Shift work health risks: Heart Disease, ulcers, obesity, diabetes, depression, accidents https://www.webmd.com/sleep-disorders/features/shift-work

[3] Shiftwork - health effects -OHS Reps: http://www.ohsrep.org.au/hazards/fatigue,-impairment-and-shift-work/shiftwork-health-effects

[4] Health effects of shift work and extended hours of work: https://oem.bmj.com/content/58/1/68

[5] Statistically, what is the most popular day for people to miss work: https://www.quora.com/Statistically-what-is-the-most-popular-day-for-people-to-miss-work

[6]How to deal with Employee Absence Surge on Mondays and Fridays: https://theundercoverrecruiter.com/employee-absence-mondays-fridays/

[7] http://news.bbc.co.uk/2/hi/health/8347332.stm

[8] Haltletveit, JM. Education as a Determinant of Temporary Absenteeism: www.nsd.uib.no/data/individ/publikasjoner/Hatletveit.pdf

[9] https://www.bmj.com/content/331/7527/1252?ecoll=

[10] Social factors of sickness absence and ways of coping 
https://www.tandfonline.com/doi/abs/10.1080/14623730.2012.696350

## End