<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction
This data challenge focuses on predicting the reason of absence of employees at a brazilian courier company based on various personal, work-related, and health-related factors. The motivation of this challenge is to permit companies and public administrations to help forecasting the reason of absence at work declared by employees and the reason predicted by a machine learning model, this could help detecting first signs of burn out or other kind of issues (to help employers health).

The database was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil.
It contains 21 input features and 1 categorical target variable that is `Reason_for_absence`. Each line represents an absence instance with multiple attributes related to personal demographics, work conditions, and health status.

Creators of the dataset : Martiniano, A. & Ferreira, R. (2012). Absenteeism at work [Dataset]. You can find the dataset on UCI Machine Learning Repository. https://doi.org/10.24432/C5X882. (The dataset is provided in **CSV format**.)

# Exploratory data analysis

## Features
1. **Individual ID** (`ID`): Unique identifier for each employee.
2. **Reason for absence** (`Reason_for_absence`): Categorized according to the International Code of Diseases (ICD) and additional reasons such as consultations, physiotherapy, and unjustified absence. (TARGET VARIABLE)
3. **Month of absence** (`Month_of_absence`): Month in which the absence occurred.
4. **Day of the week** (`Day_of_the_week`): Encoded as Monday (2) to Friday (6).
5. **Seasons** (`Seasons`): Encoded as Spring (1), Summer (2), Autumn (3), Winter (4).
6. **Transportation expense** (`Transportation_expense`): Employee's transportation cost.
7. **Distance from Residence to Work** (`Distance_from_Residence_to_Work`): Distance in kilometers.
8. **Service time** (`Service_time`): Number of years the employee has been with the company.
9. **Age** (`Age`): Employee’s age in years.
10. **Work load Average/day** (`Work_load_Average/day_`): Average workload per day.
11. **Hit target** (`Hit_target`): Performance target hit percentage.
12. **Disciplinary failure** (`Disciplinary_failure`): 1 if the employee has disciplinary failures, otherwise 0.
13. **Education** (`Education`): Education level - High school (1), Graduate (2), Postgraduate (3), Master & Doctor (4).
14. **Son** (`Son`): Number of children.
15. **Social drinker** (`Drinker`): 1 if the employee drinks socially, otherwise 0.
16. **Social smoker** (`Smoker`): 1 if the employee smokes socially, otherwise 0.
17. **Pet** (`Pet`): Number of pets owned.
18. **Weight** (`Weight`): Employee’s weight.
19. **Height** (`Height`): Employee’s height.
20. **Body mass index** (`Body_mass_index`): Calculated BMI.
21. **Absenteeism time in hours** (`Absenteeism_time_in_hours`): total absence hours.



## Target definitions



0. **Unknown**

### Group 1: Infectious, Neoplastic, and Immune Diseases
1. **Certain infectious and parasitic diseases**
2. **Neoplasms**
3. **Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism**

### Group 2: Chronic and Metabolic Conditions
4. **Endocrine, nutritional and metabolic diseases**
9. **Mental and behavioural disorders**
10. **Diseases of the nervous system**
11. **Diseases of the eye and adnexa**

### Group 3: Neurological, Psychiatric, and Sensory Disorders
5. **Diseases of the ear and mastoid process**
6. **Diseases of the circulatory system**
7. **Diseases of the respiratory system**
8. **Diseases of the digestive system**

### Group 4: Musculoskeletal, Dermatological, and Genitourinary Conditions
12. **Diseases of the skin and subcutaneous tissue**
13. **Diseases of the musculoskeletal system and connective tissue**
14. **Diseases of the genitourinary system**
15. **Pregnancy, childbirth and the puerperium**

### Group 5: Injuries, External Causes, Pregnancy, and Other Conditions
16. **Certain conditions originating in the perinatal period**
17. **Congenital malformations, deformations and chromosomal abnormalities**
18. **Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified**
19. **Injury, poisoning and certain other consequences of external causes**
20. **External causes of morbidity and mortality**
21. **Factors influencing health status and contact with health services**

### Group 6: Non-Disease Absences (Administrative & Follow-up)
22. **Patient follow-up**
23. **Medical consultation**
24. **Blood donation**
25. **Laboratory examination**
26. **Unjustified absence**
27. **Physiotherapy**
28. **Dental consultation**
  

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data

import problem
X, y = problem.get_train_data()

## Evaluation Metric
Given the difficulty to predict the exact reason of absence among a lot of possibilities, the evaluation metric will be the top 4 F1 score.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [2]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier


def get_estimator():
    pipe = make_pipeline(RandomForestClassifier())

    return pipe

## Testing using a scikit-learn pipeline

In [3]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X, y, cv=3, scoring="accuracy")
print(scores)

[0.65151515 0.66497462 0.6751269 ]


## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).