NannyML provides a synthetic dataset describing a binary classification problem, to make it easier to test and document its features.
To find out what requirements NannyML has for datasets, check out Data Requirements<data_requirements>
.
The dataset describes a machine learning model that tries to predict whether an employee will work from home on the next day.
A sample of the dataset can be seen below.
>>> import nannyml as nml
>>> reference, analysis, analysis_targets = nml.datasets.load_synthetic_binary_classification_dataset()
>>> display(reference.head(3))
distance_from_office | salary_range | gas_price_per_litre | public_transportation_cost | wfh_prev_workday | workday | tenure | identifier | work_home_actual | timestamp | y_pred_proba | y_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
40K - 60K € |
|
|
False | Friday | 0.212653 |
|
|
2014-05-09 22:27:20 |
|
|
|
|
40K - 60K € |
|
|
True | Tuesday | 4.92755 |
|
|
2014-05-09 22:59:32 |
|
|
|
|
40K - 60K € |
|
|
False | Monday | 0.520817 |
|
|
2014-05-09 23:48:25 |
|
|
The model uses 7 features:
- `distance_from_office`: A numerical feature. The distance in kilometers from the employee's house to the workplace.
- `salary_range`: A categorical feature with 4 categories that identify the range the employee's yearly income falls within.
- `gas_price_per_litre`: A numerical feature. The price of gas per litre close to the employee's residence.
- `public_transportation_cost`: A numerical feature. The price, in euros, of public transportation from the employee's residence to the workplace.
- `wfh_prev_workday`: A categorical feature with 2 categories, stating whether the employee worked from home the previous workday.
- `workday`: A categorical feature with 5 categories. The day of the week where we want to predict whether the employee will work from home.
- `tenure`: A numerical feature describing how many years the employee has been at the company.
The model predicts the probability of the employee working from home, recorded in the y_pred_proba column. A binary prediction is also available from the y_pred column. The work_home_actual is the Target
column describing what actually happened.
There are also three auxiliary columns that are helpful but not used by the monitored model:
- `identifier`: A unique number referencing each employee. This is very useful for joining the target results on the analysis dataset, when we want to
compare estimated with realized performace.<compare_estimated_and_realized_performance>
. - `timestamp`: A date column informing us of the date the prediction was made.