Skip to content

Commit

Permalink
docs: add figures to basic tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
MartinBernstorff committed Dec 8, 2022
1 parent cb85255 commit 5eb069f
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 22 deletions.
114 changes: 92 additions & 22 deletions tutorials/01_basic.ipynb
Original file line number Diff line number Diff line change
@@ -1,19 +1,24 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated timeseries but want to train simple models.\n",
"\n",
"For terminology, see the docs (elaborate here based on Lasses draft?).\n",
"To specify how to do this, we need a shared vocabulary:\n",
"\n",
"# Application\n",
"Now for application!\n",
"\n",
"Applying it consists of 3 steps:\n",
"\n",
"1. [Loading data](#loading-data) (prediction times, predictor(s), and outcome(s))\n",
"2. [Specifying how to flatten the data](#specifying-how-to-flatten-the-data) and\n",
"3. [Flattening](#flattening)\n",
"\n",
"\n",
"The simplest case is adding one predictor and one outcome.\n",
"\n",
"First, we'll load the timestamps for every time we want to issue a prediction:"
Expand Down Expand Up @@ -807,6 +812,28 @@
"### Outcome specification"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"![](img/term_a.png)\n",
"\n",
"The main decision to make for outcomes is the size of the **lookahead** window. It determines how far into the future from a given prediction time to look for outcome values. \n",
"A **prediction time** indicates at which point the model issues a prediction, and is used as a reference for the *lookahead*. "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Outcome labelling\n",
"![](img/term_b.png)\n",
"\n",
"We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let's specify this in code."
]
},
{
"cell_type": "code",
"execution_count": 5,
Expand All @@ -830,14 +857,16 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Interval days is how far ahead to look from the prediction time for an outcome. Since our outcome is binary, we want each prediction time to be labelled with 0 for the outcome if none is present within interval days. Therefore, we set fallback to 0. \n",
"Since our outcome is binary, we want each prediction time to be labelled with 0 for the outcome if none is present within interval days. Therefore, we set fallback to 0. \n",
"\n",
"How to handle multiple outcome values within interval days depends on your use case. In this case, we choose that any prediction time with at least one outcome (a timestamp labelled 1) within interval days is \"positive\". I.e., if there is both a 0 and a 1 within interval days, the prediction time should be labelled with a 1. We set resolve_multiple_fn = maximum to accomplish this.\n",
"\n",
"We also specify that the outcome is not incident. This means that each entity id (dw_ek_borger) can experience the outcome more than once. \n",
"\n",
"If the outcome was marked as incident, all prediction times after the entity experiences the outcome are dropped.\n",
"\n",
"Lastly, we specify a name of the outcome which'll be used when generating its column."
Expand Down Expand Up @@ -876,6 +905,16 @@
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"![](img/term_c.png)\n",
"\n",
"Values within the *lookbehind* window are aggregated using `resolve_multiple_fn`, for example the mean as shown in this example, or max/min etc. "
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -998,7 +1037,10 @@
],
"source": [
"sex_predictor_spec = StaticSpec(\n",
" values_df=df_synth_sex, feature_name=\"female\", prefix=\"pred\", input_col_name_override=\"female\"\n",
" values_df=df_synth_sex,\n",
" feature_name=\"female\",\n",
" prefix=\"pred\",\n",
" input_col_name_override=\"female\",\n",
")\n",
"\n",
"df_synth_sex"
Expand All @@ -1025,21 +1067,61 @@
"# Flattening"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"from timeseriesflattener import TimeseriesFlattener\n",
"\n",
"ts_flattener = TimeseriesFlattener(\n",
" prediction_times_df=df_prediction_times,\n",
" id_col_name=\"dw_ek_borger\",\n",
" timestamp_col_name=\"timestamp\",\n",
" n_workers=1,\n",
" drop_pred_times_with_insufficient_look_distance=True,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We set `drop_pred_times_with_insufficient_look_distance` to true. This means prediction times are dropped if the *lookbehind* extends further back in time than the start of the dataset or if the *lookahead* extends further than the end of the dataset. \n",
"\n",
"![](img/term_d.png)\n",
"\n",
"\n",
"For most applications, this should be true - you do not want features to say they're looking a year into the future, if you only have a month of data. This would compromise generalisability. However, there are some edge cases where you might want this to be false - see the advanced tutorial for a brief discussion on this."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2022-12-07 13:52:04 [INFO] There were unprocessed specs, computing...\n",
"100%|██████████| 2/2 [00:00<00:00, 2.37it/s]\n",
"2022-12-07 13:52:05 [INFO] Processing complete, concatenating\n",
"2022-12-07 13:52:05 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features. This is normal.\n",
"2022-12-07 13:52:05 [INFO] Concatenation took 0.002 seconds\n",
"2022-12-07 13:52:05 [INFO] Merging with original df\n"
"2022-12-08 13:26:51 [INFO] There were unprocessed specs, computing...\n",
"2022-12-08 13:26:51 [INFO] _drop_pred_time_if_insufficient_look_distance: Dropped 5999 (0.6%) rows\n",
"100%|██████████| 2/2 [00:00<00:00, 3.03it/s]\n",
"2022-12-08 13:26:52 [INFO] Processing complete, concatenating\n",
"2022-12-08 13:26:52 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features. This is normal.\n",
"2022-12-08 13:26:52 [INFO] Concatenation took 0.001 seconds\n",
"2022-12-08 13:26:52 [INFO] Merging with original df\n"
]
},
{
Expand Down Expand Up @@ -1294,24 +1376,12 @@
"[4001 rows x 6 columns]"
]
},
"execution_count": 9,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from timeseriesflattener import TimeseriesFlattener\n",
"\n",
"ts_flattener = TimeseriesFlattener(\n",
" prediction_times_df=df_prediction_times,\n",
" id_col_name=\"dw_ek_borger\",\n",
" timestamp_col_name=\"timestamp\",\n",
" n_workers=1,\n",
" drop_pred_times_with_insufficient_look_distance=True,\n",
")\n",
"\n",
"ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])\n",
"\n",
"df = ts_flattener.get_df()\n",
"\n",
"skim(df)\n",
Expand Down
Binary file added tutorials/img/term_a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tutorials/img/term_b.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tutorials/img/term_c.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tutorials/img/term_d.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 5eb069f

Please sign in to comment.