docs: add figures to basic tutorial

Aarhus-Psychiatry-Research · Dec 8, 2022 · 5eb069f · 5eb069f
1 parent cb85255
commit 5eb069f
Show file tree

Hide file tree

Showing 5 changed files with 92 additions and 22 deletions.
diff --git a/tutorials/01_basic.ipynb b/tutorials/01_basic.ipynb
@@ -1,19 +1,24 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated timeseries but want to train simple models.\n",
     "\n",
-    "For terminology, see the docs (elaborate here based on Lasses draft?).\n",
+    "To specify how to do this, we need a shared vocabulary:\n",
+    "\n",
+    "# Application\n",
+    "Now for application!\n",
     "\n",
     "Applying it consists of 3 steps:\n",
     "\n",
     "1. [Loading data](#loading-data) (prediction times, predictor(s), and outcome(s))\n",
     "2. [Specifying how to flatten the data](#specifying-how-to-flatten-the-data) and\n",
     "3. [Flattening](#flattening)\n",
     "\n",
+    "\n",
     "The simplest case is adding one predictor and one outcome.\n",
     "\n",
     "First, we'll load the timestamps for every time we want to issue a prediction:"
@@ -807,6 +812,28 @@
     "### Outcome specification"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/term_a.png)\n",
+    "\n",
+    "The main decision to make for outcomes is the size of the **lookahead** window. It determines how far into the future from a given prediction time to look for outcome values. \n",
+    "A **prediction time** indicates at which point the model issues a prediction, and is used as a reference for the *lookahead*.  "
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Outcome labelling\n",
+    "![](img/term_b.png)\n",
+    "\n",
+    "We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let's specify this in code."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -830,14 +857,16 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Interval days is how far ahead to look from the prediction time for an outcome. Since our outcome is binary, we want each prediction time to be labelled with 0 for the outcome if none is present within interval days. Therefore, we set fallback to 0. \n",
+    "Since our outcome is binary, we want each prediction time to be labelled with 0 for the outcome if none is present within interval days. Therefore, we set fallback to 0. \n",
     "\n",
     "How to handle multiple outcome values within interval days depends on your use case. In this case, we choose that any prediction time with at least one outcome (a timestamp labelled 1) within interval days is \"positive\". I.e., if there is both a 0 and a 1 within interval days, the prediction time should be labelled with a 1. We set resolve_multiple_fn = maximum to accomplish this.\n",
     "\n",
     "We also specify that the outcome is not incident. This means that each entity id (dw_ek_borger) can experience the outcome more than once. \n",
+    "\n",
     "If the outcome was marked as incident, all prediction times after the entity experiences the outcome are dropped.\n",
     "\n",
     "Lastly, we specify a name of the outcome which'll be used when generating its column."
@@ -876,6 +905,16 @@
     ")"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](img/term_c.png)\n",
+    "\n",
+    "Values within the *lookbehind* window are aggregated using `resolve_multiple_fn`, for example the mean as shown in this example, or max/min etc. "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -998,7 +1037,10 @@
    ],
    "source": [
     "sex_predictor_spec = StaticSpec(\n",
-    "   values_df=df_synth_sex, feature_name=\"female\", prefix=\"pred\", input_col_name_override=\"female\"\n",
+    "    values_df=df_synth_sex,\n",
+    "    feature_name=\"female\",\n",
+    "    prefix=\"pred\",\n",
+    "    input_col_name_override=\"female\",\n",
     ")\n",
     "\n",
     "df_synth_sex"
@@ -1025,21 +1067,61 @@
     "# Flattening"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from timeseriesflattener import TimeseriesFlattener\n",
+    "\n",
+    "ts_flattener = TimeseriesFlattener(\n",
+    "    prediction_times_df=df_prediction_times,\n",
+    "    id_col_name=\"dw_ek_borger\",\n",
+    "    timestamp_col_name=\"timestamp\",\n",
+    "    n_workers=1,\n",
+    "    drop_pred_times_with_insufficient_look_distance=True,\n",
+    ")"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We set `drop_pred_times_with_insufficient_look_distance` to true. This means prediction times are dropped if the *lookbehind* extends further back in time than the start of the dataset or if the *lookahead* extends further than the end of the dataset. \n",
+    "\n",
+    "![](img/term_d.png)\n",
+    "\n",
+    "\n",
+    "For most applications, this should be true - you do not want features to say they're looking a year into the future, if you only have a month of data. This would compromise generalisability. However, there are some edge cases where you might want this to be false - see the advanced tutorial for a brief discussion on this."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
    "metadata": {},
+   "outputs": [],
+   "source": [
+    "ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "2022-12-07 13:52:04 [INFO] There were unprocessed specs, computing...\n",
-      "100%|██████████| 2/2 [00:00<00:00,  2.37it/s]\n",
-      "2022-12-07 13:52:05 [INFO] Processing complete, concatenating\n",
-      "2022-12-07 13:52:05 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features. This is normal.\n",
-      "2022-12-07 13:52:05 [INFO] Concatenation took 0.002 seconds\n",
-      "2022-12-07 13:52:05 [INFO] Merging with original df\n"
+      "2022-12-08 13:26:51 [INFO] There were unprocessed specs, computing...\n",
+      "2022-12-08 13:26:51 [INFO] _drop_pred_time_if_insufficient_look_distance: Dropped 5999 (0.6%) rows\n",
+      "100%|██████████| 2/2 [00:00<00:00,  3.03it/s]\n",
+      "2022-12-08 13:26:52 [INFO] Processing complete, concatenating\n",
+      "2022-12-08 13:26:52 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features. This is normal.\n",
+      "2022-12-08 13:26:52 [INFO] Concatenation took 0.001 seconds\n",
+      "2022-12-08 13:26:52 [INFO] Merging with original df\n"
      ]
     },
     {
@@ -1294,24 +1376,12 @@
        "[4001 rows x 6 columns]"
       ]
      },
-     "execution_count": 9,
+     "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "from timeseriesflattener import TimeseriesFlattener\n",
-    "\n",
-    "ts_flattener = TimeseriesFlattener(\n",
-    "    prediction_times_df=df_prediction_times,\n",
-    "    id_col_name=\"dw_ek_borger\",\n",
-    "    timestamp_col_name=\"timestamp\",\n",
-    "    n_workers=1,\n",
-    "    drop_pred_times_with_insufficient_look_distance=True,\n",
-    ")\n",
-    "\n",
-    "ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])\n",
-    "\n",
     "df = ts_flattener.get_df()\n",
     "\n",
     "skim(df)\n",

diff --git a/tutorials/img/term_a.png b/tutorials/img/term_a.png
diff --git a/tutorials/img/term_b.png b/tutorials/img/term_b.png
diff --git a/tutorials/img/term_c.png b/tutorials/img/term_c.png
diff --git a/tutorials/img/term_d.png b/tutorials/img/term_d.png