# Introductory Tutorial


TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated and irregular time series but want to train simple models.

We explain terminology as needed in this tutorial. If you need a reference, see the [docs](https://aarhus-psychiatry-research.github.io/timeseriesflattener/#functionality).

Applying it consists of 3 steps:

1. Loading data (prediction times, predictor(s), and outcome(s))
2. Specifying how to flatten the data and
3. Flattening

The simplest case is adding one predictor and one outcome.

First, we'll load the timestamps for every time we want to issue a prediction:


## Loading data


### Loading prediction times

Predictin times consist of two elements:

1. The entity id. This is the entity about which the prediction is issued. In medical contexts, this is frequently a patient.
2. The timestamp at which the prediction is to be issued.


In [1]:
from __future__ import annotations

from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

df_prediction_times = load_synth_prediction_times()

skim(df_prediction_times)
df_prediction_times.sort(["entity_id"])

entity_id,timestamp
i64,datetime[μs]
0,1969-01-11 09:55:00
1,1965-03-15 07:16:00
2,1969-09-13 23:18:00
3,1968-02-04 16:16:00
4,1965-01-28 12:33:00
5,1967-10-09 06:22:00
7,1969-11-17 02:50:00
8,1965-12-11 02:17:00
9,1965-08-21 22:00:00
10,1965-12-26 16:45:00


Here, "entity_id" represents a patient ID and “timestamp” refers to the time when we want to issue a prediction. Note that each ID can have multiple prediction times.


### Loading a temporal predictor

Then, we'll load the values for our temporal predictor. Temporal predictors are predictors that can have a different value at different timepoints.


In [None]:
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float

df_synth_predictors = load_synth_predictor_float()

skim(df_synth_predictors)
df_synth_predictors.sort(["entity_id"])

entity_id,timestamp,value
i64,datetime[μs],f64
0,1967-06-12 14:06:00,0.174793
0,1968-04-15 01:45:00,3.072293
0,1968-12-09 05:42:00,1.315754
0,1969-06-20 18:07:00,2.812481
0,1967-11-26 01:59:00,2.981185
0,1968-09-07 01:45:00,0.173205
0,1969-02-21 03:29:00,9.943505
0,1967-11-26 13:45:00,5.470792
0,1967-05-12 12:44:00,0.970382
0,1965-05-03 05:23:00,6.630007


Once again, note that there can be multiple values for each ID.


### Loading a static predictor

Frequently, you'll have one or more static predictors describing each entity. In this case, an entity is a patient, and an example of a static outcome could be their sex. It doesn't change over time (it's static), but can be used as a predictor for each prediction time. Let's load it in!


In [None]:
from timeseriesflattener.testing.load_synth_data import load_synth_sex

df_synth_sex = load_synth_sex()

skim(df_synth_sex)
df_synth_sex

entity_id,female
i64,i64
0,0
1,1
2,1
3,1
4,0
5,0
6,0
7,0
8,1
9,0


As the predictor is static, there should only be a single value for each ID in this dataframe.


### Loading a temporal outcome


And, lastly, our outcome values. We've chosen a binary outcome and only stored values for the timestamps that experience the outcome. From these, we can infer patients that do not experience the outcome, since they do not have a timestamp. We handle this by setting a fallback of 0 - more on that in the following section.


In [None]:
from timeseriesflattener.testing.load_synth_data import load_synth_outcome

df_synth_outcome = load_synth_outcome()

skim(df_synth_outcome)
df_synth_outcome

entity_id,timestamp,value
i64,datetime[μs],i64
1727,1966-08-01 07:08:00,1
2269,1965-09-07 11:14:00,1
6050,1969-10-28 22:11:00,1
2230,1966-10-06 16:26:00,1
6449,1966-02-06 14:18:00,1
3165,1965-03-03 18:44:00,1
9602,1968-07-01 03:24:00,1
9569,1968-02-07 11:17:00,1
12,1965-04-17 07:17:00,1
9224,1967-12-01 09:23:00,1


This dataframe should contain at most 1 row per ID, which is the first time they experience the outcome.

We now have 4 dataframes loaded: df_prediction_times, df_synth_predictors, df_synth_sex and df_synth_outcome.


## Specifying how to flatten the data

We'll have to specify how to flatten predictors and outcomes. To do this, we use the feature specification objects as "recipes" for each column in our finished dataframe. Firstly, we'll specify the outcome specification.


### Temporal outcome specification


![](img/term_a.png)

The main decision to make for outcomes is the size of the **lookahead** window. It determines how far into the future from a given prediction time to look for outcome values.
A **prediction time** indicates at which point the model issues a prediction, and is used as a reference for the _lookahead_.


![](img/term_b.png)

We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let's specify this in code.


In [None]:
import datetime as dt

import pandas as pd
from timeseriesflattener import BooleanOutcomeSpec, TimestampValueFrame, ValueFrame
from timeseriesflattener.aggregators import MaxAggregator

test_df = pd.DataFrame({"entity_id": [0], "timestamp": [pd.Timestamp("2020-01-01")]})

outcome_spec = BooleanOutcomeSpec(
    init_frame=TimestampValueFrame(
        entity_id_col_name="entity_id", init_df=test_df, value_timestamp_col_name="timestamp"
    ),
    lookahead_distances=[dt.timedelta(days=365)],
    aggregators=[MaxAggregator()],
    output_name="outcome",
    column_prefix="outc",
)

: 

Since our outcome is binary, we want each prediction time to be labeled with 0 for the outcome if none is present within lookahead days. To do this, we use the fallback argument, which specifies the default value to use if none are found in `values_df` within `lookahead`. For the BooleanOutcomeSpec, this is hardcoded to 0.

Your use case determines how you want to handle multiple outcome values within lookahead days. In this case, we decide that any prediction time with at least one outcome (a timestamp in the loaded outcome data with a corresponding value of 1) within the specified lookahead days is "positive". I.e., if there is both a 0 and a 1 within lookahead days, the prediction time should be labeled with a 1. We set `aggregators = [MaxAggregator()]` to accomplish this.

Here, we specifiy that we want to look 365 days forward from the prediction time to search for outcomes. If we wanted to require a certain period of time from the prediction time before we look for outcome values, we can specify `lookahead` as an interval of (min_days, max_days) as a tuple instead.

Lastly, we specify a name of the outcome which'll be used when generating its column.


### Temporal predictor specification


Specifying a predictor is almost entirely identical to specifying an outcome. The only exception is that it looks a given number of days into the past from each prediction time instead of ahead.


In [None]:
import numpy as np
from timeseriesflattener import PredictorSpec, StaticSpec
from timeseriesflattener.aggregators import MeanAggregator

temporal_predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        entity_id_col_name="entity_id",
        init_df=df_synth_predictors.rename({"value": "value_1"}),
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MeanAggregator()],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_distances=[dt.timedelta(days=730)],
)

: 

![](img/term_c.png)

Values within the _lookbehind_ window are aggregated using `aggregators`, for example the mean as shown in this example, or max/min etc.


Note that we rename the value column to value_1. The value column's name determines the name of the output column after aggregation. To avoid multiple output columns with the same name, all input value columns must have unique names.


Temporal predictors can also be specified to look for values within a certain time range from the prediction time, similar to outcome specifications. For instance, you might want to create multiple predictors, where one looks for values within (0, 30) days, and another within (31, 182) days.

This can easily be specified by passing a tuple[min_days, max_days] to the lookbehind_days parameter.


In [None]:
temporal_interval_predictor_spec = PredictorSpec(
    value_frame=ValueFrame(
        entity_id_col_name="entity_id",
        init_df=df_synth_predictors.rename({"value": "value_2"}),
        value_timestamp_col_name="timestamp",
    ),
    aggregators=[MeanAggregator()],
    column_prefix="pred",
    fallback=np.nan,
    lookbehind_distances=[(dt.timedelta(days=10), dt.timedelta(days=365))],
)

: 

### Static predictor specification

Static features should be specified using `StaticSpec` as they are handled slightly differently. As in the previous specifications, we provide a `values_df` containing the values and we set the feature name. However, now we also add a prefix. By default, `PredictorSpec` prefixes columns with “pred” and `OutcomeSpec` prefixes columns with “outc” to make filtering easy.
As `StaticSpec` can be used for both generating predictors and outcomes, we manually set the prefix to be “pred”, as sex is used as predictor in this case.


In [None]:
from timeseriesflattener import StaticFrame

sex_predictor_spec = StaticSpec(
    value_frame=StaticFrame(init_df=df_synth_sex), column_prefix="pred", fallback=np.nan
)

df_synth_sex

entity_id,female
i64,i64
0,0
1,1
2,1
3,1
4,0
5,0
6,0
7,0
8,1
9,0


Note that we don't need to specify which columns to aggregate. Timeseriesflattener aggregates all columns that are not `entity_id_col_name` or `value_timestamp_col_name` and uses the name(s) of the column(s) for the output.

Now we're ready to flatten our dataset!


## Flattening

Flattening is as easy as instantiating the `TimeseriesFlattener` class with the prediction times df along with dataset specific metadata and calling the `add_*` functions. `n_workers` can be set to parallelize operations across multiple cores.


In [9]:
from timeseriesflattener import Flattener, PredictionTimeFrame

flattener = Flattener(
    predictiontime_frame=PredictionTimeFrame(
        init_df=df_prediction_times, entity_id_col_name="entity_id", timestamp_col_name="timestamp"
    )
)

In [16]:
df = flattener.aggregate_timeseries(
    specs=[
        sex_predictor_spec,
        temporal_predictor_spec,
        temporal_interval_predictor_spec,
        outcome_spec,
    ]
).df.collect()

skim(df)

list(df.columns)

  return n/db/n.sum(), bin_edges


['entity_id',
 'timestamp',
 'pred_time_uuid',
 'pred_female_fallback_nan',
 'pred_value_1_within_0_to_730_days_mean_fallback_nan',
 'pred_value_2_within_10_to_365_days_mean_fallback_nan',
 'outc_value_within_0_to_365_days_max_fallback_0']

In [17]:
# For displayability, shorten col names
shortened_pred = "predX"
shortened_predinterval = "predX_30_to_90"
shortened_outcome = "outc_Y"

display_df = df.rename(
    {
        "pred_value_1_within_0_to_730_days_mean_fallback_nan": shortened_pred,
        "pred_value_2_within_10_to_365_days_mean_fallback_nan": shortened_predinterval,
        "outc_outcome_within_0_to_365_days_max_fallback_0": shortened_outcome,
    }
)
display_df

entity_id,timestamp,pred_time_uuid,pred_female_fallback_nan,predX,predX_30_to_90,outc_Y
i64,datetime[μs],str,i64,f64,f64,i32
9903,1968-05-09 21:24:00,"""9903-1968-05-0…",0,3.197213,,0
7465,1966-05-24 01:23:00,"""7465-1966-05-2…",1,4.243969,,0
6447,1967-09-25 18:08:00,"""6447-1967-09-2…",1,5.260492,,0
2121,1966-05-05 20:52:00,"""2121-1966-05-0…",0,4.798062,,0
4927,1968-06-30 12:13:00,"""4927-1968-06-3…",0,4.040067,,0
5475,1967-01-09 03:09:00,"""5475-1967-01-0…",0,5.953548,,0
3157,1969-10-07 05:01:00,"""3157-1969-10-0…",1,5.068696,,0
9793,1968-12-15 12:59:00,"""9793-1968-12-1…",0,6.93591,,0
5962,1965-11-08 17:03:00,"""5962-1965-11-0…",0,4.112929,,0
9768,1967-07-04 23:09:00,"""9768-1967-07-0…",1,5.053019,,0


And there we go! A dataframe ready for classification, containing:

1. The citizen IDs
2. Timestamps for each prediction time
3. A unique identifier for each prediciton-time
4. Our predictor columns, prefixed with `pred_` and
5. Our outcome columns, prefixed with `outc_`
