TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated timeseries but want to train simple models.

To specify how to do this, we need a shared vocabulary:

# Application
Now for application!

Applying it consists of 3 steps:

1. [Loading data](#loading-data) (prediction times, predictor(s), and outcome(s))
2. [Specifying how to flatten the data](#specifying-how-to-flatten-the-data) and
3. [Flattening](#flattening)


The simplest case is adding one predictor and one outcome.

First, we'll load the timestamps for every time we want to issue a prediction:

# Loading data

### Loading prediction times
Predictin times consist of two elements:
1. The entity id. This is the entity about which the prediction is issued. In medical contexts, this is frequently a patient.
2. The timestamp at which the prediction is to be issued.

In [1]:
from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

df_prediction_times = load_synth_prediction_times()

skim(df_prediction_times)
df_prediction_times

Unnamed: 0,entity_id,timestamp
0,9903,1968-05-09 21:24:00
1,7465,1966-05-24 01:23:00
2,6447,1967-09-25 18:08:00
3,2121,1966-05-05 20:52:00
4,4927,1968-06-30 12:13:00
...,...,...
9995,7159,1966-12-12 16:32:00
9996,147,1965-03-12 05:32:00
9997,1421,1968-04-15 15:53:00
9998,3353,1966-01-15 10:04:00


### Loading a temporal predictor
Then we'll load the values for our predictor:

In [2]:
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float

df_synth_predictors = load_synth_predictor_float()

skim(df_synth_predictors)
df_synth_predictors

Unnamed: 0,entity_id,timestamp,value
0,9476,1969-03-05 08:08:00,0.816995
1,4631,1967-04-10 22:48:00,4.818074
2,3890,1969-12-15 14:07:00,2.503789
3,1098,1965-11-19 03:53:00,3.515041
4,1626,1966-05-03 14:07:00,4.353115
...,...,...,...
99995,4542,1968-06-01 17:09:00,9.616722
99996,4839,1966-11-24 01:13:00,0.235124
99997,8168,1969-07-30 01:45:00,0.929738
99998,9328,1965-12-22 10:53:00,5.124424


### Loading static predictor

Frequently, you'll have a static predictor describing each entity. In this case, an entity is a patient, and an example of a static outcome could be their sex. It doesn't change over time (it's static), but can be used as a predictor for each prediction time. Let's load it!

In [3]:
from timeseriesflattener.testing.load_synth_data import load_synth_sex

df_synth_sex = load_synth_sex()

skim(df_synth_sex)
df_synth_sex

Unnamed: 0,entity_id,female
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
9994,9995,0
9995,9996,0
9996,9997,1
9997,9998,1


### Loading temporal outcome

And, lastly, our outcome values. We've chosen a binary outcome and only stored values for the timestamps that experience the outcomes. From these, we can infer those that do not experience the outcome, since they do not have a timestamp. We handle this by setting a fallback of 0, see below.

In [4]:
from timeseriesflattener.testing.load_synth_data import load_synth_outcome

df_synth_outcome = load_synth_outcome()

skim(df_synth_outcome)
df_synth_outcome

Unnamed: 0,entity_id,timestamp,value
1,4,1965-06-20 16:29:00,1
3,7,1968-10-16 01:46:00,1
6,12,1965-04-17 07:17:00,1
7,13,1969-08-10 20:10:00,1
10,18,1969-02-02 09:16:00,1
...,...,...,...
6253,9964,1966-04-14 09:44:00,1
6255,9966,1969-06-05 13:00:00,1
6256,9968,1968-09-06 13:15:00,1
6257,9970,1967-01-24 08:52:00,1


We now have 3 dataframes loaded: df_prediction_times, df_synth_predictors and df_synth_outcome.

# Specifying how to flatten the data
We'll have to specify how to flatten predictors and outcomes. To do this, we use the feature specification objects as "recipes" for each column in our finished dataframe. Firstly, we'll specify the outcome specification.

### Outcome specification

![](img/term_a.png)

The main decision to make for outcomes is the size of the **lookahead** window. It determines how far into the future from a given prediction time to look for outcome values. 
A **prediction time** indicates at which point the model issues a prediction, and is used as a reference for the *lookahead*.  

### Outcome labelling
![](img/term_b.png)

We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let's specify this in code.

In [5]:
from timeseriesflattener.feature_spec_objects import OutcomeSpec
from timeseriesflattener.resolve_multiple_functions import maximum
import pandas as pd

test_df = pd.DataFrame()

outcome_spec = OutcomeSpec(
    values_df=df_synth_outcome,
    lookahead_days=365,
    fallback=0,
    resolve_multiple_fn=maximum,
    incident=False,
    feature_name="outcome_name",
)

Since our outcome is binary, we want each prediction time to be labelled with 0 for the outcome if none is present within interval days. Therefore, we set fallback to 0. 

How to handle multiple outcome values within interval days depends on your use case. In this case, we choose that any prediction time with at least one outcome (a timestamp labelled 1) within interval days is "positive". I.e., if there is both a 0 and a 1 within interval days, the prediction time should be labelled with a 1. We set resolve_multiple_fn = maximum to accomplish this.

We also specify that the outcome is not incident. This means that each entity id (id) can experience the outcome more than once. 

If the outcome was marked as incident, all prediction times after the entity experiences the outcome are dropped.

Lastly, we specify a name of the outcome which'll be used when generating its column.

### Temporal predictor specification

Specifying a predictor is almost entirely identical, except it looks into the past from each prediction time.

In [6]:
from timeseriesflattener.feature_spec_objects import PredictorSpec, StaticSpec
from timeseriesflattener.resolve_multiple_functions import mean
import numpy as np

temporal_predictor_spec = PredictorSpec(
    values_df=df_synth_predictors,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn=mean,
    feature_name="predictor_name",
)

![](img/term_c.png)

Values within the *lookbehind* window are aggregated using `resolve_multiple_fn`, for example the mean as shown in this example, or max/min etc. 

### Static predictor specification

In [7]:
sex_predictor_spec = StaticSpec(
    values_df=df_synth_sex,
    feature_name="female",
    prefix="pred",
    input_col_name_override="female",
)

df_synth_sex

Unnamed: 0,entity_id,female
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
9994,9995,0
9995,9996,0
9996,9997,1
9997,9998,1


Note that we also specify the "input_col_name_override", because the df_synth_sex df has its values in the "female" column. By default, tsflattener looks for a column names "value".

Now we're ready to flatten our dataset!

# Flattening

In [8]:
from timeseriesflattener import TimeseriesFlattener

ts_flattener = TimeseriesFlattener(
    prediction_times_df=df_prediction_times,
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=1,
    drop_pred_times_with_insufficient_look_distance=True,
)

We set `drop_pred_times_with_insufficient_look_distance` to true. This means prediction times are dropped if the *lookbehind* extends further back in time than the start of the dataset or if the *lookahead* extends further than the end of the dataset. 

![](img/term_d.png)


For most applications, this should be true - you do not want features to say they're looking a year into the future, if you only have a month of data. This would compromise generalisability. However, there are some edge cases where you might want this to be false - see the advanced tutorial for a brief discussion on this.

In [9]:
ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])

In [10]:
df = ts_flattener.get_df()

skim(df)
df

2022-12-09 10:30:57 [INFO] There were unprocessed specs, computing...
2022-12-09 10:30:57 [INFO] _drop_pred_time_if_insufficient_look_distance: Dropped 5999 (59.99%) rows
  0%|          | 0/2 [00:00<?, ?it/s]

And there we go! A dataframe ready for classification, containing:
1. The citizen IDs
2. Timestamps for each prediction time
3. A unique identifier for each prediciton-time
4. Our predictor columns, prefixed with `pred_` and
5. Our outcome columns, prefixed with `outc_`