# Introductory Tutorial

TimeseriesFlattener flattens timeseries. This is especially helpful if you have complicated and irregular time series but want to train simple models.

We explain terminology as needed in this tutorial. If you need a reference, see the [docs](https://aarhus-psychiatry-research.github.io/timeseriesflattener/#functionality).

Applying it consists of 3 steps:

1. Loading data (prediction times, predictor(s), and outcome(s))
2. Specifying how to flatten the data and
3. Flattening

The simplest case is adding one predictor and one outcome.

First, we'll load the timestamps for every time we want to issue a prediction:

## Loading data

### Loading prediction times
Predictin times consist of two elements:
1. The entity id. This is the entity about which the prediction is issued. In medical contexts, this is frequently a patient.
2. The timestamp at which the prediction is to be issued.

In [1]:
from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times

df_prediction_times = load_synth_prediction_times()

skim(df_prediction_times)
df_prediction_times.sort_values(by=["entity_id"])

Unnamed: 0,entity_id,timestamp
628,0,1969-01-11 09:55:00
2005,1,1965-03-15 07:16:00
4370,2,1969-09-13 23:18:00
6152,3,1968-02-04 16:16:00
6873,4,1965-01-28 12:33:00
...,...,...
9688,9996,1965-07-18 17:12:00
1463,9996,1965-01-30 17:19:00
3952,9997,1967-06-08 07:52:00
7926,9999,1968-02-07 22:24:00


Here, "entity_id" represents a patient ID and “timestamp” refers to the time when we want to issue a prediction. Note that each ID can have multiple prediction times.

### Loading a temporal predictor
Then, we'll load the values for our temporal predictor. Temporal predictors are predictors that can have a different value at different timepoints.

In [None]:
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float

df_synth_predictors = load_synth_predictor_float()

skim(df_synth_predictors)
df_synth_predictors.sort_values(by=["entity_id"])

Unnamed: 0,entity_id,timestamp,value
0,9476,1969-03-05 08:08:00,0.816995
1,4631,1967-04-10 22:48:00,4.818074
2,3890,1969-12-15 14:07:00,2.503789
3,1098,1965-11-19 03:53:00,3.515041
4,1626,1966-05-03 14:07:00,4.353115
...,...,...,...
99995,4542,1968-06-01 17:09:00,9.616722
99996,4839,1966-11-24 01:13:00,0.235124
99997,8168,1969-07-30 01:45:00,0.929738
99998,9328,1965-12-22 10:53:00,5.124424


Once again, note that there can be multiple values for each ID.

### Loading a static predictor

Frequently, you'll have one or more static predictors describing each entity. In this case, an entity is a patient, and an example of a static outcome could be their sex. It doesn't change over time (it's static), but can be used as a predictor for each prediction time. Let's load it in!

In [3]:
from timeseriesflattener.testing.load_synth_data import load_synth_sex

df_synth_sex = load_synth_sex()

skim(df_synth_sex)
df_synth_sex

Unnamed: 0,entity_id,female
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
9994,9995,0
9995,9996,0
9996,9997,1
9997,9998,1


As the predictor is static, there should only be a single value for each ID in this dataframe.

### Loading a temporal outcome

And, lastly, our outcome values. We've chosen a binary outcome and only stored values for the timestamps that experience the outcome. From these, we can infer patients that do not experience the outcome, since they do not have a timestamp. We handle this by setting a fallback of 0 - more on that in the following section.

In [None]:
from timeseriesflattener.testing.load_synth_data import load_synth_outcome

df_synth_outcome = load_synth_outcome()

skim(df_synth_outcome)
df_synth_outcome

Unnamed: 0,entity_id,timestamp,value
1,4,1965-06-20 16:29:00,1
3,7,1968-10-16 01:46:00,1
6,12,1965-04-17 07:17:00,1
7,13,1969-08-10 20:10:00,1
10,18,1969-02-02 09:16:00,1
...,...,...,...
6253,9964,1966-04-14 09:44:00,1
6255,9966,1969-06-05 13:00:00,1
6256,9968,1968-09-06 13:15:00,1
6257,9970,1967-01-24 08:52:00,1


This dataframe should contain at most 1 row per ID, which is the first time they experience the outcome.

We now have 4 dataframes loaded: df_prediction_times, df_synth_predictors, df_synth_sex and df_synth_outcome.

## Specifying how to flatten the data
We'll have to specify how to flatten predictors and outcomes. To do this, we use the feature specification objects as "recipes" for each column in our finished dataframe. Firstly, we'll specify the outcome specification.

### Temporal outcome specification

![](img/term_a.png)

The main decision to make for outcomes is the size of the **lookahead** window. It determines how far into the future from a given prediction time to look for outcome values. 
A **prediction time** indicates at which point the model issues a prediction, and is used as a reference for the *lookahead*.  

![](img/term_b.png)

We want labels for prediction times to be 0 if the outcome never occurs, or if the outcome happens outside the lookahead window. Labels should only be 1 if the outcome occurs inside the lookahead window. Let's specify this in code.

In [5]:
from timeseriesflattener.feature_spec_objects import OutcomeSpec
from timeseriesflattener.resolve_multiple_functions import maximum
import pandas as pd

test_df = pd.DataFrame()

outcome_spec = OutcomeSpec(
    values_df=df_synth_outcome,
    lookahead_days=365,
    fallback=0,
    resolve_multiple_fn=maximum,
    incident=False,
    feature_name="outcome_name",
)

Since our outcome is binary, we want each prediction time to be labeled with 0 for the outcome if none is present within lookahead days. To do this, we use the fallback argument, which specifies the default value to use if none are found in `values_df` within `lookahead`. In this case, we set it to 0.

Your use case determines how you want to handle multiple outcome values within lookahead days. In this case, we decide that any prediction time with at least one outcome (a timestamp in the loaded outcome data with a corresponding value of 1) within the specified lookahead days is "positive". I.e., if there is both a 0 and a 1 within lookahead days, the prediction time should be labeled with a 1. We set `resolve_multiple_fn = maximum` to accomplish this.

We also specify that the outcome is not incident. This means that patient ID (dw_ek_borger) can experience the outcome more than once. If the outcome was marked as incident, all prediction times after the patient experiences the outcome are dropped. This is useful for cases where an event is permanent - for example, whether a patient has type 1 diabetes or not.

Lastly, we specify a name of the outcome which'll be used when generating its column.

### Temporal predictor specification

Specifying a predictor is almost entirely identical to specifying an outcome. The only exception is that it looks a given number of days into the past from each prediction time instead of ahead.

In [6]:
from timeseriesflattener.feature_spec_objects import PredictorSpec, StaticSpec
from timeseriesflattener.resolve_multiple_functions import mean
import numpy as np

temporal_predictor_spec = PredictorSpec(
    values_df=df_synth_predictors,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn=mean,
    feature_name="predictor_name",
)

![](img/term_c.png)

Values within the *lookbehind* window are aggregated using `resolve_multiple_fn`, for example the mean as shown in this example, or max/min etc. 

### Static predictor specification
Static features should be specified using `StaticSpec` as they are handled slightly differently. As in the previous specifications, we provide a `values_df` containing the values and we set the feature name. However, now we also add a prefix. By default, `PredictorSpec` prefixes columns with “pred” and `OutcomeSpec` prefixes columns with “outc” to make filtering easy. 
As `StaticSpec` can be used for both generating predictors and outcomes, we manually set the prefix to be “pred”, as sex is used as predictor in this case.

In [7]:
sex_predictor_spec = StaticSpec(
    values_df=df_synth_sex,
    feature_name="female",
    prefix="pred",
    input_col_name_override="female",
)

df_synth_sex

Unnamed: 0,entity_id,female
0,0,0
1,1,1
2,2,1
3,3,1
4,4,0
...,...,...
9994,9995,0
9995,9996,0
9996,9997,1
9997,9998,1


Note that we also specify the "input_col_name_override", because the df_synth_sex df has its values in the "female" column. By default, tsflattener looks for a column names "value".

Now we're ready to flatten our dataset!

## Flattening
Flattening is as easy as instantiating the `TimeseriesFlattener` class with the prediction times df along with dataset specific metadata and calling the `add_*` functions. `n_workers` can be set to parallelize operations across multiple cores. 

In [8]:
from timeseriesflattener import TimeseriesFlattener

ts_flattener = TimeseriesFlattener(
    prediction_times_df=df_prediction_times,
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=1,
    drop_pred_times_with_insufficient_look_distance=True,
)

We set `drop_pred_times_with_insufficient_look_distance` to true. This means prediction times are dropped if the *lookbehind* extends further back in time than the start of the dataset or if the *lookahead* extends further than the end of the dataset. 

![](img/term_d.png)


For most applications, this should be true - you do not want features to say they're looking a year into the future, if you only have a month of data. This would compromise generalisability. However, there are some edge cases where you might want this to be false - see the advanced tutorial for a brief discussion on this.

In [9]:
ts_flattener.add_spec([sex_predictor_spec, temporal_predictor_spec, outcome_spec])

In [16]:
df = ts_flattener.get_df()

skim(df)

list(df.columns)

['entity_id',
 'timestamp',
 'prediction_time_uuid',
 'outc_outcome_name_within_365_days_maximum_fallback_0_dichotomous',
 'pred_predictor_name_within_730_days_mean_fallback_nan',
 'pred_female']

In [14]:
# For displayability, shorten col names
shortened_pred = "pred_X"
shortened_outcome = "outc_Y"

df = df.rename(
    {
        "pred_predictor_name_within_730_days_mean_fallback_nan": shortened_pred,
        "outc_outcome_name_within_365_days_maximum_fallback_0_dichotomous": shortened_outcome,
    },
    axis=1,
)

df[0:10].style.set_table_attributes('style="font-size: 14px"')

Unnamed: 0,entity_id,timestamp,prediction_time_uuid,outc_Y,pred_X,pred_female
0,9903,1968-05-09 21:24:00,9903-1968-05-09-21-24-00,0.0,0.990763,0
1,6447,1967-09-25 18:08:00,6447-1967-09-25-18-08-00,0.0,5.582745,1
2,4927,1968-06-30 12:13:00,4927-1968-06-30-12-13-00,0.0,4.957251,0
3,5475,1967-01-09 03:09:00,5475-1967-01-09-03-09-00,0.0,5.999336,0
4,9793,1968-12-15 12:59:00,9793-1968-12-15-12-59-00,0.0,7.294038,0
5,9768,1967-07-04 23:09:00,9768-1967-07-04-23-09-00,0.0,4.326286,1
6,7916,1968-12-20 03:38:00,7916-1968-12-20-03-38-00,0.0,4.629502,0
7,33,1967-07-28 03:16:00,33-1967-07-28-03-16-00,0.0,4.6285,0
8,2883,1968-01-28 21:50:00,2883-1968-01-28-21-50-00,0.0,8.257742,1
9,1515,1968-07-18 08:28:00,1515-1968-07-18-08-28-00,0.0,2.973084,0


And there we go! A dataframe ready for classification, containing:
1. The citizen IDs
2. Timestamps for each prediction time
3. A unique identifier for each prediciton-time
4. Our predictor columns, prefixed with `pred_` and
5. Our outcome columns, prefixed with `outc_`

## Handling data frames shaped as long with multiple value types

Often, you may wish to create predictors or outcomes from a long data frame which contains values for multiple different value types, e.g. values from different types of blood value measurements. Below you can view an example of such a data frame.

In [None]:
from timeseriesflattener.testing.utils_for_testing import load_long_df_with_multiple_values
from timeseriesflattener.utils import split_df_and_register_to_dict

In [None]:
long_df = load_long_df_with_multiple_values()
long_df

Unnamed: 0,dw_ek_borger,timestamp,value_names,value
0,3824,1968-10-30 00:01:00,value_name_1,1
1,3986,1967-04-08 04:15:00,value_name_1,0
2,3703,1968-09-06 09:43:00,value_name_1,0
3,3596,1967-12-30 10:24:00,value_name_1,0
4,4678,1967-10-29 09:21:00,value_name_1,1
...,...,...,...,...
19995,2149,1966-09-19 09:51:00,value_name_2,0
19996,571,1966-03-15 19:20:00,value_name_2,1
19997,4028,1967-08-03 00:44:00,value_name_2,1
19998,4454,1965-11-29 06:18:00,value_name_2,0


To parrallelise the generation of features, timeseriesflattener needs a dataframe for each type of raw data. Instead of having to manually split the long data frame into separate data frames, the package allows users to automatically split a long data frame.

Specifically, when loading split_df_and_register_to_dict, a variable with an empty dict is added to your namespace. When calling split_df_and_register_to_dict, it splits the dataframe and assigns this split dataframes to this variable. Let’s see an example:

In [None]:
split_df_and_register_to_dict(df=long_df)

# This import is only needed to show the registered dfs in the docs
# You don't need to import it to use the function
from timeseriesflattener.feature_spec_objects import split_dfs

split_dfs

{'value_name_1':       dw_ek_borger           timestamp  value
 0             3824 1968-10-30 00:01:00      1
 1             3986 1967-04-08 04:15:00      0
 2             3703 1968-09-06 09:43:00      0
 3             3596 1967-12-30 10:24:00      0
 4             4678 1967-10-29 09:21:00      1
 ...            ...                 ...    ...
 9995          2149 1966-09-19 09:51:00      0
 9996           571 1966-03-15 19:20:00      1
 9997          4028 1967-08-03 00:44:00      1
 9998          4454 1965-11-29 06:18:00      0
 9999           910 1967-03-30 23:39:00      1
 
 [10000 rows x 3 columns],
 'value_name_2':        dw_ek_borger           timestamp  value
 10000          3824 1968-10-30 00:01:00      1
 10001          3986 1967-04-08 04:15:00      0
 10002          3703 1968-09-06 09:43:00      0
 10003          3596 1967-12-30 10:24:00      0
 10004          4678 1967-10-29 09:21:00      1
 ...             ...                 ...    ...
 19995          2149 1966-09-19 09:51:0

Once the variable has been populated, the separate data frames can be fetched when specifying features. You simply pass the value_name to the `values_name` keyword argument. When the specification is initialized, it searches through the keys in the dictionary and extracts the corresponding data frame:

In [None]:
pred_spec = PredictorSpec(
    values_name="value_name_1",
    lookbehind_days=365,
    fallback=np.nan,
    resolve_multiple_fn=mean,
)

In [None]:
pred_spec.values_df

Unnamed: 0,dw_ek_borger,timestamp,value
0,3824,1968-10-30 00:01:00,1
1,3986,1967-04-08 04:15:00,0
2,3703,1968-09-06 09:43:00,0
3,3596,1967-12-30 10:24:00,0
4,4678,1967-10-29 09:21:00,1
...,...,...,...
9995,2149,1966-09-19 09:51:00,0
9996,571,1966-03-15 19:20:00,1
9997,4028,1967-08-03 00:44:00,1
9998,4454,1965-11-29 06:18:00,0


Now we know how to use data loader functions and how to handle a long data frame with multiple different value types to more efficiently handle the data from which we want to create features. We also know how to create a bunch of feature specifications quickly! But with more features comes more computation. Let's look at caching next, so we can iterate on our datasets more quickly.