# Advanced Tutorial

In the basic tutorial we covered how to add static features, predictors and outcomes.
In this tutorial, we'll expand on that, covering how to effectively add many features by:
1. Creating feature combinations from specifications,
2. Using caching, so you can iterate on your datasets without having to complete full computations every time


## Creating feature combinations
Manually specifying a handful of features one at a time is rather straightforward, but what if you want to generate hundreds of features? Or want to have multiple different lookbehind windows, e.g. a month, 6 months and a year? Then the amount of code you'll have to write will grow quite substantially and becomes time consuming and hard to navigate.

To solve this problem, we implemented feature group specifications. They allow you to combinatorially create features. Let's look at an example:


In [24]:
from timeseriesflattener.feature_specs.group_specs import PredictorGroupSpec
from timeseriesflattener.aggregation_fns import maximum
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float
from timeseriesflattener.feature_specs.group_specs import NamedDataframe
import numpy as np
from timeseriesflattener.aggregation_fns import mean, maximum
from pprint import pprint as pprint

In [25]:
pred_spec_batch = PredictorGroupSpec(
    named_dataframes=[
        NamedDataframe(df=load_synth_predictor_float(), name="synth_predictor_float")
    ],
    lookbehind_days=[365, 730],
    fallback=[np.nan],
    aggregation_fns=[mean, maximum],
).create_combinations()

You'll note that:

1. All attributes are now required to be lists. This makes iteration easier when creating the combinations.
2. We require a named_dataframes sequence. A namedataframe is exactly that; a dataframe and a name. This is used when we create the features in the output, e.g. for a predictor, the output feature using load_synth_predictor_flaot will be called pred_synth_predictor_float_<metadata> because that's the name attributed in the NamedDataframe.

Let's check that the results look good.

In [26]:
# Create a small summary to highlight the generated predictors
pred_spec_batch_summary = [
    {
        "feature_name": pred_spec.feature_base_name,
        "lookbehind_days": pred_spec.lookbehind_days,
        "aggregation_fn": pred_spec.aggregation_fn.__name__,
    }
    for pred_spec in pred_spec_batch
]
print(
    f"––––––––– We created {len(pred_spec_batch)} combinations of predictors. ––––––––––"
)
pprint(pred_spec_batch_summary)

––––––––– We created 4 combinations of predictors. ––––––––––
[{'aggregation_fn': 'mean',
  'feature_name': 'synth_predictor_float',
  'lookbehind_days': 365.0},
 {'aggregation_fn': 'maximum',
  'feature_name': 'synth_predictor_float',
  'lookbehind_days': 365.0},
 {'aggregation_fn': 'mean',
  'feature_name': 'synth_predictor_float',
  'lookbehind_days': 730.0},
 {'aggregation_fn': 'maximum',
  'feature_name': 'synth_predictor_float',
  'lookbehind_days': 730.0}]


Now we know how to create a bunch of feature specifications quickly! But with more features comes more computation. Let's look at caching next, so we can iterate on our datasets more quickly.

## Caching

Timeseriesflattener ships with a class that allows for caching to disk. Let's look at an example of that:

In [27]:
from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
from timeseriesflattener.feature_cache.cache_to_disk import DiskCache
from timeseriesflattener.flattened_dataset import TimeseriesFlattener
from pathlib import Path

In [28]:
ts_flattener = TimeseriesFlattener(
    prediction_times_df=load_synth_prediction_times(),
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=4,
    cache=DiskCache(
        feature_cache_dir=Path(".tmp") / "feature_cache",
    ),
    drop_pred_times_with_insufficient_look_distance=True,
)


2023-06-14 16:19:04 [INFO] Overriding pred_time_uuid_col_name in cache with pred_time_uuid_col_name passed to init of flattened dataset


All we need to specify is that we use the DiskCache class, and which directory to save the feature cache to.

The first time we create features, this will just save them to disk and won't make any difference to performance. But say we want to add two more features - then it'll load the features that it has already computed from disk, and then only compute the two new features.

Note that DiskCache is an instance of the abstract class FeatureCache. If you want to implement your own cache, for example using REDIS or SQL, all you'll need is to implement the 3 methods in that class. Now, let's compute a dataframe to check that everything works.

In [29]:
ts_flattener.add_spec(pred_spec_batch)

In [30]:
df = ts_flattener.get_df()

2023-06-14 16:19:04 [INFO] There were unprocessed specs, computing...
2023-06-14 16:19:04 [INFO] _drop_pred_time_if_insufficient_look_distance: Dropped 4038 (40.38%) rows
2023-06-14 16:19:04 [INFO] Processing 4 temporal features in parallel with 4 workers. Chunksize is 1. If this is above 1, it may take some time for the progress bar to move, as processing is batched. However, this makes for much faster total performance.
100%|██████████| 4/4 [00:01<00:00,  2.75it/s]
2023-06-14 16:19:05 [INFO] Checking alignment of dataframes - this might take a little while (~2 minutes for 1.000 dataframes with 2.000.000 rows).
2023-06-14 16:19:05 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features and 2_000_000 prediction times. This is normal.
2023-06-14 16:19:05 [INFO] Concatenation took 0.007 seconds
2023-06-14 16:19:05 [INFO] Merging with original df


In [31]:
skim(df)

list(df.columns)

['entity_id',
 'timestamp',
 'prediction_time_uuid',
 'pred_synth_predictor_float_within_365_days_mean_fallback_nan',
 'pred_synth_predictor_float_within_730_days_maximum_fallback_nan',
 'pred_synth_predictor_float_within_365_days_maximum_fallback_nan',
 'pred_synth_predictor_float_within_730_days_mean_fallback_nan']

In [32]:
# For displayability, shorten col names
pred_cols = [c for c in df.columns if c.startswith("pred_")]
rename_dict = {c: f"pred_{i+1}" for i, c in enumerate(pred_cols)}
df_renamed = df.rename(rename_dict, axis=1)

# Print a dataframe
base_cols = ["entity_id", "timestamp", "prediction_time_uuid"]
renamed_cols = list(rename_dict.values())

df_renamed[0:10][base_cols + renamed_cols].style.set_table_attributes(
    'style="font-size: 14px"'
)


Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_1,pred_2,pred_3,pred_4
0,9903,1968-05-09 21:24:00,9903-1968-05-09-21-24-00,0.154981,2.194319,0.154981,0.990763
1,6447,1967-09-25 18:08:00,6447-1967-09-25-18-08-00,5.396017,9.77405,8.930256,5.582745
2,4927,1968-06-30 12:13:00,4927-1968-06-30-12-13-00,4.957251,6.730694,6.730694,4.957251
3,5475,1967-01-09 03:09:00,5475-1967-01-09-03-09-00,6.081539,9.497229,9.497229,5.999336
4,3157,1969-10-07 05:01:00,3157-1969-10-07-05-01-00,5.068323,5.243176,5.243176,5.068323
5,9793,1968-12-15 12:59:00,9793-1968-12-15-12-59-00,8.091755,9.708976,9.708976,7.294038
6,9768,1967-07-04 23:09:00,9768-1967-07-04-23-09-00,4.959419,5.729441,5.729441,4.326286
7,9861,1969-01-22 17:34:00,9861-1969-01-22-17-34-00,3.130283,5.491415,3.130283,3.279378
8,657,1969-04-14 15:47:00,657-1969-04-14-15-47-00,,7.903614,,7.903614
9,7916,1968-12-20 03:38:00,7916-1968-12-20-03-38-00,3.901992,6.084523,4.318586,4.629502
