# Advanced Tutorial

In the basic tutorial we covered how to add static features, predictors and outcomes.
In this tutorial, we'll expand on that, covering how to effectively add many features by:
1. Utilising data loaders in a data_loaders registry,
2. Populating a dictionary from a long format dataframe,
3. Creating feature combinations from specifications,
4. Using caching, so you can iterate on your datasets without having to complete full computations every time


## Using data loaders
Until now, we've loaded our data first and then created combinations. But what if your data lies in an SQL database, and you don't want to save it to disk?

Time to introduce feature loaders. All feature spec objects (PredictorSpec, OutcomeSpec and StaticSpec) can resolve from a loader function. The only requirement of that loader function is that it should return a values dataframe, which should contain an ID column, a timestamp column and value column. This means you can have loaders that load from REDIS, SQL databases, or just from disk. Whatever you prefer.

This function is then called when you initialise a feature specification.

This loader is specified in the values_loader key like so:


In [1]:
from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_predictor_float
from timeseriesflattener.resolve_multiple_functions import mean
from timeseriesflattener.feature_spec_objects import PredictorSpec
from pprint import pprint
import numpy as np

In [2]:
pred_spec_batch = PredictorSpec(
    values_loader=load_synth_predictor_float,
    lookbehind_days=730,
    fallback=np.nan,
    resolve_multiple_fn=mean,
    feature_name="predictor_name",
)

pred_spec_batch.values_df

Unnamed: 0,entity_id,timestamp,value
0,9476,1969-03-05 08:08:00,0.816995
1,4631,1967-04-10 22:48:00,4.818074
2,3890,1969-12-15 14:07:00,2.503789
3,1098,1965-11-19 03:53:00,3.515041
4,1626,1966-05-03 14:07:00,4.353115
...,...,...,...
99995,4542,1968-06-01 17:09:00,9.616722
99996,4839,1966-11-24 01:13:00,0.235124
99997,8168,1969-07-30 01:45:00,0.929738
99998,9328,1965-12-22 10:53:00,5.124424


### The data loaders registry
If you inspect the source code of load_synth_predictor_float, you'll see that it is decorated with @data_loaders.register("synth_predictor_float").

```python
@data_loaders.register("synth_predictor_float")
def load_synth_predictor_float(
    n_rows: Optional[int] = None,
) -> pd.DataFrame:
    """Load synth predictor data.".

    Args:
        n_rows: Number of rows to return. Defaults to None which returns entire coercion data view.

    Returns:
        pd.DataFrame
    """
    return load_raw_test_csv("synth_raw_float_1.csv", n_rows=n_rows)
```

This registers it in the data_loaders registry under the "synth_predictor_float" key.

When you initialise a feature specification, it will look at the type of its `values_loader` attribute. If its type is a string, it will look for that string as a key in the data loaders registry. If it finds it, it'll resolve it to the value, in this case the `load_synth_predictor_float` function, and call that function.

The same concept applies for the resolve multiple functions.
This is super handy if you e.g. want to parse a config file, and therefore prefer to specify your data loaders as strings.

## Creating feature combinations
Manually specifying a handful of features one at a time is rather straightforward, but what if you want to generate hundreds of features? Or want to have multiple different lookbehind windows, e.g. a month, 6 months and a year? Then the amount of code you'll have to write will grow quite substantially and becomes time consuming and hard to navigate.

To solve this problem, we implemented feature group specifications. They allow you to combinatorially create features. Let's look at an example:


In [3]:
from timeseriesflattener.feature_spec_objects import PredictorGroupSpec
from timeseriesflattener.resolve_multiple_functions import maximum

In [4]:
pred_spec_batch = PredictorGroupSpec(
    values_loader=["synth_predictor_float"],
    lookbehind_days=[365, 730],
    fallback=[np.nan],
    resolve_multiple_fn=[mean, maximum],
).create_combinations()

You'll note that:

1. All attributes are now required to be lists. This makes iteration easier when creating the combinations.
2. We require values_loaders to be strings that can be resolved from the registry. This string is also used when creating the column names - otherwise we wouldn't know what to call the columns.

Let's check that the results look good.

In [5]:
# Create a small summary to highlight the generated predictors
pred_spec_batch_summary = [
    {
        "feature_name": pred_spec.feature_name,
        "lookbehind_days": pred_spec.lookbehind_days,
        "resolve_multiple_fn": pred_spec.key_for_resolve_multiple,
    }
    for pred_spec in pred_spec_batch
]
print(
    f"––––––––– We created {len(pred_spec_batch)} combinations of predictors. ––––––––––"
)
pprint(pred_spec_batch_summary)

––––––––– We created 4 combinations of predictors. ––––––––––
[{'feature_name': 'synth_predictor_float',
  'lookbehind_days': 365,
  'resolve_multiple_fn': 'mean'},
 {'feature_name': 'synth_predictor_float',
  'lookbehind_days': 730,
  'resolve_multiple_fn': 'mean'},
 {'feature_name': 'synth_predictor_float',
  'lookbehind_days': 365,
  'resolve_multiple_fn': 'maximum'},
 {'feature_name': 'synth_predictor_float',
  'lookbehind_days': 730,
  'resolve_multiple_fn': 'maximum'}]


Now we know how to create a bunch of feature specifications quickly! But with more features comes more computation. Let's look at caching next, so we can iterate on our datasets more quickly.

## Caching

Timeseriesflattener ships with a class that allows for caching to disk. Let's look at an example of that:

In [6]:
from skimpy import skim
from timeseriesflattener.testing.load_synth_data import load_synth_prediction_times
from timeseriesflattener.feature_cache.cache_to_disk import DiskCache
from timeseriesflattener.flattened_dataset import TimeseriesFlattener
from pathlib import Path

In [7]:
ts_flattener = TimeseriesFlattener(
    prediction_times_df=load_synth_prediction_times(),
    entity_id_col_name="entity_id",
    timestamp_col_name="timestamp",
    n_workers=4,
    cache=DiskCache(feature_cache_dir=Path(".tmp") / "feature_cache"),
    drop_pred_times_with_insufficient_look_distance=True,
)

All we need to specify is that we use the DiskCache class, and which directory to save the feature cache to.

The first time we create features, this will just save them to disk and won't make any difference to performance. But say we want to add two more features - then it'll load the features that it has already computed from disk, and then only compute the two new features.

Note that DiskCache is an instance of the abstract class FeatureCache. If you want to implement your own cache, for example using REDIS or SQL, all you'll need is to implement the 3 methods in that class. Now, let's compute a dataframe to check that everything works.

In [8]:
ts_flattener.add_spec(pred_spec_batch)

In [9]:
df = ts_flattener.get_df()

2022-12-09 11:44:44 [INFO] There were unprocessed specs, computing...
2022-12-09 11:44:44 [INFO] _drop_pred_time_if_insufficient_look_distance: Dropped 4038 (40.38%) rows
100%|██████████| 4/4 [00:01<00:00,  2.91it/s]
2022-12-09 11:44:46 [INFO] Starting concatenation. Will take some time on performant systems, e.g. 30s for 100 features. This is normal.
2022-12-09 11:44:46 [INFO] Concatenation took 0.019 seconds
2022-12-09 11:44:46 [INFO] Merging with original df


In [10]:
skim(df)

list(df.columns)

['entity_id',
 'timestamp',
 'prediction_time_uuid',
 'pred_synth_predictor_float_within_365_days_maximum_fallback_nan',
 'pred_synth_predictor_float_within_365_days_mean_fallback_nan',
 'pred_synth_predictor_float_within_730_days_mean_fallback_nan',
 'pred_synth_predictor_float_within_730_days_maximum_fallback_nan']

In [15]:
# For displayability, shorten col names
pred_cols = [c for c in df.columns if c.startswith("pred_")]
rename_dict = {c: f"pred_{i+1}" for i, c in enumerate(pred_cols)}
df_renamed = df.rename(rename_dict, axis=1)

# Print a dataframe
base_cols = ["entity_id", "timestamp", "prediction_time_uuid"]
renamed_cols = list(rename_dict.values())

df_renamed[0:10][base_cols + renamed_cols].style.\
    set_table_attributes('style="font-size: 14px"')

Unnamed: 0,entity_id,timestamp,prediction_time_uuid,pred_1,pred_2,pred_3,pred_4
0,9903,1968-05-09 21:24:00,9903-1968-05-09-21-24-00,0.154981,0.154981,0.990763,2.194319
1,6447,1967-09-25 18:08:00,6447-1967-09-25-18-08-00,8.930256,5.396017,5.582745,9.77405
2,4927,1968-06-30 12:13:00,4927-1968-06-30-12-13-00,6.730694,4.957251,4.957251,6.730694
3,5475,1967-01-09 03:09:00,5475-1967-01-09-03-09-00,9.497229,6.081539,5.999336,9.497229
4,3157,1969-10-07 05:01:00,3157-1969-10-07-05-01-00,5.243176,5.068323,5.068323,5.243176
5,9793,1968-12-15 12:59:00,9793-1968-12-15-12-59-00,9.708976,8.091755,7.294038,9.708976
6,9768,1967-07-04 23:09:00,9768-1967-07-04-23-09-00,5.729441,4.959419,4.326286,5.729441
7,9861,1969-01-22 17:34:00,9861-1969-01-22-17-34-00,3.130283,3.130283,3.279378,5.491415
8,657,1969-04-14 15:47:00,657-1969-04-14-15-47-00,,,7.903614,7.903614
9,7916,1968-12-20 03:38:00,7916-1968-12-20-03-38-00,4.318586,3.901992,4.629502,6.084523
