# Seglearn User Guide

From here: https://dmbee.github.io/seglearn/user_guide.html

## What this Package Includes

The main contributions of this package are:

* `SegmentX` - transformer class for performing the time series / sequence sliding window segmentation when the target is **contextual**.
    * ✍🏻 I think, here, "contectual" means "static".
* `SegmentXY` - transformer class for performing the time series / sequence sliding window segmentation when the target is a **time series or sequence**.
* `SegmentXYForecast` - transformer class for performing the time series / sequence sliding window segmentation when the target is future values of a time series or sequence.
* `PadTrunc` - transformer class for fixing time series / sequence length using a combination of *padding and truncation*.
* `Interp` - transformer class for *resampling* time series data.
* `InterpLongToWide` - transformer class for interpolating **long format** time series to **wide format** used by `seglearn`.
* `FeatureRep` - transformer class for computing a **feature representation** from segment data.
* `FeatureRepMix` - transformer class for computing **feature representations** where a different FeatureRep can be applied to *each time series variable*.
* `Pype` - sklearn compatible **pipeline class** that can *handle transforms that change X, y, and number of samples*.
* `TS_Data` - an indexable / iterable class for **storing time series & contextual data**.
* `split` - a module for splitting time series or sequences along the temporal axis.

## What this Package Doesn’t Include

For now, this package does not include tools to help label time series data - which is a separate challenge.


## Valid Sequence Data Representations

Time series data can be represented as a *list* or *array of arrays* as follows:

In [1]:
from numpy import array
from numpy.random import rand

In [2]:
# multivariate time series data: (N = 3, variables = 5)
X = [rand(100, 5), rand(200, 5), rand(50, 5)]

In [3]:
# or equivalently as a numpy array
X = array([rand(100, 5), rand(200, 5), rand(50, 5)])

  X = array([rand(100, 5), rand(200, 5), rand(50, 5)])


‼️ **Note:** 

This "array of array" is **not** the usual array where the first dimension is `n_samples` (and the next two are `timesteps, features`).

It is similar, but, as `numpy` complained above, it is actuall an array of `objects` (which *are* arrays) of size `n_samples`. Each of those arrays is then like `timesteps, features`. 

In [5]:
X.shape

(3,)

In [6]:
X[0].shape

(100, 5)

In [9]:
feature_dims = []
for idx in range(X.shape[0]):
    print(f"{idx}: {X[idx].shape}")
    feature_dims.append(X[idx].shape[1])
assert all([d == 5 for d in feature_dims])
print("Confirmed all samples have the same number of features (of 5).")

0: (100, 5)
1: (200, 5)
2: (50, 5)
Confirmed all samples have the same number of features (of 5).


<br/>

The target, as a contextual variable (again N = 3) is represented as an array or list:

In [11]:
y = [2, 1, 3]
# or
y = array([2, 1, 3])

In [13]:
# Note that y *is* a straight-forward shape[0] of n_samples=3 array.
y.shape

(3,)

The target, **as a continous variable** (again N = 3), will have the same shape as the time series data:

In [14]:
y_cont = [rand(100), rand(200), rand(50)]

In [17]:
for idx in range(len(y_cont)):
    print(f"{idx}: y_cont[idx].shape - {y_cont[idx].shape}")
    print(f"   X[idx].shape -      {X[idx].shape}")
    assert y_cont[idx].shape[0] == X[idx].shape[0]
    print("Confirmed matching time length.")

0: y_cont[idx].shape - (100,)
   X[idx].shape -      (100, 5)
Confirmed matching time length.
1: y_cont[idx].shape - (200,)
   X[idx].shape -      (200, 5)
Confirmed matching time length.
2: y_cont[idx].shape - (50,)
   X[idx].shape -      (50, 5)
Confirmed matching time length.


The `TS_Data` class is provided as an **indexable / iterable that can store `time series` & `contextual` data**:

In [18]:
from seglearn.base import TS_Data

In [20]:
# Time series part:
Xt = array([rand(100,5), rand(200,5), rand(50,5)], dtype=object)

# Create 2 context variables:
Xc = rand(3, 2)

X = TS_Data(Xt, Xc)

In [21]:
type(X)

seglearn.base.TS_Data

A quick look at various methods / attributes of `TS_Data`:

In [24]:
class_attrs = [x for x in dir(X) if "__" not in x]
class_attrs

['N', 'context_data', 'from_df', 'index', 'shape', 'ts_data']

In [25]:
X.N

3

In [27]:
X.shape

[3]

In [31]:
help(X.from_df)

Help on method from_df in module seglearn.base:

from_df(df) method of builtins.type instance



In [35]:
X.index

0

In [36]:
X.context_data

array([[0.38629216, 0.3848893 ],
       [0.06379743, 0.92788826],
       [0.55308507, 0.06523256]])

In [43]:
print("X.ts_data:")
print(f"shape={X.ts_data.shape}", end="\n\n")
print(
    str(X.ts_data)[:300] + " ..."
)

X.ts_data:
shape=(3,)

[array([[0.60055822, 0.1479646 , 0.31193028, 0.82904598, 0.99366586],
        [0.43682286, 0.44301719, 0.50046637, 0.75321285, 0.69766529],
        [0.73308697, 0.10755424, 0.85839365, 0.45548365, 0.23042428],
        [0.82291378, 0.15064949, 0.70469326, 0.01067503, 0.33340476],
        [0.94172176, ...


`TS_Data` can be initialized from a `pandas` dataframe using column `'ts_data'` for the time series:

In [44]:
import pandas as pd

In [45]:
df = pd.DataFrame(Xc)
df

Unnamed: 0,0,1
0,0.386292,0.384889
1,0.063797,0.927888
2,0.553085,0.065233


In [47]:
df['ts_data'] = Xt
df

Unnamed: 0,0,1,ts_data
0,0.386292,0.384889,"[[0.6005582152336411, 0.14796459990134914, 0.3..."
1,0.063797,0.927888,"[[0.05357667712581771, 0.9481113284716495, 0.9..."
2,0.553085,0.065233,"[[0.2730056524386082, 0.17913637707892582, 0.8..."


**‼️ Note:**

This `pandas` helper *isn't very useful*, as the timeseries data has to be pre-collected as a list/array of arrays for each sample in `"ts_data"` column.

In [50]:
X = TS_Data.from_df(df)

There is a caveat for datasets that are a **single time series**. 

For compatibility with the seglearn segmenter classes, they need to be represented as a list:

In [51]:
X = [rand(1000,10)]
y = [rand(1000)]

If you want to split a single time series for train / test or cross validation - make sure to use one of the temporal splitting tools in `split`. 

If you have many time series in the dataset, you can use the sklearn splitters to split the data by series. This is demonstrated in the examples.

Irregularly sampled **"long" format** time series data (with timestamps) can be **interpolated and transformed** to wide format used by seglearn using the `InterpLongToWide` transformer:

In [53]:
import numpy as np
Xlong = pd.DataFrame({'time': np.arange(20), 'sensor': np.random.choice([1,2,3], 20), 'value': np.random.rand(20)})

In [54]:
Xlong

Unnamed: 0,time,sensor,value
0,0,2,0.670529
1,1,2,0.435911
2,2,3,0.342993
3,3,2,0.063419
4,4,2,0.88368
5,5,1,0.375351
6,6,1,0.543495
7,7,2,0.411129
8,8,2,0.272273
9,9,1,0.24877


So you can see above, the "long" format is `time`, `variable` (here, "sensor"), `value` format.

In [56]:
from seglearn import InterpLongToWide
interp = InterpLongToWide(
    sample_period=1.0, 
    kind='linear', 
    assume_sorted=False
)

In [57]:
interp

InterpLongToWide(assume_sorted=False, sample_period=1.0)

In [58]:
Xwide, _ , _ = interp.transform([Xlong.values])
Xwide

[array([[0.37535102, 0.72616307, 0.35909666],
        [0.54349493, 0.56864586, 0.36446442],
        [0.4452532 , 0.41112865, 0.36983219],
        [0.34701147, 0.27227317, 0.37519995],
        [0.24876973, 0.37294161, 0.38056771],
        [0.61826328, 0.47361005, 0.38593547],
        [0.63496667, 0.57427849, 0.39130323],
        [0.65167006, 0.67494692, 0.42922395],
        [0.53171741, 0.77561536, 0.46714467],
        [0.41176475, 0.8762838 , 0.87593014]])]

Interpolation like this can be incorporated into a seglearn pipeline.