In [21]:
# set up an absolute path to the project 
# not needed in case of `pip install`
%run -i ../tools/setup_env.py

## Usage examples of `torchcnnbuilder.preprocess`

This submodule contains useful tensor preprocessing functions. At the moment, there are only functions for splitting tensors into a time series with X and Y parts, since the main functionality was originally developed for the task of forecasting N-dimensional time series

### Submodule `torchcnnbuilder.preprocess.time_series`

Firstly, let's create synthetic data. The generation script is located in `../tools/generating_time_series.py`. The data consists of 210 numpy matrices 100x100 that form a 2-dimenstional time series - the movement of a square in a circle. For a visual demonstration of the time series, a frame-by-frame animation is attached below:
<img src="../tools/media/time_series_animation.gif" alt="animation" style="width:40%; display: block; margin-left: auto; margin-right: auto;">

In [22]:
%%capture
from examples.tools.generating_time_series import synthetic_time_series

# the first object is an animation class of the whole time series
_, data = synthetic_time_series()

In [3]:
print(f'Dataset len: {len(data)}, One matrix shape: {data[0].shape}')

#### Function `single_output_tensor`

In [4]:
from torchcnnbuilder.preprocess import single_output_tensor

Params:

- **data**: N-dimensional arrays, lists, numpy arrays, tensors etc.
- **forecast_len**: length of prediction for each y-train future tensor (target)
- **additional_x**: extra x-train data. Default: None
- **additional_is_array**: if additional x-train is an array of x_i data like other time series. Default: False
- **additional_x_stack**: if True stack each additional_x_i to x-train. Default: True
- **threshold**: binarization threshold for each y-tensor. Default: False
- **x_binarize**: binarization with threshold for each x-tensor. Default: False

Returns:
TensorDataset of X-train and y-train

This function preprocesses an n-dimensional time series into a tensor with only the X and Y parts and returns `TensorDataset`. Let's say we want to predict the next 30 states based on the rest of the data, then `forecast_len=30`. The function can work with all data of the sequence or array type, but the library does not use any dependencies except `torch`, so in the case of `numpy.array`, you may receive a similar warning, because inside the function there is a conversion to a tensor. Basically, you can visualize the result of the function as follows:
<img src="../tools/media/single_output_tensor.png" alt="single tensor" style="width:70%; display: block; margin-left: auto; margin-right: auto;">

In [5]:
dataset = single_output_tensor(data=data, 
                               forecast_len=30)

In [6]:
# checking data shapes
for batch in dataset:
    print(f'X shape: {batch[0].shape}\nY shape: {batch[1].shape}')

If you want to predict based on multiple data, then you can add an additional X with `additional_x`. In this case, two X-data stacks in one tensor: an additional dimension will appear *(in our case, the channel, as if it is now a two-dimensional image)* after the value of `X.shape[0]`

In [7]:
dataset = single_output_tensor(data=data,
                               additional_x=data.copy(),
                               forecast_len=30)

In [8]:
for batch in dataset:
    print(f'new stacked X shape: {batch[0].shape}\nY shape: {batch[1].shape}') 

But you can avoid stacking two different X's by changing the `additional_x_stack` parameter to `False` *(default is `True`)* and get two separate X's

In [9]:
dataset = single_output_tensor(data=data,
                               additional_x=data.copy(),
                               additional_x_stack=False,
                               forecast_len=30)

In [10]:
for batch in dataset:
    print(f'X1 shape: {batch[0].shape}\nX2 shape: {batch[1].shape}\nY shape: {batch[2].shape}') 

If you want to create a dataset of several X's *(more than 2)*, then use the following template with the `additional_is_array=True` parameter, in this case all X's will be stacked in a new dimension *(only this behavior is supported when using multiple X's)*

In [11]:
dataset = single_output_tensor(data=data,
                               additional_x=[data.copy(), data.copy(), data.copy()],
                               additional_is_array=True,
                               forecast_len=30)

In [12]:
for batch in dataset:
    print(f'new stacked X shape: {batch[0].shape}\nY shape: {batch[1].shape}')  

You can also use `threshold` to binarize your data. By default, binarization occurs only for Y, but it can also be done for X using the parameter `x_binarize=True` *(all X's or new stacked X will be binarized)*

In [13]:
import numpy as np

# the Gaussian noise matrix 
gaussian_noise_matrix = np.random.normal(loc=0, scale=1, size=(100, 100))
noise_data = data - gaussian_noise_matrix

print(f'data max: {noise_data.max()} | min: {noise_data.min()}')

In [14]:
dataset = single_output_tensor(data=noise_data,
                               additional_x=[noise_data.copy(), noise_data.copy(), noise_data.copy()],
                               additional_is_array=True,
                               forecast_len=30,
                               threshold=0.5,
                               x_binarize=True)

In [15]:
for batch in dataset:
    print(f'new stacked X shape: {batch[0].shape}\nY shape: {batch[1].shape}',
          f'new stacked X max: {batch[0].max()} | min: {batch[0].min()}\nY max: {batch[1].max()} | min: {batch[1].min()}',
          sep='\n\n') 

#### Function `multi_output_tensor`

In [16]:
from torchcnnbuilder.preprocess._dynamic_window import multi_output_tensor

Params:

- **data**: N-dimensional arrays, lists, numpy arrays, tensors etc.
- **forecast_len**: length of prediction for each y-train future tensor (target)
- **pre_history_len**: length of pre-history for each x-train future tensor
- **additional_x**: extra x-train data. Default: None
- **additional_is_array**: if additional x-train is a array of x_i data like other time series. Default: False
- **additional_x_stack**: if True stack each additional_x_i to x-train. Default: True
- **threshold**: binarization threshold for each y-tensor. Default: False
- **x_binarize**: binarization with threshold for each x-tensor. Default: False

Returns: TensorDataset of X-train and y-train

This function preprocesses an n-dimensional time series into a tensor with a **sliding windoww** for the X and Y parts  and returns `TensorDataset`. In addition to the prediction range `forecast_len`, now you need to specify the period on the basis of which we predict the data, i.e. select a parameter `pre_history_len` *(which means it's possible to binarize and stack different X-windows)*. Otherwise, all the functionality remains the same as the function `single_output_tensor`. This behavior of the function allows you to train and predict over a given period of time, rather than on the entire dataset at once

Let's assume that we will predict 30 frames of our 2-dimensional time series, based on the previous 60. Basically, you can visualize the result of the function as follows:
<img src="../tools/media/multi_output_tensor.png" alt="single tensor" style="width:70%; display: block; margin-left: auto; margin-right: auto;">

In [17]:
dataset = multi_output_tensor(data=data, 
                              forecast_len=30,
                              pre_history_len=60)

In [18]:
for i, batch in enumerate(dataset):
    print(f'batch number: {i}',
          f'X shape: {batch[0].shape}\nY shape: {batch[1].shape}',
          sep='\n',
          end='\n\n')
    if i == 1:
        break
print(f'Dataset len (number of batches/X-windows): {len(dataset)}')

Using `threshold`d and several X's as in the last example of the previous function. Now each X should have 4 channels, because we use 4 different 2-dimensional time series in total

In [19]:
dataset = multi_output_tensor(data=noise_data,
                              additional_x=[noise_data.copy(), noise_data.copy(), noise_data.copy()],
                              additional_is_array=True,
                              forecast_len=30,
                              pre_history_len=60,
                              threshold=0.5,
                              x_binarize=True)

In [20]:
for i, batch in enumerate(dataset):
    print(f'batch number: {i}',
          f'new stacked X shape: {batch[0].shape}\nY shape: {batch[1].shape}',
          f'new stacked X max: {batch[0].max()} | min: {batch[0].min()}\nY max: {batch[1].max()} | min: {batch[1].min()}',
          sep='\n',
          end='\n\n')
    if i == 1:
        break
print(f'Dataset len (number of batches/X-windows): {len(dataset)}')