### Loading libraries

In [None]:
import sys
import os
import matplotlib.pyplot as plt
import numpy as np

import torch
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

sys.path.insert(1, '..')
os.chdir('..')

from data_formatters.weinstock2016 import *
from dataset import TSDataset
from conf import Conf

### Code walk-through

The major parts of the code that need to be defined for each data set are:
1. config file in `.yaml` format,
2. data formatter script.

For now, you can study the `.yaml` files in the `./conf` folder for a look of what a config file should feel like. You can skip the hyperparam defintions and the model parameters. The main focus would be on defining the dataset parameters. 

We do not intereact with `.yaml` in a direct way but instead though `Conf` class, which handles the following:
1. defines some defaults if not specified in `.yaml`,
2. sets save paths,
3. allows for nice colored printing.

Technically, we could doo all of this in the `.yaml` file directly. However, then every time we re-run the experiment, we would have to manually modify the `.yaml` file to reset save paths and redefine some variables, which would be inconvenient.  


In [None]:
# loading the config file, setting the experiment name, and the seed for random pre-processing parts (like splitting)
cnf = Conf(conf_file_path='./conf/weinstock.yaml', seed=15, exp_name="Weinstock", log=False)

In [None]:
# lets print out the config file
print(f'\nDefault configuration parameters: \n{cnf}')

Now let's move on to the data formatter. This is the part that should handle:
1. loading the data and setting types,
2. **interpolating** segments,
3. splitting the data into train / val / test sets,
4. setting scalers and encoders for numerical / categorical variables resp.

**Note:** the data formatter is ultimately what handles **all pre-processing** steps. We must be very careful in handling it and verifying that everrything happens correctly.

In [None]:
# call the data fromatter directly
data_formatter = WeinstockFormatter(cnf)

In [None]:
# plot the data inside the data formatter
counter = 0
for id_segment, df in data_formatter.data.groupby('id_segment'):
    # print(id_segment)
    df.plot(x='time', y='gl', title=f'Segment {id_segment}')
    plt.show()
    counter += 1
    if(counter == 5): break

In [None]:
# let's see train, val, and test numbers
print(f'Train / val / test indices: {len(data_formatter.train_idx)}, {len(data_formatter.val_idx)}, {len(data_formatter.test_idx)}')
# let's see proprtions
print(f'Train / val / test proportions: {len(data_formatter.train_idx) / len(data_formatter.data)}, {len(data_formatter.val_idx) / len(data_formatter.data)}, {len(data_formatter.test_idx) / len(data_formatter.data)}')

In [None]:
data_formatter.data

Finally, let's work with the `TSDataset` class. This is the main part of the code as it aligns all of our previous steps. In the end, it is the `TSDataset` that is going to call the splitters, scalers, and encoders. **Importatnly** the model is only going to interact with the data through this class. 

**NOTE:** for each train, validation, and test set we must specify a separate data set now.

In [None]:
# we are going to pass our data formatter and the config file to the TSDataset class
train_dataset = TSDataset(cnf, data_formatter, split='train')
val_dataset = TSDataset(cnf, data_formatter, split='val')
test_dataset = TSDataset(cnf, data_formatter, split='test')

In [None]:
# now let's see how we can sample minibatches from our dataset that we can then pass to the model to train on
for i in range(10):
    # 192 x ['power_usage', 'hour', 'day_of_week', 'hours_from_start', 'categorical_id']
    x = train_dataset[i]['inputs']
    # 24 x ['power_usage']
    y = train_dataset[i]['outputs']
    print(f'Example #{i}: x.shape={x.shape}, y.shape={y.shape}')

In [None]:
for i in range(5):
    # 192 x ['power_usage', 'hour', 'day_of_week', 'hours_from_start', 'categorical_id']
    x = test_dataset[i]['inputs']
    print(x[0, :])
    # 24 x ['power_usage']
    y = test_dataset[i]['outputs']

    # plt.plot(x[:, 5])
    plt.plot(y)
    plt.show()