In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time - [Wikipedia](https://en.wikipedia.org/wiki/Time_series). For any types of machine learning time series tasks, getting the correct data format is important. Below, we will introduce what and how to generate the correct time series data format based on different tasks.  


1. Forecasting
2. Classification and Regression
3. Anomaly detection

### 1. Forecasting Data Format


The required input data format is a two-dimensional structure (``pandas.DataFrame``), which should contain a TimeStamp column (``time_col``) and one or more variable columns (``var_col_0``, ``var_col_1``, ``var_col_2``,... ``var_col_n``,). See example below: 

#### 1.1 Data Format

```python
     time_col          var_col_0        var_col_1        var_col_2     ...    var_col_n
xxxx-xx-xx xx:xx:xx        x                x                x                    x
xxxx-xx-xx xx:xx:xx        x                x                x                    -
xxxx-xx-xx xx:xx:xx        x                x                x                    x
xxxx-xx-xx xx:xx:xx        -                x                x                    x
xxxx-xx-xx xx:xx:xx        x                x                x                    x
xxxx-xx-xx xx:xx:xx        x                -                x                    x
xxxx-xx-xx xx:xx:xx        x                -                x                    x
xxxx-xx-xx xx:xx:xx        x                x                -                    x
xxxx-xx-xx xx:xx:xx        x                x                x                    x
xxxx-xx-xx xx:xx:xx        x                x                -                    x
xxxx-xx-xx xx:xx:xx        -                -                -                    -
xxxx-xx-xx xx:xx:xx        x                x                x                    x
        -                  -                -                -                    -
xxxx-xx-xx xx:xx:xx        x                x                x                    x
```

where *xxxx-xx-xx xx:xx:xx* stands for datetime. Ideally, the date format is ``YYYY-MM-DD`` and the time is ``HH:MM:SS``. In variable columns, (*x*) represents a certain value and (*-*) means the value is missing. 

**Note**

  - HyperTS expects must also contain time columns, column names are not specified, whether ds, TS, TIMESTAMP, timestamp or other. The datetime can be any format as long as it can be identified by ``pandas.to_datetime``.  
  - HyperTS could sort the data in time sequence if the time is random. 
  - HyperTS supports input data with various time frequencies. For example, second(S)、minute(T)、hour(H)、day(D)、week(W)、month(M)、year(Y),etc.
  - HyperTS could impute the missing time points and segments during the preprocessing stage.
  - HyperTS could dropout the repeated rows during the preprocessing stage.

Sometimes, there are extra variables generated during the modeling, which are called covariates. They will be added parallel to the input variables. See example below:  

```python
     time_col          var_col_0   var_col_1 ... var_col_n     covar_col_0    covar_col_1 ... covar_col_m
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x 
xxxx-xx-xx xx:xx:xx        x          x              -              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x
xxxx-xx-xx xx:xx:xx        -          x              x              x              x               -
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        -          -              -              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
        -                  -          -              -              -              -               -
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
```
where *covar_col_i (i=1, 2, .., m)*  stand for covariates。

**Note**

  - Covariates can be continuous values or discrete values. 
  - Covariates can contain repeated and missing values.

#### 1.2 Example

##### 1.2.1 Generate a Random Dataset without Covariates 

In [1]:
import numpy as np
import pandas as pd

size=5

# without covariables
df_no_covariates = pd.DataFrame({
    'timestamp': pd.date_range(start='2022-02-01', periods=5, freq='H'),
    'val_0': np.random.normal(size=size),
    'val_1': [0.5, 0.2, np.nan, 0.9, 0.0],
    'val_2': np.random.normal(size=size),
})

df_no_covariates

Unnamed: 0,timestamp,val_0,val_1,val_2
0,2022-02-01 00:00:00,0.599713,0.5,1.538076
1,2022-02-01 01:00:00,2.199247,0.2,-0.007808
2,2022-02-01 02:00:00,-0.316841,,0.680783
3,2022-02-01 03:00:00,-0.74725,0.9,0.203446
4,2022-02-01 04:00:00,-0.404355,0.0,-0.922243


The output shows that:

- The name of the timestamp column is 'timestamp';
- The names of the target columns are 'var_0',  'var_1',  'var_2';
- The time frequency is per hour: 'H';
- The dataset contain missing values;
- It's a multivariate timeseries forecasting task.

##### 1.2.2 Generate a Random Dataset with Covariates

In [2]:
# with covariables
df_with_covariates = pd.DataFrame({
    'timestamp': pd.date_range(start='2022-02-01', periods=size, freq='D'),
    'val_0': np.random.normal(size=size),
    'val_1': [12, 52, 34, np.nan, 100],
    'val_2': [0.5, 0.2, np.nan, 0.9, 0.0],
    'covar_0': [0.2, 0.4, 0.2, 0.7, 0.1],
    'covar_1': ['a', 'a', 'b', 'b', 'b'],
    'covar_2': [1, 2, 2, None, 3], 
})

df_with_covariates

Unnamed: 0,timestamp,val_0,val_1,val_2,covar_0,covar_1,covar_2
0,2022-02-01,-0.921455,12.0,0.5,0.2,a,1.0
1,2022-02-02,-0.300891,52.0,0.2,0.4,a,2.0
2,2022-02-03,0.245227,34.0,,0.2,b,2.0
3,2022-02-04,1.29943,,0.9,0.7,b,
4,2022-02-05,-0.131246,100.0,0.0,0.1,b,3.0


The output shows that:

- The name of the timestamp column is 'timestamp';
- The names of the target columns are 'var_0',  'var_1',  'var_2';
- The names of the covariates columns are 'covar_0',  'covar_1',  'covar_2';
- The time frequency is per day: 'D';
- The dataset contain missing values;
- It's a multivariate timeseries forecasting task.

##### 1.2.3 Import build-in Forecasting Dataset

In [3]:
from hyperts.datasets import (load_random_univariate_forecast_dataset, 
                              load_random_multivariate_forecast_dataset,
                              load_network_traffic)

In [4]:
df0 = load_random_univariate_forecast_dataset(return_X_y=False)
df0.head()

Unnamed: 0,ds,id,value
0,2013-01-01,1.0,0.619163
1,2013-01-02,0.0,0.600649
2,2013-01-03,1.0,0.236368
3,2013-01-04,1.0,0.534748
4,2013-01-05,0.0,0.940991


In this dataset, ```ds``` is the name of time column, time frequency is day (D), target variable (i.e., forecast variable) is ```value``` column, ```id``` column is covariable column, so it is a univariate time series task.

In [5]:
df1 = load_random_multivariate_forecast_dataset(return_X_y=False)
df1.head()

Unnamed: 0,ds,Var_1,Var_2
0,2022-10-27 12:07:27.572441,0.266213,1.098948
1,2022-10-28 12:07:27.572441,1.632756,1.829807
2,2022-10-29 12:07:27.572441,2.871863,3.123874
3,2022-10-30 12:07:27.572441,3.669629,3.95761
4,2022-10-31 12:07:27.572441,4.353291,5.086246


In this dataset, ```ds``` is the name of time column, and time frequency is day(D). The target variable(i.e., forecast variable) contains two columns ```Var_1``` and ```Var_2```and there is no covariable. Therfore, it is a multivariate time series task.

In [6]:
df2 = load_network_traffic(return_X_y=False)
df2.head()

Unnamed: 0,TimeStamp,Var_1,Var_2,Var_3,Var_4,Var_5,Var_6,HourSin,WeekCos,CBWD
0,2021-03-01 00:00:00,0.7534,3.375,10.195,1.449,19174.977,286443.88,0.0,1.0,NW
1,2021-03-01 01:00:00,0.3376,2.414,3.92,0.4065,7529.263,178930.45,0.258819,1.0,NW
2,2021-03-01 02:00:00,0.2032,1.654,3.318,0.2142,3310.539,42296.164,0.5,1.0,NW
3,2021-03-01 03:00:00,0.242,1.393,3.148,0.2312,4535.464,26220.232,0.707107,1.0,NW
4,2021-03-01 04:00:00,0.194,1.429,3.215,0.2157,2732.911,27990.348,0.866025,1.0,NW


In the datset, ```TimeStamp``` is the name of time column, time frequency is hour (H). The target variable (i.e., forecast variable) contains six columns, ```Val_1```, ```Val_2```, ```Val_3```, ```Val_4```, ```Val_5``` and ```Val_6``` respectively. The covariables are ```HourSin```, ```WeekCos```, and ```CBWD```, where ```HourSin``` and ```WeekCos``` are numeric variables and ```CBWD``` is a category variable, which is a multivariable time series task.

<br>

<br>

### 2. Classification/Regression Data Format

#### 2.1 Data Format

Different from the forecasting task, the input data for classification and regression tasks are ```nested DataFrame```, which means the variations over a time segment are located in one cell. See example below:

```python
 var_col_0       var_col_1   ...    var_col_n     target
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      0
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      0
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      1
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      1
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      1
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      2
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      2
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      2
x,x,x,...,x     x,x,x,...,x        x,x,x,...,x      2
```

where，x,x,x,...,x stands for a sample in len(x,x,x... ,x) time segment of length the fluctuation of a variable over time. (x) is the value of a variable at a time. ```target``` represents the label of the sample.

**Note**
 
 - The goal of classification and regression tasks is to identify the behavior of a sample, so it is different from the data form of the forecasting task. Each row of the forecasting data represents the value of each variable at a time point, while each row of the classification or regression data represents a sample. In addition to each cell, namely *x, x, x... , x* represents a sample in len(x, x, x... , x) a time segment of length in which a variable fluctuates over time. Each sample discriminates the category (classification) or value (regression) of target according to the sequence behavior of each feature variable.
 - Intuitively, ```pandas.DadaFrame``` is a two-dimensional data table, each cell stores a value, and now we store a sequence to nest three-dimensional data in a two-dimensional data table, which is why we call it a ```nested DataFrame```.
 - The goal of classification or regression tasks is to discriminate the category or behavior of each sample, so the trend of the data is the key characteristic. For simplicity, we omit the information of TimeStamp when storing.

#### 2.2 Example

##### 2.2.1 Generate a Random Dataset for Tme Series Classification

In [7]:
import numpy as np
import pandas as pd

size=10

df = pd.DataFrame({
    'var_0': [pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size))],
    'var_1': [pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size))],
    'var_2': [pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size)),
              pd.Series(np.random.normal(size=size)), pd.Series(np.random.normal(size=size))],
    'y': [0, 0, 1, 1, 2, 2], 
})

df

Unnamed: 0,var_0,var_1,var_2,y
0,0 0.887165 1 1.340856 2 0.325697 3 ...,0 -0.289942 1 0.919971 2 -1.867462 3 ...,0 -1.354841 1 0.083280 2 0.214305 3 ...,0
1,0 1.950219 1 0.016051 2 -1.621092 3 ...,0 -0.803164 1 0.576465 2 0.010610 3 ...,0 -0.248131 1 -0.923721 2 -0.691555 3 ...,0
2,0 1.216039 1 -0.127643 2 -0.337693 3 ...,0 0.106087 1 1.853251 2 -1.428858 3 ...,0 -1.169346 1 0.207621 2 0.137396 3 ...,1
3,0 -1.258965 1 -0.151087 2 -0.150937 3 ...,0 0.132244 1 -0.230362 2 -0.040706 3 ...,0 1.060503 1 0.634884 2 0.011422 3 ...,1
4,0 0.797132 1 1.775923 2 -1.491301 3 ...,0 0.553663 1 1.031838 2 -0.555002 3 ...,0 -0.660046 1 1.278639 2 -1.011061 3 ...,2
5,0 -0.253653 1 -0.159006 2 0.378620 3 ...,0 -0.079186 1 0.423877 2 -0.805016 3 ...,0 -1.534051 1 -0.101453 2 0.279004 3 ...,2


The output shows that:

- The name of the target variable is 'y'; 
- The names of the feature variables are 'var_0',  'var_1',  'var_2';
- It's a multivariate classification task.

##### 2.2.2 Import build-in Classification Dataset

In [8]:
from hyperts.datasets import load_arrow_head, load_basic_motions

In [9]:
df0 = load_arrow_head(return_X_y=False)
df0.head() 

Unnamed: 0,Var_1,target
0,0 -1.9630 1 -1.9578 2 -1.9561 3 ...,0
1,0 -1.7746 1 -1.7740 2 -1.7766 3 ...,1
2,0 -1.8660 1 -1.8420 2 -1.8350 3 ...,2
3,0 -2.0738 1 -2.0733 2 -2.0446 3 ...,0
4,0 -1.7463 1 -1.7413 2 -1.7227 3 ...,1


In [10]:
df0.target.unique()

array(['0', '1', '2'], dtype=object)

In this dataset, ```target``` is the target column, containing three classes, ['0', '1', '2']. ```Var_1``` is a feature variable, there is only one, so it is a univariate multi-classification task.

In [11]:
df1 = load_basic_motions(return_X_y=False)
df1.head()

Unnamed: 0,Var_1,Var_2,Var_3,Var_4,Var_5,Var_6,target
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...,standing
1,0 0.377751 1 0.377751 2 2.952965 3...,0 -0.610850 1 -0.610850 2 0.970717 3...,0 -0.147376 1 -0.147376 2 -5.962515 3...,0 -0.103872 1 -0.103872 2 -7.593275 3...,0 -0.109198 1 -0.109198 2 -0.697804 3...,0 -0.037287 1 -0.037287 2 -2.865789 3...,standing
2,0 -0.813905 1 -0.813905 2 -0.424628 3...,0 0.825666 1 0.825666 2 -1.305033 3...,0 0.032712 1 0.032712 2 0.826170 3...,0 0.021307 1 0.021307 2 -0.372872 3...,0 0.122515 1 0.122515 2 -0.045277 3...,0 0.775041 1 0.775041 2 0.383526 3...,standing
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...,standing
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...,standing


In [12]:
df1.target.unique()

array(['standing', 'running', 'walking', 'badminton'], dtype=object)

In this dataset, ```target``` is the target column, containing four classes ['standing', 'running', 'walking', 'badminton'], ```Var_1```, ```Var_2```, ```Var_3```, ```Var_4```, ```Var_5```, ```Var_6``` are feature variables, a total of six. Therefore, it is a multivariate multiclassification task.

#### 2.3 3-D Numpy.array to 2-D pandas.DataFrame

Normally, the acquired data is in the form of ``numpy.array``. We need to convert them to the nested ``pandas.DataFrame``. For example, we have the numpy data as below: 

In [13]:
import numpy as np

nb_samples = 100
series_length = 72
nb_variables = 6
nb_classes = 4

X = np.random.normal(size=nb_samples*series_length*nb_variables).reshape(nb_samples, series_length, nb_variables)
y = np.random.randint(low=0, high=nb_classes, size=nb_samples)

In [14]:
X.shape, y.shape, np.unique(y)

((100, 72, 6), (100,), array([0, 1, 2, 3]))

This dataset contains 100 samples. Each sample has 6 feature variables. And each variable has measurement data taken at 72 time indices. The target variable *y* has 4 categories.

HyperTS porvides a function ``from_3d_array_to_nested_df``, that could automaticlly convert 3d array to required nested dataframe. See example below:

In [15]:
import pandas as pd
from hyperts.toolbox import from_3d_array_to_nested_df

df_X = from_3d_array_to_nested_df(data=X)
df_y = pd.DataFrame({'y': y})
df = pd.concat([df_X, df_y], axis=1)

In [16]:
df.head()

Unnamed: 0,Var_0,Var_1,Var_2,Var_3,Var_4,Var_5,y
0,0 -0.038029 1 2.407684 2 1.000662 3...,0 0.947376 1 -0.666230 2 -0.673983 3...,0 0.294960 1 -0.188034 2 0.791917 3...,0 -0.260460 1 1.190756 2 0.183264 3...,0 1.456938 1 0.773571 2 1.834219 3...,0 0.028061 1 -1.742055 2 -0.520803 3...,2
1,0 0.943434 1 0.710932 2 -0.717793 3...,0 -0.816563 1 -0.206350 2 -0.022789 3...,0 1.604838 1 -1.233925 2 0.068006 3...,0 1.471666 1 0.215581 2 -1.120550 3...,0 0.426381 1 0.911374 2 1.202849 3...,0 -0.362679 1 -0.585743 2 -0.212023 3...,1
2,0 -0.957144 1 0.126544 2 -0.962839 3...,0 0.085112 1 0.704056 2 1.869679 3...,0 -2.551469 1 -2.025863 2 -1.088073 3...,0 -0.790131 1 0.756131 2 0.676036 3...,0 -1.184995 1 -0.160961 2 -0.864118 3...,0 0.630483 1 1.281456 2 -1.885355 3...,2
3,0 0.745456 1 -0.608358 2 -2.248564 3...,0 -0.708104 1 -0.323268 2 -1.456815 3...,0 0.259860 1 0.600146 2 1.458288 3...,0 0.628735 1 0.064484 2 1.336053 3...,0 0.568531 1 -0.059870 2 -0.230826 3...,0 -0.845890 1 0.848568 2 -0.662570 3...,2
4,0 1.377579 1 -0.915978 2 -1.185956 3...,0 1.276868 1 -1.068383 2 -0.765125 3...,0 -0.252029 1 -1.035151 2 -0.677545 3...,0 -0.033123 1 0.387280 2 -0.658591 3...,0 -0.007119 1 -0.546834 2 -1.624636 3...,0 0.060272 1 1.340591 2 0.678717 3...,1


### 3. Anomaly Detection Data Format

Similar to forecasting, the required input data format is a two-dimensional structure (``pandas.DataFrame``), which should contain a TimeStamp column (``time_col``) ,one or more variable columns (``var_col_0``, ``var_col_1``, ``var_col_2``,... ``var_col_n``), and sometimes, one or more covariates columns (``covar_col_0``, ``covar_col_1``, ``covar_col_2``,... ``covar_col_m``). See example below: 

```python
     time_col          var_col_0   var_col_1 ... var_col_n     covar_col_0    covar_col_1 ... covar_col_m
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x 
xxxx-xx-xx xx:xx:xx        x          x              -              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x
xxxx-xx-xx xx:xx:xx        -          x              x              x              x               -
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
xxxx-xx-xx xx:xx:xx        -          -              -              x              x               x
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
        -                  -          -              -              -              -               -
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x
```

In addition, the above data can also contain ``ground truth``, which will help in model selection and hyperparameter search. The format is as follow:

```python
     time_col          var_col_0   var_col_1 ... var_col_n     covar_col_0    covar_col_1 ... covar_col_m       anomaly       
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               1 
xxxx-xx-xx xx:xx:xx        x          x              -              x              x               x               0
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x               0
xxxx-xx-xx xx:xx:xx        -          x              x              x              x               -               0
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               1
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x               0
xxxx-xx-xx xx:xx:xx        x          -              x              x              x               x               0
xxxx-xx-xx xx:xx:xx        x          x              x              x              -               x               1
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               0
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               0
xxxx-xx-xx xx:xx:xx        -          -              -              x              x               x               1
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               0
        -                  -          -              -              -              -               -
xxxx-xx-xx xx:xx:xx        x          x              x              x              x               x               0
```


where ``anomaly`` is anomaly label column.

**Note**

   When the data has ground truth label, the optimization evalution uses the ground truth. Otherwise, the generated pseudo-label is applied.