# Loading and working with data in sktime

From: https://www.sktime.org/en/latest/examples/loading_data.html

## Overview

Python provides a variety of useful ways to represent data, but NumPy arrays and pandas DataFrames are commonly used for data analysis. When using NumPy 2d-arrays or pandas DataFrames to analyze tabular data the rows are commony used to represent each instance (e.g. case or observation) of the data, while the columns are used to represent a given feature (e.g. variable or dimension) for an observation. 

Since timeseries data also has a time dimension for a given instance and feature, several alternative data formats could be used to represent this data, including:
* nested pandas DataFrame structures, 
* NumPy 3d-arrays, or 
* multi-indexed pandas DataFrames.

**Sktime is designed to work with timeseries data stored as nested pandas DataFrame objects.** Similar to working with pandas DataFrames with tabular data, this allows instances to be represented by rows and the feature data for each dimension of a problem (e.g. variables or features) to be stored in the DataFrame columns. To accomplish this the timepoints for each instance-feature combination are stored in a single cell in the input Pandas DataFrame. 

📚 See [Sktime pandas DataFrame format](https://www.sktime.org/en/latest/examples/loading_data.html#sktime_df_format) for more details.

Users can load or convert data into sktime’s format in a variety of ways. 
* Data can be loaded directly from a bespoke sktime file format (`.ts`) (see [Representing data with `.ts` files](https://www.sktime.org/en/latest/examples/loading_data.html#ts_files)) 
* ... or supported file formats provided by [other existing data sources](https://www.sktime.org/en/latest/examples/loading_data.html#other_file_types) (such as Weka `ARFF` and `.tsv`).

Sktime also provides functions to convert data to and from sktime’s nested pandas DataFrame format and several other common ways for representing timeseries data using NumPy arrays or pandas DataFrames.

📚 See [Converting between sktime and alternative timeseries formats](https://www.sktime.org/en/latest/examples/loading_data.html#convert).

The rest of this sktime tutorial will provide a more detailed description of the sktime pandas DataFrame format, a brief description of the .ts file format, how to load data from other supported formats, and how to convert between other common ways of representing timeseries data in NumPy arrays or pandas DataFrames.

## Sktime pandas DataFrame format

The core data structure for storing datasets in sktime is a nested pandas DataFrame, where:
* rows of the dataframe correspond to instances (cases or observations), and 
* columns correspond to dimensions of the problem (features or variables).

The multiple *timepoints and their corresponding values* for each instance-feature pair are stored as *pandas Series* object nested within the applicable DataFrame cell.

For example, for a problem with $n$ cases that each have data across $c$ timeseries dimensions:

```text
DataFrame:
index |   dim_0   |   dim_1   |    ...    |  dim_c-1
   0  | pd.Series | pd.Series | pd.Series | pd.Series
   1  | pd.Series | pd.Series | pd.Series | pd.Series
  ... |    ...    |    ...    |    ...    |    ...
   n  | pd.Series | pd.Series | pd.Series | pd.Series
```

Representing timeseries data in this way makes it easy to align the timeseries features for a given instance *with non-timeseries information*. For example, in a classification problem, it is easy to align the timeseries features for an observation with its *(index-aligned)* target class label:

```text
index | class_val
  0   |   int
  1   |   int
 ...  |   ...
  n   |   int
```

While sktime’s format uses pandas Series objects in its nested DataFrame structure, other data structures like NumPy arrays could be used to hold the timeseries values in each cell. However, the use of pandas Series objects helps to facilitate simple storage of sparse data and make it easy to accomodate series with non-integer timestamps (such as dates).

## The `.ts` file format

One common use case is to load locally stored data. To make this easy, the `.ts` file format has been created for representing problems in a standard format for use with sktime.

### Representing data with .ts files

A `.ts` file includes two main parts: 
* header information 
* data.

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following:

```python
@problemName <problem name>
@timeStamps <true/false>
@univariate <true/false>
@classLabel <true/false> <space delimited list of possible class values>
@data
```

The data for the problem should begin after the `@data` tag. In the simplest case where `@timestamps` is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list `(0, 1, …, m)`. An *instance* may contain 1 to many dimensions, where instances are line-delimited and dimensions within an instance are colon (`:`) delimited. For example:

```text
2,3,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2
```

This example data has 3 *instances*, corresponding to the three lines shown above. Each *instance* has 2 *dimensions* with 4 *observations* per dimension. For example, the intitial instance’s first dimension has the timepoint values of `2, 3, 2, 4` and the second dimension has the values `4, 3, 2, 2`.

Missing readings can be specified using ?. For example,

```text
2,?,2,4:4,3,2,2
13,12,32,12:22,23,12,32
4,4,5,4:3,2,3,2
```

would indicate the second timepoint value of the initial instance’s first dimension is missing.

Alternatively, for sparse datasets, readings can be specified by setting `@timestamps` to true in the header and representing the data with tuples in the form of `(timestamp, value)` just for the obser. For example, the first instance in the example above could be specified in this representation as:

```text
(0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)
```

Equivalently, the sparser example

```text
2,5,?,?,?,?,?,5,?,?,?,?,4
```

could be represented with just the non-missing timestamps as:

```text
(0,2),(0,5),(7,5),(12,4)
```

When using the `.ts` file format to store data for timeseries classification problems, the class label for an instance should be specified in the last dimension and `@classLabel` should be set to true in the header information and be followed by the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

```
1,4,23,34:1
```

### Loading from `.ts` file to pandas DataFrame

A dataset can be loaded from a `.ts` file using the following method in `sktime.utils.data_io.py`:

```python
load_from_tsfile_to_dataframe(full_file_path_and_name, replace_missing_vals_with='NaN')
```

This can be demonstrated using the Arrow Head problem that is included in sktime under sktime/datasets/data:

In [1]:
import os

import sktime
from sktime.utils.data_io import load_from_tsfile_to_dataframe

In [4]:
DATA_PATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")
print("DATA_PATH:", DATA_PATH)

DATA_PATH: /mnt/space/miniconda3/envs/py38_playaround-sk_sktime-source/lib/python3.8/site-packages/sktime/datasets/data


In [5]:
train_x, train_y = load_from_tsfile_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.ts")
)
test_x, test_y = load_from_tsfile_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TEST.ts")
)

Train and test partitions of the ArrowHead problem have been loaded into nested dataframes with an associated array of class values. As an example, below are the first 5 rows from the `train_x` and `train_y`:

In [6]:
train_x.head()

Unnamed: 0,dim_0
0,0 -1.9630 1 -1.9578 2 -1.9561 3 ...
1,0 -1.7746 1 -1.7740 2 -1.7766 3 ...
2,0 -1.8660 1 -1.8420 2 -1.8350 3 ...
3,0 -2.0738 1 -2.0733 2 -2.0446 3 ...
4,0 -1.7463 1 -1.7413 2 -1.7227 3 ...


In [7]:
train_y[0:5]

array(['0', '1', '2', '0', '1'], dtype='<U1')

## Loading other file formats

Researchers who have made timeseries data available have used two other common formats, including:
* Weka `ARFF` files
* UCR `.tsv` files

### Loading from Weka ARFF files

It is also possible to load data from Weka’s attribute-relation file format (ARFF) files. Data for timeseries problems are made available in this format by researchers at the University of East Anglia (among others) at www.timeseriesclassification.com. The `load_from_arff_to_dataframe` method in `sktime.utils.data_io` supports reading data for *both univariate and multivariate timeseries problems*.

The univariate functionality is demonstrated below using data on the ArrowHead problem again (this time loading from `ARFF` file).

In [8]:
from sktime.utils.data_io import load_from_arff_to_dataframe

In [9]:
X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff")
)
X.head()

Unnamed: 0,dim_0
0,0 -1.963009 1 -1.957825 2 -1.95614...
1,0 -1.774571 1 -1.774036 2 -1.77658...
2,0 -1.866021 1 -1.841991 2 -1.83502...
3,0 -2.073758 1 -2.073301 2 -2.04460...
4,0 -1.746255 1 -1.741263 2 -1.72274...


The multivariate BasicMotions problem is used below to illustrate the ability to read multivariate timeseries data from `ARFF` files into the sktime format.

In [10]:
X, y = load_from_arff_to_dataframe(
    os.path.join(DATA_PATH, "BasicMotions/BasicMotions_TRAIN.arff")
)
X.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
1,0 0.377751 1 0.377751 2 2.952965 3...,0 -0.610850 1 -0.610850 2 0.970717 3...,0 -0.147376 1 -0.147376 2 -5.962515 3...,0 -0.103872 1 -0.103872 2 -7.593275 3...,0 -0.109198 1 -0.109198 2 -0.697804 3...,0 -0.037287 1 -0.037287 2 -2.865789 3...
2,0 -0.813905 1 -0.813905 2 -0.424628 3...,0 0.825666 1 0.825666 2 -1.305033 3...,0 0.032712 1 0.032712 2 0.826170 3...,0 0.021307 1 0.021307 2 -0.372872 3...,0 0.122515 1 0.122515 2 -0.045277 3...,0 0.775041 1 0.775041 2 0.383526 3...
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...


### Loading from UCR .tsv Format Files

A further option is to load data into sktime from tab separated value (.tsv) files. Researchers at the University of Riverside, California make a variety of timeseries data available in this format at https://www.cs.ucr.edu/~eamonn/time_series_data_2018.

The `load_from_ucr_tsv_to_dataframe` method in `sktime.utils.data_io` supports reading univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from the `.ts` and `ARFF` file formats.

In [11]:
from sktime.utils.data_io import load_from_ucr_tsv_to_dataframe

In [12]:
X, y = load_from_ucr_tsv_to_dataframe(
    os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv")
)
X.head()

Unnamed: 0,dim_0
0,0 -1.963009 1 -1.957825 2 -1.95614...
1,0 -1.774571 1 -1.774036 2 -1.77658...
2,0 -1.866021 1 -1.841991 2 -1.83502...
3,0 -2.073758 1 -2.073301 2 -2.04460...
4,0 -1.746255 1 -1.741263 2 -1.72274...


## Converting between other NumPy and pandas formats

It is also possible to use data from sources other than `.ts` and `.arff` files by manually shaping the data into the format described above.

Functions to convert from and to these types to sktime’s nested DataFrame format are provided in `sktime.datatypes._panel._convert`.

### Using tabular data with sktime

One approach to representing timeseries data is a tabular DataFrame. As usual, each row represents an instance. In the tabular setting each timepoint of the univariate timeseries being measured for each instance are treated as feature and stored as a primitive data type in the DataFrame’s cells.

In a **univariate setting**, where there are `n` instances of the series and each univariate timeseries has `t` timepoints, this would yield a pandas DataFrame with shape `(n, t)`. In practice, this could be used to represent sensors measuring the same signal over time (features) on different machines (instances) or the same economic variable over time (features) for different countries (instances).

The function `from_2d_array_to_nested` converts a `(n, t)` tabular DataFrame to nested DataFrame with shape `(n, 1)`. To convert from a nested DataFrame to a tabular array the function `from_nested_to_2d_array` can be used.

The example below uses 50 instances with 20 timepoints each.

In [13]:
from numpy.random import default_rng

from sktime.datatypes._panel._convert import (
    from_2d_array_to_nested,
    from_nested_to_2d_array,
    is_nested_dataframe,
)

In [16]:
rng = default_rng()
X_2d = rng.standard_normal((50, 20))
print(f"The tabular data has the shape {X_2d.shape}")

The tabular data has the shape (50, 20)


The `from_2d_array_to_nested` function makes it easy to convert this to a nested DataFrame.

In [18]:
X_nested = from_2d_array_to_nested(X_2d)

print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")

X_nested.head()

X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 1)


Unnamed: 0,0
0,0 -1.446857 1 0.523231 2 1.706278 3...
1,0 -0.155030 1 0.991564 2 -1.084761 3...
2,0 0.701685 1 -0.238664 2 -0.467291 3...
3,0 -1.514999 1 0.921751 2 0.577310 3...
4,0 1.401442 1 0.853261 2 1.513764 3...


This nested DataFrame can also be converted back to a tabular DataFrame using easily.

In [19]:
X_2d = from_nested_to_2d_array(X_nested)
print(f"The tabular data has the shape {X_2d.shape}")

The tabular data has the shape (50, 20)


### Using long-format data with sktime

Timeseries data can also be represented in *long* format where each row identifies the value for a single timepoint for a given dimension for a given instance.

This format may be encountered in a database where each row stores a single value measurement identified by several identification columns. For example, where `case_id` is an id to identify a specific instance in the data, `dimension_id` is an integer between $0$ and $d-1$ for $d$ dimensions in the data, `reading_id` is the index of timepoints for the associated `case_id` and `dimension_id`, and value is the actual value of the observation. E.g.:

```text
     | case_id | dim_id | reading_id | value
------------------------------------------------
  0  |   int   |  int   |    int     | double
  1  |   int   |  int   |    int     | double
  2  |   int   |  int   |    int     | double
  3  |   int   |  int   |    int     | double
```

Sktime provides functions to convert to and from the long data format in `sktime.datatypes._panel._convert`.

The `from_long_to_nested` function converts from a long format DataFrame to sktime’s nested format (with assumptions made on how the data is initially formatted). Conversely, `from_nested_to_long` converts from a sktime nested DataFrame into a long format DataFrame.

To demonstrate this functionality the method below creates a dataset with 
* 50 instances (cases), 
* 5 dimensions and 
* 20 timepoints per dimension.

In [20]:
from sktime.utils.data_io import generate_example_long_table

X = generate_example_long_table(num_cases=50, series_len=20, num_dims=5)

X.head()

Unnamed: 0,case_id,dim_id,reading_id,value
0,0,0,0,0.4593
1,0,0,1,0.314455
2,0,0,2,0.661846
3,0,0,3,0.682826
4,0,0,4,0.978559


In [21]:
X.tail()

Unnamed: 0,case_id,dim_id,reading_id,value
4995,49,4,15,0.04301
4996,49,4,16,0.655298
4997,49,4,17,0.897434
4998,49,4,18,0.506377
4999,49,4,19,0.625479


As shown below, applying the `from_long_to_nested` method returns a sktime-formatted dataset with individual dimensions represented by columns of the output dataframe.

In [22]:
from sktime.datatypes._panel._convert import from_long_to_nested, from_nested_to_long

X_nested = from_long_to_nested(X)
X_nested.head()

Unnamed: 0,var_0,var_1,var_2,var_3,var_4
0,0 0.459300 1 0.314455 2 0.661846 3...,0 0.926045 1 0.758544 2 0.223308 3...,0 0.806523 1 0.260597 2 0.507627 3...,0 0.349107 1 0.626810 2 0.724004 3...,0 0.714901 1 0.442377 2 0.236769 3...
1,0 0.101307 1 0.112887 2 0.568397 3...,0 0.812085 1 0.123327 2 0.892039 3...,0 0.892351 1 0.675009 2 0.699763 3...,0 0.165541 1 0.690658 2 0.221861 3...,0 0.521603 1 0.687145 2 0.834008 3...
2,0 0.863829 1 0.651000 2 0.026650 3...,0 0.437041 1 0.549427 2 0.236153 3...,0 0.933271 1 0.447072 2 0.255453 3...,0 0.874927 1 0.789391 2 0.077620 3...,0 0.833595 1 0.746720 2 0.857255 3...
3,0 0.898896 1 0.695837 2 0.959954 3...,0 0.891670 1 0.175724 2 0.613661 3...,0 0.868691 1 0.508563 2 0.670314 3...,0 0.096364 1 0.136814 2 0.399626 3...,0 0.831894 1 0.329698 2 0.923800 3...
4,0 0.374509 1 0.721724 2 0.968902 3...,0 0.590241 1 0.996262 2 0.883930 3...,0 0.132159 1 0.936840 2 0.840720 3...,0 0.673192 1 0.173712 2 0.727276 3...,0 0.183425 1 0.162356 2 0.038333 3...


As expected the result is a nested DataFrame and the cells include nested pandas Series objects.

In [24]:
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0, 0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.iloc[0, 0].head()

X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)


0    0.459300
1    0.314455
2    0.661846
3    0.682826
4    0.978559
Name: 0, dtype: float64

As shown below, the `from_nested_to_long` function can be used to convert the resulting nested DataFrame (or any nested DataFrame) to a long format DataFrame.

In [25]:
X_long = from_nested_to_long(
    X_nested,
    instance_column_name="case_id",
    time_column_name="reading_id",
    dimension_column_name="dim_id",
)
X_long.head()

Unnamed: 0,case_id,reading_id,dim_id,value
0,0,0,var_0,0.4593
1,0,1,var_0,0.314455
2,0,2,var_0,0.661846
3,0,3,var_0,0.682826
4,0,4,var_0,0.978559


In [26]:
X_long.tail()

Unnamed: 0,case_id,reading_id,dim_id,value
4995,49,15,var_4,0.04301
4996,49,16,var_4,0.655298
4997,49,17,var_4,0.897434
4998,49,18,var_4,0.506377
4999,49,19,var_4,0.625479


### Using multi-indexed pandas DataFrames

Pandas deprecated its Panel object in version 0.20.1. Since that time pandas has recommended representing 3-dimensional data using a multi-indexed DataFrame.

Storing timeseries data in a Pandas multi-indexed DataFrame is a natural option since many timeseries problems include data over the instance, feature and time dimensions.

Sktime provides the functions `from_multi_index_to_nested` and `from_nested_to_multi_index` in `sktime.datatypes._panel._convert` to easily convert between pandas multi-indexed DataFrames and sktime’s nested DataFrame structure.

The example below illustrates how these functions can be used to convert to and from the nested structure given data with 
* 50 instances, 
* 5 features (columns) and 
* 20 timepoints per feature. 

In the multi-indexed DataFrame a row represents a unique combination of the instance and timepoint indices. Therefore, the resulting multi-indexed DataFrame should have the shape `(1000, 5)`.

In [28]:
from sktime.datatypes._panel._convert import (
    from_multi_index_to_nested,
    from_nested_to_multi_index,
)
from sktime.utils.data_io import make_multi_index_dataframe

In [29]:
X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)

In [35]:
print(f"The multi-indexed DataFrame has shape {X_mi.shape}")
print(f"The multi-index names are {X_mi.index.names}")

X_mi.head(40)

The multi-indexed DataFrame has shape (1000, 5)
The multi-index names are ['case_id', 'reading_id']


Unnamed: 0_level_0,Unnamed: 1_level_0,var_0,var_1,var_2,var_3,var_4
case_id,reading_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0.201087,0.217185,0.431738,0.557439,0.236857
0,1,0.276634,0.052594,0.085917,0.133247,0.018223
0,2,0.713826,0.20676,0.716341,0.456849,0.273336
0,3,0.812919,0.626376,0.342847,0.187723,0.508156
0,4,0.772934,0.792448,0.432539,0.039289,0.325395
0,5,0.080726,0.43596,0.669647,0.797285,0.055294
0,6,0.699128,0.716829,0.512246,0.533898,0.085029
0,7,0.770979,0.793112,0.261544,0.188832,0.823408
0,8,0.72297,0.276612,0.336821,0.209239,0.187726
0,9,0.014251,0.34619,0.924567,0.338723,0.65059


The multi-indexed DataFrame can be easily converted to a nested DataFrame with shape (50, 5). Note that the conversion to the nested DataFrame has preserved the column names (it has also preserved the values of the instance index and the pandas Series objects nested in each cell have preserved the time index).

In [36]:
X_nested = from_multi_index_to_nested(X_mi, instance_index="case_id")
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()

X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)


Unnamed: 0,var_0,var_1,var_2,var_3,var_4
0,0 0.201087 1 0.276634 2 0.713826 3...,0 0.217185 1 0.052594 2 0.206760 3...,0 0.431738 1 0.085917 2 0.716341 3...,0 0.557439 1 0.133247 2 0.456849 3...,0 0.236857 1 0.018223 2 0.273336 3...
1,0 0.870719 1 0.652456 2 0.086153 3...,0 0.980800 1 0.831189 2 0.623619 3...,0 0.024212 1 0.914950 2 0.739753 3...,0 0.719208 1 0.001141 2 0.490113 3...,0 0.458700 1 0.510090 2 0.214420 3...
2,0 0.689103 1 0.034383 2 0.124666 3...,0 0.795183 1 0.896130 2 0.432838 3...,0 0.489351 1 0.468372 2 0.480265 3...,0 0.922938 1 0.560929 2 0.365568 3...,0 0.210437 1 0.370099 2 0.884629 3...
3,0 0.271332 1 0.723495 2 0.036326 3...,0 0.487779 1 0.490938 2 0.148760 3...,0 0.351946 1 0.256900 2 0.985641 3...,0 0.914225 1 0.829495 2 0.092312 3...,0 0.827499 1 0.942766 2 0.128163 3...
4,0 0.675934 1 0.964763 2 0.166523 3...,0 0.557622 1 0.162007 2 0.934558 3...,0 0.655348 1 0.385278 2 0.452725 3...,0 0.693904 1 0.668770 2 0.583794 3...,0 0.632956 1 0.210487 2 0.628082 3...


Nested DataFrames can also be converted to a multi-indexed Pandas DataFrame:

In [37]:
X_mi = from_nested_to_multi_index(
    X_nested, instance_index="case_id", time_index="reading_id"
)
X_mi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,var_0,var_1,var_2,var_3,var_4
case_id,reading_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0.201087,0.217185,0.431738,0.557439,0.236857
0,1,0.276634,0.052594,0.085917,0.133247,0.018223
0,2,0.713826,0.20676,0.716341,0.456849,0.273336
0,3,0.812919,0.626376,0.342847,0.187723,0.508156
0,4,0.772934,0.792448,0.432539,0.039289,0.325395


### Using NumPy 3d-arrays with sktime

Another common approach for representing timeseries data is to use a 3-dimensional NumPy array with shape `(n_instances, n_columns, n_timepoints)`.

Sktime provides the functions `from_3d_numpy_to_nested`, `from_nested_to_3d_numpy` in `sktime.datatypes._panel._convert` to let users easily convert between NumPy 3d-arrays and nested pandas DataFrames.

This is demonstrated using a 3d-array with 50 instances, 5 features (columns) and 20 timepoints, resulting in a 3d-array with shape (50, 5, 20).

In [38]:
from sktime.datatypes._panel._convert import (
    from_3d_numpy_to_nested,
    from_multi_index_to_3d_numpy,
    from_nested_to_3d_numpy,
)

In [39]:
X_mi = make_multi_index_dataframe(n_instances=50, n_columns=5, n_timepoints=20)
X_mi

Unnamed: 0_level_0,Unnamed: 1_level_0,var_0,var_1,var_2,var_3,var_4
case_id,reading_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,0.324847,0.340168,0.278412,0.025561,0.835052
0,1,0.188937,0.774495,0.343728,0.265234,0.522983
0,2,0.951976,0.001649,0.112727,0.918253,0.706452
0,3,0.072100,0.650150,0.926596,0.582814,0.826831
0,4,0.738414,0.059886,0.847103,0.886019,0.792002
...,...,...,...,...,...,...
49,15,0.565125,0.627685,0.499117,0.181669,0.227346
49,16,0.999581,0.143561,0.491609,0.626787,0.476629
49,17,0.050536,0.540079,0.147250,0.206087,0.713255
49,18,0.894012,0.184404,0.163263,0.682406,0.249869


In [40]:
X_3d = from_multi_index_to_3d_numpy(
    X_mi, 
    instance_index="case_id", 
    time_index="reading_id"
)

print(f"The 3d-array has shape {X_3d.shape}")

The 3d-array has shape (50, 5, 20)


The 3d-array can be easily converted to a nested DataFrame with shape (50, 5). 

Note that *since NumPy array doesn’t have indices*, the *instance index is the numerical range over the number of instances* and the columns are automatically assigned. Users can optionally supply their own columns names via the columns_names parameter.

In [41]:
X_nested = from_3d_numpy_to_nested(X_3d)
print(f"X_nested is a nested DataFrame: {is_nested_dataframe(X_nested)}")
print(f"The cell contains a {type(X_nested.iloc[0,0])}.")
print(f"The nested DataFrame has shape {X_nested.shape}")
X_nested.head()

X_nested is a nested DataFrame: True
The cell contains a <class 'pandas.core.series.Series'>.
The nested DataFrame has shape (50, 5)


Unnamed: 0,var_0,var_1,var_2,var_3,var_4
0,0 0.324847 1 0.188937 2 0.951976 3...,0 0.340168 1 0.774495 2 0.001649 3...,0 0.278412 1 0.343728 2 0.112727 3...,0 0.025561 1 0.265234 2 0.918253 3...,0 0.835052 1 0.522983 2 0.706452 3...
1,0 0.985660 1 0.977662 2 0.467585 3...,0 0.619371 1 0.803110 2 0.688869 3...,0 0.462537 1 0.346833 2 0.181819 3...,0 0.145575 1 0.283602 2 0.890452 3...,0 0.754588 1 0.704418 2 0.481646 3...
2,0 0.836644 1 0.523001 2 0.502637 3...,0 0.338380 1 0.014664 2 0.083368 3...,0 0.890057 1 0.045099 2 0.301357 3...,0 0.092954 1 0.819357 2 0.323108 3...,0 0.708904 1 0.079426 2 0.141756 3...
3,0 0.032805 1 0.114444 2 0.374490 3...,0 0.417094 1 0.069967 2 0.829096 3...,0 0.225188 1 0.857164 2 0.225471 3...,0 0.676540 1 0.604369 2 0.417143 3...,0 0.981333 1 0.391352 2 0.282223 3...
4,0 0.634913 1 0.562715 2 0.424784 3...,0 0.471011 1 0.257612 2 0.793463 3...,0 0.974301 1 0.126887 2 0.402059 3...,0 0.660631 1 0.747228 2 0.368066 3...,0 0.544897 1 0.456153 2 0.810456 3...


Nested DataFrames can also be converted to NumPy 3d-arrays.

In [43]:
X_3d = from_nested_to_3d_numpy(X_nested)
print(f"The resulting object is a {type(X_3d)}")
print(f"The shape of the 3d-array is {X_3d.shape}")

The resulting object is a <class 'numpy.ndarray'>
The shape of the 3d-array is (50, 5, 20)


### Converting between NumPy 3d-arrays and pandas multi-indexed DataFrame

Although an example is not provided here, sktime lets users convert data between NumPy 3d-arrays and a multi-indexed pandas DataFrame formats using the functions `from_3d_numpy_to_multi_index` and `from_multi_index_to_3d_numpy` in `sktime.datatypes._panel._convert`.