# Loading data in sktime

Data for use with sktime should be stored in pandas DataFrame objects with cases represented by rows and series data for each dimension of a problem stored in columns (the specifics of the data structure are described in more detail in the section below). Data can be loaded into the sktime format through various means, such as loading directly from a bespoke sktime file format (.ts) or supported file formats provided by other existing data sources (such as ARFF and .tsv). Further, data can also be loaded through other means into a long-table format and then converted to the sktime format using a provided method.

Below is a brief description of the .ts file format, an introduction of how data are stored in dataframes for sktime, and examples of loading data from a variety of file formats.


## Representing data with .ts files

The most typical use case is to load data from a locally stored .ts file. The .ts file format has been created for representing problems in a standard format for use with sktime. These files include two main parts:
* header information
* data

The header information is used to facilitate simple representation of the data through including metadata about the structure of the problem. The header contains the following:

    @problemName <problem name>
    @timeStamps <true/false>
    @univariate <true/false>
    @classLabel <true/false> <space delimited list of possible class values>
    @data

The data for the problem should begin after the @data tag. In the simplest case where @timestamps is false, values for a series are expressed in a comma-separated list and the index of each value is relative to its position in the list (0, 1, ..., m). A _case_ may contain 1 to many dimensions, where cases are line-delimited and dimensions within a case are colon (:) delimited. For example:

    2,3,2,4:4,3,2,2
    13,12,32,12:22,23,12,32
    4,4,5,4:3,2,3,2

This example data has 3 _cases_, where each case has 2 _dimensions_ with 4 observations per dimension. Missing readings can be specified using ?, or for sparse datasets, readings can be specified by setting @timestamps to true and representing the data  with tuples in the form of (timestamp, value). For example, the first case in the example above could be specified in this representation as:

    (0,2),(1,3)(2,2)(3,4):(0,4),(1,3),(2,2),(3,2)

Equivalently,

    2,5,?,?,?,?,?,5,?,?,?,?,4

could be represented with timestamps as:

    (0,2),(0,5),(7,5),(12,4)

For classification problems, the class label for a case should be specified in the last dimension and @classLabel should be in the header information to specify the set of possible class values. For example, if a case consists of a single dimension and has a class value of 1 it would be specified as:

     1,4,23,34:1

## Storing data in a pandas DataFrame

The core data structure for storing datasets in sktime is a pandas DataFrame, where rows of the dataframe correspond to cases,  and columns correspond to dimensions of the problem. The readings within each column of the dataframe are stored as pandas Series objects; the use of Series facilitates simple storage of sparse data or series with non-integer timestamps (such as dates). Further, if the loaded problem is a classification problem, the standard loading functionality within sktime will return the class values in a separate index-aligned numpy array (with an option to combine X and Y into a single dataframe for high-level task construction). For example, for a problem with n cases that each have data across c dimensions:

    DataFrame:
    index |   dim_0   |   dim_1   |    ...    |  dim_c-1
       0  | pd.Series | pd.Series | pd.Series | pd.Series
       1  | pd.Series | pd.Series | pd.Series | pd.Series
      ... |    ...    |    ...    |    ...    |    ...
       n  | pd.Series | pd.Series | pd.Series | pd.Series

And if the data is a classification problem, a separate (index-aligned) array will be returned with the class labels:

    index | class_val
      0   |   int
      1   |   int
     ...  |   ...
      n   |   int


## Loading from .ts file to pandas DataFrame

A dataset can be loaded from a .ts file using the following method in sktime.utils.load_data.py:

    load_from_tsfile_to_dataframe(full_file_path_and_name, replace_missing_vals_with='NaN')

This can be demonstrated using the Arrow Head problem that is included in sktime under sktime/datasets/data

In [1]:
from sktime.utils.load_data import load_from_tsfile_to_dataframe
import os
import sktime

DATA_PATH = os.path.join(os.path.dirname(sktime.__file__), "datasets/data")

train_x, train_y = load_from_tsfile_to_dataframe(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.ts"))
test_x, test_y = load_from_tsfile_to_dataframe(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TEST.ts"))

Train and test partitions of the ArrowHead problem have been loaded into dataframes with associated arrays for class values. As an example, below are the first 5 rows from the train_x and train_y:

In [2]:
train_x.head()

Unnamed: 0,dim_0
0,0 -1.9630 1 -1.9578 2 -1.9561 3 ...
1,0 -1.7746 1 -1.7740 2 -1.7766 3 ...
2,0 -1.8660 1 -1.8420 2 -1.8350 3 ...
3,0 -2.0738 1 -2.0733 2 -2.0446 3 ...
4,0 -1.7463 1 -1.7413 2 -1.7227 3 ...


In [3]:
train_y[0:5]

array(['0', '1', '2', '0', '1'], dtype='<U1')

## Loading from Weka ARFF files

It is also possible to load data from Weka's attribute-relation file format (ARFF) files. This is the data format used by researchers at the University of East Anglia (available from www.timeseriesclassification.com ). The `load_from_arff_to_dataframe` method in `sktime.utils.load_data` supports reading both univariate and multivariate problems. Examples are shown below using the ArrowHead problem again (this time loading from ARFF) and also the multivariate BasicMotions problem.

### Loading the univariate ArrowHead problem from ARFF

In [4]:
from sktime.utils.load_data import load_from_arff_to_dataframe

X, y = load_from_arff_to_dataframe(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff"))
X.head()

Unnamed: 0,dim_0
0,0 -1.963009 1 -1.957825 2 -1.95614...
1,0 -1.774571 1 -1.774036 2 -1.77658...
2,0 -1.866021 1 -1.841991 2 -1.83502...
3,0 -2.073758 1 -2.073301 2 -2.04460...
4,0 -1.746255 1 -1.741263 2 -1.72274...


### Loading the multivariate BasicMotions problem from ARFF

In [5]:
X, y = load_from_arff_to_dataframe(os.path.join(DATA_PATH, "BasicMotions/BasicMotions_TRAIN.arff"))
X.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5
0,0 0.079106 1 0.079106 2 -0.903497 3...,0 0.394032 1 0.394032 2 -3.666397 3...,0 0.551444 1 0.551444 2 -0.282844 3...,0 0.351565 1 0.351565 2 -0.095881 3...,0 0.023970 1 0.023970 2 -0.319605 3...,0 0.633883 1 0.633883 2 0.972131 3...
1,0 0.377751 1 0.377751 2 2.952965 3...,0 -0.610850 1 -0.610850 2 0.970717 3...,0 -0.147376 1 -0.147376 2 -5.962515 3...,0 -0.103872 1 -0.103872 2 -7.593275 3...,0 -0.109198 1 -0.109198 2 -0.697804 3...,0 -0.037287 1 -0.037287 2 -2.865789 3...
2,0 -0.813905 1 -0.813905 2 -0.424628 3...,0 0.825666 1 0.825666 2 -1.305033 3...,0 0.032712 1 0.032712 2 0.826170 3...,0 0.021307 1 0.021307 2 -0.372872 3...,0 0.122515 1 0.122515 2 -0.045277 3...,0 0.775041 1 0.775041 2 0.383526 3...
3,0 0.289855 1 0.289855 2 -0.669185 3...,0 0.284130 1 0.284130 2 -0.210466 3...,0 0.213680 1 0.213680 2 0.252267 3...,0 -0.314278 1 -0.314278 2 0.018644 3...,0 0.074574 1 0.074574 2 0.007990 3...,0 -0.079901 1 -0.079901 2 0.237040 3...
4,0 -0.123238 1 -0.123238 2 -0.249547 3...,0 0.379341 1 0.379341 2 0.541501 3...,0 -0.286006 1 -0.286006 2 0.208420 3...,0 -0.098545 1 -0.098545 2 -0.023970 3...,0 0.058594 1 0.058594 2 0.175783 3...,0 -0.074574 1 -0.074574 2 0.114525 3...


## Loading from UCR .tsv Format Files

A further option is to load data into sktime from tab separated value (.tsv) files, as used by researchers at the University of Riverside, California (available at https://www.cs.ucr.edu/~eamonn/time_series_data_2018 ). The `load_from_ucr_tsv_to_dataframe` method in `sktime.utils.load_data` supports reading  univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from ARFF and .ts file formats.

### Loading the univariate ArrowHead problem from .tsv

In [6]:
from sktime.utils.load_data import load_from_ucr_tsv_to_dataframe

X, y = load_from_ucr_tsv_to_dataframe(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv"))
X.head()

Unnamed: 0,dim_0
0,0 -1.963009 1 -1.957825 2 -1.95614...
1,0 -1.774571 1 -1.774036 2 -1.77658...
2,0 -1.866021 1 -1.841991 2 -1.83502...
3,0 -2.073758 1 -2.073301 2 -2.04460...
4,0 -1.746255 1 -1.741263 2 -1.72274...


## Using long-format data with sktime

It is also possible to use data from sources other than .ts and .arff files by manually shaping the data into the format described above. For convenience, a helper function is also provided to convert long-format data into sktime-formatted data in the `from_long_to_nested` method in `sktime.utils.load_data` (with assumptions made on how the data is initially formatted).

The method converts rows from a long-table schema data frame assuming each row contains information for:

`case_id, dimension_id, reading_id, value`

where `case_id` is an id to identify a specific case in the data, `dimension_id` is an integer between 0 and d-1 for d dimensions in the data, `reading_id` is the index of this observation for the associated `case_id` and `dimension_id`, and `value` is the actual value of the observation. E.g.:

          | case_id | dim_id | reading_id | value
     ------------------------------------------------
       0  |   int   |  int   |    int     | double
       1  |   int   |  int   |    int     | double
       2  |   int   |  int   |    int     | double
       3  |   int   |  int   |    int     | double

To demonstrate this functionality the method below creates a dataset with a given number of cases, dimensions and observations:

In [7]:
import numpy as np
import pandas as pd

def generate_example_long_table(num_cases=50, series_len=20, num_dims=2):

    rows_per_case = series_len*num_dims
    total_rows = num_cases*series_len*num_dims

    case_ids = np.empty(total_rows, dtype=np.int)
    idxs = np.empty(total_rows, dtype=np.int)
    dims = np.empty(total_rows, dtype=np.int)
    vals = np.random.rand(total_rows)

    for i in range(total_rows):
        case_ids[i] = int(i/rows_per_case)
        rem = i%rows_per_case
        dims[i] = int(rem/series_len)
        idxs[i] = rem%series_len

    df = pd.DataFrame()
    df['case_id'] = pd.Series(case_ids)
    df['dim_id'] = pd.Series(dims)
    df['reading_id'] = pd.Series(idxs)
    df['value'] = pd.Series(vals)
    return df

The following example generates a long-format table with 50 cases, each with 4 dimensions of length 20:

In [8]:
X = generate_example_long_table(num_cases=50, series_len=20, num_dims=4)
X.head()

Unnamed: 0,case_id,dim_id,reading_id,value
0,0,0,0,0.40664
1,0,0,1,0.329423
2,0,0,2,0.489994
3,0,0,3,0.652857
4,0,0,4,0.599822


In [9]:
X.tail()

Unnamed: 0,case_id,dim_id,reading_id,value
3995,49,3,15,0.268304
3996,49,3,16,0.060808
3997,49,3,17,0.440772
3998,49,3,18,0.938703
3999,49,3,19,0.504476


As shown below, applying the `from_long_to_nested` method returns a sktime-formatted dataset with individual dimensions represented by columns of the output dataframe:

In [10]:
from sktime.utils.load_data import from_long_to_nested
X_nested = from_long_to_nested(X)
X_nested.head()

Unnamed: 0,dim_0,dim_1,dim_2,dim_3
0,0 0.406640 1 0.329423 2 0.489994 3...,0 0.022645 1 0.033930 2 0.858262 3...,0 0.517712 1 0.694270 2 0.142443 3...,0 0.230392 1 0.789417 2 0.844258 3...
1,0 0.755324 1 0.073751 2 0.026336 3...,0 0.310249 1 0.236033 2 0.585196 3...,0 0.649923 1 0.621822 2 0.957423 3...,0 0.070929 1 0.875311 2 0.690331 3...
2,0 0.943030 1 0.296144 2 0.866710 3...,0 0.546582 1 0.888466 2 0.673372 3...,0 0.239932 1 0.809456 2 0.861843 3...,0 0.067123 1 0.871830 2 0.791407 3...
3,0 0.291802 1 0.777566 2 0.026162 3...,0 0.745934 1 0.268174 2 0.460356 3...,0 0.732961 1 0.626981 2 0.305343 3...,0 0.054998 1 0.934383 2 0.681715 3...
4,0 0.624083 1 0.841714 2 0.278674 3...,0 0.479008 1 0.065146 2 0.482564 3...,0 0.672805 1 0.268156 2 0.345608 3...,0 0.707487 1 0.422607 2 0.110538 3...


In [11]:
X_nested.iloc[0][0].head()

0    0.406640
1    0.329423
2    0.489994
3    0.652857
4    0.599822
dtype: float64