# MVTS Data Toolkit

## Demo

This demo is designed to give the user a quick tour over the software's funcionalities. Below is a list of all the things one could see in this demo:
 - Downloading a dataset of 2000 multivariate time series (mvts) instances.
 - Getting some basic statistics about your data.
 - Extracting a list of statistical features from the mvts instances.
 - ...

In [1]:
import os
import yaml
from data_retriever import DataRetriever  # for downloading data
import CONSTANTS as CONST

## Download the Dataset
There is a dictionary of datasets available in `./datasets_configs.yml`. Let's download the one with id `1`.

In [2]:
dr = DataRetriever(1)
print('URL:\t\t{}'.format(dr.dataset_url))
print('NAME:\t\t{}'.format(dr.dataset_name))
print('TYPE:\t\t{}'.format(dr.get_compression_type()))
print('SIZE:\t\t{}'.format(dr.get_total_size()))

URL:		https://bitbucket.org/gsudmlab/mvtsdata_toolkit/downloads/petdataset_01.zip
NAME:		petdataset_01.zip
TYPE:		application/zip
SIZE:		32M


Ready to download? (This may take a few seconds, depending on your internet bandwidth.)

In [3]:
where_to = 'data/petdataset_01/'
dr.retrieve(target_path = where_to)

Extracting: 100%|██████████| 2001/2001 [00:00<00:00, 2862.20it/s]


OK. Let's see how many files are available to us now:

In [4]:
dr.get_total_number_of_files()

2000

## MVTS Data Analysis

- #### How many files? How large of a dataset?

Using `data_analysis.mvts_data_analysis.py` we can get an idea of the dataset we are going to work on. We start with creating an instance of a `MVTSDataAnalysis`.

In [5]:
from data_analysis.mvts_data_analysis import MVTSDataAnalysis
path_to_config = './configs/feature_extraction_configs.yml'
mvda = MVTSDataAnalysis(path_to_config)
mvda.print_stat_of_directory()

----------------------------------------
Directory:					/home/azim/CODES/PyCode/mvtsdata_toolkit/data/petdataset_01/
Total number of mvts files:	2000
Total size:					468M
Total average:				240K
----------------------------------------


- #### Summary Stats of the data.

Let's now get some stats from the content of the files. Without loss of generality, we analyze only 3 parameters (namely `'TOTUSJH'`, `'TOTBSQ'`, and `'TOTPOT'`), and only on the first 50 mvts files.

In [6]:
params = ['TOTUSJH', 'TOTBSQ', 'TOTPOT']
n = 50
mvda.compute_summary(parameters_list=params, first_k=n)

-->	[50/50] 		 File: lab[NF]_id[3364]_st[2013-11-09T09:24:00]_et[2013-11-09T21:12:00].csv00].csv

In [7]:
mvda.print_summary()

  Parameter-Name Count Null-Count           min          25th          50th          75th           max
0        TOTUSJH  3000          0  4.341081e+00  5.959299e+01  3.498416e+02  1.124406e+03  3.746423e+03
1         TOTBSQ  3000          0  2.376532e+07  4.004768e+08  4.547558e+09  1.447428e+10  6.957558e+10
2         TOTPOT  3000          0  1.750878e+20  2.814514e+21  3.870160e+22  2.625937e+23  1.425084e+24


... which says the length of the time series, across the 50 mvts is 3000, with no `NA/NAN` or missing values. In addition, `min`, `max`, and three quantiles are calculated for each time series.

## Feature Extraction

- #### What statistical features are available?

Now that we have an idea about our raw data, let's extract some features from the data. A list of ~50 features are implemented in `features.feature_collection.py`. Let's take a look.

In [8]:
import features.feature_collection as fc
help(fc)

Help on module features.feature_collection in features:

NAME
    features.feature_collection

FUNCTIONS
    get_average_absolute_change(uni_ts:pandas.core.series.Series) -> float
        :return: the average absolute first difference of a univariate time series.
    
    get_average_absolute_derivative_change(uni_ts:pandas.core.series.Series) -> float
        :return: the average absolute first difference of a derivative of univariate time series.
    
    get_avg_mono_decrease_slope(uni_ts:pandas.core.series.Series) -> float
        :return: the average slope of monotonically decreasing segments.
    
    get_avg_mono_increase_slope(uni_ts:pandas.core.series.Series) -> float
        :return: the average slope of monotonically increasing segments.
    
    get_dderivative_kurtosis(uni_ts:pandas.core.series.Series, step_size:int=1) -> float
        :return: the kurtosis of the difference derivative of univariate time series within the
                 function we use step_size to find 

- #### How to extract these features from the data?

Time to extract a set of these features from the dataset we downloaded. Let's extract 3 statistical features (`min`, `max`, and `median`), from 3 parameters (`TOTUSJH`, `TOTBSQ`, and `TOTPOT`). To speed up, we only process the first 50 mvts files.

In [9]:
from features.feature_extractor import FeatureExtractor

fe = FeatureExtractor(path_to_config)
fe.do_extraction(parameters_list=['TOTUSJH', 'TOTBSQ', 'TOTPOT'],
                 statistical_features_list=['get_min', 'get_max', 'get_median'],
                 first_k= 50)
fe.df_all_features

/home/azim/CODES/PyCode/mvtsdata_toolkit/data/petdataset_01/


	-----------------------------------
		Total No. of time series:	50
		Total No. of Parameters:		3
		Total No. of Features:		3
		Total No. of Metadata Pieces:		4
		Output dimensionality (N:50 X (F:3 X P:3 + T:4)):	650
	-----------------------------------

	 >>> Total Processed: 50 / 50 <<<
	50 files have been processed.
	As a result, a dataframe of dimension 50 X 13 is created.


Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTPOT_min,TOTPOT_max,TOTPOT_median
0,3894,M,2014-03-29T22:00:00,2014-03-30T09:48:00,967.434905,1183.833149,1124.41025,10467860000.0,11886630000.0,11533210000.0,1.228731e+23,1.716971e+23,1.49233e+23
0,3364,NF,2013-11-10T16:24:00,2013-11-11T04:12:00,785.515878,898.4197,841.636672,13844620000.0,15057130000.0,14495480000.0,2.55856e+23,2.903506e+23,2.712413e+23
0,3401,NF,2013-11-19T15:00:00,2013-11-20T02:48:00,43.795702,79.40546,61.347579,300694500.0,424908100.0,388349100.0,1.854126e+21,3.078696e+21,2.787184e+21
0,3595,NF,2014-01-12T09:24:00,2014-01-12T21:12:00,48.64734,65.213699,54.220981,327602000.0,444708900.0,353981000.0,2.278193e+21,3.308322e+21,2.707839e+21
0,3452,NF,2013-12-06T18:24:00,2013-12-07T06:12:00,54.910546,89.248031,64.519227,339513500.0,630044100.0,460598800.0,2.030967e+21,4.775595e+21,2.968826e+21
0,3401,NF,2013-11-23T00:00:00,2013-11-23T11:48:00,16.406938,25.926148,20.134504,109869200.0,176691700.0,143057900.0,6.961236e+20,1.36865e+21,9.542322e+20
0,3313,NF,2013-10-29T06:48:00,2013-10-29T18:36:00,25.213735,45.237145,37.858018,164015200.0,304008600.0,236411400.0,1.07953e+21,2.1785e+21,1.615138e+21
0,3394,NF,2013-11-16T12:48:00,2013-11-17T00:36:00,37.198119,105.665566,74.004088,442247400.0,1066918000.0,793794900.0,2.419303e+21,6.886243e+21,4.586577e+21
0,3779,C,2014-02-26T06:24:00,2014-02-26T18:12:00,761.638915,987.804217,901.833056,5706467000.0,7378234000.0,6651778000.0,5.153352e+22,7.974288e+22,7.048366e+22
0,3364,C,2013-11-17T20:24:00,2013-11-18T08:12:00,3410.057526,3746.42311,3583.958457,48323980000.0,52589520000.0,51174750000.0,7.56675e+23,8.037784e+23,7.870867e+23


In [10]:
extracted_features = fe.df_all_features
extracted_features.shape

(50, 13)

## Extracted Features Analysis

- #### A quick look over the results?

Let's have a glimps of how the extracted features look like.

In [13]:
from data_analysis.extracted_features_analysis import ExtractedFeaturesAnalysis

efa = ExtractedFeaturesAnalysis(extracted_features, exclude=['id'])
efa.compute_summary()
efa.summary

Unnamed: 0,Feature Name,count,Null Count,mean,std,min,25%,50%,75%,max
0,TOTUSJH_min,50.0,0,734.1838,966.7145,4.341081,48.79846,314.4178,927.632,3410.058
1,TOTUSJH_max,50.0,0,874.6654,1112.331,12.26119,79.40546,398.1758,1137.049,3746.423
2,TOTUSJH_median,50.0,0,803.3427,1041.983,7.426589,60.63404,352.2123,1068.766,3583.958
3,TOTBSQ_min,50.0,0,10766560000.0,15741540000.0,23765320.0,338837600.0,4200314000.0,13405200000.0,57660120000.0
4,TOTBSQ_max,50.0,0,12100670000.0,17449100000.0,65487040.0,460381700.0,4922122000.0,14702040000.0,69575580000.0
5,TOTBSQ_median,50.0,0,11585190000.0,16972730000.0,35577510.0,406411500.0,4513584000.0,14098180000.0,67512910000.0
6,TOTPOT_min,50.0,0,1.778611e+23,2.921604e+23,1.750878e+20,2.100021e+21,3.426977e+22,2.431402e+23,1.369838e+24
7,TOTPOT_max,50.0,0,1.971327e+23,3.104716e+23,6.645951e+20,3.355905e+21,4.175589e+22,2.739604e+23,1.425084e+24
8,TOTPOT_median,50.0,0,1.872583e+23,3.006618e+23,3.158317e+20,2.79775e+21,3.834246e+22,2.577412e+23,1.398754e+24


... which gives a summary statistics over every extracted feature. For instance, in row `0`, the changes of the minimum values of the parameter `TUOTUSJH`, across 50 mvts files, is described in terms of `mean`, `std`, etc. This also indicates that no `NA/NAN` or missing value was generated in the process.