# MVTS Data Toolkit

## Demo

This demo is designed to give the user a quick tour over the software's funcionalities. Below is a list of all the things one could see in this demo:
 - Downloading a dataset of 2000 multivariate time series (mvts) instances.
 - Getting some basic statistics about your data.
 - Extracting a list of statistical features from the mvts instances.
 - ...

In [None]:
import os
import yaml
from data_retriever import DataRetriever  # for downloading data
import CONSTANTS as CONST

## Download the Dataset
In this demo we use an example dataset. In the following cells, it will be automatically downloaded. But in case something goes wrong, here is the direct link:
https://bitbucket.org/gsudmlab/mvtsdata_toolkit/downloads/petdataset_01.zip

Before we download it, let's take a quick look:

In [None]:
dr = DataRetriever(1)
print('URL:\t\t{}'.format(dr.dataset_url))
print('NAME:\t\t{}'.format(dr.dataset_name))
print('TYPE:\t\t{}'.format(dr.get_compression_type()))
print('SIZE:\t\t{}'.format(dr.get_total_size()))

Ready to download? This may take a few seconds, depending on your internet bandwidth. Wait for the progress bar.

In [2]:
where_to = 'temp/'
dr.retrieve(target_path = where_to)

NameError: name 'dr' is not defined

OK. Let's see how many files are available to us now.

In [1]:
dr.get_total_number_of_files()

NameError: name 'dr' is not defined

## Setup Configurations
For tasks such as feature extraction (by `feature_extractor`) and data analysis (by `mvts_data_analysis` and `extracted_features_analysis`) a configuration file must be provided. We provide one inside this package, but you can create your own and place it anywhere you wish. Let's take a look at ours which is located at `./configs/feature_extraction_configs.yml`:

In [3]:
path_to_config = './configs/feature_extraction_configs.yml'
with open(path_to_config, 'r') as f:
    print(f.read())

#PATH_TO_MVTS: 'data/petdataset_01/'
PATH_TO_MVTS: 'temp/petdataset_01/'
#PATH_TO_EXTRACTED_FEATURES: 'data/extracted_features/'
PATH_TO_EXTRACTED_FEATURES: 'temp/extracted_features/'
META_DATA_TAGS: ['id', 'lab', 'st', 'et']
MVTS_PARAMETERS:
  - 'TOTUSJH'
  - 'TOTBSQ'
  - 'TOTPOT'
  - 'TOTUSJZ'
  - 'ABSNJZH'
  - 'SAVNCPP'
  - 'USFLUX'
  - 'TOTFZ'
  - 'MEANPOT'
  - 'EPSZ'
  - 'MEANSHR'
  - 'SHRGT45'
  - 'MEANGAM'
  - 'MEANGBT'
  - 'MEANGBZ'
  - 'MEANGBH'
  - 'MEANJZH'
  - 'TOTFY'
  - 'MEANJZD'
  - 'MEANALP'
  - 'TOTFX'
  - 'EPSY'
  - 'EPSX'
  - 'R_VALUE'
STATISTICAL_FEATURES:
  - 'get_min'
  - 'get_max'
  - 'get_median'
  - 'get_mean'
  - 'get_stddev'
  - 'get_var'
  - 'get_skewness'
  - 'get_kurtosis'
  - 'get_no_local_maxima'
  - 'get_no_local_minima'
  - 'get_no_local_extrema'
  - 'get_no_zero_crossings'
  - 'get_mean_local_maxima_value'
  - 'get_mean_local_minima_value'
  - 'get_no_mean_local_maxima_upsurges'
  - 'get_no_mean_local_minima_downslides'
  - 'get_difference_of_mins'
  

Here is the break-down of the pieces:
 - `PATH_to_MVTS`: relative path to where the mvts data is stored in.
 - `PATH_TO_EXTRACTED`: relative path to where the extracted features is/will be stored in.
 - `META_DATA_TAGS`: a list of strings present in mvts file-names with specific meanings. See the README.md file for more details.
 - `MVTS_PARAMETERS`: an enumerated list of the parameter names (column names) in the mvts. Comment out those that are not needed using `#` symbol.
 - `STATISTICAL_FEATURES`: an enumerated list of the methods available through this package. This example has all features. Comment out those that are not needed using `#` symbol.
 
 In the following cells, you will see how this can be used.

## Analysis of Raw Data (MVTS Data Analysis)

- #### How many files? How large of a dataset?

Using `mvts_data_analysis` we can get an idea of the dataset we are going to work on. We start with creating an instance of a `MVTSDataAnalysis`.

In [4]:
from data_analysis.mvts_data_analysis import MVTSDataAnalysis
path_to_config = './configs/feature_extraction_configs.yml'
mvda = MVTSDataAnalysis(path_to_config)
mvda.print_stat_of_directory()

----------------------------------------
Directory:			/home/azim/CODES/PyWorkspace/mvtsdata_toolkit/temp/petdataset_01/
Total no. of files:	2000
Total size:			445M
Total average:		228K
----------------------------------------


- #### Get a summary Stats of the data.

Let's now get some statistics from the content of the files. To speed up the demo, we analyze only 3 parameters (namely `TOTUSJH`, `TOTBSQ`, and `TOTPOT`), and only the first 50 mvts files.

In [5]:
params = ['TOTUSJH', 'TOTBSQ', 'TOTPOT']
n = 50
mvda.compute_summary(params_name=params, first_k=n)
mvda.summary

Unnamed: 0,Parameter-Name,Val-Count,Null-Count,mean,min,25th,50th,75th,max
0,TOTUSJH,3000,0,437.5038,3.494185,33.93432,80.07922,365.1932,3162.777
1,TOTBSQ,3000,0,5476429000.0,19832680.0,255063800.0,686148200.0,7434932000.0,38482840000.0
2,TOTPOT,3000,0,8.365032e+22,1.205181e+20,1.853597e+21,4.812895e+21,3.875232e+22,7.108347e+23


... which says the length of the time series, across the 50 mvts files is 3000, with no `NA/NAN` or missing values. In addition, `mean`, `min`, `max`, and three quantiles are calculated for each time series.

 - #### You have a LARGE dataset?
 A parallel version of this function is also provided to help process much larger datasets efficiently. Below, we use 4 processes to do the same thing.

In [6]:
mvda.compute_summary_in_parallel(n_jobs=4, first_k=50, verbose=False,
                                     params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'])
mvda.summary

Unnamed: 0,Parameter-Name,Val-Count,Null-Count,mean,min,25th,50th,75th,max
0,TOTUSJH,3000,0,431.5482,3.494185,33.96461,79.94485,365.8299,3162.777
1,TOTBSQ,3000,0,5409258000.0,19832680.0,254128700.0,686561200.0,7436396000.0,38482840000.0
2,TOTPOT,3000,0,8.257123e+22,1.205181e+20,1.850762e+21,4.783393e+21,3.845567e+22,7.108347e+23


**Note**: The results of the parallel and sequential versions of `mvts_data_analysis` are not exactly identical. This discrepency is due to the fact that in the parallel version, the program is designed to avoid loading the entire dataset into memory so that it is not confined to any particular data size. Therefore, it relies on some statistical estimators to approximate the percentiles with some acceptable errors. The error significantly decreases as the number of mvts files increases.

## Feature Extraction

- #### What statistical features are available?

Now that we have an idea about our raw data, let's extract some features from the data. A list of ~50 statistical features are implemented in `feature_collection`. Let's take a look at them.

In [7]:
import features.feature_collection as fc
help(fc)

Help on module features.feature_collection in features:

NAME
    features.feature_collection

FUNCTIONS
    get_average_absolute_change(uni_ts:Union[pandas.core.series.Series, numpy.ndarray]) -> numpy.float64
        :return: the average absolute first difference of a univariate time series.
    
    get_average_absolute_derivative_change(uni_ts:Union[pandas.core.series.Series, numpy.ndarray]) -> numpy.float64
        :return: the average absolute first difference of a derivative of univariate time series.
    
    get_avg_mono_decrease_slope(uni_ts:Union[pandas.core.series.Series, numpy.ndarray]) -> numpy.float64
        :return: the average slope of monotonically decreasing segments.
    
    get_avg_mono_increase_slope(uni_ts:Union[pandas.core.series.Series, numpy.ndarray]) -> numpy.float64
        :return: the average slope of monotonically increasing segments.
    
    get_dderivative_kurtosis(uni_ts:Union[pandas.core.series.Series, numpy.ndarray], step_size:int=1) -> numpy.float

- #### How to extract these features from the data?

Time to extract a set of these features from the dataset we downloaded. Let's extract 3 simple statistical features, namely `min`, `max`, and `median`, from 3 parameters, such as `TOTUSJH`, `TOTBSQ`, and `TOTPOT`. Again, to speed up the process in this demo, we only process the first 50 mvts files.

In [8]:
from features.feature_extractor import FeatureExtractor

fe = FeatureExtractor(path_to_config)
fe.do_extraction(features_name=['get_min', 'get_max', 'get_median'],
                 params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'], first_k=50)
fe.df_all_features

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTPOT_min,TOTPOT_max,TOTPOT_median
0,3497,NF,2013-12-17T10:24:00,2013-12-17T22:12:00,1360.686972,1796.834808,1560.47818,13415850000.0,19548390000.0,16689330000.0,2.224181e+23,3.640731e+23,3.096907e+23
1,3591,NF,2014-01-09T20:00:00,2014-01-10T07:48:00,67.583341,122.519078,80.033856,394546200.0,639198800.0,510650400.0,2.383901e+21,4.309764e+21,3.241898e+21
2,3497,C,2013-12-19T16:24:00,2013-12-20T04:12:00,2387.389401,2796.874544,2688.736459,31431760000.0,35632840000.0,35143120000.0,5.537445e+23,6.266628e+23,5.895503e+23
3,3576,NF,2014-01-04T18:12:00,2014-01-05T06:00:00,24.722524,35.882308,29.838575,194622700.0,257020700.0,236292600.0,1.378299e+21,2.018457e+21,1.788971e+21
4,3457,NF,2013-12-04T12:36:00,2013-12-05T00:24:00,772.739695,884.33042,853.054633,14159260000.0,15250840000.0,14600090000.0,1.82452e+23,2.048443e+23,1.933204e+23
5,3265,NF,2013-10-08T12:36:00,2013-10-09T00:24:00,9.809842,37.376417,17.536759,69985400.0,240796000.0,121395100.0,4.128342e+20,1.829502e+21,9.217824e+20
6,3334,NF,2013-11-01T13:00:00,2013-11-02T00:48:00,53.759144,74.536177,63.63074,719447000.0,928712400.0,784116500.0,2.794357e+21,4.111046e+21,3.50106e+21
7,3420,NF,2013-11-24T16:24:00,2013-11-25T04:12:00,546.477132,621.885367,602.787326,6944610000.0,8502122000.0,7709941000.0,6.834644e+22,9.394178e+22,8.067205e+22
8,3335,NF,2013-11-02T01:36:00,2013-11-02T13:24:00,6.218266,15.386502,10.868499,44913300.0,94991890.0,73944070.0,2.976223e+20,7.357936e+20,5.09094e+20
9,3362,NF,2013-11-08T12:48:00,2013-11-09T00:36:00,46.615346,83.850691,73.101949,285394300.0,716471900.0,670851800.0,2.99505e+21,6.036306e+21,4.764406e+21


... where each row corresponds to one mvts file, and the first 4 columns represent the extracted information from the file names.

 - #### You have a LARGE dataset?
 No worries. Using the parallel implementation of feature extraction, the process can be significantly sped up.

In [9]:
fe.do_extraction_in_parallel(n_jobs=4,
                             features_name=['get_min', 'get_max', 'get_median'],
                             params_name=['TOTUSJH', 'TOTBSQ', 'TOTPOT'], first_k=50)
fe.df_all_features

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTPOT_min,TOTPOT_max,TOTPOT_median
0,3441,NF,2013-11-29T20:48:00,2013-11-30T08:36:00,62.106379,138.264236,76.474256,552503000.0,1171653000.0,675424700.0,4.961458e+21,1.265386e+22,5.925591e+21
1,3423,NF,2013-11-24T03:36:00,2013-11-24T15:24:00,12.442314,26.558753,18.469005,99636180.0,205510200.0,128365000.0,6.267416e+20,1.48035e+21,8.779774e+20
2,3441,NF,2013-11-27T19:48:00,2013-11-28T07:36:00,74.039773,134.406747,91.712821,438692100.0,1000045000.0,713694300.0,4.586793e+21,9.34305e+21,7.125342e+21
3,3540,NF,2013-12-28T04:24:00,2013-12-28T16:12:00,31.691903,51.25754,44.40663,230067800.0,335713400.0,287260600.0,1.688539e+21,2.370427e+21,2.034959e+21
4,3601,NF,2014-01-13T23:24:00,2014-01-14T11:12:00,599.674564,698.39554,660.366529,11704560000.0,13021880000.0,12477830000.0,2.24366e+23,2.488918e+23,2.385261e+23
5,3367,NF,2013-11-12T16:24:00,2013-11-13T04:12:00,136.075445,222.925977,158.815598,1195805000.0,1729421000.0,1433371000.0,8.713321e+21,1.531291e+22,1.052656e+22
6,3367,NF,2013-11-10T07:24:00,2013-11-10T19:12:00,70.232592,94.93336,84.694634,714065400.0,943984500.0,859429900.0,5.626249e+21,7.600602e+21,6.678711e+21
7,3515,C,2013-12-17T23:24:00,2013-12-18T11:12:00,61.522212,418.694789,78.091793,551899000.0,3458566000.0,630650200.0,3.523414e+21,3.9821e+22,4.137569e+21
8,3557,NF,2014-01-03T00:36:00,2014-01-03T12:24:00,7.942284,15.189924,11.344946,52209220.0,98139100.0,74257390.0,3.702885e+20,6.762099e+20,5.348884e+20
9,3595,NF,2014-01-14T13:24:00,2014-01-15T01:12:00,49.277672,63.224432,54.891372,439569000.0,586273600.0,505935600.0,2.919412e+21,3.610718e+21,3.310528e+21


## Extracted Features Analysis

- #### A quick look over the results?

The extracted features can be easily summarized using descriptive statistics such as `meam`, `std`, `min`, `max`, and first, second and third quartiles. In addition, any missing value can also be spotted.

In [10]:
from data_analysis.extracted_features_analysis import ExtractedFeaturesAnalysis

efa = ExtractedFeaturesAnalysis(fe.df_all_features, exclude=['id'])
efa.compute_summary()
efa.summary

Unnamed: 0,Feature-Name,Val-Count,Null-Count,mean,std,min,25th,50th,75th,max
0,TOTUSJH_min,50.0,0,391.2862,727.5349,3.494185,24.79555,64.84486,317.8858,2848.177
1,TOTUSJH_max,50.0,0,481.8235,836.4453,10.0513,45.66104,115.1369,405.5212,3162.777
2,TOTUSJH_median,50.0,0,435.8394,789.7342,7.689264,34.85637,79.06282,340.6959,3054.06
3,TOTBSQ_min,50.0,0,5016375000.0,9280343000.0,19832680.0,197775600.0,528467600.0,5826083000.0,36221100000.0
4,TOTBSQ_max,50.0,0,5892475000.0,10383410000.0,65693430.0,341735400.0,885456800.0,6921814000.0,38482840000.0
5,TOTBSQ_median,50.0,0,5477166000.0,9979458000.0,46633240.0,272106100.0,673138200.0,6322094000.0,37126460000.0
6,TOTPOT_min,50.0,0,7.611869e+22,1.608855e+23,1.205181e+20,1.372809e+21,3.377568e+21,2.771062e+22,6.798905e+23
7,TOTPOT_max,50.0,0,9.07041e+22,1.80654e+23,4.556192e+20,2.521684e+21,6.689676e+21,4.333502e+22,7.108347e+23
8,TOTPOT_median,50.0,0,8.366257e+22,1.716708e+23,3.034555e+20,1.967947e+21,4.545931e+21,3.299998e+22,6.972036e+23


... which gives a summary statistics over every extracted feature. For instance, in row `0`, the changes of the minimum values of the parameter `TUOTUSJH`, across 50 mvts files, is described in terms of `mean`, `std`, etc. This also indicates that no `NA/NAN` or missing value was generated in the process.

## Data Normalization

The extracted features can then be normalized using four different methods. 

In [12]:
from normalizing import normalizer

df_norm = normalizer.zero_one_normalize(df=fe.df_all_features, excluded_colnames=['id'])
df_norm

Unnamed: 0,id,lab,st,et,TOTUSJH_min,TOTUSJH_max,TOTUSJH_median,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTPOT_min,TOTPOT_max,TOTPOT_median
0,3441,NF,2013-11-29T20:48:00,2013-11-30T08:36:00,0.020604,0.040667,0.022579,0.014714,0.028788,0.016958,0.007121,0.017171,0.008067
1,3423,NF,2013-11-24T03:36:00,2013-11-24T15:24:00,0.003146,0.005236,0.003539,0.002204,0.003639,0.002204,0.000745,0.001443,0.000824
2,3441,NF,2013-11-27T19:48:00,2013-11-28T07:36:00,0.024799,0.039444,0.027582,0.01157,0.024321,0.01799,0.00657,0.012511,0.009789
3,3540,NF,2013-12-28T04:24:00,2013-12-28T16:12:00,0.009912,0.01307,0.012053,0.005807,0.007029,0.006489,0.002307,0.002695,0.002485
4,3601,NF,2014-01-13T23:24:00,2014-01-14T11:12:00,0.209577,0.218333,0.214247,0.322771,0.33725,0.335255,0.329884,0.349723,0.341832
5,3367,NF,2013-11-12T16:24:00,2013-11-13T04:12:00,0.046607,0.067521,0.049609,0.032484,0.043307,0.037399,0.012641,0.020915,0.014669
6,3367,NF,2013-11-10T07:24:00,2013-11-10T19:12:00,0.023461,0.026923,0.025278,0.019177,0.022862,0.02192,0.008099,0.010058,0.009148
7,3515,C,2013-12-17T23:24:00,2013-12-18T11:12:00,0.020399,0.129616,0.02311,0.014697,0.088317,0.01575,0.005006,0.055415,0.005502
8,3557,NF,2014-01-03T00:36:00,2014-01-03T12:24:00,0.001564,0.00163,0.0012,0.000894,0.000845,0.000745,0.000367,0.000311,0.000332
9,3595,NF,2014-01-14T13:24:00,2014-01-15T01:12:00,0.016094,0.016866,0.015495,0.011595,0.013551,0.012387,0.004117,0.004441,0.004315


**Note**: The method argument `excluded_colnames` was used to keep the column `id` intact in the normalization process. Moreover, any other column with non-numeric values were preserved in the output.