## Implementing TimeSeriesRandomClassifier
* create `RandomIntervalFeatureExtractor` transformer
* inherit from `ForestClassifier`, adapt to handle pipelines as individual trees, modifying `fit` and `predict_proba` methods
* adapt `Pipeline` to handle `random_state`, propagate to all pipeline components with randomness, and `check_input` to deactivate input checks for all of its components
* adapt sklearn helper functions `_parallel_build_trees` (booststrapping) and `_accumulate_predictions` 
* minor modification of `set_oob_scores` to handle pandas df as input

To do:
* unit tests
* extend `RandomIntervalFeatureExtractor` interface to accept arbitrary function, optionally with additional args passed to the function
* ideally replace tabular-data input checks with time-series/panel data checks, possibly implemented as part of the data-container
* `feature_importances_` has to be adapted (e.g. as temporal importance curve as in the paper)
* `decision_path()` has to be adapted
* `apply` method has to be adapted
* parallelise, change from threading to multi-processing? - currently no speed up when using multiple CPUs
* implement "entrance" criterion (as in paper)

In [1]:
%load_ext autoreload
%autoreload 2

from sktime.classifiers.ensemble import TimeSeriesForestClassifier
from sktime.transformers.series_to_tabular import RandomIntervalFeatureExtractor
from sklearn.tree import DecisionTreeClassifier
from sktime.pipeline import TSPipeline
from sktime.utils.load_data import load_from_web_to_xdataframe
import pandas as pd
import numpy as np
from numba import jit

In [2]:
cache_path = 'data/'
dataset_name = 'GunPoint'
X_train, y_train = load_from_web_to_xdataframe(dataset_name, is_train_file=True,
                                               cache_path=cache_path) 
X_test, y_test = load_from_web_to_xdataframe(dataset_name, is_test_file=True,
                                             cache_path=cache_path)
X_train = pd.DataFrame(X_train)
y_train = pd.Series(y_train)
X_test = pd.DataFrame(X_test)
y_test = pd.Series(y_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(50, 1) (50,) (150, 1) (150,)


In [3]:
@jit # simple but effective optimisation 
def time_series_slope(y):
    n = y.shape[0]
    if n < 2:
        return 0
    else:
        x = np.arange(n) + 1
        x_mu = x.mean()
        return (((x * y).mean() - x_mu * y.mean())
                / ((x ** 2).mean() - x_mu ** 2))

In [4]:
from sktime.classifiers.ensemble import TimeSeriesForestClassifier
from sktime.transformers.series_to_tabular import RandomIntervalFeatureExtractor
from sklearn.tree import DecisionTreeClassifier
from sktime.pipeline import TSPipeline
features = [np.mean, np.std, time_series_slope]
steps = [('transform', RandomIntervalFeatureExtractor(n_intervals='sqrt', features=features)), 
         ('clf', DecisionTreeClassifier())]
base_estimator = TSPipeline(steps)
clf = TimeSeriesForestClassifier(base_estimator=base_estimator, 
                                 n_estimators=500, 
                                 bootstrap=False,
                                 oob_score=False,
                                 n_jobs=1,
                                 random_state=444)

In [5]:
clf.fit(X_train, y_train)

TimeSeriesForestClassifier(base_estimator=TSPipeline(check_input=False, memory=None, random_state=444,
      steps=[('transform', RandomIntervalFeatureExtractor(check_input=False,
                features=[<function mean at 0x10d333ae8>, <function std at 0x10d333b70>, CPUDispatcher(<function time_series_slope at 0x11a47fd90>)],
                n_intervals='sqrt', random_state=444)), ('clf', DecisionTreeCla...       min_weight_fraction_leaf=0.0, presort=False, random_state=444,
            splitter='best'))]),
              bootstrap=False, check_input=True, class_weight=None,
              criterion=None, max_depth=None, max_features=None,
              max_leaf_nodes=None, min_impurity_decrease=None,
              min_impurity_split=None, min_samples_leaf=None,
              min_samples_split=None, min_weight_fraction_leaf=None,
              n_estimators=500, n_jobs=1, oob_score=False,
              random_state=444, verbose=0, warm_start=False)

In [6]:
clf.score(X_test, y_test)

0.9666666666666667