# CompEngine dataset analysis
## Analysis #4: Get TS-Pymfe meta-features accuracy

**Project URL:** https://www.comp-engine.org/

**Get data in:** https://www.comp-engine.org/#!browse

**Date:** May 31 2020

### Objectives:
1. Extract the meta-features using the ts-pymfe from train and test data
2. Drop metafeatures with NaN.
3. Apply PCA in the train meta-dataset.
4. Use a simple machine learning model to predict the test set.

### Results (please check the analysis date):
1. All metafeatures from all methods combined with all summary functions in pymfe were extracted from both train and test data. This totalizes ?? candidate meta-features.
2. ??
3. The next step is to apply PCA retaining 95% of variance explained by the original meta-features. Before applying PCA we need to choose a normalization strategy. Two methods were considered:
    1. (Pipeline A) Standard Scaler (traditional standardization): ?? of ?? dimensions were kept. This corresponds to a dimension reduction of ??%.
    2. (Pipeline B) Robust Sigmoid Scaler (see reference [1]): ?? of ?? dimensions were kept. This corresponds to a dimension reduction of ??%.
4. Now it is time for some predictions. I'm using a sklearn RandomForestClassifier model with default hyper-parameters with a fixed random seed.
    1. The expected accuracy of random guessing is 2.17%.
    2. (Pipeline A) It was obtained an accuracy score of ??%.
    3. (Pipeline B) It was obtained an accuracy score of ??%.
    

## references:

.. [1] Fulcher, Ben D.  and Little, Max A.  and Jones, Nick S., "Highly comparative time-series analysis: the empirical structure of time series and their methods" (Supplemental material #1, page 11), Journal of The Royal Society Interface, 2013, doi: 10.1098/rsif.2013.0048, https://royalsocietypublishing.org/doi/abs/10.1098/rsif.2013.0048.

In [1]:
%matplotlib inline
import typing
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn.decomposition
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.ensemble
import sklearn.metrics

import robust_sigmoid
import pymfe.tsmfe

NotImplementedError: Ts-Pymfe core not implemented.

In [None]:
# Note: using only groups that has at least one meta-feature that can be extracted
# from a unsupervised dataset
groups = "all"
summary = "all"

extractor = pymfe.tsmfe.TSMFE(features="all",
                              summary=summary,
                              groups=groups)

In [None]:
data_train = pd.read_csv("../2_exploring_subsample/subsample_train.csv", header=0, index_col="timeseries_id")
data_test = pd.read_csv("../2_exploring_subsample/subsample_test.csv", header=0, index_col="timeseries_id")

In [None]:
assert data_train.shape[0] > data_test.shape[0]

data_train.head()

In [None]:
# Note: using at most the last 1024 observations of each time-series
size_threshold = 1024

# Number of iterations until to save results to .csv
to_csv_it_num = 16

# Note: using dummy data to get the metafeature names
mtf_names = extractor.fit(np.arange(16).reshape(-1, 2),
                          suppress_warnings=True).extract(suppress_warnings=True)[0]

# Note: filepath to store the results
filename_train = "metafeatures_pymfe_train.csv"
filename_test = "metafeatures_pymfe_test.csv"

def recover_data(filepath: str,
                 index: typing.Collection[str],
                 def_shape: typing.Tuple[int, int]) -> typing.Tuple[pd.DataFrame, int]:
    """Recover data from the previous experiment run."""
    filled_len = 0
    
    try:
        results = pd.read_csv(filepath, index_col=0)
        
        assert results.shape == def_shape

        # Note: find the index where the previous run was interrupted
        while filled_len < results.shape[0] and not results.iloc[filled_len, :].isnull().all():
            filled_len += 1

    except (AssertionError, FileNotFoundError):
        results = pd.DataFrame(index=index, columns=mtf_names)
    
    return results, filled_len


results_train, start_ind_train = recover_data(filepath=filename_train,
                                              index=data_train.index,
                                              def_shape=(data_train.shape[0], len(mtf_names)))

results_test, start_ind_test = recover_data(filepath=filename_test,
                                            index=data_test.index,
                                            def_shape=(data_test.shape[0], len(mtf_names)))

In [None]:
assert results_train.shape == (data_train.shape[0], len(mtf_names))
assert results_test.shape == (data_test.shape[0], len(mtf_names))

print("Train start index:", start_ind_train)
print("Test start index:", start_ind_test)

In [None]:
print("Number of candidate meta-features per dataset:", len(mtf_names))

In [None]:
def extract_metafeatures(data: pd.DataFrame, results: pd.DataFrame, start_ind: int, output_file: str) -> None:
    print(f"Starting extraction from index {start_ind}...")
    for i, (cls, _, vals) in enumerate(data.iloc[start_ind:, :].values, start_ind):
        ts = np.asarray(vals.split(",")[-size_threshold:], dtype=float)
        extractor.fit(ts_embed, suppress_warnings=True)
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore")
            res = extractor.extract(suppress_warnings=True)
        
        results.iloc[i, :] = res[1]

        if i % to_csv_it_num == 0:
            results.to_csv(output_file)
            print(f"Saved results at index {i} in file {output_file}.")
    
    results.to_csv(output_file)

In [None]:
extract_metafeatures(data=data_train,
                     results=results_train,
                     start_ind=start_ind_train,
                     output_file=filename_train)

extract_metafeatures(data=data_test,
                     results=results_test,
                     start_ind=start_ind_test,
                     output_file=filename_test)

In [None]:
# Note: analysing the NaN count.
nan_count = results_train.isnull().sum()

In [None]:
pd_nan_count = nan_count.iloc[nan_count.to_numpy().nonzero()].value_counts()
pd_nan_count = pd.concat([pd_nan_count, pd_nan_count / results_train.shape[1]], axis=1)
pd_nan_count = pd_nan_count.rename(columns={0: "Number of meta-features", 1: "Proportion of meta-features"})
pd_nan_count.index =  map("{} (missing on {:.2f}% of all train time-series)".format, pd_nan_count.index, 100. * pd_nan_count.index / results_train.shape[0])
pd_nan_count.index.name = "Missing values count"
pd_nan_count

In [None]:
# Note: suspicious meta-feature with all missing value. Which is it?
ind = (nan_count == data_train.shape[0]).to_numpy().nonzero()
print(results_train.columns[ind])

# Note afterwards: the result ("num_to_cat") seems reasonable, since no
# time-series should have categorical values.

In [None]:
results_train.dropna(axis=1, inplace=True)
print("Train shape after dropping NaN column:", results_train.shape)
print(f"Dropped {len(mtf_names) - results_train.shape[1]} of {len(mtf_names)} meta-features "
      f"({100 * (1 - results_train.shape[1] / len(mtf_names)):.2f}% from the total).")
results_test = results_test.loc[:, results_train.columns]

# Note: sanity check if the columns where dropped correctly
assert np.all(results_train.columns == results_test.columns)

In [None]:
def get_accuracy(pipeline: sklearn.pipeline.Pipeline,
                 X_train: np.ndarray,
                 X_test: np.ndarray,
                 y_train: np.ndarray,
                 y_test:np.ndarray) -> float:
    pipeline.fit(results_train)
    
    X_subset_train = pipeline.transform(X_train)
    X_subset_test = pipeline.transform(X_test)
    
    assert X_subset_train.shape[1] == X_subset_test.shape[1]
    
    # Note: sanity check if train project is zero-centered
    assert np.allclose(X_subset_train.mean(axis=0), 0.0)

    print("Train shape after PCA:", X_subset_train.shape)
    print("Test shape after PCA :", X_subset_test.shape)
    print(f"Total of {X_subset_train.shape[1]} of {X_train.shape[1]} "
          "dimensions kept for 95% variance explained "
          f"(reduction of {100. * (1 - X_subset_train.shape[1] / X_train.shape[1]):.2f}%).")
    
    rf = sklearn.ensemble.RandomForestClassifier(random_state=16)
    rf.fit(X=X_subset_train, y=y_train)
    
    y_pred = rf.predict(X_subset_test)

    # Note: since the test set is balanced, we can use the traditional accuracy
    test_acc = sklearn.metrics.accuracy_score(y_test, y_pred)
    
    return test_acc

In [None]:
pipeline_a = sklearn.pipeline.Pipeline((
    ("zscore", sklearn.preprocessing.StandardScaler()),
    ("pca", sklearn.decomposition.PCA(n_components=0.95, random_state=16))
))

test_acc_a = get_accuracy(pipeline=pipeline_a,
                          X_train=results_train.values,
                          X_test=results_test.values,
                          y_train=data_train.category.values,
                          y_test=data_test.category.values)

# This is equivalent of guessing only the majority class, which can be any class
# in this case since the dataset is perfectly balanced
print(f"Expected accuracy by random guessing: {1 / data_test.category.unique().size:.4f}")
print(f"Test accuracy (pipeline A - StandardScaler): {test_acc_a:.4f}")

In [None]:
pipeline_b = sklearn.pipeline.Pipeline((
    ("robsigmoid", robust_sigmoid.RobustSigmoid()),
    ("pca", sklearn.decomposition.PCA(n_components=0.95, random_state=16))
))

test_acc_b = get_accuracy(pipeline=pipeline_b,
                          X_train=results_train.values,
                          X_test=results_test.values,
                          y_train=data_train.category.values,
                          y_test=data_test.category.values)

# This is equivalent of guessing only the majority class, which can be any class
# in this case since the dataset is perfectly balanced
print(f"Expected accuracy by random guessing: {1 / data_test.category.unique().size:.4f}")
print(f"Test accuracy (pipeline B - RobustSigmoid) : {test_acc_b:.4f}")