# CompEngine dataset analysis
## Analysis #3: Explore balanced subset

**Project URL:** https://www.comp-engine.org/

**Get data in:** https://www.comp-engine.org/#!browse

**Date:** May 29 2020

### Objectives:
1. Extract the meta-features using the unsupervised methods in pymfe from train and test data
2. Drop metafeatures with NaN (since we can't apply PCA in those)..
3. Apply PCA in the train set meta-features.
4. Use a simple machine learning model to predict the test set.

### Results (please check the analysis date):
1. All metafeatures from all unsupervised methods combined with all summary functions in pymfe were extracted from both train and test data. This totalizes 1407 candidate meta-features.
    1. Before extraction, every time-series were embedded in the appropriate lag (using the first non-significative lag of the autocorrelation function) and appropriate dimension (using Cao's algorithm).
    2. The minimum embedding dimension was set to 2 to avoid losing too many meta-features in the summarization process.
    3. Only up to 1024 most recent observations of each time-series were used.
2. We want to apply PCA in the data, but first we need to get rid of the missing data in the training meta-data. There is a total of 121 or 8.60% of meta-features with at least one missing value. After dropping all meta-features with missing values, 1286 remains.
    1. There were 112 meta-features with 310 missing values (missing on 42.12% of all train time-series);
    2. There were 8 meta-features with 584 missing values (missing on 79.35% of all train time-series);
    3. There was 1 meta-feature ("num_to_cat") with 736 missing values (missing on all train time-series). As a side node, this result seems very reasonable.
3. The next step is to apply PCA retaining 95% of variance explained by the original meta-features. Before applying PCA we need to choose a normalization strategy. Two methods were considered:
    1. (Pipeline A) Standard Scaler (traditional standardization): 105 of 1286 dimensions were kept. This corresponds to a dimension reduction of 91.84%.
    2. (Pipeline B) Robust Sigmoid Scaler (see reference [1]): 63 of 1286 dimensions were kept. This corresponds to a dimension reduction of 95.10%.
4. Now it is time for some predictions. I'm using a sklearn RandomForestClassifier model with default hyper-parameters with a fixed random seed.
    1. The expected accuracy of random guessing is 2.17%.
    2. (Pipeline A) It was obtained an accuracy score of 47.28%.
    3. (Pipeline B) It was obtained an accuracy score of 55.43%.
    

## references:

.. [1] Fulcher, Ben D.  and Little, Max A.  and Jones, Nick S., "Highly comparative time-series analysis: the empirical structure of time series and their methods" (Supplemental material #1, page 11), Journal of The Royal Society Interface, 2013, doi: 10.1098/rsif.2013.0048, https://royalsocietypublishing.org/doi/abs/10.1098/rsif.2013.0048.

In [1]:
%matplotlib inline
import typing
import warnings

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn.decomposition
import sklearn.pipeline
import sklearn.preprocessing
import sklearn.ensemble
import sklearn.metrics

import robust_sigmoid
import pymfe.mfe
import tspymfe._embed

In [2]:
# Note: using only groups that has at least one meta-feature that can be extracted
# from a unsupervised dataset
groups = ("general", "statistical", "info-theory", "complexity", "itemset", "concept")
summary = "all"

extractor = pymfe.mfe.MFE(features="all",
                          summary=summary,
                          groups=groups)

   Please use only the updated version available at: https://github.com/ealcobaca/pymfe


In [3]:
data_train = pd.read_csv("../2_exploring_subsample/subsample_train.csv", header=0, index_col="timeseries_id")
data_test = pd.read_csv("../2_exploring_subsample/subsample_test.csv", header=0, index_col="timeseries_id")

In [4]:
assert data_train.shape[0] > data_test.shape[0]

data_train.head()

Unnamed: 0_level_0,category,inst_ind,datapoints
timeseries_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e0b36e39-3872-11e8-8680-0242ac120002,Beta noise,25254,"0.73617,0.99008,0.71331,0.87094,0.75527,0.9912..."
81db0cf2-3883-11e8-8680-0242ac120002,Relative humidity,14878,"95.5,79,86.75,8.75,62.75,98.75,79.74,44.75,92...."
380eb353-387a-11e8-8680-0242ac120002,RR,6577,"0.6328,0.6328,0.625,0.6328,0.625,0.625,0.6172,..."
f33f461c-3871-11e8-8680-0242ac120002,Tremor,27821,"-0.6,1.5,1.5,0.1,0.9,0.6,0.3,-0.2,0.7,1,0.1,1...."
7bcad309-3874-11e8-8680-0242ac120002,Noisy sinusoids,14226,"0.38553,0.2014,1.8705,0.47883,0.33958,0.009558..."


In [5]:
# Note: using at most the last 1024 observations of each time-series
size_threshold = 1024

# Number of iterations until to save results to .csv
to_csv_it_num = 16

# Note: using dummy data to get the metafeature names
mtf_names = extractor.fit(np.arange(16).reshape(-1, 2),
                          suppress_warnings=True).extract(suppress_warnings=True)[0]

# Note: filepath to store the results
filename_train = "metafeatures_pymfe_train.csv"
filename_test = "metafeatures_pymfe_test.csv"

def recover_data(filepath: str,
                 index: typing.Collection[str],
                 def_shape: typing.Tuple[int, int]) -> typing.Tuple[pd.DataFrame, int]:
    """Recover data from the previous experiment run."""
    filled_len = 0
    
    try:
        results = pd.read_csv(filepath, index_col=0)
        
        assert results.shape == def_shape

        # Note: find the index where the previous run was interrupted
        while filled_len < results.shape[0] and not results.iloc[filled_len, :].isnull().all():
            filled_len += 1

    except (AssertionError, FileNotFoundError):
        results = pd.DataFrame(index=index, columns=mtf_names)
    
    return results, filled_len


results_train, start_ind_train = recover_data(filepath=filename_train,
                                              index=data_train.index,
                                              def_shape=(data_train.shape[0], len(mtf_names)))

results_test, start_ind_test = recover_data(filepath=filename_test,
                                            index=data_test.index,
                                            def_shape=(data_test.shape[0], len(mtf_names)))

In [6]:
assert results_train.shape == (data_train.shape[0], len(mtf_names))
assert results_test.shape == (data_test.shape[0], len(mtf_names))

print("Train start index:", start_ind_train)
print("Test start index:", start_ind_test)

Train start index: 736
Test start index: 184


In [7]:
print("Number of candidate meta-features per dataset:", len(mtf_names))

Number of candidate meta-features per dataset: 1407


In [8]:
def extract_metafeatures(data: pd.DataFrame, results: pd.DataFrame, start_ind: int, output_file: str) -> None:
    print(f"Starting extraction from index {start_ind}...")
    for i, (cls, _, vals) in enumerate(data.iloc[start_ind:, :].values, start_ind):
        ts = np.asarray(vals.split(",")[-size_threshold:], dtype=float)

        embed_lag = tspymfe._embed.embed_lag(ts=ts, max_nlags=16)

        embed_dim = max(2, tspymfe._embed.ft_emb_dim_cao(ts=ts,
                                                         lag=embed_lag,
                                                         dims=16,
                                                         tol_threshold=0.2))

        ts_embed = tspymfe._embed.embed_ts(ts=ts,
                                           dim=embed_dim,
                                           lag=embed_lag)
        
        extractor.fit(ts_embed, suppress_warnings=True)
        
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore")
            res = extractor.extract(suppress_warnings=True)
        
        results.iloc[i, :] = res[1]

        if i % to_csv_it_num == 0:
            results.to_csv(output_file)
            print(f"Saved results at index {i} in file {output_file}.")
    
    results.to_csv(output_file)

In [9]:
extract_metafeatures(data=data_train,
                     results=results_train,
                     start_ind=start_ind_train,
                     output_file=filename_train)

extract_metafeatures(data=data_test,
                     results=results_test,
                     start_ind=start_ind_test,
                     output_file=filename_test)

Starting extraction from index 736...
Starting extraction from index 184...


In [10]:
# Note: analysing the NaN count.
nan_count = results_train.isnull().sum()

In [11]:
pd_nan_count = nan_count.iloc[nan_count.to_numpy().nonzero()].value_counts()
pd_nan_count = pd.concat([pd_nan_count, pd_nan_count / results_train.shape[1]], axis=1)
pd_nan_count = pd_nan_count.rename(columns={0: "Number of meta-features", 1: "Proportion of meta-features"})
pd_nan_count.index =  map("{} (missing on {:.2f}% of all train time-series)".format, pd_nan_count.index, 100. * pd_nan_count.index / results_train.shape[0])
pd_nan_count.index.name = "Missing values count"
pd_nan_count

Unnamed: 0_level_0,Number of meta-features,Proportion of meta-features
Missing values count,Unnamed: 1_level_1,Unnamed: 2_level_1
310 (missing on 42.12% of all train time-series),112,0.079602
584 (missing on 79.35% of all train time-series),8,0.005686
736 (missing on 100.00% of all train time-series),1,0.000711


In [12]:
# Note: suspicious meta-feature with all missing value. Which is it?
ind = (nan_count == data_train.shape[0]).to_numpy().nonzero()
print(results_train.columns[ind])

# Note afterwards: the result ("num_to_cat") seems reasonable, since no
# time-series should have categorical values.

Index(['num_to_cat'], dtype='object')


In [13]:
results_train.dropna(axis=1, inplace=True)
print("Train shape after dropping NaN column:", results_train.shape)
print(f"Dropped {len(mtf_names) - results_train.shape[1]} of {len(mtf_names)} meta-features "
      f"({100 * (1 - results_train.shape[1] / len(mtf_names)):.2f}% from the total).")
results_test = results_test.loc[:, results_train.columns]

# Note: sanity check if the columns where dropped correctly
assert np.all(results_train.columns == results_test.columns)

Train shape after dropping NaN column: (736, 1286)
Dropped 121 of 1407 meta-features (8.60% from the total).


In [14]:
def get_accuracy(pipeline: sklearn.pipeline.Pipeline,
                 X_train: np.ndarray,
                 X_test: np.ndarray,
                 y_train: np.ndarray,
                 y_test:np.ndarray) -> float:
    pipeline.fit(results_train)
    
    X_subset_train = pipeline.transform(X_train)
    X_subset_test = pipeline.transform(X_test)
    
    assert X_subset_train.shape[1] == X_subset_test.shape[1]
    
    # Note: sanity check if train project is zero-centered
    assert np.allclose(X_subset_train.mean(axis=0), 0.0)

    print("Train shape after PCA:", X_subset_train.shape)
    print("Test shape after PCA :", X_subset_test.shape)
    print(f"Total of {X_subset_train.shape[1]} of {X_train.shape[1]} "
          "dimensions kept for 95% variance explained "
          f"(reduction of {100. * (1 - X_subset_train.shape[1] / X_train.shape[1]):.2f}%).")
    
    rf = sklearn.ensemble.RandomForestClassifier(random_state=16)
    rf.fit(X=X_subset_train, y=y_train)
    
    y_pred = rf.predict(X_subset_test)

    # Note: since the test set is balanced, we can use the traditional accuracy
    test_acc = sklearn.metrics.accuracy_score(y_test, y_pred)
    
    return test_acc

In [15]:
pipeline_a = sklearn.pipeline.Pipeline((
    ("zscore", sklearn.preprocessing.StandardScaler()),
    ("pca", sklearn.decomposition.PCA(n_components=0.95, random_state=16))
))

test_acc_a = get_accuracy(pipeline=pipeline_a,
                          X_train=results_train.values,
                          X_test=results_test.values,
                          y_train=data_train.category.values,
                          y_test=data_test.category.values)

# This is equivalent of guessing only the majority class, which can be any class
# in this case since the dataset is perfectly balanced
print(f"Expected accuracy by random guessing: {1 / data_test.category.unique().size:.4f}")
print(f"Test accuracy (pipeline A - StandardScaler): {test_acc_a:.4f}")

Train shape after PCA: (736, 105)
Test shape after PCA : (184, 105)
Total of 105 of 1286 dimensions kept for 95% variance explained (reduction of 91.84%).
Expected accuracy by random guessing: 0.0217
Test accuracy (pipeline A - StandardScaler): 0.4728


In [16]:
pipeline_b = sklearn.pipeline.Pipeline((
    ("robsigmoid", robust_sigmoid.RobustSigmoid()),
    ("pca", sklearn.decomposition.PCA(n_components=0.95, random_state=16))
))

test_acc_b = get_accuracy(pipeline=pipeline_b,
                          X_train=results_train.values,
                          X_test=results_test.values,
                          y_train=data_train.category.values,
                          y_test=data_test.category.values)

# This is equivalent of guessing only the majority class, which can be any class
# in this case since the dataset is perfectly balanced
print(f"Expected accuracy by random guessing: {1 / data_test.category.unique().size:.4f}")
print(f"Test accuracy (pipeline B - RobustSigmoid) : {test_acc_b:.4f}")

Train shape after PCA: (736, 63)
Test shape after PCA : (184, 63)
Total of 63 of 1286 dimensions kept for 95% variance explained (reduction of 95.10%).
Expected accuracy by random guessing: 0.0217
Test accuracy (pipeline B - RobustSigmoid) : 0.5543
