# CompEngine dataset analysis
## Analysis #1: create balanced train and test subsets

**Project URL:** https://www.comp-engine.org/

**Get data in:** https://www.comp-engine.org/#!browse

**Date:** May 18 2020

### Objectives:
1. Get a subset of the most popular classes from the original data
2. Get a balanced subset with the most common classes from the original data.

### Results (please check the analysis date):
1. The 25% most popular classes were chosen. (total of 46 distinct classes.)
1. A subset (size 920) splitted into train subset (size 736, 80% from the total) and test subset (size 184, 20% from the total) were constructed using stratified hold-out. It means that the train set has 16 instances of each class (from the 25% most popular classes of the original data), and the test set has 4 instances of each class. Both subsets were saved into ".csv" files named "inds_train.txt" and "inds_test.txt" for, respectively, train subset instance indices and test subset instances indices.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as  np
import pandas as pd
import sklearn.model_selection

In [2]:
# Note: read only the class labels, 'category'.
y = pd.read_csv("../data/comp-engine-export-metadata.20200503.csv",
                usecols=["category", "timeseries_id"])

In [3]:
class_freqs = y["category"].value_counts(ascending=False)
ind_threshold = int(np.ceil(0.25 * class_freqs.size))
selected_classes = class_freqs[:ind_threshold]
print(selected_classes)
print("Total classes selected:", selected_classes.size)

ECG                                 6540
Finance                             2611
Model M5a                           1944
Micoeconomics                       1530
Medical                             1091
Autoregressive with noise            888
Postural sway                        840
Unassigned                           710
Industry                             600
Text                                 548
Macroeconomics                       491
Birdsong                             484
Precipitation rate                   466
Astrophysics                         464
Powerlaw noise                       451
Moving average process               360
Music                                328
Model M10a                           324
RR                                   312
Beta noise                           294
Nonstationary autoregressive         285
Air pressure                         276
Relative humidity                    271
Air temperature                      270
Opening prices  

In [4]:
sample_size = 920 # 20 instances per class selected

# Note: sanity check if it is possible that every class has the very same
# number of instances in the subsample
assert sample_size % selected_classes.size == 0

inst_per_class = sample_size // selected_classes.size

In [5]:
candidates_inst = y[y["category"].isin(selected_classes.index)]

subsample = candidates_inst.groupby("category").apply(
    lambda group: group.sample(inst_per_class, random_state=16))

In [6]:
# Note: sanity check if subsample has the desired size
assert subsample.shape[0] == sample_size

# Note: sanity check if every class was subsampled with
# exact the expected number of instances per class
assert np.allclose(subsample["category"].value_counts(), inst_per_class)

# Note: check if random seed was not modified
assert subsample["category"].index[0] == ('Air pressure', 8016), "Random seed changed! (use random_state=16)"

In [7]:
test_frac = 0.20

inds_train, inds_test = sklearn.model_selection.train_test_split(
    subsample.index.rename(["category", "inst_ind"]),
    test_size=int(subsample.shape[0] * test_frac),
    stratify=subsample["category"],
    random_state=16)

# Note: sanity check if train and test set both have the expected size
assert inds_train.shape[0] == int(np.ceil((1 - test_frac) * subsample.shape[0]))
assert inds_test.shape[0] == int(test_frac * subsample.shape[0])

# Note: sanity check if train and test set both are perfectly balanced
assert np.allclose(subsample.loc[inds_train]["category"].value_counts(), inst_per_class * 0.8)
assert np.allclose(subsample.loc[inds_test]["category"].value_counts(), inst_per_class * 0.2)

# Note: check if the random seed was not modified
assert inds_train[0] == ('Beta noise', 25254), "Random seed changed! (use random_state=16)"

In [8]:
print(f"train data size: {inds_train.size} "
      f"({100. * inds_train.size / sample_size}% from the total)")
print(f"test data size: {inds_test.size} "
      f"({100. * inds_test.size / sample_size}% from the total)")

train data size: 736 (80.0% from the total)
test data size: 184 (20.0% from the total)


In [9]:
pd.DataFrame(subsample["timeseries_id"].loc[inds_train], index=inds_train).to_csv("inds_train.csv")
pd.DataFrame(subsample["timeseries_id"].loc[inds_test], index=inds_test).to_csv("inds_test.csv")