<a href="https://colab.research.google.com/github/PhilippMatthes/diplom/blob/master/src/shl-deep-learning-timeseries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using a deep CNN to directly classify SHL timeseries data

In [1]:
# Free up some disk space on colab
!rm -rf /usr/local/lib/python2.7
!rm -rf /swift
!rm -rf /usr/local/lib/python3.6/dist-packages/torch
!rm -rf /usr/local/lib/python3.6/dist-packages/pystan
!rm -rf /usr/local/lib/python3.6/dist-packages/spacy
!rm -rf /tensorflow-1.15.2/

In [2]:
# Get needed auxiliary files for colab
!git clone https://github.com/philippmatthes/diplom

Cloning into 'diplom'...
remote: Enumerating objects: 1747, done.[K
remote: Counting objects: 100% (1084/1084), done.[K
remote: Compressing objects: 100% (726/726), done.[K
remote: Total 1747 (delta 537), reused 816 (delta 301), pack-reused 663[K
Receiving objects: 100% (1747/1747), 34.52 MiB | 23.64 MiB/s, done.
Resolving deltas: 100% (913/913), done.


In [3]:
# Change into src dir and load our datasets
%cd /content/diplom/src
!mkdir shl-dataset

/content/diplom/src


In [4]:
# Download training datasets
!wget -nc -O shl-dataset/challenge-2019-train_torso.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_torso.zip
!wget -nc -O shl-dataset/challenge-2019-train_bag.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_bag.zip
!wget -nc -O shl-dataset/challenge-2019-train_hips.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_hips.zip
!wget -nc -O shl-dataset/challenge-2020-train_hand.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2020/challenge-2020-train_hand.zip
# Download validation dataset
!wget -nc -O shl-dataset/challenge-2020-validation.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2020/challenge-2020-validation.zip

--2021-08-17 16:59:30--  http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_torso.zip
Resolving www.shl-dataset.org (www.shl-dataset.org)... 37.187.125.22
Connecting to www.shl-dataset.org (www.shl-dataset.org)|37.187.125.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5852446972 (5.5G) [application/zip]
Saving to: ‘shl-dataset/challenge-2019-train_torso.zip’


2021-08-17 17:08:35 (10.2 MB/s) - ‘shl-dataset/challenge-2019-train_torso.zip’ saved [5852446972/5852446972]

--2021-08-17 17:08:36--  http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_bag.zip
Resolving www.shl-dataset.org (www.shl-dataset.org)... 37.187.125.22
Connecting to www.shl-dataset.org (www.shl-dataset.org)|37.187.125.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5628524721 (5.2G) [application/zip]
Saving to: ‘shl-dataset/challenge-2019-train_bag.zip’


2021-08-17 17:17:10 (10.4 MB/s) - ‘shl-datas

In [5]:
# Unzip training datasets
!unzip -n -d shl-dataset/challenge-2019-train_torso shl-dataset/challenge-2019-train_torso.zip
!rm shl-dataset/challenge-2019-train_torso.zip
!unzip -n -d shl-dataset/challenge-2019-train_bag shl-dataset/challenge-2019-train_bag.zip
!rm shl-dataset/challenge-2019-train_bag.zip
!unzip -n -d shl-dataset/challenge-2019-train_hips shl-dataset/challenge-2019-train_hips.zip
!rm shl-dataset/challenge-2019-train_hips.zip
!unzip -n -d shl-dataset/challenge-2020-train_hand shl-dataset/challenge-2020-train_hand.zip
!rm shl-dataset/challenge-2020-train_hand.zip
# Unzip validation dataset
!unzip -n -d shl-dataset/challenge-2020-validation shl-dataset/challenge-2020-validation.zip
!rm shl-dataset/challenge-2020-validation.zip

Archive:  shl-dataset/challenge-2019-train_torso.zip
   creating: shl-dataset/challenge-2019-train_torso/train/Torso/
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Label.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/LAcc_x.txt  
  inflating: shl-dataset/challenge-2019-train

In [6]:
%cd /content/diplom/src
%tensorflow_version 2.x

/content/diplom/src


In [7]:
# Define all datasets to train our model on

from pathlib import Path

TRAIN_DATASET_DIRS = [
    Path('shl-dataset/challenge-2019-train_torso/train/Torso'),
    Path('shl-dataset/challenge-2019-train_bag/train/Bag'),
    Path('shl-dataset/challenge-2019-train_hips/train/Hips'),
    Path('shl-dataset/challenge-2020-train_hand/train/Hand'),
]

VALIDATION_DATASET_DIRS = [
    Path('shl-dataset/challenge-2020-validation/validation/Torso'),         
    Path('shl-dataset/challenge-2020-validation/validation/Bag'),   
    Path('shl-dataset/challenge-2020-validation/validation/Hips'),   
    Path('shl-dataset/challenge-2020-validation/validation/Hand'),                  
]

In [8]:
# Define more useful constants about our dataset

LABEL_ORDER = [
    'Null',
    'Still',
    'Walking',
    'Run',
    'Bike',
    'Car',
    'Bus',
    'Train',
    'Subway',
]

SAMPLE_LENGTH = 500

In [9]:
# Results from data analysis

CLASS_WEIGHTS = {
    0: 0.0, # NULL label
    1: 1.0021671573438011, 
    2: 0.9985739895697523, 
    3: 2.8994439843842423, 
    4: 1.044135815617944, 
    5: 0.7723505499007343, 
    6: 0.8652474758172704, 
    7: 0.7842127155793044, 
    8: 1.0283208861290594
}

In [10]:
# Define features for our dataset

from collections import OrderedDict

import numpy as np

# Attributes to load from our dataset
X_attributes = [
    'acc_x', 'acc_y', 'acc_z',
    'mag_x', 'mag_y', 'mag_z',
    'gyr_x', 'gyr_y', 'gyr_z',
]

# Files within the dataset that contain our attributes
X_files = [
    'Acc_x.txt', 'Acc_y.txt', 'Acc_z.txt',
    'Mag_x.txt', 'Mag_y.txt', 'Mag_z.txt',
    'Gyr_x.txt', 'Gyr_y.txt', 'Gyr_z.txt',
]

# Features to generate from our loaded attributes
# Note that `a` is going to be a dict of attribute tracks
X_features = OrderedDict({
    'acc_mag': lambda a: np.sqrt(a['acc_x']**2 + a['acc_y']**2 + a['acc_z']**2),
    'mag_mag': lambda a: np.sqrt(a['mag_x']**2 + a['mag_y']**2 + a['mag_z']**2),
    'gyr_mag': lambda a: np.sqrt(a['gyr_x']**2 + a['gyr_y']**2 + a['gyr_z']**2),
})

# Define where to find our labels for supervised learning
y_file = 'Label.txt'
y_attribute = 'labels'

In [11]:
# Load pretrained power transformers for feature scaling

import joblib

X_feature_scalers = OrderedDict({})
for feature_name, _ in X_features.items():
    scaler = joblib.load(f'models/shl-scalers/{feature_name}.scaler.joblib')
    scaler.copy = False # Save memory
    X_feature_scalers[feature_name] = scaler



In [12]:
from tensorflow import keras

# Check that we can use our GPU, to not wait forever during training
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 10049394344732522062, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 16185556992
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 12363299317392124159
 physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"]

In [13]:
!pip install keras-tuner -q

[?25l[K     |███▍                            | 10 kB 35.1 MB/s eta 0:00:01[K     |██████▉                         | 20 kB 42.4 MB/s eta 0:00:01[K     |██████████▏                     | 30 kB 36.8 MB/s eta 0:00:01[K     |█████████████▋                  | 40 kB 24.0 MB/s eta 0:00:01[K     |█████████████████               | 51 kB 17.2 MB/s eta 0:00:01[K     |████████████████████▍           | 61 kB 14.8 MB/s eta 0:00:01[K     |███████████████████████▊        | 71 kB 16.1 MB/s eta 0:00:01[K     |███████████████████████████▏    | 81 kB 17.5 MB/s eta 0:00:01[K     |██████████████████████████████▋ | 92 kB 14.7 MB/s eta 0:00:01[K     |████████████████████████████████| 96 kB 5.2 MB/s 
[?25h

In [18]:
# Define helper functions for model creation

from tensorflow.keras import layers, models

def make_resnet_block(
    input_layer, 
    block_height, 
    use_kernel_regularizer, 
    activation,
    kernel_sizes=[8, 5, 3]
):
    conv_kwargs = { 
        'filters': block_height, 
        'padding': 'same', 
        'kernel_regularizer': 'l2' if use_kernel_regularizer else None
    }

    conv_x = layers.Conv1D(kernel_size=8, **conv_kwargs)(input_layer)
    conv_x = layers.BatchNormalization()(conv_x)
    if activation == 'lrelu':
        conv_x = layers.LeakyReLU(alpha=0.2)(conv_x)
    if activation == 'relu':
        conv_x = layers.ReLU(alpha=0.2)(conv_x)

    conv_y = layers.Conv1D(kernel_size=5, **conv_kwargs)(conv_x)
    conv_y = layers.BatchNormalization()(conv_y)
    if activation == 'lrelu':
        conv_y = layers.LeakyReLU(alpha=0.2)(conv_y)
    if activation == 'relu':
        conv_y = layers.ReLU(alpha=0.2)(conv_y)

    conv_z = layers.Conv1D(kernel_size=3, **conv_kwargs)(conv_y)
    conv_z = layers.BatchNormalization()(conv_z)

    shortcut = layers.Conv1D(kernel_size=1, **conv_kwargs)(input_layer)
    shortcut = layers.BatchNormalization()(shortcut)

    output_block = layers.add([shortcut, conv_z])
    if activation == 'lrelu':
        output_block = layers.LeakyReLU(alpha=0.2)(output_block)
    if activation == 'relu':
        output_block = layers.ReLU(alpha=0.2)(output_block)

    return output_block


def make_resnet(hp):
    input_shape = (SAMPLE_LENGTH, len(X_features))
    input_layer = layers.Input(input_shape)

    endpoint_layer = input_layer # Will be built now
    for i in range(hp.Int('n_layers', 1, 5)):
        endpoint_layer = make_resnet_block(
            endpoint_layer, 
            hp.Int(f'block_{i}_maps', 64, 256, step=64),
            hp.Boolean(f'block_{i}_regularizer', default=True),
            hp.Choice(f'block_{i}_activation', ['lrelu', 'relu']),
        )
    
    gap_layer = layers.GlobalAveragePooling1D()(endpoint_layer)
    output_layer = layers.Dense(len(LABEL_ORDER), activation='softmax')(gap_layer)

    model = models.Model(inputs=input_layer, outputs=output_layer)
    model.compile(
        loss='sparse_categorical_crossentropy',
        optimizer='adam',
        metrics=['f1']
    )

    return model

In [19]:
# Define our preprocessing pipeline

import pandas as pd

from tqdm import tqdm

class SHLDatasetGenerator(keras.utils.Sequence):    
    def __init__(
        self, 
        dataset_dirs, 
        batch_size=pow(2, 6), # 64
        prefetch_size_per_dataset=pow(2, 14),
        xdtype=np.float32, # Use float16 with caution, can lead to overflows
        ydtype=np.int
    ):
        self.dataset_dirs = dataset_dirs
        self.batch_size = batch_size
        
        self.prefetch_size_per_dataset = prefetch_size_per_dataset
        self.prefetched_X_batches = None
        self.prefetched_y_batches = None
        self.prefetched_base_step_idx = None

        self.xdtype = xdtype
        self.ydtype = ydtype

        # Count samples in datasets
        self.dataset_samples = [] # (dim datasets)
        for dataset_dir in dataset_dirs:
            # Every file in the dataset has the same length, use the labels file
            samples = 0
            with open(dataset_dir / y_file) as f:
                for _ in tqdm(f, desc=f'Counting samples in {dataset_dir}'):
                    samples += 1
            assert samples > prefetch_size_per_dataset
            self.dataset_samples.append(samples)
        
        self._setup_new_chunked_readers()

    def __len__(self):
        # Datasets should be of equal length, but in case some dataset
        # is shorter than the others, we need to truncate our samples
        max_n_samples = min(self.dataset_samples) * len(self.dataset_dirs)
        # Datasets need to be truncated by the prefetch size
        padding = self.prefetch_size_per_dataset * len(self.dataset_dirs)
        max_n_samples = max_n_samples - (max_n_samples % padding)
        return int(np.floor(max_n_samples / self.batch_size))                  

    def _setup_new_chunked_readers(self):
        # Throw away potentially existing readers
        self.X_attr_readers_for_datasets = [] # (dim datasets x readers)
        self.y_attr_reader_for_datasets = [] # (dim datasets)

        # Initialize new chunked csv readers
        read_csv_kwargs = { 'sep': ' ', 'header': None, 'chunksize': self.prefetch_size_per_dataset }
        for dirname in self.dataset_dirs:
            X_readers = []
            for filename in X_files:
                X_reader = pd.read_csv(dirname / filename, dtype=self.xdtype, **read_csv_kwargs)
                X_readers.append(X_reader)
            self.X_attr_readers_for_datasets.append(X_readers)
            y_reader = pd.read_csv(dirname / y_file, dtype=self.ydtype, **read_csv_kwargs)
            self.y_attr_reader_for_datasets.append(y_reader)

    def on_epoch_end(self):
        self._setup_new_chunked_readers()

    def _prefetch_batches(self):
        # Load raw X attribute tracks
        X_raw_attrs = OrderedDict({})
        for X_attr_readers in self.X_attr_readers_for_datasets:
            for X_attribute, X_attr_reader in zip(X_attributes, X_attr_readers):
                X_attr_track = next(X_attr_reader)
                X_attr_track = np.nan_to_num(X_attr_track.to_numpy())
                if X_attribute in X_raw_attrs:
                    X_raw_attrs[X_attribute] = np.concatenate((X_raw_attrs[X_attribute], X_attr_track))
                else:
                    X_raw_attrs[X_attribute] = X_attr_track
        
        # Calculate X features
        X_feature_tracks = None
        for X_feature_name, X_feature_func in X_features.items():
            X_feature_track = X_feature_func(X_raw_attrs)
            X_feature_track = X_feature_scalers[X_feature_name].transform(X_feature_track)
            if X_feature_tracks is None:
                X_feature_tracks = X_feature_track
            else:
                X_feature_tracks = np.dstack((X_feature_tracks, X_feature_track))

        y_combined = None # dim (None, 1)
        for y_attr_reader in self.y_attr_reader_for_datasets:
            y_attr_track = next(y_attr_reader) # dim (None, sample_length)
            y_attr_track = np.nan_to_num(y_attr_track.to_numpy()) # dim (None, sample_length)
            y_attr_track = y_attr_track[:, 0] # dim (None, 1)
            y_combined = y_attr_track if y_combined is None else np.concatenate((y_combined, y_attr_track), axis=0)
        
        # Shuffle data points
        assert len(X_feature_tracks) == len(y_combined)
        p = np.random.permutation(len(y_combined))
        X_feature_tracks = X_feature_tracks[p]
        y_combined = y_combined[p]

        # Pack the prefetched data into batches
        self.prefetched_X_batches = np.split(X_feature_tracks, len(X_feature_tracks) // self.batch_size, axis=0)
        self.prefetched_y_batches = np.split(y_combined, len(y_combined) // self.batch_size, axis=0)

    def __getitem__(self, step_idx):
        if self.prefetched_y_batches is None:
            is_prefetched = False
        else:
            n_prefetched_batches = len(self.prefetched_y_batches)
            is_above_prefetch = step_idx > (self.prefetched_base_step_idx + n_prefetched_batches - 1)
            is_below_prefetch = step_idx < self.prefetched_base_step_idx
            is_prefetched = (not is_above_prefetch) and (not is_below_prefetch)

        if not is_prefetched:
            self._prefetch_batches()
            self.prefetched_base_step_idx = step_idx

        scoped_idx = step_idx - self.prefetched_base_step_idx
        X = self.prefetched_X_batches[scoped_idx]
        y = self.prefetched_y_batches[scoped_idx]

        return X, y

In [20]:
import keras_tuner as kt

tuner = kt.Hyperband(
    hypermodel=make_resnet, 
    objective=kt.Objective("val_f1", direction="max"), 
    max_epochs=15, 
    overwrite=True,
    directory='models',
    project_name='shl-resnet-gridsearch',
)

tuner.search_space_summary()

Search space summary
Default search space size: 4
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 1, 'max_value': 5, 'step': 1, 'sampling': None}
block_0_maps (Int)
{'default': None, 'conditions': [], 'min_value': 64, 'max_value': 256, 'step': 64, 'sampling': None}
block_0_regularizer (Boolean)
{'default': True, 'conditions': []}
block_0_activation (Choice)
{'default': 'lrelu', 'conditions': [], 'values': ['lrelu', 'relu'], 'ordered': False}


In [21]:
# Define callbacks for our training

from tensorflow.keras import callbacks

decay_lr = callbacks.ReduceLROnPlateau(
    monitor='val_f1',
    factor=0.5, 
    patience=5, # Epochs
    min_lr=0.0001, 
    verbose=1
)

stop_early = callbacks.EarlyStopping(
    monitor='val_f1', 
    patience=10, # Epochs
    verbose=1
)

In [22]:
# Use batch generators to not preprocess the whole dataset at once   
train_generator = SHLDatasetGenerator(
    TRAIN_DATASET_DIRS, 
    prefetch_size_per_dataset=pow(2, 16),
)
validation_generator = SHLDatasetGenerator(
    VALIDATION_DATASET_DIRS,
    prefetch_size_per_dataset=pow(2, 14),
)

Counting samples in shl-dataset/challenge-2019-train_torso/train/Torso: 196072it [00:02, 74668.36it/s]
Counting samples in shl-dataset/challenge-2019-train_bag/train/Bag: 196072it [00:02, 72856.26it/s]
Counting samples in shl-dataset/challenge-2019-train_hips/train/Hips: 196072it [00:02, 72574.18it/s]
Counting samples in shl-dataset/challenge-2020-train_hand/train/Hand: 196072it [00:02, 72874.70it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Torso: 28789it [00:00, 1008441.84it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Bag: 28789it [00:00, 92226.69it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Hips: 28789it [00:00, 999402.57it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Hand: 28789it [00:00, 994046.56it/s]


In [None]:
# Keras tuner grid search training

tuner.search(
    train_generator,
    callbacks=[decay_lr, stop_early],
    validation_data=validation_generator,
    verbose=1,
    shuffle=False, # Shuffling doesn't work with our prefetching
    class_weight=CLASS_WEIGHTS
)


Search: Running Trial #1

Hyperparameter    |Value             |Best Value So Far 
n_layers          |2                 |?                 
block_0_maps      |256               |?                 
block_0_regular...|True              |?                 
block_0_activation|lrelu             |?                 
tuner/epochs      |2                 |?                 
tuner/initial_e...|0                 |?                 
tuner/bracket     |2                 |?                 
tuner/round       |0                 |?                 



In [None]:
from google.colab import files

shutil.make_archive('models/shl-resnet-gridsearch', 'zip', 'models/shl-resnet-gridsearch')
files.download(zip_filename) # Download to control machine