<a href="https://colab.research.google.com/github/PhilippMatthes/diplom/blob/master/src/shl-deep-learning-timeseries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using a deep CNN to directly classify SHL timeseries data

In [1]:
# Free up some disk space on colab
!rm -rf /usr/local/lib/python2.7
!rm -rf /swift
!rm -rf /usr/local/lib/python3.6/dist-packages/torch
!rm -rf /usr/local/lib/python3.6/dist-packages/pystan
!rm -rf /usr/local/lib/python3.6/dist-packages/spacy
!rm -rf /tensorflow-1.15.2/

In [2]:
# Get needed auxiliary files for colab
!git clone https://github.com/philippmatthes/diplom

Cloning into 'diplom'...
remote: Enumerating objects: 1680, done.[K
remote: Counting objects: 100% (1017/1017), done.[K
remote: Compressing objects: 100% (683/683), done.[K
remote: Total 1680 (delta 494), reused 783 (delta 287), pack-reused 663[K
Receiving objects: 100% (1680/1680), 34.50 MiB | 26.81 MiB/s, done.
Resolving deltas: 100% (870/870), done.


In [3]:
# Change into src dir and load our datasets
%cd /content/diplom/src
!mkdir shl-dataset

/content/diplom/src


In [4]:
# Download training datasets
!wget -nc -O shl-dataset/challenge-2019-train_torso.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_torso.zip
!wget -nc -O shl-dataset/challenge-2019-train_bag.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_bag.zip
!wget -nc -O shl-dataset/challenge-2019-train_hips.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_hips.zip
!wget -nc -O shl-dataset/challenge-2020-train_hand.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2020/challenge-2020-train_hand.zip
# Download validation dataset
!wget -nc -O shl-dataset/challenge-2020-validation.zip http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2020/challenge-2020-validation.zip

--2021-08-17 08:24:08--  http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_torso.zip
Resolving www.shl-dataset.org (www.shl-dataset.org)... 37.187.125.22
Connecting to www.shl-dataset.org (www.shl-dataset.org)|37.187.125.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5852446972 (5.5G) [application/zip]
Saving to: ‘shl-dataset/challenge-2019-train_torso.zip’


2021-08-17 08:32:38 (11.0 MB/s) - ‘shl-dataset/challenge-2019-train_torso.zip’ saved [5852446972/5852446972]

--2021-08-17 08:32:38--  http://www.shl-dataset.org/wp-content/uploads/SHLChallenge2019/challenge-2019-train_bag.zip
Resolving www.shl-dataset.org (www.shl-dataset.org)... 37.187.125.22
Connecting to www.shl-dataset.org (www.shl-dataset.org)|37.187.125.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5628524721 (5.2G) [application/zip]
Saving to: ‘shl-dataset/challenge-2019-train_bag.zip’


2021-08-17 08:40:55 (10.8 MB/s) - ‘shl-datas

In [5]:
# Unzip training datasets
!unzip -n -d shl-dataset/challenge-2019-train_torso shl-dataset/challenge-2019-train_torso.zip
!rm shl-dataset/challenge-2019-train_torso.zip
!unzip -n -d shl-dataset/challenge-2019-train_bag shl-dataset/challenge-2019-train_bag.zip
!rm shl-dataset/challenge-2019-train_bag.zip
!unzip -n -d shl-dataset/challenge-2019-train_hips shl-dataset/challenge-2019-train_hips.zip
!rm shl-dataset/challenge-2019-train_hips.zip
!unzip -n -d shl-dataset/challenge-2020-train_hand shl-dataset/challenge-2020-train_hand.zip
!rm shl-dataset/challenge-2020-train_hand.zip
# Unzip validation dataset
!unzip -n -d shl-dataset/challenge-2020-validation shl-dataset/challenge-2020-validation.zip
!rm shl-dataset/challenge-2020-validation.zip

Archive:  shl-dataset/challenge-2019-train_torso.zip
   creating: shl-dataset/challenge-2019-train_torso/train/Torso/
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Acc_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gra_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_x.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_y.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Gyr_z.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/Label.txt  
  inflating: shl-dataset/challenge-2019-train_torso/train/Torso/LAcc_x.txt  
  inflating: shl-dataset/challenge-2019-train

In [1]:
%cd /content/diplom/src
%tensorflow_version 2.x

/content/diplom/src


In [2]:
# Define all datasets to train our model on

from pathlib import Path

TRAIN_DATASET_DIRS = [
    Path('shl-dataset/challenge-2019-train_torso/train/Torso'),
    Path('shl-dataset/challenge-2019-train_bag/train/Bag'),
    Path('shl-dataset/challenge-2019-train_hips/train/Hips'),
    Path('shl-dataset/challenge-2020-train_hand/train/Hand'),
]

VALIDATION_DATASET_DIRS = [
    Path('shl-dataset/challenge-2020-validation/validation/Torso'),         
    Path('shl-dataset/challenge-2020-validation/validation/Bag'),   
    Path('shl-dataset/challenge-2020-validation/validation/Hips'),   
    Path('shl-dataset/challenge-2020-validation/validation/Hand'),                  
]

In [3]:
# Define more useful constants about our dataset

LABEL_ORDER = [
    'Null',
    'Still',
    'Walking',
    'Run',
    'Bike',
    'Car',
    'Bus',
    'Train',
    'Subway',
]

SAMPLE_LENGTH = 500

In [4]:
# Results from data analysis

CLASS_WEIGHTS = {
    0: 0.0, # NULL label
    1: 1.0021671573438011, 
    2: 0.9985739895697523, 
    3: 2.8994439843842423, 
    4: 1.044135815617944, 
    5: 0.7723505499007343, 
    6: 0.8652474758172704, 
    7: 0.7842127155793044, 
    8: 1.0283208861290594
}

In [5]:
# Define features for our dataset

from collections import OrderedDict

import numpy as np

# Attributes to load from our dataset
X_attributes = [
    'acc_x', 'acc_y', 'acc_z',
    'mag_x', 'mag_y', 'mag_z',
    'gyr_x', 'gyr_y', 'gyr_z',
]

# Files within the dataset that contain our attributes
X_files = [
    'Acc_x.txt', 'Acc_y.txt', 'Acc_z.txt',
    'Mag_x.txt', 'Mag_y.txt', 'Mag_z.txt',
    'Gyr_x.txt', 'Gyr_y.txt', 'Gyr_z.txt',
]

# Features to generate from our loaded attributes
# Note that `a` is going to be a dict of attribute tracks
X_features = OrderedDict({
    'acc_mag': lambda a: np.sqrt(a['acc_x']**2 + a['acc_y']**2 + a['acc_z']**2),
    'mag_mag': lambda a: np.sqrt(a['mag_x']**2 + a['mag_y']**2 + a['mag_z']**2),
    'gyr_mag': lambda a: np.sqrt(a['gyr_x']**2 + a['gyr_y']**2 + a['gyr_z']**2),
})

# Define where to find our labels for supervised learning
y_file = 'Label.txt'
y_attribute = 'labels'

In [6]:
# Load pretrained power transformers for feature scaling

import joblib

X_feature_scalers = OrderedDict({})
for feature_name, _ in X_features.items():
    scaler = joblib.load(f'models/shl-scalers/{feature_name}.scaler.joblib')
    scaler.copy = False # Save memory
    X_feature_scalers[feature_name] = scaler



In [7]:
from tensorflow import keras

# Check that we can use our GPU, to not wait forever during training
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 2350749061444566413, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 16183459840
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 5182926104826231397
 physical_device_desc: "device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0"]

In [8]:
# Define helper functions for model creation

from tensorflow.keras import layers, models

def make_resnet_block(input_layer, block_height, kernel_sizes=[8, 5, 3]):
    conv_x = layers.Conv1D(filters=block_height, kernel_size=8, padding='same')(input_layer)
    conv_x = layers.BatchNormalization()(conv_x)
    conv_x = layers.Activation('relu')(conv_x)

    conv_y = layers.Conv1D(filters=block_height, kernel_size=5, padding='same')(conv_x)
    conv_y = layers.BatchNormalization()(conv_y)
    conv_y = layers.Activation('relu')(conv_y)

    conv_z = layers.Conv1D(filters=block_height, kernel_size=3, padding='same')(conv_y)
    conv_z = layers.BatchNormalization()(conv_z)

    shortcut = layers.Conv1D(filters=block_height, kernel_size=1, padding='same')(input_layer)
    shortcut = layers.BatchNormalization()(shortcut)

    output_block = layers.add([shortcut, conv_z])
    output_block = layers.Activation('relu')(output_block)

    return output_block


def make_resnet(block_heights):
    input_shape = (SAMPLE_LENGTH, len(X_features))
    input_layer = layers.Input(input_shape)

    endpoint_layer = input_layer # Will be built now
    for height in block_heights:
        endpoint_layer = make_resnet_block(endpoint_layer, height)
    
    gap_layer = layers.GlobalAveragePooling1D()(endpoint_layer)
    output_layer = layers.Dense(len(LABEL_ORDER), activation='softmax')(gap_layer)

    model = models.Model(inputs=input_layer, outputs=output_layer)
    model.compile(
        loss='sparse_categorical_crossentropy',
        optimizer='adam',
        metrics=['acc']
    )

    return model

In [9]:
# Define our preprocessing pipeline

import pandas as pd

from tqdm import tqdm

class SHLDatasetGenerator(keras.utils.Sequence):    
    def __init__(
        self, 
        dataset_dirs, 
        batch_size=pow(2, 5), # 32
        prefetch_size_per_dataset=pow(2, 14), # 16384, can be 2^16 for train dataset
        xdtype=np.float64, # Use float16 with caution, can lead to overflows
        ydtype=np.int
    ):
        self.dataset_dirs = dataset_dirs
        self.batch_size = batch_size
        
        self.prefetch_size_per_dataset = prefetch_size_per_dataset
        self.prefetched_X_batches = None
        self.prefetched_y_batches = None
        self.prefetched_base_step_idx = None

        self.xdtype = xdtype
        self.ydtype = ydtype

        # Count samples in datasets
        self.dataset_samples = [] # (dim datasets)
        for dataset_dir in dataset_dirs:
            # Every file in the dataset has the same length, use the labels file
            samples = 0
            with open(dataset_dir / y_file) as f:
                for _ in tqdm(f, desc=f'Counting samples in {dataset_dir}'):
                    samples += 1
            assert samples > prefetch_size_per_dataset
            self.dataset_samples.append(samples)
        
        self._setup_new_chunked_readers()

    def __len__(self):
        # Datasets should be of equal length, but in case some dataset
        # is shorter than the others, we need to truncate our samples
        max_n_samples = min(self.dataset_samples) * len(self.dataset_dirs)
        # Datasets need to be truncated by the prefetch size
        padding = self.prefetch_size_per_dataset * len(self.dataset_dirs)
        max_n_samples = max_n_samples - (max_n_samples % padding)
        return int(np.floor(max_n_samples / self.batch_size))                  

    def _setup_new_chunked_readers(self):
        # Throw away potentially existing readers
        self.X_attr_readers_for_datasets = [] # (dim datasets x readers)
        self.y_attr_reader_for_datasets = [] # (dim datasets)

        # Initialize new chunked csv readers
        read_csv_kwargs = { 'sep': ' ', 'header': None, 'chunksize': self.prefetch_size_per_dataset }
        for dirname in self.dataset_dirs:
            X_readers = []
            for filename in X_files:
                X_reader = pd.read_csv(dirname / filename, dtype=self.xdtype, **read_csv_kwargs)
                X_readers.append(X_reader)
            self.X_attr_readers_for_datasets.append(X_readers)
            y_reader = pd.read_csv(dirname / y_file, dtype=self.ydtype, **read_csv_kwargs)
            self.y_attr_reader_for_datasets.append(y_reader)

    def on_epoch_end(self):
        self._setup_new_chunked_readers()

    def _prefetch_batches(self):
        # Load raw X attribute tracks
        X_raw_attrs = OrderedDict({})
        for X_attr_readers in self.X_attr_readers_for_datasets:
            for X_attribute, X_attr_reader in zip(X_attributes, X_attr_readers):
                X_attr_track = next(X_attr_reader)
                X_attr_track = np.nan_to_num(X_attr_track.to_numpy())
                if X_attribute in X_raw_attrs:
                    X_raw_attrs[X_attribute] = np.concatenate((X_raw_attrs[X_attribute], X_attr_track))
                else:
                    X_raw_attrs[X_attribute] = X_attr_track
        
        # Calculate X features
        X_feature_tracks = None
        for X_feature_name, X_feature_func in X_features.items():
            X_feature_track = X_feature_func(X_raw_attrs)
            X_feature_track = X_feature_scalers[X_feature_name].transform(X_feature_track)
            if X_feature_tracks is None:
                X_feature_tracks = X_feature_track
            else:
                X_feature_tracks = np.dstack((X_feature_tracks, X_feature_track))

        y_combined = None # dim (None, 1)
        for y_attr_reader in self.y_attr_reader_for_datasets:
            y_attr_track = next(y_attr_reader) # dim (None, sample_length)
            y_attr_track = np.nan_to_num(y_attr_track.to_numpy()) # dim (None, sample_length)
            y_attr_track = y_attr_track[:, 0] # dim (None, 1)
            y_combined = y_attr_track if y_combined is None else np.concatenate((y_combined, y_attr_track), axis=0)
        
        # Shuffle data points
        assert len(X_feature_tracks) == len(y_combined)
        p = np.random.permutation(len(y_combined))
        X_feature_tracks = X_feature_tracks[p]
        y_combined = y_combined[p]

        # Pack the prefetched data into batches
        self.prefetched_X_batches = np.split(X_feature_tracks, len(X_feature_tracks) // self.batch_size, axis=0)
        self.prefetched_y_batches = np.split(y_combined, len(y_combined) // self.batch_size, axis=0)

    def __getitem__(self, step_idx):
        if self.prefetched_y_batches is None:
            is_prefetched = False
        else:
            n_prefetched_batches = len(self.prefetched_y_batches)
            is_above_prefetch = step_idx > (self.prefetched_base_step_idx + n_prefetched_batches - 1)
            is_below_prefetch = step_idx < self.prefetched_base_step_idx
            is_prefetched = (not is_above_prefetch) and (not is_below_prefetch)

        if not is_prefetched:
            self._prefetch_batches()
            self.prefetched_base_step_idx = step_idx

        scoped_idx = step_idx - self.prefetched_base_step_idx
        X = self.prefetched_X_batches[scoped_idx]
        y = self.prefetched_y_batches[scoped_idx]

        return X, y

In [21]:
# Define a helper function to mutate genes

def mutate(gene, mutate_block_height=True, mutate_new_deep_blocks=True):
    assert len(gene) > 0
    assert mutate_block_height or mutate_new_deep_blocks

    mutants = []

    if mutate_block_height:
        # Create mutants that have a changed block height
        for index_to_mutate, _ in enumerate(gene):
            for mutation_factor in [0.5, 2.0]:
                mutant = []
                for i, block_height in enumerate(gene):
                    if i == index_to_mutate:
                        mutant.append(int(block_height * mutation_factor))
                    else:
                        mutant.append(block_height)
                mutants.append(mutant)
    if mutate_new_deep_blocks:           
        # Create mutants that have a new deep block
        # by copying and modifying the last deep block
        for mutation_factor in [0.5, 1.0, 2.0]:
            mutant = gene + [int(gene[-1] * mutation_factor)]
            mutants.append(mutant)

    return mutants

def gene_str(gene):
    return '-'.join([str(h) for h in gene])

In [39]:
# Mutation based reinforcement training

import os
import json
import shutil

from tensorflow.keras import callbacks

def train_child(child_gene):
    # Check if our model was already tested earlier
    child_dir = f'models/shl-resnet-{gene_str(child_gene)}'
    if not os.path.isdir(child_dir):
        os.mkdir(child_dir)
    elif os.path.isdir(f'{child_dir}/checkpoint'):
        print(f'Mutant with gene {child_gene} was already trained.')
        return

    # Create model and print a description to the filesystem
    model = make_resnet(child_gene)
    with open(f'{child_dir}/model.txt', 'w') as f:
        model.summary(print_fn=lambda x: f.write(x + '\n'))
    
    # Define hyperparameters and callbacks for our training
    n_epochs = 5

    decay_lr = callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5, 
        patience=2, # Epochs
        min_lr=0.0001, 
        verbose=1
    )
    stop_early = callbacks.EarlyStopping(
        monitor='val_loss', 
        patience=3, # Epochs
        verbose=1
    )

    log = callbacks.CSVLogger(f'{child_dir}/train.log', append=False)
    checkpoint = callbacks.ModelCheckpoint(
        f'{child_dir}/checkpoint', save_best_only=True, monitor='val_loss', verbose=1
    )

    # Use batch generators to not preprocess the whole dataset at once   
    train_generator = SHLDatasetGenerator(TRAIN_DATASET_DIRS, prefetch_size_per_dataset=pow(2, 15))
    validation_generator = SHLDatasetGenerator(VALIDATION_DATASET_DIRS)

    print(f'Training mutant with gene {child_gene}.')
    history = model.fit(
        train_generator,
        epochs=n_epochs,
        callbacks=[decay_lr, stop_early, log, checkpoint],
        validation_data=validation_generator,
        verbose=1,
        shuffle=False, # Shuffling doesn't work with our prefetching
        class_weight=CLASS_WEIGHTS
    )

    zip_filename = f'{child_dir}.zip'
    print(f'Finished training. Zipping our results under {zip_filename}. Make sure to download everything!')
    shutil.make_archive(zip_filename, 'zip', child_dir)


def train_new_generation(ancestor_gene):
    print(f'Training a new generation with the ancestor gene {ancestor_gene}.')
    child_genes = mutate(ancestor_gene)
    for child_gene in child_genes:
        train_child(child_gene)

In [None]:
ancestor_gene = [64, 128, 128]
train_new_generation(ancestor_gene)

Training a new generation with the ancestor gene [64, 128, 128].
Mutant with gene [32, 128, 128] was already trained.


Counting samples in shl-dataset/challenge-2019-train_torso/train/Torso: 196072it [00:02, 93726.97it/s] 
Counting samples in shl-dataset/challenge-2019-train_bag/train/Bag: 196072it [00:01, 103653.39it/s]
Counting samples in shl-dataset/challenge-2019-train_hips/train/Hips: 196072it [00:02, 93124.42it/s] 
Counting samples in shl-dataset/challenge-2020-train_hand/train/Hand: 196072it [00:02, 89528.66it/s] 
Counting samples in shl-dataset/challenge-2020-validation/validation/Torso: 28789it [00:00, 94088.81it/s] 
Counting samples in shl-dataset/challenge-2020-validation/validation/Bag: 28789it [00:00, 73729.95it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Hips: 28789it [00:00, 95332.47it/s]
Counting samples in shl-dataset/challenge-2020-validation/validation/Hand: 28789it [00:00, 74688.84it/s]


Training mutant with gene [128, 128, 128].
