# Experiment PAMAP2 with mcfly

This turorial is intended to talk you through the functionalities of mcfly. As an example dataset we use the publicly available PAMAP2 dataset. It contains time series data from movement sensors worn by nine individuals. The data is labelled with the activity types that these individuals did and the aim is to train and evaluate a classifier.

Before you can start, please make sure you installed all the dependencies of mcfly (listed in requirements.txt) and make sure your jupyter notebook has a python3 kernel.

## Import required Python modules

In [1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../..'))
import numpy as np
import pandas as pd
# mcfly
from mcfly import tutorial_pamap2, modelgen, find_architecture, storage
# Keras module is use for the deep learning
import keras
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution1D, Flatten, MaxPooling1D
from keras.optimizers import Adam
# We can set some backend options to avoid NaNs
from keras import backend as K

Using Theano backend.


## Load the data

In [2]:
datapath = '/media/sf_VBox_Shared/timeseries/PAMAP2_Dataset/slidingwindow512cleaned/'
Xs = []
ys = []

ext = '.npy'
for i in range(9):
    Xs.append(np.load(datapath+'X_train_'+str(i)+ext))
    ys.append(np.load(datapath+'y_train_binary'+str(i)+ext))

## Generate models

First step is to create a model architecture. As we do not know what architecture is best for our data we will create a set of models to investigate which architecture is most suitable for our data and classification task. You will need to specificy how many models you want to create with argument 'number_of_models', the type of model which can been 'CNN' or 'DeepConvLSTM', and maximum number of layers per modeltype. See for a full overview of the optional arguments the function documentation of modelgen.generate_models

In [13]:
num_classes = ys[0].shape[1]
np.random.seed(123)
models = modelgen.generate_models(Xs[0].shape,
                                  number_of_classes=num_classes,
                                  number_of_models = 25)

## Compare models
Now that the model architectures have been generated it is time to compare the models by training them in a subset of the training data and evaluating the models in the validation subset. This will help us to choose the best candidate model. Performance results are stored in a json file.

In [14]:
# Define directory where the results, e.g. json file, will be stored
resultpath = '/media/sf_VBox_Shared/timeseries/PAMAP2_Dataset/results_test/' 
if not os.path.exists(resultpath):
        os.makedirs(resultpath)

In [15]:
def split_train_test(X_list, y_list, j):
    X_train = np.concatenate(X_list[0:j]+X_list[j+1:])
    X_test = X_list[j]
    y_train = np.concatenate(y_list[0:j]+y_list[j+1:])
    y_test = y_list[j]
    return X_train, y_train, X_test, y_test

def split_train_small_val(X_list, y_list, j, trainsize=500, valsize=500):
    X = np.concatenate(X_list[0:j]+X_list[j+1:])
    y = np.concatenate(y_list[0:j]+y_list[j+1:])
    rand_ind = np.random.choice(X.shape[0], trainsize+valsize, replace=False)
    X_train = X[rand_ind[:trainsize]]
    y_train = y[rand_ind[:trainsize]]
    X_val = X[rand_ind[trainsize:]]
    y_val = y[rand_ind[trainsize:]]
    return X_train, y_train, X_val, y_val

In [17]:
import time
t = time.time()
np.random.seed(123)
histories_list, val_accuracies_list, val_losses_list = [], [], []
for j in range(len(Xs)):
    X_train, y_train, X_val, y_val = split_train_small_val(Xs, ys, j, trainsize=10, valsize=10)
    histories, val_accuracies, val_losses = find_architecture.train_models_on_samples(X_train, y_train,
                                                                           X_val, y_val,
                                                                           models[:2],
                                                                           nr_epochs=1,
                                                                           subset_size=500,
                                                                           verbose=True,
                                                                           outputfile=resultpath+\
                                                                                  'experiment'+str(j)+'.json')
    histories_list.append(histories)
    val_accuracies_list.append(val_accuracies)
    val_losses.append(val_losses)
print(time.time()-t)

Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 0 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Epoch 1/1
Training model 1 DeepConvLSTM
Train on 10 samples, validate on 10 samples
Ep

In [20]:
val_accuracies = np.array(val_accuracies_list)
val_accuracies_avg = val_accuracies.mean(axis=0)
val_accuracies_avg

array([ 0.04444445,  0.11111111])

In [31]:
train_acc = np.array([[history.history['acc'][-1] for history in histories] for histories in histories_list])
train_loss = np.array([[history.history['loss'][-1] for history in histories] for histories in histories_list])
val_acc = np.array([[history.history['val_acc'][-1] for history in histories] for histories in histories_list])
val_loss = np.array([[history.history['val_loss'][-1] for history in histories] for histories in histories_list])

Another way of comparing model performance is by putting all the information in a pandas dataframe, which we can store in a csv file.

In [32]:
modelcomparisons = pd.DataFrame({'model':[str(params) for model, params, model_types in models[:2]],
                       'train_acc': train_acc.mean(axis=0),
                       'train_loss': train_loss.mean(axis=0),
                       'val_acc': val_acc.mean(axis=0),
                       'val_loss': val_loss.mean(axis=0),
                       })
modelcomparisons

Unnamed: 0,model,train_acc,train_loss,val_acc,val_loss
0,"{'lstm_dims': array([35]), 'filters': array([2...",0.144444,8.131043,0.044444,2.546965
1,"{'lstm_dims': array([10, 43, 98]), 'filters': ...",0.1,3.805315,0.111111,2.940969


It is also possible to vizualize the performance of the various models using our vizualisation tool as explained in the mcfly repository README file: https://github.com/NLeSC/mcfly/blob/master/README.md

Check which model is the best

In [33]:
best_model_index = np.argmax(val_accuracies)
best_model, best_params, best_model_types = models[best_model_index]
print('Model type and parameters of the best model:')
print(best_model_types)
print(best_params)

Model type and parameters of the best model:
DeepConvLSTM
{'lstm_dims': array([10, 43, 98]), 'filters': array([39, 36, 17, 30, 30, 66, 73]), 'learning_rate': 0.03529000836062186, 'regularization_rate': 0.0022288077746968685}


In [13]:
modelname = 'bestmodel_sample'
storage.savemodel(best_model,resultpath,modelname)

('/media/sf_VBox_Shared/timeseries//PAMAP2/PAMAP2_Dataset/results/bestmodel_sample_architecture.json',
 '/media/sf_VBox_Shared/timeseries//PAMAP2/PAMAP2_Dataset/results/bestmodel_sample_weights')

## Train the best model for real

Now that we have identified the best model architecture out of our random pool of models we can continue by training the model on the full training sample. For the purpose of speeding up the example we only train the full model on the first 1000 values. You will need to replace this by 'datasize = X_train.shape[0]' in a real world example.

In [75]:
from keras.models import model_from_json

def get_fresh_copy(model, lr):
    model_json = best_model.to_json()
    model_copy = model_from_json(model_json)
    model_copy.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=lr),
                  metrics=['accuracy'])
    #for layer in model_copy.layers:
    #    layer.build(layer.input_shape)
    return model_copy

In [79]:
nr_epochs = 1

np.random.seed(123)
histories, test_accuracies_list, models = [], [], []
for j in range(len(Xs)):
    X_train, y_train, X_test, y_test = split_train_test(Xs, ys, j)
    model_copy = get_fresh_copy(best_model, best_params['learning_rate'])
    datasize = 20 #X_train.shape[0]
    
    history = model_copy.fit(X_train[:datasize,:,:], y_train[:datasize,:],
              nb_epoch=nr_epochs, validation_data=(X_test, y_test))
    
    histories.append(history)
    test_accuracies_list.append(history.history['val_acc'][-1] )
    models.append(model_copy)

Train on 20 samples, validate on 2155 samples
Epoch 1/1
Train on 20 samples, validate on 2290 samples
Epoch 1/1


KeyboardInterrupt: 

In [39]:
# Plot the training process:
find_architecture.plotTrainingProcess(histories[0])

In [41]:
best_model_fullytrained = best_model_copy

In [24]:
score_val = best_model_fullytrained.evaluate(X_val, y_val_binary, verbose=False)
print('Score of best model: ' + str(score_val))

Score of best model: [0.28872849385374272, 0.98405580468360743]


### Saving, loading and comparing reloaded model with orignal model

The modoel can be saved for future use. The savemodel function will save two separate files: a json file for the architecture and a npy (numpy array) file for the weights.

In [7]:
modelname = 'my_bestmodel'

In [18]:
storage.savemodel(best_model_fullytrained,resultpath,modelname)

('/media/sf_VBox_Shared/timeseries//PAMAP2/PAMAP2_Dataset/results/my_bestmodel_architecture.json',
 '/media/sf_VBox_Shared/timeseries//PAMAP2/PAMAP2_Dataset/results/my_bestmodel_weights')

In [8]:
best_model_fullytrained = storage.loadmodel(resultpath,modelname)

In [17]:
best_model_sample = storage.loadmodel(resultpath, 'bestmodel_sample')

FileNotFoundError: [Errno 2] No such file or directory: '/media/sf_VBox_Shared/timeseries/PAMAP2_Dataset/results/bestmodel_sample_architecture.json'

In [9]:
learning_rate = 0.0013927354361231595
best_model_fullytrained.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=learning_rate),
                  metrics=['accuracy'])

In [10]:
best_model_sample.compile(loss='categorical_crossentropy',
                  optimizer=Adam(lr=learning_rate),
                  metrics=['accuracy'])

The model has been reloaded. Let's investigate whether it gives the same probability estimates as the original model in a small subset of the validation data.

## Advanced model inspection

Although beyond the scope of mcfly it may be worth highlighting that the objects 'models', 'best_model_fullytrained' and 'best_model' are Keras objects. This means that you can use Keras functions like .predict and .evaluate on the objects to run advanced analyses. These functions are all documented in the Keras documentation

In [12]:
## Test on Testset
score_test = best_model_fullytrained.evaluate(X_test, y_test_binary, verbose=False)
print('Score of best model: ' + str(score_test))

Score of best model: [1.5646683828375569, 0.52981849611063092]


In [13]:
X_test.shape

(2314, 512, 9)

In [25]:
best_model_sample.evaluate(X_test, y_test_binary, verbose=False)

[1.9208478596103324, 0.46672428694900603]

In [15]:
best_model_fullytrained.evaluate(X_val, y_val_binary, verbose=True)



[0.28872849385374272, 0.98405580468360743]

In [11]:
find_architecture.kNN_accuracy(X_train[:500,:,:], y_train_binary[:500,], X_test, y_test_binary, k=1)

0.3595505617977528

In [12]:
find_architecture.kNN_accuracy(X_train[:500,:,:], y_train_binary[:500,], X_val, y_val_binary, k=1)

0.44294967613353264

In [15]:
print(find_architecture.kNN_accuracy(X_train[:1000,:,:], y_train_binary[:1000,], X_test, y_test_binary, k=1))
print(find_architecture.kNN_accuracy(X_train[:1000,:,:], y_train_binary[:1000,], X_val, y_val_binary, k=1))

0.359982713915
0.470852017937
