# Tutorial PAMAP2 with mcfly

This turorial is intended to talk you through the functionalities of mcfly. As an example dataset we use the publicly available [PAMAP2 dataset](https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring). It contains time series data from movement sensors worn by nine individuals. The data is labelled with the activity types that these individuals did and the aim is to train and evaluate a *classifier*.

Before you can start, please make sure you installed all the dependencies of mcfly (listed in requirements.txt) and make sure your jupyter notebook has a python3 kernel.

## Import required Python modules

In [1]:
import sys
import os
import numpy as np
import pandas as pd
# mcfly
from mcfly import tutorial_pamap2, modelgen, find_architecture, storage
# Keras module is use for the deep learning
import keras
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Activation, Convolution1D, Flatten, MaxPooling1D
from keras.optimizers import Adam
np.random.seed(2)

Using TensorFlow backend.


In [2]:
sys.path.insert(0, os.path.abspath('../..'))
import tutorial

## Download data and pre-proces data

We have created a function for you to fetch and pre-proces the data. Please specify the 'directory_to_extract_to' in the code below and then execute the cell. This will download the data into the directory and create a subdirectory 'PAMAP2'. The output of the function is outputpath which indicates where the data was stored.

In [3]:
# Specify in which directory you want to store the data:
directory_to_extract_to = 'data'

In [5]:
# Specifcy which columns to use. You can leave this as it is 
columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
                 'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
                 'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
exclude_activities = [9, 10, 11, 18, 19, 20, 0]
outputpath = tutorial.fetch_and_preprocess(directory_to_extract_to,columns_to_use, 
                                                  exclude_activities=exclude_activities,
                                 val_test_size=(100, 1000))

Data previously downloaded and stored in data/PAMAP2/
Start pre-processing all 9 files...
Stored data/PAMAP2/PAMAP2_Dataset/output/X_train y_train
Stored data/PAMAP2/PAMAP2_Dataset/output/X_val y_val
Stored data/PAMAP2/PAMAP2_Dataset/output/X_test y_test
Processed data succesfully stored in data/PAMAP2/PAMAP2_Dataset/output


## Load the pre-processed data

Load the preprocessed data as stored in Numpy-files. Please note that the data has already been split up in a training (training), validation (val), and test subsets. It is common practice to call the input data X and the labels y.

In [6]:
X_train, y_train_binary, X_val, y_val_binary, X_test, y_test_binary = tutorial_pamap2.load_data(outputpath)

We can inspect the shape of the data. The shape of X is a tuple of the number of samples, length of the time series, and the number of channels for each sample. The shape of y is a tuple of the number of samples and the number of classes. Labels are formatted as a binary array where only the correct class for each sample is assigned a 1. This is called one-hot-encoding.

In [7]:
print('x shape:', X_train.shape)
print('y shape:', y_train_binary.shape)

x shape: (15718, 512, 9)
y shape: (15718, 12)


The data is split between train test and validation.

In [8]:
print('train set size:', X_train.shape[0])
print('validation set size:', X_val.shape[0])
print('test set size:', X_test.shape[0])

train set size: 15718
validation set size: 100
test set size: 1000


## Generate models

First step is to create a model architecture. As we do not know what architecture is best for our data we will create a set of models to investigate which architecture is most suitable for our data and classification task. You will need to specificy how many models you want to create with argument 'number_of_models'. See for a full overview of the optional arguments the function documentation of modelgen.generate_models by running `modelgen.generate_models?`.

##### What number of models to select?
This number differs per dataset. More models will give better results but it will take longer to evaluate them. For the purpose of this tutorial we recommend trying only 2 models to begin with. If you have enough time you can try a larger number of models. Because mcfly uses random search, you will get better results when using more models.

In [9]:
num_classes = y_train_binary.shape[1]
#%pdb on
models = modelgen.generate_models(X_train.shape,
                                  number_of_classes=num_classes,
                                  number_of_models = ?)



# Inspect the models
We can have a look at the models that were generated. The layers are shown as table rows. Most common layer types are 'Convolution' and 'LSTM' and 'Dense'. For more information see the [mcfly user manual](https://github.com/NLeSC/mcfly/wiki/User-manual). The summary also shows the data shape of each layer output and the number of parameters that are trained within this layer.

In [10]:
models_to_print = range(len(models))
for i, item in enumerate(models):
    if i in models_to_print:
        model, params, model_types = item
        print("-------------------------------------------------------------------------------------------------------")
        print("Model " + str(i))
        print(" ")
        print("Hyperparameters:")
        print(params)
        print(" ")
        print("Model description:")
        model.summary()
        print(" ")
        print("Model type:")
        print(model_types)
        print(" ")

-------------------------------------------------------------------------------------------------------
Model 0
 
Hyperparameters:
{'regularization_rate': 0.000871406843381289, 'learning_rate': 0.05241610537437882, 'filters': array([76, 19, 18, 96]), 'fc_hidden_nodes': 1391}
 
Model description:
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
batchnormalization_1 (BatchNorma (None, 512, 9)        36          batchnormalization_input_1[0][0] 
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 512, 76)       2128        batchnormalization_1[0][0]       
____________________________________________________________________________________________________
batchnormalization_2 (BatchNorma (None, 512, 76)       304         convolution1d_1[0][0]         

## Compare models
Now that the model architectures have been generated it is time to compare the models by training them in a subset of the training data and evaluating the models in the validation subset. This will help us to choose the best candidate model. Performance results are stored in a json file.

In [11]:
# Define directory where the results, e.g. json file, will be stored

resultpath = os.path.join(outputpath, 'models')
if not os.path.exists(resultpath):
        os.makedirs(resultpath)

We are now going to train each of the models that we generated. On the one hand we want to train them as quickly as possible in order to be able to pick the best one as soon as possible. On the other hand we have to train each model long enough to get a good impression of its potential.

We can influence the train time by adjusting the number of data samples that are used. This can be set with the argument 'subset_size'. We can also adjust the number of times the subsample is looped over. This is called an epoch. We recommend to start with no more than 5 epochs and a maximum subset of 300. You can experiment with these numbers.

In [12]:
outputfile = os.path.join(resultpath, 'modelcomparison.json')
histories, val_accuracies, val_losses = find_architecture.train_models_on_samples(X_train, y_train_binary,
                                                                           X_val, y_val_binary,
                                                                           models,nr_epochs=?,
                                                                           subset_size=?,
                                                                           verbose=True,
                                                                           outputfile=outputfile)
print('Details of the training process were stored in ',outputfile)

Training model 0 CNN
Train on 300 samples, validate on 100 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training model 1 DeepConvLSTM
Train on 300 samples, validate on 100 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Details of the training process were stored in  data/PAMAP2/PAMAP2_Dataset/output/models/modelcomparison.json


# Inspect model performance (Visualization)

Details about the learning process can be visualized. To use mcfly's visualization, navigate to the html folder and start a web server. For example `python3 -m http.server`.
Notice the port number the web server is serving on. This is usually 8000.
With a web browser, navigate to [localhost:8000](localhost:8000). There you can upload the json file that contains the details of the training process.

# Inspect model performance (table)

The performance of the models can also be viewed in a table.

In [13]:
modelcomparisons = pd.DataFrame({'model':[str(params) for model, params, model_types in models],
                       'train_acc': [history.history['acc'][-1] for history in histories],
                       'train_loss': [history.history['loss'][-1] for history in histories],
                       'val_acc': [history.history['val_acc'][-1] for history in histories],
                       'val_loss': [history.history['val_loss'][-1] for history in histories]
                       })
modelcomparisons.to_csv(resultpath +'modelcomparisons.csv')

modelcomparisons

Unnamed: 0,model,train_acc,train_loss,val_acc,val_loss
0,"{'regularization_rate': 0.000871406843381289, ...",0.36,189.750765,0.2,133.420563
1,"{'regularization_rate': 0.0006385919929132891,...",0.426667,1.935624,0.43,2.11442


# Choose the best model
Now that we found an effective architecture, we can choose the most promising model. For example, we can choose the model with the highest accuracy on the validation data set. To maximize this models performance, we will train this model on more data and more epochs.

In [14]:
best_model_index = np.argmax(val_accuracies)
best_model, best_params, best_model_types = models[best_model_index]
print('Model type and parameters of the best model:')
print(best_model_types)
print(best_params)

Model type and parameters of the best model:
DeepConvLSTM
{'regularization_rate': 0.0006385919929132891, 'learning_rate': 0.0013206889428859167, 'lstm_dims': array([47]), 'filters': array([68, 44, 31, 62, 67])}


## Train the best model on the full dataset

Now that we have identified the best model architecture out of our random pool of models we can continue by training the model on the full training set.

In [16]:
#We make a copy of the model, to start training from fresh
nr_epochs = ?
datasize = X_train.shape[0] #We're going to train the model on the complete data set
history = best_model.fit(X_train[:datasize,:,:], y_train_binary[:datasize,:],
              nb_epoch=nr_epochs, validation_data=(X_val, y_val_binary))

Train on 15718 samples, validate on 100 samples
Epoch 1/1


In [17]:
# Plot the training process:
find_architecture.plotTrainingProcess(history)

### Saving, loading and comparing reloaded model with orignal model

The modoel can be saved for future use. The savemodel function will save two separate files: a json file for the architecture and a npy (numpy array) file for the weights.

In [19]:
modelname = 'my_bestmodel'

In [20]:
storage.savemodel(best_model,resultpath,modelname)

('data/PAMAP2/PAMAP2_Dataset/output/modelsmy_bestmodel_architecture.json',
 'data/PAMAP2/PAMAP2_Dataset/output/modelsmy_bestmodel_weights')

In [24]:
model_reloaded = storage.loadmodel(resultpath,modelname)



The model has been reloaded. Let's investigate whether it gives the same probability estimates as the original model in a small subset of the validation data.

In [25]:
datasize = 10
probs_original = best_model.predict_proba(X_val[:datasize,:,:],batch_size=1)
probs_reloaded = model_reloaded.predict_proba(X_val[:datasize,:,:],batch_size=1)



In [29]:
(probs_reloaded == probs_original).all()

True

## Advanced model inspection

Although beyond the scope of mcfly it may be worth highlighting that the objects 'models', 'best_model_fullytrained' and 'best_model' are Keras objects. This means that you can use Keras functions like .predict and .evaluate on the objects to run advanced analyses. These functions are all documented in the Keras documentation

In [30]:
## Inspect model predictions
datasize = X_val.shape[0]
probs = best_model.predict_proba(X_val[:datasize,:,:],batch_size=1)



In [31]:
print(np.round(probs,decimals=2))

[[ 0.          0.          0.         ...,  0.          0.          0.01      ]
 [ 0.          0.01        0.         ...,  0.01        0.          0.04      ]
 [ 0.          0.          0.02       ...,  0.          0.          0.02      ]
 ..., 
 [ 0.01        0.          0.01       ...,  0.          0.          0.02      ]
 [ 0.          0.57999998  0.02       ...,  0.01        0.06        0.01      ]
 [ 0.          0.01        0.         ...,  0.01        0.          0.03      ]]


In [33]:
## Test on Testset
score_test = best_model.evaluate(X_test, y_test_binary, verbose=True)
print('Score of best model: ' + str(score_test))

Score of best model: [0.51429932212829588, 0.86799999999999999]
