# Autoencoders

After initial analysis and some classification attempts with pca and lda, we discovered that our feature set did not differentiate well between users. To improve on the features we manually engineered, we will use an autoencoder to automatically generate features. Then we will apply some machine learning algorithms and compare results to the other techniques to evaluate if our newly generated feature set presents an improvement.

## The data

As an autoencoder is a neural network, we need to process our data in a way that it can be fed to the network. Similar to what was already done in the lda analysis, we will segment the data into equal time intervals, but instead of calculating features, we will take the mean value of values from a specific sensor that falls into each time bin.

In [4]:
generate_csv_data(cont_bins=20, segment_intervals=[2, 10, 30])

## Feature extraction

After the data is calculated and stored in csv files, we can begin building our autoencoder, train it and extract the encoder that we will later use to generate features from the new examples.

### Data preparation

Befor we do any actual work with autoencoders, we need to define which data we want and subsample it accordingly (time intervals, users) and potentially perform some preprocessing such as normalization if it provides a significant increase in accuracy.

In [45]:
experiment = 1
interval = 2
use_bins = True

data = read_csv("jupyter/data/raw_data_experiment_{}_segment_{}_seconds.csv".format(experiment, interval)).fillna(0)
X_train, X_test = split_csv_data(data, use_bins=use_bins)

def extend_bins(data):
    """
    Extend list of lists into a list and transform 
    the string list into an actual list.
    """
    final_data = []
    for row in data:
        data_row = []
        for x in row:
            if type(x) == str:
                data_row += [float(y) for y in x[1:-1].split(",")]
            else:
                data_row.append(x)
        final_data.append(data_row)
    return final_data


def split_csv_data(data, use_bins=True):
    """
    Split the data to a training and a testing set with each 
    user having one seance in training and other in testing.
    """
    users = get_users_data(data)
    train_seances = [x[0] for x in users.values()]
    test_seances = [x[1] for x in users.values()]
    training_set = data[data["seance"].isin(train_seances)]
    testing_set = data[data["seance"].isin(test_seances)]
    if use_bins:
        X_train = training_set.iloc[:, 19:].values
        X_test = testing_set.iloc[:, 19:].values
    else:
        X_train = training_set.iloc[:, 3:20].values
        X_test = testing_set.iloc[:, 3:20].values

    return nan_to_num(array(extend_bins(X_train))), nan_to_num(array(extend_bins(X_test)))

print(X_train.shape)
print(X_test.shape)

(4390, 201)
(3583, 201)


### Building the autoencoder

After we acquired the data, we can define the autoencoder layers, with the appropriate dimensions, according to the data.

In [50]:
# input_img = Input(shape=(784,))
# encoded = Dense(128, activation='relu')(input_img)
# encoded = Dense(64, activation='relu')(encoded)
# encoded = Dense(32, activation='relu')(encoded)

# decoded = Dense(64, activation='relu')(encoded)
# decoded = Dense(128, activation='relu')(decoded)
# decoded = Dense(784, activation='sigmoid')(decoded)



# this is the size of our encoded representations
encoding_dim = 32  # 32 floats -> compression of factor 6.14, assuming the input is 207 floats

# this is our input placeholder
input_data = Input(shape=(X_train.shape[1],))

encoded = Dense(128, activation='relu')(input_data)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)

decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(201, activation='sigmoid')(decoded)

# # this model maps an input to its reconstruction
# autoencoder = Model(input_data, decoded)

# # this model maps an input to its encoded representation
# encoder = Model(input_data, encoded)

# # create a placeholder for an encoded (32-dimensional) input
# encoded_input = Input(shape=(encoding_dim,))
# # retrieve the last layer of the autoencoder model
# decoder_layer = autoencoder.layers[-1]
# # create the decoder model
# decoder = Model(encoded_input, decoder_layer(encoded_input))

autoencoder.compile(optimizer='adadelta', loss='msle')
autoencoder.fit(X_train, X_train,
                epochs=20,
                batch_size=128,
                shuffle=True,
                validation_data=(X_test, X_test))



Train on 4390 samples, validate on 3583 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.callbacks.History at 0x7f302c7cf320>

In [36]:
print(X_test[:1])
print(encoder.predict(X_test[:2]))

[[ 2.44545000e+04  2.22222222e-02  2.11111111e-02  2.00000000e-02
   2.05555556e-02  2.00000000e-02  2.11111111e-02  2.11111111e-02
   2.27777778e-02  2.11111111e-02  2.11111111e-02  2.05555556e-02
   1.94444444e-02  1.94444444e-02  2.05555556e-02  2.11111111e-02
   2.05555556e-02  2.16666667e-02  2.11111111e-02  2.27777778e-02
   2.00000000e-02 -2.50000000e-02 -2.77777778e-02 -2.66666667e-02
  -2.77777778e-02 -2.77777778e-02 -2.66666667e-02 -2.72222222e-02
  -2.55555556e-02 -2.55555556e-02 -2.50000000e-02 -2.44444444e-02
  -2.50000000e-02 -2.94444444e-02 -2.83333333e-02 -2.72222222e-02
  -2.50000000e-02 -2.55555556e-02 -2.83333333e-02 -2.77777778e-02
  -2.61111111e-02  8.78888889e-01  8.71666667e-01  8.77222222e-01
   8.75555556e-01  8.78333333e-01  8.76111111e-01  8.75555556e-01
   8.77777778e-01  8.77777778e-01  8.76111111e-01  8.75000000e-01
   8.77777778e-01  8.80000000e-01  8.80000000e-01  8.75555556e-01
   8.78333333e-01  8.77777778e-01  8.77222222e-01  8.78888889e-01
   8.77777

## Helpers

Functions that are used in this document, but moved here to reduce the clutter.

In [28]:
from datetime import timedelta
from keras.layers import Input, Dense
from keras.models import Model
from numpy import mean, array, isnan, unique, nan_to_num
from pandas import read_csv, DataFrame


def get_users_data(data):
    """
    Get seance id for each user in a form of a dict.
    """
    users = {}
    for user in list(set(data["user"])):
        x = data[data["user"] == user]
        users.update({user: sorted(list(set(x["seance"])))})
    return users


def generate_csv_data(cont_bins=20, segment_intervals=[2, 10, 30, 60, 90, 120]):
    sensors = {
        "ax": 60,
        "ay": 61,
        "az": 62,
        "gx": 63,
        "gy": 64,
        "gz": 65,
        "fa": 77,
        "fb": 76,
        "fc": 54,
        "fd": 55,
        "ca": 78,
        "cb": 79,
        "cc": 80,
        "cd": 81,
        "me": 82,
        "nr": 84,
        "ns": 83,
    }
    for interval in segment_intervals:
        # Experiments
        for ex in [1, 2, 3]:
            data = {
                "user": [],
                "seance": [],
                "time": [],
                "ax": [],
                "ay": [],
                "az": [],
                "gx": [],
                "gy": [],
                "gz": [],
                "fa": [],
                "fb": [],
                "fc": [],
                "fd": [],
                "ca": [],
                "cb": [],
                "cc": [],
                "cd": [],
                "me": [],
                "nr": [],
                "ns": [],
                "ax_b": [],
                "ay_b": [],
                "az_b": [],
                "gx_b": [],
                "gy_b": [],
                "gz_b": [],
                "fa_b": [],
                "fb_b": [],
                "fc_b": [],
                "fd_b": [],
            }
            seances = Seance.objects.filter(
                experiment__sequence_number=ex, valid=True
            ).order_by("created")
            seance_count = seances.count()
            print("Processing {} seances with experiment {}".format(seance_count, ex))
            curr_seance = 1
            for seance in seances:
                print("{} of {}".format(curr_seance, seance_count))
                print(seance)
                curr_seance += 1
                start = seance.start

                # Seconds from seance start
                i = 0
                # Iterate through seance
                while start < seance.end:
                    data["user"].append(seance.user.id)
                    data["seance"].append(seance.id)
                    data["time"].append(i * interval)
                    # Get records for all sensors
                    records = SensorRecord.objects.filter(
                        timestamp__range=(start, start + timedelta(seconds=interval)),
                        seance=seance,
                    ).order_by("timestamp")
                    # Calculate final data on per sensor basis
                    for sensor in sensors:
                        sensor_records = [
                            x.value for x in records.filter(sensor__id=sensors[sensor])
                        ]
                        # Create multiple bins of data, if data from accelerometer, gyroscope or force sensor
                        if sensor in [
                            "ax",
                            "ay",
                            "az",
                            "gx",
                            "gy",
                            "gz",
                            "fa",
                            "fb",
                            "fc",
                            "fd",
                        ]:
                            step = int(len(sensor_records) / cont_bins)
                            bins = []
                            for j in range(0, cont_bins):
                                sub_records = sensor_records[j * step : (j + 1) * step]
                                bins.append(mean(sub_records))
                            data[sensor + "_b"].append(bins)
                        if not sensor_records:
                            data[sensor].append(0)
                        else:
                            data[sensor].append(mean(sensor_records))
                    i += 1
                    start += timedelta(seconds=interval)
            df = DataFrame(data)
            df.to_csv(
                "raw_data_experiment_{}_segment_{}_seconds.csv".format(ex, interval),
                index=False,
            )

In [8]:
from numpy import mean, std

from seances.models import Seance
from sensors.models import SensorRecord, Sensor

SENSORS = ["fsr_01","fsr_02","fsr_03","fsr_04","accel01_x","accel01_y","accel01_z","gyro01_x","gyro01_y","gyro01_z","cpuusage_01","cpuusage_02","cpuusage_03","cpuusage_04","mempercentage_01","netpacketssent_01","netpacketsreceived_01"]

def process_data():
    """
    Shape our data in a way that is usable by the autoencoder.
    """
    data = []
    seances = Seance.objects.filter(valid=True, experiment__sequence_number=1)
    sensors = Sensor.objects.filter(topic__in=SENSORS)

    for seance in seances:
        print(seance)
        row_data = {"user_id": seance.user.id, "seance_id": seance.id}
        valid = True
        for sensor in sensors:
            try:
                sensor_data = to_n_points(
                    [
                        x.value
                        for x in SensorRecord.objects.filter(
                            seance=seance, sensor=sensor
                        )
                    ],
                    50,
                )
            except ValueError:
                print("Missin data in seance... skipping.")
                valid = False
                break
            row_data.update({sensor.topic: sensor_data})
        if valid:
            data.append(row_data)
    return data

def to_n_points(data: list, n: int):
    """
    Take the provided list of values and compress it to a length of n elements.
    This is achieved by averaging elements.
    """
    if len(data) < n:
        raise ValueError("Not enough data to compress to {} elements.".format(n))

    step = len(data) / n
    i = 0
    result = []
    for _ in range(0, n):
        row = data[round(i) : round(i + step)]
        result.append(mean(row))
        i += step
    return result

data = process_data()


Completed seance started at: 2019-09-20 10:19:29 with user test_subject_04
Completed seance started at: 2019-09-13 08:44:53 with user test_subject_01


KeyboardInterrupt: 

In [None]:
X = []
for row in data:
    x = []
    for sensor in SENSORS:
        x += row[sensor]
    X.append(x)

print(X[0])

In [None]:
from keras.layers import Input, Dense
from keras.models import Model

# this is the size of our encoded representations
encoding_dim = 16  # 32 floats -> compression of factor 24.5, assuming the input is 784 floats

# this is our input placeholder
input_img = Input(shape=(784,))
# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim, activation="relu")(input_img)
# "decoded" is the lossy reconstruction of the input
decoded = Dense(784, activation="sigmoid")(encoded)

# this model maps an input to its reconstruction
autoencoder = Model(input_img, decoded)

# this model maps an input to its encoded representation
encoder = Model(input_img, encoded)

# create a placeholder for an encoded (32-dimensional) input
encoded_input = Input(shape=(encoding_dim,))
# retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]
# create the decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

autoencoder.compile(optimizer="adadelta", loss="binary_crossentropy")

from keras.datasets import mnist
import numpy as np

autoencoder.fit(
    x_train,
    x_train,
    epochs=50,
    batch_size=256,
    shuffle=True,
    validation_data=(x_test, x_test),
)

# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder.predict(encoded_imgs)