# Linear Discriminant Analysis

In this document we will discuss the potential uses and perform some basic analysis and classification on our dataset with the help of linear discriminant analysis.

<!-- At the begging of every session (when running code), first run [helpers](#section_helpers) and then [get csv data](#section_csv) to load all the libraries, functions and data, used by this notebook. -->

## Data preparation
As we only posses 2 experiments of the same type per user, it is necessary for us to somehow generate more datapoints. To accomplish that we segmented the data into bins, lasting for m seconds, calculated basic statistical features (mean, sd, mcr) and save the results into separate csv files. Where $m = [10s, 15s, 30s, 45s, 60s, 75s, 90s, 120s, 150s, 180s]$.

In contrast to PCA which is an unsupervised method, LDA uses class labels to separate the data. As our data is labelled, this method should distinguish much better between different users. We will take a look at different time segment intervals, differnet number of users and some comparison to pca on the same data.

In [None]:
import warnings

# Yes, yes, do not use this... we only have bunch of mean of empty slice and similar warnings
warnings.filterwarnings("ignore")

for seconds in [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]:
    generate_segmented_data_csv(seconds=seconds, experiment=1)
    generate_segmented_data_csv(seconds=seconds, experiment=2)
    generate_segmented_data_csv(seconds=seconds, experiment=3)

In [2]:
segment_intervals = [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]
datas_1 = []
datas_2 = []
datas_3 = []

for interval in segment_intervals:
    datas_1.append(read_csv("data/segmented_data_{}_seconds_experiment_1.csv".format(interval)).fillna(0))
    datas_2.append(read_csv("data/segmented_data_{}_seconds_experiment_2.csv".format(interval)).fillna(0))
    datas_3.append(read_csv("data/segmented_data_{}_seconds_experiment_3.csv".format(interval)).fillna(0))

## Clustering qualities of lda

Before performing any kind of classification, we will take a look how lda perfroms comparing to pca and relation between length of each time segment and number of users.

### Evaluation of cluster quality

The result of used methods, either lda or pca is a point in n-dimensional space. Our goal is to maximize distance between clusters of point for different users, while retaining points of the same user close together. For that reason, we will use the Calinski-Harabasz Index, also known as a Variance Ration Criterion. It is defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion. More information [here](https://scikit-learn.org/stable/modules/clustering.html).

### Comparison to PCA

In the graphs below, we show the difference in spread of data after taking 2 most prominent components of lda and pca. We can observe a much higher explained variance ratio of the first 2 components with lda, as well as a much cleaner spread between classes.

You can play with the parameters but the result will not change much, as we discovered that lda is much more appropriate for processing the data we posses, mostly as the data is labelled. We will talk about the relation between number of users and length of interval segment in the next paragraphs.

In [335]:
datas = datas_1
# Parameters
number_of_users = 7
segment_interval_length = 60

# Get n random users and segment interval length
users = get_random_users(n=number_of_users)
interval = segment_intervals.index(segment_interval_length)
# Perform and visualize lda
lda, score = get_lda(datas[interval], users, score=True, stats=False)
fig = scatter(
    lda,
    x="component_1",
    y="component_2",
    color="user",
    color_continuous_scale="Rainbow",
    title="LDA TECHNIQUE; number of users: {}, segment interval: {} seconds, score: {}".format(
        number_of_users, segment_intervals[interval], score
    ),
)
fig.show()
# Perform and visualize pca
pca, score = get_pca(datas[interval], users, score=True, stats=False)
fig = scatter(
    pca,
    x="component_1",
    y="component_2",
    color="user",
    color_continuous_scale="Rainbow",
    title="PCA TECHNIQUE; number of users: {}, segment interval: {} seconds, score: {}".format(
        number_of_users, segment_intervals[interval], score
    ),
)
fig.show()

Users: [7, 14, 16, 17, 18, 24, 26]


### Time intervals vs user number graphs with lda

Below is a series of graphs for different number of users with different time intervals used, noted with vrc scores. We can confirm that with increasing number of users, it takes longer time intervals to properly discern between different users. The interseting part is that the quality of distinguishing different people is concave. Note how for each number of users (or spread of the graphs) there is a peak in score, which diminishes on both sides.

In [126]:
datas = datas_3
warnings.filterwarnings("ignore")

user_number = 5
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

user_number = 9
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

user_number = 13
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

Users: [7, 12, 16, 23, 26]


Users: [11, 12, 13, 15, 18, 21, 23, 26, 27]


Users: [8, 9, 11, 14, 15, 16, 18, 19, 21, 22, 23, 24, 27]


### Mean score user number vs time intervals with lda

To better understand how lda behaves with different number of users in relation to different time intervals, we did 50 calculations of lda for different number of users and took the mean vrc score for every time interval. The data is shown on the graph below. It confirms our assumption that with the increasing number of users, we need longer time intervals to better discern between users.

<!--Results are much more representative if run on 500 epochs or more, but that takes a bit longer. Use norm flag to scale all score data on 0-1 interval and see at which time interval there is a peak for each user count.-->

In [14]:
datas = datas_1
e = 50
norm = True
verbose = False

for datas in [datas_1, datas_2, datas_3]:
    print("Performing {} epochs".format(e))
    results = []
    for i in range(3, 22):
        results.append(calculate_scores(i, datas, epochs=e, norm=norm, verbose=verbose))

    df = concat(results)
    fig = line(df, x="time_interval", y="score", color="users")
    fig.show()

Performing 50 epochs


Performing 50 epochs


Performing 50 epochs


## Understanding which features are most informative

To better understand which sensors are generating informative data, we would like to find out which features are most prominent in each of the components of the lda. The values below are calculated based on the eigen vectors of the lda. We took one vector at a time, looked at the absolute highest values which represent features that contributed most information in a given component and weighted it with the explained variance ratio of a given component.

All of the prominent features being used in all of the components belong to the accelerometer and it seems that all other sensors are redundant. This implies that we need to do some further analysis by taking only features from a certain sensor at a time and compare the results. Maybe we can use some feature selecion techniques to validate the calculated results.

In [130]:
datas = datas_2

# Parameters
number_of_users = 5
segment_interval_length = 45
n_components = 2

# Get n random users and segment interval length
users = get_random_users(n=number_of_users)
interval = segment_intervals.index(segment_interval_length)

data = datas[interval]
data = data[data.user.isin(users)]
X = data.iloc[:, 2:].values
y = data.iloc[:, 0].values.ravel()
z = data.iloc[:, 1].values.ravel()

lda = LinearDiscriminantAnalysis(n_components=n_components)
X_lda = lda.fit_transform(X, y)
print("Explained variance ratio: {}".format(lda.explained_variance_ratio_))

# Eigen vectors(scalings_) - when transforming, this is multiplied by the
# new data points -> higher the value for a specific feature, more
# informative the feature, regarding this component. Somehow similar
# to how neurons work in neural nets. To put it in another way, every element
# of each eigen vector is the weight of each feature of that vector.
for j in range(0, n_components):
    print("============================")
    print("Component {}".format(j + 1))
    print("============================")
    eigen_vector = abs(lda.scalings_[:, j])
    eigen_vector = eigen_vector / sum(eigen_vector)

    eigen_sorted = eigen_vector.copy()
    eigen_sorted[::-1].sort()
    i = 0
    while eigen_sorted[i] > 0.05:
        value = round(eigen_sorted[i] * lda.explained_variance_ratio_[j], 2)
        if value < 0.01:
            break
        print(features_list(index=np.where(eigen_vector == eigen_sorted[i])[0][0]))
        print(value)
        i += 1

Users: [7, 8, 11, 15, 26]
Explained variance ratio: [0.856549   0.09609886]
Component 1
az_me
0.27
a_me
0.26
a_mai
0.11
az_mai
0.11
Component 2
az_me
0.03
a_me
0.03
az_mai
0.01
a_mai
0.01


## Classification

Sklearn python lda class provides a nifty function called predict, which can, provided with new unlabelled data, predict the classes for each of the given testing examples.

### Per seance classification

Take n random users, take first seance for each of those users and join them in a training set and second seance of each user and join them into a testing set. For now we will only measure classification accuracy and compare it to the majority classifier.

This is a tough task, as training and testing data is split right in between the seances for each user, so the classification accuracy (depends on the parameters) is not the best, but it is still about 3 times higher than the majority classifier, from which we can conclude that already at this stage, our algorithm is learning something.

#### All sensor data
Firstly lets take a look at what happens if we take data from all sensors.

In [275]:
datas = datas_3

# Users
number_of_users = 7
users = get_random_users(n=number_of_users)

# Time intervals
segment_interval_length = 60
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

data = datas[interval]
data = data[data.user.isin(users)]
seances = get_users_data(data)

train_seances = [x[0] for x in seances.values()]
test_seances = [x[1] for x in seances.values()]

training_set = data[data["seance"].isin(train_seances)]
testing_set = data[data["seance"].isin(test_seances)]

X_train = training_set.iloc[:, 2:].values
y_train = training_set.iloc[:, 0].values.ravel()

X_test = testing_set.iloc[:, 2:].values
y_test = testing_set.iloc[:, 0].values.ravel()

lda = LinearDiscriminantAnalysis(n_components=n_components)
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
    
print("Confusion matrix")
print(metrics.confusion_matrix(y_test, y_pred))
print()
print("Majority classifier")
print(round(list(y_test).count(max(set(y_test), key = list(y_test).count))/len(y_test),2))
print("Classification accuracy")
print(round(metrics.accuracy_score(y_test, y_pred), 2))


Users: [7, 9, 11, 15, 18, 22, 26]
Confusion matrix
[[6 0 0 0 0 0 0]
 [0 5 0 0 0 0 0]
 [0 0 9 0 0 0 0]
 [0 0 0 6 0 0 0]
 [0 0 0 1 5 0 0]
 [0 0 0 0 0 5 0]
 [0 1 0 0 0 0 4]]

Majority classifier
0.21
Classification accuracy
0.95


#### Number of users vs accuracy

Below we display how the number of users used affects the classification accuracy of the predictions.

In [None]:
epochs = 500
results = {"n_users": [], "accuracy": [], "time_interval": []}

datas = datas_3

# Components
n_components = 2
# for n in segment_intervals:
for n in [15, 45, 90, 180]:
    interval = segment_intervals.index(n)
    print(n)
    for i in range(3, 19, 2):
        ca = []
        for _ in range(0, epochs):
            users = get_random_users(n=i, stats=False, remove=[14])
            data = datas[interval]
            data = data[data.user.isin(users)]
            seances = get_users_data(data)

            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = data[data["seance"].isin(train_seances)]
            testing_set = data[data["seance"].isin(test_seances)]

            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)

            majority_classifier = round(list(y_test).count(max(set(y_test), key = list(y_test).count))/len(y_test),2)

            ca.append(metrics.accuracy_score(y_test, y_pred))
        results["n_users"].append(i)
        results["accuracy"].append(mean(ca))
        results["time_interval"].append(n)
        

df = DataFrame(results)
graphs = []
for n in segment_intervals:
    data = df[df["time_interval"] == n]
    graphs.append(go.Bar(name="{} seconds".format(n), x=data["n_users"], y=data["accuracy"]))
    
fig = go.Figure(data=graphs)
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

15
45
90
180


#### Sensor subset data

To determine the best possible combinations of sensors, we will try datasets with different sensors combinations and compare the results.

In [255]:
epochs = 150

sresults = [[], [], []]
# Users
number_of_users = 7
users_set = [get_random_users(n=number_of_users, stats=False) for _ in range(0, epochs)]

# Time intervals
segment_interval_length = 60
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for ex in [1,2,3]:
    sensors = get_per_sensor_data(experiment=ex)

    # Combine different sensor data
    sensors.update(combine_sensor_data(sensors, ["accelerometer", "gyroscope"]))
    sensors.update(combine_sensor_data(sensors, ["accelerometer + gyroscope", "force sensors"]))
    sensors.update(combine_sensor_data(sensors, ["gyroscope", "force sensors"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "memory"]))
    sensors.update(combine_sensor_data(sensors, ["cpu + memory", "network"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "accelerometer"]))

    names = [x for x in sensors]
    names.sort()
    print(names)

    ca = {"majority_classifier": []}
    for name in names:
        ca.update({name: []})

    for users in users_set:
        for key in names:
            sensor = sensors[key][interval]
            sensor = sensor[sensor.user.isin(users)]
            seances = get_users_data(sensor)

            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = sensor[sensor["seance"].isin(train_seances)]
            testing_set = sensor[sensor["seance"].isin(test_seances)]

            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)

            #     print("Data from {}:".format(key))
            #     print(round(metrics.accuracy_score(y_test, y_pred), 2))
            ca[key].append(metrics.accuracy_score(y_test, y_pred))

        # print("Majority classifier")
        # print(round(list(y_test).count(max(set(y_test), key = list(y_test).count))/len(y_test),2))
        ca["majority_classifier"].append(
            list(y_test).count(max(set(y_test), key=list(y_test).count)) / len(y_test)
        )


    results = [[], [], []]
    for key in ca:
        results[0].append(key)
        results[1].append(mean(ca[key]))
        results[2].append(ex)

    indices = np.argsort(results[1])
    for i in indices:
        sresults[0].append(results[0][i])
        sresults[1].append(results[1][i])
        sresults[2].append(results[2][i])

df = DataFrame({"sensors": sresults[0], "accuracy": sresults[1], "experiment": sresults[2]})
graphs = []
for i in range(1,4):
    data = df[df["experiment"] == i]
    graphs.append(go.Bar(name="Experiment {}".format(i), x=data["sensors"], y=data["accuracy"]))
    
fig = go.Figure(data=graphs)
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']
['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']
['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']


#### Per sensor features

To better understand which features within a single sensor data are the most informative, we will perform the same classification process as before, but only focusing on a single or a subset of features, taken from the accelerometer data.

In [166]:
epochs = 150
sensor_name = "gyroscope"

sresults = [[], [], []]
# Users
number_of_users = 7
users_set = [get_random_users(n=number_of_users, stats=False) for _ in range(0, epochs)]

# Time intervals
segment_interval_length = 60
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for ex in [1, 2, 3]:
    sensors = get_per_sensor_data(experiment=ex)
    sensors = sensors[sensor_name]

    cols = sensors[0].columns
    names = {
        "c_me": ["gx_me", "gy_me", "gz_me"],
        "me": [x for x in cols if "me" in x],
        "sd": [x for x in cols if "sd" in x],
        "mcr": [x for x in cols if "mcr" in x],
#         "c_mai": ["ax_mai", "ay_mai", "az_mai"],
#         "mai": [x for x in cols if "mai" in x]
    }
    print(names)

    ca = {"majority_classifier": []}
    for name in names:
        ca.update({name: []})

    for users in users_set:
        for key in names:
            sensor = sensors[interval]
            sensor = sensor[sensor.user.isin(users)]
            seances = get_users_data(sensor)

            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = sensor[sensor["seance"].isin(train_seances)]
            testing_set = sensor[sensor["seance"].isin(test_seances)]

            X_train = training_set[names[key]].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set[names[key]].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)

            ca[key].append(metrics.accuracy_score(y_test, y_pred))

        ca["majority_classifier"].append(
            list(y_test).count(max(set(y_test), key=list(y_test).count)) / len(y_test)
        )

    results = [[], [], []]
    for key in ca:
        results[0].append(key)
        results[1].append(mean(ca[key]))
        results[2].append(ex)

    indices = np.argsort(results[1])
    for i in indices:
        sresults[0].append(results[0][i])
        sresults[1].append(results[1][i])
        sresults[2].append(results[2][i])

df = DataFrame(
    {"sensors": sresults[0], "accuracy": sresults[1], "experiment": sresults[2]}
)
graphs = []
for i in range(1, 4):
    data = df[df["experiment"] == i]
    graphs.append(
        go.Bar(name="Experiment {}".format(i), x=data["sensors"], y=data["accuracy"])
    )

fig = go.Figure(data=graphs)
# Change the bar mode
fig.update_layout(barmode="group", title="{} features breakdown".format(sensor_name))
fig.show()

{'c_me': ['gx_me', 'gy_me', 'gz_me'], 'me': ['gx_me', 'gy_me', 'gz_me', 'g_me'], 'sd': ['gx_sd', 'gy_sd', 'gz_sd', 'g_sd'], 'mcr': ['gx_mcr', 'gy_mcr', 'gz_mcr', 'g_mcr']}
{'c_me': ['gx_me', 'gy_me', 'gz_me'], 'me': ['gx_me', 'gy_me', 'gz_me', 'g_me'], 'sd': ['gx_sd', 'gy_sd', 'gz_sd', 'g_sd'], 'mcr': ['gx_mcr', 'gy_mcr', 'gz_mcr', 'g_mcr']}
{'c_me': ['gx_me', 'gy_me', 'gz_me'], 'me': ['gx_me', 'gy_me', 'gz_me', 'g_me'], 'sd': ['gx_sd', 'gy_sd', 'gz_sd', 'g_sd'], 'mcr': ['gx_mcr', 'gy_mcr', 'gz_mcr', 'g_mcr']}


#### Pc monitor improvements

As proposed, we will only take data when the pc was turned on, to see if that improves performance of cpu, memory and network data.

In [272]:
epochs = 150

sresults = [[], [], []]
# Users
number_of_users = 7
users_set = [get_random_users(n=number_of_users, stats=False, remove=[12, 20]) for _ in range(0, epochs)]

# Time intervals
segment_interval_length = 60
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for ex in [1, 2, 3]:
    sensors = get_per_sensor_data(experiment=ex, pc_monitor=True)

    # Combine different sensor data
    sensors.update(combine_sensor_data(sensors, ["accelerometer", "gyroscope"]))
    sensors.update(combine_sensor_data(sensors, ["accelerometer + gyroscope", "force sensors"]))
    sensors.update(combine_sensor_data(sensors, ["gyroscope", "force sensors"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "memory"]))
    sensors.update(combine_sensor_data(sensors, ["cpu + memory", "network"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "accelerometer"]))

    names = [x for x in sensors]
    names.sort()
    print(names)

    ca = {"majority_classifier": []}
    for name in names:
        ca.update({name: []})

    for users in users_set:
        for key in names:
            sensor = sensors[key][interval]
            sensor = sensor[sensor.user.isin(users)]
            seances = get_users_data(sensor)
            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = sensor[sensor["seance"].isin(train_seances)]
            testing_set = sensor[sensor["seance"].isin(test_seances)]

            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)

            ca[key].append(metrics.accuracy_score(y_test, y_pred))

        ca["majority_classifier"].append(
            list(y_test).count(max(set(y_test), key=list(y_test).count)) / len(y_test)
        )


    results = [[], [], []]
    for key in ca:
        results[0].append(key)
        results[1].append(mean(ca[key]))
        results[2].append(ex)

    indices = np.argsort(results[1])
    for i in indices:
        sresults[0].append(results[0][i])
        sresults[1].append(results[1][i])
        sresults[2].append(results[2][i])

df = DataFrame({"sensors": sresults[0], "accuracy": sresults[1], "experiment": sresults[2]})
graphs = []
for i in range(1,4):
    data = df[df["experiment"] == i]
    graphs.append(go.Bar(name="Experiment {}".format(i), x=data["sensors"], y=data["accuracy"]))
    
fig = go.Figure(data=graphs)
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']
['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']
['accelerometer', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'all data', 'cpu', 'cpu + accelerometer', 'cpu + memory', 'cpu + memory + network', 'force sensors', 'gyroscope', 'gyroscope + force sensors', 'memory', 'network']


### Iterative time steps learning

As our goal is to perform continuous authentication, one of the key dimensions we have to keep in mind, is time. As the complete dataset is precollected, we can see into the future, which is pretty beneficial for our algorithms, but useless in practice. So in this section we will take a look what happens when we iteratively take one 10 second time interval at a time and learn on all previous ones for a subset of users and see how much time it takes to distinguish between them.

<!-- After each iteration, the user is prompted to press enter to continue. The output is then reset and new graphs should appear. If you write "stop" and press enter in the prompt, the program will stop outputting graphs and will calculate classification accuracy for all time steps (takes a few seconds) and show a line graph with the result. -->

In [133]:
datas = datas_2

# Users
number_of_users = 7
users = get_random_users(n=number_of_users)
    
# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

data = datas[interval]
data = data[data.user.isin(users)]
seances = get_users_data(data)
train_seances = [x[0] for x in seances.values()]
training_set = data[data["seance"].isin(train_seances)]
seances = list(set(training_set["seance"]))

graph = True
i = 2
results = [[], []]
while True:
    X = []
    y = []
    Xt = []
    yt = []
    for seance in seances:
        X.append(training_set[training_set["seance"] == seance].iloc[0:i, 2:].values)
        y.append(
            training_set[training_set["seance"] == seance].iloc[0:i, 0].values.ravel()
        )
        Xt.append(
            training_set[training_set["seance"] == seance].iloc[i : i + 1, 2:].values
        )
        yt.append(
            training_set[training_set["seance"] == seance]
            .iloc[i : i + 1, 0]
            .values.ravel()
        )
    X = np.concatenate(X)
    y = np.concatenate(y)
    Xt = np.concatenate(Xt)
    yt = np.concatenate(yt)
    if Xt.shape[0] == 0:
        break
    lda = LinearDiscriminantAnalysis(n_components=n_components)
    X_lda = lda.fit_transform(X, y)
    yp = lda.predict(Xt)
    results[0].append(i * segment_interval_length)
    results[1].append(metrics.accuracy_score(yt, yp))

    if graph:
        print("After {} seconds.".format(i * segment_interval_length))
        print("Classification accuracy: {}".format(metrics.accuracy_score(yt, yp)))
        X_ldat = lda.transform(Xt)
        X_lda = np.concatenate((X_lda, X_ldat))
        y = np.concatenate((y, yt + 100))
        df = DataFrame(
            [[y] + list(x) for x, y in zip(X_lda, y)],
            columns=["user", "component_1", "component_2"],
        )
        fig = scatter(
            df,
            x="component_1",
            y="component_2",
            color="user",
            color_continuous_scale="picnic",
        )

        fig.show()

    i += 1
    if i > 100000:
        print("Run out of data")
        break
    if graph:
        command = input("Press Enter to continue...")
        
        clear_output()
        if command == "stop":
            graph = False

df = DataFrame({"time from start [s]": results[0], "classification accuracy": results[1]})
fig = line(df, x="time from start [s]", y="classification accuracy")
fig.show()

# FIXME: This is still broken, fix it on a good day
# fig.write_image("image_01.jpg")

#### Per sensor iterative learning

To better understand how the classification accuracy fluctuates through time, we will take a look at classification accuracy for each sensor separately and try to determine why there is a dip in accuracy at the begging of the experiment.

In [121]:
# Data
sensors = get_per_sensor_data(experiment=3)
names = [x for x in sensors]
print(names)

results = [[], [], []]

# Users
number_of_users = 7
users = get_random_users(n=number_of_users)
    
# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for sensor_name in names:
    
    data = sensors[sensor_name][interval]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    seances = list(set(training_set["seance"]))

    i = 2
    while True:
        X = []
        y = []
        Xt = []
        yt = []
        for seance in seances:
            X.append(training_set[training_set["seance"] == seance].iloc[0:i, 2:].values)
            y.append(
                training_set[training_set["seance"] == seance].iloc[0:i, 0].values.ravel()
            )
            Xt.append(
                training_set[training_set["seance"] == seance].iloc[i : i + 1, 2:].values
            )
            yt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 0]
                .values.ravel()
            )
        X = np.concatenate(X)
        y = np.concatenate(y)
        Xt = np.concatenate(Xt)
        yt = np.concatenate(yt)
        if Xt.shape[0] == 0:
            break
        lda = LinearDiscriminantAnalysis(n_components=n_components)
        if mean(X) != 0:
            X_lda = lda.fit_transform(X, y)
            yp = lda.predict(Xt)
            results[1].append(metrics.accuracy_score(yt, yp))
        else:
            results[1].append(0)
        results[0].append(i * segment_interval_length)
        results[2].append(sensor_name)

        i += 1
        if i > 100000:
            print("Run out of data")
            break

df = DataFrame({"time from start [s]": results[0], "classification accuracy": results[1], "sensor": results[2]})
fig = line(df, x="time from start [s]", y="classification accuracy", color="sensor")
fig.show()


['all data', 'accelerometer', 'gyroscope', 'force sensors', 'cpu', 'memory', 'network']
Users: [13, 15, 16, 17, 22, 25, 26]


#### Combining sensor data

As one of our goals is to determine the sensor set with the highest classification accuracy, we will build sets of data with different combination of sensors and compare the results.

In [124]:
# Data
sensors = get_per_sensor_data(experiment=3)

# Combine sensor data
sensors.update(combine_sensor_data(sensors, ["accelerometer", "gyroscope"]))
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope", "force sensors"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "cpu"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "network"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "memory"])
)

names = [x for x in sensors]
names = [
    "all data",
    "accelerometer",
    "gyroscope",
    "accelerometer + gyroscope",
    "accelerometer + gyroscope + force sensors"
]
print("Sensors: {}".format(names))

results = [[], [], []]

# Users
number_of_users = 7
users = get_random_users(n=number_of_users)

# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for sensor_name in names:

    data = sensors[sensor_name][interval]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    seances = list(set(training_set["seance"]))

    i = 2
    while True:
        X = []
        y = []
        Xt = []
        yt = []
        for seance in seances:
            X.append(
                training_set[training_set["seance"] == seance].iloc[0:i, 2:].values
            )
            y.append(
                training_set[training_set["seance"] == seance]
                .iloc[0:i, 0]
                .values.ravel()
            )
            Xt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 2:]
                .values
            )
            yt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 0]
                .values.ravel()
            )
        X = np.concatenate(X)
        y = np.concatenate(y)
        Xt = np.concatenate(Xt)
        yt = np.concatenate(yt)
        if Xt.shape[0] == 0:
            break
        lda = LinearDiscriminantAnalysis(n_components=n_components)
        if mean(X) != 0:
            X_lda = lda.fit_transform(X, y)
            yp = lda.predict(Xt)
            results[1].append(metrics.accuracy_score(yt, yp))
        else:
            results[1].append(0)
        results[0].append(i * segment_interval_length)
        results[2].append(sensor_name)

        i += 1
        if i > 100000:
            print("Run out of data")
            break

df = DataFrame({"time from start [s]": results[0], "classification accuracy": results[1], "sensor": results[2]})
fig = line(df, x="time from start [s]", y="classification accuracy", color="sensor")
fig.show()

Sensors: ['all data', 'accelerometer', 'gyroscope', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors']
Users: [7, 8, 9, 16, 17, 20, 25]


#### Per sensor features

Again we would like to find out more about how certain features of the accelerometer and the gyroscope behave.

In [200]:
datas = datas_1

results = [[], [], []]
# Users
number_of_users = 7
users = get_random_users(n=number_of_users, stats=True)

# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

# names = {
# #     "c_me": ["ax_me", "ay_me", "az_me"],
#     "me": ["ax_me", "ay_me", "az_me", "a_me"],
#     "sd": ["ax_sd", "ay_sd", "az_sd", "a_sd"],
#     "mcr": ["ax_mcr", "ay_mcr", "az_mcr", "a_mcr"],
# #     "c_mai": ["ax_mai", "ay_mai", "az_mai"],
#     "mai": ["ax_mai", "ay_mai", "az_mai", "a_mai"]
# }

names = {
    "c_me": ["gx_me", "gy_me", "gz_me"],
    "me": ["gx_me", "gy_me", "gz_me", "g_me"],
    "sd": ["gx_sd", "gy_sd", "gz_sd", "g_sd"],
    "mcr": ["gx_mcr", "gy_mcr", "gz_mcr", "g_mcr"],
}

for sensor_name in names:

    data = datas[interval][["user", "seance"] + names[sensor_name]]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    seances = list(set(training_set["seance"]))
    
#     print(training_set.head())

    i = 2
    while True:
        X = []
        y = []
        Xt = []
        yt = []
        for seance in seances:
            X.append(
                training_set[training_set["seance"] == seance].iloc[0:i, 2:].values
            )
            y.append(
                training_set[training_set["seance"] == seance]
                .iloc[0:i, 0]
                .values.ravel()
            )
            Xt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 2:]
                .values
            )
            yt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 0]
                .values.ravel()
            )
        X = np.concatenate(X)
        y = np.concatenate(y)
        Xt = np.concatenate(Xt)
        yt = np.concatenate(yt)
        if Xt.shape[0] == 0:
            break
        lda = LinearDiscriminantAnalysis(n_components=n_components)
        if mean(X) != 0 and len(X) > 0:
            X_lda = lda.fit_transform(X, y)
            yp = lda.predict(Xt)
            results[1].append(metrics.accuracy_score(yt, yp))
        else:
            results[1].append(0)
        results[0].append(i * segment_interval_length)
        results[2].append(sensor_name)

        i += 1
        if i > 100000:
            print("Run out of data")
            break

df = DataFrame({"time from start [s]": results[0], "classification accuracy": results[1], "sensor": results[2]})
fig = line(df, x="time from start [s]", y="classification accuracy", color="sensor")
fig.show()

Users: [7, 8, 12, 14, 16, 18, 27]


### Trajectory classification

Another potentially fruitful approach is to take a look a point trajectory through time in lda space. As with Per seance classification, we will take first seance of different users and apply trajectory classification on the second seance, step by step, to see how much time it takes to discern between users.

In [None]:
# TODO

### Classification conclusions
Below are the conclusions from the analysis of the classification performed so far.

#### Per seance classification
What we learned so far is that the classification is heavily dependent on the subset of users and the choice of the segment time interval. For some subset of users the classification accuracy reaches up to 70%, while others are quite similar in their behavior and our algorithms suffers to distinguish between them for now.

We also discovered that taking only accelerometer data our accuracy increases for 20% comparing to all the sensor data. Simlar but to a lesser degree can be said for the gyrroscope data. On the other hand we observed 20% decrease in accuracy with force sensor data, while the data from the pc (cpu, ram, memory) struggled even to stay above the benchmark of the majority classifier.

The issue with the pc data that there is time at the begging and the end of each experiment, where this data is not being collected, as the test subject was required to turn on and off the pc during the experiment. A more "fair" experiment would be to take only data from the time interval in which the computer was turned on and redo this analysis.

When combining the data from different sensors, the resulting classification accuracy is between the classification accuracy of the data from sensors that were combined. So our algorithm learns less efficiently when being provided with data beyond the data from the accelerometer.

Surprisingly we discovered that with taking data from the second and the third experiments, there is an increase in overall classification accuracy and in most subsets of sensor data. The only notable exception here is the accelerometer, which was performing increasingly worse from the first to the third experiment.

As the accelerometer and the gyroscope are the most promising features at the moment, we divided the data of those sensors into feature groups and performed classification with only those features data. In both cases the features with the best classification accuracy was the were the means of specific axles, combined with the combied mean value. Accelerometer data was pretty consistent accuracy wise, while the gyroscope data was increasingly improving from the first to the third experiment.

As suggested above, we performed the analysis with subset of the data in which the computer was turned on. Contrary to the expectations, the classification accuracy did not change much, nor for the pc data, nor for any other. Another possible explanation is that our features, engineered from the pc data, were not the best and as proposed several times already, we should use an autoencoder to improve the feature set that we currently posses.

#### Iterative time steps learning
After running the program multiple times, with different number of users, there is a recognizable pattern emerging from the data. At the very begging, to around 40 seconds mark, the accuracy is usually very high. Then it quickly falls and stays pretty low for about a minute and then finally it climbs back up again. When this happens, visual clusters begin to emerge if you look at the data. Usually there is some noise at the end, as the length of the experiments varies from test subject to test_subject.

This pattern is consistent, no matter how many users we take. It takes about 180 seconds to recognize all or at worst 90% of all the users from the subset. Segment time interval in this subsection was locked at 10 seconds, to better understand the dynamic and changes through time. We also tried to use other time intervals, with similar results, but with fewer time steps to evaluate the progress of the algorithm.

Contrary to the per seance classification, taking the data from only a single sensor does not improve, but worsen the classification accuracy and the highest accuracy was achieved with the combination of the accelerometer, gyroscope and force sensors data. The accuracy of the pc monitor data remained low and same conclusions can be drawn as before.

The above conclusions hold as well when introducing the second and the third experiment to the algorithm. The only notable difference is in the classification accuracy progression pattern. The dip at the begging of the experiment is still present, however it occurs a bit later and it is not that great as in the first experiment.

We are suspecting that the reason for the initial dip occurs is because of the way the experiment was set up. At the start of each experiment, the test subject enters the room and sensor data is pretty much constant. When he starts using the computer, that is a change in behavioral pattern, which "confuses" the algorithm that has already adjusted to the peaceful state. After a while, it relearns the new patterns and the accuracy rises again.

Similarly to the per seance classification, most informative feature from the accelerometer and the gyroscope is again the mean. As mean value was meant more of a starting point, we should automatically engineer features using an autoencoder or a similar technique.

## Helpers

<a id="section_helpers"></a>

Function that were implemented in the above sections, but moved down here to reduce clutter. Run the cell below to enable the funcionalities of this notebook.

In [257]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import random
import warnings

from datetime import datetime, timedelta
from fastdtw import fastdtw
from IPython.display import clear_output
from math import sqrt, log10
from numpy import mean, std, array
from pandas import DataFrame, read_csv, concat
from plotly.express import scatter, line, scatter_3d, bar
from plotly.subplots import make_subplots
from scipy.spatial.distance import euclidean
from scipy.signal import find_peaks
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics
from sklearn.preprocessing import StandardScaler


def load_data(seance_id, sens):
    seance = Seance.objects.get(id=seance_id)
    if sens == "accelerometer":
        sensor_ids = [60, 61, 62]

        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
        )
    elif sens == "gyroscope":
        sensor_ids = [63, 64, 65]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
        )
    elif sens == "force":
        sensor_ids = [54, 55, 76, 77]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return (
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[0], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[1], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[2], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[3], value__gte=50
            ).order_by("timestamp"),
        )
    elif sens == "cpu":
        sensor_ids = [78, 79, 80, 81]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[3]).order_by(
                "timestamp"
            ),
        )
    elif sens == "ram":
        sensor_ids = [82]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
            "timestamp"
        )
    elif sens == "net":
        sensor_ids = [83, 84]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
        )
    elif sens == "pir":
        sensor_ids = [58, 59, 66, 67, 68, 69]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor__in=sensors).order_by(
                "timestamp"
            ),
            seance.start,
            seance.end,
        )
    else:
        raise ValueError("Invalid sensor string.")


def process_signal(records):
    """
    Take Django query and do basic signal processing.
    """
    values = [x.value for x in records]
    times = [x.timestamp for x in records]
    m = mean(values)
    s = std(values)
    norm = [(x - m) / s for x in values]

    return values, times, norm, m, s


def join_accelerometer_signals(x, y, z):
    """
    Join accelerometer signals, based simply on concurrence. 
    We can do this, as only one controller sends data in loop for all axis.
    """
    result = []
    n = min(len(x), len(y), len(z))
    for a, b, c in zip(x[:n], y[:n], z[:n]):
        result.append(sqrt(a ** 2 + b ** 2 + c ** 2))
    return result, mean(result), std(result)


def mean_crossing_rate(signal, m):
    """
    Calculate mean crossing rate from signal.
    Rate of mean crossings vs. the signal length.
    """
    try:
        prev = signal[0]
    except IndexError:
        return 0
    crosses = 0
    length = len(signal) - 1

    for curr in signal[1:]:
        if prev <= m < curr or prev > m >= curr:
            crosses += 1
        prev = curr
    if length < 1:
        return 0
    return crosses / length


def mean_acceleration_intensity(signal):
    """
    Mean derivative of a signal.
    """
    try:
        prev = signal[0]
    except IndexError:
        return 0
    length = len(signal) - 1
    derv = []

    for curr in signal[1:]:
        derv.append(abs(curr - prev))
        prev = curr

    return mean(derv)


def join_cpu_signals(a, b, c, d):
    """
    Similar to accelerometer one.
    """
    result = []
    n = min(len(a), len(b), len(c), len(d))
    for w, x, y, z in zip(a[:n], b[:n], c[:n], d[:n]):
        result.append(sqrt(w ** 2 + x ** 2 + y ** 2 + z ** 2))
    return result, mean(result), std(result)


def get_cpu_stats(val):
    if not val:
        return 0, 0, 0
    return min(val), max(val), mean_crossing_rate(val, mean(val))


def find_ram_jump(signal):
    if not signal:
        return [], {}
    derivative = []
    prev = signal[0]
    for curr in signal[1:]:
        derivative.append(abs(curr - prev))
        prev = curr
    peaks, _ = find_peaks(derivative, threshold=0.25)

    p = {"position": [], "magnitude": []}
    for x in peaks:
        p["position"].append(x)
        p["magnitude"].append(derivative[x])
    return derivative, p


def get_mem_stats(val, peaks, derivatives):
    # Calculate average inter jump interval
    intervals = []
    if peaks and peaks["position"]:
        prev = peaks["position"][0]
        for curr in peaks["position"][1:]:
            intervals.append(curr - prev)
            prev = curr
    if val:
        avg_load = round(mean(val), 2)
        min_load = min(val)
        max_load = max(val)
    else:
        avg_load = 0
        min_load = 0
        max_load = 0
    if peaks:
        jump_count = len(peaks["position"])
        if derivatives:
            jump_rate = round(len(peaks["position"]) / len(derivatives), 2)
        else:
            jump_rate = 0
        avg_jump_value = round(mean(peaks["magnitude"]), 2)
        avg_inter_jump_interval = round(mean(intervals), 2)
    else:
        jump_count = 0
        jump_rate = 0
        avg_jump_value = 0
        avg_inter_jump_interval = 0
    return (
        avg_load,
        min_load,
        max_load,
        jump_count,
        jump_rate,
        avg_jump_value,
        avg_inter_jump_interval,
    )


def find_net_jump(signal):
    if not signal:
        return [], {}
    derivative = []
    prev = signal[0]
    for curr in signal[1:]:
        derivative.append(abs(curr - prev))
        prev = curr
    peaks, _ = find_peaks(derivative, threshold=mean(derivative))

    p = {"position": [], "magnitude": []}
    for x in peaks:
        p["position"].append(x)
        p["magnitude"].append(derivative[x])
    return derivative, p


def get_net_stats(val, peaks, derivatives):
    # Calculate average inter jump interval
    intervals = []
    if peaks and peaks["position"]:
        prev = peaks["position"][0]
        for curr in peaks["position"][1:]:
            intervals.append(curr - prev)
            prev = curr
    try:
        sum_load = val[-1] - val[0]
    except IndexError:
        sum_load = 0
    if peaks:
        jump_count = len(peaks["position"])
        if derivatives:
            jump_rate = round(len(peaks["position"]) / len(derivatives), 2)
        else:
            jump_rate = 0
        avg_jump_value = round(mean(peaks["magnitude"]), 2)
        avg_inter_jump_interval = round(mean(intervals), 2)
    else:
        jump_count = 0
        jump_rate = 0
        avg_jump_value = 0
        avg_inter_jump_interval = 0
    return sum_load, jump_count, jump_rate, avg_jump_value, avg_inter_jump_interval


def get_random_users(n=3, stats=True, remove=[]):
    users = []
    while len(users) < n:
        user = random.randint(7, 27)
        if user not in users and user not in remove:
            users.append(user)
    if stats:
        print("Users: {}".format(sorted(users)))
    return users


def get_lda(data, users, comp_num=2, stats=True, score=False):
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    lda = LinearDiscriminantAnalysis(n_components=comp_num)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    if stats:
        print(lda.explained_variance_ratio_)
        print(score)

    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    if score:
        return df, score
    return df


def get_pca(data, users, comp_num=2, stats=True, score=False):
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    pca = PCA(n_components=comp_num)
    X_pca = pca.fit_transform(X)
    score = round(metrics.calinski_harabasz_score(X_pca, y))
    if stats:
        print(pca.explained_variance_ratio_)
        print(score)

    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_pca, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    if score:
        return df, score
    return df


def generate_segmented_data_csv(seconds: int, experiment: int = 1):
    """
    Generate csv file with calculated features, from data subsampled to the given time interval.
    """
    step = timedelta(seconds=seconds)
    seances = Seance.objects.filter(
        experiment__sequence_number=experiment, valid=True
    ).order_by("start")

    print(
        "Generating segmented data csv file with {} seconds intervals for {} seances.".format(
            seconds, seances.count()
        )
    )

    file_name = "segmented_data_{}_seconds_experiment_{}.csv".format(
        seconds, experiment
    )
    with open(file_name, "w") as csv_data:
        csv_data.write(
            "user,seance,ax_me,ax_sd,ax_mcr,ax_mai,ay_me,ay_sd,ay_mcr,ay_mai,az_me,az_sd,az_mcr,az_mai,a_me,a_sd,a_mcr,a_mai,gx_me,gx_sd,gx_mcr,gy_me,gy_sd,gy_mcr,gz_me,gz_sd,gz_mcr,g_me,g_sd,g_mcr,fa_me,fa_sd,fa_mcr,fb_me,fb_sd,fb_mcr,fc_me,fc_sd,fc_mcr,fd_me,fd_sd,fd_mcr,ca_me,ca_sd,ca_min,ca_max,ca_mcr,cb_me,cb_sd,cb_min,cb_max,cb_mcr,cc_me,cc_sd,cc_min,cc_max,cc_mcr,cd_me,cd_sd,cd_min,cd_max,cd_mcr,c_me,c_sd,c_min,c_max,c_mcr,m_me,m_sd,m_min,m_max,m_jc,m_jr,m_jv,m_iji,ns_me,ns_sd,ns_sum,ns_jc,ns_jr,ns_jv,ns_iji,nr_me,nr_sd,nr_sum,nr_jc,nr_jr,nr_jv,nr_iji\n"
        )
        count = 1
        for seance in seances[:]:
            print(
                "-----------------------------------------------------------------------------"
            )
            print("{} of {}".format(count, seances.count()))
            count += 1
            print(seance)
            start = seance.start
            end = seance.end
            data = (
                list(load_data(seance.id, "accelerometer"))
                + list(load_data(seance.id, "gyroscope"))
                + list(load_data(seance.id, "force"))
                + list(load_data(seance.id, "cpu"))
                + [load_data(seance.id, "ram")]
                + list(load_data(seance.id, "net"))
            )
            while start < end:
                sub_data = []
                for x in data:
                    try:
                        x[0].sensor
                        sub_data.append(
                            x.filter(timestamp__range=(start, start + step))
                        )
                    except IndexError:
                        sub_data.append([])

                # accelerometer
                ax_val, _, _, ax_me, ax_sd = process_signal(sub_data[0])
                ay_val, _, _, ay_me, ay_sd = process_signal(sub_data[1])
                az_val, _, _, az_me, az_sd = process_signal(sub_data[2])
                a_val, a_me, a_sd = join_accelerometer_signals(ax_val, ay_val, az_val)
                ax_mcr = mean_crossing_rate(ax_val, ax_me)
                ay_mcr = mean_crossing_rate(ay_val, ay_me)
                az_mcr = mean_crossing_rate(az_val, az_me)
                a_mcr = mean_crossing_rate(a_val, a_me)
                ax_mai = mean_acceleration_intensity(ax_val)
                ay_mai = mean_acceleration_intensity(ay_val)
                az_mai = mean_acceleration_intensity(az_val)
                a_mai = mean_acceleration_intensity(a_val)

                # gyroscope
                gx_val, _, _, gx_me, gx_sd = process_signal(sub_data[3])
                gy_val, _, _, gy_me, gy_sd = process_signal(sub_data[4])
                gz_val, _, _, gz_me, gz_sd = process_signal(sub_data[5])
                g_val, g_me, g_sd = join_accelerometer_signals(gx_val, gy_val, gz_val)
                gx_mcr = mean_crossing_rate(gx_val, gx_me)
                gy_mcr = mean_crossing_rate(gy_val, gy_me)
                gz_mcr = mean_crossing_rate(gz_val, gz_me)
                g_mcr = mean_crossing_rate(g_val, g_me)

                # force
                fa_val, _, _, fa_me, fa_sd = process_signal(sub_data[6])
                fb_val, _, _, fb_me, fb_sd = process_signal(sub_data[7])
                fc_val, _, _, fc_me, fc_sd = process_signal(sub_data[8])
                fd_val, _, _, fd_me, fd_sd = process_signal(sub_data[9])
                fa_mcr = mean_crossing_rate(fa_val, fa_me)
                fb_mcr = mean_crossing_rate(fb_val, fb_me)
                fc_mcr = mean_crossing_rate(fc_val, fc_me)
                fd_mcr = mean_crossing_rate(fd_val, fd_me)

                # cpu
                ca_val, _, _, ca_me, ca_sd = process_signal(sub_data[10])
                cb_val, _, _, cb_me, cb_sd = process_signal(sub_data[11])
                cc_val, _, _, cc_me, cc_sd = process_signal(sub_data[12])
                cd_val, _, _, cd_me, cd_sd = process_signal(sub_data[13])
                c_val, c_me, c_sd = join_cpu_signals(ca_val, cb_val, cc_val, cd_val)
                ca_min, ca_max, ca_mcr = get_cpu_stats(ca_val)
                cb_min, cb_max, cb_mcr = get_cpu_stats(cb_val)
                cc_min, cc_max, cc_mcr = get_cpu_stats(cc_val)
                cd_min, cd_max, cd_mcr = get_cpu_stats(cd_val)
                c_min, c_max, c_mcr = get_cpu_stats(c_val)

                # ram
                m_val, _, _, m_me, m_sd = process_signal(sub_data[14])
                derivatives, peaks = find_ram_jump(m_val)
                m_me, m_min, m_max, m_jc, m_jr, m_jv, m_iji = get_mem_stats(
                    m_val, peaks, derivatives
                )

                # net
                ns_val, _, _, ns_me, ns_sd = process_signal(sub_data[15])
                nr_val, _, _, nr_me, nr_sd = process_signal(sub_data[16])
                ns_der, ns_pe = find_net_jump(ns_val)
                nr_der, nr_pe = find_net_jump(nr_val)
                ns_sum, ns_jc, ns_jr, ns_jv, ns_iji = get_net_stats(
                    ns_val, ns_pe, ns_der
                )
                nr_sum, nr_jc, nr_jr, nr_jv, nr_iji = get_net_stats(
                    nr_val, nr_pe, nr_der
                )

                write_row = ",".join(
                    [
                        str(x)
                        for x in [
                            seance.user.id,
                            seance.id,
                            ax_me,
                            ax_sd,
                            ax_mcr,
                            ax_mai,
                            ay_me,
                            ay_sd,
                            ay_mcr,
                            ay_mai,
                            az_me,
                            az_sd,
                            az_mcr,
                            az_mai,
                            a_me,
                            a_sd,
                            a_mcr,
                            a_mai,
                            gx_me,
                            gx_sd,
                            gx_mcr,
                            gy_me,
                            gy_sd,
                            gy_mcr,
                            gz_me,
                            gz_sd,
                            gz_mcr,
                            g_me,
                            g_sd,
                            g_mcr,
                            fa_me,
                            fa_sd,
                            fa_mcr,
                            fb_me,
                            fb_sd,
                            fb_mcr,
                            fc_me,
                            fc_sd,
                            fc_mcr,
                            fd_me,
                            fd_sd,
                            fd_mcr,
                            ca_me,
                            ca_sd,
                            ca_min,
                            ca_max,
                            ca_mcr,
                            cb_me,
                            cb_sd,
                            cb_min,
                            cb_max,
                            cb_mcr,
                            cc_me,
                            cc_sd,
                            cc_min,
                            cc_max,
                            cc_mcr,
                            cd_me,
                            cd_sd,
                            cd_min,
                            cd_max,
                            cd_mcr,
                            c_me,
                            c_sd,
                            c_min,
                            c_max,
                            c_mcr,
                            m_me,
                            m_sd,
                            m_min,
                            m_max,
                            m_jc,
                            m_jr,
                            m_jv,
                            m_iji,
                            ns_me,
                            ns_sd,
                            ns_sum,
                            ns_jc,
                            ns_jr,
                            ns_jv,
                            ns_iji,
                            nr_me,
                            nr_sd,
                            nr_sum,
                            nr_jc,
                            nr_jr,
                            nr_jv,
                            nr_iji,
                        ]
                    ]
                )
                csv_data.write(write_row + "\n")
                start += step


def calculate_scores(n, datas, epochs=50, verbose=True, norm=False):
    if verbose:
        print("Calculating for {} users.".format(n))
    scores = [[] for _ in range(0, epochs)]
    for i in range(0, epochs):
        users = get_random_users(n=n, stats=False)
        for data in datas:
            _, score = get_lda(data, users, stats=False, score=True)
            scores[i].append(score)
    scores = array(scores).T
    result = []
    max_score = max([mean(x) for x in scores])
    for i in range(0, len(datas)):
        if norm:
            result.append([segment_intervals[i], mean(scores[i]) / max_score, n])
        else:
            result.append([segment_intervals[i], mean(scores[i]), n])
    return DataFrame(result, columns=["time_interval", "score", "users"])


def features_list(index=-1):
    features = [
        "ax_me",
        "ax_sd",
        "ax_mcr",
        "ax_mai",
        "ay_me",
        "ay_sd",
        "ay_mcr",
        "ay_mai",
        "az_me",
        "az_sd",
        "az_mcr",
        "az_mai",
        "a_me",
        "a_sd",
        "a_mcr",
        "a_mai",
        "gx_me",
        "gx_sd",
        "gx_mcr",
        "gy_me",
        "gy_sd",
        "gy_mcr",
        "gz_me",
        "gz_sd",
        "gz_mcr",
        "g_me",
        "g_sd",
        "g_mcr",
        "fa_me",
        "fa_sd",
        "fa_mcr",
        "fb_me",
        "fb_sd",
        "fb_mcr",
        "fc_me",
        "fc_sd",
        "fc_mcr",
        "fd_me",
        "fd_sd",
        "fd_mcr",
        "ca_me",
        "ca_sd",
        "ca_min",
        "ca_max",
        "ca_mcr",
        "cb_me",
        "cb_sd",
        "cb_min",
        "cb_max",
        "cb_mcr",
        "cc_me",
        "cc_sd",
        "cc_min",
        "cc_max",
        "cc_mcr",
        "cd_me",
        "cd_sd",
        "cd_min",
        "cd_max",
        "cd_mcr",
        "c_me",
        "c_sd",
        "c_min",
        "c_max",
        "c_mcr",
        "m_me",
        "m_sd",
        "m_min",
        "m_max",
        "m_jc",
        "m_jr",
        "m_jv",
        "m_iji",
        "ns_me",
        "ns_sd",
        "ns_sum",
        "ns_jc",
        "ns_jr",
        "ns_jv",
        "ns_iji",
        "nr_me",
        "nr_sd",
        "nr_sum",
        "nr_jc",
        "nr_jr",
        "nr_jv",
        "nr_iji",
    ]
    if index == -1:
        return features
    else:
        return features[index]


def show_graphs():
    """
    Not a function that is to be run directly, but an example how to visualize the results.
    This was the best place to put it, to keep it out the way.
    """

    # Parameters
    number_of_users = 7
    segment_interval_length = 60

    # Get n random users and segment interval length
    users = get_random_users(n=number_of_users)
    interval = segment_intervals.index(segment_interval_length)

    data = datas[interval]
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    lda = LinearDiscriminantAnalysis(n_components=2)
    X_lda = lda.fit_transform(X, y)
    lda = LinearDiscriminantAnalysis(n_components=2)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    fig = scatter(
        df,
        x="component_1",
        y="component_2",
        color="user",
        color_continuous_scale="Rainbow",
        title="number of users: {}, segment interval: {} seconds, score: {}".format(
            number_of_users, segment_intervals[interval], score
        ),
    )
    fig.show()

    lda = LinearDiscriminantAnalysis(n_components=3)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2", "component_3"],
    )
    fig = scatter_3d(
        df,
        x="component_1",
        y="component_2",
        z="component_3",
        color="user",
        color_continuous_scale="Rainbow",
        title="number of users: {}, segment interval: {} seconds, score: {}".format(
            number_of_users, segment_intervals[interval], score
        ),
    )
    fig.show()


def get_users_data(data):
    """
    Get seance id for each user in a form of a dict.
    """
    users = {}
    for user in list(set(data["user"])):
        x = data[data["user"] == user]
        users.update({user: sorted(list(set(x["seance"])))})
    return users


def get_per_sensor_data(experiment=1, pc_monitor=False):
    segment_intervals = [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]
    sensor_data = []
    com = []
    acc = []
    gyr = []
    frc = []
    cpu = []
    ram = []
    net = []

    for interval in segment_intervals:
        data_name = "data/segmented_data_{}_seconds_experiment_{}.csv".format(interval, experiment)
        data = read_csv(data_name).fillna(0)
        if pc_monitor:
            remain = []
            i = 0
            for x in data[
                [
                    "ca_me",
                    "ca_sd",
                    "ca_min",
                    "ca_max",
                    "ca_mcr",
                    "cb_me",
                    "cb_sd",
                    "cb_min",
                    "cb_max",
                    "cb_mcr",
                    "cc_me",
                    "cc_sd",
                    "cc_min",
                    "cc_max",
                    "cc_mcr",
                    "cd_me",
                    "cd_sd",
                    "cd_min",
                    "cd_max",
                    "cd_mcr",
                    "c_me",
                    "c_sd",
                    "c_min",
                    "c_max",
                    "c_mcr",
                    "m_me",
                    "m_sd",
                    "m_min",
                    "m_max",
                    "m_jc",
                    "m_jr",
                    "m_jv",
                    "m_iji",
                    "ns_me",
                    "ns_sd",
                    "ns_sum",
                    "ns_jc",
                    "ns_jr",
                    "ns_jv",
                    "ns_iji",
                    "nr_me",
                    "nr_sd",
                    "nr_sum",
                    "nr_jc",
                    "nr_jr",
                    "nr_jv",
                    "nr_iji",
                ]
            ].values:
                if mean(x) > 0:
                    remain.append(i)
                i += 1
            data = data.iloc[remain, :]
        label = data.iloc[:, 0:2]
        com.append(data)
        acc.append(label.merge(data.iloc[:, 2:18], left_index=True, right_index=True))
        gyr.append(label.merge(data.iloc[:, 18:30], left_index=True, right_index=True))
        frc.append(label.merge(data.iloc[:, 30:42], left_index=True, right_index=True))
        cpu.append(label.merge(data.iloc[:, 42:67], left_index=True, right_index=True))
        ram.append(label.merge(data.iloc[:, 67:75], left_index=True, right_index=True))
        net.append(label.merge(data.iloc[:, 75:89], left_index=True, right_index=True))
    return {
        "all data": com,
        "accelerometer": acc,
        "gyroscope": gyr,
        "force sensors": frc,
        "cpu": cpu,
        "memory": ram,
        "network": net,
    }


def combine_sensor_data(sensors, names):
    """
    Merge data from multiple sensor.
    """
    return {
        "{} + {}".format(names[0], names[1]): [
            x.merge(y.iloc[:, 2:], left_index=True, right_index=True)
            for x, y in zip(sensors[names[0]], sensors[names[1]])
        ]
    }


def center_graphs():
    from IPython.display import display, HTML
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

    init_notebook_mode(connected=True)
    display(
        HTML("""<style>.output_area {display: flex;justify-content: center;}</style>""")
    )


center_graphs()