# Linear Discriminant Analysis

In this document we will discuss the potential uses and perform some basic analysis and classification on our dataset with the help of linear discriminant analysis.

<!-- At the begging of every session (when running code), first run [helpers](#section_helpers) and then [get csv data](#section_csv) to load all the libraries, functions and data, used by this notebook. -->

## Data preparation
Our dataset consists of 21 users, each performed 3 predetermined tasks and each of those tasks twice. So in total 6 seances per user, $6 * 21 = 63$ seances. As we only posses 2 tasks of the same type per user, it is necessary for us to somehow generate more datapoints to be able to apply any kind of machine learning techniues. 

To accomplish that we segmented the data into bins, lasting for m seconds, calculated basic statistical features (complete list below) and save the results into separate csv files. Where $m = [10s, 15s, 30s, 45s, 60s, 75s, 90s, 120s, 150s, 180s]$.

The features we extracted from the data:
- Accelerometer:
    - mean of x, y, z and combined,
    - standard deviation of x, y, z and combined,
    - mean crossing rate of x, y, z and combined,
    - mean derivative value of x, y, z and combined
- Gyroscope:
    - mean of x, y, z and combined,
    - standard deviation of x, y, z and combined,
    - mean crossing rate of x, y, z and combined
- Force sensors:
    - mean of all four sensors,
    - standard deviation of all four sensors,
    - mean crossing rate of all four sensors
- Processor usage:
    - mean of each core and combined,
    - standard deviation of each core and combined, 
    - minimum value of each core and combined,
    - maximum value of each core and combined,
    - mean crossing rate of each core and combined
- Memory usage:
    - mean,
    - standard deviation,
    - minimum value,
    - maximum value,
    - number of significant jumps,
    - rate of significant jumps,
    - mean jump value,
    - mean inter jump intervals
- Network:
    - mean of cumulative sent and received packages,
    - standard deviation of cumulative sent and received packages,
    - difference in value between start and end of cumulative sent and received packages,
    - number of significant jumps of cumulative sent and received packages,
    - rate of significant jumps of cumulative sent and received packages,
    - mean jump value of cumulative sent and received packages,
    - mean inter jump intervals of cumulative sent and received packages

In [None]:
import warnings

# Yes, yes, do not use this... we only have bunch of mean of empty slice and similar warnings
warnings.filterwarnings("ignore")

for seconds in [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]:
    generate_segmented_data_csv(seconds=seconds, experiment=1)
    generate_segmented_data_csv(seconds=seconds, experiment=2)
    generate_segmented_data_csv(seconds=seconds, experiment=3)

In [9]:
segment_intervals = [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]
datas_1 = []
datas_2 = []
datas_3 = []

for interval in segment_intervals:
    datas_1.append(read_csv("jupyter/data/segmented_data_{}_seconds_experiment_1.csv".format(interval)).fillna(0))
    datas_2.append(read_csv("jupyter/data/segmented_data_{}_seconds_experiment_2.csv".format(interval)).fillna(0))
    datas_3.append(read_csv("jupyter/data/segmented_data_{}_seconds_experiment_3.csv".format(interval)).fillna(0))

### DOING Adding frequency domain features

To calculate the power of the signal at each spectral frequency, we will perform a discrete fourier transform on the data from the accelerometer, the gyroscope and the force sensors.

In [72]:
import pytz
from scipy.interpolate import interp1d
import numpy.fft as nft
from sklearn.metrics.pairwise import euclidean_distances

# Interpolation
def interpolate(data, step: int):
    """
    Interpolate list of sensor records, to the given milisecond interval.
    Return interpolated values and associated times.
    """
    data = sorted(acc_x, key=lambda x: x.timestamp)
    current_step = data[0].timestamp + timedelta(microseconds=step * 1000)
    values = [x.value for x in data]
    tim = [datetime.timestamp(x.timestamp) for x in data]
    intepolator = interp1d(tim, values)
    values_inter = []
    tim_inter = []
    while current_step <= data[-1].timestamp:
        values_inter.append(intepolator(datetime.timestamp(current_step)))
        tim_inter.append(current_step)
        current_step += timedelta(microseconds=step * 1000)
    return values_inter, tim_inter

results = {}
frequencies = []
u = 5
for user in ["test_subject_05", "test_subject_06", "test_subject_07", "test_subject_08", "test_subject_09", "test_subject_10", "test_subject_11", "test_subject_12"]:
    print("u: {}".format(u))
    for s in [0, 3]:
        print("s: {}".format(s))
        seance = Seance.objects.filter(user__username=user, valid=True).order_by("created")[s]
        records = SensorRecord.objects.filter(seance=seance,)

        acc_x = records.filter(sensor__topic="accel01_z").order_by("timestamp")
        values = [x.value for x in acc_x]
        tim = [x.timestamp for x in acc_x]
        fig = line(x=tim, y=values)
        fig.update_layout(title="Original signal")
#         fig.show()

        step = 5
        values_inter, tim_inter = interpolate(acc_x, step)
        fig = line(x=tim_inter, y=values_inter)
        fig.update_layout(title="Interpolated signal")
#         fig.show()

        # Frequency domain ananlysis
        n = len(values_inter)
        y = fft(values_inter)
        f = fftfreq(len(y))
        # Multiply frequency by sampling rate, to obtain Hertz
        f = f * (1000 / step)

        # Take only first half of the fft
        y = y[1 : len(y) // 2]
        f = f[1 : len(f) // 2]
        mag = [sqrt(r ** 2 + i ** 2) for r, i in zip(y.real, y.imag)]

        fig = line(x=f, y=mag)
#         fig.show()

        prev = 0
        step = 1
        f = list(f)
        squashed = {"freq": [], "value": []}
        for freq_step in range(step, 101, step):
            freq = [x for x in f if prev < x < freq_step]
            indices = [f.index(x) for x in freq]
            prev = freq_step
            squashed["freq"].append((prev + freq_step) / 2)
            squashed["value"].append(mean([mag[x] for x in indices]))
        frequencies = squashed["freq"][10:]
        fig = line(x=squashed["freq"], y=squashed["value"])
        fig.update_layout(title="User {}".format(user))
        results.update({"user_{}_try_{}".format(u, s): squashed["value"][10:]})
#         fig.show()
    u += 1
#     break
X = []
for x in results:
    print(x)
    X.append(results[x])
print(euclidean_distances(X))

# data.append([sensor, f, mag])
#         fig = make_subplots(
#             rows=2,
#             cols=3,
#             subplot_titles=[x[0] for x in data],
#         )
#         i = 0
#         for x in data:
#             fig.add_trace(
#                 go.Scatter(
#                     x=x[1],
#                     y=x[2],
#                     mode="lines",
# #                     marker=dict(color=lda["user"], colorscale="rainbow"),
#                 ),
#                 row=int(i / 3) + 1,
#                 col=i % 3 + 1,
#             )
#             i += 1
#         fig.update_layout(title="Fft magnitudes at different frequencies, task 1, user {}".format(user))
#         fig.show()


def generate_segmented_frequency_domaing_data_csv(seconds: int, experiment: int = 1):
    """
    Generate csv file with calculated features, from data subsampled to the given time interval.
    """
    step = timedelta(seconds=seconds)
    seances = Seance.objects.filter(
        experiment__sequence_number=experiment, valid=True
    ).order_by("start")

    print(
        "Generating segmented data csv file with {} seconds intervals for {} seances.".format(
            seconds, seances.count()
        )
    )

    file_name = "segmented_data_{}_seconds_experiment_{}.csv".format(
        seconds, experiment
    )
    with open(file_name, "w") as csv_data:
        csv_data.write(
            "user,seance,ax_me,ax_sd,ax_mcr,ax_mai,ay_me,ay_sd,ay_mcr,ay_mai,az_me,az_sd,az_mcr,az_mai,a_me,"
            + "a_sd,a_mcr,a_mai,gx_me,gx_sd,gx_mcr,gy_me,gy_sd,gy_mcr,gz_me,gz_sd,gz_mcr,g_me,g_sd,g_mcr,fa_me,"
            + "fa_sd,fa_mcr,fb_me,fb_sd,fb_mcr,fc_me,fc_sd,fc_mcr,fd_me,fd_sd,fd_mcr,ca_me,ca_sd,ca_min,ca_max,"
            + "ca_mcr,cb_me,cb_sd,cb_min,cb_max,cb_mcr,cc_me,cc_sd,cc_min,cc_max,cc_mcr,cd_me,cd_sd,cd_min,cd_max,"
            + "cd_mcr,c_me,c_sd,c_min,c_max,c_mcr,m_me,m_sd,m_min,m_max,m_jc,m_jr,m_jv,m_iji,ns_me,ns_sd,ns_sum,"
            + "ns_jc,ns_jr,ns_jv,ns_iji,nr_me,nr_sd,nr_sum,nr_jc,nr_jr,nr_jv,nr_iji,ir_me\n"
        )
        count = 1
        for seance in seances[:]:
            print(
                "-----------------------------------------------------------------------------"
            )
            print("{} of {}".format(count, seances.count()))
            count += 1
            print(seance)
            start = seance.start
            end = seance.end
            data = (
                list(load_data(seance.id, "accelerometer"))
                + list(load_data(seance.id, "gyroscope"))
                + list(load_data(seance.id, "force"))
                + list(load_data(seance.id, "cpu"))
                + [load_data(seance.id, "ram")]
                + list(load_data(seance.id, "net"))
            )
            pir_data = load_data(seance.id, "pir")
            pir_datas = process_pir_data(
                pir_data, seance.start, seance.end, seconds, interval=10
            )

            i = 0
            while start < end:
                sub_data = []
                for x in data:
                    try:
                        x[0].sensor
                        sub_data.append(
                            x.filter(timestamp__range=(start, start + step))
                        )
                    except IndexError:
                        sub_data.append([])

                # accelerometer
                ax_val, _, _, ax_me, ax_sd = process_signal(sub_data[0])
                ay_val, _, _, ay_me, ay_sd = process_signal(sub_data[1])
                az_val, _, _, az_me, az_sd = process_signal(sub_data[2])
                a_val, a_me, a_sd = join_accelerometer_signals(ax_val, ay_val, az_val)
                ax_mcr = mean_crossing_rate(ax_val, ax_me)
                ay_mcr = mean_crossing_rate(ay_val, ay_me)
                az_mcr = mean_crossing_rate(az_val, az_me)
                a_mcr = mean_crossing_rate(a_val, a_me)
                ax_mai = mean_acceleration_intensity(ax_val)
                ay_mai = mean_acceleration_intensity(ay_val)
                az_mai = mean_acceleration_intensity(az_val)
                a_mai = mean_acceleration_intensity(a_val)

                # gyroscope
                gx_val, _, _, gx_me, gx_sd = process_signal(sub_data[3])
                gy_val, _, _, gy_me, gy_sd = process_signal(sub_data[4])
                gz_val, _, _, gz_me, gz_sd = process_signal(sub_data[5])
                g_val, g_me, g_sd = join_accelerometer_signals(gx_val, gy_val, gz_val)
                gx_mcr = mean_crossing_rate(gx_val, gx_me)
                gy_mcr = mean_crossing_rate(gy_val, gy_me)
                gz_mcr = mean_crossing_rate(gz_val, gz_me)
                g_mcr = mean_crossing_rate(g_val, g_me)

                # force
                fa_val, _, _, fa_me, fa_sd = process_signal(sub_data[6])
                fb_val, _, _, fb_me, fb_sd = process_signal(sub_data[7])
                fc_val, _, _, fc_me, fc_sd = process_signal(sub_data[8])
                fd_val, _, _, fd_me, fd_sd = process_signal(sub_data[9])
                fa_mcr = mean_crossing_rate(fa_val, fa_me)
                fb_mcr = mean_crossing_rate(fb_val, fb_me)
                fc_mcr = mean_crossing_rate(fc_val, fc_me)
                fd_mcr = mean_crossing_rate(fd_val, fd_me)

                # cpu
                ca_val, _, _, ca_me, ca_sd = process_signal(sub_data[10])
                cb_val, _, _, cb_me, cb_sd = process_signal(sub_data[11])
                cc_val, _, _, cc_me, cc_sd = process_signal(sub_data[12])
                cd_val, _, _, cd_me, cd_sd = process_signal(sub_data[13])
                c_val, c_me, c_sd = join_cpu_signals(ca_val, cb_val, cc_val, cd_val)
                ca_min, ca_max, ca_mcr = get_cpu_stats(ca_val)
                cb_min, cb_max, cb_mcr = get_cpu_stats(cb_val)
                cc_min, cc_max, cc_mcr = get_cpu_stats(cc_val)
                cd_min, cd_max, cd_mcr = get_cpu_stats(cd_val)
                c_min, c_max, c_mcr = get_cpu_stats(c_val)

                # ram
                m_val, _, _, m_me, m_sd = process_signal(sub_data[14])
                derivatives, peaks = find_ram_jump(m_val)
                m_me, m_min, m_max, m_jc, m_jr, m_jv, m_iji = get_mem_stats(
                    m_val, peaks, derivatives
                )

                # net
                ns_val, _, _, ns_me, ns_sd = process_signal(sub_data[15])
                nr_val, _, _, nr_me, nr_sd = process_signal(sub_data[16])
                ns_der, ns_pe = find_net_jump(ns_val)
                nr_der, nr_pe = find_net_jump(nr_val)
                ns_sum, ns_jc, ns_jr, ns_jv, ns_iji = get_net_stats(
                    ns_val, ns_pe, ns_der
                )
                nr_sum, nr_jc, nr_jr, nr_jv, nr_iji = get_net_stats(
                    nr_val, nr_pe, nr_der
                )

                # ir sensors
                ir_me = pir_datas[i]

                write_row = ",".join(
                    [
                        str(x)
                        for x in [
                            seance.user.id,
                            seance.id,
                            ax_me,
                            ax_sd,
                            ax_mcr,
                            ax_mai,
                            ay_me,
                            ay_sd,
                            ay_mcr,
                            ay_mai,
                            az_me,
                            az_sd,
                            az_mcr,
                            az_mai,
                            a_me,
                            a_sd,
                            a_mcr,
                            a_mai,
                            gx_me,
                            gx_sd,
                            gx_mcr,
                            gy_me,
                            gy_sd,
                            gy_mcr,
                            gz_me,
                            gz_sd,
                            gz_mcr,
                            g_me,
                            g_sd,
                            g_mcr,
                            fa_me,
                            fa_sd,
                            fa_mcr,
                            fb_me,
                            fb_sd,
                            fb_mcr,
                            fc_me,
                            fc_sd,
                            fc_mcr,
                            fd_me,
                            fd_sd,
                            fd_mcr,
                            ca_me,
                            ca_sd,
                            ca_min,
                            ca_max,
                            ca_mcr,
                            cb_me,
                            cb_sd,
                            cb_min,
                            cb_max,
                            cb_mcr,
                            cc_me,
                            cc_sd,
                            cc_min,
                            cc_max,
                            cc_mcr,
                            cd_me,
                            cd_sd,
                            cd_min,
                            cd_max,
                            cd_mcr,
                            c_me,
                            c_sd,
                            c_min,
                            c_max,
                            c_mcr,
                            m_me,
                            m_sd,
                            m_min,
                            m_max,
                            m_jc,
                            m_jr,
                            m_jv,
                            m_iji,
                            ns_me,
                            ns_sd,
                            ns_sum,
                            ns_jc,
                            ns_jr,
                            ns_jv,
                            ns_iji,
                            nr_me,
                            nr_sd,
                            nr_sum,
                            nr_jc,
                            nr_jr,
                            nr_jv,
                            nr_iji,
                            ir_me,
                        ]
                    ]
                )
                csv_data.write(write_row + "\n")
                start += step
                i += 1

u: 5
s: 0
s: 3
u: 6
s: 0
s: 3
u: 7
s: 0
s: 3
u: 8
s: 0
s: 3
u: 9
s: 0
s: 3
u: 10
s: 0
s: 3
u: 11
s: 0
s: 3
u: 12
s: 0
s: 3
user_5_try_0
user_5_try_3
user_6_try_0
user_6_try_3
user_7_try_0
user_7_try_3
user_8_try_0
user_8_try_3
user_9_try_0
user_9_try_3
user_10_try_0
user_10_try_3
user_11_try_0
user_11_try_3
user_12_try_0
user_12_try_3
[[  0.          18.62459736  31.43411991  22.43800261  73.15287435
   68.17670593  17.29736719  12.70558719  24.45037444  88.08044882
   89.66519488  45.53095732  19.49680652  16.09203714  17.35932019
   13.0800218 ]
 [ 18.62459736   0.          48.31919592  38.90551919  90.08212921
   85.16996251  30.99363151  23.55600136  15.31415273 105.296568
   73.65376453  30.75243654  35.54531332  31.32293447  32.09523524
   27.7638164 ]
 [ 31.43411991  48.31919592   0.          15.77836393  46.13077033
   41.7025005   26.70446382  32.13942501  52.41641162  59.47413893
  119.27411537  74.5986758   16.77643971  21.06183073  21.62498111
   23.61540836]
 [ 22.43800261

In [75]:
users = [x for x in results]
fig = go.Figure(data=go.Heatmap(x=users, y=users, z=euclidean_distances(X), colorscale='Viridis'))
fig.show()

## Clustering qualities of lda

As LDA is primarily a dimensonality reduction technique, which is often used to more easily visualize the data, firstly we will visualize this low dimensional (2D) data and perform some basic analysis.

### Evaluation of cluster quality

The result of used methods, either lda or pca is a point in n-dimensional space. Our goal is to maximize distance between clusters of point for different users, while retaining points of the same user close together. For that reason, we will use the Calinski-Harabasz Index, also known as a Variance Ration Criterion. It is defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion. More information [here](https://scikit-learn.org/stable/modules/clustering.html).

### Comparison to PCA
In contrast to the PCA method, which is unsupervised, LDA uses class labels to separate the data. As our data is labelled, this method should distinguish much better between different users. In the below graphs we show the difference in spread of data after taking 2 most prominent components of lda and pca. We can observe a much higher explained variance ratio of the first 2 components with lda, as well as a much cleaner spread between classes. The results remain consistent, even when we change the number of users, time interval or both.

In [226]:
# Data
datas = datas_1
# Parameters
number_of_users = 7
segment_interval_length = 60

# Get n random users and segment interval length
users = get_random_users(n=number_of_users)
interval = segment_intervals.index(segment_interval_length)
# Perform and visualize lda
lda, score = get_lda(datas[interval], users, score=True, stats=False)
fig = scatter(
    lda,
    x="component_1",
    y="component_2",
    color="user",
    color_continuous_scale="Rainbow",
    title="LDA TECHNIQUE; number of users: {}, segment interval: {} seconds, score: {}".format(
        number_of_users, segment_intervals[interval], score
    ),
)
fig.show()
# Perform and visualize pca
pca, score = get_pca(datas[interval], users, score=True, stats=False)
fig = scatter(
    pca,
    x="component_1",
    y="component_2",
    color="user",
    color_continuous_scale="Rainbow",
    title="PCA TECHNIQUE; number of users: {}, segment interval: {} seconds, score: {}".format(
        number_of_users, segment_intervals[interval], score
    ),
)
fig.show()

Users: [11, 17, 18, 20, 21, 22, 24]


### Time intervals vs user number graphs with LDA

Below is a series of graphs for different number of users with different time intervals used, noted with vrc scores. We can confirm that with increasing number of users, it takes longer time intervals to properly discern between different users. The interseting part is that the quality of distinguishing different people is concave. Note how for each number of users (or spread of the graphs) there is a peak in score, which diminishes on both sides.

In [227]:
# Data
datas = datas_3
warnings.filterwarnings("ignore")

user_number = 5
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

user_number = 9
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

user_number = 13
users = get_random_users(n=user_number)
scores = []
for data in datas:
    _, score = get_lda(data, users, stats=False, score=True)
    scores.append(score)
fig = make_subplots(
    rows=2,
    cols=5,
    subplot_titles=[
        "{} seconds, {}".format(segment_intervals[i], int(scores[i]))
        for i in range(0, len(datas))
    ],
)
for i in range(0, len(datas)):
    lda, score = get_lda(datas[i], users, stats=False, score=True)
    fig.add_trace(
        go.Scatter(
            x=lda["component_1"],
            y=lda["component_2"],
            mode="markers",
            marker=dict(color=lda["user"], colorscale="rainbow"),
        ),
        row=int(i / 5) + 1,
        col=i % 5 + 1,
    )
fig.update_layout(title="Number of users: {}".format(user_number))
fig.show()

Users: [8, 14, 15, 23, 24]


Users: [7, 10, 11, 13, 14, 20, 21, 23, 25]


Users: [9, 12, 13, 14, 15, 16, 19, 20, 21, 22, 25, 26, 27]


### Score vs time intervals with number of users

To better understand how LDA behaves with different number of users in relation to different time intervals, we did 50 calculations of LDA for different number of users and took the mean vrc score for every time interval. The data is shown on the graph below. It confirms our assumption that with the increasing number of users, we need longer time intervals to better discern between users. Scores are scaled to a [0,1] interval for easier comparison as generally scores are lower with the higher number of users. The graph is interactive, which means you can easily add/remove components by clicking on the appropriate name in the legend.

<!--Results are much more representative if run on 500 epochs or more, but that takes a bit longer. Use norm flag to scale all score data on 0-1 interval and see at which time interval there is a peak for each user count.-->

In [None]:
# Data
datas = datas_1
e = 500
norm = True
verbose = False
warnings.filterwarnings("ignore")

for datas in [datas_1]:
    print("Performing {} epochs".format(e))
    results = []
    for i in range(3, 22):
        results.append(calculate_scores(i, datas, epochs=e, norm=norm, verbose=verbose))

    df = concat(results)
    print(df.to_string())
    fig = line(df, x="time_interval", y="score", color="users")
    fig.show()

Performing 500 epochs


In [None]:
# Cached data
fig = line(df, x="time_interval", y="score", color="users")
fig.show()

## Understanding which features are the most informative

There are two different approaches we would like to take to determine the informativeness of our data. The first one is to extract eigen vectors after fitting the lda model and the other one is to use some feature selection algorithms to validate our first approach and find the best possible subset of our manually engineered feature set. 

### Eigen vectors

To better understand which sensors are generating informative data, we would like to find out which features are most prominent in each of the components of the lda. The values below are calculated based on the eigen vectors of the lda. We took one vector at a time, looked at the absolute highest values which represent features that contributed most information in a given component and weighted it with the explained variance ratio of a given component.

All of the prominent features being used in all of the components belong to the accelerometer and it seems that all other sensors are redundant. We will use some feature selection algorithms to precisely determine the informative value of each feature and later select the optimal feature subset to achieve as high as possible accuracy.

In [11]:
# Data
datas = datas_1

# Parameters
number_of_users = 7
segment_interval_length = 60
n_components = 2

# Get n random users and segment interval length
users = get_random_users(n=number_of_users)
interval = segment_intervals.index(segment_interval_length)

data = datas[interval]
data = data[data.user.isin(users)]
X = data.iloc[:, 2:].values
y = data.iloc[:, 0].values.ravel()
z = data.iloc[:, 1].values.ravel()

lda = LinearDiscriminantAnalysis(n_components=n_components)
X_lda = lda.fit_transform(X, y)
print("Explained variance ratio: {}".format(lda.explained_variance_ratio_))

# Eigen vectors(scalings_) - when transforming, this is multiplied by the
# new data points -> higher the value for a specific feature, more
# informative the feature, regarding this component. Somehow similar
# to how neurons work in neural nets. To put it in another way, every element
# of each eigen vector is the weight of each feature of that vector.
for j in range(0, n_components):
    print("============================")
    print("Component {}".format(j + 1))
    print("============================")
    eigen_vector = abs(lda.scalings_[:, j])
    eigen_vector = eigen_vector / sum(eigen_vector)

    eigen_sorted = eigen_vector.copy()
    eigen_sorted[::-1].sort()
    i = 0
    while eigen_sorted[i] > 0.05:
        value = round(eigen_sorted[i] * lda.explained_variance_ratio_[j], 2)
        if value < 0.01:
            break
        print(features_list(index=np.where(eigen_vector == eigen_sorted[i])[0][0]))
        print(value)
        i += 1

Users: [7, 14, 15, 17, 18, 24, 27]
Explained variance ratio: [0.76131601 0.13055217]
Component 1
a_me
0.27
az_me
0.26
ax_mai
0.05
ay_mai
0.04
Component 2
a_me
0.04
ay_mai
0.03
ax_mai
0.02
az_me
0.01
ay_sd
0.01


### Feature selection algorithms
<a id='feature_selection'></a>

With some of the features being useless or even hurtful to our accuracy score, we would like to obtain an optimal feature subset. Besides improving our accuracy scores with this approach, we also expect to better understand which features and consequently which sensors are most informative in the domain of continuous authentication in an IoT environment.

We will use feature selection techniques implemented in the sklearn library, as they should more than suffice for our needs.

In [231]:
# Data
datas = datas_1

users = get_users_data(datas[0])

t = 13
i = 0
X = datas[i].iloc[:, 2:].values
y = datas[i].iloc[:, 0].values.ravel()
cols = list(datas[i].iloc[:, 2:].columns)

sel = VarianceThreshold()
sel.fit(X)
max_score = max(sel.variances_)
variances_scores = [x/max_score for x in sel.variances_]

sel = SelectKBest(f_classif, k=2)
sel.fit(X, y)
max_score = max(sel.scores_)
f_classif_scores = [x/max_score for x in sel.scores_]
# indices = sorted(np.argpartition(f_classif_scores, -t)[-t:])
# print("F classif:")
# print([cols[x] for x in indices])

sel = SelectKBest(mutual_info_classif, k=2)
sel.fit(X, y)
max_score = max(sel.scores_)
mutual_info_classif_scores = [x/max_score for x in sel.scores_]
# indices = sorted(np.argpartition(mutual_info_classif_scores, -t)[-t:])
# print("Mutual info classif:")
# print([cols[x] for x in indices])

fig = go.Figure(
    data=[
#         go.Bar(name="variances", x=cols, y=variances_scores),
        go.Bar(name="f_classif", x=cols, y=f_classif_scores),
        go.Bar(name="mutual_info_classif", x=cols, y=mutual_info_classif_scores),
    ]
)
# Change the bar mode
fig.update_layout(barmode="group")
fig.show()

The above graph present the scores of each manually engineered feature, later used in classification tasks. The scores were calculated with [univariate feature selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) part of the sklearn.feature_selection library. We used two different scoring fuctions (f_classif and mutual_info_classif) to calculate the presented scores, which were scaled on a [0,1] scale to enable visualization on the same graph. In both cases, the best performing feature was the mean value of the x axis of the accelerometer. In general we can see that mean values represent the majority of better rated features. We can also observe that most of the more complex features we calculated from the memory and network data are the lowest performing ones and should not be used.

We discovered that we cannot evaluate sensors informativeness, based on all the features we posses, but rather evaluate features separately, select the best and then talk about per sensor informativeness.

## Classification

LDA is not only used as a dimensionality reduction technique, but also as a classifier. Sklearn implementation of LDA provides a predict function that serves just this purpose. In the following subsections, we will use this function to predict users as accurately as possible.

### Per seance classification

Per seance classification focuses on taking whole seances and splitting them into a training and a testing set. As our dataset consists of 2 performances of the same task per user, we took the collection of the first performances of the tasks and used them as a training set, similarly the second performances were used as a testing set. Each task was split separately, so we end up with 3 training and 3 testing sets.

#### Overview

This is a tough task, as training and testing data is split right in between the seances for each user, so the classification accuracy (depends on the parameters) is not the best, but still on average about 3 times higher than the majority classifier, from which we can conclude that already at this stage, our algorithm is learning something. The accuracy is highly dependent on the user subset and to a lesser degree on the number of users and time intervals.

For some subset of users the classification accuracy reaches up to 80%, while others are quite similar in their behavior and our algorithms suffers to distinguish between them for now. Below are two examples of both favorable and unfavorable user set, with the number of users set to 7 and time intervals to 75 seconds.

In [383]:
# Data
datas = datas_1

# Users
number_of_users = 7
for users in [[7, 9, 11, 12, 19, 22, 24], [13, 14, 15, 18, 23, 25, 27]]:
    print("Users: {}".format(users))
    segment_interval_length = 75
    interval = segment_intervals.index(segment_interval_length)
    n_components = 2
    data = datas[interval]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    test_seances = [x[1] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    testing_set = data[data["seance"].isin(test_seances)]
    X_train = training_set.iloc[:, 2:].values
    y_train = training_set.iloc[:, 0].values.ravel()
    X_test = testing_set.iloc[:, 2:].values
    y_test = testing_set.iloc[:, 0].values.ravel()
    lda = LinearDiscriminantAnalysis(n_components=n_components)
    lda.fit(X_train, y_train)
    y_pred = lda.predict(X_test)
    print("Confusion matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))
    print("Majority classifier: {}".format(round(list(y_test).count(max(set(y_test), key = list(y_test).count))/len(y_test),2)))
    print("Classification accuracy: {}".format(round(metrics.accuracy_score(y_test, y_pred), 2)))
    print()

Users: [7, 9, 11, 12, 19, 22, 24]
Confusion matrix:
[[5 0 0 0 0 0 0]
 [0 6 0 0 0 0 0]
 [0 0 3 0 0 0 2]
 [0 2 0 2 0 1 0]
 [0 0 0 0 3 0 0]
 [0 0 0 0 0 4 0]
 [0 0 0 0 0 1 3]]
Majority classifier: 0.19
Classification accuracy: 0.81

Users: [13, 14, 15, 18, 23, 25, 27]
Confusion matrix:
[[0 0 0 0 0 4 0]
 [0 1 0 0 0 2 4]
 [0 0 2 0 0 0 0]
 [0 0 4 1 0 0 0]
 [0 0 0 0 3 1 0]
 [0 0 0 0 0 6 0]
 [0 2 0 0 0 3 2]]
Majority classifier: 0.2
Classification accuracy: 0.43



#### Number of users vs accuracy

Below we display how the number of users used affects the classification accuracy of the predictions. As anticipated, with the increasing number of users, the accuracy decrease, but there is no point where the decrease is sudden and significant.

We performed 500 iterations per time interval, per number of users. At each iteration the users were chosen at random.

In [386]:
# Epochs
epochs = 500
results = {"n_users": [], "accuracy": [], "time_interval": []}

datas = datas_1

# Components
n_components = 2
# for n in segment_intervals:
for n in [15, 180]:
    interval = segment_intervals.index(n)
    for i in range(3, 19, 2):
        ca = []
        for _ in range(0, epochs):
            users = get_random_users(n=i, stats=False, remove=[14])
            data = datas[interval]
            data = data[data.user.isin(users)]
            seances = get_users_data(data)

            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = data[data["seance"].isin(train_seances)]
            testing_set = data[data["seance"].isin(test_seances)]

            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)

            majority_classifier = round(
                list(y_test).count(max(set(y_test), key=list(y_test).count))
                / len(y_test),
                2,
            )

            ca.append(metrics.accuracy_score(y_test, y_pred))
        results["n_users"].append(i)
        results["accuracy"].append(mean(ca))
        results["time_interval"].append(n)


df = DataFrame(results)
graphs = []
for n in segment_intervals:
    data = df[df["time_interval"] == n]
    graphs.append(
        go.Bar(name="{} seconds".format(n), x=data["n_users"], y=data["accuracy"])
    )

fig = go.Figure(data=graphs)
# Change the bar mode
fig.update_layout(
    barmode="group",
    title="Classification accuracy for different number of users",
    xaxis_title="number of users",
    yaxis_title="classification accuracy",
)
fig.show()

#### Sensor subset data

To determine the best possible combinations of sensors, we will take datasets with different combinations of sensors and compare the results. As the tasks were meant to stimulate response from different sensors, we will plot the data from all 3 tasks and compare it for potential differences.

Again we will perform 500 iterations per task, with the number of users set to 7 and the time interval to 75 seconds.

In [393]:
# Epochs
epochs = 500
sresults = [[], [], []]

# Users
number_of_users = 7
users_set = [get_random_users(n=number_of_users, stats=False) for _ in range(0, epochs)]

# Time intervals
segment_interval_length = 75
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for ex in [1, 2, 3]:
    sensors = get_per_sensor_data(experiment=ex)

    # Combine different sensor data
    sensors.update(combine_sensor_data(sensors, ["accelerometer", "gyroscope"]))
    sensors.update(
        combine_sensor_data(sensors, ["accelerometer + gyroscope", "force sensors"])
    )
    sensors.update(combine_sensor_data(sensors, ["gyroscope", "force sensors"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "memory"]))
    sensors.update(combine_sensor_data(sensors, ["cpu + memory", "network"]))
    sensors.update(combine_sensor_data(sensors, ["cpu", "accelerometer"]))

    names = [x for x in sensors]
    names.sort()

    ca = {"majority_classifier": []}
    for name in names:
        ca.update({name: []})

    for users in users_set:
        for key in names:
            sensor = sensors[key][interval]
            sensor = sensor[sensor.user.isin(users)]
            seances = get_users_data(sensor)

            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]

            training_set = sensor[sensor["seance"].isin(train_seances)]
            testing_set = sensor[sensor["seance"].isin(test_seances)]

            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()

            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)
            ca[key].append(metrics.accuracy_score(y_test, y_pred))

        ca["majority_classifier"].append(
            list(y_test).count(max(set(y_test), key=list(y_test).count)) / len(y_test)
        )

    results = [[], [], []]
    for key in ca:
        results[0].append(key)
        results[1].append(mean(ca[key]))
        results[2].append(ex)

    indices = np.argsort(results[1])
    for i in indices:
        sresults[0].append(results[0][i])
        sresults[1].append(results[1][i])
        sresults[2].append(results[2][i])

df = DataFrame(
    {"sensors": sresults[0], "accuracy": sresults[1], "experiment": sresults[2]}
)
graphs = []
for i in range(1, 4):
    data = df[df["experiment"] == i]
    graphs.append(
        go.Bar(name="Task {}".format(i), x=data["sensors"], y=data["accuracy"])
    )

fig = go.Figure(data=graphs)
fig.update_layout(
    barmode="group",
    title="Classification accuracy for different combinations of sensor data",
    xaxis_title="sensor combination",
    yaxis_title="classification accuracy",
)
fig.show()

In all 3 tasks the worst performing classifier is the majority classifier, even if in some cases by a small margin. That means that disregarding the sensor set, there were some patterns in the data and we were able to detect them. We can also observe that the best performing sensors are the accelerometer and the gyroscope, with some discrepany between the tasks.

In the first task taking only the accelerometer data seems to be the best strategy. Task number 2 is interesting in a way that only the combination of accelerometer and gyroscope data beats the accuracy of taking all available sensor data. The last task' most accurate sensor is the gyroscope, following some combinations of other data with it. Note how the combination of gyroscope and force sensors data is higher without accelerometer data.

All together the accelerometer and the gyroscope are the sensors with the most informative features, with force sensor lagging roughly 20% behind and the pc data another 15%. A potential issue with the pc data is that there is time at the beginng and the end of each experiment, where this data is not being collected, as the test subject was required to turn on and off the pc during the experiment. A more "fair" experiment would be to take only data from the time interval in which the computer was turned on and redo this analysis, which we will do later on.

Surprisingly we discovered that with taking data from the second and the third experiments, there is an increase in overall classification accuracy and in most subsets of sensor data. The only notable exception here is the accelerometer, which was performing increasingly worse from the first to the third experiment.

#### Optimal feature set

As shown in the [Feature selection algorithms section](#feature_selection), the optimal feature set is not sensor dependent, but rather spread across different sensors. We will take some of the best feature sets of data and compare the classification accuracy.

To achieve that, we performed 5000 iterations of data subsampling, taking 7 random users and 75 seconds long time intervals for each iteration. Then within each iteration we performed feature selection on the training data (first seance of each task of every user), took n best features, where $n = [5, 10, 15, 20, 25, 30, 35]$, perform prediction on the testing data and calculate the classification accuracy. We also took some sensor feature subsets that we believe perform best.

In [None]:
# Init
preset_sets = ["accelerometer", "gyroscope", "force_sensors", "acc + gyro + force"]
# for datas, i in zip([datas_1], [1]):
for datas, i in zip([datas_1, datas_2, datas_3], [1, 2, 3]):
    # Parameters
    epochs = 5000
    number_of_users = 7
    segment_interval_length = 75
    interval = segment_intervals.index(segment_interval_length)
    n_components = 2
    sets = {}
    
    data = datas[interval]
    cols = list(data.columns)
    feature_sets = {
        "accelerometer": cols[0:2] + cols[2:18],
        "gyroscope": cols[0:2] + cols[18:30],
        "force_sensors": cols[0:2] + cols[30:42],
        "acc + gyro + force": cols[0:2] + cols[2:42],
        "f_classif_5": [],
        "mutual_classif_5": [],
        "f_classif_10": [],
        "mutual_classif_10": [],
        "f_classif_15": [],
        "mutual_classif_15": [],
        "f_classif_20": [],
        "mutual_classif_20": [],
        "f_classif_25": [],
        "mutual_classif_25": [],
        "f_classif_30": [],
        "mutual_classif_30": [],
        "f_classif_35": [],
        "mutual_classif_35": [],
    }
    results = {
        "accelerometer": [],
        "gyroscope": [],
        "force_sensors": [],
        "acc + gyro + force": [],
        "f_classif_5": [],
        "mutual_classif_5": [],
        "f_classif_10": [],
        "mutual_classif_10": [],
        "f_classif_15": [],
        "mutual_classif_15": [],
        "f_classif_20": [],
        "mutual_classif_20": [],
        "f_classif_25": [],
        "mutual_classif_25": [],
        "f_classif_30": [],
        "mutual_classif_30": [],
        "f_classif_35": [],
        "mutual_classif_35": [],
    }
    for e in range(0, epochs):
        # Data subset
        users = get_random_users(n=7, stats=False)
        # take all or just training data?
        sub_data = data
        # Subset users
        sub_data = sub_data[sub_data.user.isin(users)]
        # Split the data
        seances = get_users_data(sub_data)
        train_seances = [x[0] for x in seances.values()]
        test_seances = [x[1] for x in seances.values()]
        training_set = sub_data[sub_data["seance"].isin(train_seances)]
        testing_set = sub_data[sub_data["seance"].isin(test_seances)]
        X = training_set.iloc[:, 2:].values
        y = training_set.iloc[:, 0].values.ravel()

        # Feature selection
        f_cols = list(data.iloc[:, 2:].columns)
        sel = SelectKBest(f_classif, k=2)
        sel.fit(X, y)
        indices = np.argsort(sel.scores_)
        feature_sets["f_classif_5"] = cols[0:2] + [f_cols[x] for x in indices[-5:]]
        feature_sets["f_classif_10"] = cols[0:2] + [f_cols[x] for x in indices[-10:]]
        feature_sets["f_classif_15"] = cols[0:2] + [f_cols[x] for x in indices[-15:]]
        feature_sets["f_classif_20"] = cols[0:2] + [f_cols[x] for x in indices[-20:]]
        feature_sets["f_classif_25"] = cols[0:2] + [f_cols[x] for x in indices[-25:]]
        feature_sets["f_classif_30"] = cols[0:2] + [f_cols[x] for x in indices[-30:]]
        feature_sets["f_classif_35"] = cols[0:2] + [f_cols[x] for x in indices[-35:]]
        sel = SelectKBest(mutual_info_classif, k=2)
        sel.fit(X, y)
        indices = np.argsort(sel.scores_)
        feature_sets["mutual_classif_5"] = cols[0:2] + [f_cols[x] for x in indices[-5:]]
        feature_sets["mutual_classif_10"] = cols[0:2] + [f_cols[x] for x in indices[-10:]]
        feature_sets["mutual_classif_15"] = cols[0:2] + [f_cols[x] for x in indices[-15:]]
        feature_sets["mutual_classif_20"] = cols[0:2] + [f_cols[x] for x in indices[-20:]]
        feature_sets["mutual_classif_25"] = cols[0:2] + [f_cols[x] for x in indices[-25:]]
        feature_sets["mutual_classif_30"] = cols[0:2] + [f_cols[x] for x in indices[-30:]]
        feature_sets["mutual_classif_35"] = cols[0:2] + [f_cols[x] for x in indices[-35:]]
        
        # Calculate ca for given users for all feature sets
        for feature_set in feature_sets:
            X_train = training_set[feature_sets[feature_set]].iloc[:, 2:].values
            y_train = training_set[feature_sets[feature_set]].iloc[:, 0].values.ravel()
            X_test = testing_set[feature_sets[feature_set]].iloc[:, 2:].values
            y_test = testing_set[feature_sets[feature_set]].iloc[:, 0].values.ravel()

            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)
            results[feature_set].append(metrics.accuracy_score(y_test, y_pred))
    
    # Calculate mean accuracy per feature set
    for x in results:
        results[x] = mean([results[x]])    
        
    # Get previous best 
    reference = max(
        [
            results[key]
            for key in results.keys()
            if key
            in preset_sets
        ]
    )
    print(results)
    print(reference)
    
    # Draw the figure
    fig = go.Figure([go.Bar(x=list(results.keys()), y=list(results.values()))])
    fig.add_shape(
        go.layout.Shape(
            type="line",
            xref="paper",
            yref="y",
            x0=0,
            y0=reference,
            x1=1,
            y1=reference,
            line=dict(color="Red", width=1,),
        ),
    )
    fig.update_layout(
        title="Task {}: feature set vs classification accuracy".format(i),
        xaxis_title="feature set",
        yaxis_title="classification accuracy",
    )
#     fig.show()

![title](jupyter/resources/4.1.4/01.png)
![title](jupyter/resources/4.1.4/02.png)
![title](jupyter/resources/4.1.4/03.png)

In [30]:
# Cached data

# Task 1
results =
reference =
fig = go.Figure([go.Bar(x=list(results.keys()), y=list(results.values()))])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Task 1: feature set vs classification accuracy",
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 2
results =
reference =
fig = go.Figure([go.Bar(x=list(results.keys()), y=list(results.values()))])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Task 1: feature set vs classification accuracy",
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 3
results =
reference =
fig = go.Figure([go.Bar(x=list(results.keys()), y=list(results.values()))])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Task 1: feature set vs classification accuracy",
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()


For the second and the third task, we can observe an improvement of at most 10% and 4% respectively. Both achieved the best accuracy while using 10 best features, calculated by using the [mutual information scoring function](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif). On the other hand, the first task still performed best, using only the data from the accelerometer, achieving the accuracy 1% higher than the best scoring selected feature subset, which was once again the 10 feature, mutual information scoring function approach.

In addition we ran the algorithm with different number of users and different time intervals, but the results stayed consistent, the only variable being the absolute classification accuracy.

Finally to evaluate the best performing features, we next took the parameters with the highest achieving accuracy (10 features, mutual information scoring function) and plotted how many times each of the features is selected through a course of 5000 iterations. Again we used 7 users and 75 second time intervals.

In [None]:
# Init
for datas, i in zip([datas_1, datas_2, datas_3], [1, 2, 3]):
    s = datetime.now()
    # Parameters
    epochs = 5000
    number_of_users = 7
    segment_interval_length = 75
    interval = segment_intervals.index(segment_interval_length)
    n_components = 2
    sets = {}

    data = datas[interval]
    cols = list(data.columns)
    feature_sets = {
        "mutual_classif_10": [],
    }
    results = {
        "mutual_classif_10": [],
    }
    for e in range(0, epochs):
        # Data subset
        users = get_random_users(n=number_of_users, stats=False)
        sub_data = data
        # Subset users
        sub_data = sub_data[sub_data.user.isin(users)]
        # Split the data
        seances = get_users_data(sub_data)
        train_seances = [x[0] for x in seances.values()]
        test_seances = [x[1] for x in seances.values()]
        training_set = sub_data[sub_data["seance"].isin(train_seances)]
        testing_set = sub_data[sub_data["seance"].isin(test_seances)]
        X = training_set.iloc[:, 2:].values
        y = training_set.iloc[:, 0].values.ravel()

        # Feature selection
        sel = SelectKBest(mutual_info_classif, k=2)
        sel.fit(X, y)
        indices = np.argsort(sel.scores_)
        feature_sets["mutual_classif_10"] = cols[0:2] + [
            f_cols[x] for x in indices[-10:]
        ]

        # Calculate ca for given users for all feature sets
        for feature_set in feature_sets:
            if feature_set not in sets:
                sets.update({feature_set: {}})
            for feature in feature_sets[feature_set]:
                if feature in sets[feature_set]:
                    sets[feature_set][feature] += 1
                else:
                    sets[feature_set].update({feature: 1})

#     print("Running time: {}".format(datetime.now() - s))
    # Draw the figure
    print(sets)
    for key in sets:
        features = sets[key]
        features.pop("user")
        features.pop("seance")
        features = Counter(features).most_common()
        fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
        fig.update_layout(
            title="Task {}: feature frequency".format(i),
            xaxis_title="features",
            yaxis_title="frequency",
        )
#         fig.show()

In [None]:
# Cached data

# Task 1
sets = 
for key in sets:
    features = sets[key]
    features.pop("user")
    features.pop("seance")
    features = Counter(features).most_common()
    fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
    fig.update_layout(
        title="Task 1: feature frequency",
        xaxis_title="features",
        yaxis_title="frequency",
    )
fig.show()

# Task 2
sets = 
for key in sets:
    features = sets[key]
    features.pop("user")
    features.pop("seance")
    features = Counter(features).most_common()
    fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
    fig.update_layout(
        title="Task 2: feature frequency",
        xaxis_title="features",
        yaxis_title="frequency",
    )
fig.show()

# Task 3
sets = 
for key in sets:
    features = sets[key]
    features.pop("user")
    features.pop("seance")
    features = Counter(features).most_common()
    fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
    fig.update_layout(
        title="Task 3: feature frequency",
        xaxis_title="features",
        yaxis_title="frequency",
    )
fig.show()

![title](jupyter/resources/4.1.4/04.png)
![title](jupyter/resources/4.1.4/05.png)
![title](jupyter/resources/4.1.4/06.png)

As we can see from the graphs, the most commonly used features are the means of accelerometer and gyroscope axis, followed by some of the means of force sensors, mean and maximum memory usage, as well as mean crossing rate of accelerometer axis in the second and the third task.

We also discovered the reason why we did not improve on the classification accuracy in the first task while using feature selection - if you take a look at the graph, there are more features that were present in at least 10% of all runs, which means that, depending on the case, different features were selected, thus making the prediction model less stable and consequently less accurate, than with a fixed feature set.

### Iterative time steps learning

As our goal is to perform continuous authentication, one of the key dimensions we have to keep in mind, is time. As the complete dataset is precollected, we can see into the future, which is pretty beneficial for our algorithms, but useless in practice. So in this section we will take a look what happens when we iteratively take one 10 second time interval at a time and learn on all previous ones for a subset of users and see how much time it takes to distinguish between them.

#### Iterative visualization

To better understand, the process that occurs with each iterative step of our algorithm, firstly we will take a look at the lda clustering after each time step and the classification accuracy trend through time.

<!-- if graph=True and generate=False: After each iteration, the user is prompted to press enter to continue. The output is then reset and new graphs should appear. If you write "stop" and press enter in the prompt, the program will stop outputting graphs and will calculate classification accuracy for all time steps (takes a few seconds) and show a line graph with the result. -->

In [234]:
# Data
datas = datas_2

# Users
number_of_users = 13
users = [8, 11, 13, 15, 16, 17, 19, 21, 23, 24, 25, 26, 27]
# users = get_random_users(n=number_of_users)

# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

warnings.filterwarnings("ignore")
data = datas[interval]
data = data[data.user.isin(users)]
seances = get_users_data(data)
train_seances = [x[0] for x in seances.values()]
training_set = data[data["seance"].isin(train_seances)]
seances = list(set(training_set["seance"]))

graph = False
generate = True
i = 2
results = [[], []]
while True:
    X = []
    y = []
    Xt = []
    yt = []
    for seance in seances:
        X.append(training_set[training_set["seance"] == seance].iloc[0:i, 2:].values)
        y.append(
            training_set[training_set["seance"] == seance].iloc[0:i, 0].values.ravel()
        )
        Xt.append(
            training_set[training_set["seance"] == seance].iloc[i : i + 1, 2:].values
        )
        yt.append(
            training_set[training_set["seance"] == seance]
            .iloc[i : i + 1, 0]
            .values.ravel()
        )
    X = np.concatenate(X)
    y = np.concatenate(y)
    Xt = np.concatenate(Xt)
    yt = np.concatenate(yt)
    if Xt.shape[0] == 0:
        break
    lda = LinearDiscriminantAnalysis(n_components=n_components)
    X_lda = lda.fit_transform(X, y)
    yp = lda.predict(Xt)
    results[0].append(i * segment_interval_length)
    results[1].append(metrics.accuracy_score(yt, yp))

    if graph or generate and i in [2, 5, 11, 18, 24]:
        print("After {} seconds.".format(i * segment_interval_length))
        print("Classification accuracy: {}%".format(round(metrics.accuracy_score(yt, yp) * 100)))
        X_ldat = lda.transform(Xt)
        X_lda = np.concatenate((X_lda, X_ldat))
        y = np.concatenate((y, yt + 100))
        df = DataFrame(
            [[y] + list(x) for x, y in zip(X_lda, y)],
            columns=["user", "component_1", "component_2"],
        )
        fig = scatter(
            df,
            x="component_1",
            y="component_2",
            color="user",
            color_continuous_scale="picnic",
        )
        fig.show()

    i += 1
    if i > 100000:
        print("Run out of data")
        break
    if graph and not generate:
        command = input("Press Enter to continue...")

        clear_output()
        if command == "stop":
            graph = False

df = DataFrame(
    {"time from start [s]": results[0], "classification accuracy": results[1]}
)
fig = line(df, x="time from start [s]", y="classification accuracy")
fig.show()

After 20 seconds.
Classification accuracy: 92.0%


After 50 seconds.
Classification accuracy: 38.0%


After 110 seconds.
Classification accuracy: 92.0%


After 180 seconds.
Classification accuracy: 100.0%


After 240 seconds.
Classification accuracy: 100.0%


After running the program multiple times with different number of users and for all of the 3 performed tasks, there is a recognizable pattern emerging from the data. At the very begging, the accuracy of the algorithm is high ( > 90%). Then after some time (30s - 60s) it quickly falls to about 40% and fluctuates around that value till around 180s and then finally climbs back up again to the previous accuracy, mostly achieving 100% accuracy. When this happens, visual clusters begin to emerge if you look at the data more closely. Usually there is some noise at the end, as the length of the experiments varies from test subject to test subject.

This pattern is consistent, no matter how many users we take. It takes about 180 seconds to recognize all or at worst 90% of all the users from the subset. Segment time interval in this subsection was locked at 10 seconds, to better understand the dynamic and changes through time. We also tried to use other time intervals, with similar results, but with fewer time steps to evaluate the progress of the algorithm.

#### Sensor subset data

To better understand how the classification accuracy fluctuates through time, next we will take a look at classification accuracy for each sensor separately and some sensor combinations and try to determine why there is a dip in accuracy at the begging of the experiment. The graph below is interactive, which means, you can add/remove any line by clicking on the appropriate name in the legend.

In [235]:
# Data
sensors = get_per_sensor_data(experiment=3)

# Combine sensor data
sensors.update(combine_sensor_data(sensors, ["accelerometer", "gyroscope"]))
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope", "force sensors"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "cpu"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "network"])
)
sensors.update(
    combine_sensor_data(sensors, ["accelerometer + gyroscope + force sensors", "memory"])
)

names = [x for x in sensors]
names = [
    "all data",
    "accelerometer",
    "gyroscope",
    "accelerometer + gyroscope",
    "accelerometer + gyroscope + force sensors",
    "cpu",
    "memory",
    "network"
]
print("Sensors: {}".format(names))

results = [[], [], []]

# Users
number_of_users = 13
users = [8, 12, 13, 16, 21, 23, 27]
# users = get_random_users(n=number_of_users)

# Time intervals
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)

# Components
n_components = 2

for sensor_name in names:

    data = sensors[sensor_name][interval]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    seances = list(set(training_set["seance"]))

    i = 2
    while True:
        X = []
        y = []
        Xt = []
        yt = []
        for seance in seances:
            X.append(
                training_set[training_set["seance"] == seance].iloc[0:i, 2:].values
            )
            y.append(
                training_set[training_set["seance"] == seance]
                .iloc[0:i, 0]
                .values.ravel()
            )
            Xt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 2:]
                .values
            )
            yt.append(
                training_set[training_set["seance"] == seance]
                .iloc[i : i + 1, 0]
                .values.ravel()
            )
        X = np.concatenate(X)
        y = np.concatenate(y)
        Xt = np.concatenate(Xt)
        yt = np.concatenate(yt)
        if Xt.shape[0] == 0:
            break
        lda = LinearDiscriminantAnalysis(n_components=n_components)
        if mean(X) != 0:
            X_lda = lda.fit_transform(X, y)
            yp = lda.predict(Xt)
            results[1].append(metrics.accuracy_score(yt, yp))
        else:
            results[1].append(0)
        results[0].append(i * segment_interval_length)
        results[2].append(sensor_name)

        i += 1
        if i > 100000:
            print("Run out of data")
            break

df = DataFrame({"time from start [s]": results[0], "classification accuracy": results[1], "sensor": results[2]})
fig = line(df, x="time from start [s]", y="classification accuracy", color="sensor")
fig.show()

Sensors: ['all data', 'accelerometer', 'gyroscope', 'accelerometer + gyroscope', 'accelerometer + gyroscope + force sensors', 'cpu', 'memory', 'network']


Contrary to the per seance classification, taking the data from only a single sensor does not improve, but worsen the classification accuracy and the highest accuracy was achieved with the combination of the accelerometer, the gyroscope and the force sensors data. The accuracy of the pc monitor data (cpu, memory, network) remained low and same conclusions can be drawn as before.

The above conclusions hold as well when introducing the second and the third experiment to the algorithm. The only notable difference is in the classification accuracy progression pattern. The dip at the begging of the experiment is still present, however it occurs a bit later and it is not that great as in the first experiment.

We are suspecting that the reason for the initial dip occurs is because of the way the experiment was set up. At the start of each experiment, the test subject enters the room and sensor data is pretty much constant. When he/she starts using the computer, that is a change in behavioral pattern, which "confuses" the algorithm that has already adjusted to the peaceful state. After a while, it relearns the new patterns and the accuracy rises again.

#### Optimal feature set

As we determined with the current feature set, the highest classification accuracy is achieved by combining the data from the accelerometer, the gyroscope and the force sensors. Now we will take some feature sets that were determined as most optimal by the feature selection algorithms, implemented in the section about [feature selection](#feature_selection).

In [None]:
# Data generation
comb_results = []
for datas in [datas_1, datas_2, datas_3]:
    epochs = 500
    results = {"feature_set": [], "accuracy": [], "time_step": [], "epoch": []}

    for e in range(0, epochs):
        # Parameters
        number_of_users = 7
        users = get_random_users(n=number_of_users, stats=False)
        segment_interval_length = 10
        interval = segment_intervals.index(segment_interval_length)
        n_components = 2

        comp_num = [5, 10, 15, 20, 25, 30, 35]
        data = datas[interval]
        data = data[data.user.isin(users)]

        # Get only data for relevant users
        seances = get_users_data(data)
        seances = [x[0] for x in seances.values()]
        data = data[data["seance"].isin(seances)]
        cols = list(data.columns)

        # Calculate iterate ca for each feature subset
        i = 2
        while True:
            # Subset and split the data
            # Perform feature selection
            # Subset features for each feature set
            # Calculate ca
            training_set = []
            testing_set = []
            for seance in seances:
                training_set.append(data[data["seance"] == seance].iloc[0:i, :])
                testing_set.append(data[data["seance"] == seance].iloc[i : i + 1, :])
            training_set = concat(training_set)
            testing_set = concat(testing_set)
            X_train = training_set.iloc[:, 2:]
            y_train = training_set.iloc[:, 0]

            f_cols = list(training_set.iloc[:, 2:].columns)
            sel = SelectKBest(f_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            feature_sets = {
                "accelerometer": cols[:18],
                "gyroscope": cols[:2] + cols[18:30],
                "force_sensors": cols[:2] + cols[30:42],
                "acc_gyro_force": cols[:42],
            }
            for x in comp_num:
                feature_sets.update(
                    {
                        "f_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            sel = SelectKBest(mutual_info_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            for x in comp_num:
                feature_sets.update(
                    {
                        "mutual_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            done = False
            for feature_set in feature_sets:
                training_sub = training_set[feature_sets[feature_set]]
                testing_sub = testing_set[feature_sets[feature_set]]
                X_sub_train = training_sub.iloc[:, 2:]
                y_sub_train = training_sub.iloc[:, 0]
                X_sub_test = testing_sub.iloc[:, 2:]
                y_sub_test = testing_sub.iloc[:, 0]
                if X_sub_test.shape[0] == 0:
                    done = True
                    break
                lda = LinearDiscriminantAnalysis(n_components=n_components)
                X_lda = lda.fit_transform(X_sub_train, y_sub_train)
                y_pred = lda.predict(X_sub_test)

                results["time_step"].append(i * segment_interval_length)
                results["accuracy"].append(metrics.accuracy_score(y_sub_test, y_pred))
                results["feature_set"].append(feature_set)
                results["epoch"].append(e)
            if i > 1000 or done:
                break
            i += 1
    preset_sets = ["accelerometer", "gyroscope", "force_sensors", "acc_gyro_force"]
    df = DataFrame(results)
    feature_sets = [x for x in feature_sets]
    accuracies = {"feature_set": [], "accuracy": []}
    reference = []
    for x in feature_sets:
        avg = mean(df[df["feature_set"] == x]["accuracy"])
        accuracies["feature_set"].append(x)
        accuracies["accuracy"].append(avg) 
        if x in preset_sets:
            reference.append(avg)
    reference = max(reference)
    print(accuracies)
    print(reference)
    
print("DONE")

In [None]:
# Cached data

# Task 1
accuracies = 
reference = 
fig = go.Figure([go.Bar(x=accuracies["feature_set"], y=accuracies["accuracy"])])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Mean accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 2
accuracies = 
reference = 
fig = go.Figure([go.Bar(x=accuracies["feature_set"], y=accuracies["accuracy"])])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Mean accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 3
accuracies = 
reference = 
fig = go.Figure([go.Bar(x=accuracies["feature_set"], y=accuracies["accuracy"])])
fig.add_shape(
    go.layout.Shape(
        type="line",
        xref="paper",
        yref="y",
        x0=0,
        y0=reference,
        x1=1,
        y1=reference,
        line=dict(color="Red", width=1,),
    ),
)
fig.update_layout(
    title="Mean accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

![title](jupyter/resources/4.2.3/01.png)
![title](jupyter/resources/4.2.3/02.png)

As the accuracy of the already chosen sensors was high, only marginal improvements could be achieved here and the graphs show that our expectations were correct. With the exception of f_classif 5 and 10 and mutual_classif_5, all of the selected feature sets performed within the margin of error and on par with our previous best performer - combination of the accelerometer, the gyroscope and the force sensors data.

The graphs were produced with 500 iterations of 7 random test subjects, with time interval set to 10 seconds. We manipulated the number of users as well and achieved similar results.

#### Sliding window

<a id="sliding_window"></a>

Another potentially benefitial approach to more accurately classify users is to, instead of taking all previously available data, take n-last time segments and use only those to train the prediction model. By that we hope to eliminate the discrepancy induced by the changing actions of the user.

In [None]:
# Data generation
datas = datas_1
epochs = 50
results = {"feature_set": [], "accuracy": [], "time_step": [], "epoch": [], "window": []}

for e in range(0, epochs):
    # Parameters
    number_of_users = 7
    users = get_random_users(n=number_of_users, stats=False)
    segment_interval_length = 10
    interval = segment_intervals.index(segment_interval_length)
    n_components = 2

    data = datas[interval]
    data = data[data.user.isin(users)]

    # Get only data for relevant users
    seances = get_users_data(data)
    seances = [x[0] for x in seances.values()]
    data = data[data["seance"].isin(seances)]
    cols = list(data.columns)

    # Calculate iterate ca for each feature subset
    i = 2
    while True:
        if i % 10 == 0:
            print(i)
        for j in range(0, i-1):
            training_set = []
            testing_set = []
            for seance in seances:
                training_set.append(data[data["seance"] == seance].iloc[j:i, :])
                testing_set.append(data[data["seance"] == seance].iloc[i : i + 1, :])
            training_set = concat(training_set)
            testing_set = concat(testing_set)
            X_train = training_set.iloc[:, 2:]
            y_train = training_set.iloc[:, 0]

            f_cols = list(training_set.iloc[:, 2:].columns)
            sel = SelectKBest(f_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            feature_sets = {
                "accelerometer": cols[:18],
                "gyroscope": cols[:2] + cols[18:30],
                "force_sensors": cols[:2] + cols[30:42],
                "acc_gyro_force": cols[:42],
            }
            for x in comp_num:
                feature_sets.update(
                    {
                        "f_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            sel = SelectKBest(mutual_info_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            for x in comp_num:
                feature_sets.update(
                    {
                        "mutual_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            done = False
            for feature_set in feature_sets:
                training_sub = training_set[feature_sets[feature_set]]
                testing_sub = testing_set[feature_sets[feature_set]]
                X_sub_train = training_sub.iloc[:, 2:]
                y_sub_train = training_sub.iloc[:, 0]
                X_sub_test = testing_sub.iloc[:, 2:]
                y_sub_test = testing_sub.iloc[:, 0]
                if X_sub_test.shape[0] == 0:
                    done = True
                    break
                lda = LinearDiscriminantAnalysis(n_components=n_components)
                X_lda = lda.fit_transform(X_sub_train, y_sub_train)
                y_pred = lda.predict(X_sub_test)

                results["time_step"].append(i * segment_interval_length)
                results["accuracy"].append(metrics.accuracy_score(y_sub_test, y_pred))
                results["feature_set"].append(feature_set)
                results["window"].append(i - j)
                results["epoch"].append(e)
            if i > 50 or done:
                break
        if i > 50 or done:
            break
        i += 1
print("DONE")

windows = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]
df = DataFrame(results)
df = df[df["window"].isin(windows)]
accuracies = {"window": [], "accuracy": [], "feature_set": []}
for feature_set in ["accelerometer", "gyroscope", "acc_gyro_force", "mutual_classif_10"]:
    for x in windows:
        sub = df[(df["window"] == x) & (df["feature_set"] == feature_set)]
        accuracies["window"].append(x)
        accuracies["accuracy"].append(mean(sub["accuracy"]))
        accuracies["feature_set"].append(feature_set)
print(accuracies)

In [None]:
# Cached data
accuracies = 
acc = DataFrame(accuracies)
accelerometer = acc[acc["feature_set"] == "accelerometer"]
gyroscope = acc[acc["feature_set"] == "gyroscope"]
force_sensors = acc[acc["feature_set"] == "acc_gyro_force"]
mutual_classif_10 = acc[acc["feature_set"] == "mutual_classif_10"]
fig = go.Figure(
    data=[
        go.Bar(
            name="accelerometer", x=accelerometer["window"], y=accelerometer["accuracy"]
        ),
        go.Bar(name="gyroscope", x=gyroscope["window"], y=gyroscope["accuracy"]),
        go.Bar(
            name="acc_gyro_force", x=force_sensors["window"], y=force_sensors["accuracy"]
        ),
        go.Bar(
            name="mutual_classif_10",
            x=mutual_classif_10["window"],
            y=mutual_classif_10["accuracy"],
        ),
    ]
)
# Change the bar mode
fig.update_layout(
    barmode="group",
    title="Window length vs classification accuracy",
    xaxis_title="window length",
    yaxis_title="classification accuracy",
)
fig.show()

![title](jupyter/resources/4.2.4/01.png)

As there is alot more processing involved, we only performed a few runs of 10 iterations with 7 random users and time intervals set at 10 seconds. The results were consistent throughout. After taking 10 or more last time segments the accuracy is in line with the previous findings. Contrary to our expectations, taking less timesteps reduces the accuracy and to achieve the best possible performance, 10 or more last timesteps should be taken into consideration when training the model.

#### DOING Distant future prediction

In this subsection, we will take a look how far into the future we can accurately predict the user class.

In [None]:
# Data generation
for datas in [datas_1, datas_2, datas_3]:
    epochs = 50
    results = {"feature_set": [], "accuracy": [], "time_step": [], "epoch": [], "window": []}

    for e in range(0, epochs):
        print("e: {}".format(e))
        # Parameters
        number_of_users = 7
        users = get_random_users(n=number_of_users, stats=False)
        segment_interval_length = 10
        interval = segment_intervals.index(segment_interval_length)
        n_components = 2

        data = datas[interval]
        data = data[data.user.isin(users)]

        # Get only data for relevant users
        seances = get_users_data(data)
        seances = [x[0] for x in seances.values()]
        data = data[data["seance"].isin(seances)]
        cols = list(data.columns)

        # Calculate iterate ca for each feature subset
        i = 2
        while True:
            training_set = []
            testing_set = []
            for seance in seances:
                training_set.append(data[data["seance"] == seance].iloc[0:i, :])
                testing_set.append(data[data["seance"] == seance].iloc[i:, :])
            training_set = concat(training_set)
            testing_set = concat(testing_set)
            X_train = training_set.iloc[:, 2:]
            y_train = training_set.iloc[:, 0]

            # Feature selection
            f_cols = list(training_set.iloc[:, 2:].columns)

            feature_sets = {
                "accelerometer": cols[:18],
                "gyroscope": cols[:2] + cols[18:30],
                "force_sensors": cols[:2] + cols[30:42],
                "acc_gyro_force": cols[:42],
            }

            sel = SelectKBest(f_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            for x in comp_num:
                feature_sets.update(
                    {
                        "f_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            sel = SelectKBest(mutual_info_classif, k=2)
            sel.fit(X_train, y_train)
            indices = np.argsort(np.nan_to_num(sel.scores_))
            for x in comp_num:
                feature_sets.update(
                    {
                        "mutual_classif_{}".format(x): ["user", "seance"]
                        + [f_cols[y] for y in indices[-x:]]
                    }
                )
            done = False
            j = i
            while True:
                for feature_set in feature_sets:
                    training_sub = training_set[feature_sets[feature_set]]
                    testing_sub = testing_set[feature_sets[feature_set]]
                    testing_sub_sub = []
                    for seance in seances:
                        try:
                            testing_sub_sub.append(testing_sub[testing_sub["seance"] == seance].iloc[j, :])
                        except IndexError:
                            continue
                    testing_sub = DataFrame(testing_sub_sub)
                    try:
                        X_sub_train = training_sub.iloc[:, 2:]
                        y_sub_train = training_sub.iloc[:, 0]
                        X_sub_test = testing_sub.iloc[:, 2:]
                        y_sub_test = testing_sub.iloc[:, 0]
                    except IndexError:
                        done = True
                        break
                    lda = LinearDiscriminantAnalysis(n_components=n_components)
                    X_lda = lda.fit_transform(X_sub_train, y_sub_train)
                    y_pred = lda.predict(X_sub_test)

                    results["time_step"].append(i * segment_interval_length)
                    results["accuracy"].append(metrics.accuracy_score(y_sub_test, y_pred))
                    results["feature_set"].append(feature_set)
                    results["epoch"].append(e)
                    results["window"].append(j - i)
                if done or j - i >= 20:
                    done = False
                    break
                j += 1
            if i > 100 or done:
                break
            i += 1
    windows = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
    df = DataFrame(results)
    df = df[df["window"].isin(windows)]
    accuracies = {"window": [], "accuracy": [], "feature_set": []}
    for feature_set in ["accelerometer", "gyroscope", "force_sensors", "mutual_classif_10"]:
        for x in windows:
            sub = df[(df["window"] == x) & (df["feature_set"] == feature_set)]
            accuracies["window"].append(x)
            accuracies["accuracy"].append(mean(sub["accuracy"]))
            accuracies["feature_set"].append(feature_set)
    print(accuracies)
print("DONE")

In [254]:
# Cached Data
acc = DataFrame(accuracies)
accelerometer = acc[acc["feature_set"] == "accelerometer"]
gyroscope = acc[acc["feature_set"] == "gyroscope"]
force_sensors = acc[acc["feature_set"] == "force_sensors"]
mutual_classif_10 = acc[acc["feature_set"] == "mutual_classif_10"]
fig = go.Figure(
    data=[
        go.Bar(
            name="accelerometer", x=accelerometer["window"], y=accelerometer["accuracy"]
        ),
        go.Bar(name="gyroscope", x=gyroscope["window"], y=gyroscope["accuracy"]),
        go.Bar(
            name="force sensors", x=force_sensors["window"], y=force_sensors["accuracy"]
        ),
        go.Bar(
            name="mutual_classif_10",
            x=mutual_classif_10["window"],
            y=mutual_classif_10["accuracy"],
        ),
    ]
)
# Change the bar mode
fig.update_layout(
    barmode="group",
    title="Steps into future vs classification accuracy",
    xaxis_title="steps into future",
    yaxis_title="classification accuracy",
)
fig.show()

Similarly to the [previous section](#sliding_window), computation time was high and again we only performed a few runs of 10 iteration with 7 random users and time segments set to 10 seconds, with minimal discrepancy in accuracy between runs.

### Trajectory classification

Another potentially fruitful approach is to take a look a point trajectory through time in lda space. As with Per seance classification, we will take first seance of different users and apply trajectory classification on the second seance, step by step, to see how much time it takes to discern between users.

#### Wasserstein distance

There are many different distance metrics to compare trajectory similarity, we chose to use the wasserstein metric, also known as the earth mover's distance. We first transform both training and testing data with lda, then pairwise per component compute wasserstein distances and inspect the results.

In [None]:
# Data
datas = datas_1

# Users
number_of_users = 5
for users in [[10, 12, 15, 17, 23], [13, 8, 26, 14, 27]]:
    print("Users: {}".format(users))
    segment_interval_length = 10
    interval = segment_intervals.index(segment_interval_length)
    n_components = 4
    data = datas[interval]
    data = data[data.user.isin(users)]
    seances = get_users_data(data)
    train_seances = [x[0] for x in seances.values()]
    test_seances = [x[1] for x in seances.values()]
    training_set = data[data["seance"].isin(train_seances)]
    testing_set = data[data["seance"].isin(test_seances)]
    X_train = training_set.iloc[:, 2:].values
    y_train = training_set.iloc[:, 0].values.ravel()
    X_test = testing_set.iloc[:, 2:].values
    y_test = testing_set.iloc[:, 0].values.ravel()
    lda = LinearDiscriminantAnalysis(n_components=n_components)
    X_tr = lda.fit_transform(X_train, y_train)
    X_te = lda.transform(X_test)

    train_frame = DataFrame(
        {
            "user": y_train,
            "component_1": X_tr[:, 0],
            "component_2": X_tr[:, 1],
            "component_3": X_tr[:, 2],
            "component_4": X_tr[:, 3],
        }
    )
    test_frame = DataFrame(
        {
            "user": y_test,
            "component_1": X_te[:, 0],
            "component_2": X_te[:, 1],
            "component_3": X_te[:, 2],
            "component_4": X_te[:, 3],
        }
    )
    users.sort()
    results = {
        "train": [],
        "test": [],
        "dist_1": [],
        "dist_2": [],
        "dist_3": [],
        "dist_4": [],
    }
    fake_users = [x for x in range(1, len(users) + 1)]
    i = 0
    for train_user in users:
        j = 0
        for test_user in users:
            x = train_frame[train_frame["user"] == train_user]
            y = test_frame[test_frame["user"] == test_user]
            results["train"].append(fake_users[i])
            results["test"].append(fake_users[j])
            results["dist_1"].append(
                wasserstein_distance(x["component_1"], y["component_1"])
            )
            results["dist_2"].append(
                wasserstein_distance(x["component_2"], y["component_2"])
            )
            results["dist_3"].append(
                wasserstein_distance(x["component_3"], y["component_3"])
            )
            results["dist_4"].append(
                wasserstein_distance(x["component_4"], y["component_4"])
            )
            j += 1
        i += 1
    results = DataFrame(results)
    dists = results[["dist_1", "dist_2", "dist_3", "dist_4"]].values
    results["dist"] = np.sum(dists * lda.explained_variance_ratio_.T, axis=1)

    scores = []
    for user in fake_users:
        data = results[results["train"] == user]
        scores.append(data["dist"].values)
    scores = array(scores).T
    matching = 0
    y_pred = np.argmin(array(scores), axis=1) + 1
    for i in range(0, len(fake_users)):
        if fake_users[i] == y_pred[i]:
            matching += 1
    print("Majority classifier: {}".format(round(1 / len(users), 2)))
    print("Classification accuracy: {}".format(round(matching / len(users), 2)))

    graphs = []
    for user in fake_users:
        data = results[results["train"] == user]
        graphs.append(
            go.Bar(name="train user " + str(user), x=data["test"], y=data["dist"])
        )
    fig = go.Figure(data=graphs)
    fig.update_layout(
        barmode="group",
        title="Per test user distances",
        xaxis_title="test user class",
        yaxis_title="combined wasserstein distance",
    )
    fig.show()

Similarly to the other two approaches, we discovered that the user subset is the most influencing factor, regarding classification accuracy. After performing lda transform on both the training and the testing data, on the first look the trajectories we observed, were nicely clustered per class, but the indiviual projected paths seemed random.

As the wasserstein distance only takes 1D data, we then took the first 4 components of the lda transformation of both the training and the testing data and pairwise calculated the distances for each component separately. To combine the distances into a single metric per user pair, we then weighted the component distances with the explained variance ratio of each component to increase the contribution of the more important components to the final distance score. To classify each testing seance, we took all the training distances and classify the testing seance into the training class with the minimum distance.

On the graph above, we can observe plotted distances for a subset of 5 users. Depending on the subset, we can observe classification accuracies ranging between 0.4 and 1 and averages at around 0.7. Below we will use a more complete approach to properly evaluate our algorithm.

In [316]:
# Epochs
e = 50
datas = datas_1
user_numbers = [5, 9, 13, 17]
intervals = [10, 90, 180]
final_result = {"user_number": [], "interval": [], "accuracy": []}
warnings.filterwarnings("ignore")

# Users
number_of_users = 5

for number_of_users in user_numbers:
#     print("=============================")
#     print(number_of_users)
#     print("-----------------------------")
    for segment_interval_length in intervals:
#         print(segment_interval_length)
        accuracies = []
        for _ in range(0, e):
            users = get_random_users(n=number_of_users, stats=False)
            # Time intervals
            interval = segment_intervals.index(segment_interval_length)
            # Components
            n_components = 4

            data = datas[interval]
            data = data[data.user.isin(users)]
            seances = get_users_data(data)
            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]
            training_set = data[data["seance"].isin(train_seances)]
            testing_set = data[data["seance"].isin(test_seances)]
            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()
            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()
            lda = LinearDiscriminantAnalysis(n_components=n_components)
            X_tr = lda.fit_transform(X_train, y_train)
            X_te = lda.transform(X_test)
            train_frame = DataFrame(
                {
                    "user": y_train,
                    "component_1": X_tr[:, 0],
                    "component_2": X_tr[:, 1],
                    "component_3": X_tr[:, 2],
                    "component_4": X_tr[:, 3],
                }
            )
            test_frame = DataFrame(
                {
                    "user": y_test,
                    "component_1": X_te[:, 0],
                    "component_2": X_te[:, 1],
                    "component_3": X_te[:, 2],
                    "component_4": X_te[:, 3],
                }
            )
            users.sort()
            results = {
                "train": [],
                "test": [],
                "dist_1": [],
                "dist_2": [],
                "dist_3": [],
                "dist_4": [],
            }
            fake_users = [x for x in range(1, len(users) + 1)]
            i = 0
            for train_user in users:
                j = 0
                for test_user in users:
                    x = train_frame[train_frame["user"] == train_user]
                    y = test_frame[test_frame["user"] == test_user]
                    results["train"].append(fake_users[i])
                    results["test"].append(fake_users[j])
                    results["dist_1"].append(
                        wasserstein_distance(x["component_1"], y["component_1"])
                    )
                    results["dist_2"].append(
                        wasserstein_distance(x["component_2"], y["component_2"])
                    )
                    results["dist_3"].append(
                        wasserstein_distance(x["component_3"], y["component_3"])
                    )
                    results["dist_4"].append(
                        wasserstein_distance(x["component_4"], y["component_4"])
                    )
                    j += 1
                i += 1
            results = DataFrame(results)

            dists = results[["dist_1", "dist_2", "dist_3", "dist_4"]].values
            results["dist"] = np.sum(dists * lda.explained_variance_ratio_.T, axis=1)

            scores = []
            for user in fake_users:
                data = results[results["train"] == user]
                scores.append(data["dist"].values)
            scores = array(scores).T
            matching = 0
            y_pred = np.argmin(array(scores), axis=1) + 1
            for i in range(0, len(fake_users)):
                if fake_users[i] == y_pred[i]:
                    matching += 1
            accuracies.append(matching/number_of_users)
        
        final_result["user_number"].append(number_of_users)
        final_result["interval"].append(segment_interval_length)
        final_result["accuracy"].append(mean(accuracies))

df = DataFrame(final_result)

graphs = []
for interval in intervals:
    data = df[df["interval"] == interval]
    graphs.append(go.Bar(name=str(interval) + " seconds", x=data["user_number"], y=data["accuracy"]))

fig = go.Figure(data=graphs)
fig.update_layout(barmode="group", title="User number vs accuracy", xaxis_title="number of users")
fig.show()

To evaluate the overall accuracy of our method, we took different time intervals $ti = [10, 90, 180]$ and different number of users $ nu = [5, 9, 13, 17] $ and we calculated the accuracy for random subset of users for 50 times for each $ti$-$nu$ pair.

As expected, we can see a dropoff in accuracy with rising amount of users, no matter the length of the time intervals taken. The accuracy itself is lower than with per seance classification, but still much higher than the majority classifier for any number of users.

### Context based approach

Instead of naively taking all available data, we will concentrate on specific contexts that occured during each experiment, e.g. time interval of when the computer was turned on or when the test subject was writing on the keyboard. We expect to see an improved classification accuracy, as the data will be more consistent and less noisy.

#### DOING Pc turned on

One of the most trivial data subset we can extract from our dataset is the data from time intervals when the computer was turned on. As the experiment was devised as such that testing subjects started and ended the experiment with the compter turned off, we introduced some less consistent data that is hurtful to our classification accuracy. So in this section, we will explore how the accuracy changes with only aforementioned data taken into account and put through classification methods, presented in previous sections.

The method of extracting this subset is trivial, because we only need to take a look at any of the pc monitor features (cpu, memory, network acitvity) as those only provide non-zero data when the computer is turned on.

In [None]:
# Generating csv data again, only taking intervals, when the pc was turned on
import warnings

# Yes, yes, do not use this... we only have bunch of mean of empty slice and similar warnings
warnings.filterwarnings("ignore")

for seconds in [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]:
    generate_segmented_data_pc_monitor(seconds=seconds, experiment=1)
    generate_segmented_data_pc_monitor(seconds=seconds, experiment=2)
    generate_segmented_data_pc_monitor(seconds=seconds, experiment=3)

In [38]:
# Importing csv data
segment_intervals = [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]
pc_datas_1 = []
pc_datas_2 = []
pc_datas_3 = []

for interval in segment_intervals:
    pc_datas_1.append(
        read_csv(
            "jupyter/data/segmented_data_{}_seconds_experiment_1_pc_monitor.csv".format(
                interval
            )
        ).fillna(0)
    )
    pc_datas_2.append(
        read_csv(
            "jupyter/data/segmented_data_{}_seconds_experiment_2_pc_monitor.csv".format(
                interval
            )
        ).fillna(0)
    )
    pc_datas_3.append(
        read_csv(
            "jupyter/data/segmented_data_{}_seconds_experiment_3_pc_monitor.csv".format(
                interval
            )
        ).fillna(0)
    )

In [None]:
# Data classification
i = 1
for datas, pc_datas, remove in zip(
    [datas_1, datas_2, datas_3], [pc_datas_1, pc_datas_2, pc_datas_3], [[9, 12], [12], [12, 14, 20]]
):
    epochs = 5000
    all_data = {"accuracy": [], "majority_classifier": []}
    pc_data = {"accuracy": [], "majority_classifier": []}

    for e in range(0, epochs):
        if e % 1000 == 0:
            print(e)
        # Users
        number_of_users = 7
        users = get_random_users(n=number_of_users, stats=False, remove=remove)
        segment_interval_length = 75
        interval = segment_intervals.index(segment_interval_length)
        n_components = 2
        sub = False
        for data in [datas[interval], pc_datas[interval]]:
            data = data[data.user.isin(users)]
            seances = get_users_data(data)
            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]
            training_set = data[data["seance"].isin(train_seances)]
            testing_set = data[data["seance"].isin(test_seances)]
            X_train = training_set.iloc[:, 2:].values
            y_train = training_set.iloc[:, 0].values.ravel()
            X_test = testing_set.iloc[:, 2:].values
            y_test = testing_set.iloc[:, 0].values.ravel()
            lda = LinearDiscriminantAnalysis(n_components=n_components)
            lda.fit(X_train, y_train)
            y_pred = lda.predict(X_test)
            if not sub:
                all_data["majority_classifier"].append(
                    list(y_test).count(max(set(y_test), key=list(y_test).count))
                    / len(y_test)
                )
                all_data["accuracy"].append(metrics.accuracy_score(y_test, y_pred))
            else:
                pc_data["majority_classifier"].append(
                    list(y_test).count(max(set(y_test), key=list(y_test).count))
                    / len(y_test)
                )
                pc_data["accuracy"].append(metrics.accuracy_score(y_test, y_pred))
            sub = True

    fig = go.Figure(
        data=[
            go.Bar(
                name="All data",
                y=[mean(all_data["majority_classifier"]), mean(all_data["accuracy"])],
                x=["majority classifier", "accuracy"],
            ),
            go.Bar(
                name="Pc data",
                y=[mean(pc_data["majority_classifier"]), mean(pc_data["accuracy"])],
                x=["majority classifier", "accuracy"],
            ),
        ]
    )
    fig.update_layout(
        barmode="group",
        title="Task {}: pc turned on data comparison".format(
            i, xaxis_title="classifier", yaxis_title="classification accuracy",
        ),
    )
    print("x: {}".format(["majority classifier", "accuracy"]))
    print(
        "y: {}".format(
            [mean(all_data["majority_classifier"]), mean(all_data["accuracy"])]
        )
    )
    print("x: {}".format(["majority classifier", "accuracy"]))
    print(
        "y: {}".format(
            [mean(pc_data["majority_classifier"]), mean(pc_data["accuracy"])]
        )
    )
    fig.show()
    i += 1

In [None]:
# Cached results
# 7 random users
# 75 seconds intervals
# 5000 epochs

# Task 1
fig = go.Figure(
    data=[
        go.Bar(
            name="All data",
            x=['majority classifier', 'accuracy'],
            y=[0.20253805058932847, 0.5919559171587843],
        ),
        go.Bar(
            name="Pc data",
            x=['majority classifier', 'accuracy'],
            y=[0.2381656563313647, 0.5690095794482245],
        ),
    ]
)
fig.update_layout(
    barmode="group",
    title="Task 1: pc turned on vs all data comparison",
    xaxis_title="classifier",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 2
fig = go.Figure(
    data=[
        go.Bar(
            name="All data",
            x=['majority classifier', 'accuracy'],
            y=[0.26491006303020836, 0.6573337291589266],
        ),
        go.Bar(
            name="Pc data",
            x=['majority classifier', 'accuracy'],
            y=[0.28683472733258203, 0.7200323437425099],
        ),
    ]
)
fig.update_layout(
    barmode="group",
    title="Task 2: pc turned on vs all data comparison",
    xaxis_title="classifier",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 3
fig = go.Figure(
    data=[
        go.Bar(
            name="All data",
            x=['majority classifier', 'accuracy'],
            y=[0.1970789031319191, 0.7162078150603199],
        ),
        go.Bar(
            name="Pc data",
            x=['majority classifier', 'accuracy'],
            y=[0.2367723519823066, 0.7495415738451454],
        ),
    ]
)
fig.update_layout(
    barmode="group",
    title="Task 3: pc turned on vs all data comparison",
    xaxis_title="classifier",
    yaxis_title="classification accuracy",
)
fig.show()

By subsetting the data we improved classification accuracy of the second and the third task by 4% and 6%. The first task was again more disappointing, as we can observe a 2% drop in accuracy, which we did not expect. As previously we performed 5000 iterations per task with 7 random users and 75 seconds time intervals for each iteration.

In the next step we will observe how does the feature selection algorithm selects features, now that the pc data should be more relevant.

In [None]:
# Data classification
for dataset, remove in zip(
    [[datas_1, pc_datas_1], [datas_2, pc_datas_1], [datas_3, pc_datas_3]],
    [[9, 12], [12], [12, 14, 20]],
):
    comp_results = []
    for datas, i in zip(dataset, [1, 2]):
        # Parameters
        epochs = 5000
        number_of_users = 7
        segment_interval_length = 75
        interval = segment_intervals.index(segment_interval_length)
        n_components = 2
        sets = {}

        data = datas[interval]
        cols = list(data.columns)
        feature_sets = {
            "accelerometer": cols[0:2] + cols[2:18],
            "gyroscope": cols[0:2] + cols[18:30],
            "force_sensors": cols[0:2] + cols[30:42],
            "acc + gyro + force": cols[0:2] + cols[2:42],
            "cpu": cols[0:2] + cols[42:67],
            "mem": cols[0:2] + cols[67:75],
            "net": cols[0:2] + cols[75:89],
            "cpu + mem + net": cols[0:2] + cols[42:89],
            "f_classif_5": [],
            "mutual_classif_5": [],
            "f_classif_10": [],
            "mutual_classif_10": [],
            "f_classif_15": [],
            "mutual_classif_15": [],
            "f_classif_20": [],
            "mutual_classif_20": [],
            "f_classif_25": [],
            "mutual_classif_25": [],
            "f_classif_30": [],
            "mutual_classif_30": [],
            "f_classif_35": [],
            "mutual_classif_35": [],
        }
        results = {
            "accelerometer": [],
            "gyroscope": [],
            "force_sensors": [],
            "acc + gyro + force": [],
            "cpu": [],
            "mem": [],
            "net": [],
            "cpu + mem + net": [],
            "f_classif_5": [],
            "mutual_classif_5": [],
            "f_classif_10": [],
            "mutual_classif_10": [],
            "f_classif_15": [],
            "mutual_classif_15": [],
            "f_classif_20": [],
            "mutual_classif_20": [],
            "f_classif_25": [],
            "mutual_classif_25": [],
            "f_classif_30": [],
            "mutual_classif_30": [],
            "f_classif_35": [],
            "mutual_classif_35": [],
        }
        for e in range(0, epochs):
            if e % 10 == 0:
                print(e)
            # Data subset
            users = get_random_users(n=7, stats=False, remove=remove)
            sub_data = data
            # Subset users
            sub_data = sub_data[sub_data.user.isin(users)]
            # Split the data
            seances = get_users_data(sub_data)
            train_seances = [x[0] for x in seances.values()]
            test_seances = [x[1] for x in seances.values()]
            training_set = sub_data[sub_data["seance"].isin(train_seances)]
            testing_set = sub_data[sub_data["seance"].isin(test_seances)]
            X = training_set.iloc[:, 2:].values
            y = training_set.iloc[:, 0].values.ravel()

            # Feature selection
            f_cols = list(data.iloc[:, 2:].columns)
            sel = SelectKBest(f_classif, k=2)
            sel.fit(X, y)
            indices = np.argsort(sel.scores_)
            feature_sets["f_classif_5"] = cols[0:2] + [f_cols[x] for x in indices[-5:]]
            feature_sets["f_classif_10"] = cols[0:2] + [
                f_cols[x] for x in indices[-10:]
            ]
            feature_sets["f_classif_15"] = cols[0:2] + [
                f_cols[x] for x in indices[-15:]
            ]
            feature_sets["f_classif_20"] = cols[0:2] + [
                f_cols[x] for x in indices[-20:]
            ]
            feature_sets["f_classif_25"] = cols[0:2] + [
                f_cols[x] for x in indices[-25:]
            ]
            feature_sets["f_classif_30"] = cols[0:2] + [
                f_cols[x] for x in indices[-30:]
            ]
            feature_sets["f_classif_35"] = cols[0:2] + [
                f_cols[x] for x in indices[-35:]
            ]
            sel = SelectKBest(mutual_info_classif, k=2)
            sel.fit(X, y)
            indices = np.argsort(sel.scores_)
            feature_sets["mutual_classif_5"] = cols[0:2] + [
                f_cols[x] for x in indices[-5:]
            ]
            feature_sets["mutual_classif_10"] = cols[0:2] + [
                f_cols[x] for x in indices[-10:]
            ]
            feature_sets["mutual_classif_15"] = cols[0:2] + [
                f_cols[x] for x in indices[-15:]
            ]
            feature_sets["mutual_classif_20"] = cols[0:2] + [
                f_cols[x] for x in indices[-20:]
            ]
            feature_sets["mutual_classif_25"] = cols[0:2] + [
                f_cols[x] for x in indices[-25:]
            ]
            feature_sets["mutual_classif_30"] = cols[0:2] + [
                f_cols[x] for x in indices[-30:]
            ]
            feature_sets["mutual_classif_35"] = cols[0:2] + [
                f_cols[x] for x in indices[-35:]
            ]

            # Calculate ca for given users for all feature sets
            for feature_set in feature_sets:
                X_train = training_set[feature_sets[feature_set]].iloc[:, 2:].values
                y_train = (
                    training_set[feature_sets[feature_set]].iloc[:, 0].values.ravel()
                )
                X_test = testing_set[feature_sets[feature_set]].iloc[:, 2:].values
                y_test = (
                    testing_set[feature_sets[feature_set]].iloc[:, 0].values.ravel()
                )

                lda = LinearDiscriminantAnalysis(n_components=n_components)
                lda.fit(X_train, y_train)
                y_pred = lda.predict(X_test)
                results[feature_set].append(metrics.accuracy_score(y_test, y_pred))

        # Calculate mean accuracy per feature set
        for x in results:
            results[x] = mean([results[x]])
        comp_results.append(results)

    # Draw the figure
    fig = go.Figure(
        [
            go.Bar(
                name="All data",
                x=list(comp_results[0].keys()),
                y=list(comp_results[0].values()),
            ),
            go.Bar(
                name="Pc data",
                x=list(comp_results[1].keys()),
                y=list(comp_results[1].values()),
            ),
        ]
    )
    fig.update_layout(
        title="Feature set vs classification accuracy".format(i),
        xaxis_title="feature set",
        yaxis_title="classification accuracy",
    )
    print("All data")
    print("x: {}".format(list(comp_results[0].keys())))
    print("y: {}".format(list(comp_results[0].values())))
    print("Pc data")
    print("x: {}".format(list(comp_results[1].keys())))
    print("y: {}".format(list(comp_results[1].values())))
    fig.show()

In [None]:
# Cached data

# Task 1
fig = go.Figure(
    [
        go.Bar(
            name="All data",

        ),
        go.Bar(
            name="Pc data",

        ),
    ]
)
fig.update_layout(
    title="Feature set vs classification accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 2
fig = go.Figure(
    [
        go.Bar(
            name="All data",

        ),
        go.Bar(
            name="Pc data",

        ),
    ]
)
fig.update_layout(
    title="Feature set vs classification accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

# Task 3
fig = go.Figure(
    [
        go.Bar(
            name="All data",

        ),
        go.Bar(
            name="Pc data",

        ),
    ]
)
fig.update_layout(
    title="Feature set vs classification accuracy".format(i),
    xaxis_title="feature set",
    yaxis_title="classification accuracy",
)
fig.show()

There is very little to no improvement in any of the selected feature sets by only taking the data from when the pc was turned on, even more with feature sets chosen by feature selection algorithms the accuracy worsen. To elaborate, we will look at specific features chosen by the feature selection algorithm to see if any of the pc monitor features is chosen more often than with all data and consequently worsening the results.

In [None]:
# Data classification
for datas, i, remove in zip([pc_datas_1, pc_datas_2, pc_datas_3], [1, 2, 3], [[9, 12], [12], [12, 14, 20]]):
    s = datetime.now()
    # Parameters
    epochs = 5000
    number_of_users = 7
    segment_interval_length = 75
    interval = segment_intervals.index(segment_interval_length)
    n_components = 2
    sets = {}

    data = datas[interval]
    cols = list(data.columns)
    feature_sets = {
        "mutual_classif_10": [],
    }
    results = {
        "mutual_classif_10": [],
    }
    for e in range(0, epochs):
        if e % 1000 == 0:
            print(e)
        # Data subset
        users = get_random_users(n=number_of_users, stats=False, remove=remove)
        sub_data = data
        # Subset users
        sub_data = sub_data[sub_data.user.isin(users)]
        # Split the data
        seances = get_users_data(sub_data)
        train_seances = [x[0] for x in seances.values()]
        test_seances = [x[1] for x in seances.values()]
        training_set = sub_data[sub_data["seance"].isin(train_seances)]
        testing_set = sub_data[sub_data["seance"].isin(test_seances)]
        X = training_set.iloc[:, 2:].values
        y = training_set.iloc[:, 0].values.ravel()

        # Feature selection
        sel = SelectKBest(mutual_info_classif, k=2)
        sel.fit(X, y)
        indices = np.argsort(sel.scores_)
        feature_sets["mutual_classif_10"] = cols[0:2] + [
            f_cols[x] for x in indices[-10:]
        ]

        # Calculate ca for given users for all feature sets
        for feature_set in feature_sets:
            if feature_set not in sets:
                sets.update({feature_set: {}})
            for feature in feature_sets[feature_set]:
                if feature in sets[feature_set]:
                    sets[feature_set][feature] += 1
                else:
                    sets[feature_set].update({feature: 1})

#     print("Running time: {}".format(datetime.now() - s))
    # Draw the figure
    print(sets)
    for key in sets:
        features = sets[key]
        features.pop("user")
        features.pop("seance")
        features = Counter(features).most_common()
        print("x: {}".format([x[0] for x in features]))
        print("y: {}".format([x[1] for x in features]))
        fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
        fig.update_layout(
            title="Task {}: feature frequency".format(i),
            xaxis_title="features",
            yaxis_title="frequency",
        )
        fig.show()

In [None]:
# Cached data

# Task 1
sets = 
for key in sets:
        features = sets[key]
        features.pop("user")
        features.pop("seance")
        features = Counter(features).most_common()
        fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
        fig.update_layout(
            title="Task 1: feature frequency",
            xaxis_title="features",
            yaxis_title="frequency",
        )
        fig.show()

# Task 2
sets = 
for key in sets:
        features = sets[key]
        features.pop("user")
        features.pop("seance")
        features = Counter(features).most_common()
        fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
        fig.update_layout(
            title="Task 2: feature frequency",
            xaxis_title="features",
            yaxis_title="frequency",
        )
        fig.show()

# Task 3
sets = 
for key in sets:
        features = sets[key]
        features.pop("user")
        features.pop("seance")
        features = Counter(features).most_common()
        fig = bar(x=[x[0] for x in features], y=[x[1] for x in features])
        fig.update_layout(
            title="Task 3: feature frequency",
            xaxis_title="features",
            yaxis_title="frequency",
        )
        fig.show()

#### TODO Typing data

To elaborate on the previous section, we will extract the data subset from when the test subjects were strictly typing. We are able to achieve this by combining data from the accelerometer, gyroscope and force sensors and watch for increased activity.

In [None]:
# Data generation
datas = datas_1

# Parameters
number_of_users = 7
users = get_random_users(n=number_of_users, stats=False)
segment_interval_length = 10
interval = segment_intervals.index(segment_interval_length)
n_components = 2
data = datas[interval]
data = data[data.user.isin(users)]

## Helpers

<a id="section_helpers"></a>

Function that were implemented in the above sections, but moved down here to reduce clutter. Run the cell below to enable the funcionalities of this notebook.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import random
import warnings

from collections import Counter
from datetime import datetime, timedelta
from fastdtw import fastdtw
from IPython.display import clear_output
from math import sqrt, log10
from numpy import mean, std, array
from numpy.fft import fft, fftfreq, ifftshift
from pandas import DataFrame, read_csv, concat
from plotly.express import scatter, line, scatter_3d, bar, line_3d
from plotly.subplots import make_subplots
from scipy.spatial.distance import euclidean
from scipy.signal import find_peaks
from scipy.stats import wasserstein_distance
from sklearn import metrics
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_selection import SelectKBest, VarianceThreshold, f_classif, mutual_info_classif
from sklearn.preprocessing import StandardScaler


def load_data(seance_id, sens):
    seance = Seance.objects.get(id=seance_id)
    if sens == "accelerometer":
        sensor_ids = [60, 61, 62]

        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
        )
    elif sens == "gyroscope":
        sensor_ids = [63, 64, 65]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
        )
    elif sens == "force":
        sensor_ids = [54, 55, 76, 77]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return (
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[0], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[1], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[2], value__gte=50
            ).order_by("timestamp"),
            SensorRecord.objects.filter(
                seance=seance, sensor=sensors[3], value__gte=50
            ).order_by("timestamp"),
        )
    elif sens == "cpu":
        sensor_ids = [78, 79, 80, 81]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[2]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[3]).order_by(
                "timestamp"
            ),
        )
    elif sens == "ram":
        sensor_ids = [82]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("topic")
        return SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
            "timestamp"
        )
    elif sens == "net":
        sensor_ids = [83, 84]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return (
            SensorRecord.objects.filter(seance=seance, sensor=sensors[0]).order_by(
                "timestamp"
            ),
            SensorRecord.objects.filter(seance=seance, sensor=sensors[1]).order_by(
                "timestamp"
            ),
        )
    elif sens == "pir":
        sensor_ids = [58, 59, 66, 67, 68, 69]
        sensors = Sensor.objects.filter(id__in=sensor_ids).order_by("id")
        return SensorRecord.objects.filter(seance=seance, sensor__in=sensors).order_by(
            "timestamp"
        )
    else:
        raise ValueError("Invalid sensor string.")


def process_signal(records):
    """
    Take Django query and do basic signal processing.
    """
    values = [x.value for x in records]
    times = [x.timestamp for x in records]
    m = mean(values)
    s = std(values)
    norm = [(x - m) / s for x in values]

    return values, times, norm, m, s


def join_accelerometer_signals(x, y, z):
    """
    Join accelerometer signals, based simply on concurrence. 
    We can do this, as only one controller sends data in loop for all axis.
    """
    result = []
    n = min(len(x), len(y), len(z))
    for a, b, c in zip(x[:n], y[:n], z[:n]):
        result.append(sqrt(a ** 2 + b ** 2 + c ** 2))
    return result, mean(result), std(result)


def mean_crossing_rate(signal, m):
    """
    Calculate mean crossing rate from signal.
    Rate of mean crossings vs. the signal length.
    """
    try:
        prev = signal[0]
    except IndexError:
        return 0
    crosses = 0
    length = len(signal) - 1

    for curr in signal[1:]:
        if prev <= m < curr or prev > m >= curr:
            crosses += 1
        prev = curr
    if length < 1:
        return 0
    return crosses / length


def mean_acceleration_intensity(signal):
    """
    Mean derivative of a signal.
    """
    try:
        prev = signal[0]
    except IndexError:
        return 0
    length = len(signal) - 1
    derv = []

    for curr in signal[1:]:
        derv.append(abs(curr - prev))
        prev = curr

    return mean(derv)


def join_cpu_signals(a, b, c, d):
    """
    Similar to accelerometer one.
    """
    result = []
    n = min(len(a), len(b), len(c), len(d))
    for w, x, y, z in zip(a[:n], b[:n], c[:n], d[:n]):
        result.append(sqrt(w ** 2 + x ** 2 + y ** 2 + z ** 2))
    return result, mean(result), std(result)


def get_cpu_stats(val):
    if not val:
        return 0, 0, 0
    return min(val), max(val), mean_crossing_rate(val, mean(val))


def find_ram_jump(signal):
    if not signal:
        return [], {}
    derivative = []
    prev = signal[0]
    for curr in signal[1:]:
        derivative.append(abs(curr - prev))
        prev = curr
    peaks, _ = find_peaks(derivative, threshold=0.25)

    p = {"position": [], "magnitude": []}
    for x in peaks:
        p["position"].append(x)
        p["magnitude"].append(derivative[x])
    return derivative, p


def get_mem_stats(val, peaks, derivatives):
    # Calculate average inter jump interval
    intervals = []
    if peaks and peaks["position"]:
        prev = peaks["position"][0]
        for curr in peaks["position"][1:]:
            intervals.append(curr - prev)
            prev = curr
    if val:
        avg_load = round(mean(val), 2)
        min_load = min(val)
        max_load = max(val)
    else:
        avg_load = 0
        min_load = 0
        max_load = 0
    if peaks:
        jump_count = len(peaks["position"])
        if derivatives:
            jump_rate = round(len(peaks["position"]) / len(derivatives), 2)
        else:
            jump_rate = 0
        avg_jump_value = round(mean(peaks["magnitude"]), 2)
        avg_inter_jump_interval = round(mean(intervals), 2)
    else:
        jump_count = 0
        jump_rate = 0
        avg_jump_value = 0
        avg_inter_jump_interval = 0
    return (
        avg_load,
        min_load,
        max_load,
        jump_count,
        jump_rate,
        avg_jump_value,
        avg_inter_jump_interval,
    )


def find_net_jump(signal):
    if not signal:
        return [], {}
    derivative = []
    prev = signal[0]
    for curr in signal[1:]:
        derivative.append(abs(curr - prev))
        prev = curr
    peaks, _ = find_peaks(derivative, threshold=mean(derivative))

    p = {"position": [], "magnitude": []}
    for x in peaks:
        p["position"].append(x)
        p["magnitude"].append(derivative[x])
    return derivative, p


def get_net_stats(val, peaks, derivatives):
    # Calculate average inter jump interval
    intervals = []
    if peaks and peaks["position"]:
        prev = peaks["position"][0]
        for curr in peaks["position"][1:]:
            intervals.append(curr - prev)
            prev = curr
    try:
        sum_load = val[-1] - val[0]
    except IndexError:
        sum_load = 0
    if peaks:
        jump_count = len(peaks["position"])
        if derivatives:
            jump_rate = round(len(peaks["position"]) / len(derivatives), 2)
        else:
            jump_rate = 0
        avg_jump_value = round(mean(peaks["magnitude"]), 2)
        avg_inter_jump_interval = round(mean(intervals), 2)
    else:
        jump_count = 0
        jump_rate = 0
        avg_jump_value = 0
        avg_inter_jump_interval = 0
    return sum_load, jump_count, jump_rate, avg_jump_value, avg_inter_jump_interval


def get_random_users(n=3, stats=True, remove=[]):
    users = []
    while len(users) < n:
        user = random.randint(7, 27)
        if user not in users and user not in remove:
            users.append(user)
    if stats:
        print("Users: {}".format(sorted(users)))
    return users


def get_lda(data, users, comp_num=2, stats=True, score=False):
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    lda = LinearDiscriminantAnalysis(n_components=comp_num)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    if stats:
        print(lda.explained_variance_ratio_)
        print(score)

    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    if score:
        return df, score
    return df


def get_pca(data, users, comp_num=2, stats=True, score=False):
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    scaler = StandardScaler()
    scaler.fit(X)
    X = scaler.transform(X)
    pca = PCA(n_components=comp_num)
    X_pca = pca.fit_transform(X)
    score = round(metrics.calinski_harabasz_score(X_pca, y))
    if stats:
        print(pca.explained_variance_ratio_)
        print(score)

    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_pca, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    if score:
        return df, score
    return df


def calculate_scores(n, datas, epochs=50, verbose=True, norm=False):
    if verbose:
        print("Calculating for {} users.".format(n))
    scores = [[] for _ in range(0, epochs)]
    for i in range(0, epochs):
        users = get_random_users(n=n, stats=False)
        for data in datas:
            _, score = get_lda(data, users, stats=False, score=True)
            scores[i].append(score)
    scores = array(scores).T
    result = []
    max_score = max([mean(x) for x in scores])
    for i in range(0, len(datas)):
        if norm:
            result.append([segment_intervals[i], mean(scores[i]) / max_score, n])
        else:
            result.append([segment_intervals[i], mean(scores[i]), n])
    return DataFrame(result, columns=["time_interval", "score", "users"])


def features_list(index=-1):
    features = [
        "ax_me",
        "ax_sd",
        "ax_mcr",
        "ax_mai",
        "ay_me",
        "ay_sd",
        "ay_mcr",
        "ay_mai",
        "az_me",
        "az_sd",
        "az_mcr",
        "az_mai",
        "a_me",
        "a_sd",
        "a_mcr",
        "a_mai",
        "gx_me",
        "gx_sd",
        "gx_mcr",
        "gy_me",
        "gy_sd",
        "gy_mcr",
        "gz_me",
        "gz_sd",
        "gz_mcr",
        "g_me",
        "g_sd",
        "g_mcr",
        "fa_me",
        "fa_sd",
        "fa_mcr",
        "fb_me",
        "fb_sd",
        "fb_mcr",
        "fc_me",
        "fc_sd",
        "fc_mcr",
        "fd_me",
        "fd_sd",
        "fd_mcr",
        "ca_me",
        "ca_sd",
        "ca_min",
        "ca_max",
        "ca_mcr",
        "cb_me",
        "cb_sd",
        "cb_min",
        "cb_max",
        "cb_mcr",
        "cc_me",
        "cc_sd",
        "cc_min",
        "cc_max",
        "cc_mcr",
        "cd_me",
        "cd_sd",
        "cd_min",
        "cd_max",
        "cd_mcr",
        "c_me",
        "c_sd",
        "c_min",
        "c_max",
        "c_mcr",
        "m_me",
        "m_sd",
        "m_min",
        "m_max",
        "m_jc",
        "m_jr",
        "m_jv",
        "m_iji",
        "ns_me",
        "ns_sd",
        "ns_sum",
        "ns_jc",
        "ns_jr",
        "ns_jv",
        "ns_iji",
        "nr_me",
        "nr_sd",
        "nr_sum",
        "nr_jc",
        "nr_jr",
        "nr_jv",
        "nr_iji",
    ]
    if index == -1:
        return features
    else:
        return features[index]


def show_graphs():
    """
    Not a function that is to be run directly, but an example how to visualize the results.
    This was the best place to put it, to keep it out the way.
    """

    # Parameters
    number_of_users = 7
    segment_interval_length = 60

    # Get n random users and segment interval length
    users = get_random_users(n=number_of_users)
    interval = segment_intervals.index(segment_interval_length)

    data = datas[interval]
    data = data[data.user.isin(users)]
    X = data.iloc[:, 2:].values
    y = data.iloc[:, 0].values.ravel()
    z = data.iloc[:, 1].values.ravel()

    lda = LinearDiscriminantAnalysis(n_components=2)
    X_lda = lda.fit_transform(X, y)
    lda = LinearDiscriminantAnalysis(n_components=2)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2"],
    )
    fig = scatter(
        df,
        x="component_1",
        y="component_2",
        color="user",
        color_continuous_scale="Rainbow",
        title="number of users: {}, segment interval: {} seconds, score: {}".format(
            number_of_users, segment_intervals[interval], score
        ),
    )
    fig.show()

    lda = LinearDiscriminantAnalysis(n_components=3)
    X_lda = lda.fit_transform(X, y)
    score = round(metrics.calinski_harabasz_score(X_lda, y))
    y = y.reshape(len(y), 1)
    z = z.reshape(len(z), 1)
    df = DataFrame(
        [list(y) + list(z) + list(x) for x, y, z in zip(X_lda, y, z)],
        columns=["user", "try", "component_1", "component_2", "component_3"],
    )
    fig = scatter_3d(
        df,
        x="component_1",
        y="component_2",
        z="component_3",
        color="user",
        color_continuous_scale="Rainbow",
        title="number of users: {}, segment interval: {} seconds, score: {}".format(
            number_of_users, segment_intervals[interval], score
        ),
    )
    fig.show()


def get_users_data(data):
    """
    Get seance id for each user in a form of a dict.
    """
    users = {}
    for user in list(set(data["user"])):
        x = data[data["user"] == user]
        users.update({user: sorted(list(set(x["seance"])))})
    return users


def get_per_sensor_data(experiment=1, pc_monitor=False):
    segment_intervals = [10, 15, 30, 45, 60, 75, 90, 120, 150, 180]
    sensor_data = []
    com = []
    acc = []
    gyr = []
    frc = []
    cpu = []
    ram = []
    net = []

    for interval in segment_intervals:
        data_name = "jupyter/data/segmented_data_{}_seconds_experiment_{}.csv".format(
            interval, experiment
        )
        data = read_csv(data_name).fillna(0)
        if pc_monitor:
            remain = []
            i = 0
            for x in data[
                [
                    "ca_me",
                    "ca_sd",
                    "ca_min",
                    "ca_max",
                    "ca_mcr",
                    "cb_me",
                    "cb_sd",
                    "cb_min",
                    "cb_max",
                    "cb_mcr",
                    "cc_me",
                    "cc_sd",
                    "cc_min",
                    "cc_max",
                    "cc_mcr",
                    "cd_me",
                    "cd_sd",
                    "cd_min",
                    "cd_max",
                    "cd_mcr",
                    "c_me",
                    "c_sd",
                    "c_min",
                    "c_max",
                    "c_mcr",
                    "m_me",
                    "m_sd",
                    "m_min",
                    "m_max",
                    "m_jc",
                    "m_jr",
                    "m_jv",
                    "m_iji",
                    "ns_me",
                    "ns_sd",
                    "ns_sum",
                    "ns_jc",
                    "ns_jr",
                    "ns_jv",
                    "ns_iji",
                    "nr_me",
                    "nr_sd",
                    "nr_sum",
                    "nr_jc",
                    "nr_jr",
                    "nr_jv",
                    "nr_iji",
                ]
            ].values:
                if mean(x) > 0:
                    remain.append(i)
                i += 1
            data = data.iloc[remain, :]
        label = data.iloc[:, 0:2]
        com.append(data)
        acc.append(label.merge(data.iloc[:, 2:18], left_index=True, right_index=True))
        gyr.append(label.merge(data.iloc[:, 18:30], left_index=True, right_index=True))
        frc.append(label.merge(data.iloc[:, 30:42], left_index=True, right_index=True))
        cpu.append(label.merge(data.iloc[:, 42:67], left_index=True, right_index=True))
        ram.append(label.merge(data.iloc[:, 67:75], left_index=True, right_index=True))
        net.append(label.merge(data.iloc[:, 75:89], left_index=True, right_index=True))
    return {
        "all data": com,
        "accelerometer": acc,
        "gyroscope": gyr,
        "force sensors": frc,
        "cpu": cpu,
        "memory": ram,
        "network": net,
    }


def combine_sensor_data(sensors, names):
    """
    Merge data from multiple sensor.
    """
    return {
        "{} + {}".format(names[0], names[1]): [
            x.merge(y.iloc[:, 2:], left_index=True, right_index=True)
            for x, y in zip(sensors[names[0]], sensors[names[1]])
        ]
    }


def center_graphs():
    from IPython.display import display, HTML
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

    init_notebook_mode(connected=True)
    display(
        HTML("""<style>.output_area {display: flex;justify-content: center;}</style>""")
    )


center_graphs()


def process_pir_data(records, start, end, segment_interval, interval=1):
    """
    Classify every time step (determined by interval parameter) into one of categories:
    - moving (sensors, other than pir_01 active)
    - sitting (only pir_01 active)
    - nothing (no sensor activity)
    And calculate some features on top of this classes.
    """
    bins = []
    while start < end:
        bins.append((start, start + timedelta(seconds=interval)))
        start += timedelta(seconds=interval)

    data = []
    active = set()
    activity = 0
    # Iterate through time steps
    for step in bins:
        activity = 0
        # Check for previously active sensors
        if active:
            activity = 2 if max(active) > 1 else 1
        # Iterate through each sensor record in a given time interval
        for record in records.filter(timestamp__range=step).order_by("timestamp"):
            if record.value:
                active.add(int(record.sensor.topic[-1]))
                if int(record.sensor.topic[-1]) > 1:
                    activity = 2
                elif activity < 2:
                    activity = 1
            else:
                try:
                    active.remove(int(record.sensor.topic[-1]))
                except KeyError:
                    # Removing what is already not present
                    pass
        # Add to results
        data.append(activity)

    means = []
    segment_step = int(segment_interval / interval)
    for i in range(0, len(data), segment_step):
        means.append(mean(data[i : i + segment_step]))

    #     # Fill parts where zeros between ones - the subject was not moving enough, but still sitting
    #     prev = data[0]
    #     for i in range(1, len(data)):
    #         if prev == 1 and data[i] == 0:
    #             j = i
    #             while data[j] == 0:
    #                 j += 1
    #                 if j + 1 >= len(data):
    #                     j -= 1
    #                     break
    #             if data[j] == 1:
    #                 for k in range(i, j):
    #                     data[k] = 1
    #         prev = data[i]

    #     # Count number of intervals with specific class
    #     class_interval_count = [0, 0, 0]
    #     lengths = []
    #     prev = data[0]
    #     curr_len = 1
    #     for x in data[1:]:
    #         if x != prev:
    #             class_interval_count[prev] += 1
    #             if curr_len > 0:
    #                 lengths.append(curr_len)
    #                 curr_len = 1
    #         else:
    #             curr_len += 1
    #         prev = x
    #     class_interval_count[data[-1]] += 1
    #     lengths.append(curr_len)

    return means


def generate_segmented_data_csv(seconds: int, experiment: int = 1):
    """
    Generate csv file with calculated features, from data subsampled to the given time interval.
    """
    step = timedelta(seconds=seconds)
    seances = Seance.objects.filter(
        experiment__sequence_number=experiment, valid=True
    ).order_by("start")

    print(
        "Generating segmented data csv file with {} seconds intervals for {} seances.".format(
            seconds, seances.count()
        )
    )

    file_name = "segmented_data_{}_seconds_experiment_{}.csv".format(
        seconds, experiment
    )
    with open(file_name, "w") as csv_data:
        csv_data.write(
            "user,seance,ax_me,ax_sd,ax_mcr,ax_mai,ay_me,ay_sd,ay_mcr,ay_mai,az_me,az_sd,az_mcr,az_mai,a_me,"
            + "a_sd,a_mcr,a_mai,gx_me,gx_sd,gx_mcr,gy_me,gy_sd,gy_mcr,gz_me,gz_sd,gz_mcr,g_me,g_sd,g_mcr,fa_me,"
            + "fa_sd,fa_mcr,fb_me,fb_sd,fb_mcr,fc_me,fc_sd,fc_mcr,fd_me,fd_sd,fd_mcr,ca_me,ca_sd,ca_min,ca_max,"
            + "ca_mcr,cb_me,cb_sd,cb_min,cb_max,cb_mcr,cc_me,cc_sd,cc_min,cc_max,cc_mcr,cd_me,cd_sd,cd_min,cd_max,"
            + "cd_mcr,c_me,c_sd,c_min,c_max,c_mcr,m_me,m_sd,m_min,m_max,m_jc,m_jr,m_jv,m_iji,ns_me,ns_sd,ns_sum,"
            + "ns_jc,ns_jr,ns_jv,ns_iji,nr_me,nr_sd,nr_sum,nr_jc,nr_jr,nr_jv,nr_iji,ir_me\n"
        )
        count = 1
        for seance in seances[:]:
            print(
                "-----------------------------------------------------------------------------"
            )
            print("{} of {}".format(count, seances.count()))
            count += 1
            print(seance)
            start = seance.start
            end = seance.end
            data = (
                list(load_data(seance.id, "accelerometer"))
                + list(load_data(seance.id, "gyroscope"))
                + list(load_data(seance.id, "force"))
                + list(load_data(seance.id, "cpu"))
                + [load_data(seance.id, "ram")]
                + list(load_data(seance.id, "net"))
            )
            pir_data = load_data(seance.id, "pir")
            pir_datas = process_pir_data(
                pir_data, seance.start, seance.end, seconds, interval=10
            )

            i = 0
            while start < end:
                sub_data = []
                for x in data:
                    try:
                        x[0].sensor
                        sub_data.append(
                            x.filter(timestamp__range=(start, start + step))
                        )
                    except IndexError:
                        sub_data.append([])

                # accelerometer
                ax_val, _, _, ax_me, ax_sd = process_signal(sub_data[0])
                ay_val, _, _, ay_me, ay_sd = process_signal(sub_data[1])
                az_val, _, _, az_me, az_sd = process_signal(sub_data[2])
                a_val, a_me, a_sd = join_accelerometer_signals(ax_val, ay_val, az_val)
                ax_mcr = mean_crossing_rate(ax_val, ax_me)
                ay_mcr = mean_crossing_rate(ay_val, ay_me)
                az_mcr = mean_crossing_rate(az_val, az_me)
                a_mcr = mean_crossing_rate(a_val, a_me)
                ax_mai = mean_acceleration_intensity(ax_val)
                ay_mai = mean_acceleration_intensity(ay_val)
                az_mai = mean_acceleration_intensity(az_val)
                a_mai = mean_acceleration_intensity(a_val)

                # gyroscope
                gx_val, _, _, gx_me, gx_sd = process_signal(sub_data[3])
                gy_val, _, _, gy_me, gy_sd = process_signal(sub_data[4])
                gz_val, _, _, gz_me, gz_sd = process_signal(sub_data[5])
                g_val, g_me, g_sd = join_accelerometer_signals(gx_val, gy_val, gz_val)
                gx_mcr = mean_crossing_rate(gx_val, gx_me)
                gy_mcr = mean_crossing_rate(gy_val, gy_me)
                gz_mcr = mean_crossing_rate(gz_val, gz_me)
                g_mcr = mean_crossing_rate(g_val, g_me)

                # force
                fa_val, _, _, fa_me, fa_sd = process_signal(sub_data[6])
                fb_val, _, _, fb_me, fb_sd = process_signal(sub_data[7])
                fc_val, _, _, fc_me, fc_sd = process_signal(sub_data[8])
                fd_val, _, _, fd_me, fd_sd = process_signal(sub_data[9])
                fa_mcr = mean_crossing_rate(fa_val, fa_me)
                fb_mcr = mean_crossing_rate(fb_val, fb_me)
                fc_mcr = mean_crossing_rate(fc_val, fc_me)
                fd_mcr = mean_crossing_rate(fd_val, fd_me)

                # cpu
                ca_val, _, _, ca_me, ca_sd = process_signal(sub_data[10])
                cb_val, _, _, cb_me, cb_sd = process_signal(sub_data[11])
                cc_val, _, _, cc_me, cc_sd = process_signal(sub_data[12])
                cd_val, _, _, cd_me, cd_sd = process_signal(sub_data[13])
                c_val, c_me, c_sd = join_cpu_signals(ca_val, cb_val, cc_val, cd_val)
                ca_min, ca_max, ca_mcr = get_cpu_stats(ca_val)
                cb_min, cb_max, cb_mcr = get_cpu_stats(cb_val)
                cc_min, cc_max, cc_mcr = get_cpu_stats(cc_val)
                cd_min, cd_max, cd_mcr = get_cpu_stats(cd_val)
                c_min, c_max, c_mcr = get_cpu_stats(c_val)

                # ram
                m_val, _, _, m_me, m_sd = process_signal(sub_data[14])
                derivatives, peaks = find_ram_jump(m_val)
                m_me, m_min, m_max, m_jc, m_jr, m_jv, m_iji = get_mem_stats(
                    m_val, peaks, derivatives
                )

                # net
                ns_val, _, _, ns_me, ns_sd = process_signal(sub_data[15])
                nr_val, _, _, nr_me, nr_sd = process_signal(sub_data[16])
                ns_der, ns_pe = find_net_jump(ns_val)
                nr_der, nr_pe = find_net_jump(nr_val)
                ns_sum, ns_jc, ns_jr, ns_jv, ns_iji = get_net_stats(
                    ns_val, ns_pe, ns_der
                )
                nr_sum, nr_jc, nr_jr, nr_jv, nr_iji = get_net_stats(
                    nr_val, nr_pe, nr_der
                )

                # ir sensors
                ir_me = pir_datas[i]

                write_row = ",".join(
                    [
                        str(x)
                        for x in [
                            seance.user.id,
                            seance.id,
                            ax_me,
                            ax_sd,
                            ax_mcr,
                            ax_mai,
                            ay_me,
                            ay_sd,
                            ay_mcr,
                            ay_mai,
                            az_me,
                            az_sd,
                            az_mcr,
                            az_mai,
                            a_me,
                            a_sd,
                            a_mcr,
                            a_mai,
                            gx_me,
                            gx_sd,
                            gx_mcr,
                            gy_me,
                            gy_sd,
                            gy_mcr,
                            gz_me,
                            gz_sd,
                            gz_mcr,
                            g_me,
                            g_sd,
                            g_mcr,
                            fa_me,
                            fa_sd,
                            fa_mcr,
                            fb_me,
                            fb_sd,
                            fb_mcr,
                            fc_me,
                            fc_sd,
                            fc_mcr,
                            fd_me,
                            fd_sd,
                            fd_mcr,
                            ca_me,
                            ca_sd,
                            ca_min,
                            ca_max,
                            ca_mcr,
                            cb_me,
                            cb_sd,
                            cb_min,
                            cb_max,
                            cb_mcr,
                            cc_me,
                            cc_sd,
                            cc_min,
                            cc_max,
                            cc_mcr,
                            cd_me,
                            cd_sd,
                            cd_min,
                            cd_max,
                            cd_mcr,
                            c_me,
                            c_sd,
                            c_min,
                            c_max,
                            c_mcr,
                            m_me,
                            m_sd,
                            m_min,
                            m_max,
                            m_jc,
                            m_jr,
                            m_jv,
                            m_iji,
                            ns_me,
                            ns_sd,
                            ns_sum,
                            ns_jc,
                            ns_jr,
                            ns_jv,
                            ns_iji,
                            nr_me,
                            nr_sd,
                            nr_sum,
                            nr_jc,
                            nr_jr,
                            nr_jv,
                            nr_iji,
                            ir_me,
                        ]
                    ]
                )
                csv_data.write(write_row + "\n")
                start += step
                i += 1
                
                
def generate_segmented_data_pc_monitor(seconds, experiment):
    step = timedelta(seconds=seconds)
    seances = Seance.objects.filter(
        experiment__sequence_number=experiment, valid=True
    ).order_by("start")

    df = {"user": [], "seance": [], "ax_me": [], "ax_sd": [], "ax_mcr": [], "ax_mai": [], "ay_me": [], "ay_sd": [], "ay_mcr": [], "ay_mai": [], "az_me": [], "az_sd": [], "az_mcr": [], "az_mai": [], "a_me": [], "a_sd": [], "a_mcr": [], "a_mai": [], "gx_me": [], "gx_sd": [], "gx_mcr": [], "gy_me": [], "gy_sd": [], "gy_mcr": [], "gz_me": [], "gz_sd": [], "gz_mcr": [], "g_me": [], "g_sd": [], "g_mcr": [], "fa_me": [], "fa_sd": [], "fa_mcr": [], "fb_me": [], "fb_sd": [], "fb_mcr": [], "fc_me": [], "fc_sd": [], "fc_mcr": [], "fd_me": [], "fd_sd": [], "fd_mcr": [], "ca_me": [], "ca_sd": [], "ca_min": [], "ca_max": [], "ca_mcr": [], "cb_me": [], "cb_sd": [], "cb_min": [], "cb_max": [], "cb_mcr": [], "cc_me": [], "cc_sd": [], "cc_min": [], "cc_max": [], "cc_mcr": [], "cd_me": [], "cd_sd": [], "cd_min": [], "cd_max": [], "cd_mcr": [], "c_me": [], "c_sd": [], "c_min": [], "c_max": [], "c_mcr": [], "m_me": [], "m_sd": [], "m_min": [], "m_max": [], "m_jc": [], "m_jr": [], "m_jv": [], "m_iji": [], "ns_me": [], "ns_sd": [], "ns_sum": [], "ns_jc": [], "ns_jr": [], "ns_jv": [], "ns_iji": [], "nr_me": [], "nr_sd": [], "nr_sum": [], "nr_jc": [], "nr_jr": [], "nr_jv": [], "nr_iji": [], "ir_me": []}
    print("Generating segmented data csv file with {} seconds intervals for {} seances.".format(seconds, seances.count()))
    count = 1
    bad_users = []
    for seance in seances:
        print("-----------------------------------------------------------------------------")
        print("{} of {}".format(count, seances.count()))
        count += 1
        print(seance)
        if seance.user.id in bad_users:
            print("Bad user, skipping...")
            continue
        data = (
            list(load_data(seance.id, "accelerometer"))
            + list(load_data(seance.id, "gyroscope"))
            + list(load_data(seance.id, "force"))
            + list(load_data(seance.id, "cpu"))
            + [load_data(seance.id, "ram")]
            + list(load_data(seance.id, "net"))
        )
        pir_data = load_data(seance.id, "pir")
        pir_datas = process_pir_data(
            pir_data, seance.start, seance.end, seconds, interval=10
        )
        try:
            start = data[14].order_by("timestamp")[0].timestamp
            end = data[14].order_by("-timestamp")[0].timestamp
        except IndexError:
            print("BAD USER... REMOVING FROM RESULTS")
            bad_users.append(seance.user.id)
            continue
        i = 0
        while start < end:
            sub_data = []
            for x in data:
                try:
                    x[0].sensor
                    sub_data.append(
                        x.filter(timestamp__range=(start, start + step))
                    )
                except IndexError:
                    sub_data.append([])

            # accelerometer
            ax_val, _, _, ax_me, ax_sd = process_signal(sub_data[0])
            ay_val, _, _, ay_me, ay_sd = process_signal(sub_data[1])
            az_val, _, _, az_me, az_sd = process_signal(sub_data[2])
            a_val, a_me, a_sd = join_accelerometer_signals(ax_val, ay_val, az_val)
            ax_mcr = mean_crossing_rate(ax_val, ax_me)
            ay_mcr = mean_crossing_rate(ay_val, ay_me)
            az_mcr = mean_crossing_rate(az_val, az_me)
            a_mcr = mean_crossing_rate(a_val, a_me)
            ax_mai = mean_acceleration_intensity(ax_val)
            ay_mai = mean_acceleration_intensity(ay_val)
            az_mai = mean_acceleration_intensity(az_val)
            a_mai = mean_acceleration_intensity(a_val)

            # gyroscope
            gx_val, _, _, gx_me, gx_sd = process_signal(sub_data[3])
            gy_val, _, _, gy_me, gy_sd = process_signal(sub_data[4])
            gz_val, _, _, gz_me, gz_sd = process_signal(sub_data[5])
            g_val, g_me, g_sd = join_accelerometer_signals(gx_val, gy_val, gz_val)
            gx_mcr = mean_crossing_rate(gx_val, gx_me)
            gy_mcr = mean_crossing_rate(gy_val, gy_me)
            gz_mcr = mean_crossing_rate(gz_val, gz_me)
            g_mcr = mean_crossing_rate(g_val, g_me)

            # force
            fa_val, _, _, fa_me, fa_sd = process_signal(sub_data[6])
            fb_val, _, _, fb_me, fb_sd = process_signal(sub_data[7])
            fc_val, _, _, fc_me, fc_sd = process_signal(sub_data[8])
            fd_val, _, _, fd_me, fd_sd = process_signal(sub_data[9])
            fa_mcr = mean_crossing_rate(fa_val, fa_me)
            fb_mcr = mean_crossing_rate(fb_val, fb_me)
            fc_mcr = mean_crossing_rate(fc_val, fc_me)
            fd_mcr = mean_crossing_rate(fd_val, fd_me)

            # cpu
            ca_val, _, _, ca_me, ca_sd = process_signal(sub_data[10])
            cb_val, _, _, cb_me, cb_sd = process_signal(sub_data[11])
            cc_val, _, _, cc_me, cc_sd = process_signal(sub_data[12])
            cd_val, _, _, cd_me, cd_sd = process_signal(sub_data[13])
            c_val, c_me, c_sd = join_cpu_signals(ca_val, cb_val, cc_val, cd_val)
            ca_min, ca_max, ca_mcr = get_cpu_stats(ca_val)
            cb_min, cb_max, cb_mcr = get_cpu_stats(cb_val)
            cc_min, cc_max, cc_mcr = get_cpu_stats(cc_val)
            cd_min, cd_max, cd_mcr = get_cpu_stats(cd_val)
            c_min, c_max, c_mcr = get_cpu_stats(c_val)

            # ram
            m_val, _, _, m_me, m_sd = process_signal(sub_data[14])
            derivatives, peaks = find_ram_jump(m_val)
            m_me, m_min, m_max, m_jc, m_jr, m_jv, m_iji = get_mem_stats(
                m_val, peaks, derivatives
            )

            # net
            ns_val, _, _, ns_me, ns_sd = process_signal(sub_data[15])
            nr_val, _, _, nr_me, nr_sd = process_signal(sub_data[16])
            ns_der, ns_pe = find_net_jump(ns_val)
            nr_der, nr_pe = find_net_jump(nr_val)
            ns_sum, ns_jc, ns_jr, ns_jv, ns_iji = get_net_stats(
                ns_val, ns_pe, ns_der
            )
            nr_sum, nr_jc, nr_jr, nr_jv, nr_iji = get_net_stats(
                nr_val, nr_pe, nr_der
            )

            # ir sensors
            ir_me = pir_datas[i]

            df["user"].append(seance.user.id)
            df["seance"].append(seance.id)
            df["ax_me"].append(ax_me)
            df["ax_sd"].append(ax_sd)
            df["ax_mcr"].append(ax_mcr)
            df["ax_mai"].append(ax_mai)
            df["ay_me"].append(ay_me)
            df["ay_sd"].append(ay_sd)
            df["ay_mcr"].append(ay_mcr)
            df["ay_mai"].append(ay_mai)
            df["az_me"].append(az_me)
            df["az_sd"].append(az_sd)
            df["az_mcr"].append(az_mcr)
            df["az_mai"].append(az_mai)
            df["a_me"].append(a_me)
            df["a_sd"].append(a_sd)
            df["a_mcr"].append(a_mcr)
            df["a_mai"].append(a_mai)
            df["gx_me"].append(gx_me)
            df["gx_sd"].append(gx_sd)
            df["gx_mcr"].append(gx_mcr)
            df["gy_me"].append(gy_me)
            df["gy_sd"].append(gy_sd)
            df["gy_mcr"].append(gy_mcr)
            df["gz_me"].append(gz_me)
            df["gz_sd"].append(gz_sd)
            df["gz_mcr"].append(gz_mcr)
            df["g_me"].append(g_me)
            df["g_sd"].append(g_sd)
            df["g_mcr"].append(g_mcr)
            df["fa_me"].append(fa_me)
            df["fa_sd"].append(fa_sd)
            df["fa_mcr"].append(fa_mcr)
            df["fb_me"].append(fb_me)
            df["fb_sd"].append(fb_sd)
            df["fb_mcr"].append(fb_mcr)
            df["fc_me"].append(fc_me)
            df["fc_sd"].append(fc_sd)
            df["fc_mcr"].append(fc_mcr)
            df["fd_me"].append(fd_me)
            df["fd_sd"].append(fd_sd)
            df["fd_mcr"].append(fd_mcr)
            df["ca_me"].append(ca_me)
            df["ca_sd"].append(ca_sd)
            df["ca_min"].append(ca_min)
            df["ca_max"].append(ca_max)
            df["ca_mcr"].append(ca_mcr)
            df["cb_me"].append(cb_me)
            df["cb_sd"].append(cb_sd)
            df["cb_min"].append(cb_min)
            df["cb_max"].append(cb_max)
            df["cb_mcr"].append(cb_mcr)
            df["cc_me"].append(cc_me)
            df["cc_sd"].append(cc_sd)
            df["cc_min"].append(cc_min)
            df["cc_max"].append(cc_max)
            df["cc_mcr"].append(cc_mcr)
            df["cd_me"].append(cd_me)
            df["cd_sd"].append(cd_sd)
            df["cd_min"].append(cd_min)
            df["cd_max"].append(cd_max)
            df["cd_mcr"].append(cd_mcr)
            df["c_me"].append(c_me)
            df["c_sd"].append(c_sd)
            df["c_min"].append(c_min)
            df["c_max"].append(c_max)
            df["c_mcr"].append(c_mcr)
            df["m_me"].append(m_me)
            df["m_sd"].append(m_sd)
            df["m_min"].append(m_min)
            df["m_max"].append(m_max)
            df["m_jc"].append(m_jc)
            df["m_jr"].append(m_jr)
            df["m_jv"].append(m_jv)
            df["m_iji"].append(m_iji)
            df["ns_me"].append(ns_me)
            df["ns_sd"].append(ns_sd)
            df["ns_sum"].append(ns_sum)
            df["ns_jc"].append(ns_jc)
            df["ns_jr"].append(ns_jr)
            df["ns_jv"].append(ns_jv)
            df["ns_iji"].append(ns_iji)
            df["nr_me"].append(nr_me)
            df["nr_sd"].append(nr_sd)
            df["nr_sum"].append(nr_sum)
            df["nr_jc"].append(nr_jc)
            df["nr_jr"].append(nr_jr)
            df["nr_jv"].append(nr_jv)
            df["nr_iji"].append(nr_iji)
            df["ir_me"].append(ir_me)

            start += step
            i += 1

    # Write the dataframe to a csv file
    file_name = "segmented_data_{}_seconds_experiment_{}_pc_monitor.csv".format(
    seconds, experiment
    )
    df = DataFrame(df)
    for x in bad_users:
        df = df[df["user"] != x]
    df.to_csv(file_name, index=False)