*This material is a joint work of TAs and the instructor from IC Lab at KAIST, including Woohyeok Choi, Sangjun Park, Yunjo Han, Soowon Kang, Auk Kim, Inyeob Kim, Minhyung Kim, Hansoo Lee, Uichin Lee, Cheul Y. Park, Eunji Park, and Panyu Zhang. This work is licensed under CC BY-SA 4.0.*

# What is a Feature?
* *Feature* in machine learning is defined as *an individual measuarable property or characteristic of a phenomenon being observed*.
* The purpose of feature engineering or feature extraction is to choose informative and discriminating features to well describe different data set, patterns, or classes.
* Features can be numeric, strings, or categorical values.
* Raw data can be directly used as features; however, such features are very vulnerable to noises, outliers, or something else.
* Typically, features are extracted from several *subsets of entire data*, where each subset corresponds to the point that we are interested in (e.g., the point having *class label* in classification).



## Feature Extraction on Sensory Data
* Typically, IMU sensors (e.g., an accelerometer, a gyroscope, and a magnetometer) or bio-physiological sensors (e.g., a heart rate tracker, an electrocadiogram) provide their readings as *numerical values* on the *time domain*.
* In other words, this datum is formed as its value and collection time: $\mathbf{X} = x_t$
* In addition, the point we are interested in is also formed as its value (e.g., *class labels* in classification, *continuous values* in regression) and the time: $\mathbf{Y}=y_t$
* For a given $y_t$, feature extraction is conducted on a subset of data collected during a certain amount of time (i.e., time window), $\lambda$, before $t$: $$X_t = \{x_t|t\in[t-\lambda, t)\}$$



## Distinct vs. Overlapped Time Window
* To separate entire data into windows, we can consider that each window does not overlap neighboring windows: *distinct window*. It is simple, but it loses information at the boundaries.

![대체 텍스트](http://drive.google.com/uc?export=view&id=1BDE_8CbVTYsqX-9MoUf6dfqV2qIQYGdb)

* An alternative approach is to allow a time window to be overlapped with its consecutive window, *sliding or overlapped window*.
  * An example below is the overlap of 50% of a window size on consecutive windows.

![대체 텍스트](http://drive.google.com/uc?export=view&id=1r93X2gPdj0kC-L9RMCietzDsbA54uBax)

# Preliminary Setting

## Install Dependencies

For easiness of importing datasets, we need to install a dataset module including CrowdSignals' and KAIST dataset.

In [64]:
%pip install kse801-dataset==0.1.0



## Dataset
* In this lab example, we will use CrowdSignal.io's **accelerometer data** collected from a smartphone.
* This data include x, y, and z readings across different activities.


In [65]:
from kse801 import load_crowdsignal_accel_phone, load_crowdsignal_label

ACCEL = load_crowdsignal_accel_phone()
LABEL = load_crowdsignal_label()

print('Accelerometer:\n- Shape: {}\n- Head: \n{}'.format(ACCEL.shape, ACCEL.head()))
print('\n')
print('Label:\n- Shape: {}\n- Head: \n{}'.format(LABEL.shape, LABEL.head()))

Accelerometer:
- Shape: (1604606, 6)
- Head: 
     sensor_type device_type    timestamps      x      y      z
0  accelerometer  smartphone  1.454957e+12 -0.002  0.104  9.605
1  accelerometer  smartphone  1.454957e+12  0.001  0.123  9.622
2  accelerometer  smartphone  1.454957e+12  0.003  0.129  9.647
3  accelerometer  smartphone  1.454957e+12 -0.002  0.119  9.603
4  accelerometer  smartphone  1.454957e+12 -0.002  0.119  9.603


Label:
- Shape: (25, 7)
- Head: 
      sensor_type device_type          label   label_start  \
0  interval_label  smartphone       On Table  1.454956e+12   
1  interval_label  smartphone        Driving  1.454961e+12   
2  interval_label  smartphone  Washing Hands  1.454958e+12   
3  interval_label  smartphone  Washing Hands  1.454960e+12   
4  interval_label  smartphone       Standing  1.454962e+12   

  label_start_datetime     label_end   label_end_datetime  
0  08/02/2016 10:33:13  1.454957e+12  08/02/2016 10:36:18  
1  08/02/2016 11:44:02  1.454961e+12  08/0

## Preparation
* Each dimension in accelerometer data represents its own direction (x, y, z). In order to understand characteristics regardless of directions, we add a *magnitude* of those readings: $$Mag = \sqrt{(x^2 + y^2 + z^2)}$$

* We will use a simple lambda function for this. If you're not familiar with the lambda function, please read [this tutorial](https://www.w3schools.com/python/python_lambda.asp). Simply put, a lambda function is a small anonymous function that takes any number of arguments, but can only have one expression.
* Syntax:
```python
lambda arguments : expression
```
* Example:
```python
y = lambda a : a + 10 # adds 10 to a
y = lambda a, b : a * b # multiply a and b
y = lambda a, b, c : a + b + c # add three numbers
```




In [66]:
import numpy as np

# Below code assigns a new column (i.e., mag) to a DataFrame and returns the Dataframe
# Here, x in lambda func. means ACCEL DataFrame itself.
# Therefore, a code below is equivalent to ACCEL.assign(mag=np.sqrt(ACCEL['x'] ** 2 + ACCEL['y'] ** 2 + ACCEL['y'] ** 2))
# Here, if the value is a callable (within assign), it's evaluated on the current dataframe:

ACCEL_W_MAG = ACCEL.assign(
    mag=lambda x: np.sqrt(x['x'] ** 2 + x['y'] ** 2 + x['z'] ** 2)
)

ACCEL_W_MAG.head()

Unnamed: 0,sensor_type,device_type,timestamps,x,y,z,mag
0,accelerometer,smartphone,1454957000000.0,-0.002,0.104,9.605,9.605563
1,accelerometer,smartphone,1454957000000.0,0.001,0.123,9.622,9.622786
2,accelerometer,smartphone,1454957000000.0,0.003,0.129,9.647,9.647863
3,accelerometer,smartphone,1454957000000.0,-0.002,0.119,9.603,9.603738
4,accelerometer,smartphone,1454957000000.0,-0.002,0.119,9.603,9.603738


* For easiness of analysis, we add labels to each sample of accelerometer data based on our label data.


In [67]:
ACCEL_LABELED = ACCEL_W_MAG.assign(label='Undefined')

# let's iterate over DataFrame rows as "label" and add corresponding a label for each row
for index, row in LABEL.iterrows():
  label_start = row['label_start']
  label_end = row['label_end']
  label = row['label']
  tmp = ACCEL_LABELED.query("timestamps < @label_end and timestamps > @label_start").index
  ACCEL_LABELED.loc[tmp,'label'] = [label]*len(tmp)

ACCEL_LABELED.head()

Unnamed: 0,sensor_type,device_type,timestamps,x,y,z,mag,label
0,accelerometer,smartphone,1454957000000.0,-0.002,0.104,9.605,9.605563,Undefined
1,accelerometer,smartphone,1454957000000.0,0.001,0.123,9.622,9.622786,Undefined
2,accelerometer,smartphone,1454957000000.0,0.003,0.129,9.647,9.647863,Undefined
3,accelerometer,smartphone,1454957000000.0,-0.002,0.119,9.603,9.603738,Undefined
4,accelerometer,smartphone,1454957000000.0,-0.002,0.119,9.603,9.603738,Undefined


* Because accelerometer data is too large (so we may fail to plot all of them), we only use a subset of entire data.

In [68]:
# (1) use a query function to select all the rows in between 1454958500000 (GMT Feb 8, 2016 7:08:20 PM) and 1454958700000 (GMT Feb 8, 2016 7:11:40 PM) (i.e., roughly 3 miniutes)
# (2) use a lambda function to re-number timestamps by subtracting the minimum value of the timestamps (i.e., min(x['timestamps']) from the original values
# Note: Again, here x in lambda means a DataFrame itself (i.e., the return Dataframe of ACCEL_LABELED.loc[lambda x: (x['timestamps'] >= 1454958500000) & (x['timestamps'] < 1454958700000)])

ACCEL_SUB = ACCEL_LABELED.query(
    "timestamps >= 1454958500000 and timestamps < 1454958700000"
).assign(
    timestamps=lambda x: x['timestamps'] - min(x['timestamps'])
).sort_values('timestamps')

ACCEL_SUB.head()


Unnamed: 0,sensor_type,device_type,timestamps,x,y,z,mag,label
1187481,accelerometer,smartphone,0.0,-1.79,-9.621,1.089,9.846505,Standing
1187482,accelerometer,smartphone,5.983398,-1.788,-9.579,1.075,9.803561,Standing
1187483,accelerometer,smartphone,10.168701,-1.779,-9.594,1.076,9.816693,Standing
1187484,accelerometer,smartphone,16.220215,-1.778,-9.592,1.092,9.816324,Standing
1187485,accelerometer,smartphone,20.37793,-1.763,-9.615,1.086,9.835435,Standing


## Visualize

In [69]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots

fig = go.Figure()

for var in ['x', 'y', 'z', 'mag']:
  fig.add_trace(
    go.Scatter(x=ACCEL_SUB.loc[:, 'timestamps'], y=ACCEL_SUB.loc[:, var], name=var)
  )

# For label traces, select the rows that either start a new label (or end an existing label)
# x          x.shift(1)  x.shift(-1)
# Standing   NaN        Standing
# Standing   Standing   Undefined
# Undefined  Standing   Undefined
# Undefined  Undefined  Walking
# Walking    Undefined  Walking
# Walking    Walking    NaN

act_start = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(1)['label'], :]
act_end = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(-1)['label'], :]

fig.update_layout(
  shapes=[
    go.layout.Shape(type="rect",
      x0=s.timestamps, x1=e.timestamps, y0=0, y1=1, yref='paper'
        # y-reference is assigned to the plot paper [0,1] see more: https://plot.ly/python/shapes/
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  annotations=[
    go.layout.Annotation(
      text=s.label, x=s.timestamps + (e.timestamps - s.timestamps) / 2, y=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  xaxis_title_text='$Timestamps~(ms)$',
  yaxis_title_text='$ Acceleration~(m/s^2) $',
  title='Accelerometer traces'
)

fig.show()

# Handling Numerical Data in the Time Domain



## Candidate Features
* Simple but widely-used statistics are: ***mean, max, min, and standard deviation***.
  * In addition, we can consider other candidates, including *kurtosis*, *skewneess*, *discrete wavelet transform (DWT)*, *slope*, and etc.
* We extract features by calculating such statistics on each window.
  * Mean: $\mu_t = \frac{1}{\lambda+1} \sum_{k=t-\lambda}^{t}x_t$
  * Max: $M_t = \max_{t \in [t-\lambda, t)} x_t$
  * Min: $m_t = \min_{t \in [t-\lambda, t)} x_t$
  * Std. Dev: $\sigma = \sqrt{\bigg[ \sum_{k=t-\lambda}^{t} (x_t - \mu_t)^2 \bigg] / (\lambda+1)} $

* (Optional) After extracting features, we need to transform each feature into same scales, because some machine learning models (e.g., SVM, linear regression) are very vulnerable to different scales.
  * *scikit-learn.preprocess* provides several scaling functions (e.g., StandardScaler, MinMaxScaler, RobustScaler, etc.)
  * [https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing](https://https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)

## Practice on CrowdSignal's Data

### Calculate Features
For feature extraction, we consider an overlapped window; see an example below with the overlap of 50% of a window size on consecutive windows.

![대체 텍스트](http://drive.google.com/uc?export=view&id=1r93X2gPdj0kC-L9RMCietzDsbA54uBax)

In [70]:
import numpy as np
import pandas as pd

WIN_SIZE_IN_MS = 2000
OVERLAP_RATIO = 0.5
START_TIME, END_TIME = ACCEL_SUB.loc[:, 'timestamps'].min(), ACCEL_SUB.loc[:, 'timestamps'].max()

# find the end time of each window by considering the overlapping ratio
WINDOWS = np.arange(START_TIME + WIN_SIZE_IN_MS, END_TIME, WIN_SIZE_IN_MS * (1 - OVERLAP_RATIO))

# Create a dataframe w/ column names: timestamps, feature, value
FEATURES_TIME = pd.DataFrame()
columns = ['timestamps']
for var in ['x','y','z','mag']:
  columns.append('{}-{}'.format('Min', var))
  columns.append('{}-{}'.format('Max', var))
  columns.append('{}-{}'.format('Mean', var))
  columns.append('{}-{}'.format('Std', var))
FEATURES_TIME = FEATURES_TIME.reindex(columns=columns)

for w in WINDOWS:
  # for a given window, set the start and end time stamps
  win_start, win_end = w - WIN_SIZE_IN_MS, w

  row = [w]
  for var in ['x', 'y', 'z', 'mag']:
    # select the rows that belong to the current window, w
    value = ACCEL_SUB.query(
        'timestamps >= @win_start and timestamps < @win_end'
    )[var]

    # extract basic features
    min_v = np.min(value) # min
    max_v = np.max(value) # max
    mean_v = np.mean(value) # mean
    std_v = np.std(value) # std. dev.

    # other time domain features can be extracted by referring to these docs:
    # https://numpy.org/doc/stable/reference/routines.statistics.html
    # https://docs.scipy.org/doc/scipy/reference/stats.html#summary-statistics

    # append each result (w: current window's end-timestamp, extracted feature) as a new row
    row.append(min_v)
    row.append(max_v)
    row.append(mean_v)
    row.append(std_v)

  # insert row to the dataframe
  FEATURES_TIME.loc[len(FEATURES_TIME)] = row

FEATURES_TIME.head()

Unnamed: 0,timestamps,Min-x,Max-x,Mean-x,Std-x,Min-y,Max-y,Mean-y,Std-y,Min-z,Max-z,Mean-z,Std-z,Min-mag,Max-mag,Mean-mag,Std-mag
0,2000.0,-1.841,-1.659,-1.763983,0.024045,-9.66,-9.568,-9.608985,0.016099,1.015,1.202,1.119505,0.029516,9.79444,9.880019,9.833566,0.01432
1,3000.0,-1.848,-1.659,-1.771111,0.025884,-9.656,-9.561,-9.606545,0.015985,1.015,1.24,1.131218,0.031517,9.798668,9.869561,9.833815,0.013617
2,4000.0,-1.848,-1.724,-1.777358,0.021161,-9.641,-9.561,-9.60424,0.014904,1.055,1.24,1.132723,0.028879,9.792495,9.869561,9.832843,0.013422
3,5000.0,-1.856,-1.727,-1.78543,0.023023,-9.64,-9.564,-9.602785,0.014336,1.055,1.206,1.126686,0.025967,9.792495,9.867519,9.832187,0.013451
4,6000.0,-1.856,-1.714,-1.786752,0.024782,-9.649,-9.563,-9.602604,0.015059,1.05,1.206,1.129431,0.027753,9.79021,9.867519,9.832575,0.013715


### Scaling
* In this example, we scale each feature using *MinMaxScaler*, where values are bounded to a range from 0 to 1

* For a given input $x$, $x_{minmax}$ = $(x-x_{min}) / (x_{max} - x_{min})$.

* For details about scaling, please read [this blog article](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html).    


In [71]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaled = MinMaxScaler().fit_transform(FEATURES_TIME.drop('timestamps', axis=1).to_numpy())
# fit: Compute the minimum and maximum to be used for later scaling.
# transform: Scaling features of X according to feature_range.
# fit_transform: fit & transform at the same time
# both input/output are numpy arrays, and thus, DataFrame needs to be converted to a NumPy array (by callig to_numpy())

# np.column_stack() is takes a sequence of 1-D or 2-D arrays as input and returns a 2-D array with those arrays stacked as columns.
FEATURES_TIME_SCALED = pd.DataFrame(
  np.column_stack([FEATURES_TIME.loc[:, 'timestamps'].values, scaled]),
  columns=FEATURES_TIME.columns
)

FEATURES_TIME_SCALED.head()

Unnamed: 0,timestamps,Min-x,Max-x,Mean-x,Std-x,Min-y,Max-y,Mean-y,Std-y,Min-z,Max-z,Mean-z,Std-z,Min-mag,Max-mag,Mean-mag,Std-mag
0,2000.0,0.939997,0.003379,0.547914,0.000579,0.452924,0.000983,0.067671,0.000294,0.676071,0.002837,0.213967,0.000703,0.999116,0.00085,0.141494,0.000236
1,3000.0,0.939597,0.003379,0.547297,0.000948,0.453106,0.001388,0.067846,0.000275,0.676071,0.005466,0.21507,0.001099,0.999694,0.000192,0.141602,8.4e-05
2,4000.0,0.939597,0.000149,0.546757,0.0,0.453788,0.001388,0.068013,9.5e-05,0.67769,0.005466,0.215211,0.000577,0.99885,0.000192,0.141178,4.2e-05
3,5000.0,0.939139,0.0,0.54606,0.000374,0.453834,0.001215,0.068117,0.0,0.67769,0.003114,0.214643,0.0,0.99885,6.3e-05,0.140892,4.8e-05
4,6000.0,0.939139,0.000646,0.545945,0.000727,0.453424,0.001273,0.06813,0.000121,0.677487,0.003114,0.214901,0.000354,0.998538,6.3e-05,0.141061,0.000105


### Visualize Features on Time Domain

In [72]:
import plotly.graph_objs as go

fig = go.Figure()

# For feature plotting
for var in FEATURES_TIME_SCALED.drop('timestamps', axis=1).columns:
  fig.add_trace(
      go.Scatter(
          x=FEATURES_TIME_SCALED.loc[:, 'timestamps'],
          y=FEATURES_TIME_SCALED.loc[:, var],
          name=var
      )
  )

# For label annotations
act_start = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(1)['label'], :]
act_end = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(-1)['label'], :]

fig.update_layout(
  shapes=[
    go.layout.Shape(
      x0=s.timestamps, x1=e.timestamps, y0=0, y1=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  annotations=[
    go.layout.Annotation(
      text=s.label, x=s.timestamps + (e.timestamps - s.timestamps) / 2, y=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  xaxis_title_text='Timestamps (ms)',
  yaxis_title_text='Feature value',
  title='Feature value traces'
)

fig.show()

### (Optional) Visualize Features with Heatmap
* For more intuitive visualization, we can use a heatmap.

In [73]:
import plotly.graph_objs as go

# Here we use go.Heatmap() instead of go.Scatter()
# you need to specify 3-axes for heatmap plotting
fig = go.Figure(
    go.Heatmap(
        x=FEATURES_TIME_SCALED.loc[:, 'timestamps'],
        y=FEATURES_TIME_SCALED.drop('timestamps', axis=1).columns,
        z=FEATURES_TIME_SCALED.drop('timestamps', axis=1).to_numpy().transpose()
    )
)

# for annotation
act_start = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(1)['label'], :]
act_end = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(-1)['label'], :]

fig.update_layout(
  shapes=[
    go.layout.Shape(
      x0=s.timestamps, x1=e.timestamps, y0=0, y1=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  annotations=[
    go.layout.Annotation(
      text=s.label, x=s.timestamps + (e.timestamps - s.timestamps) / 2, y=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  xaxis_title_text='Timestamps (ms)',
  yaxis_title_text='Feature value',
  title='Feature value heatmap'
)

fig.show()

# Handling Numerical Data in the Frequency Domain
* One of important characteristics in our data is the periodicity of sensor readings, especially in walking activities.
* In other words, if we can extract periodical characteristics, it is easy to distinguish activities.
* To this end, we can utilize features on the *frequency domain*.

## Feature Extraction on Frequency Domain
* Features on the frequency domain is basically based on *Fast Fourier transform*, as introduced in the previous lecture.
* After conducting FFT, we can get frequency elements on $k$-th frequency bins, as follows:
$$X_k = \sum_{n=0}^{N-1} x_n \cdot e^{-\mathbf{i}\frac{2\pi}{N} kn} = \sum_{k=0}^{N-1} x_n \cdot \bigg[ \cos \big(\frac{2\pi}{N}kn \big) - \mathbf{i} \cdot \sin \big(\frac{2\pi}{N}kn\big) \bigg] = \mathbf{Re}(X_k) + \mathbf{Im}(X_k)$$
* From this, we can calculate a normalized amplitude:
$$ A_k = \frac{1}{N}\sqrt{\mathbf{Re}(X_k)^2 + \mathbf{Im}(X_k)^2}$$
* Or, a power spectrum:
$$ P_k = \frac{1}{N}\bigg(\mathbf{Re}(X_k)^2 + \mathbf{Im}(X_k)^2\bigg)$$
* Typically, feature extraction on the frequency domain is conducted on $A_k$.

## Candidate Features
* Frequency of the maximum amplitude:
$$\mathrm{argmax}_{k \in [0, N)} A_k$$
* Frequency-weighted average of amplitudes (or power):
$$\frac{\sum_{k=0}^{N-1} A_k \cdot f_k}{\sum_{k=0}^{N-1} A_k} ~\mathrm{or}~ \frac{\sum_{k=0}^{N-1} A_k^2 \cdot f_k}{\sum_{k=0}^{N-1} A_k^2}$$
* Power spectral entropy:
$$- \sum_{k=0}^{N-1} p_k \ln p_k~\text{where}~p_k = \frac{A_k^2}{\sum_{k=0}^{N-1} A_k^2} $$

* Note that the 0-th frequency component (i.e., DC component) is excluded when calculating those features.

## Toy Practice
* Before handling CrowdSignal's data, we practice feature extraction on a toy example.
* In this example, we assume that:
  * Entire collection time is $10.0s$.
  * Sampling frequency, $f_s = 100 \mathrm{Hz}$
  * Sample signal: sinusoidal waves at $5, 10, 20 \mathrm{Hz}$.
  * Window size in time: $2.56s$ (meaning $256$ samples)
  * Window overlap-ratio: $50\%$
  * Bin size of FFT: $256$


### FFT Analysis

In [74]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

F_s = 100.0
N = 256
#WIN_SIZE = 2.56
# CUSTOM
WIN_SIZE = 1
P = 1.0 / F_s
T = np.arange(0, 10, P)
S = np.sin(2 * np.pi * 5.0 * T) + np.sin(2 * np.pi * 10.0 * T) + np.sin(2 * np.pi * 20.0 * T)
DF = pd.DataFrame(np.column_stack([T, S]), columns=['timestamp', 'value'])

fig = px.line(DF, x='timestamp', y='value')
fig.update_layout(height=300)
fig.show()

In [75]:
W = np.arange(WIN_SIZE, 10.0, WIN_SIZE * 0.5)
PLOTS = []

for w in W:
  sub = DF.query('timestamp >= @w-@WIN_SIZE and timestamp < @w')['value']
  fft = np.fft.fft(sub.to_numpy() * np.hamming(len(sub.index)), n=N)[:N//2]
  freq = np.fft.fftfreq(N, P)[0:N//2]

  bar = go.Bar(x=freq, y=np.abs(fft), name='Frequency domain at time {:0.2f}'.format(w))
  PLOTS.append(bar)

fig = make_subplots(rows=len(PLOTS), cols=1)
for idx, plot in enumerate(PLOTS):
  fig.add_trace(plot, row=idx+1, col=1)
fig.update_layout(height=600)
fig.show()

* Our results show that frequency at 5Hz, 10Hz, and 20Hz have higher amplitude, as our setting is.
* In addition, different start points of windows show same frequency results.

### Feature Extraction
* Now, we extract features as stated above and visualize how informative such features are.

In [76]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go

FEATURES_FREQ_TOY = []
FEATURES_FREQ_TOY = pd.DataFrame(columns=['timestamp','Freq. Max. Amp', 'Weighted Avg. Amp', 'Weighted Avg. Energy', 'Power Spec. Engtropy'])

for w in W:
  sub = DF.query('timestamp >= @w-@WIN_SIZE and timestamp < @w')['value']

  # Important: Excluding the 0-th frequency components
  fft = np.fft.fft(sub.to_numpy() * np.hamming(len(sub.index)), n=N)[1:N//2]
  freq = np.fft.fftfreq(N, P)[1:N//2]

  amp = np.abs(fft)
  energy = amp ** 2
  amp_norm = amp / N
  energy_norm = energy / N

  freq_max_amp = freq[np.argmax(amp_norm)]
  weight_amp_avg = np.sum(amp * freq) / np.sum(amp)
  weight_energy_avg = np.sum(energy * freq) / np.sum(energy)
  power_entropy = - np.sum((energy / np.sum(energy)) * np.log(energy / np.sum(energy)))

  row = [w]
  row.append(freq_max_amp)
  row.append(weight_amp_avg)
  row.append(weight_energy_avg)
  row.append(power_entropy)

  FEATURES_FREQ_TOY.loc[len(FEATURES_FREQ_TOY)] = row

FEATURES_FREQ_TOY.head()

Unnamed: 0,timestamp,Freq. Max. Amp,Weighted Avg. Amp,Weighted Avg. Energy,Power Spec. Engtropy
0,1.0,19.921875,11.967086,11.66372,2.825343
1,1.5,5.078125,11.947589,11.66656,2.823889
2,2.0,19.921875,11.967086,11.66372,2.825343
3,2.5,5.078125,11.947589,11.66656,2.823889
4,3.0,19.921875,11.967086,11.66372,2.825343


In [77]:
import plotly.graph_objects as go

fig = go.Figure()

for var in FEATURES_FREQ_TOY.drop('timestamp', axis=1).columns:
  fig.add_trace(
      go.Scatter(
          x=FEATURES_FREQ_TOY.loc[:, 'timestamp'],
          y=FEATURES_FREQ_TOY.loc[:, var],
          name=var
      )
  )
fig.show()

## Practice on CrowdSignal's Data


### Calculate Features

In [78]:
import numpy as np
import pandas as pd

WIN_SIZE_IN_MS = 5000
OVERLAP_RATIO = 0.5
BIN_SIZE = 256

START_TIME, END_TIME = ACCEL_SUB.loc[:, 'timestamps'].min(), ACCEL_SUB.loc[:, 'timestamps'].max()
DURATION_IN_SEC = (END_TIME - START_TIME) / 1000
F_s = len(ACCEL_SUB.index) / DURATION_IN_SEC
P = 1.0 / F_s

WINDOWS = np.arange(START_TIME + WIN_SIZE_IN_MS, END_TIME, WIN_SIZE_IN_MS * (1 - OVERLAP_RATIO))

FEATURES_FREQ = pd.DataFrame()
columns = ['timestamps']
for var in ['x','y','z','mag']:
  columns.append('{}-{}'.format('Freq. Max. Amp', var))
  columns.append('{}-{}'.format('Weighted Avg. Amp', var))
  columns.append('{}-{}'.format('Weighted Avg. Energy', var))
  columns.append('{}-{}'.format('Power Spec. Entropy', var))
FEATURES_FREQ = FEATURES_FREQ.reindex(columns=columns)

for w in WINDOWS:
  win_start, win_end = w - WIN_SIZE_IN_MS, w

  row = [w]
  for var in ['x', 'y', 'z', 'mag']:
    value = ACCEL_SUB.query('timestamps >= @win_start and timestamps < @win_end')[var]

    fft = np.fft.fft(value * np.hamming(value.shape[0]), n=BIN_SIZE)[1:BIN_SIZE//2]
    freq = np.fft.fftfreq(BIN_SIZE, P)[1:BIN_SIZE//2]
    amp = np.abs(fft)
    energy = amp ** 2
    amp_norm = amp / BIN_SIZE
    energy_norm = energy / BIN_SIZE

    freq_max_amp = freq[np.argmax(amp_norm)]
    weight_amp_avg = np.sum(amp * freq) / np.sum(amp)
    weight_energy_avg = np.sum(energy * freq) / np.sum(energy)
    power_entropy = - np.sum((energy / np.sum(energy)) * np.log(energy / np.sum(energy)))

    row.append(freq_max_amp)
    row.append(weight_amp_avg)
    row.append(weight_energy_avg)
    row.append(power_entropy)

  FEATURES_FREQ.loc[len(FEATURES_FREQ)] = row

FEATURES_FREQ.head()

Unnamed: 0,timestamps,Freq. Max. Amp-x,Weighted Avg. Amp-x,Weighted Avg. Energy-x,Power Spec. Entropy-x,Freq. Max. Amp-y,Weighted Avg. Amp-y,Weighted Avg. Energy-y,Power Spec. Entropy-y,Freq. Max. Amp-z,Weighted Avg. Amp-z,Weighted Avg. Energy-z,Power Spec. Entropy-z,Freq. Max. Amp-mag,Weighted Avg. Amp-mag,Weighted Avg. Energy-mag,Power Spec. Entropy-mag
0,5000.0,0.787638,20.308242,2.617002,1.481421,0.787638,20.176548,2.616731,1.48956,0.787638,20.640869,2.774198,1.523448,0.787638,20.185116,2.617792,1.489512
1,7500.0,0.787638,19.996352,2.586463,1.47499,0.787638,20.165093,2.611165,1.48754,0.787638,19.903791,2.630683,1.497862,0.787638,20.15406,2.609716,1.487093
2,10000.0,0.787638,20.197207,2.678724,1.514014,0.787638,20.18944,2.612878,1.487,0.787638,19.806268,2.625071,1.504867,0.787638,20.183312,2.614066,1.487827
3,12500.0,0.787638,19.947864,2.570056,1.476727,0.787638,20.147884,2.610725,1.487867,0.787638,20.184546,2.652194,1.497557,0.787638,20.140611,2.609299,1.487497
4,15000.0,0.787638,20.219898,2.621945,1.492026,0.787638,20.111027,2.605362,1.486344,0.787638,20.204695,2.622672,1.478724,0.787638,20.113941,2.60531,1.486253


### Scaling

In [79]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

scaled = MinMaxScaler().fit_transform(FEATURES_FREQ.drop('timestamps', axis=1).to_numpy())

FEATURES_FREQ_SCALED = pd.DataFrame(
  np.column_stack([FEATURES_FREQ.loc[:, 'timestamps'].values, scaled]),
  columns=FEATURES_FREQ.columns
)
FEATURES_FREQ_SCALED.head()

Unnamed: 0,timestamps,Freq. Max. Amp-x,Weighted Avg. Amp-x,Weighted Avg. Energy-x,Power Spec. Entropy-x,Freq. Max. Amp-y,Weighted Avg. Amp-y,Weighted Avg. Energy-y,Power Spec. Entropy-y,Freq. Max. Amp-z,Weighted Avg. Amp-z,Weighted Avg. Energy-z,Power Spec. Entropy-z,Freq. Max. Amp-mag,Weighted Avg. Amp-mag,Weighted Avg. Energy-mag,Power Spec. Entropy-mag
0,5000.0,0.0,0.749742,0.118341,0.233306,0.0,0.588782,0.150393,0.222475,0.0,0.644465,0.021019,0.032984,0.0,0.533681,0.136827,0.167802
1,7500.0,0.0,0.72548,0.112562,0.229825,0.0,0.587678,0.149352,0.221221,0.0,0.592435,0.009821,0.020991,0.0,0.530836,0.135,0.166126
2,10000.0,0.0,0.741104,0.130021,0.250949,0.0,0.590024,0.149672,0.220886,0.0,0.585551,0.009383,0.024274,0.0,0.533516,0.135984,0.166635
3,12500.0,0.0,0.721708,0.109457,0.230765,0.0,0.58602,0.14927,0.221424,0.0,0.612253,0.011499,0.020847,0.0,0.529603,0.134905,0.166406
4,15000.0,0.0,0.742869,0.119277,0.239047,0.0,0.582469,0.148267,0.220479,0.0,0.613676,0.009196,0.012019,0.0,0.52716,0.134003,0.165545


### Visualize Features on Time Domain

In [80]:
import plotly.graph_objs as go

fig = go.Figure()

for var in FEATURES_FREQ_SCALED.drop('timestamps', axis=1).columns:
  fig.add_trace(
      go.Scatter(
          x=FEATURES_FREQ_SCALED.loc[:, 'timestamps'],
          y=FEATURES_FREQ_SCALED.loc[:, var],
          name=var
      )
  )

# For label traces:
act_start = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(1)['label'], :]
act_end = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(-1)['label'], :]

fig.update_layout(
  shapes=[
    go.layout.Shape(
      x0=s.timestamps, x1=e.timestamps, y0=0, y1=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  annotations=[
    go.layout.Annotation(
      text=s.label, x=s.timestamps + (e.timestamps - s.timestamps) / 2, y=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  xaxis_title_text='Timestamps (ms)',
  yaxis_title_text='Feature value',
  title='Feature value traces'
)

fig.show()

### (Optional) Visualize Features with Heatmap

In [81]:
import plotly.graph_objs as go

fig = go.Figure(
    go.Heatmap(
        x=FEATURES_FREQ_SCALED.loc[:, 'timestamps'],
        y=FEATURES_FREQ_SCALED.drop('timestamps', axis=1).columns,
        z=FEATURES_FREQ_SCALED.drop('timestamps', axis=1).to_numpy().transpose()
    )
)

act_start = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(1)['label'], :]
act_end = ACCEL_SUB.loc[lambda x: x['label'] != x.shift(-1)['label'], :]

fig.update_layout(
  shapes=[
    go.layout.Shape(
      x0=s.timestamps, x1=e.timestamps, y0=0, y1=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  annotations=[
    go.layout.Annotation(
      text=s.label, x=s.timestamps + (e.timestamps - s.timestamps) / 2, y=1, yref='paper'
    ) for s, e in zip(act_start.itertuples(), act_end.itertuples())
  ],
  xaxis_title_text='Timestamps (ms)',
  yaxis_title_text='Feature value',
  title='Feature value heatmap'
)

fig.show()

# Homework: Feature Extraction on KAIST Dataset


## Get Dataset
* To get dataset for Q1 and Q2, do as follows:


In [96]:
from kse801 import load_kaist_accel, load_kaist_gsr, load_kaist_skin_temperature

accel = load_kaist_accel() # acceleromenter data
gsr = load_kaist_gsr() # galvanic skin response

print(accel.head())
print(gsr.head())

       timestamp         Y         X         Z
0  1557710351729  0.537354  0.038330  0.871094
1  1557710351863  0.521484  0.036865  0.875488
2  1557710351999  0.540039  0.036865  0.869629
3  1557710352131  0.535645  0.035889  0.870850
4  1557710352246  0.533447  0.035400  0.871338
       timestamp  Resistance
0  1557710352248         643
1  1557710352460         644
2  1557710352660         649
3  1557710352844         652
4  1557710353070         654


* To get label data, do as follows:

In [97]:
import pandas as pd
from kse801 import load_kaist_activity

activity = load_kaist_activity()

labels = activity.loc[lambda x: x.groupby('timestamp').idxmax(numeric_only=True).loc[:, 'confidence'], :]
labels_start = labels.loc[lambda x: x['type'] != x.shift(1)['type'], :]
labels_end = labels.loc[lambda x: x['type'] != x.shift(-1)['type'], :]

labels = pd.DataFrame([
    (s.timestamp, e.timestamp, s.type)
    for s, e in zip(labels_start.itertuples(), labels_end.itertuples())
], columns=['label_start', 'label_end', 'activity'])

labels.head()

Unnamed: 0,label_start,label_end,activity
0,1557708047750,1557710384597,STILL
1,1557710451577,1557710451577,TILTING
2,1557710505762,1557710571204,UNKNOWN
3,1557710585803,1557710619613,ON_FOOT
4,1557710642457,1557710712178,UNKNOWN


* Given that there are missing data, we will only use the middle segment

In [98]:
gsr = gsr.iloc[22321:135565,:]
accel = accel.iloc[33916:206121,:]
accel.head()

Unnamed: 0,timestamp,Y,X,Z
33916,1557723221072,0.87085,-0.18457,-0.433105
33917,1557723221232,0.875732,-0.185547,-0.435547
33918,1557723221359,0.878662,-0.185303,-0.432129
33919,1557723221479,0.866699,-0.184082,-0.422607
33920,1557723221594,0.864502,-0.185059,-0.437256


## **Q1**

Do feature extraction on time domain and compare feature values of each activity. (1.5 pt)
* For *accelerometer* of KAIST dataset, conduct feature extraction on time domain (0.8pt).
  * Use **60sec** of window size and **0.5** overlap.
  * Use **MinMaxScaler** for scaling

* Add activity labels to extracted features (0.5pt).
  * The activity label corresponding to each feature is determined by the **most frequently occurring label** within the window of the feature.
  * HINT: Use statistics library in python [link](https://docs.python.org/ko/3.7/library/statistics.html)

* Print the statistics (`.describe()`) of features with **ON_BICYCLE** labels. (0.2pt)

**The result should be printed out as below. It is possible for different statistics depending how we choose the most frequent label since there could exist more than one most frequent labels with same frequency**

![](https://drive.google.com/uc?id=1leIM6TDDxe6TE7uz0ZxH5kx6Ec-nKePb)

In [99]:
# Write your code here
# Define window size and overlap ratio
WIN_SIZE_IN_MS = 60000  # 60 seconds
OVERLAP_RATIO = 0.5

START_TIME, END_TIME = accel.loc[:, 'timestamp'].min(), accel.loc[:, 'timestamp'].max()

# Generate windows
WINDOWS = np.arange(START_TIME + WIN_SIZE_IN_MS, END_TIME, WIN_SIZE_IN_MS * (1 - OVERLAP_RATIO))

# Initialize DataFrame for features
FEATURES_TIME = pd.DataFrame()
columns = ['timestamp']
for var in ['X', 'Y', 'Z' ]:
    columns.extend([f'Min-{var}', f'Max-{var}', f'Mean-{var}', f'Std-{var}'])

FEATURES_TIME = FEATURES_TIME.reindex(columns=columns)

# Extract features for each window
for w in WINDOWS:
    win_start, win_end = w - WIN_SIZE_IN_MS, w
    row = [w]
    for var in ['X', 'Y', 'Z' ]:
      value = accel.query('timestamp >= @win_start and timestamp < @win_end')[var]

      min_v = np.min(value)
      max_v = np.max(value)
      mean_v = np.mean(value)
      std_v = np.std(value)

      row.extend([min_v, max_v, mean_v, std_v])

    FEATURES_TIME.loc[len(FEATURES_TIME)] = row

FEATURES_TIME.head()

Unnamed: 0,timestamp,Min-X,Max-X,Mean-X,Std-X,Min-Y,Max-Y,Mean-Y,Std-Y,Min-Z,Max-Z,Mean-Z,Std-Z
0,1557723000000.0,-1.144287,1.118652,-0.465611,0.384077,-0.844971,1.580322,0.525473,0.40838,-0.933594,0.97876,0.196729,0.454356
1,1557723000000.0,-1.157715,-0.0979,-0.698913,0.182196,-0.844971,1.596436,0.242297,0.424804,-0.752686,0.97876,0.409253,0.355806
2,1557723000000.0,-1.274658,-0.116455,-0.768235,0.158409,-0.779053,1.596436,0.268706,0.395077,-0.752686,0.908447,0.329625,0.315595
3,1557723000000.0,-1.425781,0.258301,-0.71327,0.24891,-0.720459,1.275635,0.459545,0.334082,-0.909424,0.811035,0.10825,0.349683
4,1557723000000.0,-1.425781,0.356445,-0.40254,0.443243,-1.433838,1.967041,0.536601,0.454668,-0.909424,0.811035,-0.12408,0.408436


In [100]:
# scaling
scaled = MinMaxScaler().fit_transform(FEATURES_TIME.drop(['timestamp'], axis=1).to_numpy())

FEATURES_TIME_SCALED = pd.DataFrame(
    np.column_stack([FEATURES_TIME['timestamp'], scaled]),
    columns=FEATURES_TIME.columns
)

FEATURES_TIME_SCALED.head()

Unnamed: 0,timestamp,Min-X,Max-X,Mean-X,Std-X,Min-Y,Max-Y,Mean-Y,Std-Y,Min-Z,Max-Z,Mean-Z,Std-Z
0,1557723000000.0,0.64111,0.228426,0.238964,0.49473,0.63421,0.186745,0.58128,0.711881,0.562631,0.21958,0.50333,0.653894
1,1557723000000.0,0.638518,0.086553,0.121146,0.232923,0.63421,0.188883,0.304427,0.740736,0.60363,0.21958,0.632048,0.511561
2,1557723000000.0,0.615946,0.084389,0.086138,0.202075,0.648521,0.188883,0.330246,0.688509,0.60363,0.208462,0.58382,0.453485
3,1557723000000.0,0.586777,0.128093,0.113896,0.31944,0.661242,0.146319,0.516824,0.58135,0.568109,0.193059,0.449742,0.502718
4,1557723000000.0,0.586777,0.139538,0.270815,0.571458,0.506361,0.238055,0.592159,0.793203,0.568109,0.193059,0.309028,0.587573


In [101]:
from statistics import mode
# label activity
for row in FEATURES_TIME_SCALED.iterrows():
    win_start, win_end = row[1]['timestamp'] - WIN_SIZE_IN_MS, row[1]['timestamp']

    labels_in_window = labels.query('label_start <= @win_end and label_end > @win_start')['activity']

    if labels_in_window.empty:
      FEATURES_TIME_SCALED.at[row[0], 'activity'] = 'Undefined'
    else:
      FEATURES_TIME_SCALED.at[row[0], 'activity'] = mode(labels_in_window)

FEATURES_TIME_SCALED.head()

Unnamed: 0,timestamp,Min-X,Max-X,Mean-X,Std-X,Min-Y,Max-Y,Mean-Y,Std-Y,Min-Z,Max-Z,Mean-Z,Std-Z,activity
0,1557723000000.0,0.64111,0.228426,0.238964,0.49473,0.63421,0.186745,0.58128,0.711881,0.562631,0.21958,0.50333,0.653894,STILL
1,1557723000000.0,0.638518,0.086553,0.121146,0.232923,0.63421,0.188883,0.304427,0.740736,0.60363,0.21958,0.632048,0.511561,STILL
2,1557723000000.0,0.615946,0.084389,0.086138,0.202075,0.648521,0.188883,0.330246,0.688509,0.60363,0.208462,0.58382,0.453485,STILL
3,1557723000000.0,0.586777,0.128093,0.113896,0.31944,0.661242,0.146319,0.516824,0.58135,0.568109,0.193059,0.449742,0.502718,STILL
4,1557723000000.0,0.586777,0.139538,0.270815,0.571458,0.506361,0.238055,0.592159,0.793203,0.568109,0.193059,0.309028,0.587573,STILL


In [102]:
# bicycle labels
on_bicycle_features = FEATURES_TIME_SCALED.query("activity == 'ON_BICYCLE'")
on_bicycle_features.describe()


Unnamed: 0,timestamp,Min-X,Max-X,Mean-X,Std-X,Min-Y,Max-Y,Mean-Y,Std-Y,Min-Z,Max-Z,Mean-Z,Std-Z
count,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0,28.0
mean,1557734000000.0,0.476147,0.488597,0.752761,0.657318,0.513302,0.270375,0.313115,0.569536,0.600537,0.518876,0.793093,0.495439
std,6574258.0,0.201338,0.204984,0.051605,0.200031,0.194388,0.163121,0.135844,0.148124,0.174599,0.200955,0.141807,0.201023
min,1557724000000.0,0.0,0.288472,0.631742,0.153943,0.0,0.099284,0.157204,0.251467,0.179927,0.247954,0.360877,0.177148
25%,1557731000000.0,0.350125,0.338183,0.734035,0.527818,0.466713,0.166775,0.210341,0.480076,0.512905,0.365069,0.782895,0.338208
50%,1557738000000.0,0.485038,0.420166,0.759325,0.644567,0.577997,0.221308,0.273546,0.561004,0.661005,0.46736,0.854443,0.47076
75%,1557738000000.0,0.578531,0.505239,0.779297,0.809492,0.639099,0.326301,0.352835,0.666265,0.756722,0.612483,0.865542,0.644741
max,1557741000000.0,0.924697,1.0,0.892323,0.977725,0.73545,0.707298,0.594491,0.828361,0.794069,1.0,0.962449,0.929113


## **Q2**

Do feature extraction on frequency domain and compare feature values of each activity. (1.5 pt)
* For *galvanic skin response* of KAIST dataset, conduct feature extraction on time domain (0.8pt).
  * Use **60sec** of window size and **0.5** overlap, and **512** bin size.
  * Use **MinMaxScaler** for scaling

* Add activity labels to extracted features (0.5pt).
  * The activity label corresponding to each feature is determined by the **most frequently occurring label** within the window of the feature.
  * HINT: Use statistics library in python [link](https://docs.python.org/ko/3.7/library/statistics.html)

* Print the statistics (`.describe()`) of features with **STILL** labels. (0.2pt)

****The result should be printed out as below. It is possible for different statistics depending how we choose the most frequent label since there could exist more than one most frequent labels with same frequency**
**

![](https://drive.google.com/uc?id=1qK_EPs-SAh432H7GOHSUzPK37eiqpMQJ)

In [103]:
gsr.head()

Unnamed: 0,timestamp,Resistance
22321,1557723221233,2571
22322,1557723221515,2002
22323,1557723221712,2000
22324,1557723221906,1987
22325,1557723222113,1976


In [104]:
# Write your code here
# Define window size and overlap ratio
WIN_SIZE_IN_MS = 60000  # 60 seconds
OVERLAP_RATIO = 0.5
BIN_SIZE = 512

START_TIME, END_TIME = gsr['timestamp'].min(), gsr['timestamp'].max()

# Generate windows
WINDOWS = np.arange(START_TIME + WIN_SIZE_IN_MS, END_TIME, WIN_SIZE_IN_MS * (1 - OVERLAP_RATIO))

# Initialize DataFrame for features
FEATURES_TIME = pd.DataFrame()
columns = ['timestamp']
columns.append("Freq. Max. Amp-resistance")
columns.append("Weighted Avg. Amp-resistance")
columns.append("Weighted Avg. Energy-resistance")
columns.append("Power spec. Entropy-resistance")

FEATURES_TIME = FEATURES_TIME.reindex(columns=columns)

for w in WINDOWS:
    win_start, win_end = w - WIN_SIZE_IN_MS, w
    row = [w]

    value = gsr.query('timestamp >= @win_start and timestamp < @win_end')['Resistance']

    fft = np.fft.fft(value * np.hamming(value.shape[0]), n=BIN_SIZE)[1:BIN_SIZE//2]
    freq = np.fft.fftfreq(BIN_SIZE, P)[1:BIN_SIZE//2]
    amp = np.abs(fft)
    energy = amp ** 2
    amp_norm = amp / BIN_SIZE
    energy_norm = energy / BIN_SIZE

    freq_max_amp = freq[np.argmax(amp_norm)]
    weight_amp_avg = np.sum(amp * freq) / np.sum(amp)
    weight_energy_avg = np.sum(energy * freq) / np.sum(energy)
    power_entropy = - np.sum((energy / np.sum(energy)) * np.log(energy / np.sum(energy)))

    row.extend([freq_max_amp,weight_amp_avg,weight_energy_avg,power_entropy])

    FEATURES_TIME.loc[len(FEATURES_TIME)] = row


FEATURES_TIME.head()

Unnamed: 0,timestamp,Freq. Max. Amp-resistance,Weighted Avg. Amp-resistance,Weighted Avg. Energy-resistance,Power spec. Entropy-resistance
0,1557723000000.0,0.393819,48.406322,40.171153,5.011778
1,1557723000000.0,0.393819,49.716867,48.79289,5.528686
2,1557723000000.0,0.393819,15.852445,1.012442,0.724258
3,1557723000000.0,0.393819,12.886086,0.545325,0.45976
4,1557723000000.0,0.393819,27.425809,1.289763,0.645686


In [105]:
# scaling
scaled = MinMaxScaler().fit_transform(FEATURES_TIME.drop(['timestamp'], axis=1).to_numpy())

FEATURES_TIME_SCALED = pd.DataFrame(
    np.column_stack([FEATURES_TIME['timestamp'], scaled]),
    columns=FEATURES_TIME.columns
)

FEATURES_TIME_SCALED.head()

Unnamed: 0,timestamp,Freq. Max. Amp-resistance,Weighted Avg. Amp-resistance,Weighted Avg. Energy-resistance,Power spec. Entropy-resistance
0,1557723000000.0,0.0,0.954151,0.793114,0.897587
1,1557723000000.0,0.0,0.982821,0.965277,0.997578
2,1557723000000.0,0.0,0.242006,0.011177,0.068205
3,1557723000000.0,0.0,0.177114,0.001849,0.01704
4,1557723000000.0,0.0,0.495184,0.016715,0.053006


In [92]:
from statistics import mode
# label activity
for row in FEATURES_TIME_SCALED.iterrows():
    win_start, win_end = row[1]['timestamp'] - WIN_SIZE_IN_MS, row[1]['timestamp']

    labels_in_window = labels.query('label_start <= @win_end and label_end > @win_start')['activity']

    if labels_in_window.empty:
      FEATURES_TIME_SCALED.at[row[0], 'activity'] = 'Undefined'
    else:
      FEATURES_TIME_SCALED.at[row[0], 'activity'] = mode(labels_in_window)

FEATURES_TIME_SCALED.head()

Unnamed: 0,timestamp,Freq. Max. Amp-resistance,Weighted Avg. Amp-resistance,Weighted Avg. Energy-resistance,Power spec. Entropy-resistance,activity
0,1557723000000.0,0.000107,0.22662,0.207172,0.897587,STILL
1,1557723000000.0,0.000107,0.233429,0.252143,0.997578,STILL
2,1557723000000.0,0.000107,0.057479,0.00292,0.068205,STILL
3,1557723000000.0,0.000107,0.042066,0.000483,0.01704,STILL
4,1557723000000.0,0.000161,0.118084,0.004388,0.053006,STILL


In [93]:
# still labels
still_features = FEATURES_TIME_SCALED.query("activity == 'STILL'")
still_features.describe()

Unnamed: 0,timestamp,Freq. Max. Amp-resistance,Weighted Avg. Amp-resistance,Weighted Avg. Energy-resistance,Power spec. Entropy-resistance
count,207.0,207.0,207.0,207.0,207.0
mean,1557732000000.0,0.013362,0.08419,0.050103,0.340919
std,4545804.0,0.085428,0.080333,0.073118,0.386067
min,1557723000000.0,0.000107,0.0,0.0,0.00359
25%,1557728000000.0,0.000107,0.005847,5.6e-05,0.010403
50%,1557732000000.0,0.000107,0.058929,0.001254,0.050933
75%,1557735000000.0,0.000161,0.149912,0.089683,0.786335
max,1557740000000.0,1.0,0.342056,0.261213,0.999836
