# Recognizing hand gestures from EMG signal

Author: Eitan Hemed, PhD

## Table of contents
- [Introduction](#Introduction)
- [Preparation](#Preparation)
    - [Imports](#Imports)
    - [Defining helper functions](#Defining-helper-functions)
    - [Getting the dataset](#Getting-the-dataset)
    - [Load the data](#Load-the-data)
- [Wrangling and tidying](#Wrangling-and-tidying)
    - [Segment data to trials](#Segment-data-to-trials)
    - [General tidying](#General-tidying)
        - [Odd and missing gesture information](#Odd-and-missing-gesture-information)
- [Exploring channel data](#Exploring-channel-data)
- [Feature engineering](#Feature-engineering)
- [Save the data](#Save-the-data)

## Introduction
This notebook serves as starter template to working with data from the paper **Latent Factors Limiting the Performance of sEMG-Interfaces** [Lobov et al., (2018)](https://www.mdpi.com/1424-8220/18/4/1122).



### The study

In this study, the authors recorded surface electromyography (sEMG) signals from subjects. sEMG is a non-invasive method for recording electrical activity of muscles. The authors recorded the sEMG signals while subjects performed several gestures, and while they were at rest.

The data from the recording was then used to train a classifier to recognize the performed gesture from the sEMG signal. Later used to control the sprite in a PC game.

### This notebook
This notebook includes the pooling and tidying of the data from the paper. Then, some exploration of the data, and finally, basic feature engineering. Use it as a starting point for your own analysis.



## Preparation

### Imports

In [None]:
import re
import tqdm
import os
import glob
import requests
import shutil

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

np.random.seed(999)

### Defining helper functions


In [None]:
def root_mean_square(s: np.array) -> np.array:
    return np.sqrt(np.mean(np.power(s, 2)))

def mean_abs_val(s: np.array) -> np.array:
    return np.mean(np.abs(s))

def prep_single_df(fname: str) -> pd.DataFrame:
    _df = pd.read_csv(fname, sep='\t')
    data_info = re.findall(r'\d+', fname)
    subject, session = data_info[:2]
    timestamp = data_info[2:]
    _df = _df.assign(subject=subject, session=session)
    return _df

### Getting the dataset

Download the dataset (a zip folder) into the current directory

In [None]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00481/EMG_data_for_gestures-master.zip'
zip_data_path = f"../{url.split('/')[-1]}"

with requests.get(url, stream=True) as r:
    with open(zip_data_path, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

Extract the dataset into a new folder, and remove the zip file.

In [None]:
!dir ..

In [None]:
shutil.unpack_archive(zip_data_path, '..')
if not os.path.exists('../input-data'):
    os.rename(os.path.splitext(zip_data_path)[0], '../input-data')
os.remove(zip_data_path)

In [None]:
# Print the description of the dataset
!cat ../input-data/README.txt

## Load the data

In [None]:
raw_data_file_paths = glob.glob('../input-data/*/*.txt')

In [None]:
df = pd.concat([prep_single_df(fname) for fname in tqdm.tqdm(raw_data_file_paths)],
                ignore_index=True).rename({'class': 'gesture'}, axis=1)

## Wrangling and tidying

### Segment data to trials

The data is not segmented into epochs or trials, but we know that participants should perform a gesture, have some unmarked inter-trial-interval data (i.e., unmarked), then perform another gesture.

Partition each session into trials, where on each trial a different gesture is performed. Finding trial onsets is the matter of finding a difference in the performed gesture.

-----------------

There is however one missing value for the gesture data. If it is not removed, it will result in very odd data down the line.

In [None]:
df['gesture'].isna().sum()

**For educational purposes, don't drop this value first, and continue up to the analysis of trial duration. Then come back, uncomment the data removal, and re-do the previous cells from data loading, and continue to advance.**

In [None]:
df = df.dropna(subset=['gesture'])

In [None]:
df = df.sort_values(['subject', 'session', 'time'])

df['first_frame_on_trial'] = df.groupby(['subject', 'session'])['gesture'].diff().ne(0)


df['first_frame_on_trial'] = df['first_frame_on_trial'].replace(
    {True: 1, False: np.nan})

onsets = df.loc[df['first_frame_on_trial'].notna()]
numof_onsets = onsets.shape[0]

Next, assign trial numbers to each onset, and forward fill the trial numbers (to indicate consecutive frames belonging to the same trial).

In [None]:
df.loc[onsets.index.values, 'trial_num'] = np.concatenate(
    onsets.groupby(['subject', 'session']).apply(lambda s: np.arange(s.shape[0])).values)
# Forward fill the trial numbers
df['trial_num'] = df['trial_num'].ffill().values.astype(int)
# This column is no longer needed
df = df.drop('first_frame_on_trial', axis=1)

Mark the sample index within each trial, to be used later for aggregating the data into 200ms bins

In [None]:
df['time_within_trial'] = df.groupby(['subject', 'session', 'trial_num'])['time'].transform(
    lambda s: s - s.min())

What is the distribution of the duration of trials?

In [None]:
df.groupby(['subject', 'session', 'trial_num'])['time_within_trial'].describe()[['mean', 'max']].hist()

Exploring the duration of unmarked data VS. gestures.

In [None]:
df['marked_data'] = df['gesture'].ne(0)
df.groupby(['marked_data', 'subject', 'session', 'trial_num'])['time_within_trial'].max().reset_index().groupby('marked_data')['time_within_trial'].agg(
    ['mean', 'max', lambda s: s.quantile(0.95), lambda s: s.quantile(0.99)])

It seems that trials last up to 6000 milliseconds or so, but unmarked data might stretch up to almost 35000 milliseconds. We will remove all unmarked data after 5000 milliseconds, to equate it with the duration of the marked data.

-----------------

Compare this plot when the missing value is removed, or when it is not removed.

In [None]:
df = df.loc[df['time_within_trial'] < 5000]

In [None]:
# df.loc[df['time_within_trial'] > 6000].groupby(['subject', 'session', 'gesture', 'trial_num'])['time_within_trial'].max()

We can see now that we have a lot of `unmarked-data` epochs. They usually have longer duration than the marked epochs.

In [None]:
sns.FacetGrid(col='gesture', col_wrap=4, data=df.groupby(['subject', 'session', 'trial_num', 'gesture']
                                                         )['time_within_trial'].max().reset_index()).map(
    sns.histplot, 'time_within_trial', bins=30)

#### What is the sampling frequency of data?

To get the sampling frqeuency, we can divide the duration of each trial by the number of samples in it.

In [None]:
trial_duration_data = df.groupby(['subject', 'session', 'trial_num'])[
    'time_within_trial'].agg(['max', 'count']).rename({'count': 'numof_samples'}, axis=1).reset_index()

In [None]:
trial_duration_data['sampling_frequency'] = (trial_duration_data['max'] /
    trial_duration_data['numof_samples']).round(2)
trial_duration_data.head()

In [None]:
g = sns.FacetGrid(trial_duration_data, col='session', col_wrap=3)
g.map(plt.hist, 'sampling_frequency')

It seems that the sampling frequency is not constant, but most values are around 1MS (i.e., 1000Hz - as expected from the paper).

### General tidying

Subject and session are represented as strings, but they should be integers

In [None]:
df.dtypes.tail()

In [None]:
df[['subject', 'session', 'gesture']] = df[['subject', 'session', 'gesture']].astype(int).values

#### Odd and missing gesture information

Some subjects have more than 7 different gestures (extended palm gesture was not recorded for most).

In [None]:
df.groupby('subject')['gesture'].nunique().value_counts()

We can either remove the subjects, or remove the periods with the extra gesture, either way, here are their subject and session IDs.

In [None]:
more_than_seven_gestures = df.groupby(['subject', 'session'])['gesture'].nunique().reset_index(
    ).rename({'gesture': 'numof_classes'}, axis=1).query(
    'numof_classes > 7')

more_than_seven_gestures

Here we select to remove these periods from the data.

In [None]:
df = df.loc[df['gesture'] != 7]

We can visualize the allocation of gestures throughout sessions.

In [None]:
g = sns.FacetGrid(df.groupby(['subject', 'session', 'trial_num', 'gesture'])[
                  'time_within_trial'].first().reset_index(), col='session', col_wrap=2, hue='subject')
g.map(sns.lineplot, 'trial_num', 'gesture')

We can see that almost all subjects performed the gestures in the same order - 0-1-0-2 and so on, and then repeated all gestures again (within that session). This is not ideal of course, as a confounding factor - of the order of the gestures is introduced. This might for example interfere with fatigure when performing gestures number 6 and 7.

One last thing we can do when tidying the dataset it to further segment each trial into periods of 200MS. This is the period used in the paper to calculate the features.

In [None]:
df['epoch'] = (df['time_within_trial'] // 200).values

In [None]:
df.loc[(df['subject'] == 1) & (df['session'] == 1)].groupby(['trial_num', 'epoch']).first().reset_index()[['time', 'time_within_trial', 'epoch']]

## Exploring channel data



We can finally look at the EMG data. It is specified in micro-volts (uV), which is 1-millionth of a volt. We will multiply it by 1e6, for convenience. The data was recorded on 8 channels.

In [None]:
df[df.filter(regex='channel\d$').columns] = (df.filter(regex='channel\d$') * 1e6).round(3)

Here we show data from different gestures.

In [None]:
g = sns.FacetGrid(df.loc[(df['gesture'] > 0) & (df['subject'] == 1)].groupby(['subject', 'time_within_trial', 'gesture']
                                                                             )[['channel1', 'channel3', 'channel5']].mean().reset_index(),
                  hue='subject', col='gesture', col_wrap=3)
g.map(sns.lineplot, 'time_within_trial', 'channel1', color='black')
g.map(sns.lineplot, 'time_within_trial', 'channel3', color='red', alpha=0.5)
# Add the legend to first plot
legend_handles, legend_labels = g.axes[0].get_legend_handles_labels()
g.axes[0].legend(legend_handles, ['Channel #1', 'Channel #3'])
[ax.set_ylabel('Amplitude (uV * 1e6)') for ax in g.axes[[0, 3]]]
[ax.set_xlabel('Time (MS)') for ax in g.axes[[3, 4, 5]]]

We can see that it is quite noisy, but at least the scale of the noise is different for different gestures.

## Feature engineering

Common EMG features are root-mean-square and mean-absolute-value, calculated on each value. Here they are calculated per trial, we can also calculate them per period (200MS) within each trial.

In [None]:
# Prepare the arguments for the aggregation - a dictionary of the new column names, and the functions to apply
agg_args = {k: [root_mean_square, mean_abs_val] for k in df.filter(regex='channel\d$').columns}

aggregated_df = df.groupby(['subject', 'session', 'trial_num', 'epoch'])[
    list(agg_args.keys())].agg(
    agg_args)

In [None]:
aggregated_df.head()

Accessing the Multindex

In [None]:
aggregated_df.head().loc[(1, 1, 0), 'channel1']

## Save the data

To be used on further lessons - plotting, modelling, etc.

The file is quite large, so we will save it as a zipped dataframe (can still be read with `pd.read_csv`).

In [None]:
df.to_csv('../output/output-data/emg-data-clean.zip ', index=False)