<a href="https://colab.research.google.com/github/SachinScaler/Nov23_MathsForML/blob/main/PreRead_Boosting_2_UseCase_Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Use case Intro: EMG signal classification

#### Imagine You are a Decision Scientist at Boston Dynamics( a robotics company)

- Your team is making a **robotics arm that can be controlled by brain signals**.
- These brain signals are recorded through **EMG**.

#### Problem Statement:
- Your task is to classify these EMG signals into 20 different physical actions
- This will then be used for controlling the robotics arm.

#### What is EMG (ElectroMioGraphy) ?
  - Technique to study electrical signals produced by muscular movement.

#### Dataset
- You have a dataset of EMG signals from 4 subjects/people.

#### How was the data collected ?
  - Subject was asked to perform specific physical actions
  - Signals produced due to that movement were recorded over time.
  - 8 channels were used to record the signals
  - Channels here correspond to muscles\
    For eg: Right-hand bicep
  - Frequency : 10 $ms^{-1}$

Now, lets import some libs at first.

 Source: https://archive.ics.uci.edu/ml/datasets/EMG+Physical+Action+Data+Set

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sklearn

#### Extracting data

In [None]:
!gdown 1h86M8si2YT-aI4Zec1MeMP_mPYsLPy5F

Downloading...
From: https://drive.google.com/uc?id=1h86M8si2YT-aI4Zec1MeMP_mPYsLPy5F
To: /content/emg.rar
  0% 0.00/18.6M [00:00<?, ?B/s] 90% 16.8M/18.6M [00:00<00:00, 167MB/s]100% 18.6M/18.6M [00:00<00:00, 172MB/s]


In [None]:
# x is extract

!unrar x "./emg.rar" "./"


UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from ./emg.rar

Creating    ./EMG Physical Action Data Set                            OK
Extracting  ./EMG Physical Action Data Set/readme.txt                      0%  OK 
Creating    ./EMG Physical Action Data Set/sub1                       OK
Creating    ./EMG Physical Action Data Set/sub1/Aggressive            OK
Creating    ./EMG Physical Action Data Set/sub1/Aggressive/log        OK
Extracting  ./EMG Physical Action Data Set/sub1/Aggressive/log/Elbowing.log       0%  OK 
Extracting  ./EMG Physical Action Data Set/sub1/Aggressive/log/FrontKicking.log       0%  1%  OK 
Extracting  ./EMG Physical Action Data Set/sub1/Aggressive/log/Hamering.log       1%  OK 
Extracting  ./EMG Physical Action Data Set/sub1/Aggressive/log/Headering.log       1%  2%  OK 
Extracting  ./EMG Physical Action Data Set/sub1/Aggressive/log/Kneeing.log       2%  OK 
Extra

#### Visualizing file structure

In [None]:
!sudo apt install tree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 29 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 0s (82.3 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure:

In [None]:
!tree "./EMG Physical Action Data Set/sub1"

./EMG Physical Action Data Set/sub1
├── Aggressive
│   ├── log
│   │   ├── Elbowing.log
│   │   ├── FrontKicking.log
│   │   ├── Hamering.log
│   │   ├── Headering.log
│   │   ├── Kneeing.log
│   │   ├── Pulling.log
│   │   ├── Punching.log
│   │   ├── Pushing.log
│   │   ├── SideKicking.log
│   │   └── Slapping.log
│   └── txt
│       ├── Elbowing.txt
│       ├── Frontkicking.txt
│       ├── Hamering.txt
│       ├── Headering.txt
│       ├── Kneeing.txt
│       ├── Pulling.txt
│       ├── Punching.txt
│       ├── Pushing.txt
│       ├── Sidekicking.txt
│       └── Slapping.txt
└── Normal
    ├── log
    │   ├── Bowing.log
    │   ├── Clapping.log
    │   ├── Handshaking.log
    │   ├── Hugging.log
    │   ├── Jumping.log
    │   ├── Running.log
    │   ├── Seating.log
    │   ├── Standing.log
    │   ├── Walking.log
    │   └── Waving.log
    └── txt
        ├── Bowing.txt
        ├── Clapping.txt
        ├── Handshaking.txt
        ├── Hugging.txt
        ├── Jumping.txt
        ├── 

Here if you see for subject 1 , we have  sub folders
- aggressive and
- normal

These folders mention the aggresive and normal activities respectively with corresponding log and txt files

We will use txt files


Let's see one of the folder from above

In [None]:
!ls -lrt ./EMG\ Physical\ Action\ Data\ Set/sub1/Aggressive/txt/

total 3768
-rw-r--r-- 1 root root 361096 Feb  7  2010 Slapping.txt
-rw-r--r-- 1 root root 388912 Feb  7  2010 Sidekicking.txt
-rw-r--r-- 1 root root 379428 Feb  7  2010 Pushing.txt
-rw-r--r-- 1 root root 379597 Feb  7  2010 Punching.txt
-rw-r--r-- 1 root root 387656 Feb  7  2010 Pulling.txt
-rw-r--r-- 1 root root 398523 Feb  7  2010 Kneeing.txt
-rw-r--r-- 1 root root 350285 Feb  7  2010 Headering.txt
-rw-r--r-- 1 root root 402363 Feb  7  2010 Hamering.txt
-rw-r--r-- 1 root root 390158 Feb  7  2010 Frontkicking.txt
-rw-r--r-- 1 root root 398095 Feb  7  2010 Elbowing.txt


#### Reading data

Now, let's see what is the data in slapping.txt

In [None]:
!cat ./EMG\ Physical\ Action\ Data\ Set/sub1/Aggressive/txt/Slapping.txt

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
-814	-49	-25	64	-746	174	-1034	-301
-1344	252	-90	90	158	107	-998	-506
-1674	498	-129	162	988	77	-1136	-602
-1801	537	-86	-120	692	67	-1458	-778
-1735	461	-45	-397	320	143	-1781	-1404
-1674	363	8	-82	-295	90	-2096	-2259
-1581	254	69	111	-274	58	-2223	-2456
-1296	108	12	-76	-11	42	-2312	-1387
-1010	-4	-52	-15	-217	-11	-2577	-130
-740	-75	-21	122	-408	-33	-2707	716
-554	-21	18	221	-525	-37	-2573	1085
-444	-12	1	65	772	-94	-2385	1056
-272	-99	33	-10	2822	-141	-2223	1270
-218	-130	64	38	3851	-143	-2079	1452
-156	-92	61	-21	3725	-126	-2039	1332
-117	-87	62	-23	3428	-105	-2076	1150
-66	-98	36	-72	2968	-125	-2009	710
2	-137	23	-18	2125	-116	-1962	389
67	-203	15	6	2243	-55	-1962	448
103	-231	52	12	2378	-22	-2043	731
100	-215	93	10	2321	-26	-1781	817
55	-122	83	4	2813	-63	-1262	621
33	-106	14	2	3474	-68	-941	535
59	-98	-4	18	2711	-55	-774	451
90	-145	-35	32	1721	-74	-375	408
216	-277	-56	35	1474	-66	-158	3

**Key observations**

* We got eight columns of the data which corresponds to eight electrodes
* We are collecting data 10 times per millisecond,
- each row gives the data for every 0.1 millisecond

# **Loading data**

While importing the data we are also going to chunk it.

#### What does chunking of data mean ?
  - Pick continuous intervals of a fixed size from data.
  - Replace those intervals with their mean/median/max etc.

#### What size of interval to choose ?
  - Depends on dataset and application
  - For this case, interval size = 10

#### But doesn't this result in a loss of data ?
  - Yes
  - #### Then why should we chunk a dataset ?
    - It depends on
      1. Data acquisition techniques
      2. Application

#### Why are we chunking our data ?
  - EMG signals suffer from problem of duplication
  - #### What does duplication mean in EMG?
    - Consecutive samples are similar to one another.
  - #### Why can this be a problem ?
    - Unnecessary data leads to :
      1. Longer training time
      2. More memory

Now lets import the data and chunk it

In [None]:
actions = {}

data_dirs = ["./EMG Physical Action Data Set/sub1/Aggressive/txt",
             "./EMG Physical Action Data Set/sub1/Normal/txt"]

ind = 0
data = pd.DataFrame()

for dirs in data_dirs :

  for files in os.listdir(dirs):

    with open(os.path.join(dirs, files), "r") as f:

      temp = pd.read_csv(f.name,
                        sep = "\t",
                        header = None,
                        names = ["ch" + str(i) for i in range(1, 9)] # 8 input channels
                        )

      # chunking using Max of every 10 sequential values.
      temp_chunked = pd.DataFrame()

      for i in range(0, len(temp), 10):
        temp_chunked = temp_chunked.append(temp.iloc[i:i+10].max(), ignore_index = True)

      labels = [files[:-4] for i in range(len(temp_chunked))] # remove the last 4 characters=".txt" from the filename
      actions[files[:-4]] = ind

      temp_chunked["Action"] = labels

      data = pd.concat([data, temp_chunked])

      ind+=1

print(actions)

{'Sidekicking': 0, 'Punching': 1, 'Frontkicking': 2, 'Kneeing': 3, 'Headering': 4, 'Elbowing': 5, 'Hamering': 6, 'Slapping': 7, 'Pulling': 8, 'Pushing': 9, 'Jumping': 10, 'Walking': 11, 'Standing': 12, 'Clapping': 13, 'Running': 14, 'Waving': 15, 'Handshaking': 16, 'Hugging': 17, 'Bowing': 18, 'Seating': 19}


In [None]:
data.head()

Unnamed: 0,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,Action
0,-506.0,-391.0,-73.0,363.0,933.0,4000.0,4000.0,4000.0,Sidekicking
1,-43.0,-84.0,218.0,341.0,1720.0,3333.0,3726.0,-2429.0,Sidekicking
2,1246.0,119.0,2539.0,354.0,2960.0,1864.0,-4000.0,-4000.0,Sidekicking
3,838.0,590.0,2551.0,716.0,4000.0,4000.0,-4000.0,-2254.0,Sidekicking
4,430.0,190.0,-255.0,329.0,4000.0,3670.0,-4000.0,4000.0,Sidekicking


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19711 entries, 0 to 999
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ch1     19711 non-null  float64
 1   ch2     19711 non-null  float64
 2   ch3     19711 non-null  float64
 3   ch4     19711 non-null  float64
 4   ch5     19711 non-null  float64
 5   ch6     19711 non-null  float64
 6   ch7     19711 non-null  float64
 7   ch8     19711 non-null  float64
 8   Action  19711 non-null  object 
dtypes: float64(8), object(1)
memory usage: 1.5+ MB


#### What can all we see from this data ?

- The data contains :
  1. 8 features and 1 target variable
  2. No Null values
  3. Around 20,000 samples

- We will use "Action" as the target attribute

In [None]:
Y = data["Action"]
X = data.drop(columns = ["Action"])


Now, lets analyze the target variable.


In [None]:
print(Y.unique())

['Sidekicking' 'Punching' 'Frontkicking' 'Kneeing' 'Headering' 'Elbowing'
 'Hamering' 'Slapping' 'Pulling' 'Pushing' 'Jumping' 'Walking' 'Standing'
 'Clapping' 'Running' 'Waving' 'Handshaking' 'Hugging' 'Bowing' 'Seating']


#### What can you tell about target variable from this info ?
  - Target variable contains 20 unique values.
  - It is categorical.
  - But the values are textual.
  
#### How is this going to be a problem ?
- ML algos can only take inputs in number form.

<br>
  
#### **How should we transform target variable to numerical ?**

  - It has 20 distinct values.
    #### 1. Can we use Binary Encoding ?
      - No - Why ?
        - Works with variables having only 2 values.
    
    #### 2. Can we use One Hot Encoding ?
      - No - Why ?
        - Memory consumption will become very high.

    #### 3. Can we use Label Encoding ?
      - Yes - Why ?
        - Doesn't require extra memory.
        - Works with any number of unique vals.
      
      - But the target var does not have any order.
      - #### Why won't this be a problem ?
        - The algo doesn't directly use it as input - What does this mean ?
    


In [None]:
Y = Y.map(actions)
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: Action, dtype: int64


Now lets check if the dataset is balanced.
#### How can we check for data balance ?
  - Check their value counts.

In [None]:
print(Y.value_counts())

10    1000
6     1000
15    1000
13    1000
11    1000
19    1000
4     1000
3     1000
14     997
18     983
0      983
2      982
7      979
5      978
17     976
12     973
9      968
8      966
1      964
16     962
Name: Action, dtype: int64


#### What can we see from this information ?
  - Each class is equally represented in the dataset.
  - i.e. The dataset is balanced.


## **Domain specific preprocessing - Rectification**

Our EMG signals should also be Rectified

#### What does rectification of EMG signals mean ?

<img src='https://drive.google.com/uc?id=14vHbNx-gTTkkI-ey1bNABT0JxPuLmLGK' width = 400>


  - Our data contains both neg/pos values.
  
  - This means that the signal cancels out to 0.
  
  - #### How can we deal with this problem ?
  
    1. Half Wave rectification:
      - Discard neg/pos values
  
    2. Full wave rectification:
      - Take abs values of entire data

#### How should we rectify our EMG signals ?
  - Full wave rectification
  - #### Why not do half-wave rectification ?
    - To minimize loss of data.

Lets rectify our data now

In [None]:
X = abs(X)
X.head()

Unnamed: 0,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8
0,506.0,391.0,73.0,363.0,933.0,4000.0,4000.0,4000.0
1,43.0,84.0,218.0,341.0,1720.0,3333.0,3726.0,2429.0
2,1246.0,119.0,2539.0,354.0,2960.0,1864.0,4000.0,4000.0
3,838.0,590.0,2551.0,716.0,4000.0,4000.0,4000.0,2254.0
4,430.0,190.0,255.0,329.0,4000.0,3670.0,4000.0,4000.0


## **Handling Noise**

There can also be a lot of noise in EMG signals.

#### Why does noise occur in EMG data ?
  - Faulty equipment
  - Sensitive techniques

So, we need to remove this noise
#### Why is it important to remove noise ?
  - Noise is unwanted data
  - Hampers performance
  - Longer training time.

#### How can we remove noise from EMG signals ?
  - Taking a moving average.

<img src='https://drive.google.com/uc?id=1mgd7NwvNcL0Shs-yLVunt6yWHyzj1zVz' width = 800>







#### Why do we take a moving average ?
1. Good smoother i.e. reduces oscillations
2. Simple to implement.

So, lets remove noise in our data


#### What about points till t = 9 ?

There are various strategies for handling that

- Leave as NaN
- Use avg of as many numbers as possible
- Use the values as is, until enough points are available.
Etc.

Depending on library we are using.


We will be using pandas [ewm](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html) function

- Which calculates exponentially weighted mean columnwise.

Also we will keep the parameter ```com``` = 10
- Which Specifies the decay in terms of center of mass

In [None]:
# what is ewm??

X = X.ewm(10).mean()
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19711 entries, 0 to 999
Data columns (total 8 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ch1     19711 non-null  float64
 1   ch2     19711 non-null  float64
 2   ch3     19711 non-null  float64
 3   ch4     19711 non-null  float64
 4   ch5     19711 non-null  float64
 5   ch6     19711 non-null  float64
 6   ch7     19711 non-null  float64
 7   ch8     19711 non-null  float64
dtypes: float64(8)
memory usage: 1.4 MB


#### Splitting data

Now our dataset is ready for training.

#### But do we feed the entire dataset into our algo ?
  - No - Why ?

#### So how should we split our data ?
  - We split it into train and test set.
  
  - #### What should be the ratios for splitting ?
    - 80%:20% for train/test set is good enough.

  - #### But what about validation set?
    - We will use k fold cross-validation technique.
  
  - #### Why use cross-validation ?
      - Prevents overfitting on dev-set.
      - Gives estimate of how precise model's evaluation is.

Lets split the data now.


In [None]:
from sklearn.model_selection import train_test_split

X = np.array(X.values)
Y = np.array(Y.values)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, shuffle = True)

print(f"Sizes of the sets created are:\nTraining set:{X_train.shape[0]}\nTest set:{X_test.shape[0]}")

Sizes of the sets created are:
Training set:15768
Test set:3943


#### Preprocessed Data

In [None]:
# import pickle

# !gdown 171Yoe_GSapyrmOnD9oBzHWNOD_OnQs0F
# !gdown 1hnIlTPW3AMeB69EbeaXCRIrpMVT1Vwmc
# !gdown 1nZtB_RtxMg_MgoRczb8UWQX-AEK_l3qE
# !gdown 1zLDUErwKdmF-RacOyHEuI_z_46LssQtP


# with open('X_train.pickle', 'rb') as handle:
#     X_train = pickle.load(handle)

# with open('X_test.pickle', 'rb') as handle:
#     X_test = pickle.load(handle)

# with open('Y_train.pickle', 'rb') as handle:
#     Y_train = pickle.load(handle)

# with open('Y_test.pickle', 'rb') as handle:
#     Y_test = pickle.load(handle)