Michael O'Hanlon\
Professor Monogioudis\
CS301101\
11/17/2022

Assignment #3: Electromyography and Gradient Boosting

# Background and Information

Gradient Boosting is a very popular and effective machine learning effective. It works to combine several weak models or learners into a strong model, where a weak model is defined as one with poor accuracy performance. If a model struggles to perform better than complete random predictions, it can be classified as a weak model. So, the motivation behind using Gradient Boosting to take several models with poor accuracy performance and combine them into a single strong model that has good accuracy, which if what we want of course!
\
\
\
What follows is the four steps of the Gradient Boosting Algorithm.
\
\
\
The Gradient Boosting Algorithm:\
![](https://drive.google.com/uc?export=view&id=1-4Ai4qRnqSM6I74AYL9qPcihuX9cLGct)

When M is sufficiently large, the result is a strong composite model which can be used to make predictions.
\
\
\
For the classification task, the loss function is defined as:\
![](https://drive.google.com/uc?export=view&id=1MIuSSvHCDbI2H7IH3H5ru-FkVu2rYAGi)

However, in the case of multiclass classification, which will be applied later, the loss function is defined as:\
![](https://drive.google.com/uc?export=view&id=10T9G2VYHVJR-CgvRmb4kRAROfkfvV4U-)
\
\
\
The trick behind Gradient Boosting is to fit a new model to the *residual errors* that are created by the previous model. As expected from it's name, Gradient Boosting utilizes calculating the gradient for every step. This is used to update the weights to ensure we are actually heading in the right direction.
\
\
\
An important component of Gradient Boosting is to finely tune the two hyperparameters, `learning_rate` and `n_estimators`. The first hyperparameter, `learning_rate`, is used to provide a weight to how much each weak learner actually contributes. The second, `n_estimators`, is used to give a number to the amount of models in the ensemble. These two hyperparameters must be tuned to prevent overfitting and ensure the resultant model is high in accuracy.
\
\
\
In this notebook, Gradient Boosting will be implemented from scratch using JAX libraries. The implementation will then be applied to a Electromyography dataset to see it's performance in action.

# Coding from scratch using JAX

In [1]:
"""
Implementation of Gradient Boosting from scratch.
Utilizes JAX libraries to perform classification.

Video Reference: 
https://www.youtube.com/watch?v=SstuvS-tVc0&list=WL&index=2&t=7s&ab_channel=AleksaGordi%C4%87-TheAIEpiphany

Useful Links:
https://towardsdatascience.com/gradient-boosting-classification-explained-through-python-60cc980eeb3d
https://www.simplilearn.com/gradient-boosting-algorithm-in-python-article
https://github.com/groverpr/Machine-Learning/blob/master/notebooks/01_Gradient_Boosting_Scratch.ipynb

https://gkaissis.github.io/post/2020-03-15-rfgb/
https://github.com/eriklindernoren/ML-From-Scratch/blob/master/mlfromscratch/supervised_learning/gradient_boosting.py
"""

"""
#Necessary imports
import jax
import jax.numpy as jnp
from sklearn.tree import DecisionTreeClassifier

#Loss function for classification task
def CrossEntropy(y_true:jnp.array, y_proba:jnp.array):
    y_proba = jnp.clip(y_proba, 1e-5, 1 - 1e-5)
    return jnp.sum(- y_true * jnp.log(y_proba) - (1 - y_true) * jnp.log(1 - y_proba))

#Class for Gradient Boosting
class GradientBooster:
    #Construct a new GradientBooster
    def __init__(self, n_estimators, learning_rate, **kwargs):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.loss = CrossEntropy

        #Create all of the estimators to use together
        self.estimators = []
        for _ in range(self.n_estimators):
                self.estimators.append(DecisionTreeClassifier(**kwargs))

    #Function to train the classifier with given Xs and ys
    def fit(self, X:np.array, y:np.array):
        y_pred = np.full(np.shape(y), np.mean(y))
        for i, estimator in enumerate(self.estimators):
            gradient = jax.grad(self.loss, argnums=1)(y.astype(np.float32), y_pred.astype(np.float32))
            self.estimators[i].fit(X, gradient)
            update = self.estimators[i].predict(X)
            y_pred -= (self.learning_rate * update)

    #Function to make predictions based on X data
    def predict(self, X:np.array):
        y_pred = np.zeros(X.shape[0], dtype=np.float32)
        for estimator in self.estimators:
            y_pred -= (self.learning_rate * estimator.predict(X))

        #Return the prediction
        return np.where(1/(1 + np.exp(-y_pred))>.5, 1, 0)
"""

'\n#Necessary imports\nimport jax\nimport jax.numpy as jnp\nfrom sklearn.tree import DecisionTreeClassifier\n\n#Loss function for classification task\ndef CrossEntropy(y_true:jnp.array, y_proba:jnp.array):\n    y_proba = jnp.clip(y_proba, 1e-5, 1 - 1e-5)\n    return jnp.sum(- y_true * jnp.log(y_proba) - (1 - y_true) * jnp.log(1 - y_proba))\n\n#Class for Gradient Boosting\nclass GradientBooster:\n    #Construct a new GradientBooster\n    def __init__(self, n_estimators, learning_rate, **kwargs):\n        self.n_estimators = n_estimators\n        self.learning_rate = learning_rate\n        self.loss = CrossEntropy\n\n        #Create all of the estimators to use together\n        self.estimators = []\n        for _ in range(self.n_estimators):\n                self.estimators.append(DecisionTreeClassifier(**kwargs))\n\n    #Function to train the classifier with given Xs and ys\n    def fit(self, X:np.array, y:np.array):\n        y_pred = np.full(np.shape(y), np.mean(y))\n        for i, es

# EMG Dataset

In [2]:
"""
Mount Google Drive
"""

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
"""
Import the data into the environment and unzip
"""

import sys
import os

! cp "/content/drive/MyDrive/Fall_2022/CS301_F22/Homework/Homework 3/EMG Physical Action Data Set.rar" "EMG Physical Action Data Set.rar"

! pip install unrar
! unrar x "EMG Physical Action Data Set.rar"

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unrar
  Downloading unrar-0.4-py3-none-any.whl (25 kB)
Installing collected packages: unrar
Successfully installed unrar-0.4

UNRAR 5.50 freeware      Copyright (c) 1993-2017 Alexander Roshal


Extracting from EMG Physical Action Data Set.rar

Creating    EMG Physical Action Data Set                              OK
Extracting  EMG Physical Action Data Set/readme.txt                        0%  OK 
Creating    EMG Physical Action Data Set/sub1                         OK
Creating    EMG Physical Action Data Set/sub1/Aggressive              OK
Creating    EMG Physical Action Data Set/sub1/Aggressive/log          OK
Extracting  EMG Physical Action Data Set/sub1/Aggressive/log/Elbowing.log       0%  OK 
Extracting  EMG Physical Action Data Set/sub1/Aggressive/log/FrontKicking.log       0%  1%  OK 
Extracting  EMG Physical Action Data Set/sub1/Aggressive

In [4]:
"""
Load all of the data and append a 0 or 1
indicating if the data is aggressive or normal
"""

import numpy as np

directory = "EMG Physical Action Data Set"
data = np.empty((0, 9), dtype="float32")

#Append a 0 or 1 to every row and then append to the data np array
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    if os.path.isdir(f):
        for subdir in os.listdir(f):
          subdir = os.path.join(f, subdir)
          if os.path.isdir(f):
            for txtdir in os.listdir(subdir):
              txtdir = os.path.join(subdir, txtdir)
              #Only interested in the txt files containing the data
              if os.path.isdir(txtdir) and "log" not in txtdir:
                for txtfile in os.listdir(txtdir):
                  txtfile = os.path.join(txtdir, txtfile)
                  if os.path.isfile(txtfile):
                    filepath = txtfile
                    with open(filepath) as fp:
                        lines = fp.read().splitlines()
                    if "Aggressive" in filepath:  #This file contains aggressive data
                      with open(filepath, "w") as fp:
                          for line in lines:
                              newline = line + "\t0"
                              print(newline, file=fp)
                    else: #This file contains normal data
                      with open(filepath, "w") as fp:
                          for line in lines:
                              newline = line + "\t1"
                              print(newline, file=fp)
                    txtdata = np.genfromtxt(txtfile,delimiter='\t')
                    data = np.append(data, txtdata, axis=0)

In [5]:
"""
Split the data into train, test, as well as X and y
"""

#Necessary imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier

#Function to remove any NaN of inf values from the data (just in case)
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

#Create the pandas dataframe and then clean it
df = pd.DataFrame(data)
df = clean_dataset(df)

#Split the data into two datasets, train and test
train, test = train_test_split(df, test_size=0.2, random_state=1234)

#Split the training and test dataframes into X and Y where y is aggressive/normal identifier
X_train_df = pd.DataFrame(train, columns=[0, 1, 2, 3, 4, 5, 6, 7])
Y_train_df = pd.DataFrame(train, columns=[8])
X_test_df = pd.DataFrame(test, columns=[0, 1, 2, 3, 4, 5, 6, 7])
Y_test_df = pd.DataFrame(test, columns=[8])

#Convert the dataframes into numpy representations
X_train_np = X_train_df.values
y_train_np = Y_train_df.values
X_test_np = X_test_df.values
y_test_np = Y_test_df.values

#Reshape the y numpy arrays
y_train_np = y_train_np.reshape(X_train_np.shape[0])
y_test_np = y_test_np.reshape(X_test_np.shape[0])

"""
Shape of X_train_np :  (637940, 8)
Shape of Y_train_np :  (637940,)
Shape of X_test_np :  (159486, 8)
Shape of Y_test_np :  (159486,)
"""

'\nShape of X_train_np :  (637940, 8)\nShape of Y_train_np :  (637940,)\nShape of X_test_np :  (159486, 8)\nShape of Y_test_np :  (159486,)\n'

In [6]:
"""
Apply the Gradient Boosting implementation to the model.
"""

#Create the GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train_np, y_train_np)

#Print it's score
clf.score(X_test_np, y_test_np)

0.8190938389576514