# Classification of Grasp and Lift Action - EEG Dataset

In this notebook we fetch the data from Mongo DB into an EMR cluster and develop several machine learning models for aacurately predicting the event type from the eeg signals. 
We compare the performance of models in terms of both area under the ROC curve (AUC) and time taken to classify the test data.
The raw EEG signals have been pre-processed and stored in Mongo DB along with their corresponding event data. The pre-processing step is part of another notebook.


In [1]:
#importing libraries

import os
pyspark_submit_args = '--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args


In [2]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import *
spark = SparkSession \
    .builder \
    .appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://34.219.77.22/msds697.eeg")\
.config("spark.executor.memory", "22g")\
.config("spark.driver.memory", "10g").config("spark.memory.offHeap.enabled",True)\
.config("spark.memory.offHeap.size", "3g")\
.getOrCreate()



# 1. Reading the data from MongoDB

In [3]:
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

In [4]:
from pyspark.ml.feature import VectorAssembler
target_cols = [str(x) for x in range(6)]
feature_cols = [str(x) for x in range(6,26)]

va = VectorAssembler(outputCol="features", inputCols=feature_cols)

lpoints = va.transform(df).select("features", (df['0']).alias('HandStart'),\
                                  (df['1']).alias('FirstDigitTouch'),\
                                  (df['2']).alias('BothStartLoadPhase'),\
                                  (df['3']).alias('LiftOff'),\
                                  (df['4']).alias('Replace'),\
                                  (df['5']).alias('BothReleased'))

In [5]:
splits=lpoints.randomSplit([0.8,0.2], seed=42 )
eeg_train = splits[0].cache()
eeg_valid=splits[1].cache()

# 2. Modeling

### 2.1 Logistic Regression

In [6]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
import time

In [7]:

#labels are events that we are trying to classify
labels= ['HandStart','FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']
#iterating over the events and fitting a logisitic regression model and train it for each event
start= time.time()
for label in labels:
    lr = LogisticRegression(regParam=0.01, maxIter=100, fitIntercept=True, labelCol=label)
    lrmodel = lr.fit(eeg_train.select('features',label))
    validpredicts = lrmodel.transform(eeg_valid.select('features',label))
    bceval = BinaryClassificationEvaluator(labelCol=label)
    auc = bceval.evaluate(validpredicts)
    duration= time.time()-start
    print ('area Under ROC ' + label+ " : " + str(auc))
print('Time taken for logistic regression '+ str(duration) + 's')

area Under ROC HandStart : 0.5209817175371723
area Under ROC FirstDigitTouch : 0.5654686820170102
area Under ROC BothStartLoadPhase : 0.5626207148258029
area Under ROC LiftOff : 0.5770476860632803
area Under ROC Replace : 0.5965438220639617
area Under ROC BothReleased : 0.5812007968808708
Time taken for logistic regression 829.0077323913574s


## 2.2 Random Forest 

In [8]:
from pyspark.ml.classification import RandomForestClassifier


In [9]:
#labels are events that we are trying to classify
labels= ['HandStart','FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']
start= time.time()
#iterating over the events and fitting a Random forest model and train it for each event
for label in labels:
    rf = RandomForestClassifier(maxDepth=10, labelCol=label)
    rfmodel = rf.fit(eeg_train.select('features',label))
    validpredicts = rfmodel.transform(eeg_valid.select('features',label))
    bceval = BinaryClassificationEvaluator(labelCol=label)
    auc = bceval.evaluate(validpredicts)
    duration= time.time()-start
    print ('area Under ROC ' + label+ " : " + str(auc))
print('Time taken for Random Forest '+ str(duration) + 's')

area Under ROC HandStart : 0.878859706306801
area Under ROC FirstDigitTouch : 0.8473881048005849
area Under ROC BothStartLoadPhase : 0.8486340197761922
area Under ROC LiftOff : 0.8415285260668661
area Under ROC Replace : 0.8382513282108403
area Under ROC BothReleased : 0.8610798973492925
Time taken for Random Forest 1761.7716419696808s


## Linear Support Vector Classifier

In [7]:
from pyspark.ml.classification import LinearSVC

In [8]:
#labels are events that we are trying to classify
labels= ['HandStart','FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']
start= time.time()
#iterating over the events and fitting a logisitic regression model and train it for each event
for label in labels:
    svc = LinearSVC(labelCol=label)
    svcmodel = svc.fit(eeg_train.select('features',label))
    validpredicts = svcmodel.transform(eeg_valid.select('features',label))
    bceval = BinaryClassificationEvaluator(labelCol=label)
    auc = bceval.evaluate(validpredicts)
    duration= time.time()-start
    print ('areaUnderROC ' + label+ " : " + str(auc))
print('Time taken for Linear SVC  '+ str(duration) + 's')

areaUnderROC HandStart : 0.5176370865239852
areaUnderROC FirstDigitTouch : 0.5469576112336251
areaUnderROC BothStartLoadPhase : 0.5352378380935301
areaUnderROC LiftOff : 0.536106045267316
areaUnderROC Replace : 0.5353413731113085
areaUnderROC BothReleased : 0.5303661353569502
Time taken for Linear SVC  6194.433146238327s


## 2.4 Gradient Boosted Trees

In [12]:
from pyspark.ml.classification import GBTClassifier

In [13]:
labels= ['HandStart','FirstDigitTouch', 'BothStartLoadPhase', 'LiftOff', 'Replace', 'BothReleased']
start= time.time()
for label in labels:
    gbt = GBTClassifier(maxIter=10, maxDepth=10, labelCol=label)
    gbtmodel = gbt.fit(eeg_train.select('features',label))
    validpredicts = gbtmodel.transform(eeg_valid.select('features',label))
    bceval = BinaryClassificationEvaluator(labelCol=label)
    auc = bceval.evaluate(validpredicts)
    duration= time.time()-start
    print ('areaUnderROC ' + label+ " : " + str(auc))
print('Time taken for Gradient Boosted Tree  '+ str(duration) + 's')

areaUnderROC HandStart : 0.850708214941604
areaUnderROC FirstDigitTouch : 0.8312993115269941
areaUnderROC BothStartLoadPhase : 0.8184384403506808
areaUnderROC LiftOff : 0.8233671866533202
areaUnderROC Replace : 0.794779023843368
areaUnderROC BothReleased : 0.8525921633912922
Time taken for Gradient Boosted Tree  3941.667410850525s
