<a href="https://colab.research.google.com/github/OungKennedy/EG3301R-Data-Classification/blob/master/Data_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Classification for Spectacle Dongle

This project is related to EG3301R. A dongle is designed to be attached to a pair of spectacles. Data such as Time of Flight, Acceleration, yaw, pitch, roll and brightness are collected.

This notebook documents the process in developing a model for classifying the actions of the user based on the data collected.

In [33]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


First Iteration
data by Bryan LMK, collected on 23 Jun 2020

In [34]:
# !pip install fastai==0.7.0
# !pip install scikit-learn==0.21.3

In [35]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [36]:
import os
os.chdir('/content/gdrive/My Drive/EG3301R EIM-328/Machine Learning')

In [37]:
# from fastai.imports import *
# from fastai.structured import *

# from pandas_summary import DataFrameSummary
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics
from sklearn import model_selection

import joblib

In [47]:
data_path = 'data/23 Jun 2020'

##Compiling Data

In [9]:
data_path = 'data/23 Jun 2020'
os.listdir(data_path)

['Watching TV.txt',
 'Exercise_Jumping Jacks.txt',
 'UsingComputer&OccasionallyPhone.txt',
 'WalkingAroundHouse.txt']

### Examine Data

In [10]:
test_file = os.listdir(data_path)[0]
test_data = pd.read_csv(os.path.join(data_path,test_file), skiprows=0,delimiter=',',header=1)
# add label to data
label = test_file.split('.')[0]
test_data['label'] = label
test_data.head()

Unnamed: 0,Epoch,TOF_1,TOF_2,TOF_3,Accel,Yaw,Pitch,LDR,label
0,1592923701,178,180,185,10,93,95,101,Watching TV
1,1592923702,255,255,255,9,98,90,94,Watching TV
2,1592923704,255,255,255,10,92,90,101,Watching TV
3,1592923705,255,255,255,9,92,12,81,Watching TV
4,1592923707,255,255,255,9,91,6,86,Watching TV


In [11]:
data = pd.DataFrame()
for txt_file in os.listdir(data_path):
    txt_path = os.path.join(data_path, txt_file)
    partial_data = pd.read_csv(txt_path, skiprows=0,delimiter=',',header=1)
    label = txt_file.split('.')[0]
    partial_data['label'] = label
    if data.empty: # append data if there is already data inside
        data = partial_data
    else:
        data = data.append(partial_data)
data = data.reset_index().drop('index',axis=1)

In [50]:
data.describe()

Unnamed: 0,Epoch,TOF_1,TOF_2,TOF_3,Accel,Yaw,Pitch,LDR
count,765.0,765.0,765.0,765.0,765.0,765.0,765.0,765.0
mean,1592923000.0,151.952941,150.972549,123.988235,9.371242,84.227451,63.662745,58.963399
std,914.5831,81.762764,83.891715,86.672239,0.709479,38.72325,41.602032,77.091679
min,1592922000.0,19.0,12.0,1.0,7.0,1.0,1.0,3.0
25%,1592922000.0,91.0,88.0,63.0,9.0,91.0,8.0,18.0
50%,1592922000.0,102.0,98.0,97.0,9.0,102.0,91.0,43.0
75%,1592924000.0,255.0,255.0,252.0,10.0,106.0,93.0,58.0
max,1592924000.0,255.0,255.0,255.0,15.0,176.0,122.0,569.0


Store compiled data

In [52]:
os.makedirs('compiled', exist_ok=True)
data.to_feather('compiled/compiled_data')
# can use pd.read_feather('compiled/compiled_data') to load data in the future

##Creating a classifier

### Random Forest

In [38]:
df_raw = pd.read_feather('compiled/compiled_data')
df_raw.head()

Unnamed: 0,Epoch,TOF_1,TOF_2,TOF_3,Accel,Yaw,Pitch,LDR,label
0,1592923701,178,180,185,10,93,95,101,Watching TV
1,1592923702,255,255,255,9,98,90,94,Watching TV
2,1592923704,255,255,255,10,92,90,101,Watching TV
3,1592923705,255,255,255,9,92,12,81,Watching TV
4,1592923707,255,255,255,9,91,6,86,Watching TV


Replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.



In [39]:
X = df_raw.drop('label',axis=1)
y = df_raw['label']
print(y.head(), X.head())

0    Watching TV
1    Watching TV
2    Watching TV
3    Watching TV
4    Watching TV
Name: label, dtype: object         Epoch   TOF_1   TOF_2   TOF_3   Accel   Yaw   Pitch   LDR 
0  1592923701     178     180     185      10    93      95    101
1  1592923702     255     255     255       9    98      90     94
2  1592923704     255     255     255      10    92      90    101
3  1592923705     255     255     255       9    92      12     81
4  1592923707     255     255     255       9    91       6     86


Generate train and test sets

In [40]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)

Create classifier and fit.

In [41]:
#Create a Gaussian Classifier
classifier=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)

View results

In [42]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 1.0


Analyse feature by importance.

In [43]:
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

Epoch     0.329434
 TOF_1    0.179682
 TOF_2    0.151009
 TOF_3    0.105195
 LDR      0.102931
 Yaw      0.100352
 Pitch    0.020341
 Accel    0.011057
dtype: float64

Possible overfitting due to epoch number. i.e. Since the data is acquired in chronological order, the epoch number is a cheat for the classifier to predict the correct label.

### Point for Consideration
Does this mean epoch number is not a good factor for prediction? If it is not used should it even be collected ?

### Retraining without epoch number

In [44]:
X = df_raw.drop(['label','Epoch'],axis=1)
y = df_raw['label']
print(y.head(), X.head())

0    Watching TV
1    Watching TV
2    Watching TV
3    Watching TV
4    Watching TV
Name: label, dtype: object     TOF_1   TOF_2   TOF_3   Accel   Yaw   Pitch   LDR 
0     178     180     185      10    93      95    101
1     255     255     255       9    98      90     94
2     255     255     255      10    92      90    101
3     255     255     255       9    92      12     81
4     255     255     255       9    91       6     86


In [45]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
#Create a Gaussian Classifier
classifier=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
classifier.fit(X_train,y_train)

y_pred=classifier.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9826086956521739


In [46]:
feature_imp = pd.Series(classifier.feature_importances_,index=X.columns).sort_values(ascending=False)
feature_imp

 TOF_1    0.234034
 TOF_2    0.205942
 Yaw      0.183185
 TOF_3    0.176737
 LDR      0.166019
 Pitch    0.020783
 Accel    0.013299
dtype: float64

Uneven importance of TOF sensors. Could it be due to user's habits? Tendency to lean towards a certain side when performing a certain action like watching tv? Might want to get more data from different people

### Saving model

In [48]:
save_filename = 'classifier_rf_160720'
joblib.dump(classifier,os.path.join(data_path, save_filename))
# Can use joblib.load() to load this model in the future

['data/23 Jun 2020/classifier_rf_160720']