# Human Activity Recognizer

[![Twitter Follow](https://img.shields.io/twitter/follow/dialhaseeb?style=social)](www.twitter.com/dialhaseeb)

![Logo](https://github.com/zenyc/zenyc/blob/master/logo-small.png)

## 🕯 About
**human-activty-recognizer** is a *machine learning model* that predicts human activity by using smartphone sensor data. It was trained using XGBoost (eXtreme Gradient Boosting).


## Before we beigin, let's cofigure some stuff so that the notebook runs both on your local machine and on *Google's Colaboratory*

1- If you are running locally, run the following cell:

In [31]:
proj_dir = "proj-dir/"

2- If you are running on *Colab*, 
- Make sure you have uploaded all the project files to your *Google Drive*. Then, mount your drive by running the following cell:

In [6]:
from google.colab import drive
drive.mount("/content/drive")

- Then write out the path to the project files relative to your drive's root directory after `/content/drive/My Drive/` in the following cell:

In [7]:
proj_dir = "/content/drive/My Drive/Projects/human-activity-recognizer/" + "proj-dir"

## Next up, let's import everything we need. Run the following:

If you don't have XGBoost installed, run the following:

In [None]:
!pip install --upgrade xgboost



In [112]:
import pandas as pd
from tensorflow.keras import models
from tensorflow.keras import layers
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import accuracy_score
import xgboost as xgb
import numpy as np
import joblib

## We will solve this problem using different models. We will start by using simple Decision Trees and then go about trying Support Vector Machines, then XGBoost, and finally we will do Grid Search with XGBoost

Let's first have a quick review of how the sklearn API works. We will do this by using the Iris dataset.

In [7]:
data = datasets.load_iris()

In [8]:
data.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [9]:
print(data['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [10]:
X,y = data["data"], data["target"]

In [11]:
X.shape

(150, 4)

In [12]:
y.shape

(150,)

In [14]:
clf = DecisionTreeClassifier()

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [18]:
X_train.shape

(120, 4)

In [19]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [20]:
clf.score(X_test, y_test)

0.9666666666666667

## We got a 96% accuracy! Now let's try XGBoost:

In [22]:
Dtrain = xgb.DMatrix(X_train, y_train)

In [23]:
Dtrain

<xgboost.core.DMatrix at 0x7fd3de6c49e8>

In [24]:
Dtest = xgb.DMatrix(X_test, y_test)

In [25]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3}

In [26]:
steps = 20

In [27]:
model = xgb.train(param, Dtrain, steps)

In [30]:
preds = model.predict(Dtest)

In [31]:
preds.shape

(30, 3)

In [32]:
best_preds = np.asarray([np.argmax(line) for line in preds])

In [33]:
best_preds

array([1, 1, 1, 0, 0, 0, 2, 2, 2, 2, 2, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2,
       1, 1, 1, 0, 0, 1, 2, 0])

In [34]:
best_preds.shape

(30,)

In [35]:
print(f"Accuracy = {accuracy_score(y_test, best_preds)}")

Accuracy = 1.0


## That's a 100 percent! Okay, now let's get back to our original dataset:

In [77]:
data = pd.read_csv(proj_dir+"X_train.txt", header=None, sep=" ")

In [79]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,552,553,554,555,556,557,558,559,560,561
0,,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,...,-0.074323,-0.298676,-0.710304,-0.112754,0.030400,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,,0.278419,-0.016411,-0.123520,-0.998245,-0.975300,-0.960322,-0.998807,-0.974914,-0.957686,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,,0.279653,-0.019467,-0.113462,-0.995380,-0.967187,-0.978944,-0.996520,-0.963668,-0.977469,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.982750,-0.989302,...,0.404573,-0.117290,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,,0.276629,-0.016570,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,...,0.087753,-0.351471,-0.699205,0.123320,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7347,,0.299665,-0.057193,-0.181233,-0.195387,0.039905,0.077078,-0.282301,0.043616,0.060410,...,-0.070157,-0.588433,-0.880324,-0.190437,0.829718,0.206972,-0.425619,-0.791883,0.238604,0.049819
7348,,0.273853,-0.007749,-0.147468,-0.235309,0.004816,0.059280,-0.322552,-0.029456,0.080585,...,0.165259,-0.390738,-0.680744,0.064907,0.875679,-0.879033,0.400219,-0.771840,0.252676,0.050053
7349,,0.273387,-0.017011,-0.045022,-0.218218,-0.103822,0.274533,-0.304515,-0.098913,0.332584,...,0.195034,0.025145,-0.304029,0.052806,-0.266724,0.864404,0.701169,-0.779133,0.249145,0.040811
7350,,0.289654,-0.018843,-0.158281,-0.219139,-0.111412,0.268893,-0.310487,-0.068200,0.319473,...,0.013865,0.063907,-0.344314,-0.101360,0.700740,0.936674,-0.589479,-0.785181,0.246432,0.025339


## Doing some data cleaning:

In [80]:
with open(proj_dir+"features.txt") as file:
    columns = file.readlines()

In [81]:
df = data.dropna(axis=1)

In [82]:
df

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,552,553,554,555,556,557,558,559,560,561
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.074323,-0.298676,-0.710304,-0.112754,0.030400,-0.464761,-0.018446,-0.841247,0.179941,-0.058627
1,0.278419,-0.016411,-0.123520,-0.998245,-0.975300,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,0.158075,-0.595051,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317
2,0.279653,-0.019467,-0.113462,-0.995380,-0.967187,-0.978944,-0.996520,-0.963668,-0.977469,-0.938692,...,0.414503,-0.390748,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.982750,-0.989302,-0.938692,...,0.404573,-0.117290,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663
4,0.276629,-0.016570,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,0.087753,-0.351471,-0.699205,0.123320,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7347,0.299665,-0.057193,-0.181233,-0.195387,0.039905,0.077078,-0.282301,0.043616,0.060410,0.210795,...,-0.070157,-0.588433,-0.880324,-0.190437,0.829718,0.206972,-0.425619,-0.791883,0.238604,0.049819
7348,0.273853,-0.007749,-0.147468,-0.235309,0.004816,0.059280,-0.322552,-0.029456,0.080585,0.117440,...,0.165259,-0.390738,-0.680744,0.064907,0.875679,-0.879033,0.400219,-0.771840,0.252676,0.050053
7349,0.273387,-0.017011,-0.045022,-0.218218,-0.103822,0.274533,-0.304515,-0.098913,0.332584,0.043999,...,0.195034,0.025145,-0.304029,0.052806,-0.266724,0.864404,0.701169,-0.779133,0.249145,0.040811
7350,0.289654,-0.018843,-0.158281,-0.219139,-0.111412,0.268893,-0.310487,-0.068200,0.319473,0.101702,...,0.013865,0.063907,-0.344314,-0.101360,0.700740,0.936674,-0.589479,-0.785181,0.246432,0.025339


In [83]:
cols = []
for c in columns:
    cols.append(c.strip("\n"))

In [84]:
df.columns = cols

In [85]:
labels = pd.read_csv(proj_dir+"y_train.txt")

In [86]:
labels = labels.to_numpy()

In [87]:
x_train = df.to_numpy()

In [88]:
y_train = labels

## Now let's define the model:

In [89]:
model = DecisionTreeClassifier()

In [90]:
model.fit(x_train[0:7351], np.squeeze(y_train))

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [91]:
x_train.shape

(7352, 561)

In [92]:
y_train.shape

(7351, 1)

## Now let's evaluate it:

In [94]:
data = pd.read_csv(proj_dir+"X_test.txt", header=None, sep=" ")

In [95]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,552,553,554,555,556,557,558,559,560,561
0,,0.257178,-0.023285,-0.014654,-0.938404,-0.920091,-0.667683,-0.952501,-0.925249,-0.674302,...,0.071645,-0.330370,-0.705974,0.006462,0.162920,-0.825886,0.271151,-0.720009,0.276801,-0.057978
1,,0.286027,-0.013163,-0.119083,-0.975415,-0.967458,-0.944958,-0.986799,-0.968401,-0.945823,...,-0.401189,-0.121845,-0.594944,-0.083495,0.017500,-0.434375,0.920593,-0.698091,0.281343,-0.083898
2,,0.275485,-0.026050,-0.118152,-0.993819,-0.969926,-0.962748,-0.994403,-0.970735,-0.963483,...,0.062891,-0.190422,-0.640736,-0.034956,0.202302,0.064103,0.145068,-0.702771,0.280083,-0.079346
3,,0.270298,-0.032614,-0.117520,-0.994743,-0.973268,-0.967091,-0.995274,-0.974471,-0.968897,...,0.116695,-0.344418,-0.736124,-0.017067,0.154438,0.340134,0.296407,-0.698954,0.284114,-0.077108
4,,0.274833,-0.027848,-0.129527,-0.993852,-0.967445,-0.978295,-0.994111,-0.965953,-0.977346,...,-0.121711,-0.534685,-0.846595,-0.002223,-0.040046,0.736715,-0.118545,-0.692245,0.290722,-0.073857
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2942,,0.310155,-0.053391,-0.099109,-0.287866,-0.140589,-0.215088,-0.356083,-0.148775,-0.232057,...,0.074472,-0.376278,-0.750809,-0.337422,0.346295,0.884904,-0.698885,-0.651732,0.274627,0.184784
2943,,0.363385,-0.039214,-0.105915,-0.305388,0.028148,-0.196373,-0.373540,-0.030036,-0.270237,...,0.101859,-0.320418,-0.700274,-0.736701,-0.372889,-0.657421,0.322549,-0.655181,0.273578,0.182412
2944,,0.349966,0.030077,-0.115788,-0.329638,-0.042143,-0.250181,-0.388017,-0.133257,-0.347029,...,-0.066249,-0.118854,-0.467179,-0.181560,0.088574,0.696663,0.363139,-0.655357,0.274479,0.181184
2945,,0.237594,0.018467,-0.096499,-0.323114,-0.229775,-0.207574,-0.392380,-0.279610,-0.289477,...,-0.046467,-0.205445,-0.617737,0.444558,-0.819188,0.929294,-0.008398,-0.659719,0.264782,0.187563


In [97]:
labels = pd.read_csv(proj_dir+"y_test.txt")

In [98]:
labels = labels.to_numpy()

In [99]:
x_test = data.dropna(axis=1).to_numpy()

In [100]:
y_test = labels

In [107]:
x_test = x_test[:-1]

In [108]:
x_test.shape

(2946, 561)

In [109]:
y_test.shape

(2946, 1)

In [113]:
joblib.dump(model,proj_dir+"./trained_tree.sav")

['proj-dir/./trained_tree.sav']

In [110]:
model.score(x_test, np.squeeze(y_test))

0.7905634758995248

## 79% accuracy, fair enough. But we can do better...

## Let's try SVC:

In [114]:
model = SVC()

In [115]:
model.fit(x_train[0:7351], np.squeeze(y_train))

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [116]:
model.score(x_test, np.squeeze(y_test))

0.9164969450101833

In [117]:
joblib.dump(model,proj_dir+"./trained_svc.sav")

['proj-dir/./trained_svc.sav']

## Wow, we got a 91.6% accuracy! But let's see what XGBoost has to offer

In [None]:
xg_clf = xgb.XGBClassifier(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [25]:
param = {
    'eta': 0.3, 
    'max_depth': 3,  
    'objective': 'multi:softprob',  
    'num_class': 3}

In [26]:
steps = 20

In [27]:
model = xgb.train(param, Dtrain, steps)

In [30]:
preds = model.predict(Dtest)

# The End?

## 👀 Contact

If you want to contact me you can reach me at <zenyc@live.com>.