## Final Project

### Introduction
In this final project, available to Verified learners only, we'll attempt to predict the type of physical activity (e.g., walking, climbing stairs) from tri-axial smartphone accelerometer data. Smartphone accelerometers are very precise, and different physical activities give rise to different patterns of acceleration.

(Note on project availability: while project submission is only available to Verified learners, all learners are welcome to work on the project on their own and have access to the instructions describing the project.)

### Input Data
The input data used for training in this project consists of two files. The first file, train_time_series.csv, contains the raw accelerometer data, which has been collected using the Beiwe research platform, and it has the following format:
```
timestamp, UTC time, accuracy, x, y, z
```
You can use the timestamp column as your time variable; you'll also need the last three columns, here labeled x, y, and z, which correspond to measurements of linear acceleration along each of the three orthogonal axes.

The second file, train_labels.csv, contains the activity labels, and you'll be using these labels to train your model. Different activities have been numbered with integers. We use the following encoding: 1 = standing, 2 = walking, 3 = stairs down, 4 = stairs up. Because the accelerometers are sampled at high frequency, the labels in train_labels.csv are only provided for every 10th observation in train_time_series.csv.

### Activity Classification
Your goal is to classify different physical activities as accurately as possible. To test your code, you're also provided a file called test_time_series.csv, and at the end of the project you're asked to provide the activity labels predicted by your code for this test data set. Only the course staff have the corresponding true labels for the test data, and the accuracy of your code will be determined as the percentage of correct classifications. Note that in both cases, for training and testing, the input file consists of a single (3-dimensional) time series. To test the accuracy of your code, you'll be asked to upload your predictions as a CSV file. This file called test_labels.csv is provided to you, but it only contains the time stamps needed for prediction; you'll need to augment this file by adding the corresponding class predictions (1,2,3,4).

### Code Run Time
In addition to providing the predictions, you're also asked to time the running time of your code, starting at the moment when you load in the test data set and ending at the moment you're done computing your predictions. You'll be asked to enter this running time, and the goal is to see how fast your code runs compared to the code of others. Because computing speeds vary for several reasons, including hardware and implementation of the code, these numbers aren't directly comparable, and for this reason your grading will not be affected by them. However, it may still be interesting to you to see how long the code of other participants takes to solve the problem.

### Project Submission
You're expected to implement your solution using a Jupyter notebook. Once you're done, you're asked to upload the notebook, which will be peer reviewed by other course participants. This review will impact your final course grade, so you should write your code as clearly code as possible, include comments, and use meaningful variable names.

You can approach this problem any way you'd like. It may be beneficial to search the web for possible solutions, or you may try to solve this problem from scratch. We recommend that you build your code in stages, encapsulating different parts (tasks) of the problem into functions. There are many ways to solve this problem. Good luck!

#### Reading the Data

In [1]:
import scipy.stats as ss
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier


In [2]:
train_time = pd.read_csv("train_time_series.csv", index_col=0)
train_time

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
20586,1565109930787,2019-08-06T16:45:30.787,unknown,-0.006485,-0.934860,-0.069046
20587,1565109930887,2019-08-06T16:45:30.887,unknown,-0.066467,-1.015442,0.089554
20588,1565109930987,2019-08-06T16:45:30.987,unknown,-0.043488,-1.021255,0.178467
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985
20590,1565109931188,2019-08-06T16:45:31.188,unknown,-0.054031,-1.003616,0.126450
...,...,...,...,...,...,...
24325,1565110305638,2019-08-06T16:51:45.638,unknown,0.024384,-0.710709,0.030304
24326,1565110305738,2019-08-06T16:51:45.738,unknown,0.487228,-1.099136,-0.015213
24327,1565110305838,2019-08-06T16:51:45.838,unknown,0.369446,-0.968506,0.036713
24328,1565110305939,2019-08-06T16:51:45.939,unknown,0.167877,-0.802826,0.049805


In [3]:
train_labels = pd.read_csv("train_labels.csv", index_col=0)
train_labels

Unnamed: 0,timestamp,UTC time,label
20589,1565109931087,2019-08-06T16:45:31.087,1
20599,1565109932090,2019-08-06T16:45:32.090,1
20609,1565109933092,2019-08-06T16:45:33.092,1
20619,1565109934094,2019-08-06T16:45:34.094,1
20629,1565109935097,2019-08-06T16:45:35.097,1
...,...,...,...
24289,1565110302030,2019-08-06T16:51:42.030,4
24299,1565110303032,2019-08-06T16:51:43.032,4
24309,1565110304034,2019-08-06T16:51:44.034,4
24319,1565110305037,2019-08-06T16:51:45.037,4


## Manipulating The Data

In [4]:
time_labels=train_time.loc[train_labels.index]
time_labels

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985
20599,1565109932090,2019-08-06T16:45:32.090,unknown,0.013718,-0.852371,-0.000870
20609,1565109933092,2019-08-06T16:45:33.092,unknown,0.145584,-1.007843,-0.036819
20619,1565109934094,2019-08-06T16:45:34.094,unknown,-0.099380,-1.209686,0.304489
20629,1565109935097,2019-08-06T16:45:35.097,unknown,0.082794,-1.001434,-0.025375
...,...,...,...,...,...,...
24289,1565110302030,2019-08-06T16:51:42.030,unknown,-0.641953,-1.469177,0.301041
24299,1565110303032,2019-08-06T16:51:43.032,unknown,-0.171616,-0.366074,-0.059082
24309,1565110304034,2019-08-06T16:51:44.034,unknown,0.401810,-1.077698,0.258911
24319,1565110305037,2019-08-06T16:51:45.037,unknown,0.330338,-1.470062,0.303894


In [5]:
time_labels["label"] = train_labels["label"]
time_labels

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z,label
20589,1565109931087,2019-08-06T16:45:31.087,unknown,-0.053802,-0.987701,0.068985,1
20599,1565109932090,2019-08-06T16:45:32.090,unknown,0.013718,-0.852371,-0.000870,1
20609,1565109933092,2019-08-06T16:45:33.092,unknown,0.145584,-1.007843,-0.036819,1
20619,1565109934094,2019-08-06T16:45:34.094,unknown,-0.099380,-1.209686,0.304489,1
20629,1565109935097,2019-08-06T16:45:35.097,unknown,0.082794,-1.001434,-0.025375,1
...,...,...,...,...,...,...,...
24289,1565110302030,2019-08-06T16:51:42.030,unknown,-0.641953,-1.469177,0.301041,4
24299,1565110303032,2019-08-06T16:51:43.032,unknown,-0.171616,-0.366074,-0.059082,4
24309,1565110304034,2019-08-06T16:51:44.034,unknown,0.401810,-1.077698,0.258911,4
24319,1565110305037,2019-08-06T16:51:45.037,unknown,0.330338,-1.470062,0.303894,4


## Model

In [6]:
X_train=time_labels[["x","y","z"]]
X_train

Unnamed: 0,x,y,z
20589,-0.053802,-0.987701,0.068985
20599,0.013718,-0.852371,-0.000870
20609,0.145584,-1.007843,-0.036819
20619,-0.099380,-1.209686,0.304489
20629,0.082794,-1.001434,-0.025375
...,...,...,...
24289,-0.641953,-1.469177,0.301041
24299,-0.171616,-0.366074,-0.059082
24309,0.401810,-1.077698,0.258911
24319,0.330338,-1.470062,0.303894


In [7]:
Y_train=time_labels["label"]
Y_train

20589    1
20599    1
20609    1
20619    1
20629    1
        ..
24289    4
24299    4
24309    4
24319    4
24329    4
Name: label, Length: 375, dtype: int64

In [8]:
from sklearn.model_selection import train_test_split
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)


NameError: name 'df' is not defined

In [None]:
forest_classifier = RandomForestClassifier(random_state=1)
forest_classifier.fit(X_train,Y_train)

RandomForestClassifier(random_state=1)

## Test

In [None]:
test_time = pd.read_csv("test_time_series.csv", index_col=0)
test_time

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
24330,1565110306139,2019-08-06T16:51:46.139,unknown,0.034286,-1.504456,0.157623
24331,1565110306239,2019-08-06T16:51:46.239,unknown,0.409164,-1.038544,0.030975
24332,1565110306340,2019-08-06T16:51:46.340,unknown,-0.234390,-0.984558,0.124771
24333,1565110306440,2019-08-06T16:51:46.440,unknown,0.251114,-0.787003,0.054810
24334,1565110306540,2019-08-06T16:51:46.540,unknown,0.109924,-0.169510,0.235550
...,...,...,...,...,...,...
25575,1565110430975,2019-08-06T16:53:50.975,unknown,0.036499,-0.724823,0.553802
25576,1565110431075,2019-08-06T16:53:51.075,unknown,-0.159241,0.307022,0.142410
25577,1565110431175,2019-08-06T16:53:51.175,unknown,-0.037964,-0.673706,1.065445
25578,1565110431275,2019-08-06T16:53:51.275,unknown,0.255707,-1.485397,-0.013336


In [None]:
test_labels = pd.read_csv("test_labels.csv", index_col=0)
test_labels

Unnamed: 0,timestamp,UTC time,label
24339,1565110307041,2019-08-06T16:51:47.041,
24349,1565110308043,2019-08-06T16:51:48.043,
24359,1565110309046,2019-08-06T16:51:49.046,
24369,1565110310048,2019-08-06T16:51:50.048,
24379,1565110311050,2019-08-06T16:51:51.050,
...,...,...,...
25539,1565110427366,2019-08-06T16:53:47.366,
25549,1565110428369,2019-08-06T16:53:48.369,
25559,1565110429371,2019-08-06T16:53:49.371,
25569,1565110430373,2019-08-06T16:53:50.373,


In [None]:
y_eval=forest_classifier.predict_proba(X_train)
y_eval


array([[0.76, 0.04, 0.18, 0.02],
       [0.68, 0.29, 0.  , 0.03],
       [0.71, 0.25, 0.04, 0.  ],
       ...,
       [0.  , 0.28, 0.01, 0.71],
       [0.  , 0.09, 0.12, 0.79],
       [0.  , 0.18, 0.12, 0.7 ]])

In [None]:
lbls_eval = []
for i in y_eval:
	lbls_eval.append(np.argmax(i)+1)
lbls_eval

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 4,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,
 3,


In [None]:
(lbls_eval == Y_train).mean()

1.0

In [None]:
copy_test_time=test_time.loc[test_labels.index]
copy_test_time

Unnamed: 0,timestamp,UTC time,accuracy,x,y,z
24339,1565110307041,2019-08-06T16:51:47.041,unknown,0.098282,-0.833771,0.118042
24349,1565110308043,2019-08-06T16:51:48.043,unknown,0.348465,-0.946701,-0.051041
24359,1565110309046,2019-08-06T16:51:49.046,unknown,0.377335,-0.849243,-0.026474
24369,1565110310048,2019-08-06T16:51:50.048,unknown,0.110077,-0.520325,0.312714
24379,1565110311050,2019-08-06T16:51:51.050,unknown,0.283478,-0.892548,-0.085876
...,...,...,...,...,...,...
25539,1565110427366,2019-08-06T16:53:47.366,unknown,-0.043915,-0.242416,0.068802
25549,1565110428369,2019-08-06T16:53:48.369,unknown,0.118271,-1.212097,0.357468
25559,1565110429371,2019-08-06T16:53:49.371,unknown,0.667404,-0.978851,0.171906
25569,1565110430373,2019-08-06T16:53:50.373,unknown,0.371384,-1.021927,-0.244446


In [None]:
X_test=copy_test_time[["x","y","z"]]
y_predict=forest_classifier.predict_proba(X_test)
y_predict.shape

array([[0.1 , 0.5 , 0.28, 0.12],
       [0.03, 0.54, 0.27, 0.16],
       [0.  , 0.64, 0.1 , 0.26],
       [0.  , 0.52, 0.45, 0.03],
       [0.12, 0.54, 0.25, 0.09],
       [0.06, 0.55, 0.29, 0.1 ],
       [0.  , 0.57, 0.25, 0.18],
       [0.01, 0.3 , 0.29, 0.4 ],
       [0.  , 0.49, 0.11, 0.4 ],
       [0.  , 0.79, 0.06, 0.15],
       [0.  , 0.41, 0.43, 0.16],
       [0.  , 0.17, 0.76, 0.07],
       [0.17, 0.66, 0.07, 0.1 ],
       [0.06, 0.41, 0.15, 0.38],
       [0.04, 0.49, 0.16, 0.31],
       [0.17, 0.54, 0.12, 0.17],
       [0.  , 0.54, 0.35, 0.11],
       [0.  , 0.24, 0.76, 0.  ],
       [0.28, 0.51, 0.15, 0.06],
       [0.01, 0.8 , 0.14, 0.05],
       [0.04, 0.74, 0.07, 0.15],
       [0.03, 0.39, 0.43, 0.15],
       [0.07, 0.64, 0.16, 0.13],
       [0.05, 0.78, 0.09, 0.08],
       [0.08, 0.79, 0.06, 0.07],
       [0.  , 0.12, 0.69, 0.19],
       [0.01, 0.74, 0.1 , 0.15],
       [0.07, 0.77, 0.15, 0.01],
       [0.03, 0.72, 0.21, 0.04],
       [0.  , 0.43, 0.35, 0.22],
       [0.

In [None]:
labels = []
for i in y_predict:
	labels.append(np.argmax(i)+1)
labels

[2,
 2,
 2,
 2,
 2,
 2,
 2,
 4,
 2,
 2,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 2,
 3,
 3,
 3,
 3,
 2,
 2,
 1,
 3,
 2,
 3,
 4,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 3,
 3,
 3,
 3,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 4,
 2,
 3,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 2,
 3,
 2,
 2,
 2,
 2,
 2,
 2,
 2]

In [None]:
test_labels["labels"] =labels
test_labels["labels"]

24339    2
24349    2
24359    2
24369    2
24379    2
        ..
25539    2
25549    2
25559    2
25569    2
25579    2
Name: labels, Length: 125, dtype: int64