<h1><center>Project: Physical Activity Monitoring</center></h1>
<h3><div style="text-align: right"> Course: Big Data Tools and Methods<br>Kernel: Apache Toree - Scala<br>Author: Xin Zhou</div></h1>
## Introduction
The paper of "Introducing a New Benchmarked Dataset for Activity Monitoring" was published in 2012, and dataset "PAMAP2" was downloaded from UCI Machine Learning Depository. German Research Center of Artificial Intelligence (DFKI) establised this dataset for research purpose, and they had achieved high accuracy on physical activity monitoring problems. In this project, raw data "PAMAP2" had been used and preprocessed using python script named "csvConvert.py". In order to reach the benchmarking classification accuracy in this paper, several preprocess methods had used to find the most important data attribute. The decision tree classifier was first used to test the classification accuracy, then several other classifiers were used for model comparison purpose. Some problems were showed up during experiment, and they will be discussed in this project.<br>

### Dataset Subjects Information
All data are recorded from total of 9 people subjects followed by protocols. Minimum age is 23 and maximum age is 31. From height and weight information, they are in healthy shape. More specific information are inluded in the "PAMAP2_Dataset" which can be downloaded at https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring <br>
There are total of 54 attributes collected from each subject shown as below.
### PAMAP2 Dataset Attribute Information
The 54 columns in the data files are organized as follows: <br>
1.	timestamp (s) <br>
2.	activityID (see below for the mapping to the activities) <br>
3.	heart rate (bpm) <br>
4. ~ 20.	IMU hand <br>
21. ~ 37.	IMU chest <br>
38. ~ 54.	IMU ankle <br>

The IMU sensory data contains the following columns: <br>
1.	temperature (Â°C) <br>
2. ~ 4.	3D-acceleration data (ms-2), scale: Â±16g, resolution: 13-bit <br>
5. ~ 7.	3D-acceleration data (ms-2), scale: Â±6g, resolution: 13-bit <br>
8. ~ 10.	3D-gyroscope data (rad/s) <br>
11. ~ 13.	3D-magnetometer data (Î¼T) <br>
14. ~ 17.	orientation (invalid in this data collection) <br>

List of activityIDs and corresponding activities: <br>
1	lying, 2	sitting, 3	standing, 4	walking, 5	running, 6	cycling, 7	Nordic walking, 9	watching TV, 10	computer work, 11	car driving, 12	ascending stairs, 13	descending stairs, 16	vacuum cleaning, 17	ironing, 18	folding laundry, 19	house cleaning, 20	playing soccer, 24	rope jumping, 0	other (transient activities)
### Preprocessing 
#### CSV Data File Generation
In python script "csvConvert.py", total of 10 .csv files are generated from 9 .dat files. One csv file named "subjectsAll.csv" included all 9 subjects attributes. Other 9 csv files are simplied converted from subject101.dat ~ subject109.dat files. During conversion, there are 5 extra attributes added in. Which are age, height, weight, minBPM, maxBPM. <br>
#### Drop Useless Attributes
Atribute in column of 17, 18, 19, 20 are always the same as data of [1, 0, 0, 0]. So are the column 34, 35, 36, 37 and column 51, 52, 53, 54. These 12 columns of attributes are orientations from three IMU sensors which are invalid, so they will be dropped. The column 1 attibute is time which will be dropped, because there are none models like RNN or LSTM will used for this dataset. <br>
#### Dealing with NULL in Attribute of Heart Beats (bpm)
From observation in the dataset, many "null" are recorded in the heart beats attribute. Several methods had been tested as shown below. <br>
1 Average all attributes in a time window. In such window, only one row has valid bpm data.<br>
2 Drop hearts beats attribute "bpm"<br>
3 Drop only rows with "null", and keep the rows which contain valid bpm <br><br>
As an experiment result in option 1, apache toree scala does not inference double type data from csv file which contains all averaged data. Using specified Struct would generate null, so such option 1 was given up. <br>
Option 2 gives a low accuracy such as 87% in basic activity classification, but option 3 gives 99.9% which means bpm is a very import attribute in this data set. <br>

### Decision Tree Classification

In [54]:
//Author: Xin Zhou
// Setup training files and decision tree model parameters
// subjectNumber can be int chosen from 101 to 109, can also be string "All" which will extract all subjects data
// activityMin and activityMax are specifying the range of activity ID
// Basic Activity, activityMin = 1, activityMax = 5
// Background Activity, activityMin = 6, activityMax = 12
// All acitivity in the paper means first 12 activity, activityMin = 1, activityMax = 12
val subjectNumber = "All"
val activityMin = 1
val activityMax = 5

In [55]:
//Author: Xin Zhou
val dir = new java.io.File(".").getCanonicalPath
val rawDF = spark.read.format("csv").option("inferSchema",true).option("header",true).load(dir+"/data/subject"+subjectNumber.toString+".csv")
// val df = spark.read.format("csv").schema(schema).option("header",false).load(dir+"/data/subject101.csv")
val dropDF = rawDF.na.drop("any").drop("o0", "o1", "o2", "o3", "o4", "o5", "o6", "o7", "o8", "o9",
                                  "o10", "o11", "time")
val df = dropDF.filter(dropDF("label") <= activityMax).filter(dropDF("label") >= activityMin)
// df.show(1)
// df.printSchema
import org.apache.spark.ml.feature.RFormula

// add extra subject information here in formula
// age+height+weight+minBPM+maxBPM+
val formula = new RFormula().setFormula("label ~ bpm+h0+h1+h2+h3+h4+h5+h6+h7+h8+h9+h10+h11+h12+c0+c1+c2+c3+c4+c5+c6+c7+c8+c9+c10+c11+c12+a0+a1+a2+a3+a4+a5+a6+a7+a8+a9+a10+a11+a12")
val preparedDF = formula.fit(df).transform(df)

val Array(trainingData, testData) = preparedDF.randomSplit(Array(0.7, 0.3))

//train a decision tree model
//Code below original from Spark 2.2.0 Documentation
//https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

//training start time:
val t0 = System.currentTimeMillis()

val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(preparedDF)
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(activityMax+1-activityMin).fit(preparedDF)

val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures")

val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))

val model = pipeline.fit(trainingData)

//training end time:
val t1 = System.currentTimeMillis()

//evaluate
val predictions = model.transform(testData)

predictions.select("predictedLabel", "label", "features").show(5)

val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Decision Tree Test Accuracy = " + accuracy)
println("Test Error = " + (1.0 - accuracy))
println("Decision Tree Training Elapsed time: " + (t1 - t0)/1000 + "s")

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|             1|    1|[78.0,31.4375,3.6...|
|             1|    1|[78.0,31.4375,3.6...|
|             1|    1|[78.0,31.4375,3.6...|
|             1|    1|[78.0,31.4375,3.7...|
|             1|    1|[78.0,31.4375,3.7...|
+--------------+-----+--------------------+
only showing top 5 rows

Decision Tree Test Accuracy = 0.9372873713981227
Test Error = 0.06271262860187732
Decision Tree Training Elapsed time: 40s


### Decision Tree Classification without Subjects Information
12 Activities includes:1 lying, 2 sitting, 3 standing, 4 walking, 5 running, 6	cycling, 7	Nordic walking, 9	watching TV, 10	computer work, 11	car driving, 12	ascending stairs <br>
If include label of 0 as "others of transient activities" would lower such classification 12% accuracy 

|  | Basic Activity | Background Activity | 12 Activities | 24 Activities |
|--- |---|---|---|
|Subject 101| 99.99% | 100% | 99.49% | 85.78% |
|Subject 102| 99.53% | 100% | 98.34% | 92.55% |
|Subject 103| 99.96% | 100% | 100%   | 92.99% |
|Subject 104| 99.47% | 100% | 99.95% | 93.29% |
|Subject 105| 99.02% | 
|Subject 106| 99.96% |
|Subject 107| 100%   | 
|Subject 108| 99.62% | 
|Subject 109| No Basic Activity |
|All Subjects|94.50% |94.04%| 73.45% | 64.02% |


#### Comment: 
100% background activity accuracy is suspicious. Then it was unnecessary to test all data. In the paper, the author says "all activities" are all 12 activities. However, he did not say which 12. There are total of 24 activities in this dataset. Moreover, the author says "the background activity is the other 6 of the 12 activities". Still background activities labels are unclear. Comparing his "All activity" with my "12 activities" or compare the background activity accuracy is not reasonable. 

### Decision Tree Classification with Subjects Information
Added age, weight, height, resting bpm, max bpm

|  | Basic Activity | Background Activity | 12 Activities | 24 Activities |
|--- |---|---|---|
|Subject 101| 99.99% | 100% | 99.63% | 85.31% |
|Subject 102| 99.53% | 100% | 98.34% | 81.35% |
|All Subjects|94.14% |94.34%| 72.80% | 63.90% |

#### Comment:
12 experiments as shown above, standard subject dependent classification has almost 100% accuracy, but has only 94% accuracy classification on all subjects. The classifier for classifying basic activity and background activity has very good performance, except first 12 activities and all activities. Then the classification on basic activity with all subjects data will be used in other classifiers. As comparing these accuracies with the ones without subjects information, there are no difference or improvement on classificaition accuracy.
<br><br>
### Comparing to Paper Reiss2012b using Decision Tree Classifier
Standard accuracy are using averaged accuracy on 8 subjects test accuracies. LOSO (leave-one-subject-out) is using accuracy from "All subjects", which model was trained from all subjects. As a test comparison result, my result is very close to the paper.

| | Standard Basic Activity Accuracy | LOSO Basic Activity Accuracy |
|--- |---|---|
|Reiss2012| 99.70% | 94.47% | 
|Mine| 99.69% | 94.50% |

## Other Classifiers on Apache Toree - Scala
1. Gradient-boosted tree, invalid on this dataset which it only support binary classfication
2. Naive Bayes, invalid for computing negative values
3. LSVM, not a good classifier for computing this nonlinear dataset
4. Random Forest, works slightly better than decision tree

In [None]:
//Original from Spark 2.2.0 Documentation
//https://spark.apache.org/docs/2.2.0/ml-classification-regression.html
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// using the same preparedDF in code block 2
//training starts
val t0 = System.currentTimeMillis()

val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(preparedDF)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").setMaxCategories(activityMax+1-activityMin).fit(preparedDF)

// Split the data into training and test sets (30% held out for testing).
//val Array(trainingData, testData) = preparedDF.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

//training ends
val t1 = System.currentTimeMillis()

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Random Forest Test Accuracy = " + accuracy)
println("Random Forest Training Elapsed time: " + (t1 - t0)/1000 + "s")

## Models Performance Comparison
The below table data are ranges from minum to maximum recorded in different validation run. As a result, Decision Tree and Random Forest have almost the same performance with same run time.

| | LOSO Basic Activity Accuracy | Training Time (s) |
|--- |---|---|
|Decision Tree| 93.72%~94.50% | 39~45 | 
|Random Forest| 93.98%~95.42% | 40~45 |

## Conclusion
Many classifiers had been tested on this "PAMAP2" dataset, such as Decision Tree, Random Forest, Naive Bayes and so  on. Only two classifiers Decision Tree and Random Forest are workable for this dataset on this Apache Toree - Scala kernel. Even though sometimes some type of classifiers work better on specific dataset, the classifier types on this "PAMAP2" dataset does not have apparent classfication accuracy. As an experiment result, preprocessing the raw data is the key to reach the benchmark accuracy in 2012 paper of "Introducing a New Benchmarked Dataset for Activity Monitoring". Another observation from experiment is that heart beats attribute plays a big role on such on physical activity classficaion problem. However, other problems were shown up. For example, a person's activities always inlude transient activities. If we do not include label of "0" (transient activities) in the test dataset, we have a really high accuracy for sure. However, if we put transient activities in the test dateset, the classifiers would not have a good performance classifying transient activities. If person is doing transient activities in real time, such classifier would not a good performance. 