-
Notifications
You must be signed in to change notification settings - Fork 0
/
activityQuality.Rmd
81 lines (69 loc) · 3.21 KB
/
activityQuality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
`r opts_chunk$set(cache = TRUE)`
Predictive Model of Quality of Weight-Lifting Exercises
=======================================================
## Synopsis
A Random Forest model is trained to recognise different variations in quality of weight-lifting acvitity. The model achieves an estimated out-of-sample accuracy of 99%.
<br><br>
## I. Data Processing
First, we download a dataset containing sensor measurement output from a number of volunteers performing five variations of a barbell weight lift.
```{r}
# Download data and load 'caret' package for modeling
setInternet2(TRUE)
dat <- read.csv('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
library(caret)
```
The downloaded dataset contains a total of **`r nrow(dat)`** labeled observations.
To clean-up this data set for analysis and modeling, we remove columns that:
* Are irrelevant features, such as timestamps and subject names
* Contain mostly "NA" values
* Are predominantly blank
``` {r}
# Remove unneccessary non-feature columns
nonFeatureColNames <-
c('X', 'user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2',
'cvtd_timestamp', 'new_window', 'num_window')
dat <- dat[, -match(nonFeatureColNames, names(dat))]
rm(nonFeatureColNames)
# Remove columns with over 30% of values being NAs
naPropUnacceptable <- 0.3
naProp <- colSums(is.na(dat)) / nrow(dat)
naColNums <- (1:ncol(dat))[naProp > naPropUnacceptable]
dat <- dat[, -naColNums]
rm(naPropUnacceptable, naProp, naColNums)
# Remove columns with many blanks (read as "factor" class)
colClasses <- sapply(dat, class)
factorColNums <- (1:ncol(dat))[colClasses == 'factor']
factorColNums <- factorColNums[1 : (length(factorColNums) - 1)]
dat <- dat[, -factorColNums]
rm(colClasses, factorColNums)
numFeats <- ncol(dat) - 1
```
The number of features used for modelling is **`r numFeats`**.
Lastly, we randomly split the dataset into a Training set (60%) and a Testing set (40%).
```{r}
# Split data set into Training & Testing set
set.seed(313)
indcsTrain <- createDataPartition(y = dat$classe, p = 0.6, list = FALSE)
datTrain <- dat[indcsTrain, ]
datTest <- dat[-indcsTrain, ]
rm(dat, indcsTrain)
```
The Training set contains **`r nrow(datTrain)`** observations, and the Test set contains **`r nrow(datTest)`**.
<br><br>
## II. Training a Random Forest Classification Model
A Random Forest model is fit to the Training set to predict the activity quality variable **classe** on the `r numFeats` selected features. The training involves **4-fold cross-validation**.
``` {r, results = 'hide'}
# Train a Random Forest model
model <- train(classe ~ ., data = datTrain, method = 'rf',
trControl = trainControl(method = "cv", number = 4),
allowParallel = TRUE)
```
<br>
## III. Estimated Out-of-Sample Accuracy
The trained model achieves high predictive accuracy on the seperate Testing dataset, with Balanced Accuracy of **over 99%** for each of the five activity quality variations A-E.
``` {r}
predTest <- predict(model, datTest)
confusionMatrix(predTest, datTest$classe)
```
Thus, we expect an **out-of-sample error rate of only approximately 1%**.
Finally, the model also performs well in the *Practical Machine Learning* course's separate testing dataset, classifying correctly 20 out of 20 examples.