-
Notifications
You must be signed in to change notification settings - Fork 0
/
10_Practical_PrincipalComponentAnalysis.Rmd
executable file
·327 lines (222 loc) · 10.4 KB
/
10_Practical_PrincipalComponentAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
title: "09_Practical; Wk. 10; PCA"
author: "Ryan Greenup 17805315"
date: "27 September 2019"
output:
html_notebook:
# html_document:
# code_folding: hide
# keep_md: yes
# theme: flatly
# toc: yes
# toc_depth: 4
# toc_float: yes
pdf_document:
toc: yes
always_allow_html: yes
##Shiny can be good but {.tabset} will be more compatible with PDF
##but you can submit HTML in turnitin so it doesn't really matter.
##If a floating toc is used in the document only use {.tabset} on more or less copy/pasted
#sections with different datasets
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Load Packages
if(require('pacman')){
library('pacman')
}else{
install.packages('pacman')
library('pacman')
}
pacman::p_load(caret, scales, ggplot2, rmarkdown, shiny, ISLR, class, BiocManager,
corrplot, plotly, tidyverse, latex2exp, stringr, reshape2, cowplot, ggpubr,
rstudioapi, wesanderson, RColorBrewer, colorspace, gridExtra, grid, car,
boot, colourpicker, tree, ggtree, mise, rpart, rpart.plot, knitr, MASS,
magrittr, EnvStats,tidyverse,tidyr,devtools, bookdown, leaps, car, clipr,
tikzDevice, e1071, ggbiplot, base)
#install.packages("ggbiplot")
mise()
set.seed(23)
# Set Working Directory
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# setwd(getSrcDirectory()[1])
```
# (Wk 10) Introduction to Data Science
Material of Tue 13 May 2019, week 11
## Iris Data
### Load and inspect the Data Set
```{r}
irisTB <- as_tibble(iris)
head(iris)
#str(iris)
#summary(iris)
#dim(iris)
```
This data set has 4 predictive features and a categorical output variable that represents the plant species.
The mean values may be returned by using the `summary()` command:
```{r}
#means <- summary(iris)[4,-5]
means <- data.frame(SL = mean(irisTB[["Sepal.Length"]]), SW = mean(irisTB[["Sepal.Width"]]), PL = mean(irisTB[["Petal.Length"]]), PW = mean(irisTB[["Petal.Width"]]))
vars <- data.frame(SL = var(irisTB[["Sepal.Length"]]), SW = var(irisTB[["Sepal.Width"]]), PL = var(irisTB[["Petal.Length"]]), PW = var(irisTB[["Petal.Width"]]))
desc.stats <- rbind(means, vars)
desc.stats
```
### Descriptive Statistics
```{r}
desc.stats <- data.frame(
Mean = apply(irisTB[-5], 2, mean), # 1 is rows, 2 is cols p. 401 ISL TB
Variance = apply(irisTB[-5], 2, var) # 1 is rows, 2 is cols p. 401 ISL TB
)
desc.stats$variable <- row.names(desc.stats)
descStatsTidy <- pivot_longer(desc.stats, cols = c(Mean, Variance), names_to = "Statistic", values_to = "Value")
ggplot(descStatsTidy, aes(x = variable, y = Value, fill = Statistic)) +
geom_col(position = "dodge")
```
The mean values will be adjusted and centred at 0 by default when performing PCA, however, the differing amounts of variance that occurs across the variables will cause most of the principal components to be driven by `petal.length` because of it's high variance as opposed to `sepal.width`, for this reason the variance needs to be scaled so only the variance between the variables are considered; this can be specified with the parameter `scale = TRUE`.
### Perform a Principal Component Analysis
When performing PCA, remember to pull out any predictive variables.
```{r}
irisTB <- irisTB[-5]
pca.mod <- prcomp(irisTB, scale = TRUE)
pca.mod
```
### Display the Principal Component Loading vectors
The loading vectors correspond to the `rotation` matrix in the model:
```{r}
print(pca.mod$rotation, digits = 2)
```
These values are the $\phi_{i,j}$ that are used to create the linear combination that represents the Principal Component line in the $n$ dimensional space, so for example $\phi_{3, 2}$ will be the 3rd feature, 2nd principal component, $\phi$ value and will also correspond to (row, col) = (3,2) in the above matrix.
The first two principal components will be:
* $i$ is the observation
* $j$ is the feature
* $m$ is the Principal Component.
* $\phi_{j, m}$ is the $m$th principal component of the $j$th feature
* $z_{i,m}$ is the $m$th principal compoenent for the $i$th observations
- In this case $m$ is like a new feature and $i$ is the new feature.
* $x_{i,j}$ is the $i$th observation for the $j$th feature.
* $x_{i,j}$ \approx \sum^n_{m=1} \left[ z_{i,m} \cdot \phi_{j,m} \right]$
- The trick to remembering the rows columns etc. is that it will be all explained by alphabetical order, e.g. $\phi_{j,m} \in \Phi$ is a matrix with the features up/down as rows and the PC's as left/right cols.
### Return the first to Principal Components
$$
\begin{align}
&\textbf{PC1:} \\
Z_{1} &= \begin{bmatrix} SL_i & SW_i & PL_i & PW_i \end{bmatrix} \times \begin{bmatrix} 0.52 \\ -0.26 \\ 0.58 \\ 0.56 \end{bmatrix} \\
& = 0.52 \cdot SL_i - 0.26 \cdot SW_i +0.58 \cdot PL + 0.56 \cdot PW \\
\ \\
\ \\
&\textbf{PC2:} \\
Z_{2} &= \begin{bmatrix} SL_i & SW_i & PL_i & PW_i \end{bmatrix} \times \begin{bmatrix} -0.37 \\ -0.92 \\ -0.02 \\ -0.07 \end{bmatrix} \\
&= -0.377 \cdot SL_i - 0.92 \cdot SW_i -0.02 \cdot PL -0.07 \cdot PW
\end{align}
$$
### Draw the biplot
make sure to specify the `scale=0` argument, this ensures that the arrows are scaled to represent the loadings; other values for `scale` give slightly different biplots with different interpretations.
```{r}
biplot(pca.mod, scale = 0, cex = 0.5)
#ggbiplot(pca.mod, scale = 0, cex = 0.5, )
```
### (9) Rotate the Plot and Compare
The principle components are identical if they are positive or negative, so the plot may be rotated:
```{r}
pca.modR <- pca.mod
pca.modR$rotation <- -pca.mod$rotation
pca.modR$x <- -pca.mod$x
biplot(pca.modR, scale = 0, cex = 0.5)
```
This plot is identical, it's merely been rotated 180 degrees.
### (10) Discuss and graph the proportion of variance explained by the principal components
```{r}
pcaVar <- pca.mod$sdev^2
pcaVarpr <- pcaVar/sum(pcaVar)
pcaVarpr <- enframe(pcaVarpr)
# pcaVarpr <- dplyr::rename(pcaVarpr,
# "Principal.Component" = name,
# "Proportion.Variance" = value
# )
names(pcaVarpr) <- c("Principal.Component", "Proportion.Variance") # This gives a warning
for (i in 1:nrow(pcaVarpr)) {
pcaVarpr[["Principal.Component"]][i] <- paste("PC", i)
}
ggplot(data = pcaVarpr, aes( x = Principal.Component, y = Proportion.Variance, group = 1)) +
geom_point(size = 3, alpha = 0.7, col = "RoyalBlue") +
geom_line(col = "IndianRed") +
labs(x = "Principal Component", y = "Proportion of Variance", title = "Variance Explained by PC") +
theme_classic2() +
geom_vline(xintercept = 3, col = "purple", lty = 2)
print(pcaVarpr, digits = 1)
```
This shows that the 3rd principal component explains over 96% of the variation within the data and hence the first two principal components would be sufficient to describe the data.
## Question 2: Using USArrest dataset built-in R.
### 1. Install the package “base” and attach the dataset “USArrests”.
```{r}
library(base); head(USArrests)
Crim <- as_tibble(USArrests)
row.names(Crim) <- row.names(USArrests)
```
### 2. Give a short description of the dataset.
This is a data set with 4 predictive features describing crime rates and the proportion of residents within an urban population across various states of the US.
### 3/4. Examine the mean/variance of the variables in the dataset.
```{r}
dStatsCrim <- data.frame(
Mean = apply(Crim, 2, mean),
Variance = apply(Crim, 2, var)
)
dStatsCrim$State <- row.names(dStatsCrim)
dStatsCrimTidy <- pivot_longer(dStatsCrim, cols = c("Mean", "Variance"), names_to = "Statistic", values_to = "Value")
ggplot(dStatsCrimTidy, aes(x = State, y = Value, fill = Statistic)) +
geom_col( position = 'dodge')
```
Assault has a very large amount of variance and would hence would have most of the effect on the Principal Components merely owing to how varied the data is, instead the data should be scaled to a SD of 1 and mean of 0, this can be specified with the `scale = TRUE` parameter.
### 5. Perform Principal Component Analysis for USArrests.
```{r}
pcaCrimMod <- prcomp(Crim, scale = TRUE)
```
### 6. Display and explain the principal component loading vectors for the given dataset.
The principal component loading vectors may be returned by the `rotation` data frame from within the model:
```{r}
print(pcaCrimMod$rotation, digits = 2)
```
These are the coefficients used to create linear combinations of the predictive features in order to maximise the variance in creating the Principal Component lines.
### 7. Give the first two Principal components using loading parameters in part 6.
$$
\begin{align}
&\textbf{PC1:} \\
Z_{1} &= \begin{bmatrix} SL_i & SW_i & PL_i & PW_i \end{bmatrix} \times \begin{bmatrix} -0.54 \\ -0.58 \\ -0.28 \\ -0.54 \end{bmatrix} \\
& = -0.54 \cdot SL_i - 0.58 \cdot SW_i -0.28 \cdot PL - 0.54 \cdot PW \\
\ \\
\ \\
&\textbf{PC2:} \\
Z_{2} &= \begin{bmatrix} SL_i & SW_i & PL_i & PW_i \end{bmatrix} \times \begin{bmatrix} 0.42 \\ 0.19 \\ -0.87 \\ -0.17 \end{bmatrix} \\
&= 0.42 \cdot SL_i + 0.19 \cdot SW_i -0.87 \cdot PL -0.17 \cdot PW
\end{align}
$$
### 8. Draw the biplot and interpret it.
```{r}
biplot(pcaCrimMod, scale = 0, cex = 0.6)
```
### 9. Rotate the graph and compare the results.
The direction of the Principal Component is non unique, the plot may be rotated by 180 degrees and still be identical, for example:
```{r}
pcaCrimModR <- pcaCrimMod
pcaCrimModR$rotation <- -pcaCrimMod$rotation
pcaCrimModR$x <- -pcaCrimMod$x
biplot(pcaCrimModR, cex = 0.7)
```
### 10. Explain the proportion of variance explained by each PCs using graph and summarise your results.
```{r}
pcaVarCrim <- pcaCrimMod$sdev^2
pcaVarCrimProp <- pcaVarCrim/sum(pcaVarCrim)
pcaVarCrimProp <- enframe(pcaVarCrimProp)
names(pcaVarCrimProp) <- c("Principal.Component", "Proportion.Variance")
for (i in 1:nrow(pcaVarCrimProp)) {
pcaVarCrimProp[["Principal.Component"]][i] <- paste("PC", i)
}
ggplot(pcaVarCrimProp, aes(x = Principal.Component, y = Proportion.Variance, group = 1)) +
geom_point(col = "IndianRed", size = 3, alpha = 0.7) +
geom_line(col = "royalBlue") +
theme_classic2() +
labs(x = "Principal Component", y = "Proprtion of Variance", title = "Proportion of Variance Explained by Principal Component") +
geom_vline(xintercept = 3, col = "purple", lty = 2)
print(pcaVarCrimProp, digits = 2)
```
Over 90% of the data is explained by the first 3 principal components and hence the data may be sufficiently explained by those PC's as predictive features.