generated from allisonhorst/meds-distill-template
-
Notifications
You must be signed in to change notification settings - Fork 19
/
Lab7_Demo_InClass_complete.Rmd
133 lines (96 loc) · 3.27 KB
/
Lab7_Demo_InClass_complete.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Clustering Lab"
author: "Mateo Robbins"
date: "2024-02-29"
output: html_document
---
```{r, echo = FALSE, eval = TRUE}
library(tidyverse)
library(cluster) #cluster analysis
library(factoextra) #cluster visualization
library(tidymodels) #simulation
library(readr) #read data
library(RColorBrewer)# Color palettes
```
We'll start off with some simulated data that has a structure that is amenable to clustering analysis.
```{r init_sim}
#Set the parameters of our simulated data
set.seed(101)
cents <- tibble(
cluster = factor(1:3),
num_points = c(100, 150, 50),
x1 = c(5,0,-3),
x2 = c(-1,1,-2)
)
```
```{r sim}
#Simulate the data by passing n and mean to rnorm using map2()
labelled_pts <-
cents %>%
mutate(
x1 = map2(num_points, x1, rnorm),
x2 = map2(num_points, x2, rnorm)
) %>%
select(-num_points) %>%
unnest(cols=c(x1,x2))
ggplot(labelled_pts, aes(x1,x2, color=cluster)) +
geom_point(alpha = 0.4)
```
```{r kmeans}
points <-
labelled_pts %>%
select(-cluster)
kclust <- kmeans(points, centers = 3, n = 25)
kclust
```
```{r syst_k}
#now let's try a systematic method for setting k
kclusts <-
tibble(k=1:9) %>%
mutate(
kclust = map(k, ~kmeans(points, .x)),
augmented = map(kclust, augment, points)
)
```
```{r assign}
#append cluster assignment to tibble
assignments <-
kclusts %>%
unnest(cols=c(augmented))
```
```{r plot_9_clust}
#Plot each model
p1 <-
ggplot(assignments, aes(x=x1,y=x2)) +
geom_point(aes(color = .cluster), alpha = 0.8) +
scale_color_brewer(palette = "Set1") +
facet_wrap(~k)
p1
```
```{r elbow}
#Use a clustering function from {factoextra} to plot total WSSs
fviz_nbclust(points, kmeans, "wss")
```
```{r more_fviz}
#Another plotting method
k3 <- kmeans(points, centers = 3, nstart = 25)
p3 <- fviz_cluster(k3, geom = "point", data = points) +ggtitle("k=3")
p3
```
In-class assignment!
Now it's your turn to partition a dataset. For this round we'll use data from Roberts et al. 2008 on bio-contaminants in Sydney Australia's Port Jackson Bay. The data are measurements of metal content in two types of co-occurring algae at 10 sample sites around the bay.
```{r data}
#Read in data
metals_dat <- readr::read_csv(here::here("data/Harbour_metals.csv"))
# Inspect the data
#Grab pollutant variables
metals_dat2 <- metals_dat[, 4:8]
```
1. Start with k-means clustering - kmeans(). You can start with fviz_nbclust() to identify the best value of k. Then plot the model you obtain with the optimal value of k.
Do you notice anything different about the spacing between clusters? Why might this be?
Run summary() on your model object. Does anything stand out?
2. Good, now let's move to hierarchical clustering that we saw in lecture. The first step for that is to calculate a distance matrix on the data (using dist()). Euclidean is a good choice for the distance method.
2. Use tidy() on the distance matrix so you can see what is going on. What does each row in the resulting table represent?
3. Then apply hierarchical clustering with hclust().
4. Now plot the clustering object. You can use something of the form plot(as.dendrogram()). Or you can check out the cool visual options here: https://rpubs.com/gaston/dendrograms
How does the plot look? Do you see any outliers? How can you tell?