-
Notifications
You must be signed in to change notification settings - Fork 0
/
ClusteringCountry2.Rmd
628 lines (525 loc) · 33.9 KB
/
ClusteringCountry2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
---
title: "Clustering & PCA Analysis - Country data"
author: "Ahmet Zamanis"
output:
github_document:
toc: TRUE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(error = FALSE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
library(rmarkdown)
library(tidyverse)
library(Cairo)
library(gt)
library(GGally)
library(corrplot)
library(ggthemes)
library(patchwork)
library(RColorBrewer)
library(hopkins)
library(scatterplot3d)
library(NbClust)
library(cluster)
library(factoextra)
library(fpc)
options(scipen=999)
```
## Purpose
Clustering is an unsupervised learning method that aims to split the observations in a dataset into clusters, based on their similarity. Clustering can be used as an exploratory data analysis tool, to help discover the structures and relationships in the data. The resulting cluster ID's can be considered as a "summary" of the variables in the data, and can be used to train supervised learning models with large amounts of data and variables.
\
\
Clustering is used for real-world applications such as customer segmenting, anomaly detection and medical imaging. In this example analysis, we have a dataset of country statistics, and we will see if we can cluster them meaningfully. The dataset is sourced from [Kaggle](https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data), uploaded by the user [Rohan kokkula](https://www.kaggle.com/rohan0301).
## Data preparation
Let's load our dataset, and make a table of the first five rows.
\
```{r}
df_org <- read.csv("country_data.csv", header=TRUE, encoding="UTF-8")
tb1 <-
gt(data=df_org[1:5,], rownames_to_stub = TRUE) %>%
tab_header(title="Country data") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
as_raw_html(tb1)
```
\
The variables in our dataset are:
- Country name
- child_mort: Deaths of childer under the age of 5, per 1000 live births.
- exports: Exports of goods and services per capita, as a % of GDP per capita.
- health: Health spending per capita, as a % of GDP per capita.
- imports: Imports of goods and services per capita, as a % of GDP per capita.
- income: Net income per person.
- inflation: Annual growth as % of the total GDP.
- life_expect: Average expected lifespan of a newborn child, if mortality patterns stay the same.
- total_fert: Number of children expected per woman, if age-fertility patterns stay the same.
- gdpp: GDP per capita.
All variables in our dataset except country are numeric variables. Some variables with a range between 0-100 are actually percentages, but we don't need to modify them now, as we will standardize all variables before we apply our algorithms.
## Exploratory analysis
Let's visualize and explore the distributions of variables, and their correlations with eachother. Since clustering doesn't make predictions, we don't have an outcome variable, so we don't need to look for a specific relationship between any variables.
### Distributions
```{r, include=FALSE}
df <- df_org[,2:10]
for (i in 1:ncol(df)) {
x_val <- df[,i]
x_lab <- colnames(df)[i]
t <- paste0("Histogram of ", x_lab)
q25 <- quantile(x_val, 0.25)
q75 <- quantile(x_val, 0.75)
medi <- median(x_val)
h <- ggplot(df, aes(x=!!x_val)) +
geom_histogram(color="cyan4", fill="cyan4", size=0.75, alpha=0.5, bins=30) +
geom_vline(xintercept=medi, color="#CA0020", size=1, linetype="dashed") +
geom_vline(xintercept=q25, color="#FDB863", size=1, linetype="dashed") +
geom_vline(xintercept=q75, color="#FDB863", size=1, linetype="dashed") +
geom_point(aes(x=!!medi, y=0, color="#CA0020")) +
geom_point(aes(x=!!q25, y=0, color="#FDB863")) +
geom_point(aes(x=!!q75, y=0, color="#FDB863")) +
scale_color_identity(guide="legend", name="Stats", breaks=c("#CA0020", "#FDB863"), labels=c("Median", "IQR")) +
theme_bw() +
labs(x=x_lab, y="Count")
assign(paste0("hist", i), h)
}
```
Let's create and visualize histograms for all our variables, starting with the first four.
\
```{r}
(hist1+hist2+hist3+hist4) + plot_layout(guides="collect") +
plot_annotation(title="Histograms of child_mort, exports, health, imports", subtitle="N=167", theme=theme_bw())
```
\
We see that all 4 variables are right skewed, with numerous outliers. Especially child mortality is very right skewed: Most countries have low child mortality rates, but a small number of countries have very high rates. The median is very close to the 25th quartile, and far from the 75th. This is likely to be a key factor in splitting countries into clusters.
\
\
Let's look at the next 2 variables.
\
```{r}
(hist5+hist6) + plot_layout(guides="collect") +
plot_annotation(title="Histograms of income, inflation", subtitle="N=167", theme=theme_bw())
```
\
Income per capita and inflation are both very right skewed with outliers, especially inflation. Most countries have low inflation rates, but a few countries have very high rates. Most countries have an income per capita lower than roughly 25,000.
\
\
Let's look at our last 3 variables.
\
```{r}
(hist7 | hist8 / hist9) + plot_layout(guides="collect") +
plot_annotation(title="Histograms of life_expect, total_fert, gdpp", subtitle="N=167", theme=theme_bw())
```
- Life expectancy, unlike other variables, has a left skewed distribution, but is also susceptible to outliers. Most countries have high life expectancy, roughly over 65, but a few countries have very low life expectancy such as around 30 or 50. Possibly another key factor to split the data.
- Fertility rates are right skewed, but with a considerable number of observations at the right tail of the data. Many countries have relatively low fertility rates around 2, but quite a few countries have higher rates around 4 and 6.
- GDP per capita is very right skewed: Most countries are below roughly 15,000, while numerous countries have a GDPP of more than 2-3x that amount.
Overall, the distributions in our data are very right skewed, with a lot of outliers both for "positive" variables such as income, as well as "negative" variables such as child mortality. We have no reason to remove any outliers from our analysis, as they all represent real countries we need to account for, and are potentially more interesting for our analysis than countries with moderate statistics.
### Correlations
Highly correlated variables can have a strong impact on clustering analyses, especially if the algorithm is distance based: Highly correlated variables can exaggerate the similarity or dissimilarity between observations. Dropping them, unless there is a serious reason to keep them, can be a good idea.
\
\
Let's calculate and plot the correlations of all variables in our dataset.
\
```{r}
ggpairs(df)
corrplot(cor(df), method="square", type="upper", addCoef.col = "black", number.cex=0.75, mar=c(0,0,0,0))
```
\
In the square plot, larger blue squares indicate stronger positive correlation, and larger red squares indicate stronger negative correlation. We have numerous strong correlations between variables:
- Income and GDPP have a very strong positive correlation of 0.89, as we would expect.
- The pattern in the scatterplot from the first plot suggests there may be two different groups in respect to the relationship between income and GDPP. This is possibly because there may be some disparity between income and GDPP for some countries.
- Child mortality and life expectancy have a strong negative correlation of 0.88, again, an expected relationship.
- Child mortality and fertility have a strong positive correlation of 0.84.
- Looking at the scatterplot of these two variables, we can say this correlation may have been even higher if not for one outlier dragging the line down: Haiti has a high child mortality of 208, yet a relatively low fertility of 3.33.
- Fertility and life expectancy have a negative correlation of 0.76.
- Again highly affected by the same outlier: Haiti has a relatively low fertility of 3.33, yet also a low life expectancy of 32.1.
- Imports and exports have a positive correlation of 0.73.
Since some of these variables are highly correlated, and record very similar features, we should drop some of them from our clustering analysis. To help with the elimination of variables, and to reduce the dimensions in the dataset, we can make use of a principal component analysis (PCA).
### Principal Component Analysis (PCA)
Principal component analysis basically aims to reduce the dimensions of a dataset with minimal information loss. Information from all variables is combined into Principal Components while minimizing redundancy, i.e correlation. A dataset with 10 variables yields 10 PCs, but each PC aims to maximize the information summarized in one dimension, and is uncorrelated with the other PCs. We can then decide to discard some of the PCs while using the remaining ones as our variables. This way, we reduce the number of dimensions, with minimal information loss and no correlation between the new variables.
\
\
Before we can compute a PCA, let's standardize our variables into z-scores. Our variables have different scales, some in the thousands while others represent percentages, and this can lead to biased results both for PCA and for clustering.
```{r, echo=TRUE}
df_z <- as.data.frame(scale(df))
```
Now let's compute a PCA for our standardized dataframe, and display the calculated PCs.
```{r, echo=TRUE}
pca <- prcomp(df_z)
eival <- get_eigenvalue(pca)
```
```{r}
tb2 <-
gt(data=eival, rownames_to_stub = TRUE) %>%
tab_header(title="Principal components") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
as_raw_html(tb2)
```
\
The variance percent column shows the percentage of variance captured by each single PC. The cumulative variance column shows the percentage of variance captured by each PC, plus the previous PCs. We can see that only the first 4 PCs capture 87% of the variance, the fifth PC brings it to 94%, and the sixth to 97%. These likely represent a good tradeoff between fewer dimensions and variance explained.
\
\
We can also use the results of our PCA to get insight on the effects of our original variables on the first two PCs, by plotting them as vectors in a two-dimensional space.
\
```{r}
p_var <- fviz_pca_var(pca,
col.var = "contrib",
gradient.cols = c("#0571B0", "#FDB863", "#CA0020"),
repel = TRUE)
p_var + labs(x="Dimension 1 (46% of the variance)", y="Dimension 2 (17.2% of the variance)", title="Original variables and PCA dimensions",
subtitle="Directions and sizes of their effects") +
theme(plot.margin = unit(c(15, 15, 15, 15), "mm")) + theme_bw()
```
- In the above plot, variables with longer vectors and closer directions to a PC axis have a bigger effect on that PC.
- We can see PC1 is most strongly influenced by income and GDPP.
- If the angle between two variable vectors are close to 0, they are strongly positively correlated.
- We can see income and GDPP have a strong positive correlation.
- If the angles between two variable vectors are close to 180, they are strongly negatively correlated.
- Child mortality and life expectancy have a strong negative correlation.
- A 90 degree angle between two vectors represents close to zero correlation.
- Imports and life expectancy are close to having zero correlation.
If we wanted to use the original variables for clustering while dropping some of the highly correlated ones, the intuition from this plot could help us make our choice. Since we want the variables with the biggest impact, without correlating with one another, a good choice could be the following variables, in order of importance: life_expect, income, imports and health.
## Clustering Analysis
### Cluster tendencies
```{r, include=FALSE}
pc5 <- prcomp(df_z, rank.=5)
df_pc5 <- as.data.frame(pc5$x)
df1 <- dplyr::select(df_z, c(7, 5, 4, 3))
set.seed(1)
```
Before we start our clustering analysis, we can test the cluster tendency of our data. An issue with clustering algorithms is that they will cluster any given dataset, even if there are no meaningful structures present. Metrics of cluster tendency can sometimes indicate whether the data is likely to contain meaningful clusters or not.
\
\
We apply two methods to test cluster tendency: The Hopkins statistic, and a visual assessment. Let's test the cluster tendency of a dataset with the first 5 PCs from our PCA.
```{r, echo=TRUE}
set.seed(1)
hopkins(df_pc5, m=nrow(df_pc5)-1)
get_clust_tendency(data=df_pc5, n=50, gradient=list(low="white",high="blue"), seed=1)
```
\
A clear consensus on the interpretation of clustering tendency does not seem to exist, and we can see that one function returned a Hopkins statistic of 0.95, while the other returned 0.79.
- Generally, a Hopkins statistic close to 0 indicates a very non-uniform dataset, which may be interpreted as a tendency to be clustered. A statistic close to 1 indicates a very uniform dataset, which may be interpreted as low tendency to be clustered.
- The plot visualizes the dissimilarity matrix of the observations in our dataset, with white representing low dissimilarity, and blue representing high. Ideally, we would have distinct, blue squares across the diagonal, indicating the presence of strongly dissimilar clusters.
Let's apply the same tests to a dataset of the four original variables we chose using the PCA: life_expect, income, imports and health.
```{r, include=TRUE}
hopkins(df1, m=nrow(df1)-1)
get_clust_tendency(data=df1, n=50, gradient=list(low="white",high="blue"), seed=1)
```
\
The results are a higher Hopkins stat of 0.99 or 0.85, but we also have more dissimilar areas in the dissimilarity plot.
\
\
The results for both datasets suggest a low clustering tendency for our data, but intuitively, we would expect countries to cluster at least somewhat decently based on statistics of income and life quality. Since there is no clear consensus on the usage and interpretation of clustering tendency statistics, let's attempt to cluster both datasets and evaluate the results.
### K-medoids clustering
The most well-known and used clustering algorithm is probably the k-means algorithm. This algorithm takes a K number of random centroids as centers of the K number of clusters in the data, assigns all data points to their closest centroid, recalculates the centroids of all clusters as the means of clusters, reassigns the data points to the closest centroid, and repeats this until convergence. As this approach is based on means, it is sensitive to outliers.
\
\
The variables in our data generally follow a very right skewed distribution, with a lot of outliers. Because of this, we will prefer a k-medoids clustering algorithm: K-medoids takes an actual K number of observations from the data as centers of the clusters, instead of calculating means. These centroids are chosen to minimize the dissimilarity within each cluster. K-medoids is more robust against outliers, as it does not rely on means.
\
\
To apply a non-hierarchical clustering method, such as k-medoids, we need to determine the number of clusters we want. Choosing the optimal number of clusters has a strong impact on the quality of our clustering, and can be subjective. One method to choose the optimal number of clusters is by using average silhouette width:
\
```{r, echo=TRUE}
fviz_nbclust(df_pc5, cluster::pam, method="silhouette")
```
\
This plot displays the average silhouette width for each cluster. The closer the average silhouette width to 1, the more efficient the clustering. A negative silhouette width for an observation indicates that the observation was likely put in the wrong cluster. According to our plot, 2 clusters is optimal, but 3 clusters also comes close, and intuitively makes more sense instead of just splitting the countries in two groups.
\
\
Another method we can use is the elbow method:
\
```{r, echo=TRUE}
fviz_nbclust(df_pc5, cluster::pam, method="wss")
```
\
This plot displays the unexplained variance for each number of clusters. As we divide the data into more clusters, we can expect to explain more of the variance, but also expect to have smaller, less meaningful and less generalizable clusters, similar to the overfitting problem in regression analysis. The elbow method refers to choosing the number of clusters that creates an "elbow" in this plot: A point of diminishing returns. We don't have a very clear elbow in this case, but 3 and 6 are the candidates.
\
\
Another method we can use is the gap statistic:
\
```{r, echo=TRUE}
fviz_nbclust(df_pc5, cluster::pam, method="gap_stat")
```
\
Gap statistic is a measure of the explained variance for each number of clusters. Here, we want to choose the smallest number of K clusters that has a gap statistic within 1 standard deviation of the gap statistic of K+1 clusters. This can be considered the "inverse" of the previous method, we are again choosing the point of diminishing returns. In this case, 3 clusters emerges as the ideal choice again.
#### Clustering attempt: pam1
Let's fit and plot our first clustering, with 3 clusters, using the PAM k-medoids algorithm. We choose Manhattan distance as our metric of distance, as it's known to be more robust to outliers.
\
```{r, echo=TRUE}
pam1 <- pam(df_pc5, k=3, metric="manhattan")
fviz_cluster(pam1, data=df_pc5, geom="point", ellipse.type="norm", repel=TRUE, ggtheme=theme_bw()) +
labs(title="Cluster plot, pam1", subtitle="k=3 clusters", x="Dim1, 20% variance", y="Dim2, 20% variance" )
```
\
The first and second clusters are decently separated, but the third cluster appears to overlap with the other two quite a bit. Some observations in cluster 3 also seem to be quite distant from one another. This is likely not a very effective clustering, but in the same time, it suggests at least some structure in our data. We should also keep in mind that the plot only represents the observations and clusters in respect to two dimensions, which have 20% effect each, so the remaining dimensions may still separate the data more clearly.
\
\
Let's plot our clusters against the first 3 dimensions, in a 3D scatterplot:
```{r}
pam1_3d <- as.data.frame(pam1$data)
pam1_3d <- pam1_3d[,1:3]
pam1_3d[,4] <- as.factor(pam1$clustering)
sp_clr1 <- c("#CA0020", "#4DAC26","#0571B0")
sp_clr1 <- sp_clr1[as.numeric(pam1_3d$V4)]
shapes <- c(18,17,15)
shapes <- shapes[as.numeric(pam1_3d$V4)]
sp1 <- scatterplot3d(x=pam1_3d$PC1, y=pam1_3d$PC2, z=pam1_3d$PC3, color=sp_clr1, pch=shapes, main="3D clustering plot, pam1", sub="N=167. Dimensions are principal components", xlab="PC1, 46% of the variance", ylab="PC2, 17% of the variance", zlab="PC3, 13% of the variance")
legend("topright", col=c("#CA0020", "#4DAC26","#0571B0"), pch=c(18,17,15), legend=c("1","2","3"), title="Clusters")
sp1.1 <- scatterplot3d(x=pam1_3d$PC1, y=pam1_3d$PC2, z=pam1_3d$PC3, color=sp_clr1, pch=shapes, main="3D clustering plot, pam1", sub="N=167. Dimensions are principal components", xlab="PC1, 46% of the variance", ylab="PC2, 17% of the variance", zlab="PC3, 13% of the variance", type="h")
legend("topright", col=c("#CA0020", "#4DAC26","#0571B0"), pch=c(18,17,15), legend=c("1","2","3"), title="Clusters")
```
\
The 3D plot shows that our clusters are actually much more separated than the 2D plot would suggest.
\
\
Let's plot the silhouette widths for our clusters.
```{r, echo=TRUE}
fviz_silhouette(pam1)
```
\
The average silhouette width of all clusters is 0.33, while the smallest average width per cluster is 0.25, and maximum is 0.36, indicative of a somewhat ineffective clustering. Some observations in cluster 2 and 3 have negative silhouette widths, indicating they may have been wrongly placed in these clusters. However, some observations exceed a silhouette width of 0.5, again indicating some structure in the data may be present at least partially.
\
\
Let's see the 3 countries that form the center of each cluster.
```{r, echo=TRUE}
pam1$id.med
```
```{r}
df_tb3 <- df_org[95,]
df_tb3[2,] =df_org[144,]
df_tb3[3,] =df_org[54,]
tb3 <-
gt(data=df_tb3, rownames_to_stub = TRUE) %>%
tab_header(title="Center observations of pam1 clusters") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
as_raw_html(tb3)
```
- The first cluster's center is Malawi. We can see a high child mortality rate, low income, and relatively lower life expectancy.
- The second cluster's center is Suriname. A relatively lower child mortality rate, higher life expectancy, and higher income, closer to the median amount.
- The third cluster's center is Finland. Very low child mortality, very high life expectancy and income.
This may suggest that our clusters 1, 2 and 3 respectively may represent the least developed countries, developing countries and highly developed countries, even if some observations may be poorly clustered.
#### Clustering attempt: pam2
Let's also try clustering using the dataset with the four key variables we chose using our PCA: life_expect, income, imports and health. Again, we start with testing the optimal number of clusters.
\
```{r, echo=TRUE}
fviz_nbclust(df1, cluster::pam, method="silhouette")
fviz_nbclust(df1, cluster::pam, method="wss")
fviz_nbclust(df1, cluster::pam, method="gap_stat")
```
\
Both the silhouette and gap methods suggest 2 as the optimal number, with 3 being a close runner-up. The elbow method doesn't yield any number that can be considered a point of diminishing returns. This suggests this data is even less suitable for clustering.
\
\
Let's try fitting and 3D plotting 2 clusters.
\
```{r}
pam2 <- pam(df1, k=2, metric="manhattan")
pam2_3d <- as.data.frame(pam2$data)
pam2_3d <- pam2_3d[,1:3]
pam2_3d[,4] <- as.factor(pam2$clustering)
sp_clr2 <- c("#CA0020", "#4DAC26")
sp_clr2 <- sp_clr2[as.numeric(pam2_3d$V4)]
shapes2 <- c(18,17)
shapes2 <- shapes2[as.numeric(pam2_3d$V4)]
sp2 <- scatterplot3d(x=pam2_3d$life_expect, y=pam2_3d$income, z=pam2_3d$imports, color=sp_clr2,
pch=shapes2, main="3D clustering plot, pam2", sub="Dimensions are z-scores of original variables", xlab="life_expect", ylab="income", zlab="imports")
legend("topright", pch=c(18,17), col=c("#CA0020", "#4DAC26"), legend=levels(pam2_3d$V4), title="Clusters")
fviz_silhouette(pam2)
```
\
With 2 clusters, the average silhouette width is 0.3, slightly lower than pam1. This time, one cluster performs a bit better, with 0.4 width, while the other performs a bit worse, with 0.21 width. We see some observations with a width of more than 0.5, but also more observations with negative widths in cluster 2.
\
\
Let's see the medoids for each cluster.
```{r, echo=TRUE}
pam2$id.med
```
\
```{r}
df_tb4 <- df_org[60,]
df_tb4[2,] =df_org[122,]
tb4 <-
gt(data=df_tb4, rownames_to_stub = TRUE) %>%
tab_header(title="Center observations of pam2 clusters") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
as_raw_html(tb4)
```
- Cluster 1's center is Ghana, which has high child mortality, low income, and a somewhat medium life expectancy. Not as extreme an observation as Malawi from the previous clustering.
- Cluster 2's center is Poland, which has low child mortality, higher income and life expectancy. Again, not as extreme an observation as Finland from the previous clustering.
This clustering is likely a more generalized version of the previous one, which may still be meaningful, but is likely to ignore the extreme observations, such as very underdeveloped or very developed countries. Indeed, we see numerous observations with negative silhouette width in cluster 2: These are likely the highly developed countries such as Finland, which were previously in cluster 3.
#### Clustering attempt: pam2.1
Let's try 3 clusters with the same dataset of 4 original variables.
\
```{r}
pam2.1 <- pam(df1, k=3, metric="manhattan")
pam2.1_3d <- as.data.frame(pam2.1$data)
pam2.1_3d <- pam2.1_3d[,1:3]
pam2.1_3d[,4] <- as.factor(pam2.1$clustering)
sp_clr3 <- c("#CA0020", "#4DAC26", "#0571B0")
sp_clr3 <- sp_clr3[as.numeric(pam2.1_3d$V4)]
shapes3 <- c(18,17,15)
shapes3 <- shapes3[as.numeric(pam2.1_3d$V4)]
sp3 <- scatterplot3d(x=pam2.1_3d$life_expect, y=pam2.1_3d$income, z=pam2.1_3d$imports, color=sp_clr3,
pch=shapes3, main="3D clustering plot, pam2.1", sub="Dimensions are z-scores of original variables", xlab="life_expect", ylab="income", zlab="imports")
legend("topright", legend=levels(pam2.1_3d$V4), pch=c(18,17,15), col=c("#CA0020", "#4DAC26", "#0571B0"),
title="Clusters")
fviz_silhouette(pam2.1)
```
\
The result is somewhat similar to pam1's clusters. Cluster 1 and 3 are completely separated, while cluster 2 is somewhat caught in the middle. The average silhouette width is 0.27, slightly lower than the previous attempts, but the difference between average widths of clusters is lower: They are closer to eachother, all between 0.25 and 0.29. Each cluster has some observations with negative widths, and some observations close to 0.5 width.
\
\
Let's see the medoids for each cluster.
```{r, echo=TRUE}
pam2.1$id.med
```
```{r}
df_tb5 <- df_org[60,]
df_tb5[2,] =df_org[25,]
df_tb5[3,] =df_org[159,]
tb5 <-
gt(data=df_tb5, rownames_to_stub = TRUE) %>%
tab_header(title="Center observations of pam2.1 clusters") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
as_raw_html(tb5)
```
- The first cluster's center is Ghana again: An example of an underdeveloped country, not as extreme as Malawi from the first clustering attempt.
- The second cluster's center is Bulgaria: A developing country with middle levels of income and life quality statistics, probably more representative of developing countries compared to Suriname from the first attempt.
- The third cluster's center is United Kingdom: A developed country with high income and good life quality statistics, though not as extreme as Finland, and possibly more representative of developed countries as a whole.
These results suggest that this clustering attempt "distributes" the error more evenly between clusters, and may offer a more balanced degree of representativeness compared to the first attempt. There is no objective way to determine if one is better than the other: If we are interested in the most "accurate" results overall, we may prefer the first clustering, but if we prefer to minimize the differences in accuracy between clusters, we may prefer the third clustering. Another upside of the third clustering is that it uses original variables instead of PCs, so the plots and values are more directly interpretable.
## Analysis with clustering results
Now, we can group our original dataset into 3 clusters, and see if the variables differ strongly by clusters. First, let's add the cluster to the original database as a new categorical variable, and see some of the countries in each cluster:
```{r, include=FALSE}
df_clus <- df_org %>% mutate(cluster=pam1$clustering)
df_clus$cluster <- as.factor(df_clus$cluster)
df_clus1 <- subset(df_clus, cluster==1)
df_clus2 <- subset(df_clus, cluster==2)
df_clus3 <- subset(df_clus, cluster==3)
```
```{r, out.width="75%", out.height="75%"}
clus1 <-
gt(data=df_clus1[c(1,12,27,35,38),], rownames_to_stub = TRUE) %>%
tab_header(title="Cluster 1") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
gtsave(data=clus1, filename="./ClusteringCountry2_files/figure-gfm/clus1.png")
knitr::include_graphics("./ClusteringCountry2_files/figure-gfm/clus1.png", dpi=NA)
```
\
We can see cluster 1 consists of generally underdeveloped countries. This cluster has 41 observations.
\
\
```{r, out.width="75%", out.height="75%"}
clus2 <-
gt(data=df_clus2[c(4, 27, 38, 89, 92),], rownames_to_stub = TRUE) %>%
tab_header(title="Cluster 2") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
gtsave(data=clus2, filename="./ClusteringCountry2_files/figure-gfm/clus2.png")
knitr::include_graphics("./ClusteringCountry2_files/figure-gfm/clus2.png", dpi=NA)
```
\
Cluster 2 generally consists of developing countries across the world, but there are some observations that may have fit into the first cluster as well. We knew that cluster 2 had some overlap with the other 2 clusters. This is the largest cluster with 96 observations.
\
\
```{r, out.width="75%", out.height="75%"}
clus3 <-
gt(data=df_clus3[c(1, 4, 12, 24, 28),], rownames_to_stub = TRUE) %>%
tab_header(title="Cluster 3") %>%
tab_style(locations=cells_column_labels(columns=everything()),
style=list(
cell_borders(sides="bottom", weight=px(3)),
cell_text(weight="bold"))) %>%
opt_table_font(font=list(google_font("Calibri"), default_fonts()))
gtsave(data=clus3, filename="./ClusteringCountry2_files/figure-gfm/clus3.png")
knitr::include_graphics("./ClusteringCountry2_files/figure-gfm/clus3.png", dpi=NA)
```
\
Cluster 3 seems to be the more developed countries in the world. This is the smallest cluster with 30 observations.
\
```{r, include=FALSE}
for (i in 2:10) {
y_val <- df_clus[,i]
y_lab <- colnames(df_clus)[i]
h <- ggplot(df_clus, aes(x=cluster, y=!!y_val, fill=cluster)) +
geom_boxplot(width=0.125, lwd=0.75, fatten=1) +
stat_boxplot(geom="errorbar", width=0.125, lwd=0.75) +
theme_bw() +
theme(legend.position = "none") +
labs(x="Clusters", y=y_lab) +
scale_fill_manual(values=c("#CA0020", "#4DAC26","#0571B0"))
assign(paste0("box", i-1), h)
}
```
Since all our variables are numeric, let's see boxplots for each one, grouped by clusters, to determine if there are significant differences between clusters. Let's start with the first 4 variables:
\
```{r}
(box1+box2+box3+box4) + plot_annotation(title="Boxplots of child_mort, exports, health, imports, grouped by cluster", subtitle="N=167, N1=41, N2=96, N3=30", theme=theme_bw())
```
\
From the boxplots, we see that:
- Cluster 1 has very high child mortality, lowest exports, lowest health spending, a middle amount of imports.
- Cluster 2 has low child mortality, medium amounts of exports and health spending, and the highest amount of imports.
- Cluster 3 has very low child mortality, the highest exports and health spending, and the lowest imports.
- A few cluster 3 countries have a much lower health spending proportion than almost all countries in the data. Qatar and Kuwait are the most extreme examples, with only 1.81% and 2.63% respectively.
These differences are in line with what we'd expect for underdeveloped, developing and developed countries.
\
\
Let's look at the next 2 variables.
\
```{r}
box5+box6 + plot_annotation(title="Boxplots of income and inflation, grouped by cluster", subtitle="N=167, N1=41, N2=96, N3=30", theme=theme_bw())
```
- Cluster 1 has very low income, and higher inflation.
- Cluster 2 has a middle amount of income, and lower inflation.
- Cluster 3 has much higher income, and the lowest inflation.
Let's see the next 2 variables.
\
```{r}
box7 + box8 +
plot_annotation(title="Boxplots of life_expect, total_fert, grouped by cluster", subtitle="N=167, N1=41, N2=96, N3=30", theme=theme_bw())
```
- Cluster 1 has the lowest life expectancy, and the highest fertility rate.
- There is a considerably low outlier for both variables: Haiti with 32.1 life expectancy and 3.33 fertility.
- Cluster 2 has a high life expectancy, and lower fertility rate.
- Cluster 3 has the highest life expectancy, and the lowest fertility rate.
And our final variable.
\
```{r}
box9 + plot_annotation(title="Boxplot of gdpp, grouped by cluster", subtitle="N=167, N1=41, N2=96, N3=30", theme=theme_bw())
```
\
Cluster 1 has a very low GDPP, cluster 2 is a bit higher, and cluster 3 is much higher, as expected due to the income inequality between developed and underdeveloped countries.
## Conclusion
The suitability and performance of clustering algorithms is not always easy to assess: In our example, cluster tendency and performance statistics indicated a low suitability and performance for our clustering, but exploring the results show us that we have clearly identified three meaningful clusters in the data:
- Underdeveloped countries, with the lowest income and quality of life statistics,
- Developing countries with moderate income and quality of life statistics,
- Highly developed countries with high income and quality of life statistics.
The first and third clusters are smaller, with more extreme observations, and the middle cluster is larger, with less extreme observations, as we'd expect. Clustering yielded to be an effective and insightful exploratory analysis tool, even though the pure statistics and metrics suggested otherwise. Visualizing the clustering results in 3D helped us see a clearer picture of the separation between clusters, which looked deceptively overlapped in 2D plots.
\
\
Another important step was the preprocessing: If we hadn't applied PCA and/or removed the highly correlated variables, our results may have been less effective: Some variables in our dataset were highly correlated measures of similar features, and they could have exaggerated the similarity or dissimilarity of observations and clusters.