# Clustering Examples

This notebook contains some examples of clustering to show how to look for clusters in data and how to visualize several aspects of clustering. 


We will start with the simple iris data set. 

In [None]:
options(repr.plot.width=10, repr.plot.height=10)


head(iris)

It already comes with the class labels; we know the Species of each observation. Below is the original data: 

In [None]:
library(ggplot2)
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) + geom_point() 

Let's create a data frame and get only two variables; Petal.Length and Petal.Width. We are not taking the Species variable, so in this data frame, there is no labels. 

In [None]:
iris_c = iris[,3:4]

In [None]:
head(iris_c)

In the unsupervised setting, we don't know where the observation belong: 

In [None]:
ggplot(iris_c, aes(x=Petal.Length, y=Petal.Width)) + geom_point() 

Let's create a clustering using K-Means. By looking at the above plot, we can tell that there are probably 2 or 3 clusters. If we know that we have to deal with three species, we can specify k as 3. Also, make sure to assign `nstart` so that kmeans can run multiple times. 

In [None]:
set.seed(100)
i_clust <- kmeans(iris_c, 3, nstart = 20)

In [None]:
i_clust # Look at the return value; it contains several structures 


In [None]:

# These are the centroids of the clusters 
i_clust$centers

# These are the clluster labels assigned to each observation 
i_clust$cluster

In [None]:
dfc <- data.frame(i_clust$centers)


Let's plot the clusters and the cluster centers. 

In [None]:

ggplot() + 

geom_point(data=iris_c, aes(x=Petal.Length, y=Petal.Width, color=factor(i_clust$cluster))) + 

geom_point(data=dfc, aes(x=Petal.Length, y=Petal.Width), color="black",size=5)

Let's compute a confusion table. 

**IMPORTANT!** Remember that this is an unsupervised modeling example. We do not know what actual labels are. All we can do is to create clusters and randomly label them as 1, 2, 3, etc. What the clustering algorithm labels as 1, for example,  may actually correspond to a label 3, etc. So the confusion table can show gross errors. Since cluster labels are random, we can simply shuffle the labels for clusters to get the highest accuracy. 

In [None]:
table(i_clust$cluster, iris$Species)

Here, the clustering algorithm created clusters that mostly conforms with the actual groups. 

We can visualize clustering results in the following ways, too: 

In [None]:
library(fpc)
library(cluster)

In [None]:
plotcluster(iris_c, i_clust$cluster)


In [None]:
clusplot(iris_c, i_clust$cluster, color=TRUE, shade=TRUE, labels=2)


What you see above is a generalized way of showing clusters; if we have more than two dimensions, these methods will produce two dimensional plots similar to PCA to show the clusters projected onto two dimensions. 


---

Let's look at the US arrests data set. **Remember that scaling data is good practice;** we don't want our analysis affected by the different scales of the variables. Variables with larger units would dominate the results. 

In [None]:
df <- USArrests
df <- na.omit(df)
df <- scale(df)
head(df)
dim(df)

One of the ways to analyze the data is to see if there are natural groupings (clusters) in it; observations that share similar characteristics can be grouped into clusters and analyzing those clusters as sub-groups, we can get insights from the data. 


First, we should have some idea if the data is suitable for clustering. The following code creates a distance matrix. If we see blocks in the distance matrix, that shows that there are some clusters in the data. 

In [None]:
library(factoextra)
distance <- get_dist(df)
fviz_dist(distance, gradient = list(low = "blue", mid = "white", high = "red"), order=TRUE)

The above distance matrix suggests that there are some clusters in the data. Small distances are blue, large distances are red. 


We can also use the following function to get the same idea; the Hopkins statistic above 0.5 suggests clusterable data. 

In [None]:
get_clust_tendency(data = df, n=10)

Let's do a kmeans clustering with two clusters. 

In [None]:
kclust <- kmeans(df, centers = 2, nstart = 20)
str(kclust)

In [None]:
kclust$centers

In [None]:
#Let's visualize the clusters:

clusplot(df, kclust$cluster, color=TRUE, shade=TRUE, labels=2)


In [None]:
# We can also use this function to visualize. 

fviz_cluster(kclust, data=df)

We do not know the optimal number of clusters; we can try a few methods to see if we  can justify two clusters. 

In [None]:
fviz_nbclust(df, kmeans, method = "wss")

In [None]:
fviz_nbclust(df, kmeans, method = "silhouette")


In [None]:
gap_stat <- clusGap(df, FUN = kmeans, nstart = 20, K.max = 10, B = 50)

In [None]:
fviz_gap_stat(gap_stat)


In [None]:
library(NbClust)
nb <- NbClust(df, distance = "euclidean", min.nc = 2,
        max.nc = 10, method = "complete", index ="all")

As suggested by nbclust, this data seems to have two clusters. 

Let's apply `pamk` and see what it finds for number of clusters and clusters themselves. 

In [None]:
help(pamk)

In [None]:
pamclust <- pamk(df, krange=1:5, critout=TRUE)

pamclust

`pamk` also suggests two clusters. Below are the clusters produced by `pamk`:

In [None]:
clusplot(df, pamclust$pamobject$clustering, color=TRUE, shade=TRUE, labels=2)


You can further analyze these clusters to find out what they have in common. You can look at the univariate statistics, figure out if factor analysis groups the variables for each cluster in some meaningful way, or see if there are associations between the variables that only exist in the clusters. 

---

Let's create another data set: we will read the red and white wine-quality data sets and combine them into one set. We will create a new variable named `type` and assign 1 for white and 2 for red wines. Then, we will try to cluster the data without looking at the `type` column and see if the unsupervised methods can find two natural clusters which conform with the red vs. white wine groups. 


In [None]:
wine.r <- read.csv("/dsa/data/all_datasets/wine-quality/winequality-red.csv", sep=";")

wine.w <- read.csv("/dsa/data/all_datasets/wine-quality/winequality-white.csv", sep=";")

# combine two data sets and add a type column: 1=white, 2=red 

wdf <- rbind(cbind(wine.w, type=rep(1,dim(wine.w)[1])),cbind(wine.r, type=rep(2,dim(wine.r)[1])))

In [None]:
head(wdf)

In [None]:
# remove the type column
wdf_c_ <-  wdf[, which(names(wdf) != "type")]

wdf_c <- scale(wdf_c_) # try and see what happens if you do NOT scale.. 
head(wdf_c)

Since we want to find out if we can differentiate between red and white wines, we'll take k=2: 

In [None]:
wclust <- kmeans(wdf_c, centers = 2, nstart = 20)


In [None]:
fviz_cluster(wclust, data=wdf_c)

Let's create a confusion table and compute an accuracy. **Remember** that we discussed how these cluster numbers are random; you should try 1 vs. 2 and find which cluster number assignment produces the highest accuracy. 

In [None]:
ctable_k <- table(wclust$cluster, wdf$type)
ctable_k
sum(diag(ctable_k))/dim(wdf_c)[1]



What the clustering algorithm calls 1 and 2 actually correspond to 2 and 1, so if we compute the accuracy by considering that: 

In [None]:


(4804+1574)/dim(wdf_c)[1]



Considering that this is an unsupervised way of guessing red and white wines, it's actually pretty good. 

Remember that kmeans is not robust to outliers; let's try `pamk` and see what happens. 

**It takes a while to compute:**

In [None]:
 pamclust2 <- pamk(wdf_c, krange=1:5, critout=TRUE)


`pamk` also suggests two clusters although the criteria are very close to each other. Let's look at the confusion table: 

In [None]:
ctable_p <- table(pamclust2$pamobject$clustering, wdf$type)
ctable_p

In [None]:
sum(diag(ctable_p))/dim(wdf_c)[1]

Both methods are able to separate red and white wines successfully. This suggests that the variables form natural, and most probably convex, separable clusters for red and white wines. 

You can use similar approach to find natural clusters in data and analyze further what they have in common. 