Clustering
===

When a data set doesn’t have labels, we can use unsupervised learning to find structure in the data, which in turn allows us to discover patterns or groups.

Cluster analysis is a method of finding groups, known as **clusters**, in datasets. As the datasets are unlabelled, cluster analysis aims to group similar samples based on their input features.

**K-means clustering** separates samples into `k` clusters, and partitions samples by the average (mean) of the clusters. So if we state that `k = 5`, k-means clustering will divide the samples into 5 clusters based on the means of the clusters.

Step 1
---

In this exercise, we will use k-means clustering to analyse a few different datasets.

First, we need to load the required packages for this session.

**Run the code below**

In [None]:
# Run this box to load the required packages

# Load the required libraries for this session
suppressMessages(install.packages("tidyverse"))
suppressMessages(library("tidyverse"))
suppressMessages(install.packages("clusterGeneration"))
suppressMessages(library("clusterGeneration"))
suppressMessages(install.packages("kernlab"))
suppressMessages(library("kernlab"))
suppressMessages(install.packages("mlbench"))
suppressMessages(library("mlbench"))

Now let's create a dataset with a known number of clusters to demonstrate how k-means clustering would handle the data.

Below, we will change the `numClust` argument within the `genRandomClust` function to generate a random data set with 3 clusters.

#### Replace `<clusterNumber>` with `3` and run the code.

In [None]:
# Set the seed to be able to reproduce the same random cluster data
set.seed(365)

# Generate random data set with 3 clusters
###
# REPLACE <clusterNumber> WITH 3
###
clust_three <- genRandomClust(numClust = <clusterNumber>, sepVal = 0.1, numReplicate = 1, clustszind = 1, clustSizeEq = 175, 
                              outputDatFlag = FALSE, outputLogFlag = FALSE, outputEmpirical = FALSE, 
                              outputInfo = FALSE)
###

# Save x and y values to a data frame
clust_three <- as.data.frame(clust_three$datList$test_1) %>% 
rename(., x = x1, y = x2)

# Create scatter plot
ggplot(clust_three, aes(x, y)) +
geom_point(alpha = 0.75) +
ggtitle("Data set n = 3 clusters") +
theme(plot.title = element_text(hjust = 0.5))

Alright, we just made a dataset with 3 clusters and graphed it.

Let's see how k-means performs on this dataset, already knowing we have 3 clusters.

### In the cell below replace:
#### 1. `<clusterVariableName>` with `clust_three`
#### 2. `<numberOfCenters>` with `3`
#### then __run the code__.

In [None]:
###
# REPLACE <clusterVariableName> WITH clust_three AND <numberOfCenters> WITH 3
###
clust3_kmeans <- kmeans(x = <clusterVariableName>, centers = <numberOfCenters>)
###

clust3 <- as.data.frame(clust3_kmeans$cluster) %>% 
rename(., Cluster_kmeans = `clust3_kmeans$cluster`) %>% 
mutate(., Cluster_kmeans = as.factor(Cluster_kmeans)) %>%
bind_cols(., clust_three)

# Check output
# str(clust3)
head(clust3)

# Plot the results
clust3 %>% 
ggplot(aes(x, y, colour = Cluster_kmeans)) +
geom_point(alpha = 0.75) +
labs(title = "Data set n = 3 clusters: k-means clustering analysis", colour = "Cluster\ngroup") +
theme(plot.title = element_text(hjust = 0.5))

K-means clustering performs rather well, by the looks of it!

But we knew that our data set had three clusters, sometimes it might not be so clear...

## Step 2

Let's generate another dataset in which it may be a little less obvious how many clusters it contains.

Below we will generate a random data set with `4` clusters and change the `sepVal` argument to  reduce the separation between the clusters.

### In the cell below replace:
#### 1. `<numberOfClusters>` with `4`
#### 2. `<clusterSeperationValue>` with `-0.01`
#### then __run the code__.

In [None]:
# Set seed to reproduce this code
set.seed(365)

# Generate random data set with 4 clusters
###
# REPLACE <numberOfClusters> WITH 4 AND <clusterSeperationValue> WITH -0.01
###
four_clust <- genRandomClust(numClust = <numberOfClusters>, sepVal = <clusterSeperationValue>, numReplicate = 1, clustszind = 1, clustSizeEq = 175,
                             outputDatFlag = FALSE, outputLogFlag = FALSE, outputEmpirical = FALSE, 
                             outputInfo = FALSE)
###

# Obtain cluster data x and y values
clust4_data <- as.data.frame(four_clust$datList$test_1) %>% 
rename(., x = x1, y = x2)

# Create scatter plot
ggplot(clust4_data, aes(x, y)) +
geom_point(alpha = 0.75) +
ggtitle("Data set n = 4 clusters") +
theme(plot.title = element_text(hjust = 0.5))

In instances where we do not know how many classes to expect, we can run k-means clustering multiple times with different *k* values to see how the data is partitioned. Let's try that now.

The following code block creates a custom function named `cluster_kvalue`. This function performs k-means clustering, saves the cluster membership, then creates a scatter plot of the data coloured by the cluster membership. The `cluster_kvalue` function contains two arguments:

- `data_input` numeric matrix/data frame
- `kvalue` number of clusters to partition into (k)

You do not need to edit the following code block. However, you will need to call this custom function later!

**Run the code below to prepare the function for later use**

In [None]:
# Run this block to prepare the function for later

# But don't edit it!

# Create own function to run k-means clustering, save cluster membership, then plot results
cluster_kvalue <- function(data_input, kvalue) {
    clust_kmeans <- kmeans(x = data_input, centers = kvalue) #, algorithm = "Hartigan-Wong"
    as.data.frame(clust_kmeans$cluster) %>% 
    rename(., Cluster_kmeans = `clust_kmeans$cluster`) %>% 
    mutate(., Cluster_kmeans = as.factor(Cluster_kmeans)) %>%
    bind_cols(., data_input) %>% 
    ggplot(aes(x, y, colour = Cluster_kmeans)) +
    geom_point(alpha = 0.75) +
    ggtitle(paste("Cluster analysis using k-means clustering: k = ", kvalue)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(colour = "Cluster\ngroup")
}

Now let's run our custom function on `clust4_data`, changing the number of clusters (`kvalue`) each time. This will tell us how kmeans performs with a different set number of clusters.

#### Below, replace the `<numberOfClusters>`'s as directed.

In [None]:
# Run this box to test k = 2
cluster_kvalue(clust4_data, 2)

In [None]:
###
# REPLACE <numberOfClusters> WITH 3
###
cluster_kvalue(clust4_data, <numberOfClusters>)
###

In [None]:
###
# REPLACE <numberOfClusters> WITH 4
###
cluster_kvalue(clust4_data, <numberOfClusters>)
###

In [None]:
###
# REPLACE <numberOfClusters> WITH 5
###
cluster_kvalue(clust4_data, <numberOfClusters>)
###

In [None]:
###
# REPLACE <numberOfClusters> WITH 6
###
cluster_kvalue(clust4_data, <numberOfClusters>)
###

Which value of *k* do you think best splits the data?

## Step 3

K-means clustering performs well enough on clustered data like that, but let's try it out on a data set that is not so linear.

Let's create a data set that contains two rings of data. 

We need to change the arguments in the `ggplot` code to plot the data.

### In the cell below replace:
#### 1. `<dataset>` with `ring_data`
#### 2. `<variableNames>` with `x, y`

#### and then __run the code__.

In [None]:
x <- matrix(rnorm(500), ncol = 2)
# Formula for Euclidean norm
ring1_data <- x/sqrt(rowSums(x^2))
ring2_data <- ring1_data/2

ring1_data <- as.data.frame(ring1_data)
ring2_data <- as.data.frame(ring2_data)

# Check structure
str(ring1_data)
str(ring2_data)

ring_data <- bind_rows(ring1_data, ring2_data) %>% 
rename(x = V1, y = V2)
str(ring_data)

###
# REPLACE <dataset> WITH ring_data AND <variableNames> WITH x, y
###
ggplot(data = <dataset>, aes(<variableNames>)) +
###
geom_point(alpha = 0.5) +
ggtitle("Two ring data set") +
theme(plot.title = element_text(hjust = 0.5))

We can clearly distinguish two "clusters", that is, the two rings of datapoints.

Let's see how k-means handles a dataset like this. We can use our previous custom function to perform k-means clustering on `ring_data` and plot the results.

#### Replace the `<clusterFunctionParameters>` with `ring_data, 2` then run the code.

In [None]:
###
# REPLACE <clusterFunctionParameters> WITH ring_data, 2
###
(ring_kmeans <- cluster_kvalue(<clusterFunctionParameters>))
###

K-means clustering clearly has difficulty solving this. As we are currently using it, there is no way for k-means clustering to place two means to label this data set correctly.

Step 4
---

We can try to run k-means clustering another way. Let's add another feature to our two ring data set: the distance of each point away from the centre.

Let's see if k-means is able to classify the two data clusters with this new feature.

We will change the arguments in the `ggplot` call to plot the ring data in 2D. This will be coloured by the new feature, `z`.

### In the cell below replace:
#### 1. `<variablesToPlot>` with `x, y`
#### 2. `<variableForColour>` with `z`

#### and then __run the code__.

In [None]:
# Calculate distance from centre for each data point
ring_data_z <- ring_data %>% 
mutate(z = 4 * sqrt(x^2 + y^2))

head(ring_data_z)
tail(ring_data_z)

# Plot in 2D first
ring_data_z %>%

###
# REPLACE <variablesToPlot> WITH x, y and <variableForColour> WITH z
###
ggplot(aes(<variablesToPlot> , colour = <variableForColour> )) +
###
geom_point(alpha = 0.75) +
labs(title = "Two ring data coloured by distance from centre") +
theme(plot.title = element_text(hjust = 0.5))

Now let's plot all three features `x, y, z` coloured by the feature `z` in 3D using the `plot3D` package.

#### Run the code below.

In [None]:
plot_ly(ring_data_z, x = ~x, y = ~y, z = ~z, color = ~z) %>% 
add_markers(opacity = 0.25)

How does k-means clustering deal with our ring dataset now that it has 3 features, and 2 clusters?

### In the cell below replace:
#### 1. `<dataset>` with `ring_data_z`
#### 2. `<numberOfClusers>` with `2

#### and then __run the code__.

In [None]:
###
# REPLACE <dataset> WITH ring_data_z AND <numberOfClusers> WITH 2
###
cluster_kvalue(<dataset>, <numberOfClusers>)
###

Looks good! When we add a third feature `z` to our two ring data set, k-means clustering can better discern the cluster membership.

Step 5
---

Some data cannot be manipulated like that. Let's have a look at a different type of data distribution, spirals.

We will create a data set in the shape of spirals using the function `mlbench.spirals` from the package `mlbench`.

#### Replace `<spiralFunction>` with `mlbench.spirals` and run the code.

In [None]:
# Set the seed to reproduce the random data set
set.seed(123)

###
# REPLACE <spiralFunction> WITH mlbench.spirals
###
spiral_data <- <spiralFunction>(n = 500, cycles = 1, sd = 0.025)
###

# Save `spiral_data` to a data frame to allow plotting
spiral_data <- data.frame(x = spiral_data$x[, 1], y = spiral_data$x[, 2], classes = spiral_data$classes)

# Create scatter plot of the data
spiral_data %>% 
ggplot(aes(x, y)) +
geom_point(alpha = 0.75) +
labs(title = "Spiral data set", colour = "Spiral\nnumber") +
theme(plot.title = element_text(hjust = 0.5))

Let's try running k-means clustering on `spiral_data` using our custom function.

**In the code below, add the appropriate arguments to `cluster_kvalue` and press Run.**

In [None]:
###
# REPLACE <dataset> WITH spiral_data AND <numberOfClusters> WITH 2
###
cluster_kvalue(<dataset>, <numberOfClusters>)
###

Again, k-means clustering is facing a similar issue as with the circle data. But k-means clustering is just one method for clustering; other clustering methods may be more suitable to partition spiral data appropriately.

Step 6
---

**Spectral clustering** is a clustering method that aims to cluster data that is in some way connected, so that samples in the same group are similar, and samples in different groups are dissimilar to each other.

We will run spectral analysis using the `specc` function. We will set the number of centers to 2, since we expect two groups where the samples in each group belong to a different spiral. Our dataset is the `spiral_data` we have been using previously.

#### Replace `<spectralClusteringFunction>` with `specc` and run the code.

In [None]:
###
# REPLACE <spectralClusteringFunction> WITH specc
###
spiral_data_specc <- ?(as.matrix(select(spiral_data, -classes)), centers = 2)
###
spiral_data %>% 
mutate(Specc = spiral_data_specc@.Data) %>%
ggplot(aes(x, y, colour = as.factor(Specc))) +
geom_point(alpha = 0.75) +
labs(title = "Spectral clustering of spiral data", colour = "Spectral\ncluster") +
theme(plot.title = element_text(hjust = 0.5))

Excellent! Spectral clustering works for the spiral data.

Let's see how spectral clustering performs on our previous data set, the two ring data, based on just two features, x and y co-ordinates.

### In the cell below replace:
#### 1. `<dataset>` with `ring_data`
#### 2. `<numberOfCenters>` with `2`
#### then __run the code__.

In [None]:
# Use spectral clustering algorithm on ring_data
head(ring_data)
class(ring_data)

###
# REPLACE <dataset> WITH ring_data and <numberOfCenters> WITH 2
###
ring_data_specc <- specc(as.matrix(<dataset>), centers = <numberOfCenters>)
###

ring_data %>% 
mutate(Specc = ring_data_specc@.Data) %>%
ggplot(aes(x, y, colour = as.factor(Specc))) +
geom_point(alpha = 0.75) +
labs(title = "Spectral clustering of two ring data", colour = "Spectral\ncluster") +
theme(plot.title = element_text(hjust = 0.5))

Does spectral clustering classify the two ring data into the correct clusters?

## Conclusion

We have learnt two important clustering methods, *k-means clustering* and *spectral clustering*, and used them on a variety of datasets. Remember, one clustering method might be more appropriate to use on a data set than another, especially straight out of the box; some additions/transformations to the data may allow the clustering method to be used.