## Exercise - Train and evaluate a clustering model by using Tidymodels and friends

### Clustering introduction

In contrast to *supervised* machine learning, *unsupervised* machine learning is used when there is no "ground truth" from which to train and validate label predictions. The most common form of unsupervised learning is *clustering*. Clustering is similar to *classification*, except that the training data doesn't include known values for the class label to be predicted.

Clustering works by separating the training cases based on similarities that can be determined from their feature values. The numeric features of a particular entity can be thought of as vector coordinates that define the entity's position in n-dimensional space. A clustering model identifies groups, or *clusters*, of entities that are close to one another, while being separated from other clusters.

For example, let's take a look at a dataset that contains measurements of different species of wheat seed.

> **Citation**: The seeds dataset used in this exercise was originally published by the Institute of Agrophysics of the Polish Academy of Sciences in Lublin, and can be downloaded from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).


In [None]:
# Load the core tidyverse and make it available in your current R session
library(tidyverse)

# Read the csv file into a tibble
seeds <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/seeds.csv")

# Print the first 5 rows of the data
seeds %>% 
  slice_head(n = 5)


Sometimes, you might want more information on your data. You can look at the data, its structure, and the data type of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function, as follows:



In [None]:
# Explore dimension and type of columns
seeds %>% 
  glimpse()


You can also use `skimr::skim()` to take a look at the summary statistics for the data:



In [None]:
library(skimr)

# Obtain Summary statistics
seeds %>% 
  skim()


Now take a moment to go through the quick data exploration you just performed. Any missing values? What's the dimension of your data (rows and columns)? What are the different column types? How are the values in your columns distributed?

For this module, you'll work with the first six *feature* columns. For plotting purposes, let's encode the *label* column as categorical. Tidymodels provides a neat way of excluding this variable when fitting a model to your data. Remember, you're dealing with unsupervised learning, which doesn't make use of previously known label values to train a model.


In [None]:
# Narrow down to desired features
seeds_select <- seeds %>% 
  select(!groove_length) %>% 
  mutate(species = factor(species))

# View first 5 rows of the data
seeds_select %>% 
  slice_head(n = 5)


As you can see, we now have six data points (or *features*) for each instance (*observation*) of a seed's species. So you could interpret these as coordinates that describe each seed's location in six-dimensional space.

Now, of course six-dimensional space is difficult to visualize in a three-dimensional world, or on a two-dimensional plot. So you take advantage of a mathematical technique called *principal component analysis* (PCA) to analyze the relationships between the features, and to summarize each observation as coordinates for two principal components. In other words, you translate the six-dimensional feature values into two-dimensional coordinates.

> PCA is a dimension reduction method that aims at reducing the feature space. Most of the information or variability in the dataset can then be explained by using fewer uncorrelated features.

Let's see this in action by creating a specification of a `recipe` that estimates the principal components based on our six variables. You'll then prep and bake the recipe to apply the computations.

> PCA works well when the variables are normalized (*centered* and *scaled*).


In [None]:
# Load the core tidymodels and make it available in your current R session
library(tidymodels)


# Specify a recipe for pca
pca_rec <- recipe(~ ., data = seeds_select) %>% 
  update_role(species, new_role = "ID") %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 2, id = "pca")

# Print out recipe
pca_rec


Compared to supervised learning techniques, there's no `outcome` variable in this recipe.

By updating the role of the `species` column to `ID`, you instruct the recipe to keep the variable but not use it as either an outcome or predictor.

By calling `prep()`, which estimates the statistics required by PCA, and by applying them to `seeds_select` using `bake(new_data = NULL)`, you can get the fitted PC transformation of our features.


In [None]:
# Estimate required statistcs 
pca_estimates <- prep(pca_rec)

# Return preprocessed data using bake
features_2d <- pca_estimates %>% 
  bake(new_data = NULL)

# Print baked data set
features_2d %>% 
  slice_head(n = 5)


These two components capture the maximum amount of information (that is, the variance) in the original variables. From the output of the prepped recipe `pca_estimates`, you can examine how much variance each component accounts for:



In [None]:
# Examine how much variance each PC accounts for
pca_estimates %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(str_detect(terms, "percent"))


theme_set(theme_light())
# Plot how much variance each PC accounts for
pca_estimates %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(terms == "percent variance") %>% 
  ggplot(mapping = aes(x = component, y = value)) +
  geom_col(fill = "midnightblue", alpha = 0.7) +
  ylab("% of total variance")


This output's tibbles and plots show how well each principal component is explaining the original six variables. For example, the first principal component (PC1) explains about 72 percent of the variance of the six variables. The second principal component explains an additional 16.97 percent, giving a cumulative percent variance of 89.11. This is certainly better. It means that the first two variables seem to have some power in summarizing the original six variables.

Naturally, PC1 captures the most variance, followed by PC2 and PC3.

Now that you have the data points translated to two dimensions PC1 and PC2, you can visualize them in a plot:


In [None]:
# Visualize PC scores
features_2d %>% 
  ggplot(mapping = aes(x = PC1, y = PC2)) +
  geom_point(size = 2, color = "dodgerblue3")


Hopefully you can see at least two, arguably three, reasonably distinct groups of data points. But here lies one of the fundamental problems with clustering: without knowning class labels, how do you know how many clusters to separate your data into?

One way is to use a data sample to create a series of clustering models with an incrementing number of clusters. Then you can measure how tightly the data points are grouped within each cluster. A metric often used to measure this tightness is the *within cluster sum of squares* (WCSS), with lower values meaning that the data points are closer. You can then plot the WCSS for each model.

You'll use the built-in `kmeans()` function, which accepts a data frame with all numeric columns as its primary argument to perform clustering. This approach means that you'll have to drop the *species* column. For clustering, it is recommended that the data have the same scale. You can use the recipes package to perform these transformations.


In [None]:
# Drop target column and normalize data
seeds_features<- recipe(~ ., data = seeds_select) %>% 
  step_rm(species) %>% 
  step_normalize(all_predictors()) %>% 
  prep() %>% 
  bake(new_data = NULL)

# Print out data
seeds_features %>% 
  slice_head(n = 5)


Now, let's explore the WCSS of different numbers of clusters.

You use `map()` from the [purrr](https://purrr.tidyverse.org/) package to apply functions to each element in list.


In [None]:
set.seed(2056)
# Create 10 models with 1 to 10 clusters
kclusts <- tibble(k = 1:10) %>% 
  mutate(
    model = map(k, ~ kmeans(x = seeds_features, centers = .x,
                            nstart = 20)),
    glanced = map(model, glance)) %>% 
  unnest(cols = c(glanced))



# Plot Total within-cluster sum of squares (tot.withinss)
kclusts %>% 
  ggplot(mapping = aes(x = k, y = tot.withinss)) +
  geom_line(size = 1.2, alpha = 0.5, color = "dodgerblue3") +
  geom_point(size = 2, color = "dodgerblue3")


The plot shows a large reduction in WCSS (so, greater *tightness*) as the number of clusters increases from one to two, and a further noticeable reduction from two to three clusters. After that, the reduction is less pronounced, resulting in an *elbow* in the chart at around three clusters. This is a good indication that there are two to three reasonably well separated clusters of data points.

### Summary

In this notebook exercise, you looked at what clustering means, and how to determine whether clustering might be appropriate for your data. In the next notebook, you examine two ways of labeling the data automatically.
