# Unsupervised Learning

https://scikit-learn.org/stable/modules/clustering.html

## Introduction to Clustering

Suppose that Shila is visiting a zoo for the first time, and she does not know much about animals. While observing various animals, she sees some living in water, some with wings, some living in trees, and some underground. Although she does not precisely know what animals they are, she has observed some similarities and differences. With the similarity she noted, she was able to group various animals, as shown in the figure below.

![image.png](attachment:f2e5dfc3-66b6-49ff-9db0-dded35e83008.png)

Figure 1: Grouping of animals in zoo

The task that Shila performed is known as clustering. To formally put it, clustering is the process of partitioning the observations into groups or clusters such that observations in the same cluster tend to be similar, and those in the different clusters tend to be dissimilar.

### Clustering is an Unsupervised Learning

Clustering seems similar to classification. In both classification and clustering, we categorize a given data point as belonging to some group/class. But unlike classification where we have the true label to guide the model training process, in clustering we do not have the true label to guide or supervise the training. Therefore, clustering is an unsupervised machine learning method.

### Clustering in Machine Learning

![image.png](attachment:ad239c3e-dc48-4342-b9e2-1cfc17528c45.png)

The above Figure 2(a) shows how our data looks like before applying any clustering algorithm. Figure 2(b) on the right shows our desired outcome after applying a clustering algorithm. In the next chapter, you will learn to synthesize a dataset, as shown in Figure 2(a), and apply a clustering algorithm on it so that data can be grouped as shown in Figure 2(b).

In real-world applications, most of the observations involve a large number of features $(x_1, x_2, \dots , x_n)$. So it is difficult to plot and visualize the dataset, but the underlying intuition is the same.


### Types of Clustering Algorithms

Clustering algorithm based on their clustering approaches can be divided into following categories:

1. **Partitioning-Based Clustering:**  
Partitioning based clustering algorithms work by subdividing the data sets into some number of clusters (_K_). The clusters are generally identified by the centroid points such as the mean. A data point belongs to a cluster whose centroid is closest to it, i.e. centroid with minimum distance. With partition based clustering, the data shown in the Figure 2(a) can be partitioned as,

![image.png](attachment:b475931d-80e9-4e06-83f9-0f0509cf449a.png)

Figure 3: Partition based clustering. Elliptical boundary shows partition of data points into different clusters which is also highlighted with color.


2. **Density-Based Clustering:**   
In density-based clustering, a region of densely connected data points creates clusters. If there is more than one densely connected region, then there is more than one cluster. Density-based clustering does not assume the shape of the clusters. Hence, they can be of any form but separated by a sparse region.

![image.png](attachment:b2740d41-8f69-4c8e-8cc2-c38a891c91f2.png)

Figure 5: Density based clustering. (a) dataset to apply clustering. (b) cluster assignment with density based clustering alsogithm. Different colors indicates different clusters.

3. **Hierarchical Clustering:**   
Hierarchical clustering is a connectivity-based clustering algorithm where either a bigger cluster is formed by connecting two similar smaller clusters, or two smaller clusters are formed by breaking a bigger cluster. This type of clustering creates a hierarchy of connectivity, hence called hierarchical clustering. Figure below shows a simple dataset where hierarchical clustering is applied.

![image.png](attachment:dbfd3383-7a23-460f-bc9f-f8aede9b57b2.png)

Figure 4: Hierarchical clustering. (a) dataset to apply clustering. (b) hierarchy of clusters. Clusters near to each other are combined to form a bigger cluster, process shown from bottom to top.
    

4. **Distribution-Based Clustering:**   
The goal of distribution based clustering is to assign the probability to the data points belonging to some probability distribution.

![image.png](attachment:946cc7b1-8abc-4b9f-9c5e-e057b05c3ec3.png)

Figure 6: Distribution based clustering. (a) dataset to apply clustering. (b) fitted two normal distribution on the data.

Above Figure 6(a) shows one-dimensional data points concentrated at two locations. In Figure 6(b), we have fitted two Gaussian distribution (one for each concentrated region) to represent the data. In density-based clustering, we use these two probability distributions as clusters.



5. **Grid-Based Clustering:**   
Grid-based clustering divides the data space into a discrete space called grids. Then, the density of the data in each grid is used to classify it as noise or cluster. The clusters are then grouped based on neighboring locations to form the region of clusters and empty regions.

![image.png](attachment:317c7e4e-f556-478f-bd29-4ec889dc4ddc.png)

Figure 7: Grid based clustering. (a) data space divided into discrete grids (b) cluster assignment by combining grids with relatively dense data-points


The above Figure 7(a) shows the dataset divided into grids. After that, highly dense grids are combined to form clusters as shown in Figure 7(b).

There are many more clustering algorithms that have been developed until now.