# Unsupervised Learning Techniques

Although most of the applications of machine learning today are based on supervised learning (& as a result, this is where most of the investments go to), the vast majority of the available data is unlabeled: we have the input features $X$, but we do not have the labels $y$. The computer scientist Yann LeCun famously said that "if intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, & reinforcement learning would be the cherry on the cake." In other words, there is a huge potential in unsupervised learning that we have only barely started to sink our teeth into.

Say you want to create a system that will take a few pictures of each item on a manufacturing production line & detect which items are defective. You can fairly easily create a system that will take pictures automatically, & this might give you thousands of pictures every day. you can then build a reasonably large dataset in just a few weeks. But wait, there is no labels! If you want to train a regular binary classifier that will predict whether an item is defective or not, you will need to label every single picture as 'defective" or "normal". This will generally require human experts to sit down & manually go through all the pictures. This is a long, costly, & tedious task, so it will usually only be done ona small subset of available pictures. As a result, the labeled dataset will be quite small, & the classifier's performance will be disappointing. Moreover, every time the company makes any change to its products, the whole process will need to be started over from scratch. Wouldn't it be great if the algorithm could just exploit the unlabeled data without needing humans to label every picture? Enter unsupervised learning.

In this lesson, we will look at a few unsupervised learning tasks & algorithms:

* *Clustering*
   - The goal is to group similar instance together into *clusters*. Clustering is a great tool for data analysis, customer segmentation, recommender systems, search engines, image segmentation, semi-supervised learning, dimensionality reduction, & more.
* *Anomaly detection*
   - The objective is to learn what "normal" data looks like, & then use that to detect abnormal instances, such as defective items on a production line or a new trend in a time series.
* *Density estimation*
   - This is the task of estimating the *probability density function* (PDF) of the random process that generated the dataset. Density estimation of commonly used for anomaly detection: instances located in very low-density regions are likely to be anomalies. It is also useful for data analysis & visualisation.

---

# Clustering

As you enjoy a hike in the mountains, you stumble upon a plant you have never seen before. You look around & you notice a few more. They are not identical, yet they are sufficiently similar for you to know that they most likely belong to the same species (or at least the same genus). You may need a botanist to tell you what species that is, but you certainly don't need an expert to identify groups of similar-looking objects. This is called *clustering*: it is the task of identifying similar instances & assigning them to *clusters*, or groups of similar instances.

Just like in classification, each instance gets assigned to a group. However, unlike classification, clustering is an unsupervised task. 

<img src = "Images/Classification vs Clustering.png" width = "600" style = "margin:auto"/>

Consider the left diagram in the figure: on the left is the iris dataset, where each instance's species is represented with a different marker. It is a labeled dataset, for which classification algorithms such as logistic regression, SVMs, or random forest classifiers are well suited. On the right is the same dataset, but without labels, so you cannot use classification algorithm anymore. This is where clustering algorithms step in: many of them can easily detect the lower-left cluster. It is also quite easy to see with your own eyes, but it is not so obvious that the upper-right cluster is composed of two distinct sub-clusters. That siad, the dataset has two additional features (sepal length & width), not represented here, & clustering algorithms can make good use of all features, so in fact they identify the three clusters fairly well (e.g., using a Gaussian mixture model, only 5 instances out of 150 are assigned to the wrong cluster).

Clustering is used in a wide variety of applications, including these:

* *For customer segmentation*
   - You can cluster your customers based on their purchases & their activity on your website. This is useful to understand who your customers are & what they need, so you can adapt your products & marketing campaigns to each segment. For example, customer segmentation can be useful in *recommender systems* to suggest content that other users in the same cluster enjoyed.
* *For data analysis*
   - When you analyse a new dataset, it can be helpful to run a clustering algorithm, & then analyse each cluster separately.
* *As a dimensionality reduction technique*
   - Once a dataset has been clustered, it is usually possible to measure each instance's *affinity* with each cluster (affinity is a measure of how well an instance fits into a cluster). Each instance's feature vector $x$ can then be replaced with the vector of its cluster affinities. If there are *k* clusters,then this vector is *k*-dimensional. This vector is typically much lower-dimensional than the original feature vector, but it can preserve enough information for further processing.
* *For anomaly detection (also called outlier detection)*
   - Any instance that has a low affinity to all the clusters is likely to be an anomaly. For example, if you have clustered the users of your website based on their behaviour, you can detect users with unusual behaviour, such as an unusual number of requests per second. Anomaly detection is particularly useful in detecting defects in manufacturing, or for *fraud detection*.
* *For semi-supervised learning*
   - If you only have a few labels, you could perform clustering & propagate the labels to all the instances in the same cluster. This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, & thus improve its performance.
* *For search engines*
   - Some search engines let you search for images that are similar to a reference image. To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster. Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image's cluster, & you can then simply return all the images from this cluster.
* *To segment an image*
   - By clustering pixels according to their colour, then replacing each pixel's colour with the mean colour of its cluster, it is possible to considerably reduce the number of different colors in the image. Image segmentation is used in many object detection & tracking systems, as it makes it easier to detect the contour of each object.
  
There is no universal definition of what a cluster is: it really depends on the context & different algorithms will capture different kinds of clusters. Some algorithms look for instance centered around a particular point, called a *centroid*. Others look for continuous regions of densely packed instances: these clusters can take on any shape. Some algorithms are hierarchical, looking for clusters of clusters. & the list goes on.

In this section, we'll look at two popular clustering algorithms, K-means & DBSCAN, & explore some of their applications, such as nonlinear dimensionality reduction, semi-supervised learning, & anomaly detection.

## K-Means

Consider the unlabeled dataset represented in this figure, you can clearly see five blobs of instances.

<img src = "Images/Unlabeled Blobs.png" width = "600" style = "margin:auto"/>

The K-means algorithm is a simple algorithm capable of clustering this kind of dataset very quickly & efficiently, often in just a few iterations. It as proposed by Stuart Lloyd at Bell Labs in 1957 as a technique for pulse-code modulation, but it was only published outside of the company in 1982. In 1965, Edward W. Forgy had published virtually the same algorithm, so K-means is sometimes referred to as Lloyd-Forgy.

Let's train a K-means clusterer on this dataset. It will try to find each blob's center & assign each instance to the closest blob:

In [2]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np

blob_centers = np.array([[ 0.2,  2.3],
                         [-1.5 ,  2.3],
                         [-2.8,  1.8],
                         [-2.8,  2.8],
                         [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples = 2000, centers = blob_centers,
                  cluster_std = blob_std, random_state = 32)

k = 5
kmeans = KMeans(n_clusters = k)
y_pred = kmeans.fit_predict(X)

Note that you have to specify the number of clusters *k* that the algorithm must find. In this example, it is pretty obvious from looking at the data that *k* should be set to 5, but it general it is not that easy.

Each instance was assigned to one of the five clusters. In the context of clustering, an instance's label is the index of the cluster that this instance gets assigned to by the algorithm: this is ot to be confused with the class labels in classification (remember that clustering is an unsupervised learning task). The `KMeans` instance preserves a copy of the labels of the instances it was trained on, available via the `labels_`instance variable.

In [3]:
y_pred

array([3, 1, 0, ..., 0, 2, 1], dtype=int32)

In [4]:
y_pred is kmeans.labels_

True

We can also take a look at the five centroids that the algorithm found:

In [5]:
kmeans.cluster_centers_

array([[-1.48628063,  2.26054806],
       [-2.78301719,  2.80833515],
       [ 0.21050882,  2.34183312],
       [-2.80254262,  1.29573614],
       [-2.80720418,  1.80874278]])

You can easily assign new instances to the cluster whose centroid is closest:

In [6]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
kmeans.predict(X_new)

array([2, 2, 1, 1], dtype=int32)

If you predict the cluster's decision boundaries, you get a Voronoi tesselation, where each centroid is represented with an X.

<img src = "Images/K-Means Decision Boundaries.png" width = "600" style = "margin:auto"/>

The vast majority of the instances were clearly assigned to the appropriate cluster, but a few instances were probably mislabeled (especially near the boundary between the top-left cluster & the central clsuter). Indeed, the K-means algorithm does not behave very well then the blobs have very different diameters because all it cares about when assigning an instance to a cluster is the distance to the centroid.

Instead of assigning each instance to a single cluster, which is called *hard clustering*, it can be useful to give each instance a score per cluster, which is called *soft clustering*. The score can be the distance between the instance & the centroid; conversely, it can be a similarity score (or affinity), such as the gaussian radial basis function. In the `KMeans` class, the `transform()` method measures the distance from each instance to every centroid:

In [7]:
kmeans.transform(X_new)

array([[1.50894513, 2.89803216, 0.40145218, 2.88967693, 2.8137119 ],
       [4.49384014, 5.83923741, 2.81035779, 5.84512519, 5.8103528 ],
       [1.68467667, 0.28951158, 3.27727792, 1.71566451, 1.20675763],
       [1.53254153, 0.37703064, 3.21440254, 1.22034457, 0.71763972]])

In this example, theh first example in `X_new` is located at a distance of 1.51 from the first centroid, 2.90 from the second centroid, 0.40 from the third centroid, 2.89 from the fourth centroid, & 2.81 from the fifth centroid. If you have a high-dimensional dataset & you transform it this way, you end up with a *k*-dimensional dataset: this transformation can be a very efficient nonlinear dimensionality reduction technique.

### The K-Means Algorithm

So, how does the algorithm work? Well, suppose you were given the centroids. You could easily label all the instances in the dataset by assigning each of them to the cluster whose centroid is closest. Conversely, if you were given all the instance labels, you could easily locate all teh centroids by computing the mean of the instances for each cluster. But you are given neither the labels nor the centroids, so how can you proceed? Well, just start by placing the centroids randomly (e.g., by picking the *k* instances at random & using their locations as centroids). Then label the instances, update the centroids, label the instances, update the centroids, & so on until the centroids stop moving. The algorithm is guaranteed to converge in a finite number of steps (usually quite small); it will not oscillate forever.

You can see the algorithm in action here: the centroids are initialised randomly (top left), then the instances are labeled (top right), then the centroids are updated (center left), the instances are relabeled (center right), & so on. As you can see, in just three iterations, the algorithm has reached a clustering that seems close to optimal.

<img src = "Images/K-Means Algorithm.png" width = "600" style = "margin:auto"/>

Although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum): whether it does or not depends on the centroid initialisation. The below figure shows two suboptimal solutions that the algorithm can converge to if you are unlucky with the random initialisation step.

<img src = "Images/Suboptimal Solution K-Means.png" width = "600" style = "margin:auto"/>

Let's look at a few ways you can mitigate this risk by improving the centroid intialisation.

### Centroid Initialisation Methods

If you happen to know approximately where the centroids should be (e.g., if you ran another clustering algorithm earlier), then you can set the `init` hyperparameter to a numpy array containing the list of centroids, & set `n_init` to 1.

In [10]:
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters = 5, init = good_init, n_init = 1)

Another solution is to run the algorithm multiple times with different random intialisations & keep the best solution. The number of random intialisations is controlled by the `n_init` hyperparameter: by default, it is equal to 10, which means that the whole algorithm decribed earlier runs 10 times when you call `fit()`, & scikit-learn keeps the best solution. But how exactly does it know which solution is the best? It uses a performance metric. That metrics is called the model's *inertia*, which is the mean squared distance between each instance & its cloest centroid. It is roughly equal to 223.3 & 237.5 on the left & right of the above figure, respectively. The `KMeans` class runs the algorithm `n_init` times & keeps the model with the lowest inertia. In this example,the model will be selected (unless we are very unlucky with `n_init` consecutive random initialisations). If you are curious, a model's inertia is accessible via the `inertia_` instance variable:

In [12]:
kmeans.fit(X)
kmeans.inertia_

216.0712975215635

The `score()` method returns the negative inertia. Why negative? Because a predictor's `score()` method must always respect scikit-learn's "greater is better" rule: if a predictor is better than another, its `score()` method should return a greater score.

In [13]:
kmeans.score(X)

-216.0712975215635

An important improvement to the K-means algorithm, *K-Means++*, was proposed in a 2006 paper by David Arthur & Sergei Vassilvitskii. They introduced a smarter initialisation step that tends to select centroids that are distant from one another, & this improvement makes the K-means algorithm much less likely to converge to a suboptimal solution. They showed that the additional computation required for the smarter initialisation step is well worth it because it makes it possiblel to drastically reduce the number of times the algorithm needs to be run to find the optimal solution. here is the K-Means++ initialisation algorithm.

1. Take one centroid $c^{(i)}$, chosen uniformly at random from the dataset.
2. Take a new centroid $c^{(i)}$, choosing an instance $x^{(i)}$ with probability $D(x^{(i)})^2/\sum^{m}_{j = 1}D(x^{(j)})^2$, where $D(x^{(i)})$ is the distance between the instance $x^{(i)}$ & the closest centroid that has already chosen. This probability distribution ensures that instances farther away from already chosen centroids are much more likely to be selected as centroids.
3. Repeat the previous step until all *k* centroids have been chosen.

The *KMeans* class uses this utilisation method by default. If you want to force it to use the original method (i.e., picking *k* instances randomly to define the intial centroids), then you can set the `init` hyperparameter to "random". You will rarely need to do this.

### Accelerated K-means & Mini-Batch K-Means