# Chapter 9: Unsupervised Learning Techniques

In [12]:
import numpy as np

- **Clustering**:
    - Group similar instances together into *clusters*.

- **Anomoly detection**:
    - Learn what "normal" data looks like, and then use that to detect abnormal instances.

- **Density estimation**:
    - Estimating the *probability density function (PDF)* of the random process that generated the dataset.

## 9.1 Clustering

**Clustering** - The task of identifying similar instances and assigning them to *clusters*, or groups of similar instances.

Various applications of clustering:
- Customer segmentation
    - Cluster your customers based on their purchases and their activity on your website.
    - Understand who your customers are and their needs so you can adapt your products and marketing to each segment.
    - Useful in *recommender systems* to suggest content that others in the same cluster enjoyed.
    
- Data analysis
    - When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster separately.

- Dimensionality reduction technique
    - Once a dataset has been clustered, it's possible to measure each instance's affinity with each cluster.
    - **Affinity** is any measure of how well an instance fits into a cluster.
    - Each instance's feature vector can then be replaced with the vector of its cluster affinities.
    - If there are *k* clusters, then the vector will be *k*-dimensional.
    - The vector is typically much lower-dimensional than the original feature vector, but preserves enough information for further processing.

- Anomaly detection (also called outlier detection)
    - Any instance that has a low affinity to all the clusters is likely to be an anomaly.
    - Detect unusual behavior such as unusual number of request per second.
    - Particularly useful in detecting defects in manufactoring or fraud detection.

- Semi-supervised learning
    - If you only have a few labels, you can propagate the labels to all the instances in the same cluster.
    - Greatly improves the number of labels available for subsequent supervised learning algorithms, improving performance.

- Search engines
    - Some search engines let you search for images that are similar to a reference image.
    - Perform clustering and then return all the images from the same cluster.

- Segment an image
    - By clustering pixels according to their color and replacing each pixel's color with the mean color of its cluster, it's possible to reduce the number of different colors in the image.
    - Used in object detection and tracking systems.
    - Makes it easier to detect the contour of each object.

### 9.1.1 K-Means

The K-Means algorithm is a simple algorithm capable of clustering clearly defined "blobs" of instances.

> Note: See Figure 9-2 in book.

In [13]:
# FROM BOOK NOTEBOOK

from sklearn.datasets import make_blobs

blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

In [14]:
from sklearn.cluster import KMeans

In [15]:
k = 5
kmeans= KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

> Note: You have to specify the number of clusters *k* that the algorithm must find.

In context of clustering, an instance's *"label"* is the index of the cluster that this instance gets assigned to.

In [16]:
y_pred

array([4, 0, 1, ..., 3, 1, 0])

In [17]:
y_pred is kmeans.labels_

True

In [18]:
kmeans.cluster_centers_

array([[-2.80389616,  1.80117999],
       [ 0.20876306,  2.25551336],
       [-1.46679593,  2.28585348],
       [-2.79290307,  2.79641063],
       [-2.80037642,  1.30082566]])

You can easily assign new instances to the cluster whose centroid is closest.

In [19]:
X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
kmeans.predict(X_new)

array([1, 1, 3, 3])

> Note: K-Means algorithm does not behave very well when the blobs have very different diameters because all it cares about when assigning an instance to a cluster is the distance to the centroid.

*Hard clustering* - Assigning each instance to a single cluster.  
*Soft clustering* - Give each instance a score per cluster.

In the `KMeans` class, `transform()` measures the distance from each instance to every centroid.

In [20]:
kmeans.transform(X_new)

array([[2.81093633, 0.32995317, 1.49439034, 2.9042344 , 2.88633901],
       [5.80730058, 2.80290755, 4.4759332 , 5.84739223, 5.84236351],
       [1.21475352, 3.29399768, 1.69136631, 0.29040966, 1.71086031],
       [0.72581411, 3.21806371, 1.54808703, 0.36159148, 1.21567622]])

First instance in `X_new` is located at a distance of:
- 2.81 from the 1st centroid
- 0.33 from the 2nd centroid
- 2.90 from the 3rd centroid
- 1.49 from the 4th centroid
- 2.89 from the 5th centroid

Transforming into a k-dimensional dataset can be a very efficient nonlinear dimensionality reduction technique.

#### 9.1.1.1 The K-Means algorithm

How does the algorithm work?

1. Start by placing the centroids randomly (eg. pick *k* instances at random and use their locations as centroids).
2. Label the instances.
3. Update the centroids.
4. Repeat steps 2 & 3 until centroids stop moving.

> Note: The algorithm is guaranteed to converge in a finite number of steps because the mean squared distance between the instances and their closest centroid can only go down at each step. It does not oscillate forever.

> Note: The computational complexity is generally linear with regard to number of instances *m*, number of clusters *k*, and number of dimensions *n*. Worst case is exponential with number of instances. But in practice, K-Means is generally one of the fastest *clustering* algorithms.

Although it is guaranteed to converge, it may not converge to the right solution (eg. may converge to a local optimum). Whether it does or not depends on the centroid initialization.

#### 9.1.1.2 Centroid initialization methods

If you happen to know approximately where the centroids should be, you can set the `init` hyperparamter to a NumPy array containing the list of centroids and set `n_init=1`.

In [21]:
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)

Another solution is to run the algorithm multiple times with different random initializations and keep the best solution. The number of random initializations is controlled by `n_init` hyperparameter (default `=10`).

It uses a performance metric called the model's **inertia**, the mean squared distance between each instance and its closest centroid.

In [22]:
k = 5
kmeans= KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

kmeans.inertia_ # Used 1st example of kmeans, Figure 9-3 in book

211.59853725816828

In [23]:
kmeans.score(X)

-211.5985372581683

> Note: Negative because Scikit-Learn's "greater is better" rule:  
if a predictor is better than another, its `score()` method should return a greater score.

K-Means++ algorithm uses initialization step that tends to select centroids that are distant from one another.

1. Take one centroid $\mathbf{c}^{(1)}$, chosen uniformly at random from the dataset.
2. Take a new centroid $\mathbf{c}^{(i)}$, choosing an instance $\mathbf{x}^{(i)}$ with probability $ D(\mathbf{x}^{(i)})^2 / \sum_{j=1}^{m} D(\mathbf{x}^{(j)})^2$, where $ D(\mathbf{x}^{(i)})$ is the distance between the instance and closest centroid that was already chosen.  
This probability distribution ensures that instances farther away from already chosen centroids are much more likely to be selected as centroids.
3. Repeat the previous step until all k centroids have been chosen.

`KMeans` uses this method by default.

#### 9.1.1.3 Accelerated K-Means and mini-batch K-Means

Accelerated K-Means by avoiding many unnecessary distance calculations. It can be achieved by exploiting the triangle inequality (ie. a straight line is always the shortest distance between two points) and by keeping track of lower and upper bounds for distances between instances and centroids.

`KMeans` uses this method by default.

Mini-batch K-Means uses mini-batches, moving the centroids just slightly at each iteration, instead of using the full dataset at each iteration.

In [24]:
from sklearn.cluster import MiniBatchKMeans

In [25]:
minibatch_kmeans = MiniBatchKMeans(n_clusters=5)
minibatch_kmeans.fit(X)

MiniBatchKMeans(n_clusters=5)

> Note: Mini-batch K-Means algorithm trains much faster than regular K-Means algorithm, but its inertia is generally slightly worse, especially as the number of clusters increases. See Figure 9-6 in book.

#### 9.1.1.4 Finding the optimal number of clusters

The inertia is not a good performance metric when trying to choose *k* because it keeps getting lower as we increase *k*.

By plotting the inertia as a function of *k*, there is an "elbow" at k = 4, splitting the rapid drop and the slower decrease. So it would be an okay guess to pick 4 centroids, but isn't very precise.

> Note: See Figure 9-8 in book.

A more precise approach is to use the **silhouette score**, which is the mean *silhouette coefficient* (between -1 and +1) over all the instances. A coefficient:
- Close to +1 => Instance is well inside its own cluster and far from others.
- Close to 0 => Instance is close to a cluster boundary.
- Close to -1 => Instance may have been assigned to the wrong cluster.

In [26]:
from sklearn.metrics import silhouette_score

In [27]:
silhouette_score(X, kmeans.labels_)

0.655517642572828

By plotting silhouette scores for different numbers of clusters, it provides much more information than the inertia plot. k = 4 is a very good choice, and k = 5 is good too - much better than k = 6 or 7.

An even better visualization called a **silhouette diagram** is a plot of every instance's silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient.

The dashed line indicates the mean silhouette coefficient. We want most instances to be to the right as if it falls to the left, it means they are too close to other clusters. When k = 3 and k = 6, we get bad clusters. 

When k = 4, the cluster at index 1 is fairly big, so picking k = 5 may be a better idea to get clusters of similar sizes.

> Note: See Figure 9-10 in book.

### 9.1.2 Limits of K-Means

As mentioned already, K-Means isn't perfect.

- Needs to run several times to avoid suboptimal solution
- Need to specify number of clusters
- Does not perform well on clusters of varying sizes, densities, or nonspherical shapes

> Note: It is important to scale the input features before running K-Means, or the clusters may be very stretched and K-Means will perform poorly.

### 9.1.3 Using Clustering for Image Segmentation

**Image segmentation** - The task of partitioning an image into multiple segments.  

**Semantic segmentation** - All pixels that are part of the same object type get assigned to the same segment. (eg. 1 segment for all pedestrians)  

**Instance segmentation** - All pixels that are part of the same individual object are assigned to the same segment. (eg. different segment for each pedestrian)

**Color segmentation** - All pixels that have a similar color are assigned to the same segment.

In [28]:
from matplotlib.image import imread

In [29]:
image = imread("Images/Unsupervised_learning/ladybug.png")
image.shape

(533, 800, 3)

The image is represented as a 3D array.
- 1st dimension's size = Height
- 2nd dimension's size = Width
- 3rd dimension's size = Number of color channels (RGB)

=> For each pixel, there is a 3D vector containing the intensities of red, green, and blue, each between 0.0 and 1.0.

In [31]:
X = image.reshape(-1, 3) # Reshape array to long list of RBG colors
kmeans = KMeans(n_clusters=8).fit(X) # Clusters these colors using K-Means
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)

K-Means prefers clusters of similar sizes. The ladybug is small - much smaller than the rest of the image - so even though its color is flashy, K-Means fails to dedicate a cluster to it.

> Note: See Figure 9-12 in book for ladybug reference.

### 9.1.4 Using Clustering for Preprocessing

Clustering can be an efficient approach to dimensionality reduction.

In [34]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [38]:
X_digits, y_digits = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits)

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

log_reg.score(X_test, y_test)

0.9688888888888889

Now try using K-means as a preprocessing step. Create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to these 50 clusters, then apply Logisitic Regression model.

In [39]:
from sklearn.pipeline import Pipeline

In [53]:
pipeline = Pipeline([
    ("kmeans", KMeans(n_clusters=50)),
    ("log_reg", LogisticRegression())
])
pipeline.fit(X_train, y_train)

Pipeline(steps=[('kmeans', KMeans(n_clusters=50)),
                ('log_reg', LogisticRegression())])

> Note: Since there are 10 different digits, it's tempting to set the number of clusters to 10. However, each digit can be written several different ways, so it is preferable to use a larger number of clusters, such as 50.

In [54]:
pipeline.score(X_test, y_test)

0.9622222222222222

Since K-Means is used as a preprocessing step before classification, it's easy to find the best value of *k* by using `GridSearchCV`.

In [55]:
from sklearn.model_selection import GridSearchCV

In [57]:
param_grid = dict(kmeans__n_clusters=range(2, 100))
grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2)
grid_clf.fit(X_train, y_train)

grid_clf.best_params_

D ..............................kmeans__n_clusters=17; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=18; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=18; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=18; total time=   0.3s
[CV] END ..............................kmeans__n_clusters=19; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=19; total time=   0.3s
[CV] END ..............................kmeans__n_clusters=19; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=20; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=20; total time=   0.2s
[CV] END ..............................kmeans__n_clusters=20; total time=   0.3s
[CV] END ..............................kmeans__n_clusters=21; total time=   0.3s
[CV] END ..............................kmeans__n_clusters=21; total time=   0.3s
[CV] END ..........................

{'kmeans__n_clusters': 97}

In [58]:
grid_clf.score(X_test, y_test)

0.9644444444444444

### 9.1.5 Using Clustering for Semi-Supervised Learning

### 9.1.6 DBSCAN

### 9.1.7 Other Clustering Algorithms

## 9.2 Gaussian Mixtures

### 9.2.1 Anomaly Detection Using Gaussian Mixtures

### 9.2.2 Selecting the Number of Clusters

### 9.2.3 Bayesian Gaussian Mixture Models

### 9.2.4 Other Algorithms for Anomaly and Novelty Detection