# Chapter 9: Unsupervised Learning Techniques Exercises

## 1.

> How would you define clustering?

> Can you name a few clustering algorithms?

Clustering is the task of identifying similar instances and assigning them to clusters, or groups of similar instances.

K-Means and DBSCAN are the two popular clustering algorithms.

## 2.

> What are some of the main applications of clustering algorithms?

Some of the main applications of clustering algorithms are:
- Customer segmentation
- Data analysis
- Dimensionality reduction technique
- Anomaly detection
- Semi-supervised learning
- Search engines
- Image segmentation

## 3.

> Describe two techniques to select the right number of clusters when using K-Means.

To select the right number of clusters when using K-Means, we can use:
- The model's **inertia**: the mean squared distance between each instance and its closest centroid.
    - Plot the inertia as a function of the number of clusters *k*.
    - Find the curve's "elbow" in which the inertia drop off changes from rapid to gradual decrease.
    - The inertia will always decrease as *k* clusters increases, so you can't simply pick the one with lowest inertia.

- The model's **silhouette score**: the mean silhouette coefficient over all the instances.
    - Since the silhouette coefficient is a score telling how well an instance belongs to its own cluster, 
        - $\approx +1$ => it's well in its own cluster, 
        - $\approx 0 $ => it's close to a cluster boundary,
        - $\approx -1$ => it may have been assigned to wrong cluster.
    - Plot the silhouette score as a function of number of clusters *k*.
    - Pick the cluster number with the highest score 
        - => highest amount of instances belonging to its own cluster.

## 4.

> What is label propagation?

> Why would you implement it, and how?

Label propagation is propagating the labels to all the other instances in the same cluster.

This is useful for semi-supervised learning in which you have some labeled instances but many unlabeled instances. By using label propagation, you increase the number of labeled instances and improve the performance of the algorithm.

To do this,

1. Cluster the training set.
2. For each cluster, find the instance closest to the centroid - the "representative instance".
3. Manually label these representative instances (now 100% accurate).
4. Label all other instances in the same cluster the same as representative instance.
5. Or, only label a percentage of the instances as the same - reason being that instances at cluster boundaries are probably mislabeled into the wrong cluster.

## 5.

> Can you name two clustering algorithms that can scale to large datasets?

> And two that look for regions of high density?

Two clustering algorithms that can scale to large datasets are:
- Mini-batch K-Means:
    - With caveat that data must have a clustering structure.
    - Uses mini-batches, moving centroids slightly at each iteration.
    - Capable of clustering huge datasets that don't fit into memory.
    
- BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies):
    - With caveat that (n features < 20).
    - Builds tree structure with just enough information to quickly assign each new instance to a cluster.
    - Does not have to store all instances in the tree.
    - Uses limited memory, while handling huge datasets.

Two clustering algorithms that look for regions of high density are:
- DBSCAN:
    - Counts how many instances are located within a small distance $\epsilon$.
    - If an instance has more than the minimum instances in the $\epsilon$-neighborhood, it is labeled a core instance. 
    - Any instance that is not a core instance or does not have any in its neighborhood is considered an anomaly.

- Mean-Shift:
    - Places a circle centered on each instance.
    - For each circle, computes the mean of all instances located within it.
    - Shifts the circle so that it's centered on the mean.
    - It shifts in direction of higher density.

## 6.

> Can you think of a use case where active learning would be useful?

> How would you implement it?

Active learning would be helpful in semi-supervised learning where there's some labeled instances but many unlabeled instances such as handwritten digits in MNIST dataset, assuming each digit is only labeled once. 

Since each digit can be written in many ways, you would train the model as usual. Then for the instances that have the lowest probability, have them manually labeled. And then repeat the process until the improvements are miniscule compared to the labeling effort.

## 7.

> What is the difference between anomaly detection and novelty detection?

The difference between anomaly and novelty detection is that novelty detection is assumed to be trained on a "clean" dataset with no outliers, whereas anomaly detection does not make this assumption.

## 8.

> What is a Gaussian mixture?

> What tasks can you use it for?

A Gaussian mixture is a probabilistic model that assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.

They can be used for density estimation, clustering, and anomaly detection.

## 9.

> Can you name two techniques to find the right number of clusters when using a Gaussian mixture model?

Two techniques to find the right number of clusters when using a Gaussian mixture model are:
- Minimizing the theoretical information criterion, Bayesian (BIC) or Akaike (AIC):
    - Must compute the maximum value of the likelihood function of the model.
    - Plot the BIC/AIC as a function of number of clusters.
    - Pick the cluster number that gives lowest BIC/AIC.

- Running Bayesian Gaussian Mixture model to automatically find the number:
    - Use `BayesianGaussianMixture` class.
    - Must set number of initial clusters to be greater than ideal number.
    - Gives weights of 0 to unnecessary clusters.
    - Cluster parameters (weights, means, covariance matrices) are treated as latent random variables.
    - Uses Bayes' theorem to update probability distribution.

## 10.

> The classic Olivetti faces dataset contains 400 grayscale 64x64-pixel images of faces. Each image is flattened to a 1D vector of size 4,096. 40 different people were photographed (10 times each).

> And the usual task is to train a model that can predict which person is represented in each picture.

> 1. Load the dataset using the `sklearn.datasets.fetch_olivetti_faces()` function.

> 2. Then split it into a training set, a validation set, and a test set.

>> Note: The dataset is already scaled between 0 and 1.

> 3. Since the dataset is quite small, you probably want to use stratified sampling to ensure that there are the same number of images per person in each set.

> 4. Next, cluster the images using K-Means.

> 5. Ensure that you have a good number of clusters (using one of the techniques discussed in this chapter).

> 6. Visualize the clusters.

> Do you see similar faces in each cluster?

## 11.

> Continuing with the Olivetti faces dataset,

> 1. Train a classifier to predict which person is represented in each picture.

> 2. Evaluate it on the validation set.

> 3. Next, use K-Means as a dimensionality reduction tool.

> 4. Train a classifier on the reduced set.

> 5. Search for the number of clusters that allows the classifier to get the best performance.

> What performance can you reach?

> What if you append the features from the reduced set to the original features (again, searching for the best number of clusters)?

## 12.

> 1. Train a Gaussian mixture model on the Olivetti faces dataset.

> 2. To speed up the algorithm, you should probably reduce the dataset's dimensionality (eg. use PCA, preserving 99% of the variance).

> 3. Use the model to generate some new faces (using the `sample()` method).

> 4. Visualize them (if you used PCA, you will need to use its `inverse_transform()` method).

> 5. Try to modify some images (eg. rotate, flip, darken).

> 6. See if the model can detect the anomalies (ie. compare the output of the `score_samples()` method for normal images and for anomalies).

## 13.

> Some dimensionality reduction techniques can also be used for anomaly detection. For example,

> 1. Take the Olivetti faces dataset and reduce it with PCA, preserving 99% of the variance.

> 2. Then compute the reconstruction error for each image.

> 3. Next, take some of the modified images you built in the previous exercise, and look at their reconstruction error.

> 4. Notice how much large the reconstruction error is.

> 5. If you plot a reconstructed image, you will see why: it tries to reconstruct a normal face.