### Faces dataset decompositions
Dataset : The Olivetti faces dataset 

Here we apply different unsupervised matrix decomposition (dimension reduction) methods on the dataset

In [2]:
from numpy.random import RandomState
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn import cluster
from sklearn import decomposition

#### Dataset

In [3]:
rng = RandomState(0)
faces, _ = fetch_olivetti_faces(return_X_y=True, shuffle=True, random_state=rng)
n_samples, n_features = faces.shape

In [142]:
plt.imshow(np.reshape(faces[0], (64,64)), cmap=plt.cm.gray);

<img src='./plots/face-0.png'>

In [141]:
# global centering

# mean over [axis = 0] outputs (4096,) ---       # broadcasting
faces_centered = faces - faces.mean(axis=0)
plt.imshow(np.reshape(faces_centered[0], (64,64)), cmap=plt.cm.gray);

<img src='./plots/global-center.png'>

In [140]:
# local centering  

# mean over [axis = 1] outputs (400,) --- reshape to (400, 1)      # broadcasting
faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1)
plt.imshow(np.reshape(faces_centered[0], (64,64)), cmap=plt.cm.gray);

<img src='./plots/local-center.png'>

In [139]:
# now the mean is zero
x0 = range(len(faces_centered.mean(axis=0)))
x1 = range(len(faces_centered.mean(axis=1)))
y0 = faces_centered.mean(axis=0)
y1 = faces_centered.mean(axis=1)

plt.figure(figsize=(15,5))
plt.subplot(121)
plt.scatter( x0, y0 )
plt.subplot(122)
plt.scatter( x1, y1 )

<img src='./plots/centering-global-local.png'>

#### util to plot the gallery of faces.
* Initialise different estimators for decomposition and fit each of them on all images and plot some results. 
* Each estimator extracts 6 components as vectors. 
* We just displayed these vectors in human-friendly visualisation as 64x64 pixel images.

In [90]:
def plot_faces(images, title='Faces', rows=2, cols=3, cmap=plt.cm.gray):
    fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(15,10), constrained_layout=True)
    ax = ax.ravel()

    for i, frame in enumerate(ax):
        frame.imshow(images[i].reshape(64,64), cmap=cmap)

    fig.suptitle(title, size=24)

### Eigenfaces - PCA using randomized SVD
* Linear dimensionality reduction using Singular Value Decomposition (SVD) of the data to project it to a lower dimensional space.
* PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance.
* We use PCA to reduce the dimension of the data. There is a closely related preprocessing step called whitening (or, in some other literatures, sphering) which is needed for some algorithms. 
* If we are training on images, the raw input is redundant, since adjacent pixel values are highly correlated. 
* The goal of whitening is to make the input less redundant;
* To make each of our input features centered, we can substract off the mean
* To make each of our input features have unit variance, we can simply rescale each feature `x(i)` by `1/sqrt(λi)`
* PCA whitened version of the data: The different components of xPCAwhite are uncorrelated and have unit variance.



In [43]:
pca_estimator = decomposition.PCA(n_components=6, whiten=True, svd_solver='randomized')

pca_estimator.fit(faces_centered)

pca_estimator.components_.shape

(6, 4096)

In [138]:
plot_faces(pca_estimator.components_, title='PCA components')

<img src='./plots/face_dataset_decomposition--pca.png'>

### Non-negative components - NMF
* Estimate non-negative original data as production of two non-negative matrices.

* Find two non-negative matrices, i.e. matrices with all non-negative elements, (W, H) whose product approximates the non-negative matrix X. 

* This factorization can be used for dimensionality reduction, source separation or topic extraction.

* **Here we pass original non- negative dataset to the fit method**

In [45]:
non_neg_mat = decomposition.NMF(n_components=6, tol=5e-3)
non_neg_mat.fit(faces)

In [137]:
plot_faces(non_neg_mat.components_, 'Non Negative Components\n')

<img src='./plots/face_dataset_decomposition--NMF.png'>

### FastICA: a fast algorithm for Independent Component Analysis.
Independent component analysis separates a multivariate vectors into additive subcomponents that are maximally independent.
*  Typically, ICA is not used for reducing dimensionality but for separating superimposed signals. 
* Since the ICA model does not include a noise term, for the model to be correct, whitening must be applied.
* This can be done internally using the whiten argument or manually using one of the PCA variants.

It is classically used to separate mixed signals (a problem known as blind source separation)

In [49]:
fast_ica = decomposition.FastICA(n_components=6, whiten='unit-variance', max_iter=500)
fast_ica.fit(faces_centered)

In [136]:
plot_faces(fast_ica.components_,'FAST ICA')

<img src='./plots/face_dataset_decomposition--fast-ica.png'>

## Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA)

* SparsePCA is a variant of PCA, with the goal of extracting the set of sparse components that best reconstruct the data.

* Mini-batch sparse PCA (MiniBatchSparsePCA) is a variant of SparsePCA that is faster but less accurate. The increased speed is reached by iterating over small chunks of the set of features, for a given number of iterations.

In [56]:
sparse_pca = decomposition.SparsePCA(n_components=6, alpha=0.2)
sparse_pca.fit(faces_centered)

In [134]:
plot_faces(sparse_pca.components_, 'Sparse PCA')

<img src='./plots/face_dataset_decomposition--sparse-pca.png'>

### Sparse components - MiniBatchSparsePCA
Mini-batch sparse PCA (MiniBatchSparsePCA) extracts the set of sparse components that best reconstruct the data. This variant is faster but less accurate than the similar `sklearn.decomposition.SparsePCA` .


In [52]:
# alpha = Sparsity controlling parameter. Higher values lead to sparser components.
# max_iterint  Maximum number of iterations over the complete dataset before 
#              stopping independently of any early stopping criterion heuristics.
mini_sparse_pca = decomposition.MiniBatchSparsePCA(n_components=6, alpha=0.1, batch_size=32)
mini_sparse_pca.fit(faces_centered)

In [132]:
plot_faces(mini_sparse_pca.components_,'Sparse PCA Mini Batch')

<img src='./plots/face_dataset_decomposition--MiniBatchSparsePCA.png'>

### Dictionary learning
* By default, MiniBatchDictionaryLearning divides the data into mini-batches and optimizes in an online manner by cycling over the mini-batches for the specified number of iterations.
* Finds a dictionary (a set of atoms) that performs well at sparsely encoding the fitted data.

In [61]:
dict_learning = decomposition.MiniBatchDictionaryLearning(n_components=6, alpha=0.1, max_iter=100, batch_size=256)
dict_learning.fit(faces_centered)

In [128]:
plot_faces(dict_learning.components_, 'MiniBatchDictoinaryLearning\n')

<img src='./plots/face_dataset_minibatchdict.png'>

### Cluster centers - MiniBatchKMeans
MiniBatchKMeans is computationally efficient and implements on-line learning with a partial_fit method. That is why it could be beneficial to enhance some time-consuming algorithms with MiniBatchKMeans.

In [66]:
mini_kmeans = cluster.MiniBatchKMeans(n_clusters=6, batch_size=512, n_init='auto')
mini_kmeans.fit(faces_centered)

In [127]:
plot_faces(mini_kmeans.cluster_centers_, 'Mini Batch Kmeans\n')

<img src='./plots/face_dataset_minibatchkmeans.png'>

Factor Analysis components - FA
* Factor Analysis is similar to PCA but has the advantage of modelling the variance in every direction of the input space independently (heteroscedastic noise).
* Factor analysis can produce similar components (the columns of its loading matrix) to PCA. However, one can not make any general statements about these components (e.g. whether they are orthogonal):
* This allows better model selection than probabilistic PCA in the presence of heteroscedastic noise:

In [71]:
factor_analysis = decomposition.FactorAnalysis(n_components=6)
factor_analysis.fit(faces_centered)

In [125]:
plot_faces(factor_analysis.components_, 'Factor Analysis\n')

<img src='./plots/face_dataset_decomposition--factor-analysis.png'>

#### Factor Analysis (FA).

A simple linear generative model with Gaussian latent variables.

The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. Without loss of generality the factors are distributed according to a Gaussian with zero mean and unit covariance. The noise is also zero mean and has an arbitrary diagonal covariance matrix.

* If we would restrict the model further, by assuming that the Gaussian noise is even isotropic (all diagonal entries are the same) we would obtain PCA.

In [124]:
# The estimated noise variance for each feature.
plt.title("Pixelwise variance from \n Factor Analysis (FA)", size=16, wrap=True)
plt.imshow(
    factor_analysis.noise_variance_.reshape(64,64), 
    cmap=plt.cm.gray, 
    interpolation='nearest',
    vmin=-factor_analysis.noise_variance_.max(),
    vmax=factor_analysis.noise_variance_.max()
);

<img src='./plots/face_dataset_decomposition--factor-analysis-pixelwise-variance.png'>

### Dictionary learning
* Dictionary learning is a problem that amounts to finding a sparse representation of the input data as a combination of simple elements. 
* These simple elements form a dictionary. 
* It is possible to constrain the dictionary and/or coding coefficients to be positive to match constraints that may be present in the data.

* MiniBatchDictionaryLearning implements a faster, but less accurate version of the dictionary learning algorithm that is better suited for large datasets. 

* Plot the same samples from our dataset but with another colormap. Red indicates negative values, blue indicates positive values, and white represents zeros.

In [123]:
plot_faces(faces, 'Faces Dataset', cmap=plt.cm.RdBu_r)

<img src='./plots/face_dataset_plot-cmap_rdbu_r.png'>

### Dictionary learning - positive dictionary
*  we enforce positivity when finding the dictionary.

In [112]:
dict_learning_positive_dict = decomposition.MiniBatchDictionaryLearning(
    n_components=6, alpha=0.1, max_iter=50, batch_size=256, positive_dict=True)
dict_learning_positive_dict.fit(faces_centered)    

In [122]:
plot_faces(
    dict_learning_positive_dict.components_, 
    'Dictionary Learning - Positive Dictionary\n', cmap=plt.cm.RdBu_r)

<img src='./plots/face_dataset_decomposition--positive-dict.png'>

### Dictionary learning - positive code

In [114]:
dict_learning_positive_code = decomposition.MiniBatchDictionaryLearning(
    n_components=6, alpha=0.1, max_iter=50, batch_size=256, positive_dict=True)
dict_learning_positive_code.fit(faces_centered)    

In [121]:
plot_faces(dict_learning_positive_code.components_, 'Dictionary Learning - Positive Code\n', cmap=plt.cm.RdBu_r)

<img src='./plots/face_dataset_decomposition--positive-code.png'>

### Dictionary learning - positive dictionary & code

In [116]:
dict_learning_positive_dict_and_code= decomposition.MiniBatchDictionaryLearning(
    n_components=6, alpha=0.1, max_iter=50, batch_size=256,
    positive_dict=True, positive_code=True, fit_algorithm='cd')
dict_learning_positive_dict_and_code.fit(faces_centered)    

In [120]:
plot_faces(
    dict_learning_positive_dict_and_code.components_,
    'Dictionary Learning - Positive Dictionar and Positive Code\n', cmap=plt.cm.RdBu_r)

<img src='./plots/face_dataset_decomposition--positive-dict-&-code.png'>