## Robust PCA with Sparse Tensors

In this demonstration, we see that the Robust PCA algorithm is able to work with sparse tensor input.  

**<div style="color:red;">WARNING:</div>** Even though the input and output will be of a sparse tensors type, they will likely be fully filled or dense. This is because the PCA algorithm necessarily densifies due to intermediate tensor subtractions that occur.

### YaleB Dataset

Here we will obtain the [YaleB dataset](http://vision.ucsd.edu/~leekc/ExtYaleDatabase/Yale%20Face%20Database.htm), which is a collection of human face images. The following cell obtains the data and was comes from [this notebook](http://jeankossaifi.com/blog/rpca.html).

In [1]:
from pathlib import Path
from urllib.request import urlretrieve
import zipfile
from imageio import imread
from scipy.ndimage.interpolation import zoom
import numpy as np


def fetch_cropped_yaleb(data_folder, zooming=0.5, max_n_subjects=None):
    """Returns a dictionary of paths
    
    Parameters
    ----------
    data_folder: string
    zooming: float, optional, default is 0.5
        factor by which to resize the images
    max_n_subjects: {None, int}, optional, default is None
        if not None, only the first max_n_subjects are returned
    
    Returns
    -------
    dict: {
        subjects_1: {'images': [image_1, ... image_N],
               'ambient': image_ambient,
        }
    }
    
    images are stored as numpy arrays
    """
    url = 'http://vision.ucsd.edu/extyaleb/CroppedYaleBZip/CroppedYale.zip'
    yaleb_path = Path(data_folder).joinpath('cropped_yaleb')
    
    if not yaleb_path.joinpath('CroppedYale').exists():
        yaleb_path.mkdir(parents=True)
    
    # If not already unzip, do it
    if not list(yaleb_path.iterdir()):
        zip_path = yaleb_path.joinpath('yaleb.zip')
        
        # If zip not already downloaded, download it
        if not zip_path.exists():
            urlretrieve(url, zip_path.as_posix())
        
        zfile = zipfile.ZipFile(zip_path.as_posix())
        zfile.extractall(path=yaleb_path.as_posix())

    yaleb = {}
    for folder_path in yaleb_path.joinpath('CroppedYale').iterdir():
        if max_n_subjects is not None and len(yaleb) > max_n_subjects:
            return yaleb
        
        if not folder_path.is_dir():
            continue
            
        video_name = folder_path.name
        paths = sorted(list(folder_path.glob('*.pgm')))
        images = []
        for path in paths:
            if 'Ambient' in path.name:
                ambient = imread(path.as_posix())
            else:
                images.append(zoom(imread(path.as_posix()), zooming)[None, ...])
                
        data = {'images':np.concatenate(images),
        'ambient':ambient}
        yaleb[video_name] = data

    return yaleb


In [2]:
data = fetch_cropped_yaleb('yaleb/', zooming=0.3, max_n_subjects=5)



In [3]:
X = np.concatenate([data[key]['images'] for key in data], axis=0)
print(X.shape)

(384, 58, 50)


In [4]:
X = X.astype(np.float64)
X -= X.mean()

In [5]:
from tensorly.random import check_random_state
import sparse

random_state = 1234

## Sparse Implementation

At this point, `X` is a densely stored `np.ndarray`. Let's convert this to a `sparse.COO` and run it through the sparse version of `robust_pca()`.

In [6]:
from tensorly.contrib.sparse.decomposition import robust_pca

In [7]:
Y = sparse.COO(X)
Y

<COO: shape=(384, 58, 50), dtype=float64, nnz=1113600, fill_value=0.0>

In [8]:
%%time
low_rank_part, sparse_part = robust_pca(Y, reg_E=0.04, learning_rate=1.2, n_iter_max=20)

CPU times: user 15min 45s, sys: 4min 25s, total: 20min 10s
Wall time: 20min 3s


In [9]:
print('X.shape={} == low_rank_part.shape={} == sparse_part.shape={}.'.format(Y.shape, low_rank_part.shape, sparse_part.shape))

X.shape=(384, 58, 50) == low_rank_part.shape=(384, 58, 50) == sparse_part.shape=(384, 58, 50).


Note that the low-rank part is completely filled!

In [10]:
low_rank_part

<COO: shape=(384, 58, 50), dtype=float64, nnz=1113600, fill_value=0.0>

Though the sparse part does get the benefit of actually being sparse.

In [11]:
sparse_part

<COO: shape=(384, 58, 50), dtype=float64, nnz=278385, fill_value=0.0>

## Missing Values

The robust PCA implementation in TensorLy supports missing data. This is also supported by the sparse version of the algorithm.

First, let's create a mask representing the missing data. Note that the mask is False for missing values and True everywhere else.

In [12]:
mask = np.ones(X.shape, dtype=bool)
mask[:100, :20, :10] = False
mask = sparse.COO(mask, fill_value=True)

In [13]:
mask

<COO: shape=(384, 58, 50), dtype=bool, nnz=20000, fill_value=True>

In [14]:
%%time
low_rank_part, sparse_part = robust_pca(Y, mask=mask, reg_E=0.04, learning_rate=1.2, n_iter_max=20)

CPU times: user 17min 47s, sys: 5min 5s, total: 22min 53s
Wall time: 22min 53s


In [15]:
low_rank_part

<COO: shape=(384, 58, 50), dtype=float64, nnz=1113600, fill_value=0.0>

In [16]:
sparse_part

<COO: shape=(384, 58, 50), dtype=float64, nnz=297346, fill_value=0.0>