# Sparse PARAFAC with missing values

This notebook is based on [sparse_demo.ipynb](sparse_demo.ipynb#parafac). 

As before, we start with a random sparse tensor, constructed so that it has a tensor factorization of rank 5.

Because masked PARAFAC can take longer to converge than non-masked PARAFAC, we will use a smaller tensor than in the other notebook.

In [1]:
shape = (1000, 1001, 1002)
rank = 5

import sparse
starting_factors = [sparse.random((i, rank)) for i in shape]
starting_factors

[<COO: shape=(1000, 5), dtype=float64, nnz=50, fill_value=0.0>,
 <COO: shape=(1001, 5), dtype=float64, nnz=50, fill_value=0.0>,
 <COO: shape=(1002, 5), dtype=float64, nnz=50, fill_value=0.0>]

In [2]:
from tensorly.contrib.sparse.kruskal_tensor import kruskal_to_tensor
tensor = kruskal_to_tensor(starting_factors)
tensor

<COO: shape=(1000, 1001, 1002), dtype=float64, nnz=5125, fill_value=0.0>

In [3]:
tensor.nbytes / 1e9                # Actual memory usage in GB

0.000164

In [4]:
import numpy as np
np.prod(tensor.shape) * 8 / 1e9    # Memory usage if array was dense, in GB

8.024016

Now let's construct a random mask. A mask should be a boolean array of the same shape as the tensor, that is `False` (`0`) where there are missing values and `True` (`1`) where elements are not missing. 

It is important that the mask array have a fill value of `True`, that is, the zero entries of the original `tensor` should be considered non-missing. This is because internally the parafac algorithm generates dense arrays with as many elements as are False in the mask.

In [5]:
import sparse
missing_p = 0.3 # The fraction of nonzero entries from tensor which should be considered missing. The larger this number is, the harder it will be for PARAFAC to reconstruct the factors (meaning it may take more iterations to converge).

mask = sparse.COO(coords=tensor.coords, data=np.random.choice([False, True], size=tensor.nnz, p=[missing_p, 1-missing_p]), shape=tensor.shape, fill_value=True)
# This clears the True values from the mask.data
mask = sparse.elemwise(lambda x: x, mask)
mask

<COO: shape=(1000, 1001, 1002), dtype=bool, nnz=1526, fill_value=True>

Now we factor the tensor. In order to demonstrate that there are no tricks up our sleeve, we multiply the tensor by the mask to clear the "missing" values. The mask is passed in as a keyword argument to `parafac()`. 

Note that at this time, you have to use the `parafac` function from the sparse backend when using a sparse mask to avoid memory blowups.

In [6]:
import time
%load_ext memory_profiler
from tensorly.contrib.sparse.decomposition import parafac

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


In [7]:
%%memit
start_time = time.time()
factors = parafac(tensor*mask, rank=rank, init='random', verbose=True, mask=mask)
end_time = time.time()
total_time = end_time - start_time
print('Took %d mins %d secs' % (divmod(total_time, 60)))

Starting iteration 0
Mode 0 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 1 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 2 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
reconstruction error=0.897274887796118
Starting iteration 1
Mode 0 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 1 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 2 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
reconstruction error=0.5784042008587181, variation=0.31887068693739984.
Starting iteration 2
Mode 0 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 1 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
Mode 2 of 3
 Rank 0 of 5
 Rank 1 of 5
 Rank 2 of 5
 Rank 3 of 5
 Rank 4 of 5
reconstruction error=0.2585023950982153, variation=0.3199018057605028.
Starting iteration 3
Mode 0 of 3
 Rank 0 of 5
 Rank 1 of 5
 Ra

Let's look at one of the values that was masked out.

In [8]:
mask.coords.T[0]

array([  8,  58, 741])

In [9]:
mask[tuple(mask.coords.T[0])]

False

In [10]:
orig_val = tensor[tuple(mask.coords.T[0])]
orig_val

0.08614559752892052

See the [sparse_demo.ipynb](sparse_demo.ipynb) for how to calculate individual values from the factors.

In [11]:
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(tuple(mask.coords.T[0]))], 0), 0))
computed_val

0.08614559755825321

In [12]:
np.abs(orig_val - computed_val)

2.933268905547237e-11

Obviously this is a constructed example, where we know the unmasked tensor has an exact factorization. But this demonstrates that given a tensor with missing values, which we have reason to believe is represented by a rank $r$ tensor decomposition, we should expect this decomposition to do a decent job at reconstructing those missing values (this may not be the case if the missing values are not randomly distributed across the tensor as we have here). 

Let's compare this to a value that was not masked

In [13]:
for i in tensor.coords.T:
    non_missing_coord = tuple(i)
    if mask[non_missing_coord]:
        break
        
mask[non_missing_coord]

True

In [14]:
orig_val = tensor[non_missing_coord]
orig_val

0.08860903077760515

In [15]:
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(non_missing_coord)], 0), 0))
computed_val

0.0886090308131763

In [16]:
np.abs(orig_val - computed_val)

3.557114325314359e-11

As before, we should not in general try to recompose a sparse factorization unless we can represent it densely, but since this was constructed explicitly from sparse factors, we are able to do it (being careful to use the `kruskal_to_tensor` from the sparse backend).

In [17]:
expanded = kruskal_to_tensor(factors)
expanded

<COO: shape=(1000, 1001, 1002), dtype=float64, nnz=108288, fill_value=0.0>

Now let's look at the absolute error, both including and not including missing values.

In [18]:
from tensorly.contrib.sparse import norm
norm((tensor - expanded)*mask) # Absolute error of the non-missing values

5.5240250002835305

In [19]:
norm(tensor - expanded) # Absolute error including missing values

6.554612462548368