# Sparse PARAFAC with missing values

This notebook is based on [sparse_demo.ipynb](sparse_demo.ipynb#parafac). 

As before, we start with a random sparse tensor, constructed so that it has a tensor factorization of rank 5.

Because masked PARAFAC can take longer to converge than non-masked PARAFAC, we will use a smaller tensor than in the other notebook.

In [1]:
shape = (1000, 1001, 1002)
rank = 5

import sparse
starting_factors = [sparse.random((i, rank)) for i in shape]
starting_factors
starting_weights = sparse.ones(rank)

In [3]:
from tensorly.contrib.sparse.kruskal_tensor import kruskal_to_tensor
tensor = kruskal_to_tensor((starting_weights, starting_factors))
tensor

0,1
Format,coo
Data Type,float64
Shape,"(1000, 1001, 1002)"
nnz,5094
Density,5.0787535817475934e-06
Read-only,True
Size,159.2K
Storage ratio,0.0


In [4]:
tensor.nbytes / 1e9                # Actual memory usage in GB

0.000163008

In [5]:
import numpy as np
np.prod(tensor.shape) * 8 / 1e9    # Memory usage if array was dense, in GB

8.024016

Now let's construct a random mask. A mask should be a boolean array of the same shape as the tensor, that is `False` (`0`) where there are missing values and `True` (`1`) where elements are not missing. 

It is important that the mask array have a fill value of `True`, that is, the zero entries of the original `tensor` should be considered non-missing. This is because internally the parafac algorithm generates dense arrays with as many elements as are False in the mask.

In [6]:
import sparse
missing_p = 0.3 # The fraction of nonzero entries from tensor which should be considered missing. The larger this number is, the harder it will be for PARAFAC to reconstruct the factors (meaning it may take more iterations to converge).

mask = sparse.COO(coords=tensor.coords, data=np.random.choice([False, True], size=tensor.nnz, p=[missing_p, 1-missing_p]), shape=tensor.shape, fill_value=True)
# This clears the True values from the mask.data
mask = sparse.elemwise(lambda x: x, mask)
mask

0,1
Format,coo
Data Type,bool
Shape,"(1000, 1001, 1002)"
nnz,1503
Density,1.4985014985014984e-06
Read-only,True
Size,36.7K
Storage ratio,0.0


Now we factor the tensor. In order to demonstrate that there are no tricks up our sleeve, we multiply the tensor by the mask to clear the "missing" values. The mask is passed in as a keyword argument to `parafac()`. 

Note that at this time, you have to use the `parafac` function from the sparse backend when using a sparse mask to avoid memory blowups.

In [7]:
import time
%load_ext memory_profiler
from tensorly.contrib.sparse.decomposition import parafac

In [15]:
%%memit
start_time = time.time()
sparse_kruskal = parafac(tensor*mask, rank=rank, init='random', verbose=True, mask=mask)
end_time = time.time()
total_time = end_time - start_time
print('Took %d mins %d secs' % (divmod(total_time, 60)))

reconstruction error=0.8175833368555955
iteration 1, reconstruction error: 0.4582569541516581, decrease = 0.35932638270393735, unnormalized = 6.059446087313438
iteration 2, reconstruction error: 0.37173227790593805, decrease = 0.08652467624572008, unnormalized = 5.0898443578716845
iteration 3, reconstruction error: 0.3433681340822257, decrease = 0.02836414382371233, unnormalized = 4.777701968504902
iteration 4, reconstruction error: 0.33827195176982416, decrease = 0.005096182312401554, unnormalized = 4.738296921651478
iteration 5, reconstruction error: 0.33671694244574846, decrease = 0.0015550093240757068, unnormalized = 4.730505969337963
iteration 6, reconstruction error: 0.3360651876425429, decrease = 0.000651754803205562, unnormalized = 4.728142499505644
iteration 7, reconstruction error: 0.33575261109873816, decrease = 0.0003125765438047323, unnormalized = 4.7273045932585855
iteration 8, reconstruction error: 0.33558942206153664, decrease = 0.00016318903720152766, unnormalized = 4.

Let's look at one of the values that was masked out.

In [16]:
mask.coords.T[0]

array([ 10,  59, 292])

In [17]:
mask[tuple(mask.coords.T[0])]

False

In [18]:
orig_val = tensor[tuple(mask.coords.T[0])]
orig_val

0.09704512781634266

See the [sparse_demo.ipynb](sparse_demo.ipynb) for how to calculate individual values from the factors.

In [19]:
weights, factors = sparse_kruskal
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(tuple(mask.coords.T[0]))], 0), 0))
computed_val

0.09704496725694943

In [20]:
np.abs(orig_val - computed_val)

1.6055939322523471e-07

Obviously this is a constructed example, where we know the unmasked tensor has an exact factorization. But this demonstrates that given a tensor with missing values, which we have reason to believe is represented by a rank $r$ tensor decomposition, we should expect this decomposition to do a decent job at reconstructing those missing values (this may not be the case if the missing values are not randomly distributed across the tensor as we have here). 

Let's compare this to a value that was not masked

In [21]:
for i in tensor.coords.T:
    non_missing_coord = tuple(i)
    if mask[non_missing_coord]:
        break
        
mask[non_missing_coord]

True

In [22]:
orig_val = tensor[non_missing_coord]
orig_val

0.0704013608737617

In [23]:
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(non_missing_coord)], 0), 0))
computed_val

0.07040150152751701

In [24]:
np.abs(orig_val - computed_val)

1.4065375530947222e-07

As before, we should not in general try to recompose a sparse factorization unless we can represent it densely, but since this was constructed explicitly from sparse factors, we are able to do it (being careful to use the `kruskal_to_tensor` from the sparse backend).

In [26]:
expanded = kruskal_to_tensor((weights, factors))
expanded

0,1
Format,coo
Data Type,float64
Shape,"(1000, 1001, 1002)"
nnz,111648
Density,0.00011131383586473407
Read-only,True
Size,3.4M
Storage ratio,0.0


Now let's look at the absolute error, both including and not including missing values.

In [27]:
from tensorly.contrib.sparse import norm
norm((tensor - expanded)*mask) # Absolute error of the non-missing values

4.7267674362759795

In [28]:
norm(tensor - expanded) # Absolute error including missing values

5.606543216758244