# Sparse PARAFAC with missing values

This notebook is based on [sparse_demo.ipynb](sparse_demo.ipynb#parafac). 

As before, we start with a random sparse tensor, constructed so that it has a tensor factorization of rank 5.

Because masked PARAFAC can take longer to converge than non-masked PARAFAC, we will use a smaller tensor than in the other notebook.

In [1]:
shape = (1000, 1001, 1002)
rank = 5

import sparse
starting_factors = [sparse.random((i, rank)) for i in shape]
starting_factors
starting_weights = sparse.ones(rank)

In [2]:
from tensorly.contrib.sparse.cp_tensor import cp_to_tensor
tensor = cp_to_tensor((starting_weights, starting_factors))
tensor

0,1
Format,coo
Data Type,float64
Shape,"(1000, 1001, 1002)"
nnz,4675
Density,4.661007655019631e-06
Read-only,True
Size,146.1K
Storage ratio,0.0


Let's write a small convenience function to check the size of a tensor

In [3]:
def format_size(size_bytes):
    size = size_bytes
    for unit in ['B', 'KiB', 'MiB', 'GiB', 'TiB']:
        if not int(size/1024):
            return f'{round(size)}.{unit}'
        else:
            size /= 1024

In [4]:
format_size(tensor.nbytes)                      # Actual memory usage

'146.KiB'

In [5]:
import numpy as np
format_size(np.prod(tensor.shape) * 8)    # Memory usage if array was dense, in GB

'7.GiB'

Now let's construct a random mask. A mask should be a boolean array of the same shape as the tensor, that is `False` (`0`) where there are missing values and `True` (`1`) where elements are not missing. 

It is important that the mask array have a fill value of `True`, that is, the zero entries of the original `tensor` should be considered non-missing. This is because internally the parafac algorithm generates dense arrays with as many elements as are False in the mask.

In [6]:
import sparse
missing_p = 0.3 # The fraction of nonzero entries from tensor which should be considered missing. The larger this number is, the harder it will be for PARAFAC to reconstruct the factors (meaning it may take more iterations to converge).

mask = sparse.COO(coords=tensor.coords, data=np.random.choice([False, True], size=tensor.nnz, p=[missing_p, 1-missing_p]), shape=tensor.shape, fill_value=True)
# This clears the True values from the mask.data
mask = sparse.elemwise(lambda x: x, mask)
mask

0,1
Format,coo
Data Type,bool
Shape,"(1000, 1001, 1002)"
nnz,1414
Density,1.409767876833745e-06
Read-only,True
Size,34.5K
Storage ratio,0.0


Now we factor the tensor. In order to demonstrate that there are no tricks up our sleeve, we multiply the tensor by the mask to clear the "missing" values. The mask is passed in as a keyword argument to `parafac()`. 

Note that at this time, you have to use the `parafac` function from the sparse backend when using a sparse mask to avoid memory blowups.

In [7]:
import time
%load_ext memory_profiler
from tensorly.contrib.sparse.decomposition import parafac

In [8]:
%%memit
start_time = time.time()
sparse_kruskal = parafac(tensor*mask, rank=rank, init='random', verbose=True, mask=mask)
end_time = time.time()
total_time = end_time - start_time
print('Took %d mins %d secs' % (divmod(total_time, 60)))

reconstruction error=0.6476812069469653
iteration 1, reconstruction error: 0.3149141332267099, decrease = 0.3327670737202554, unnormalized = 3.4417445973414806
iteration 2, reconstruction error: 0.10859513912412147, decrease = 0.20631899410258844, unnormalized = 1.2399762718299734
iteration 3, reconstruction error: 0.04297596063149542, decrease = 0.06561917849262605, unnormalized = 0.5005543219059398
iteration 4, reconstruction error: 0.020878679522241667, decrease = 0.022097281109253752, unnormalized = 0.24529692252918345
iteration 5, reconstruction error: 0.010889872512359632, decrease = 0.009988807009882035, unnormalized = 0.12844693809963953
iteration 6, reconstruction error: 0.005971615533669401, decrease = 0.0049182569786902315, unnormalized = 0.07056765288707954
iteration 7, reconstruction error: 0.0033970658618460692, decrease = 0.0025745496718233315, unnormalized = 0.04018083121088336
iteration 8, reconstruction error: 0.001986732417403245, decrease = 0.0014103334444428243, un

Let's look at one of the values that was masked out.

In [9]:
mask.coords.T[0]

array([10, 59, 99])

In [10]:
mask[tuple(mask.coords.T[0])]

False

In [11]:
orig_val = tensor[tuple(mask.coords.T[0])]
orig_val

0.04802876721150133

See the [sparse_demo.ipynb](sparse_demo.ipynb) for how to calculate individual values from the factors.

In [12]:
weights, factors = sparse_kruskal
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(tuple(mask.coords.T[0]))], 0), 0))
computed_val

0.048028767210974516

In [13]:
np.abs(orig_val - computed_val)

5.268147029724446e-13

Obviously this is a constructed example, where we know the unmasked tensor has an exact factorization. But this demonstrates that given a tensor with missing values, which we have reason to believe is represented by a rank $r$ tensor decomposition, we should expect this decomposition to do a decent job at reconstructing those missing values (this may not be the case if the missing values are not randomly distributed across the tensor as we have here). 

Let's compare this to a value that was not masked

In [14]:
for i in tensor.coords.T:
    non_missing_coord = tuple(i)
    if mask[non_missing_coord]:
        break
        
mask[non_missing_coord]

True

In [15]:
orig_val = tensor[non_missing_coord]
orig_val

0.002143451454775147

In [16]:
computed_val = np.sum(np.prod(sparse.stack([factors[i][idx] for i, idx in enumerate(non_missing_coord)], 0), 0))
computed_val

0.002143451454763028

In [17]:
np.abs(orig_val - computed_val)

1.2119211884042969e-14

As before, we should not in general try to recompose a sparse factorization unless we can represent it densely, but since this was constructed explicitly from sparse factors, we are able to do it (being careful to use the `cp_to_tensor` from the sparse backend).

In [18]:
expanded = cp_to_tensor((weights, factors))
expanded

0,1
Format,coo
Data Type,float64
Shape,"(1000, 1001, 1002)"
nnz,4675
Density,4.661007655019631e-06
Read-only,True
Size,146.1K
Storage ratio,0.0


Now let's look at the absolute error, both including and not including missing values.

In [19]:
from tensorly.contrib.sparse import norm
norm((tensor - expanded)*mask) # Absolute error of the non-missing values

1.56344373736724e-07

In [20]:
norm(tensor - expanded) # Absolute error including missing values

2.676220096895096e-07