Skip to content

Sparse matrices

jcanny edited this page Jun 16, 2014 · 6 revisions

Table of Contents

Creating Sparse Matrices

Sparse matrices can be created in several ways: by converting a dense matrix to sparse:

scala> val a = bernrnd(0.5,5,5)
a: BIDMat.IMat =
   0   0   0   1   1
   1   0   1   1   0
   1   0   1   1   1
   0   1   0   1   1
   1   0   0   0   0


scala> val b = sparse(a)
b: BIDMat.SMat =
(   1,   0)   1
(   2,   0)   1
(   4,   0)   1
(   3,   1)   1
(   1,   2)   1
(   2,   2)   1
(   0,   3)   1
(   1,   3)   1
  ...  ...  ...

from which you can see that sparse matrices print out as (row, column, value) tuples. Note that a was actually an IMat, but was implicitly cast to FMat, which is the argument type for sparse().

There is a sparse random number generator, sprand() whose arguments are nrows, ncols, and probability of element presence. The full() function converts back from a sparse matrix to a dense one.

scala> val a = sprand(5,5,0.5)
a: BIDMat.SMat =
(        0,        0)  0.32618
(        2,        0)  0.12405
(        3,        0)  0.48652
(        0,        1)  0.87441
(        1,        1)  0.44249
(        2,        1)  0.16163
(        0,        2)  0.56278
(        1,        2)  0.56792
       ...       ...       ...

scala> full(a)
res0: BIDMat.FMat =
  0.32618  0.87441  0.56278  0.50188        0
        0  0.44249  0.56792  0.75591  0.58401
  0.12405  0.16163        0  0.91592        0
  0.48652        0        0  0.82078        0
        0        0  0.97430        0        0

The operators \ and on work as they do for dense matrices. Its important to know that both horizontal and vertical concatenation is an efficient operation for sparse matrices. In both cases, the time taken is proportional to the size of the output: no additional sorting is needed.

scala> val b = sprand(5,3,0.3)
b: BIDMat.SMat =
(         1,         0)   0.78089
(         0,         1)   0.87985
(         2,         1)   0.13684
(         1,         2)   0.38906
(         2,         2)  0.078547
(         3,         2)   0.63639

scala> a \ b
res1: BIDMat.SMat =
(        0,        0)  0.32618
(        2,        0)  0.12405
(        3,        0)  0.48652
(        0,        1)  0.87441
(        1,        1)  0.44249
(        2,        1)  0.16163
(        0,        2)  0.56278
(        1,        2)  0.56792
       ...       ...       ...

scala> full(a\b)
res2: BIDMat.FMat =
   0.32618   0.87441   0.56278   0.50188         0         0   0.87985         0
         0   0.44249   0.56792   0.75591   0.58401   0.78089         0   0.38906
   0.12405   0.16163         0   0.91592         0         0   0.13684  0.078547
   0.48652         0         0   0.82078         0         0         0   0.63639
         0         0   0.97430         0         0         0         0         0

scala> full(a)
res4: BIDMat.FMat =
  0.32618  0.87441  0.56278  0.50188        0
        0  0.44249  0.56792  0.75591  0.58401
  0.12405  0.16163        0  0.91592        0
  0.48652        0        0  0.82078        0
        0        0  0.97430        0        0

scala> full(b)
res5: BIDMat.FMat =
         0   0.87985         0
   0.78089         0   0.38906
         0   0.13684  0.078547
         0         0   0.63639
         0         0         0

Find Functions

find functions work exactly as they do for dense matrices. i.e. find(A) returns the single indices of non-zeros of A, find2(A) returns a two-tuple with row and column indices of non-zeros, and find3(A) returns a three-tuple with row, column and value as column matrices. It should always be the case that findx(A) = findx(sparse(A)) and findx(B) = findx(full(B)) for findx one of find(), find2() or find3().

Just as for Dense matrices, the find functions for sparse matrices return the indices in column-major order. That means that the indices returned by find() will be strictly increasing.

The find() functions play a special role for sparse matrices however, because they allow access to alternative representations of sparse arrays. Sparse matrices in BIDMat, as in Fortran and Matlab, are internally stored in compressed sparse column (CSC) format. CSC is a column-major format where column indices are compressed. For a matrix in column-major order, the column indices are non-decreasing, and e.g. all the elements with column index j occur in one contiguous block in this ordering. Rather than store them explicitly, in CSC you store only the start and end indices of this block. Specifically, in a CSC matrix there is an array jc of length ncols+1 where jc(j) is the index of the first element whose column index is j in the column-major list of all non-zero elements. jc(ncols) is equal to the number of non-zeros in the matrix.

The find3() function effectively converts the matrix to coordinate format, where all of row, column and value matrices are available. It allows the matrix elements to be reordered or filtered in various ways. All the find functions are efficient, and take linear time.

Creating Sparse Matrices from Elements

The sparse function is overloaded to construct (CSC) sparse matrices from vectors of indices and value. So if we had indices defined like this:

scala> val ii = icol(1,0,2,1,2,3)
ii: BIDMat.IMat =
   1
   0
   2
   1
   2
   3

scala> val jj = icol(0,1,1,2,2,2)
jj: BIDMat.IMat =
   0
   1
   1
   2
   2
   2

scala> val vv = col(0.78, 0.88, 0.13, 0.39, 0.08, 0.64)
vv: BIDMat.FMat =
   0.78000
   0.88000
   0.13000
   0.39000
  0.080000
   0.64000

scala> val a = sparse(ii,jj,vv)
a: BIDMat.SMat =
(         1,         0)   0.78000
(         0,         1)   0.88000
(         2,         1)   0.13000
(         1,         2)   0.39000
(         2,         2)  0.080000
(         3,         2)   0.64000

scala> full(a)
res0: BIDMat.FMat =
         0   0.88000         0
   0.78000         0   0.39000
         0   0.13000  0.080000
         0         0   0.64000

Note that creating sparse matrices in this way is expensive. Its necessary to sort the indices to column-major order, to reduce (sum values) over duplicate indices, and to remove zero values. In general this is an O(n log n) process.

Accessing Sparse Matrix Elements

Sparse array elements can be accessed directly as A(row,column)

scala> full(a)
res0: BIDMat.FMat =
         0   0.88000         0
   0.78000         0   0.39000
         0   0.13000  0.080000
         0         0   0.64000


scala> a(0,1)
res1: Double = 0.88

but this is a slow operation. Since the row indices are stored in a sorted array per column, a binary search is needed to find the row index if it exists. Its better to use "batch" calculations on sparse matrices that work on all their elements at once, or on blocks of values.

Updating is possible on sparse matrix elements, but only if the element exists. Adding a new non-zero would require returning a new array, and this is not supported:

scala> full(a)
res0: BIDMat.FMat =
         0   0.88000         0
   0.78000         0   0.39000
         0   0.13000  0.080000
         0         0   0.64000

scala> a(0,1)=0.2
res2: Double = 0.2

scala> full(a)
res4: BIDMat.FMat =
         0   0.20000         0
   0.78000         0   0.39000
         0   0.13000  0.080000
         0         0   0.64000

scala> a(0,0)=1
java.lang.RuntimeException: Can't set missing values

Slicing

Slicing is an important operation for sparse matrices because it allows operations on sparse matrices in blocks. Of particular importance is column slicing (where the row index is a wildcard), which is always efficient:

scala> val a = sprand(5,10,0.4)
a: BIDMat.SMat =
(        0,        0)  0.97430
(        2,        0)  0.50188
(        0,        1)  0.75591
(        1,        1)  0.91592
(        3,        1)  0.82078
(        4,        1)  0.58401
(        1,        2)  0.51697
(        3,        2)  0.19109
       ...       ...       ...

scala> full(a)
res7: BIDMat.FMat =
   0.97430   0.75591         0         0   0.33410   0.25824         0         0         0   0.38906
         0   0.91592   0.51697         0         0   0.58547         0         0         0         0
   0.50188         0         0         0         0   0.25645   0.88392         0   0.87985         0
         0   0.82078   0.19109  0.066242         0   0.77816         0   0.12139   0.13684  0.078547
         0   0.58401   0.93623         0   0.64584         0         0   0.78089         0         0

scala> val b = a(?, 1 \ 2 \ 4 \ 4)
b: BIDMat.SMat =
(        0,        0)  0.75591
(        1,        0)  0.91592
(        3,        0)  0.82078
(        4,        0)  0.58401
(        1,        1)  0.51697
(        3,        1)  0.19109
(        4,        1)  0.93623
(        0,        2)  0.33410
       ...       ...       ...

scala> full(b)
res8: BIDMat.FMat =
  0.75591        0  0.33410  0.33410
  0.91592  0.51697        0        0
        0        0        0        0
  0.82078  0.19109        0        0
  0.58401  0.93623  0.64584  0.64584

On the other hand, row slicing in general requires a sort of the result and can be O(n log n). If Regular row slicing is needed, it may be better to compute the transpose of the matrix first, and then slice by columns.

Sparse Matrix Contents

We have found it very useful to expose the non-zero values of a sparse matrix as a dense vector. This allows transformations of the non-zeros without use of the indices. e.g. you can transform counts using log, sqrt or signum functions, or compute quotients of sparse matrices with identical non-zeros.

Sparse matrix contents are deliberately aliased since the intention is to be able to change the values, so should be used with care. To access the contents of a sparse matrix, call its contents method:

> val a = sprand(4,4,0.3)
(        1,        0)  0.46293
(        1,        1)  0.12652
(        0,        3)  0.45308
(        1,        3)  0.73676
> val b = a.contents
  0.46293
  0.12652
  0.45308
  0.73676
> b(?) = 1
   1
   1
   1
   1
> a
(   1,   0)   1
(   1,   1)   1
(   0,   3)   1
(   1,   3)   1
The contents of an SMat will be an FMat vector, for an SDMat it will be a DMat, and for GSMat it will be a GMat. The contents method is defined also for dense matrices so that fully generic programs can use contents. In that case returns a vector with the same storage as the dense matrix, and whose length is the product of nrows and ncols for the matrix.