Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What levels of sparsity is this useful for? #752

Open
SimLeek opened this issue Jan 16, 2024 · 4 comments
Open

What levels of sparsity is this useful for? #752

SimLeek opened this issue Jan 16, 2024 · 4 comments

Comments

@SimLeek
Copy link

SimLeek commented Jan 16, 2024

I'm currently working with matrices where >99% of the values of 0, and it's very unlikely most of the blocks will be larger than 1x1 in size. However, the libsmm_acc code makes me think it could still be useful there, since SMM is used for the smaller blocks and there's some cuda optimization.

I was able to create large matrices by modifying the example dbcsr_example_3.cpp and got 1 teraflop performance, but with occupation of 1. It looks like when I assign values with c_dbcsr_iterator_next_2d_block_d, it treats the blocks like full 2D matrices. I could still calculate the 1x1 blocks and their resulting locations, but I feel like that wouldn't be very fast.

So, is this library useful when all matrices have >99% sparsity? If so, how would I set things up for that? If not, what specific types of sparsity or sparsity percentages does it help with?


Here was my main modification to the example, with density set to 0.01:

while (c_dbcsr_iterator_blocks_left(iter)) {
  //...
  c_dbcsr_iterator_next_2d_block_d(iter, &i, &j, &blk, &tr, &nblk, &rsize, &csize, &roff, &coff);
  for (int g=0; g<rsize*csize;g++){
    double r = static_cast<double>(std::rand()) / RAND_MAX;
    if ((density<1.0 &&r<density ) || density>=1.0){
        blk[g] = static_cast<double>(std::rand()) / RAND_MAX;
    }
  }
}
@alazzaro
Copy link
Member

I'm currently working with matrices where >99% of the values of 0, and it's very unlikely most of the blocks will be larger than 1x1 in size. However, the libsmm_acc code makes me think it could still be useful there, since SMM is used for the smaller blocks and there's some cuda optimization.

The <1% occupancy is fine for the library. However, as you pointed out, the library is really optimized for block-sparsity (DBCSR = distributed block CSR), so you will not get any special optimization for the multiplication of such blocks (they will simply run on the CPU, i.e. no GPU involved).

I was able to create large matrices by modifying the example dbcsr_example_3.cpp and got 1 teraflop performance, but with occupation of 1.

For this particular case of dense matrices, we run a "densification" algorithm.

It looks like when I assign values with c_dbcsr_iterator_next_2d_block_d, it treats the blocks like full 2D matrices. I could still calculate the 1x1 blocks and their resulting locations, but I feel like that wouldn't be very fast.

This is correct, we always assume blocks, so single elements are treated as 1x1 blocks. We don't run any special optimizationf for that (it is a bit out of the scope for DBCSR). Still, the library will work, but I would assume the performance would be low.

In conclusion, sparsity is fine (we use to run tests with 0.01% of occupancy), but we do expect blocks-sparse to get the best performance out of the library.

@SimLeek
Copy link
Author

SimLeek commented Jan 18, 2024

Fair. I guess 'maximum local sparsity' should be something like 75% for a block size, and it might be useful to do a game of life style convolve, 'if 3 values in a square of 4 involving this location are non-zero, then count this zero as non-zero', then build an R-tree for 100% dense rectangles, if you know there's going to be grouping but you don't know exactly where.

One of my matrices might be like that in a best case scenario, with most non-zero elements along the diagonal, sand I might be able to just set some grouping on the diagonal ahead of time, but the other matrices will actually avoid clumping.

Anyway, I think that answers the question in this issue: libsmm_acc does not optimize 1x1 blocks, nor does any other part of the library, since that's outside the scope of DBCSR. If that's correct, I think this can be closed.

@alazzaro
Copy link
Member

Anyway, I think that answers the question in this issue: libsmm_acc does not optimize 1x1 blocks, nor does any other part of the library, since that's outside the scope of DBCSR. If that's correct, I think this can be closed.

This is correct. Note that the library will work for 1x1 elements, but there will a massive overhead for assuming "blocks" (lookup of the dimensions and calculation of the position) that are useless for single elements... On the other hand, we could think to build the matrices with blocks of a given dimension, where the blocks are sparse inside. Then what we need is a way to group 1x1 blocks in larger blocks...

Out of curiosity, what's your use case?

@SimLeek
Copy link
Author

SimLeek commented Jan 20, 2024

I'm trying to get optimized sparsity working for linear layers in pytorch, preferably with each layer being variational to maintain sparsity, then conv layers. I'm looking at smaller SPGeMM and SpMSpV libraries now that look hopeful, but I'll have to test them.

Unfortunately, that variational rule is likely what's going to make it so that most of the blocks, even 2x2, will be sparse. And now that I think about it, that might apply to the synapses as well as the input, since having meaningfully different input connections would help keep an output from activating at the same time as another output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants