Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

matrixmultiply #38

Closed
bluss opened this issue Apr 7, 2016 · 20 comments · Fixed by #43
Closed

matrixmultiply #38

bluss opened this issue Apr 7, 2016 · 20 comments · Fixed by #43

Comments

@bluss
Copy link

bluss commented Apr 7, 2016

matrixmultiply does f32 and f64 matrix multiplication quite competently for larger matrices. It's a simpler dependency than blas, so it might be nice to just use. It does need type specific dispatch (unstable stabilization, TypeId, or specific trait) at this point though.

What do you think, could I help you integrate with it, or are you developing something equivalent?

@AtheMathmo
Copy link
Owner

Hey!

Thanks for creating this issue. I was planning on reaching out to you soon to ask about it too.

I think it would be really good to use the matrixmultiply library! I had started playing around with some divide and conquer parallelism and after wrapping up my current work was going to compare the two. I'd definitely like to compare how big the gains are compared to my current (in-the-works) parallelized approach.

Do you have plans to add multithreading to the library - I'd be happy to try and help with that!

@bluss
Copy link
Author

bluss commented Apr 7, 2016

I didn't have any plans for that, even if the shape of it should be set up for it. I've already tried plugging in rayon to try it though. It seems it needs quite big matrices for it to pay off.

@AtheMathmo
Copy link
Owner

Hmm, I found that I had some nice performance gains even for matrices that were as small as n x 64. I imagine even without parallelism your library will achieve higher performance. I'd be happy to play around with parallelism a little in matrixmultiply but for now I'll experiment with it as is.

Before starting any work I need to break up the matrix module into smaller sub modules (all the operation overload etc. is in one file right now, and without macros!). I'll definitely need some help integrating it and will keep this issue open.

@bluss
Copy link
Author

bluss commented Apr 7, 2016

The packed matrix multiply will already be many times faster than a simpler approach though, at small sizes. (This comparison for example bluss/matrixmultiply#6 (comment) The "ref" version is a reference simple triple loop matrix multiplication implementation).

@AtheMathmo
Copy link
Owner

For tracking:

I'm pretty convinced that swapping out the current multiplication methods for matrixmultiply is an easy and smart move. As for multithreading I can experiment with divide and conquer which calls matrixmultiply at the base level.

@AtheMathmo
Copy link
Owner

Did a quick test using matrixmultiply:

Before:

test linalg::matrix::mat_mul_128_1000 ... bench: 14,231,183 ns/iter (+/- 833,129)

After:

test linalg::matrix::mat_mul_128_1000 ... bench: 3,352,754 ns/iter (+/- 412,408)

Pretty nice! This is without any divide and conquer parallelism as well (which I will need to investigate).

The only thing holding me back now is that my current multiplication is generic while matrixmultiply supports only f32 and f64. To solve this for now we could use:

http://rust-num.github.io/num/num/traits/trait.ToPrimitive.html

But I still need to cover both f32 and f64. @bluss I guess this was what you were talking about with specific dispatch. Do you have any suggestions for handling this?

@bluss
Copy link
Author

bluss commented Apr 11, 2016

I'm not surprised it was a big improvement 😉

There's no perfect solution to type specific dispatch without specialization (that young unstable feature). ndarray uses Any and TypeId to use the f32, f64 matrixmultiply for those types only, but it requires a trait bound Any. You can also define your own trait, but it would not be as inclusive as the current trait bounds.

I have experimented some with scoped_threadpool and rayon in matrixmultiply, maybe not with big enough matrices. It's easy to have overhead dominate. I think the sweet spot to divide by multiple threads are the loops marked 3 and 2 in that code, which requires some extra coding since each thread then needs its own packing buffer.

Oh by the way, there's another alternative; to write a more generic version of matrixmultiply. It's not my favourite solution for several reasons including compilation time, the matrix multiply then needs to be recompiled every time the usage site of it is recompiled.

@AtheMathmo
Copy link
Owner

I was going to see if I could get away with simply dividing up the matrix before feeding it to the BLIS algorithm. If that doesn't work I'll try to play with the algorithm itself - but the gains are big enough that I'm not super motivated to improve it :).

I think it is best not to have a more generic version matrixmultiply if it impacts compilation heavily.

I'll take a look at how you solve the dispatch in ndarray - no point reinventing the wheel ;). I haven't played with Any or TypeId yet so that may be fun...

Thanks

@bluss
Copy link
Author

bluss commented Apr 11, 2016

I call the TypeId solution "ad-hoc specialization". It's crude but it works just fine.

@bluss
Copy link
Author

bluss commented Apr 11, 2016

Dividing up the matrix sounds just fine actually.

@bluss
Copy link
Author

bluss commented Apr 11, 2016

(Just make sure the threads are not writing to the same output location in the C matrix)

@bluss
Copy link
Author

bluss commented Apr 11, 2016

@AtheMathmo
Copy link
Owner

Thanks for the source - that doesn't look too bad at all :).

It could be a challenge to get the divide and conquer right - as you pointed out, providing the right views of C to dgemm will take some careful thought.

@bluss
Copy link
Author

bluss commented Apr 11, 2016

Pen and paper solves everything 😉

@AtheMathmo
Copy link
Owner

A quick and dirty implementation shows we can get some speedup:

test linalg::matrix::mat_mul_128_1000 ... bench: 3,154,849 ns/iter (+/- 243,448)
test linalg::matrix::mat_paramul_128_1000 ... bench: 2,308,833 ns/iter (+/- 970,111)

This is splitting along an axis with length greater than 256 (I tried 64 and 128 as well). This implementation also has huge overhead as I'm copying data to split up the matrices (I need to rewrite it using MatrixSlice). I'm also just creating new C matrices for each actual dgemm call. This can also be improved :).

@bluss
Copy link
Author

bluss commented Apr 11, 2016

Cool. So you're the parallelism wizard

@bluss
Copy link
Author

bluss commented Apr 11, 2016

Which crate do you use for threading by the way?

@AtheMathmo
Copy link
Owner

I'm not sure about wizard! It is a simple but inefficient implementation. I'm using rayon for the threading, it looks like this:

pub fn paramul(&self, m: &Matrix<T>) -> Matrix<T> {
        let n = self.rows();
        let p = self.cols();
        let q = m.cols();

        let mut max_dim = n;

        if max_dim < p {
            max_dim = p;
        }
        if max_dim < q {
            max_dim = q;
        }

        if max_dim < 256 {
            // If max_dim is less than some threshhold, just multiply.
            self * m
        } else {
            // Otherwise we should split along the axis.
            let split_point = max_dim / 2;

            // Split along `self`'s rows.
            if max_dim == n {
                let top = self.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
                let bottom = self.select_rows(&(split_point..n).collect::<Vec<usize>>()[..]);

                let (a_1_b, a_2_b) = rayon::join(|| Matrix::paramul(&top, m),
                                                 || Matrix::paramul(&bottom, m));

                a_1_b.vcat(&a_2_b)
            } else if max_dim == p {
                // Split self along cols and m along rows
                let a_left = self.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
                let a_right = self.select_cols(&(split_point..p).collect::<Vec<usize>>()[..]);

                let b_top = m.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
                let b_bottom = m.select_rows(&(split_point..p).collect::<Vec<usize>>()[..]);

                let (a_1_b_1, a_2_b_2) = rayon::join(|| Matrix::paramul(&a_left, &b_top),
                                                     || Matrix::paramul(&a_right, &b_bottom));

                a_1_b_1 + a_2_b_2
            } else if max_dim == q {
                // Split m along cols
                let left = m.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
                let right = m.select_cols(&(split_point..q).collect::<Vec<usize>>()[..]);

                let (a_b_1, a_b_2) = rayon::join(|| Matrix::paramul(self, &left),
                                                 || Matrix::paramul(self, &right));

                a_b_1.hcat(&a_b_2)
            } else {
                panic!("Couldn't find the max of the matrix dimensions.");
            }
        }
    }

The select methods are copying data right now. I need to change them to use split methods and get slices.

The basic idea of the algorithm is described here. Split in half along the largest axis and then bring the results back together.

@bluss
Copy link
Author

bluss commented Apr 11, 2016

Nice, so there's some easy wins possible by using views for the inputs to the multiplication.

@AtheMathmo
Copy link
Owner

I suspect so :).

Sadly I need to do some code maintenance before I can get to work on this properly. I'll probably poke you at some point for some advice with that too...

Once I've wrapped that up I'll try to get this going. It should make the performance of most algorithms pretty acceptable!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants