matrixmultiply #38

bluss · 2016-04-07T18:46:53Z

matrixmultiply does f32 and f64 matrix multiplication quite competently for larger matrices. It's a simpler dependency than blas, so it might be nice to just use. It does need type specific dispatch (unstable stabilization, TypeId, or specific trait) at this point though.

What do you think, could I help you integrate with it, or are you developing something equivalent?

AtheMathmo · 2016-04-07T18:58:16Z

Hey!

Thanks for creating this issue. I was planning on reaching out to you soon to ask about it too.

I think it would be really good to use the matrixmultiply library! I had started playing around with some divide and conquer parallelism and after wrapping up my current work was going to compare the two. I'd definitely like to compare how big the gains are compared to my current (in-the-works) parallelized approach.

Do you have plans to add multithreading to the library - I'd be happy to try and help with that!

bluss · 2016-04-07T19:03:44Z

I didn't have any plans for that, even if the shape of it should be set up for it. I've already tried plugging in rayon to try it though. It seems it needs quite big matrices for it to pay off.

AtheMathmo · 2016-04-07T19:08:33Z

Hmm, I found that I had some nice performance gains even for matrices that were as small as n x 64. I imagine even without parallelism your library will achieve higher performance. I'd be happy to play around with parallelism a little in matrixmultiply but for now I'll experiment with it as is.

Before starting any work I need to break up the matrix module into smaller sub modules (all the operation overload etc. is in one file right now, and without macros!). I'll definitely need some help integrating it and will keep this issue open.

bluss · 2016-04-07T19:21:13Z

The packed matrix multiply will already be many times faster than a simpler approach though, at small sizes. (This comparison for example bluss/matrixmultiply#6 (comment) The "ref" version is a reference simple triple loop matrix multiplication implementation).

AtheMathmo · 2016-04-07T19:38:25Z

For tracking:

I'm pretty convinced that swapping out the current multiplication methods for matrixmultiply is an easy and smart move. As for multithreading I can experiment with divide and conquer which calls matrixmultiply at the base level.

AtheMathmo · 2016-04-11T17:11:16Z

Did a quick test using matrixmultiply:

Before:

test linalg::matrix::mat_mul_128_1000 ... bench: 14,231,183 ns/iter (+/- 833,129)

After:

test linalg::matrix::mat_mul_128_1000 ... bench: 3,352,754 ns/iter (+/- 412,408)

Pretty nice! This is without any divide and conquer parallelism as well (which I will need to investigate).

The only thing holding me back now is that my current multiplication is generic while matrixmultiply supports only f32 and f64. To solve this for now we could use:

http://rust-num.github.io/num/num/traits/trait.ToPrimitive.html

But I still need to cover both f32 and f64. @bluss I guess this was what you were talking about with specific dispatch. Do you have any suggestions for handling this?

bluss · 2016-04-11T18:25:18Z

I'm not surprised it was a big improvement 😉

There's no perfect solution to type specific dispatch without specialization (that young unstable feature). ndarray uses Any and TypeId to use the f32, f64 matrixmultiply for those types only, but it requires a trait bound Any. You can also define your own trait, but it would not be as inclusive as the current trait bounds.

I have experimented some with scoped_threadpool and rayon in matrixmultiply, maybe not with big enough matrices. It's easy to have overhead dominate. I think the sweet spot to divide by multiple threads are the loops marked 3 and 2 in that code, which requires some extra coding since each thread then needs its own packing buffer.

Oh by the way, there's another alternative; to write a more generic version of matrixmultiply. It's not my favourite solution for several reasons including compilation time, the matrix multiply then needs to be recompiled every time the usage site of it is recompiled.

AtheMathmo · 2016-04-11T18:39:18Z

I was going to see if I could get away with simply dividing up the matrix before feeding it to the BLIS algorithm. If that doesn't work I'll try to play with the algorithm itself - but the gains are big enough that I'm not super motivated to improve it :).

I think it is best not to have a more generic version matrixmultiply if it impacts compilation heavily.

I'll take a look at how you solve the dispatch in ndarray - no point reinventing the wheel ;). I haven't played with Any or TypeId yet so that may be fun...

Thanks

bluss · 2016-04-11T18:40:33Z

I call the TypeId solution "ad-hoc specialization". It's crude but it works just fine.

bluss · 2016-04-11T18:41:45Z

Dividing up the matrix sounds just fine actually.

bluss · 2016-04-11T18:42:16Z

(Just make sure the threads are not writing to the same output location in the C matrix)

bluss · 2016-04-11T18:42:57Z

Source code https://github.com/bluss/rust-ndarray/blob/master/src/linalg/impl_linalg.rs#L402-L473

AtheMathmo · 2016-04-11T18:45:59Z

Thanks for the source - that doesn't look too bad at all :).

It could be a challenge to get the divide and conquer right - as you pointed out, providing the right views of C to dgemm will take some careful thought.

bluss · 2016-04-11T18:49:12Z

Pen and paper solves everything 😉

AtheMathmo · 2016-04-11T19:03:01Z

A quick and dirty implementation shows we can get some speedup:

test linalg::matrix::mat_mul_128_1000 ... bench: 3,154,849 ns/iter (+/- 243,448)
test linalg::matrix::mat_paramul_128_1000 ... bench: 2,308,833 ns/iter (+/- 970,111)

This is splitting along an axis with length greater than 256 (I tried 64 and 128 as well). This implementation also has huge overhead as I'm copying data to split up the matrices (I need to rewrite it using MatrixSlice). I'm also just creating new C matrices for each actual dgemm call. This can also be improved :).

bluss · 2016-04-11T19:09:43Z

Cool. So you're the parallelism wizard

bluss · 2016-04-11T19:10:08Z

Which crate do you use for threading by the way?

AtheMathmo · 2016-04-11T19:20:41Z

I'm not sure about wizard! It is a simple but inefficient implementation. I'm using rayon for the threading, it looks like this:

pub fn paramul(&self, m: &Matrix<T>) -> Matrix<T> {
        let n = self.rows();
        let p = self.cols();
        let q = m.cols();

        let mut max_dim = n;

        if max_dim < p {
            max_dim = p;
        }
        if max_dim < q {
            max_dim = q;
        }

        if max_dim < 256 {
            // If max_dim is less than some threshhold, just multiply.
            self * m
        } else {
            // Otherwise we should split along the axis.
            let split_point = max_dim / 2;

            // Split along `self`'s rows.
            if max_dim == n {
                let top = self.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
                let bottom = self.select_rows(&(split_point..n).collect::<Vec<usize>>()[..]);

                let (a_1_b, a_2_b) = rayon::join(|| Matrix::paramul(&top, m),
                                                 || Matrix::paramul(&bottom, m));

                a_1_b.vcat(&a_2_b)
            } else if max_dim == p {
                // Split self along cols and m along rows
                let a_left = self.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
                let a_right = self.select_cols(&(split_point..p).collect::<Vec<usize>>()[..]);

                let b_top = m.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
                let b_bottom = m.select_rows(&(split_point..p).collect::<Vec<usize>>()[..]);

                let (a_1_b_1, a_2_b_2) = rayon::join(|| Matrix::paramul(&a_left, &b_top),
                                                     || Matrix::paramul(&a_right, &b_bottom));

                a_1_b_1 + a_2_b_2
            } else if max_dim == q {
                // Split m along cols
                let left = m.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
                let right = m.select_cols(&(split_point..q).collect::<Vec<usize>>()[..]);

                let (a_b_1, a_b_2) = rayon::join(|| Matrix::paramul(self, &left),
                                                 || Matrix::paramul(self, &right));

                a_b_1.hcat(&a_b_2)
            } else {
                panic!("Couldn't find the max of the matrix dimensions.");
            }
        }
    }

The select methods are copying data right now. I need to change them to use split methods and get slices.

The basic idea of the algorithm is described here. Split in half along the largest axis and then bring the results back together.

bluss · 2016-04-11T19:56:41Z

Nice, so there's some easy wins possible by using views for the inputs to the multiplication.

AtheMathmo · 2016-04-11T20:00:51Z

I suspect so :).

Sadly I need to do some code maintenance before I can get to work on this properly. I'll probably poke you at some point for some advice with that too...

Once I've wrapped that up I'll try to get this going. It should make the performance of most algorithms pretty acceptable!

AtheMathmo mentioned this issue Apr 16, 2016

Matrixmultiply #43

Merged

AtheMathmo closed this as completed in #43 Apr 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matrixmultiply #38

matrixmultiply #38

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

matrixmultiply #38

matrixmultiply #38

Comments

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

bluss commented Apr 7, 2016

AtheMathmo commented Apr 7, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016

bluss commented Apr 11, 2016

AtheMathmo commented Apr 11, 2016