-
Notifications
You must be signed in to change notification settings - Fork 153
matrixmultiply #38
Comments
Hey! Thanks for creating this issue. I was planning on reaching out to you soon to ask about it too. I think it would be really good to use the matrixmultiply library! I had started playing around with some divide and conquer parallelism and after wrapping up my current work was going to compare the two. I'd definitely like to compare how big the gains are compared to my current (in-the-works) parallelized approach. Do you have plans to add multithreading to the library - I'd be happy to try and help with that! |
I didn't have any plans for that, even if the shape of it should be set up for it. I've already tried plugging in rayon to try it though. It seems it needs quite big matrices for it to pay off. |
Hmm, I found that I had some nice performance gains even for matrices that were as small as n x 64. I imagine even without parallelism your library will achieve higher performance. I'd be happy to play around with parallelism a little in matrixmultiply but for now I'll experiment with it as is. Before starting any work I need to break up the matrix module into smaller sub modules (all the operation overload etc. is in one file right now, and without macros!). I'll definitely need some help integrating it and will keep this issue open. |
The packed matrix multiply will already be many times faster than a simpler approach though, at small sizes. (This comparison for example bluss/matrixmultiply#6 (comment) The "ref" version is a reference simple triple loop matrix multiplication implementation). |
For tracking: I'm pretty convinced that swapping out the current multiplication methods for matrixmultiply is an easy and smart move. As for multithreading I can experiment with divide and conquer which calls matrixmultiply at the base level. |
Did a quick test using matrixmultiply: Before: test linalg::matrix::mat_mul_128_1000 ... bench: 14,231,183 ns/iter (+/- 833,129) After: test linalg::matrix::mat_mul_128_1000 ... bench: 3,352,754 ns/iter (+/- 412,408) Pretty nice! This is without any divide and conquer parallelism as well (which I will need to investigate). The only thing holding me back now is that my current multiplication is generic while matrixmultiply supports only f32 and f64. To solve this for now we could use: http://rust-num.github.io/num/num/traits/trait.ToPrimitive.html But I still need to cover both f32 and f64. @bluss I guess this was what you were talking about with specific dispatch. Do you have any suggestions for handling this? |
I'm not surprised it was a big improvement 😉 There's no perfect solution to type specific dispatch without specialization (that young unstable feature). ndarray uses Any and TypeId to use the f32, f64 matrixmultiply for those types only, but it requires a trait bound Any. You can also define your own trait, but it would not be as inclusive as the current trait bounds. I have experimented some with scoped_threadpool and rayon in matrixmultiply, maybe not with big enough matrices. It's easy to have overhead dominate. I think the sweet spot to divide by multiple threads are the loops marked 3 and 2 in that code, which requires some extra coding since each thread then needs its own packing buffer. Oh by the way, there's another alternative; to write a more generic version of matrixmultiply. It's not my favourite solution for several reasons including compilation time, the matrix multiply then needs to be recompiled every time the usage site of it is recompiled. |
I was going to see if I could get away with simply dividing up the matrix before feeding it to the BLIS algorithm. If that doesn't work I'll try to play with the algorithm itself - but the gains are big enough that I'm not super motivated to improve it :). I think it is best not to have a more generic version matrixmultiply if it impacts compilation heavily. I'll take a look at how you solve the dispatch in ndarray - no point reinventing the wheel ;). I haven't played with Any or TypeId yet so that may be fun... Thanks |
I call the TypeId solution "ad-hoc specialization". It's crude but it works just fine. |
Dividing up the matrix sounds just fine actually. |
(Just make sure the threads are not writing to the same output location in the C matrix) |
Thanks for the source - that doesn't look too bad at all :). It could be a challenge to get the divide and conquer right - as you pointed out, providing the right views of |
Pen and paper solves everything 😉 |
A quick and dirty implementation shows we can get some speedup: test linalg::matrix::mat_mul_128_1000 ... bench: 3,154,849 ns/iter (+/- 243,448) This is splitting along an axis with length greater than 256 (I tried 64 and 128 as well). This implementation also has huge overhead as I'm copying data to split up the matrices (I need to rewrite it using MatrixSlice). I'm also just creating new |
Cool. So you're the parallelism wizard |
Which crate do you use for threading by the way? |
I'm not sure about wizard! It is a simple but inefficient implementation. I'm using rayon for the threading, it looks like this: pub fn paramul(&self, m: &Matrix<T>) -> Matrix<T> {
let n = self.rows();
let p = self.cols();
let q = m.cols();
let mut max_dim = n;
if max_dim < p {
max_dim = p;
}
if max_dim < q {
max_dim = q;
}
if max_dim < 256 {
// If max_dim is less than some threshhold, just multiply.
self * m
} else {
// Otherwise we should split along the axis.
let split_point = max_dim / 2;
// Split along `self`'s rows.
if max_dim == n {
let top = self.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
let bottom = self.select_rows(&(split_point..n).collect::<Vec<usize>>()[..]);
let (a_1_b, a_2_b) = rayon::join(|| Matrix::paramul(&top, m),
|| Matrix::paramul(&bottom, m));
a_1_b.vcat(&a_2_b)
} else if max_dim == p {
// Split self along cols and m along rows
let a_left = self.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
let a_right = self.select_cols(&(split_point..p).collect::<Vec<usize>>()[..]);
let b_top = m.select_rows(&(0..split_point).collect::<Vec<usize>>()[..]);
let b_bottom = m.select_rows(&(split_point..p).collect::<Vec<usize>>()[..]);
let (a_1_b_1, a_2_b_2) = rayon::join(|| Matrix::paramul(&a_left, &b_top),
|| Matrix::paramul(&a_right, &b_bottom));
a_1_b_1 + a_2_b_2
} else if max_dim == q {
// Split m along cols
let left = m.select_cols(&(0..split_point).collect::<Vec<usize>>()[..]);
let right = m.select_cols(&(split_point..q).collect::<Vec<usize>>()[..]);
let (a_b_1, a_b_2) = rayon::join(|| Matrix::paramul(self, &left),
|| Matrix::paramul(self, &right));
a_b_1.hcat(&a_b_2)
} else {
panic!("Couldn't find the max of the matrix dimensions.");
}
}
} The The basic idea of the algorithm is described here. Split in half along the largest axis and then bring the results back together. |
Nice, so there's some easy wins possible by using views for the inputs to the multiplication. |
I suspect so :). Sadly I need to do some code maintenance before I can get to work on this properly. I'll probably poke you at some point for some advice with that too... Once I've wrapped that up I'll try to get this going. It should make the performance of most algorithms pretty acceptable! |
matrixmultiply does f32 and f64 matrix multiplication quite competently for larger matrices. It's a simpler dependency than blas, so it might be nice to just use. It does need type specific dispatch (unstable stabilization, TypeId, or specific trait) at this point though.
What do you think, could I help you integrate with it, or are you developing something equivalent?
The text was updated successfully, but these errors were encountered: