You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When working with CPU tensors on Linux, transparent huge pages (THP) can provide a big speedup. For example, I see a 15% increase in speed for my code when I turn it on. However, for many operating systems such as Ubuntu, the default behavior for THP is "madvise", which means that madvise must be called with the proper flags on each memory segment that we want THP for. In NumPy, they turn on THP with madvise for any arrays that are 4 MB or larger, when they're on Linux 4.6+ (initial commit: numpy/numpy@7180479). It would be great to have similar behavior for candle. I'm not exactly sure how this would be implemented, as CpuStorage takes a Vec that may have already been paged in, but maybe calling madvise somehow for all usages of CpuStorage in cpu_backend.rs would cover the broad strokes.
The text was updated successfully, but these errors were encountered:
Interesting, could you maybe provide a way to replicate your 15% speedups? I'm pretty curious about which parts get actually accelerated by THP, if it's more loading the tensors vs the actual ops and if it's in the ops which ones benefit the most of it.
Here are some operations and their speeds without THP (left) and with THP (right): Tensor::ones((5000, 5000), ...): 22 vs. 63 iters/sec a + a, where a is a 5,000x5,000 tensor: 19 vs. 42 iters/sec a.matmul(a), where a is a 5,000x5,000 tensor: 1.65 vs. 1.73 iters/sec
When working with CPU tensors on Linux, transparent huge pages (THP) can provide a big speedup. For example, I see a 15% increase in speed for my code when I turn it on. However, for many operating systems such as Ubuntu, the default behavior for THP is "madvise", which means that madvise must be called with the proper flags on each memory segment that we want THP for. In NumPy, they turn on THP with madvise for any arrays that are 4 MB or larger, when they're on Linux 4.6+ (initial commit: numpy/numpy@7180479). It would be great to have similar behavior for candle. I'm not exactly sure how this would be implemented, as CpuStorage takes a Vec that may have already been paged in, but maybe calling madvise somehow for all usages of CpuStorage in cpu_backend.rs would cover the broad strokes.
The text was updated successfully, but these errors were encountered: