You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This feature arguably falls outside the scope of the library - it is really a problem which is 2 dependencies down the food chain - but I will suggest it anyway because I think it makes a difference in quality of life for users. The linear algebra backend called by Numpy sometimes does very naive and inappropriate multithreading, where a single small matrix multiplication can be parallelized across all cores of the machine, resulting in a longer computation that takes more cores. I am running the computation on a 20 core machine, and if I try to compute the GW distances between two 100x100 matrices, it engages all 20 cores and takes 20ms; forcing Numpy to use a single thread brings the computation to about 12ms.
This problem is quite subtle to debug, as Numpy (or really its BLAS backend) launches 20 threads at the beginning of the program no matter how large the matrices are, but then it dynamically decides whether to use all threads or leave some idle. This is a clever design decision as it gets past the overhead of creating the threads up front, but for debugging parallelization-related performance issues it is quite confusing, because the system debugger will tell you that 20 threads are active, when in reality 1 is active and 19 are idle.
My feature request is that you integrate a tool such as joblib/threadpoolctl into the library and provide options to the user to control or limit thread-level parallelization, together with sensible defaults.
The text was updated successfully, but these errors were encountered:
This feature arguably falls outside the scope of the library - it is really a problem which is 2 dependencies down the food chain - but I will suggest it anyway because I think it makes a difference in quality of life for users. The linear algebra backend called by Numpy sometimes does very naive and inappropriate multithreading, where a single small matrix multiplication can be parallelized across all cores of the machine, resulting in a longer computation that takes more cores. I am running the computation on a 20 core machine, and if I try to compute the GW distances between two 100x100 matrices, it engages all 20 cores and takes 20ms; forcing Numpy to use a single thread brings the computation to about 12ms.
This problem is quite subtle to debug, as Numpy (or really its BLAS backend) launches 20 threads at the beginning of the program no matter how large the matrices are, but then it dynamically decides whether to use all threads or leave some idle. This is a clever design decision as it gets past the overhead of creating the threads up front, but for debugging parallelization-related performance issues it is quite confusing, because the system debugger will tell you that 20 threads are active, when in reality 1 is active and 19 are idle.
My feature request is that you integrate a tool such as joblib/threadpoolctl into the library and provide options to the user to control or limit thread-level parallelization, together with sensible defaults.
The text was updated successfully, but these errors were encountered: