Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore the number of threads for OpenBLAS to what it used to be on 1.6.4 and 1.7. #3996

Closed
KristofferC opened this issue Dec 2, 2021 · 6 comments · Fixed by #3997, #3999 or #4000
Closed

Comments

@KristofferC
Copy link
Member

KristofferC commented Dec 2, 2021

As a "regression test" on 1.8 the number of threads on OpenLAS was experimentally bumped to a large number (JuliaLang/julia#42584):

My thought was that since we are early enough in the 1.8 cycle, we can see how things feel for now with the large limits and get some regression testing in, and introduce the 32 max threads a little closer to release (or at least wait a few weeks).

However, this thread bump has managed to sneak into both the LTS and the latest 1.7 release(!). This has caused multiple regressions. Looking at #3732, the reason seems to be that there is a shared file OpenBLAS/common.jl that determines a bunch of settings for ALL builds and the thread bump then caused every successive OpenBLAS builds (even those with bug fixes for old OpenBLAS versions), to use this experimental high thread count.

This needs to be put back to the original thread count, new OpenBLAS releases need to be built and we need to release 1.7.1 and 1.6.5 with the fix.

As a note, I think the presence of this common.jl file is quite dangerous for reproducibility reasons. Imo, each build should be more or less contained and there should not be stuff like default arguments that are updated for every build.

cc @ViralBShah

@giordano
Copy link
Member

giordano commented Dec 2, 2021

As a note, I think the presence of this common.jl file is quite dangerous for reproducibility reasons. Imo, each build should be more or less contained and there should not be stuff like default arguments that are updated for every build.

Maintainability is also a factor though: we have to keep something like 6 (or 9 if including the HighCoreCount variant) builds at the same time, having each of them with their own script is error-prone. LLVM is even worse, keeping each version with their own independent build script (as we used to do before) and remember to propagate all changes everywehre is really hard.

That said, introducing new features without changing the current default behaviour would be a sensible approach.

@KristofferC
Copy link
Member Author

I guess the common file could be there but not have any default arguments (so that each build specifies that fully). Although, that doesn't protect against changes in the actual script, it avoids the case where default input arguments are changed at least.

@fingolfin
Copy link
Member

To be clear, the following is not intended to assign "blame" to anyone, just to reflect on what happened: I think the crucial mistake was to modify the default argument value (and to accept it past review); I am not sure why that was done instead of just actually passing alternate values? After all, it exists precisely to allow overriding it.

Of course mistakes happen: I can absolutely see myself both of making that mistake, and also overlooking it in a review. So, what can we do to reduce the risk of mistakes?

Perhaps we can somehow detect the situation were a common.jl file (i.e. a jl file which is not a builder recipe itself) is modified, and in that case, put up a big fat warning somewhere "careful, your change might affect other variants of this recipe. Are you sure this is OK? Consider guarding your changes so that they only affect the variants you care about, but not others.". That would not be able to prevent any and all mistake, but perhaps it would have helped here?

@ViralBShah
Copy link
Contributor

Having a common.jl, but passing defaults from each openblas version is the right way forward.

@ViralBShah
Copy link
Contributor

ViralBShah commented Dec 2, 2021

It is also likely the case that the openblas version we have on 1.8-dev is newer enough and better at dealing with more threads than the versions of openblas that we ship with 1.6 and 1.7.

In any case, we should add a test to linear algebra to test lu on a matrix of size 2000. I am a bit surprised that nobody ran into this in the RC process, and it works fine on my mac. (Note: I was only trying Float64 that doesn't seem to trigger this).

@giordano
Copy link
Member

Fixed by #3997, #3999, #4000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment