Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve switch time complexity for constant cases #862

Closed
wants to merge 13 commits into from

Conversation

XmiliaH
Copy link
Contributor

@XmiliaH XmiliaH commented May 10, 2024

First version of an improvement to the switch statement to change the time complexity from O(n) to O(log n) (see #860).

This does currently not pass the debug tests because of the extra code generated in the main chunk.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

Some notes for the current implementation.

The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.

It would also be possible to move the lookup table initialization into the chunk. This would remove like two instructions from the switch for the initialization test. However, this would iniliatize all switches, even the ones never used.

@XmiliaH XmiliaH marked this pull request as ready for review May 18, 2024 17:41
@Sainan
Copy link
Member

Sainan commented May 18, 2024

Rebased this branch because we treat Git as a tree; not a directed graph.


The switchimpl function is now incredibly long, so I'm assuming you're taking ownership and I can ping you if there's any issues with it? :P


The generated bytecode is not stable due to the use of lua tables. I do not know how important this point is.

Could you elaborate on this some more? Would this PR cause issues with the upgrade to Lua 5.5?

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

You can ping me when there are issues with the switch implementation.

With not stable I mean that string.dump will give out different results on different runs for functions with switch statements due to different order of the table initialization.
I use a lua table to during the compiling to gather all the constants, but a lua table traversal order is not defined, so can result in different orderings.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

I see. That does sound a bit ugly, but I also can't imagine a better solution than putting a std::stack<std::map<...>> on LexState.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

I could use a std::vector in the switchimpl function, but that would just be there for the ordering. So if a stable output is important, this should be simple to add.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

You cannot stack-allocate such a thing because RAII is not guaranteed here.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

You are right, then I need to look into case_pc too as it does this currently.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

Just tested this extensively against our integrated Pluto environment, and no obvious crashes or bugs at least.

Tho, it seems to actually decrease performance in a contrived test like this:

local start = os.clock()
local cc = 0
for i = 1, 100000000 do
  local value = 3
  switch value do
    case 1:
    case 2:
    case 3:
    cc = cc + 1
    break
    case 4:
    case 5:
    break
  end
end
print(os.clock() - start)
return print(cc)

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

Yes, with the table lookup, check for initialization, and other overhead the binary search will only be used from around 16 cases upwards.
Try with the switch speed test:

local N = 2000
local code = "return function(val)\nswitch val do"
for i=1, N do
    code = code .. "\ncase " .. i .. ".5: return " .. i
end
code = code .. "\ndefault: return -1\nend\nend"
local f = load(code)()
-- Initialize, this might take more time
assert(f(0) == -1)
for i=1, N do
    assert(f(i+0.5) == i)
end
assert(f(0) == -1)

@Sainan
Copy link
Member

Sainan commented May 18, 2024

The problem is that I think >95% of switch blocks are gonna have <16 cases.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

And are the other stack allocated std::vector in the parser then not also buggy?

@Sainan
Copy link
Member

Sainan commented May 18, 2024

That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

The problem is that I think >95% of switch blocks are gonna have <16 cases.

Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.

int average_time_linear = num_consts/2;
int average_time_table = luaO_ceillog2((unsigned int)case_pc.size()) + 4; /* Load and check cache, load value, check nil case, then do the binary search */
if (average_time_linear <= average_time_table) {
   /* Not worth the effort, just do a linear scan */

That depends. If there is a chance a Lua error could be thrown in their scope, then it would not be deallocated when a longjump is used.

Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

Yes, but they are hard to get better. When multiple case values map to the same case then it can happen earlier.

I'm not asking for you to pull a miracle and make them better, but it's bad when this change makes them worse.

Yes, that could happen in almost all the cases I saw, so I assumed that Pluto is using exceptions and not longjump.

We do use exceptions instead of longjumps by default, but it is still an option, and the goal is not to leak memory in this case. It's possible some older code was unaware of this pitfall and unfortunately will still have this issue.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

I fixed the memory leak issues with std::vector in the switch statement, need to check std::function.
Will also look in the speed decrease of the example you mentioned.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

I would advise just using a void* for user-data and casting it as before instead of std::function.

Also nothing wrong with std::vector, as long as it's somewhere on LexState, because that does actually get cleaned up in all cases.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

When putting the vector on LexState it would need to handle reentrancy, And the tables also use the correct allocation function from the Lua state.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

I guess the speed decrease for the simple case is due to the switch always starting with a jump and not with a conditional jump like before. This results in one instruction more per loop. I will look to emit the first case in place and jump on false to the other logic.

@Sainan
Copy link
Member

Sainan commented May 18, 2024

Still a pessimisation, which really shouldn't happen since you seem to have logic to "not bother" for small switches? Maybe this patch would've been better if it focused on trying to add the new functionality without effectively rewriting the whole thing.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 18, 2024

When just looking at

local value = 3
switch value do
  case 1:
  case 2:
  case 3:
  cc = cc + 1
  break
  case 4:
  case 5:
  break
end

I get the same code from main & this branch now. So I have no clue where the slowdown should come from.

@Sainan
Copy link
Member

Sainan commented May 19, 2024

You are able to reproduce it, tho, right?

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 19, 2024

No, I am not able to reproduce the slowdown. Over multiple runs sometimes the main branch is faster, sometimes this one.

@Sainan
Copy link
Member

Sainan commented May 19, 2024

Well, it seems to only be on MSVC, and I guess "within margin of error," but now I've tried a benchmark where your implementation should undoubtably be faster, and again it's the same speed at best:

local start = os.clock()
local cc = 0
for i = 1, 100000000 do
  local value = 20
  switch value do
    case 1: break
    case 2: break
    case 3: break
    case 4: break
    case 5: break
    case 6: break
    case 7: break
    case 8: break
    case 9: break
    case 10: break
    case 11: break
    case 12: break
    case 13: break
    case 14: break
    case 15: break
    case 16: break
    case 17: break
    case 18: break
    case 19: break
    case 20: cc = cc + 1 break
  end
end
print(os.clock() - start)
return print(cc)
  • main
    • MSVC: ~3.4
    • Clang: ~3.4
  • improve-switch
    • MSVC: ~3.6
    • Clang: ~3.4

Or am I misunderstanding what your optimisation is supposed to do?

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 19, 2024

With the 20 cases both average_time_linear and average_time_table are both 9, so it will still use the linear lookup. Either add a case 21 or remove the break from all cases but 19.

@Sainan
Copy link
Member

Sainan commented May 19, 2024

Okay, now we have ~3.5 on main, and ~2.4 on this branch (using Clang). That is cool, but...

  • It's an optimisation targeted at a very small minority of switch blocks.
  • That is slower on MSVC in the remaining (overwhelming majority) of switch blocks.
  • At the cost of severely increasing the code complexity of the parser's switch functions.

@XmiliaH
Copy link
Contributor Author

XmiliaH commented May 19, 2024

I don't know why the same bytecode and only changes to the parser make the interpreter slower on MSVC. Maybe something gets aligned differenty which reduces the speed. But that could happen with any change.

As for the complexity vs benefit is something you need to decide. I have no problem with closing this. I had my fun implementing this and have no problem when I do not need to maintain in and further.

@well-in-that-case
Copy link
Member

well-in-that-case commented May 20, 2024

This is a very impressive optimization, however I share some of the concerns already raised in this thread. The most concerning point goes to maintainability of the switch statement. The complexity increase is great enough to say that the lessons learned over time with our switch statement — the bugs fixed, the time-tested factor, the overall polish applied — would be reset. While the performance increase is quite good, the scope of when that performance increase will apply is too limited to justify the aforementioned costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants