Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current state of multi-threading... #145

Open
nrader95 opened this issue Dec 1, 2021 · 12 comments
Open

Current state of multi-threading... #145

nrader95 opened this issue Dec 1, 2021 · 12 comments
Labels
enhancement New feature or request

Comments

@nrader95
Copy link
Contributor

nrader95 commented Dec 1, 2021

When multi-threading with default runner, it seems there is big overhead for me. When i profiled with Unity profiler, system updates sometimes can take twice more time to execute for some reason. (and it was in middle of various methods, both mine and ecs methods(including entity.Get<>, for example)). I tried to write my own IParallelRunner based on System.Threading.Task<T> and result was the same. So if degree of paralelism is low, it can actually result in loss of performance.

Then i tried to use raw Threads from System.Threading instead of Tasks, and overhead was finally gone. So using Tasks seems to be the bad way in at least some cases.

Even if overhead i'm observing has anything to do with syncronization, current limitations on multi-threading for working with framework already prohibit changing entity composition. There is still that scenario where you write data into component after getting it from 2+ threads, but that's not always the case.

So i think you should consider changing default runner, or maybe alternative fast&unsafe one for safe-write systems.

@Doraku
Copy link
Owner

Doraku commented Dec 1, 2021

eh really? new Task(Update, index, TaskCreationOptions.LongRunning) from DefaultParallelRunner is supposed to create dedicated threads for the dotnet framework implementation but I guess this is not true for the mono implementation used in Unity. I'll see if I can change the implementation for targets above netstandard1.1 (which does not have Thread hence why I stuck with Task).

@Doraku Doraku added the enhancement New feature or request label Dec 1, 2021
@nrader95
Copy link
Contributor Author

nrader95 commented Dec 1, 2021

Hmm perhaps i was a bit too quick about saying Threads are good.

Because it seems my updates on threads were not time-synced and thus accessed data at a bit different times.
Because when i finally writed my thread-based runner to get rid of this small time-desync, surprise, overheads are back.
And a main thread that just so happened to run its chunk of data almost after threads had overhead on first few entities that time-overlapper with thread data processing, but did not have it after.

@Doraku
Copy link
Owner

Doraku commented Dec 1, 2021

Keep in mind that if the number of entities is too low using multithread can be detrimental because of the overhead of splitting/synchronization, hence the minEntityCountByRunnerIndex parameter of the system constructors. Since this number is dependent on your actual system process and the degree of parallelism of your runner, that's why I didn't put any default but maybe I should put 100 or something just in case.

Doraku added a commit that referenced this issue Dec 1, 2021
@Doraku
Copy link
Owner

Doraku commented Dec 1, 2021

I pushed the change to use thread just in case, can you check if it changes anything?

@nrader95
Copy link
Contributor Author

nrader95 commented Dec 2, 2021

Tried it just now, no luck, so it seems to be really linked to operation running at the same time.

And another weird thing - overhead is bigger, the more cores i add.

UPD: and i observe it in both linux and windows Unity.

@Doraku
Copy link
Owner

Doraku commented Dec 2, 2021

I will probably roll it back then. How many entities do you have for your tests?

@nrader95
Copy link
Contributor Author

nrader95 commented Dec 2, 2021

Yeah its should be rolled back.

The only idea i have at the moment is that deep profiler is messing with me, but i would need to write some heavy-lift test code to see difference in performance without it.

Doraku added a commit that referenced this issue Dec 3, 2021
…d for targets above netstandard1.1 (may help with #145)"

This reverts commit 6e539f2.
@nrader95
Copy link
Contributor Author

nrader95 commented Dec 5, 2021

Okay, so i tried to measure performance without using Unity deep profiling, and it seems deep profiler really was the culprit here for weird data (now i compared time since between PreUpdate and PostUpdate)

So here is results i obtained with different runners but on same dataset (30 units (root entities) and the system updated its children entities(didnt count how much per units, but anyway, it always stays the same for same unit) local positions using quaternions and vector3's):

DefaultParallerRunner (your usual task version) is indeed has troubles with performance, it's basically same as single-thread or worse on this dataset & entities count.
My dumb task runner that basically comes down to "start the tasks, calculate own chunk of runnable in single thread, then wait for tasks in case they still didnt finish" showed the same results, so Task is indeed has its problems on Unity.

My thread runner showed performance improvement on most of frames, although not to the degree i was hoping (around 2x on most frames, the problem is i specified 4 cores for runner), so there is still room for improvement i guess.

What surprised me is that your Thread-version of DefaultParallerRunner didn't fare better then it's Task version, so there are some other overheads inside DefaultParallerRunner it seems.

@nrader95
Copy link
Contributor Author

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count.
Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

@Doraku
Copy link
Owner

Doraku commented Feb 11, 2022

What surprised me is that your Thread-version of DefaultParallerRunner didn't fare better then it's Task version, so there are some other overheads inside DefaultParallerRunner it seems.

Not sure if related but maybe unity doesn't like the ManualResetEventSlim used internally for synchronization. I know that at the beginning I was using ThreadLocal<> which had terrible performance on mobile.

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count.
Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

It is kinda hard to find a middle ground on this. The more cores you have, the more you will pay for synchronisation but the more you can process in parallel, ending in a net gain past a certain number of entities. I choose (from gut and nothing else so I may be completely wrong!) that the minimum number of entities to be processed per core for it to be beneficial is less dependent on the actual number of cores instead of the total number of entities to be processed. Add that every systems are different with their own threshold.

@nrader95
Copy link
Contributor Author

By the way, another thing worth considering: currently, the entity count threshold for triggering runner usage with minEntityCountByRunnerIndex scales with the CPU core count.
Which means, you cant predict maximum number of entities in a single-threaded execution before runner is triggered in compile time, since users have different CPUs with different number of cores.

It is kinda hard to find a middle ground on this. The more cores you have, the more you will pay for synchronisation but the more you can process in parallel, ending in a net gain past a certain number of entities. I choose (from gut and nothing else so I may be completely wrong!) that the minimum number of entities to be processed per core for it to be beneficial is less dependent on the actual number of cores instead of the total number of entities to be processed. Add that every systems are different with their own threshold.

Maybe, tho currently i'm using a system which trigger runner after fixed amount of enities in filter and it seems at least not worse.

I've been thinking tho: Since we cant determine overall net gain from threading with all cores due to combined sync time & actual gain balance, maybe its worth trying non-fixed number of threads used. This way, you can specify how many entities thread has to lift from main one workload to justify the sync time for add one more thread.

Something like this: we set it to 100 entities for example, so when you have 200 you use 2 thread(main one included), for 300 - three, and so on until you hit your thread cap specified inside the runner's DegreeOfParallelism
This way, the perf gains will always be bigger than sync times, assuming the system was configured properly.

@hhyyrylainen
Copy link

I just ran into this situation when trying to enable threading. I had just finished converting basically all of my game logic to systems and entities. I then turned on threading for all systems that didn't do operations only allowed on the main thread, and to my surprise I lost about 50% of the performance. So then searching a bit I found this issue. Turns out that the documentation doesn't mention minEntityCountByRunnerIndex at all (on this page where I read about multithreading: https://github.com/Doraku/DefaultEcs/blob/master/README.md#threading), which I think is pretty much key to solving my issue. So at the very least it'd be great if the main readme mentioned the threading caveats and that one constructor parameter that should be used to tweak things.

Something like this: we set it to 100 entities for example, so when you have 200 you use 2 thread(main one included), for 300 - three, and so on until you hit your thread cap specified inside the runner's DegreeOfParallelism
This way, the perf gains will always be bigger than sync times, assuming the system was configured properly.

Yes, this would be awesome. This is how I did threading before switching over to DefaultEcs and I think it had a lot more consistency and ease of use. Now I need to try do something like estimating the number of entities that use each system, pick the number of threads based on that that are reported to DefaultECS and then also use minEntityCountByRunnerIndex to limit systems that have a lot less entities to process than other systems. So I basically need to do things in reverse to nudge things so that a reasonable number of entities to process per thread is selected. The good news is that based on my initial testing it is possible to get more performance from multithreading than the thread delegating overhead adds, but it requires quite a bit of manual tweaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants