New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regionized Entity (and Tile) Ticking #1001

Open
aikar opened this Issue Feb 7, 2018 · 12 comments

Comments

8 participants
@aikar
Copy link
Member

aikar commented Feb 7, 2018

Goal

This issue is to serve as a note holder and idea registry for an idea I had today on being able to tick entities in a parallel fashion, safely (at least, safe enough) - Please read entire issue before you go jumping "THIS ISN'T SAFE!!!"

This design is intended to benefit servers that have players more spread out where they are nowhere near each other.

The goal of this idea is to not be 100% thread safe, but to do things that under most conditions, would be practically safe, that in realistic scenarios are unlikely to cause issues.

Implementation Design

Break world ticking into a 3 stage process

Stage 1: doTick method - All World Operations that would happen for each world pre entity/tile entity ticking, sequentially, no parallelism, as it happens today.
Stage 2: Iterate all players Chunk Maps, iterate the chunks that they are participating in.

  • If none of their chunks are shared with another player, nor the chunk doesn't border a chunk that's also part of another region, create a region
  • If the chunk either is already part of a region, or borders a chunk as part of a region, that entire players chunk map joins the other region
    • This ensures that no 2 regions will ever border each other. All connected chunks are part of the same region.
  • Once all players chunk maps are marked and regions created, scan all chunks and look for any outlier "floating" chunks that are not part of any players chunk map (these are pending unload). Find nearest region and join that chunk to that region.
    • Consider: Config options to simply not even worry about outliers and skip ticking them? This is likely desirable.
  • Each region should create and immediately acquire a write lock on itself. This will be used by notes in Concurrency section at bottom.
  • Once all chunks are marked to a region, iterate the world entity and tile entity list (this is important to use the world list and not the chunk list in order to keep ticking order correct), and find their current chunk and then by retrospect their region, and build an Entity/Tile Entity tick list for that region. (and likely need the "Global" list which is used by lightning ?)
  • Dispatch every region to a Region Thread pool (configurable # of threads, default to # of cores)
  • Each of these threads should set a static ThreadLocal for "isSafeThread", to true
    • this system should be generic so we can reuse it for other sections of paper.
  • Potentially add a "safer" config option, that creates a Region Thread Pool per world with a max of 1 thread, allowing us to use 1 global region but tick the worlds in parallel, but leaving the worlds itself in parallel.
  • Entity Addition/Removal Lists need to be made concurrent, or per-region instead of per world, to avoid concurrency issues with multiple threads adding to them
  • The main thread will await on the region thread pools for them to fully flush and empty. This will keep the main thread paused until all regions are done executing.
  • Repeat all of this, but for Tile Entities. There is no benefit to doing this at the same time as Entities, as we would hopefully already be maximizing CPU potential, and will just be creating more queued work. This would also increase the likelihood of behavioral changes if these happened at the same time.
  • The main thread will await on the region thread pools for them to fully flush and empty. This will keep the main thread paused until all regions are done executing.
  • flush any per-region add/removal list into the world list. unset the region markers on every entity and tile entity

Stage 3: All post tick entity operations per world such as Entity Tracker, done sequentially, with no parallel (we can investigate applying parallel behavior to Entity Tracker in a later change)

Concurrency Concerns

During Parallel Region Execution, the main thread will be suspended, blocked waiting for the thread pools to completely empty.

This reduces risk of concurrency issues down to if a plugin itself acts on an event based on an Entity or Tile Entity, and then finds a relationship or scans the entire world for another entity, that happens to be in a different region.

This scenario is of course possible to occur, but not going to be super common.

If a plugin does end up responding to 1 entity event by acting on a completely different entity event, we then have to consider what is the plugin doing to the other entity.

I see it as very unlikely that even if a plugin does act on another regions entity, that it would even do something that would have a negative outcome.

The closest risky scenario I could see is:

  • Region 1 and a Region 2 Creeper is exploding at the exact same tick. Plugin listens to event, and teleports creeper 1 to the same location as creeper 2.
  • Both creepers now are going to explode at the same location, modifying the world,

This is the worst case scenario I can see, as you have 2 threads operating in the same location of the world, both removing blocks, and an entity may be receiving damage while mid its own tick, causing weird results.

However, the likelyhood that the server would crash here should be pretty minimal. What ultimately would come out of this bad scenario is behavioral differences.
The teleported creeper may of killed the attacking skeleton, that still fired an arrow, killing a player.
In a sequential scenario, creeper killed skeleton, skeleton never shoots arrow, player doesnt die.
In this risky scenario, both skeleton and player die.

Ultimately, you would of never even knew the outcome would of been different without regionized ticking.
Many server owners would be perfectly ok with this rare cased risk of behavior change.

In the event that state does get in a weird way, any error thrown would simply remove the entity, it would not crash the server.
Server owners could then make their own judgement to turn off this enhancement if its causing them issues.

Concurrency of Plugin Events

For any event fired on these threads, we should synchronize ensuring that no 2 events are firing at the exact same time.
This ensures that plugins do not have any local state being manipulated concurrently for any event they expected to be sync.

For plugins that are checking Bukkit.isPrimaryThread, we would update the isPrimaryThread to check the ThreadLocal isSafeThread mentioned in stage 3. If this boolean is set to true, we return true.

This will handle all of the Async Catcher operations too.

Entity Teleportation Mid Event - A resolution to teleporting Concerns (Creeper Scenario)

If an event causes an Entity to Teleport mid event, causing it to change regions, such as the Creeper scenario, we can even improve risk here by:

  • in Entity.setLocation, if the region marker is set (ie: we are mid tick), perform a post teleport region check to make sure the entity is still in the same region as before.
  • If the new location was not part of any region, consider it the same region
  • If the new location is in a chunk of another region, try to acquire the regions write lock object, which will result in blocking this regions execution until the new locations region has fully finished ticking. This isn't the "most idea", but it should be rare and trivial in delays compared to the bigger picture.
  • This avoids having 2 entities tick in the same area at the same time, completely resolving the Creeper Scenario concern!

Closing / Feedback

This idea isn't "final". I am opening this now to collect feedback and more risk scenarios from the community so we can design ideas to solve them.

If we can pull this off, this will pretty much be the largest performance improvement we will ever do for the server.

There is no ETA on this. This is an idea I had, and I would love to implement myself, but I am also extremely busy with work for the next few months... so I don't know when.

I don't want to hear any "This won't work" or "This isn't safe". I perfectly know this isn't 100% safe. That's not my goal. My goal is practically safe.
Meaning the server runs as expected under vanilla expectations and players would never even notice anything difference.

Give me scenarios that could result in noticeable negative behavior that could be considered "Deal breaking" that would force people to keep this improvement off.

@electronicboy

This comment has been minimized.

Copy link
Member

electronicboy commented Feb 8, 2018

This honestly sounds look a cool concept, the practice of regions should keep us reasonably safe, especially as beyond teleporting, it's kinda rare that a plugin does something outside of its own "region" based upon some event.

The concept of regions should do a lot in terms of preventing issues around concurrency in "vanilla", only real concerns for me would be teleports (which you've covered).

my only real concern around this (beyond the obvious) is the plugin manager and operations such as teleporting where you intend to block.
With your proposal for handling teleporting it sounds like you might actually be causing a deadlock, if thread A calls an event and then teleports an entity and blocks waiting for thread B to finish processing the target region, you'd end up causing a deadlock of the two threads should an event try to be called from the other; This would fall onto my other concern, what is the potential that a server ends up in a state where region threads are blocking each other enough that the gains become negligible or lost? e.g. servers heavy with redstone/TEs would serve to see a massive bonus from this, but at the same point, these do cause a lot of events to be fired, could this have an impact on performance?

@aikar

This comment has been minimized.

Copy link
Member Author

aikar commented Feb 8, 2018

Good catch. We would need to detect that scenario and then avoid the extra block, and "hope for best"

Redstone would be part of block ticking and thats not in scope of this attempt.

TE's are a "phase 2" goal, I first want us to get just Entity Ticking done (batch TE's into Stage 3)

As Command Blocks in particular cause a lot of concerns for this ,as they can execute commands that move entities and world data.

But now I just realized we have Command Block Minecarts too... so orz thats an issue.

Looking for ideas on how to handle command blocks? I'm thinking we can skip them in region phase and do them in stage 3.

@Brokkonaut

This comment has been minimized.

Copy link
Contributor

Brokkonaut commented Feb 14, 2018

What about plugins that use custom entities or custom AI for entities? I know that's not API, but there are a lot of widely used plugins that do things like that.. (Citizens, pets, a lot of minigames, ...) And those plugins often have a global state that would not be accessed synchronized

@aikar

This comment has been minimized.

Copy link
Member Author

aikar commented Feb 14, 2018

@Brokkonaut Plugin events would be synchronized, which would be the primary way to react to those entities.

Timers for a entity would not be parallel.

any custom aI rules would be accessing local entity state for the most part.

The only way you could run into issues here is if the plugin extends PathfinderGoal NMS level, and inserts the custom object into the goal selectors, and then that causes that goal to be ticked in parallel.

And we can resolve that too, by making any custom goal selector (ie, selector is not located in the NMS package), synchronize on execution, preventing any 2 custom goal selectors from ticking at same time.
This is pretty trivial to do.

@khlfkk

This comment has been minimized.

Copy link

khlfkk commented Feb 14, 2018

Amazing, 2018 is a big change and hope year

@melinstagibson

This comment has been minimized.

Copy link

melinstagibson commented Mar 26, 2018

This sounds like an awesome idea, i would drop most of my plugins in favor of an improvement like this for my survival servers, maybe you can implement it as an experimental feature (if you do and there are issues)

@daboluo666

This comment has been minimized.

Copy link

daboluo666 commented Jul 9, 2018

Amazing! What's the progress now? I can't wait for it!

@aikar

This comment has been minimized.

Copy link
Member Author

aikar commented Jul 9, 2018

Sadly 0%, I'm very busy.

@M-AJ

This comment has been minimized.

Copy link

M-AJ commented Jul 17, 2018

I've attempted to implement (roughly) this in the past few days and fixed some small concurrency issues along the way, but a large issue remains: Chunk Loading.

Plugins have the ability to load chunks during events, which can be called during entity ticking. The chunk provider class and possibly the chunk would have to be made thread safe when creating chunks async and populating them to a map while another thread in parallel reads it. Accessing the chunk map is already a very hot method. Not sure if there is another alternative method without implementing some locking mechanism.

That's one notable issue I encountered on a ~100 player server for a few hours.

@aikar

This comment has been minimized.

Copy link
Member Author

aikar commented Aug 4, 2018

@M-AJ good catch. We would need to synchronize on chunk loading (but not getting if its already in map),

might have to handle delaying any unload called during this phase if a plugin did something really weird?

I removed some synchronization in 1.13, but maybe if we reduce how much sync is done by not using an object other stuff uses, we can reduce contention on main, especially compared to the gains received from PRT.

@aikar

This comment has been minimized.

Copy link
Member Author

aikar commented Aug 4, 2018

per irc discussion, players would need to be ticked outside of the parallel loop.

@GotoFinal

This comment has been minimized.

Copy link

GotoFinal commented Jan 31, 2019

If a plugin does end up responding to 1 entity event by acting on a completely different entity event, we then have to consider what is the plugin doing to the other entity.
I see it as very unlikely that even if a plugin does act on another regions entity, that it would even do something that would have a negative outcome.

I would say it can be pretty common (unless I misunderstood you here), maybe not all plugins do it, but I see many places where this can be used, especially plugins that adds own entities, like pets. And event about pet might affect other pets or owning player, even if they are not in same tick region.

For plugins that are checking Bukkit.isPrimaryThread, we would update the isPrimaryThread to check the ThreadLocal isSafeThread mentioned in stage 3. If this boolean is set to true, we return true.

But events will be still executed on different threads? isn't that risky due to how JVM/jit works? Especially if someone will register event with simple non-volatile variables that are modified (set) by event. If I'm not mistaken there is a chance that change in this variable will not be visible from other thread as it will be cached.

Tho we can always add a bit of black magic and cover most basic cases like that by transforming plugins code on load. But it does not sounds right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment