Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/save bulk transactions #827

Closed
wants to merge 3 commits into from
Closed

Feature/save bulk transactions #827

wants to merge 3 commits into from

Conversation

RekkasGit
Copy link
Contributor

Note, I am sorry about this also including the 'stuck' characters but I forgot to make a branch before the pull request and the PR is on master and didn't want to mess it up. Can strip that feature out if need be.

For this feature however: I have enabled multi results in mysql connection and changed the Save method in client.cpp to bulk up all operations into one query.

In very small benchmarks this is over 2x faster and I believe under load it will be much higher. Right now my testing is on a zero load and its already 2x faster.
Example:
ZoneDatabase::SaveCharacterData_OldWay 685273, done... Took 0.013000 seconds
ZoneDatabase::SaveCharacterData_NewWay 685273, done... Took 0.005000 seconds

Note, two indexes need to be added to the pet tables, as part of the update hits these columns and they are doing full table scans.

ALTER TABLE character_pet_inventory
ADD INDEX char_idIndex (char_id);

ALTER TABLE character_pet_info
ADD INDEX char_idIndex (char_id);

Note, this only a stop gap measure, and while it limits "fsyncs" during transactions, this may have to be rethought out. IMO a distruptor ring buffer, and a 1 sec heartbeat is a better way to handle this. Publish all bulk events to the ring buffer which are serialized and appened to queue per character. either the 1 sec event or the size of the buffer will 'flush' the data to mysql. If multiple saves happen (For whatever reason simply camping a character causes 3x saves!) during that period, last character save wins. So instead of 3x hitting the database only 1x does.

…to bulk up all operations into one query.

In very small benchmarks this is over 2x faster and I am very confident under load it will be much higher. Right now my testing is on zero load and its already 2x faster.
Example:
ZoneDatabase::SaveCharacterData_OldWay 685273, done... Took 0.013000 seconds
ZoneDatabase::SaveCharacterData_NewWay 685273, done... Took 0.005000 seconds

Note, two indexes need to be added to the pet tables, as part of the update hits these columns and they are doing full table scans.

ALTER TABLE `character_pet_inventory`
	ADD INDEX `char_idIndex` (`char_id`);

ALTER TABLE `character_pet_info`
	ADD INDEX `char_idIndex` (`char_id`);
zone/client.cpp Outdated
@@ -606,10 +607,118 @@ void Client::RemoveExpendedAA(int aa_id)
database.QueryDatabase(StringFormat("DELETE from `character_alternate_abilities` WHERE `id` = %d and `aa_id` = %d", character_id, aa_id));
}

//bool Client::Save(uint8 iCommitNow) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this. This code is saved in the version control software :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I left that there for ease of changing back and forth between the old/new. I was assuming before this would even be accepted I would do a full cleanup. First experience of a PR to this project.

zone/tasks.cpp Outdated
@@ -280,7 +280,138 @@ bool TaskManager::LoadTasks(int singleTask)

return true;
}
std::string TaskManager::SaveClientStateQuery(Client *c, ClientTaskState *state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the fact VS lets this compile for some reason. This is also written in a way that may possibly fail NRVO and since this is a performance focused commit, that is a very relevant point :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this is resolved with pass by ref

@mackal
Copy link
Member

mackal commented Mar 8, 2019

From a performance standpoint, I think it might be better for all of these to take an out parameter taken by reference. This would allow us to just continue to grow a single std::string object instead of having a bunch of independently growing std::string objects that ultimately get combined into a single std::string.

@RekkasGit
Copy link
Contributor Author

RekkasGit commented Mar 8, 2019

From a performance standpoint, I think it might be better for all of these to take an out parameter taken by reference. This would allow us to just continue to grow a single std::string object instead of having a bunch of independently growing std::string objects that ultimately get combined into a single std::string.

I fully agree and was already thinking about that. Think of this as an initial rough draft / Proof of concept. I wrote most of this last night.

I will have edits up shortly.

@Akkadius
Copy link
Member

Akkadius commented Mar 8, 2019

@livermana, thank you so much for spending the time and putting together a PR, we always appreciate folks who take initiative to contribute to the project

There are quite a few things I want to highlight about this PR that definitely won't make it to mainline for the following reasons

  1. We don't want to enable multiple statements per query, this leaves our non-parameterized statements open to SQL injection and is a security risk
  2. You don't need multiple statements in one query send to perform a transaction, we have methods to start transactions and commit them with examples in the code for where it makes sense
  3. Multiple statements can be wrapped inside said transaction to reduce the overall parsing and execution time because the server just does it all in one fell atomic swoop
  4. EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_disk_submenu_sda;theme=slate;help=true
  5. Don't mix two separate PR's together

On the Performance Standpoint

Since I'm assuming this is related to your chiming in on the lag problems thread I will just again restate the following

With PEQ at 800-1000 toons on the daily, we barely break 100 IOPS a second, with barely 1MB/s in writes with minimal spikes here and there. That is virtually nothing

Also with your benchmark, the current stock code on some middle of the line server hardware produces the following timings

[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000107 seconds
[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000091 second

This is a sub 1ms operation that is not going to hurt even when it is synchronous

These operations also happen very infrequently where it is not going to even matter.

There are many many factors that play into the overall performance of a server and since the server is essentially an infinite loop, anything within that loop can influence the amount of time that a CPU is not spent idling (Network, I/O, overly CPU intensive operations etc.). Hardware is of influence, your MySQL server configuration is of influence and most of all software is of influence

EQEmu used to be way way way more resource intensive and we've come along way to where that is not even an issue anymore. We have one outstanding bug that is isolated to the networking layer that made its way through because we never saw it on PEQ during our normal QA routines

We are currently working on the code to measure application layer network stats so folks can give us dump metrics off of so we can give a proper fix. We've seen what happens during the network anomaly during a CPU profile and there's not much that it is going to show alone but where it is spending most of its time.

We folks at EQEmu definitely have jobs that have been a deterrer from resolving said bug but we will have it on lockdown soon enough as we know exactly what we need to do, the only thing in our way is time as a resource

As far as your other stuck PR that could definitely be useful, but I defer to @KimLS to review that PR

As far as this PR is concerned though, the footprint of these queries are so incredibly small, it is not worth trying to optimize further unless you wanted to wrap a handful of saves in two transaction begin and end calls but that is about it

@Akkadius Akkadius closed this Mar 8, 2019
@RekkasGit
Copy link
Contributor Author

All well thought out and detailed reply, though a touch of sting in it.
I would suggest still adding the indexes to the two tables.

  1. We don't want to enable multiple statements per query, this leaves our non-parameterized statements open to SQL injection and is a security risk

This was a new method for QueryMulti, other statements were not impacted (though I apologize If I am wrong in this). I checked that the statements in question were not parameterized, just raw SQL send down the stack.

  1. You don't need multiple statements in one query send to perform a transaction, we have methods to start transactions and commit them with examples in the code for where it makes sense.

This was very much understood, just trying to limit the network hits. No reason to do multiple network hits when you can do one. Between crossing kernel boundaries , network switches, etc.

  1. EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_disk_submenu_sda;theme=slate;help=true

Thank you for the charts, very nice.

5 Don't mix two separate PR's together

This was stated in the PR itself that it was an issue and I would resolve it. No need to bring that back up i would think.

Ultimately I think I steeped into a sore point and apologize for any interference on my part. This was simply created last night as I saw a post bout disabling some of the save queries helped with their lag, so I looked into said function. Saw the missing indexes, saw the multiple transactions being done (which is usually a bad practice for throughput and inconsistent data) and saw the lock on a single database connection.
So i tried a proof of concept, saw a marked improvement (database was seperate of the eqemu instance), and data consistency achieved. Thus I made the PR to see if it could help out.

@Akkadius
Copy link
Member

Akkadius commented Mar 8, 2019

No worries, I just wanted to bring things into context as I've spent ridiculous amounts of time with the low level aspects of our codebase and performance that it is easy to focus in on things that aren't exactly glaring performance issues

If you have indexes that are obvious that we are hitting a lot and it could save some time by all means present the case there, but I would also want to see it backed up with numbers, how often its being called, the depth of the records that are being called, benchmarks etc. Adding indexes also have the overhead of being maintained but since we don't intensively insert and when we do we insert in bulk, it wouldn't be much of an overhead to maintain new indexes you most likely would introduce as well

For example here was some work done last year where we did 50+ toon burn against raid mob with massive amounts of packets flying around and realized there is indeed some glaring performance issues at the network level where we were negligibly sending way too many packet updates. Hard numbers clearly illustrated massive benefit where the zone would choke and collapse upon itself from the sheer number of packets and trying to keep up

https://gist.github.com/Akkadius/28b7ad2fdd82bdd15ea737c68f404346

	- Before and After packet differences under similar load/tests (Packets per second)
		- 7,000 - 8,000 	OP_Animation pps	After: 600-800 pps
		- 13,0000 - 17,000 	OP_MobHealth pps	After: 1-10 pps
		- 15,0000 - 20,000 	OP_ClientUpdate pps	After: 500-1,000 pps

You can also review just about any changelog notation that has a performance tag with it to also view work done there as well

I apologize if I came off with a sting in anyway, I just have a very good understanding of where our current technical expenditures lie and have spent way way more time than I care to have spent - so folks can have a performant and stable server experience

By all means hit up our Discord channel and we are happy to discuss other things as well if you are thinking of certain things to improve upon as well or even open a Github issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants