New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less queries #4412

Merged
merged 3 commits into from Jan 4, 2018

Conversation

Projects
None yet
5 participants
@albertlast
Collaborator

albertlast commented Dec 18, 2017

I take some time to find some unnessary queries,
found only two queries (it's hard to search for them).

MessageIndex i was able to rewrite the query to the existing on by using of correlated subquery

ProfileView evade a useless roundtrip when both trips do the same

albertlast added some commits Dec 18, 2017

correlated subquery for Participation
Signed-off-by: albertlast albertlast@hotmail.de
run once when range_limit is empty
Signed-off-by: albertlast albertlast@hotmail.de

@Gwenwyfar Gwenwyfar added the Database label Dec 18, 2017

@jdarwood007

This comment has been minimized.

Show comment
Hide comment
@jdarwood007

jdarwood007 Dec 19, 2017

Member

How would this affect queries for big forums?

Member

jdarwood007 commented Dec 19, 2017

How would this affect queries for big forums?

@albertlast

This comment has been minimized.

Show comment
Hide comment
@albertlast

albertlast Dec 19, 2017

Collaborator

I hope you mean the messageindex change:

You could awnser the question by your self.
The old query use a group by + limit 20
which result in that you select between 0 and unlimited(user can be write more than once in a topic) rowset on disk side.
The subquery runs 20 times (when the limit is 20) and select between 0 and 20 rowsets(because of the limit 1 without group by).

And also the overhead(build qry, run qry, evaluate) is gone in this solution.

in short:
This in improvment is specialy for big boards.

In my run test of the "big" qry is take the same time as before.
and i got a mid size test board ~500k messages.

Collaborator

albertlast commented Dec 19, 2017

I hope you mean the messageindex change:

You could awnser the question by your self.
The old query use a group by + limit 20
which result in that you select between 0 and unlimited(user can be write more than once in a topic) rowset on disk side.
The subquery runs 20 times (when the limit is 20) and select between 0 and 20 rowsets(because of the limit 1 without group by).

And also the overhead(build qry, run qry, evaluate) is gone in this solution.

in short:
This in improvment is specialy for big boards.

In my run test of the "big" qry is take the same time as before.
and i got a mid size test board ~500k messages.

@jdarwood007

This comment has been minimized.

Show comment
Hide comment
@jdarwood007

jdarwood007 Dec 19, 2017

Member

Yes I was referring to the messageindex change. That is a heavy query and a frequent one ran. So on Big Boards they will notice if that query becomes unoptimized.
Generally because of how heavy things like this and whos viewing queries can take on a database, most big boards don't use them.

Sounds good if we are improving it then and big boards should see an improvement.

Member

jdarwood007 commented Dec 19, 2017

Yes I was referring to the messageindex change. That is a heavy query and a frequent one ran. So on Big Boards they will notice if that query becomes unoptimized.
Generally because of how heavy things like this and whos viewing queries can take on a database, most big boards don't use them.

Sounds good if we are improving it then and big boards should see an improvement.

@colinschoen colinschoen self-requested a review Jan 1, 2018

@colinschoen

This comment has been minimized.

Show comment
Hide comment
@colinschoen

colinschoen Jan 1, 2018

Member

I'll review this soon.

Member

colinschoen commented Jan 1, 2018

I'll review this soon.

Show outdated Hide outdated Sources/MessageIndex.php
@@ -699,7 +699,7 @@ function showPosts($memID)
}
// Make sure we quit this loop.
if ($smcFunc['db_num_rows']($request) === $maxIndex || $looped)
if ($smcFunc['db_num_rows']($request) === $maxIndex || $looped || $range_limit == '')

This comment has been minimized.

@colinschoen

colinschoen Jan 4, 2018

Member

This is more of a question with the existing code than your change:

Am I missing something? The body of this loop seems to ever run the same query 1-2 times. Why is this while loop in place?

@colinschoen

colinschoen Jan 4, 2018

Member

This is more of a question with the existing code than your change:

Am I missing something? The body of this loop seems to ever run the same query 1-2 times. Why is this while loop in place?

This comment has been minimized.

@albertlast

albertlast Jan 4, 2018

Collaborator

The first run try something,
when it not successed it try to do the same again but with range_limit = ''.
Dunno how i get to the state that range_limit is in the first run not empty.

@albertlast

albertlast Jan 4, 2018

Collaborator

The first run try something,
when it not successed it try to do the same again but with range_limit = ''.
Dunno how i get to the state that range_limit is in the first run not empty.

added as
Signed-off-by: albertlast albertlast@hotmail.de
@colinschoen

LGTM approval

I can take a closer look at the existing code later. Your changes look fine.

@colinschoen colinschoen merged commit eb20223 into SimpleMachines:release-2.1 Jan 4, 2018

2 checks passed

Scrutinizer No new issues
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Arantor

This comment has been minimized.

Show comment
Hide comment
@Arantor

Arantor Jan 19, 2018

Contributor

Does this actually make it faster? I can imagine on Postgres that it would, but my initial testing says it makes it slower on MySQL, sometimes more than 3 times slower (with SQL_NO_CACHE to avoid the cache affecting things). Not to mention that a user can configure how many rows per page on the message index meaning that it's possible the subquery can be more like 50 times rather than 20.

In addition, the single query also should be able to be satisfied on MySQL solely from the index, which it now won't be. I'd be interested in getting real world numbers from MySQL seeing how it makes up 99.99% of the install base...

Contributor

Arantor commented Jan 19, 2018

Does this actually make it faster? I can imagine on Postgres that it would, but my initial testing says it makes it slower on MySQL, sometimes more than 3 times slower (with SQL_NO_CACHE to avoid the cache affecting things). Not to mention that a user can configure how many rows per page on the message index meaning that it's possible the subquery can be more like 50 times rather than 20.

In addition, the single query also should be able to be satisfied on MySQL solely from the index, which it now won't be. I'd be interested in getting real world numbers from MySQL seeing how it makes up 99.99% of the install base...

@albertlast

This comment has been minimized.

Show comment
Hide comment
@albertlast

albertlast Jan 20, 2018

Collaborator
  1. When you do database benchmark you always expected that the data is warm,
    so the use of SQL_NO_CACHE is a mistake from the start.

  2. Since the old solution use a group by the amount of selected rowset >= shown rowset
    to get a better fealing:
    the second query in the old solution select rowset between 0 - max_amount_row_of_messages
    new version the subquery select works without group by selected rowset = shown rowset
    -> the amount of selected data is between 0 - limit(20 or 50)

  3. A internal single database select is faster than a php single select

  4. when you calc the improvment than you had to sum all times of the queries and php
    old: runtime qry 1 + php overhead of qry1 + runtime qry 2 + php overhead of qry 2 = run_time_old
    new: runtime qry 1 + php overhead of qry1 = run_time_new

In short i guess in small setup the solution could run maybe slower,
but on medium/larg setup it's faster.
When I do improvment than I got medium/larg setup in mind and not small.

Collaborator

albertlast commented Jan 20, 2018

  1. When you do database benchmark you always expected that the data is warm,
    so the use of SQL_NO_CACHE is a mistake from the start.

  2. Since the old solution use a group by the amount of selected rowset >= shown rowset
    to get a better fealing:
    the second query in the old solution select rowset between 0 - max_amount_row_of_messages
    new version the subquery select works without group by selected rowset = shown rowset
    -> the amount of selected data is between 0 - limit(20 or 50)

  3. A internal single database select is faster than a php single select

  4. when you calc the improvment than you had to sum all times of the queries and php
    old: runtime qry 1 + php overhead of qry1 + runtime qry 2 + php overhead of qry 2 = run_time_old
    new: runtime qry 1 + php overhead of qry1 = run_time_new

In short i guess in small setup the solution could run maybe slower,
but on medium/larg setup it's faster.
When I do improvment than I got medium/larg setup in mind and not small.

@Arantor

This comment has been minimized.

Show comment
Hide comment
@Arantor

Arantor Jan 20, 2018

Contributor

The problem with benchmarking on the assumption that the data is warm is that in MySQL's cache you need to make sure the cache isn't hiding the real performance of the query. The same query run 50 times in a row will run slowly once and considerably faster every subsequent time. Need to eliminate that because you can't rely on the query cache.

I'd still want to see numbers showing that on a MySQL setup, this will actually be faster, and ideally numbers from a real forum, rather than a constructed benchmark because the reality is that a constructed benchmark forum has odd distributions of data that don't match a real forum (virtually every benchmark forum DB I've ever see has a far more equally distributed user:post ratio than ever exists on a real forum, for example)

What happens on a multi million post forum? Does it still scale on that, on MySQL, in the expected way? Sadly I can't get hold of the 80-odd million post database any more, but an anonymised dump from sm.org would be an interesting dataset in terms of proving it does or doesn't scale better. I share SleePy's concerns about it.

I've asked a friend to give me a copy of their 111k post database to see how that kind of thing scales, will see how that goes.

Contributor

Arantor commented Jan 20, 2018

The problem with benchmarking on the assumption that the data is warm is that in MySQL's cache you need to make sure the cache isn't hiding the real performance of the query. The same query run 50 times in a row will run slowly once and considerably faster every subsequent time. Need to eliminate that because you can't rely on the query cache.

I'd still want to see numbers showing that on a MySQL setup, this will actually be faster, and ideally numbers from a real forum, rather than a constructed benchmark because the reality is that a constructed benchmark forum has odd distributions of data that don't match a real forum (virtually every benchmark forum DB I've ever see has a far more equally distributed user:post ratio than ever exists on a real forum, for example)

What happens on a multi million post forum? Does it still scale on that, on MySQL, in the expected way? Sadly I can't get hold of the 80-odd million post database any more, but an anonymised dump from sm.org would be an interesting dataset in terms of proving it does or doesn't scale better. I share SleePy's concerns about it.

I've asked a friend to give me a copy of their 111k post database to see how that kind of thing scales, will see how that goes.

@Arantor

This comment has been minimized.

Show comment
Hide comment
@Arantor

Arantor Jan 20, 2018

Contributor

OK, so I've been playing with that database (it's 111k posts from the board index but turns out there's 226k posts in total) a bit to try to get a feel for improvement. Interestingly with a larger database on MySQL, the difference isn't so huge between the two - but it seems to perform about the same, give or take a 0.01 second.

Would still love to see some benchmarking on real MySQL installs but that's the biggest I can get my hands on right now.

Contributor

Arantor commented Jan 20, 2018

OK, so I've been playing with that database (it's 111k posts from the board index but turns out there's 226k posts in total) a bit to try to get a feel for improvement. Interestingly with a larger database on MySQL, the difference isn't so huge between the two - but it seems to perform about the same, give or take a 0.01 second.

Would still love to see some benchmarking on real MySQL installs but that's the biggest I can get my hands on right now.

@albertlast albertlast deleted the albertlast:LessQuery branch Mar 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment