Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB performance: remove distinct statements in subqueries #486

Closed
2 tasks done
Michiel-s opened this issue Jul 22, 2016 · 14 comments
Closed
2 tasks done

DB performance: remove distinct statements in subqueries #486

Michiel-s opened this issue Jul 22, 2016 · 14 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior priority:normal

Comments

@Michiel-s
Copy link
Member

Michiel-s commented Jul 22, 2016

Achtergrond
Zoals de laatste tijd wel vaker loop ik tegen performance problemen van de database aan, die voortkomen uit de manier waarop de queries zijn opgebouwd. Voor kleine populaties spelen dergelijke issues niet omdat de boete in termen van tijd enkele miliseconden betreft. Wanneer de populaties echter toenemen (in mijn geval > 10k records voor bepaalde tabellen), dan loopt de boete voor niet geoptimaliseerde queries snel op (van enkele seconden tot minuten).

Gerelateerde issues
#454 removing subquery in left join
#426 remove outer subquery which is added by backend
#217 reduce number of queries for getting interface data

Dit issue
Dit issue verzoekt het verwijderen van het distinct statement dat aan alle (sub)queries standaard wordt toegevoegd.

Analyse
Het distinct statement wordt toegevoegd omdat we in Ampersand unique paren in verzamelingen willen hebben. Distinct zorgt daarvoor. Echter, bij subqueries en joins, welke we nogal vaak nodig hebben, zorgt het distinct statement op een subquery ervoor dat er een derived tabel nodig is voor het verwijderen van de dubbelingen in de tussenresultaten. Dit heeft een significante impact op de performance. Waar het ons echter om gaat is dat het eindresultaat geen dubbelingen bevat. Aan de andere kant voorkomt het distinct stament een explosie van tussenresultaten in subqueries. Dit is weer goed voor de performance. Dit voordeel weegt in generieke zin echter niet op tegen het nadeel dat de database subqueries niet kan optimaliseren.

Vereiste actie

  • verwijder alle distinct statement voor queries incl. subqueries
  • voeg distinct to op alle buitenste queries

LET op. Voor interface queries met '_SRCATOM' placeholder -> distinct op buitenste query laten staan!!

Een voorbeeld ter illustratie van performance winst (factor 100!!)
Gegeven deze rule uit een script: subElement |- propertyDomain~ \/ subClassOf;subElement
Wordt onderstaande sql query gegenereerd voor het verkrijgen van de violations:
Merk op dat de query bezaaid is met distinct statements

select distinct t1.src as src, t1.tgt as tgt 
from 
    (   select distinct t1.src as src, t1.tgt as tgt 
        from 
            (   select distinct "Class" as src, "Property" as tgt 
                from "subElement" 
                where ("Class" is not null) 
                and ("Property" is not null)
            ) as t1 
        left join 
            (   /* Flipped: EDcD propertyDomain[Property*Class] */ 
                select distinct "propertyDomain" as src, "Property" as tgt 
                from "Property" 
                where ("Property" is not null) 
                and ("propertyDomain" is not null)
            ) as t2 
        on (t1.src = t2.src) and (t1.tgt = t2.tgt) 
        where (t2.src is null) or (t2.tgt is null)
    ) as t1 
left join 
    (   select distinct fence0.src as src, fence1.tgt as tgt 
        from 
            (   select distinct "SrcClass" as src, "TgtClass" as tgt 
                from "subClassOf" 
                where ("SrcClass" is not null) and ("TgtClass" is not null)
            ) as fence0, 
            (   select distinct "Class" as src, "Property" as tgt 
                from "subElement" 
                where ("Class" is not null) and ("Property" is not null)
            ) as fence1 
        where (fence0.tgt = fence1.src)
    ) as t2 
on (t1.src = t2.src) and (t1.tgt = t2.tgt) 
where (t2.src is null) or (t2.tgt is null)

Gegeven de volgende populatie in de database:
subElement: 10.046 paren
propertyDomain: 6.171 paren
subClassOf: 910 paren

Is dit wat de database uitvoerd (uitvoering op mijn machine duurt +/- 20 sec):

image

Verwijderen we alle distinct statements uit de subqueries, dan duurt het slechts 0,2 sec. Analyse van de query geeft ook een veel geruststellender resultaat:

image

@Michiel-s
Copy link
Member Author

@hanjoosten I've tried to implement this myself, but this one is too hard for me (yet) :)

@stefjoosten
Copy link
Contributor

@Michiel-s,
I did an attempt in SHA 40accc7. Could you please test it?

@Michiel-s
Copy link
Member Author

Michiel-s commented Jul 26, 2016

For conjuncts queries b330310 looks good (separate branch, not commit of @stefjoosten).

For our information, I've noticed that for queries with a UNION statement the DISTINCT is at the UNION statement (which is also the default of the database). Therefore the DISTINCT statement is not necessary on both outer SELECTS statements. See example below.. this is an ok implementation!

select t1.src as src, t1.tgt as tgt 
from    (   blabla
        ) as t1 
union distinct 
select t1.src as src, t1.tgt as tgt 
from    (   blabla
        ) as t1

@Michiel-s
Copy link
Member Author

I've reverted the addition of ALL statements b330310. This is the default of the database and not necessary.

"The ALL and DISTINCT options specify whether duplicate rows should be returned. ALL (the default) specifies that all matching rows should be returned, including duplicates. DISTINCT specifies removal of duplicate rows from the result set. It is an error to specify both options. DISTINCTROW is a synonym for DISTINCT" see mysql doc

@Michiel-s
Copy link
Member Author

@hanjoosten please let me know when you have time to look into detail into this issue. I can show you the thing in interfaces.json that are not yet consistent.

Michiel-s pushed a commit that referenced this issue Jul 26, 2016
Issues not yet fully fixed, but merged for now.

Commits:
* this should fix issue #486 
* Fix issue #488
* Minor code cleanup
* Releasenotes
* fix bug. The default was not as I expected.
* Bugfix e5386cf
* Revert "fix bug. The default was not as I expected."
@stefjoosten
Copy link
Contributor

@hanjoosten and @Michiel-s
can you decide whether this issue can be closed?

@Michiel-s
Copy link
Member Author

Not all actions (see initial comment) are done or checked. I'll look into that.

@Michiel-s
Copy link
Member Author

Tested the solution and it is not yet fixed properly. There are still interface queries generated where no DISTINCT statement is added to the outer query. This causes duplicated.

@hanjoosten let's discuss this asap via Skype.

hanjoosten added a commit that referenced this issue Jan 2, 2017
@hanjoosten
Copy link
Member

@Michiel-s We didn't discuss this issue via Skype, but I looked into it myself and found an obvious cause. As far as I can see, this should solve the problem. I've made a pull request. Could you verify that the problem is solved? (and then pull the branch?)

@Michiel-s
Copy link
Member Author

I'll look at this PR Thursday morning

@RieksJ
Copy link
Contributor

RieksJ commented Jan 3, 2017

This issue was referenced by @hanjoosten in commit AmpersandTarski/ampersand-models@f5f266d (related to ticket #599)

Michiel-s pushed a commit that referenced this issue Jan 6, 2017
@Michiel-s
Copy link
Member Author

I've merged PR #611. Priority can be lowered now, but there is still something not consistent yet regarding the distinct statements. I would like to have a online work session with @hanjoosten to look into this later. For now please leave this issue open.

@RieksJ
Copy link
Contributor

RieksJ commented Apr 25, 2017

It would be nice if we could get the XPaths demo working again. How are things coming along?

@stefjoosten
Copy link
Contributor

@Michiel-s can this issue be closed already?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior priority:normal
Projects
None yet
Development

No branches or pull requests

4 participants