Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save shard-specific query plans to metrics #1229

Open
jwomeara opened this issue Jul 26, 2021 · 5 comments · May be fixed by #2027
Open

Save shard-specific query plans to metrics #1229

jwomeara opened this issue Jul 26, 2021 · 5 comments · May be fixed by #2027
Assignees
Labels
enhancement New feature or request

Comments

@jwomeara
Copy link
Collaborator

Shard-specific query plans are created by the visitor function. We should add these query plans to the metrics for the query, and make the metrics viewable via the metrics endpoint.

The metrics endpoint should be updated to accept a shard query parameter which can be used to determine which query plan is displayed. We could/should also display a column with the shards associated with the query, and potentially make each one a hyperlink to the metrics entry specific to that shard.

@jwomeara jwomeara added the enhancement New feature or request label Jul 26, 2021
@foster33
Copy link
Collaborator

foster33 commented Oct 25, 2022

I have been working on this and currently have the subPlans created by the VisitorFunction added to the metrics.

@ivakegg had mentioned that it would be a good idea to open up discussion regarding what route we might want to go design wise for a few different aspects.

I wanted to get some design based opinions on:

  • How it would be best to store the subPlans in accumulo.
  • Thoughts on how to pull them back out.
  • And how to present the subPlans to the user.

*Edit - My branch in the Datawave repo: datawave/tree/bugfix/DATAWAVE-1609

@ivakegg
Copy link
Collaborator

ivakegg commented Oct 28, 2022

So when the metric makes it to the query metric service, there we need to add fields to the query event being stored in accumulo that encode this information. I am thinking that we might add fields as follows:

    For the following map:
        "20220101_23" -> "SOME QUERY PLAN"
        "20220103_10/twmaze/12389.235235.25252" -> "SOME OTHER QUERY PLAN"
    The fields to add to the query event may be
        RANGE-20220101_23 -> hash1
        RANGE-20220103_10-twmaze-12389.235235.25252 -> hash2
        SUBPLAN-hash1 -> SOME QUERY PLAN
        SUBPLAN-hash2 -> SOME OTHER QUERY PLAN

Where hash1 and hash2 are the hashes of the respective query plan strings. The reason for this somewhat awkward storage is that we expect the query plans to be potentially huge and the same sub-plan will exist for many of the ranges. This will avoid storing those sub-plans multiple times for the same value.

@jwomeara @billoley Can you guys chime in on this please?

@jwomeara
Copy link
Collaborator Author

jwomeara commented Nov 4, 2022

I talked with @ivakegg and I think we settled on something a little different...

    For the following map:
        "20220101_23" -> "SOME QUERY PLAN WHO HASHES TO 'HASH1'"
        "20220101_17" -> "SOME OTHER QUERY PLAN WHO HASHES TO 'HASH2'"
        "20220103_10/twmaze/12389.235235.25252" -> "SOME OTHER QUERY PLAN THAT HASHES TO 'HASH2'"
    The fields to add to the query event would be
        RANGES -> HASH1:20220101_23
        RANGES -> HASH2:20220101_17
        RANGES -> HASH2:20220103_10/twmaze/12389.235235.25252
        SUBPLAN-HASH1 -> SOME QUERY PLAN
        SUBPLAN-HASH2 -> SOME OTHER QUERY PLAN

What I put above roughly represents what the mutations going to the metrics shard table should look like. You'll notice that the RANGES field has multiple values. This is intentional, and part of this effort will be to write an accumulo combining iterator to combine the values of those entries into a single value.

So, what we eventually will write to the table will look more like this:

    For the following map:
        "20220101_23" -> "SOME QUERY PLAN WHO HASHES TO 'HASH1'"
        "20220101_17" -> "SOME OTHER QUERY PLAN WHO HASHES TO 'HASH2'"
        "20220103_10/twmaze/12389.235235.25252" -> "SOME OTHER QUERY PLAN THAT HASHES TO 'HASH2'"
    The fields to add to the query event would be
        RANGES -> HASH1:20220101_23;HASH2:20220101_17,20220103_10/twmaze/12389.235235.25252
        SUBPLAN-HASH1 -> SOME QUERY PLAN
        SUBPLAN-HASH2 -> SOME OTHER QUERY PLAN

Combining the RANGES fields should help to save space since the idea is that we'll be writing a lot of plans.

A follow-on task would be to update the query metrics html endpoint to display the contents of the RANGES field as a set of clickable links. When you click on one of these links, we should reload the metrics for given query and swap the displayed PLAN with the selected SUBPLAN. For now though I would just work on the first part - we can work out the UI details at a later time.

@billoley
Copy link
Collaborator

billoley commented Nov 4, 2022

Wouldn't it make lookups faster if we kept the RANGES separate?
Metrics use a dot, (.) to indicate multiple entries for the same indexed field.

RANGES.HASH1:20220101_23
RANGES.HASH2:20220101_17
RANGES.HASH2:20220103_10/twmaze/12389.235235.25252
SUBPLAN.HASH1 -> SOME QUERY PLAN
SUBPLAN.HASH2 -> SOME OTHER QUERY PLAN

@ivakegg
Copy link
Collaborator

ivakegg commented Nov 4, 2022

I suppose there may be an issue when we have so many ranges that we blow out the size of a Key in accumulo...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants