Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk requests and grouping to the whisp count query #349

Closed
MeStrak opened this issue Apr 7, 2021 · 20 comments
Closed

Add bulk requests and grouping to the whisp count query #349

MeStrak opened this issue Apr 7, 2021 · 20 comments
Labels
enhancement New feature or request

Comments

@MeStrak
Copy link
Contributor

MeStrak commented Apr 7, 2021

Is your feature request related to a problem? Please describe

Currently the whispsCount query takes a single mongo format query, and returns the count of documents which match that query. In cases where an application needs to get the count of whisps in bulk, a separate request is required for each query, resulting in a high number of requests from the browser (e.g. one for every row of a table).

Describe the solution you'd like

  • Extend whispsCount to allow bulk requests, and grouping by specific criteria in the mongo query
  • Stop using the mongo countDocuments() function, and use db.collection.aggregate instead which will allow us to have a grouping functionality (explained more in detail below)
  • Grouping criteria would be optional - not necessary if the consuming app wants a total document count based on the matched query

Currently the query runs the mongo command countDocuments() (https://docs.mongodb.com/manual/reference/method/db.collection.countDocuments/). countDocuments is actually just a wrapper for the following command:

db.collection.aggregate([
--
{ $match: <query> },
{ $group: { _id: null, n: { $sum: 1 } } }
])

We can easily replace countDocuments() with that command to allow us to group, and therefore manage bulk requests.

For example this query:

 db.whisps.aggregate([
    { $match: {"applicationID": "myapp"} },
    { $group: { _id: { criteria1: "$data.thing1", criteria2: "$data.thing2" }, count: { $sum: 1 } } }

Would return something like this, which satisfies the requirements:

{ "_id" : { "criteria1" : "abc", "criteria2" : "123" }, "count" : 3 }
{ "_id" : { "criteria1" : "def", "criteria2" : "123" }, "count" : 5 }
{ "_id" : { "criteria1" : "def", "criteria2" : "345" }, "count" : 7 }
{ "_id" : { "criteria1" : "xyz", "criteria2" : "999" }, "count" : 9 }

In that example the match criteria is very simple, but it could also be a more complex query with multiple things to match.

BREAKING CHANGE(S)

  • If we use exactly the same function for a simple query with no grouping (like the implementation today), the response would contain an object with an _id field, instead of just the number of records.
  • This query will also never return a record count of 0, because for the grouping functionality to work there must be at least one record matched
  • We could manage both of those by using the existing logic if no grouping criteria is specified, I'm just not sure if the change is critical enough to require both of these to be maintained (added to open points for discussion)

Describe alternatives you've considered

Consuming app provides an array of queries, with an ID for each query, whispr executes individual documentCount queries but returns a single response

  • Not an optimal solution as we would be sending many separate queries to the database

Use a mongo aggregate function so that we can add grouping functionality

(this is the proposed solution described in detail above)

  • Allows flexible grouping on a single field, or an object
  • Would potentially introduce a breaking change for single ungrouped queries, depending on how we manage it
 db.whisps.aggregate([
    { $match: {"applicationID": "myapp"} },
    { $group: { _id: { criteria1: "$data.thing1", criteria2: "$data.thing2" }, count: { $sum: 1 } } }

Additional context

Open points for discussion:

  • whispsCount - should it be called whispCount instead? - if it should then it allows us to introduce this feature without a breaking change (for now) as we could add the new query and decommission the old one later
  • Should we accept an array of queries to be matched, or have the consuming app specify a single query to match in mongo format? (either we construct the query with an OR for every item in the array, or the consuming app does that for us)
  • Recommended maximum number of queries in the bulk request?
@MeStrak MeStrak added the enhancement New feature or request label Apr 7, 2021
@Babbafett
Copy link
Contributor

Babbafett commented Apr 7, 2021

whispsCount - should it be called whispCount instead? - if it should then it allows us to introduce this feature without a breaking change (for now) as we could add the new query and decommission the old one later

I agree we should rename it and then use it to decommission the old query later.

Should we accept an array of queries to be matched, or have the consuming app specify a single query to match in mongo format? (either we construct the query with an OR for every item in the array, or the consuming app does that for us)

I would prefer to provide an array of criteria arguments to a single query and let whispr do the OR combination before executing the aggregate function

Recommended maximum number of queries in the bulk request?
On the one hand we should think about limiting it, otherwise it will end up in long running times and an overloaded whispr backend. On the other hand we could say that this is in responsibility of the consuming apps to use a proper value, but that might me a too optimistic way.

For a rate limiting option we should use the nestjs throttler https://docs.nestjs.com/security/rate-limiting to do a rate limiting based on the query. To identify how much request we want to allow it might be worth to implement a k6 load test to get this number.

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

In a separate chat @DennisWue also mentioned that we do actually have some basic grouping capability already, which I was not aware of!

query combinedCounts {
  thing1Whisps: whispsCount(filter: { applicationID: "thing1" })
  thing2Whisps: whispsCount(filter: { applicationID: "thing2" })
}

Would return:

{
  "data": {
    "thing1Whisps": 16,
    "thing2Whisps": 0
  }
}

It's pretty simple, but becomes complex when you want to group on more than one criteria (which we do in at least one of our apps). I can't see an elegant way of doing that - I think it would be necessary to separate the grouping by an underscore in the query name.

Thoughts?

@DennisWue
Copy link

Yes it's called 'Field alias' read more here https://spec.graphql.org/June2018/#sec-Field-Alias

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

Yes it's called 'Field alias' read more here https://spec.graphql.org/June2018/#sec-Field-Alias

If it works then I like it! Can you give an example of what the query would need to look like using the grouping by thing1, thing2 example?

@DennisWue
Copy link

Do you mean a combined grouping?

query combinedCounts($filter: JSONObject) {
thing1Whisps: whispsCount(filter: { applicationID: "thing1" })
thing2Whisps: whispsCount(filter: { applicationID: "thing2" })
combinedWhisps: whispsCount(filter: $filter)
}

with variable:
{"filter": { "applicationID": {"$in": ["thing1","thing2"]} }}

which would return:
{
"data": {
"thing1Whisps": 16,
"thing2Whisps": 2,
"combinedWhisps": 18
}
}

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

Do you mean a combined grouping?

query combinedCounts($filter: JSONObject) {
thing1Whisps: whispsCount(filter: { applicationID: "thing1" })
thing2Whisps: whispsCount(filter: { applicationID: "thing2" })
combinedWhisps: whispsCount(filter: $filter)
}

with variable:
{"filter": { "applicationID": {"$in": ["thing1","thing2"]} }}

which would return:
{
"data": {
"thing1Whisps": 16,
"thing2Whisps": 2,
"combinedWhisps": 18
}
}

Nope, I mean where the grouping is on 2 or more criteria (like the composite key in the original example) to return something equivalent to this:

{ "_id" : { "criteria1" : "abc", "criteria2" : "123" }, "count" : 3 }
{ "_id" : { "criteria1" : "def", "criteria2" : "123" }, "count" : 5 }
{ "_id" : { "criteria1" : "def", "criteria2" : "345" }, "count" : 7 }
{ "_id" : { "criteria1" : "xyz", "criteria2" : "999" }, "count" : 9 }

Do you mean that the consuming application should manage its own mapping to manage that, either by building the composite key in the query name, or a separate hash to lookup the query name vs the composite key?

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

Another thought (which was not the original aim of this issue) is that the new proposal would provide some enhanced functionality for grouping when you don't know the exact query filter - for example to return the count of all whisps in the database grouped by applicationID, without knowing all distinct applicationIDs.

@DennisWue
Copy link

DennisWue commented Apr 7, 2021

Do you mean a combined grouping?
query combinedCounts($filter: JSONObject) {
thing1Whisps: whispsCount(filter: { applicationID: "thing1" })
thing2Whisps: whispsCount(filter: { applicationID: "thing2" })
combinedWhisps: whispsCount(filter: $filter)
}
with variable:
{"filter": { "applicationID": {"$in": ["thing1","thing2"]} }}
which would return:
{
"data": {
"thing1Whisps": 16,
"thing2Whisps": 2,
"combinedWhisps": 18
}
}

Nope, I mean where the grouping is on 2 or more criteria (like the composite key in the original example) to return something equivalent to this:

{ "_id" : { "criteria1" : "abc", "criteria2" : "123" }, "count" : 3 }
{ "_id" : { "criteria1" : "def", "criteria2" : "123" }, "count" : 5 }
{ "_id" : { "criteria1" : "def", "criteria2" : "345" }, "count" : 7 }
{ "_id" : { "criteria1" : "xyz", "criteria2" : "999" }, "count" : 9 }

Okay this depends on what you want to achieve. Do you want to have the count the matching criteria1 AND criteria2 or do you want the criteria1 OR criteria2 counted.
The first case AND should be as easy as:

query combinedCounts() {
thing1Whisps: whispsCount(filter: { applicationID: "thing1", type: "abc" })
thing2Whisps: whispsCount(filter: { applicationID: "thing1", type: "def" })
thing3Whisps: whispsCount(filter: { applicationID: "thing2", type: "abc" })
thing4Whisps: whispsCount(filter: { applicationID: "thing2", type: "def" })
}

For the OR case you would have to use the mongodb $or operator https://docs.mongodb.com/manual/reference/operator/query/or/

Do you mean that the consuming application should manage its own mapping to manage that, either by building the composite key in the query name, or a separate hash to lookup the query name vs the composite key?

Yes this would be part of the requesting application

@DennisWue
Copy link

Another thought (which was not the original aim of this issue) is that the new proposal would provide some enhanced functionality for grouping when you don't know the exact query filter - for example to return the count of all whisps in the database grouped by applicationID, without knowing all distinct applicationIDs.

I like the idea maybe we can even achieve something right now with a filter like
"filter": { "applicationID": {"$ne": ""} }
which filters for 'not equal'.

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

Okay this depends on what you want to achieve. Do you want to have the count the matching criteria1 AND criteria2 or do you want the criteria1 OR criteria2 counted.
The first case AND should be as easy as:

query combinedCounts() {
thing1Whisps: whispsCount(filter: { applicationID: "thing1", type: "abc" })
thing2Whisps: whispsCount(filter: { applicationID: "thing1", type: "def" })
thing2Whisps: whispsCount(filter: { applicationID: "thing2", type: "abc" })
thing2Whisps: whispsCount(filter: { applicationID: "thing2", type: "def" })
}

I don't think I was clear enough in my initial question, yes it is an AND case, but the challenge was more around how we return it with a unique key (which I think you addressed with a lookup in the app with your comment below).

Maybe it was a typo but that query gives this error because the query alias' are duplicated:

"Fields \"thing2Whisps\" conflict because they have differing arguments.
Use different aliases on the fields to fetch both if this was intentional."

Do you mean that the consuming application should manage its own mapping to manage that, either by building the composite key in the query name, or a separate hash to lookup the query name vs the composite key?

Yes this would be part of the requesting application

Ok - so that means that the app itself would generate a unique hash for the query name, which would solve the issue above.

@DennisWue
Copy link

Sorry typo on my side, corrected the initial comment

@DennisWue
Copy link

Do you mean that the consuming application should manage its own mapping to manage that, either by building the composite key in the query name, or a separate hash to lookup the query name vs the composite key?

Yes this would be part of the requesting application

Ok - so that means that the app itself would generate a unique hash for the query name, which would solve the issue above.

Yes I think a hash as the alias and a dynamic filter could be a solution for your usecase.

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 7, 2021

@DennisWue I've just been testing your proposal with one of the queries from our apps and it does work well, even if the mapping within the consuming app adds a bit of complexity.

I do still like the grouping idea for helping with cases where you want to group without knowing the filter, but that's a different use case which could be added a bit later.

For this use case I think I'm now more in favour of @DennisWue's proposal and just doing a few documentation updates now to explain how to use the current functionality, and then consider the additional grouping option a bit later.

@Babbafett @DennisWue @David-Wennemaring @schroedermaxi what are your thoughts? Is the need for the mapping in the consuming app an issue?

@nikola-kovacevic
Copy link
Contributor

I like the simplicity of the approach that we have. However, I do have a question. The current implementation still hits the database multiple times. What I mean is for each whispsCount in the query we would request new countDocuments from MongoDB. Then API rate limiting the request itself loses its purpose, as the database/app can be brought down or be made unusable (at least partially - depends on the infra) by query with unlimited whispsCounts. @Babbafett Next to the rate limiting of the API itself, should we limit the amount of whispsCount that can be requested at once?

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 8, 2021

I like the simplicity of the approach that we have. However, I do have a question. The current implementation still hits the database multiple times. What I mean is for each whispsCount in the query we would request new countDocuments from MongoDB. Then API rate limiting the request itself loses its purpose, as the database/app can be brought down or be made unusable (at least partially - depends on the infra) by query with unlimited whispsCounts. @Babbafett Next to the rate limiting of the API itself, should we limit the amount of whispsCount that can be requested at once?

It's a good point - I think we need to do some testing of that part as I'm not certain how it works and at what level we can apply the rate limiting, and what level of visibility we have in the resolver of the number of alias' in the request. I'll take a look at that when I get some time.

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 19, 2021

@nikola-kovacevic I haven't forgotten this point - I'm doing some other cleanup work this week and hope to test it at the same time.

@MeStrak
Copy link
Contributor Author

MeStrak commented Apr 21, 2021

Ok so I did some work on this today. The good news is that we could still rate limit this type of request if we want to. I did a naive implementation of NestJS throttler, with a limit of 10 requests every 10 seconds.

With this type of alias, we basically just hit the resolver x times for x aliased queries, so the throttling works in the same way as if we did x totally separate requests.

Works fine with 10 aliased requests
image

I then waited for the limit to reset (10 seconds and tried with more aliased requests ...

Fails instantly with 11 aliased requests
image

So that is good news, because when we implement a rate limiter (not in this task) it will work for this type of request.

However, this steers me back towards the original proposal because we have a genuine case today where clients are hitting the count endpoint several hundred times over a few seconds. Combining the count of all clients this is actually thousands of times per second at peak times. I think that the original proposal is likely to work much better in this case because the aggregation is done by the db, and we're not launching lots of countDocuments operations which is an expensive op.

I'm going to try a basic implementation and create some load tests to try to prove this.

Thanks @nikola-kovacevic for catching this possible issue.

@MeStrak
Copy link
Contributor Author

MeStrak commented May 11, 2021

I started working on the grouping feature, and created a draft PR for it: #420.

Some more testing to do, but so far it looks promising. One thing that I'm unsure of right now is how we will manage subscriptions. I need to do a bit more investigation on that, and will possibly consider it as an additional PR depending on the scope.

@MeStrak
Copy link
Contributor Author

MeStrak commented May 14, 2021

Simple load testing done, and the performance of the new query is a lot better, especially when the filter size is limited and we rely on grouping instead.

I also realised that my point about subscriptions is not relevant for this issue as there's no current subscription for the which count query. It might be possible to implement a subscription for the count query once this is done: #425.

@MeStrak
Copy link
Contributor Author

MeStrak commented May 17, 2021

Closed by #420, feature included in https://github.com/Sanofi-IADC/whispr/releases/tag/v2.4.0.

@MeStrak MeStrak closed this as completed May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants