-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Filter parameter (GQL filter queries spec) #5604
Comments
The GQL repository has a PR with an initial implementation: TryGhost/GQL#2 |
I put this in slack, but going to put it here too as it's a design change in the API: For the old style requests, we include in the response some data under This was, I think, always a bit of a hack so we didn't have to do multiple requests to the API to build the page I'm not intending to rebuild that behaviour for the new-style filter parameter - trying to figure out what extra objects need to be fetched and included is incredibly difficult, and I think, unnecessary fragility. I think the frontend controller needs to ask for the data it wants, much like getting rid of the ability to fetch both posts & pages from the API, we need to remove all the 'hacks' from the API that we built in to make our own lives easier. E.g. part of #5943 will be rewriting the frontend controller to be able to perform multiple requests. |
refs TryGhost#5604, refs TryGhost#5463 - deps: ghost-gql@0.0.2 - adds code to wire up the filtering to a paginated query - updated pagination plugin count query to use 'distinct' so it's more robust - rename paginationUtils.query to addLimitAndOffset to be more explicit and make the code clearer - add a new 'advanced browsing spec' set of tests for tracking these features as they are built out
refs TryGhost#5604 - `filter` is missing from tags & users - add it in and add tests which show it works
@ErisDS I read over the requirements, examples and the GQL source. I put together a bunch of examples along with sample code for an api that I think will support everything you're looking for. I reworked filters (aka "the JSON format"). I don't think the current implementation is flexible enough to achieve what you're looking for. I borrowed ideas from mongo on how to structure the JSON. I also changed filters around to support aggregate queries like what you're going for with And I made one small change to how null/not-null is handled. Made it more concise. I wrote all of this up on a fork of the README and I forked the gist with samples as well. Here's the gist and here the README. In the gist I also show what a javascript API for GQL might look like. I think it would be amazing to have that library available to work with knexify independently from the blogging platform. We can talk more about that if you're interested. I think going forward with the actual search implementation through bookshelf is probably best. That way you'll tie in with the event emitters, validation, and the pipelining via promises, etc. I have some ideas on joins/projections as well that I can offer as well. But first things first. Let's see what you think of the filters. Once we're on the same page there and have a working implementation we can add to the pile. Cheers! |
I've read everything you've done, but I feel like I'm missing the beginning of your thought process. GQL is currently out there in the wild, it's kinda working, it's not the prettiest, but it gets most of the job done. There are some bugs which I haven't had chance to really look into yet so maybe you're basing your thoughts on one of those, but I don't see the explanation written anywhere. Sooo could you go back a few steps and explain in much more detail why you're proposing these changes? |
Hi Hannah. Sure. A few pieces came together to prompt my ideas. I'm working on a blog. I want to add search. Google search might work but really isn't what I'm looking for. I don't want to pay for Swiftype. So I started looking around for search options for Ghost. I saw you're looking for a search lead and read some comments around the code which led me to believe that there was some work to be done and room still for implementation design. I read through the models and saw how the relations fit together. I found your sample use cases. A common problem stood out to me when I looked at them. APIs like this tend to start out with simple matching. People start to use the API and then want to do more an more complex queries. Before too long the API falls down and can't support pretty basic things like date range queries, aggregations or really anything other than exact matches. You did a fantastic job on the spec. So it was easy to see that not only had you thought ahead to some of those more advanced query types, but you took the time to document them beautifully so that we all could understand what you're looking for :) One thing that came up a lot in your examples was the use of At this point the Having worked a lot with mongo and elasticsearch, I had those models as references. In ES, queries based on The way With this in mind I went back to This immediately brought into scope the way the mongo has implemented query operators. Having worked a lot with meteor, I know how expressive, natural and powerful this syntax is as you work with it. At this point With all of this as context, I thought:
Heck, why don't I just build this thing! So I swapped my architect hat in favor of my builder hat and I started to look at how I would implement an API to turn GQL into JSON filters which I could then pass directly into a call to posts.fetchAll(). It seemed natural that there should be a javascript API to do this transparently. This led me to the I iterated on this for a couple of hours by working through a lot of the use cases in the the original README. As I did I found that the filter mechanism that I had in mind felt a lot more fluid than the original Statement Processing recommendation. And I knew it would support the more advanced aggregations that will be valuable to me and to others. I ran into one or two things such as the recommendation for support for whitespace in GQL. It seemed to prevent me from running the query And there was the obvious conflict in how nulls were handled. And lastly, to wrap this up, the filter model that I came up with is less verbose than the JSON format defined here. For example: This:
becomes this
I see a huge amount of potential for GQL. It solves a common problem that primarily Lucene solves. But the GQL syntax is even better. It's more fluid and more concise. I really like that. Does this help? I sure hope so. 'cus it's all I've got :) |
Sorry for the slow response here! Too many exciting things happening at once 😁 Pretty much everything you've said here makes 100% sense and seems like a great improvement to GQL. I kinda rolled with what I had as a beta implementation - the adage that you should build one to throw away. I don't think it's quite that drastic, but what I mean is, I'm definitely not precious about the implementation. (With the exception of the fact we need to use an actual parser and not regex 😉) I am, however, a little bit more precious about the user-facing syntax. I'm open to changing it but I feel that needs to be considered from more angles than the internal workings because it's what users come into contact with. Re: the null behaviour, I'm not 100% what the right thing is but I do agree it needs revisiting. I'm really interested in the distinction you drew between a "count API" and a "count aggregate". This distinction makes it much easier to reason about some of the behaviour of counts in Ghost's API as a whole. However, I don't think that the user-facing filter syntax needs to change here? I may be missing something, but at first glance it appears to me that in Ghost's API |
Hey Hannah. No worries :) So I worked on this a bunch over the last couple weeks and pretty much have it where I think it makes the most sense. It's all in laran/GQL. To touch on your comments: Syntax: I changed it slightly. You can see what I did here. You'll see a couple differences in there from the spec defined here in 5604. Adding support for Nulls: What I suggested before was harder work on the parser than I felt was necessary. So I left it as you originally had it, which works very well. My suggestion was more clever than it needs to be and more work that I think is worth it. It doesn't prevent a query from containing the actual string Aggregates: Yeah, I took an even stronger departure from the original syntax, and for good reason. I took some time to document how I actually implemented aggregates. But I'll explain a bit more here just for reinforcement. Aggregate queries are a lot more tricky. And there's more than just the filter. There's table joins, grouping, etc. and it's (imho) just too complicated to cram all of that into the query. Hacks like this are required (which restrict the portability of the library). And there's a certain amount of context from the domain model that has to be configured outside of GQL. In the case of Ghost this comes from bookshelf. But it's a general concern. So, what I did instead was just create the API in a way that allowed all of the grouping, aggregate function calls, and joining to be done outside of gql, and then reference the aggregate value (something like I did this by introducing a Aggregates also gets us into the topic of joins, so I'll mention that I also added some smarts to help enable that. Joins: Joins are another thing that need some support separately from the filters. There's actually three parts to the complete definition of a query.
I standardized the API on knex. So defining what fields to return is done that way (using Determining what rows to return is exactly what the GQL string syntax defines. So this is the problem that GQL solves. Re: how tables relate, it's something that also has to happen outside of GQL. But, because the GQL query tells us what tables are referenced (e.g. I would love to have you take a look at what I've put together. It's pretty complete from the perspective of what I was trying to achieve, which is basically building a GQL library that's complete enough to support full search capabilities within Ghost, while keeping it general-purpose enough to make GQL broadly viable and useful outside of Ghost. I think it does that nicely. I would have submitted a PR to review it. But it's such a complete overhaul that it didn't seem to make sense. Talk soon. |
This paragraph touches on a lot of different ideas, that I think are most poignant to the direction GQL needs to go in:
There are a couple of key things to pull out here I think:
You're absolutely correct here that the When I was originally designing these different query parameters, I didn't pay quite enough attention to the detail of how many aspects of the query filters would need to change. I had where & join covered (just about), plus the special rules for order & group, but I totally missed needing to work with However, I disagree that it's too complicated to cram it all in. Fundamentally that's the purpose of GQL - to abstract all of that craziness of how SQL works away from the user. The GQL query language is primarily intended for use by theme creators = designers & frontend developers, the MO here is people who don't know how SQL works. The fact that asking for all the tags with at least one post has such wide ranging effects on the query structure is an SQL implementation detail that I want to hide away, even if it is ridiculously hard to do that. I don't want the end user to have to think about any of that. That's why I don't think that count should have a
The portability of the library is definitely not a primary concern. Knex itself allows for various different underlying databases, so we've already got quite a flexible tool. I think we need to solve problems with the querying possibilities first, and then when all that is nailed, if there's someone interested in making GQL more portable, that'll be great. It's also worth noting that all of the context from the domain model that's currently configured in GQL is deliberately kept apart from the internal workings because it's clearly a problem that needs to be solved, but again, getting the internals of GQL & the querying working 100% first is more important - we might find the domain model needs to change yet, once we know what it looks like, it'll be easier to move. Secondarily, the way that GQL interacts with knex is also identified as not being quite right just yet and there's even an open issue for it. Again, it should be much easier to solve when the internals of GQL is closer to being complete. In short, I really want to focus on getting the internals right before addressing the interface problems. |
Hey Hannah. I appreciate the comments. I'm with you on all points in terms of where we want to end up. I think we can have a conversation about how we get there. I think we're actually super, super close to being able to just do straight up Bottom line is I 100% share your feelings in terms of building GQL to be right for the intended audience. We'll make that happen. One point I want to bring up quickly is that the JSON in the previous comment on this thread is no longer valid. Here's what it would be given the current GQL change I'm proposing: This was the old JSON that you posted:
I translated that to this GQL string: Which becomes this JSON using the new
I'll follow up on your other comments later today when I have some time. I'm looking forward to working on this with you Hannah. Cheers! |
I 100% agree that this is all going in the right direction 👍 The updated internal JSON spec looks good to me. I think it makes a lot more sense, but it would be great to have it outlined in an issue over GQL |
Super. We're off to a great start :) I'll create an issue with an outline of the changes and we can take it from there. |
Just wanted to point out some changes the the string syntax which I've outlined here:
|
OK, most of those things are improvements to my pretty buggy handling of the intended syntax. There are two major exceptions:
This should not change. It was a deliberate design choice based on gmail, google, and other filter interfaces that users have exposure to.
This is new. I think that |
Just to be sure you noticed it I'll mention that the negation is also done now by putting the negation character before the property, not the value. So, now it's One of the observations that I made early on was that with Then, as I looked at handling negation of not just single values but also clauses putting the negation before the property instead of before the value would allow the syntax to stay consistent. For example: Likewise: This wasn't possible in a consistent way with the negation character coming before the value because there is no value for clauses. There's just the clause itself as a unit. I also moved away from And, in fact, this is the way google does negation for search conditions with a property (e.g. Re: Also, Github uses "NOT" for negation. It doesn't use Lucene, on the other hand [uses both](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Boolean operators) NOT and Another thing to consider is that someone that does a lot of front-end work, but who isn't a super-technical person, and who isn't a 'walking compiler' is probably used to seeing For someone with absolutely zero context of boolean logic or programming syntax I would agree with you that And then, of course, you have developers who DO intuitively understand Another thing is that you'll end up with queries like this For instance, we want a query that means "is featured and was not created after 2016-01-01". The correct syntax (using the But a less technical person could very easily try to collapse the I think having the NOT character being the inverse of the AND character will lead to confusion like this. Using So, I think it's just worth noting these things as we think things through. What are your thoughts? |
@laran I'm sorry for dropping the ball on GQL big time over the last.. um 4 months. We're looking to pick API & OAuth up again after the current project to get Ghost 1.0.0 shipped. If you still want to be involved I'd love to have you 😄 I'm closing most API issues temporarily with the JSON API Overhaul & OAuth access are currently scheduled next on the roadmap |
This spec/issue describes the
filter
parameter originally mentioned in issue #5463, however it contains a more in-depth spec than usual, due to the complexity of the feature. This issue serves as a basis for documentation whilst the feature is in active development, as well as giving a basis for testing.Table of Contents:
Summary
The idea behind filter queries is including additional parameters in
Browse
API requests, causing the result to be filtered in some useful way. This provides an additional level of power and flexibility for users interacting with their data in themes, apps, and in the admin UI.The initial focus is on the ability to do advanced queries against Posts, Tags, and Users. The following document outlines a detailed proposal for a 'query language' exposed via a top-level
filter
property available when accessing the API from various tools.Key Contextual Information
The Ghost JSON API supports full BREAD operations, however filtering only applies to Browse requests.
Resources
There are 3 main resources in the Ghost JSON API which are 'public' resources, and have clear use cases which require advanced filtering:
Relations
The following relations need special treatment when querying across resources.
Additionally, the relations & business logic in a blog reveal the following important relational data concepts:
API Access Tools
There are 3 key tools which can be used to request data from the Ghost JSON API which need to be considered:
GET /api/posts?limit=5
api.posts.browse({limit: 5})
{{#get "posts" limit="5"}}
Proposal
Adding advanced filtering to Ghost's JSON API via a top-level
filter
parameter which accepts a string representation of an SQL query, highly optimised for the kinds of queries which we expect will be common to Ghost, but flexible enough to grow outside of anticipated use cases.Top-Level Filter Parameter
A top-level
filter
parameter will provide advanced querying capabilities available in all 3 tools in a standard way, such that the parameter's value is always a string:GET /api/posts?limit=5&filter=???
api.posts.browse({limit: 5, filter: "???"})
{{#get "posts" limit="5" filter="???"}}
In both the method call and get helper syntaxes, the filter parameter value will be surrounded by double-quotes. Avoiding double-quotes inside the string is therefore a requirement.
In the HTTP version, the string will need to be url safe. Url encoding & decoding will be required.
Filter Parameter Values
The key requirement of the
filter
parameter is to accept values which can represent the rough equivalent of thewhere
clause of an SQL statement (language diagram). The fundamentals of this proposal are therefore the definition of a language which fits Ghost's use cases and can be converted into SQL through Ghost's Bookshelf/Knex model layer.SQL's capabilities are extensive. To start with, the focus for Ghost is on supporting logical and & or combinations, comparisons using equals, greater than & less than (with support for not). The importance of the relation between Post & Tags also drives the need to support in.
Filter Anatomy
An individual filter is composed of 3 parts, the property that is being filtered, the operator (currently always
=
) and the value. Filters need to be combined in groups of or & and rules.The property needs to support path matching: e.g.
tag
currently really meanstag.slug
.The operators need to include equals, not equals, and numerical comparisons.
The values are strings, numbers, nulls etc that are matched/compared against.
Filter Style / Syntax
There are a number of common styles for combining properties, operators and values into expressions. This proposal borrows heavily from the style used in GitHub, Gmail & Slack amongst others: E.g.
label:api
orfrom:-hannah
orstars:<10
.The syntax of this style is the use of property-operator-value triplets in the form: property:operatorvalue, where the colon is a separator, and operator is optional.
This syntax is short and compact for the common use cases in Ghost, for example:
featured:true
,tags:photo
,author:john
,image:-null
,posts.count:>10
This syntax is also flexible enough to be extended to support multiple complex expressions by combining the property-operator-value triplets together with other symbols to represent and & or. Following on from using
-
for negation, the proposal is to use+
for and as well as,
for or e.g:featured:true+tags.count:>10
,tags:photo,tags:video
.This is then used inside of the
filter=""
parameter, E.g:GET /api/posts?limit=5&filter=tags:photo,featured:true
api.posts.browse({limit: 5, filter: "tags:photo,featured:true"})
{{#get "posts" limit="5" filter="tags:photo,featured:true"}}
Filter Query Language Spec
Syntax
Filter Expressions
A filter expression is a string which provides the property, operator and value in the form property:operatorvalue:
:
- separator between property and an operator-value expression:
on its own is roughly=
Property
Matches:
[a-zA-Z_][a-zA-Z0-9_.]
_
.
separated paths, E.g.author.name
orposts.count
Filter expressions can be combined using the following:
+
- represents and,
- represents or(
filter expression)
- overrides operator precedenceOperator
The following operators are spec'd for the first version:
-
- not operator>
- greater than operator>=
- greater than or equals operator<
- less than operator<=
- less than or equals operatorValue
Can be one of the following
-
but may contain it'"+,()><=[]
unless they are escapednew Date()
Special characters:
:
- filter expression separator+
- *_and *_symbol,
- *_or *_symbol(
- left parenthesis (operator precedence))
- right parenthesis (operator precedence)-
- not operator>
- greater than operator>=
- greater than or equals operator<
- less than operator<=
- less than or equals operator[
- left bracket (*_in *_operator)]
- right bracket (*_in *_operator)'
- single quote (*_string *_syntax)"
- double quote - NOT PERMITTED UNESCAPEDWhitespace rules
Whitespace is not permitted inside of a property, or inside a value unless the value is surrounded with quotes. Everywhere else, whitespace is ignored and therefore becomes optional.
This allows for whitespace to be used to improve readability if necessary, as any of the following are equivalent and will be accepted.
featured:true+posts.count:>10
,tags:photo,tags:video
featured:true + posts.count:>10
,tags:photo, tags:video
featured: true + posts.count: >10
,tags: photo, tags: video
featured: true + posts.count :> 10
,tags: photo, tags: video
Implementation
Hold on to your hats, this is where shit gets real...
Creating a Parser
As far as I can determine, the set of rules above describes a language that can be represented by an unambiguous context-free grammar and therefore understood by an LALR(1) parser.
What this means is that we can represent the set of rules in a standard way, and then use JISON to generate a parser in JavaScript, which can in turn be used in Ghost to understand filter query strings, and convert them into something more useful. Essentially, all that has to be done is write a set of rules & the hard part of this is done. Neat.
The parser should live in its own repository, with a full suite of tests, and become a dependency of Ghost. The repository for it is here: https://github.com/TryGhost/GQL
Parser Output
The next step is to determine what the 'something more useful' that the parser is supposed to output looks like. The essence of what we're doing is representing SQL, so we could output the SQL equivalent of a filter query string and then pass that to
knex.whereRaw()
, which will then run the query for us inside the model layer of Ghost.In fact, this is what I did for a first test that all this works. I wrote a rudimentary version of the rules, output raw SQL, and ran it through knex. And it worked.
The downside to parsing straight to SQL is two-fold: firstly, by translating from one string format to another, there is little opportunity for semantic analysis - that is to validate or operate on the property and value parts of the query to check they exist, or are permitted (filtering by the
password
property of the User resource should not be permitted, for example). Secondly, it seems to me that running raw sql queries, even through knex, is likely to result in security problems.Instead of churning out SQL, the parser needs to output an intermediary format that we can use to get all of the useful and contextual information out of the query. This format should be something we can easily pass to knex's query builder functions, like
.where()
and.orWhere()
, (which take a property, and operator and a value) so that we can run the queries through the query builder.The intermediary format must not lose any information from the query.
JSON Format
The proposal is to use an array. Each filter expression will be converted into an object with key value pairs representing the property, operator, and value.
Arrays are ordered, and so the order of the array will match the order in which the query should be built.
All except the first item of the array should also indicate whether the expression is to be combined with and or or. Something like this:
To maintain precedence overrides, a single object in the array can alternatively represent a group that contains a sub-array of objects. The group object must also indicate whether it is combined using and or or.
This format lends itself nicely to building queries using the knex query builder functions, as we can cycle through the array (and recursively through groups) calling
where
ororWhere
as necessary. Having a JSON format with named key-value pairs for property, operator, and value, makes it easy to use lodash magic to pull out parts of the query (e.g. all properties or all values) and validate them.Post-Processing
Once we have our filter in a JSON format, and we can pull out properties (table & field names) and values, we can start to do some validation and processing of the information in order to ensure the queries are safe and sensible.
Public vs Private
Certain properties should not be filtered upon, e.g.
user.email
,user.password
,user.status
andpost.status
as all of these properties are affected by permissions / not intended to be available to public endpoints.For now, as we're only intending to offer 'official' public-only beta API access via HTTP and the get helper, these properties could be santised out entirely if it's easier, or otherwise they should be santised out if the context is public (see the change coming with public permissions for more context for what I mean here).
Standard validation
All of the properties should be known attributes of the models, and all of the values should be valid values for those attributes, according to our standard validation mechanisms.
Joins
Any reference to a related table needs to be translated into a join.
Dates
Any String value used along with a date property E.g.
created_at
,updated_at
,published_at
,last_login
should be converted into a Date that SQL can understand, providing it is something that can be parsed using JavaScript'snew Date()
. Some care needs to be taken here to check this will work across SQLite, MySQL and pg as they treat dates differently and Bookshelf/Knex is not good at handling this (yet).Special Rules
There are a few queries which should have special rules for how they affect other parts of the query. The single example at the moment is filtering posts by tags where tag.slug is
IN
a set (or in fact, any use ofIN
).Where using
IN
we need to be sure that we still return distinct items in the set (and the correct number). When usingIN
the default order should naturally change to be ordered by the number of matches in descending order.E.g. when requesting posts with tags
IN
a group, the query output should be similar to (or have the same affect as) this:select * from "posts" inner join "posts_tags" on "posts_tags"."post_id" = "posts"."id" inner join "tags" on "posts_tags"."tag_id" = "tags"."id" where "page" = false and "status" = 'published' and "posts"."id" != 11 and "tags"."slug" in ('markdown', 'list') group by "posts"."id" order by count(tags.id) DESC, "posts"."status" ASC, "posts"."published_at" DESC, "posts"."updated_at" DESC, "posts"."id" DESC limit 5
Note the
group by
andorder by count(tags.id)
parts.Examples
Here is a separate Gist outlining some example queries in HTTP, method call & get helper form: https://gist.github.com/ErisDS/f516a859355d515aa6ad, feel free to suggest more example use cases in the comments.
The text was updated successfully, but these errors were encountered: