Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Filter parameter (GQL filter queries spec) #5604

Closed
ErisDS opened this issue Jul 26, 2015 · 7 comments
Closed

API: Filter parameter (GQL filter queries spec) #5604

ErisDS opened this issue Jul 26, 2015 · 7 comments
Assignees
Labels
affects:api Affects the Ghost API

Comments

@ErisDS
Copy link
Member

ErisDS commented Jul 26, 2015

This spec/issue describes the filter parameter originally mentioned in issue #5463, however it contains a more in-depth spec than usual, due to the complexity of the feature. This issue serves as a basis for documentation whilst the feature is in active development, as well as giving a basis for testing.

Table of Contents:


Summary

The idea behind filter queries is including additional parameters in Browse API requests, causing the result to be filtered in some useful way. This provides an additional level of power and flexibility for users interacting with their data in themes, apps, and in the admin UI.

The initial focus is on the ability to do advanced queries against Posts, Tags, and Users. The following document outlines a detailed proposal for a 'query language' exposed via a top-level filter property available when accessing the API from various tools.

Key Contextual Information

The Ghost JSON API supports full BREAD operations, however filtering only applies to Browse requests.

Resources

There are 3 main resources in the Ghost JSON API which are 'public' resources, and have clear use cases which require advanced filtering:

  • Posts
  • Tags
  • Users

Relations

The following relations need special treatment when querying across resources.

  • Posts have many Tags
  • Posts have one Author (Users)
  • Tags have many Posts
  • Users have many Posts

Additionally, the relations & business logic in a blog reveal the following important relational data concepts:

  • Each Post has a Next Post and a Previous Post.
  • Each Post has a Tag Count
  • Each User has a Post Count
  • Each Tag has a Post Count

API Access Tools

There are 3 key tools which can be used to request data from the Ghost JSON API which need to be considered:

  1. RESTful HTTP calls - E.g. GET /api/posts?limit=5
  2. Method calls (used internally & in apps) - E.g. api.posts.browse({limit: 5})
  3. The (in progress) 'get' handlebars helper - E.g. {{#get "posts" limit="5"}}

Proposal

Adding advanced filtering to Ghost's JSON API via a top-level filter parameter which accepts a string representation of an SQL query, highly optimised for the kinds of queries which we expect will be common to Ghost, but flexible enough to grow outside of anticipated use cases.

Top-Level Filter Parameter

A top-level filter parameter will provide advanced querying capabilities available in all 3 tools in a standard way, such that the parameter's value is always a string:

  • HTTP: GET /api/posts?limit=5&filter=???
  • Method Call: api.posts.browse({limit: 5, filter: "???"})
  • Get Helper: {{#get "posts" limit="5" filter="???"}}

In both the method call and get helper syntaxes, the filter parameter value will be surrounded by double-quotes. Avoiding double-quotes inside the string is therefore a requirement.

In the HTTP version, the string will need to be url safe. Url encoding & decoding will be required.

Filter Parameter Values

The key requirement of the filter parameter is to accept values which can represent the rough equivalent of the where clause of an SQL statement (language diagram). The fundamentals of this proposal are therefore the definition of a language which fits Ghost's use cases and can be converted into SQL through Ghost's Bookshelf/Knex model layer.

SQL's capabilities are extensive. To start with, the focus for Ghost is on supporting logical and & or combinations, comparisons using equals, greater than & less than (with support for not). The importance of the relation between Post & Tags also drives the need to support in.

Filter Anatomy

An individual filter is composed of 3 parts, the property that is being filtered, the operator (currently always =) and the value. Filters need to be combined in groups of or & and rules.

The property needs to support path matching: e.g. tag currently really means tag.slug.

The operators need to include equals, not equals, and numerical comparisons.

The values are strings, numbers, nulls etc that are matched/compared against.

Filter Style / Syntax

There are a number of common styles for combining properties, operators and values into expressions. This proposal borrows heavily from the style used in GitHub, Gmail & Slack amongst others: E.g.label:api or from:-hannah or stars:<10.

The syntax of this style is the use of property-operator-value triplets in the form: property:operatorvalue, where the colon is a separator, and operator is optional.

This syntax is short and compact for the common use cases in Ghost, for example: featured:true, tags:photo, author:john, image:-null, posts.count:>10

This syntax is also flexible enough to be extended to support multiple complex expressions by combining the property-operator-value triplets together with other symbols to represent and & or. Following on from using - for negation, the proposal is to use + for and as well as , for or e.g: featured:true+tags.count:>10, tags:photo,tags:video.

This is then used inside of the filter="" parameter, E.g:

  • HTTP: GET /api/posts?limit=5&filter=tags:photo,featured:true
  • Method Call: api.posts.browse({limit: 5, filter: "tags:photo,featured:true"})
  • Get Helper: {{#get "posts" limit="5" filter="tags:photo,featured:true"}}

Filter Query Language Spec

Syntax

Filter Expressions

A filter expression is a string which provides the property, operator and value in the form property:operatorvalue:

  • property - a path representing the key to filter on
  • : - separator between property and an operator-value expression
    • operator is optional, so : on its own is roughly =

Property

Matches: [a-zA-Z_][a-zA-Z0-9_.]

  • can contain only alpha-numeric characters and _
  • cannot contain whitespace
  • must start with a letter
  • supports . separated paths, E.g. author.name or posts.count
  • is always lowercase, but accepts and converts uppercase

Filter expressions can be combined using the following:

  • + - represents and
  • , - represents or
  • ( filter expression ) - overrides operator precedence

Operator

The following operators are spec'd for the first version:

  • - - not operator
  • > - greater than operator
  • >= - greater than or equals operator
  • < - less than operator
  • <= - less than or equals operator

Value

Can be one of the following

  • null
  • true
  • false
  • a *number *(integer)
  • a literal
    • Any character string which follows these rules:
    • Cannot start with - but may contain it
    • Cannot contain any of these symbols: '"+,()><=[] unless they are escaped
    • Cannot contain whitespace
  • a string surrounded by single quotes
    • Any character except a single or double quote surrounded by single quotes
    • Single or Double quote *_MUST _be escaped
    • Can contain whitespace
    • A string can contain a date any format that can be understood by new Date()

Special characters:

  • : - filter expression separator
  • + - *_and *_symbol
  • , - *_or *_symbol
  • ( - left parenthesis (operator precedence)
  • ) - right parenthesis (operator precedence)
  • - - not operator
  • > - greater than operator
  • >= - greater than or equals operator
  • < - less than operator
  • <= - less than or equals operator
  • [ - left bracket (*_in *_operator)
  • ] - right bracket (*_in *_operator)
  • ' - single quote (*_string *_syntax)
  • " - double quote - NOT PERMITTED UNESCAPED

Whitespace rules

Whitespace is not permitted inside of a property, or inside a value unless the value is surrounded with quotes. Everywhere else, whitespace is ignored and therefore becomes optional.

This allows for whitespace to be used to improve readability if necessary, as any of the following are equivalent and will be accepted.

  • featured:true+posts.count:>10, tags:photo,tags:video
  • featured:true + posts.count:>10, tags:photo, tags:video
  • featured: true + posts.count: >10, tags: photo, tags: video
  • featured: true + posts.count :> 10, tags: photo, tags: video

Implementation

Hold on to your hats, this is where shit gets real...

Creating a Parser

As far as I can determine, the set of rules above describes a language that can be represented by an unambiguous context-free grammar and therefore understood by an LALR(1) parser.

What this means is that we can represent the set of rules in a standard way, and then use JISON to generate a parser in JavaScript, which can in turn be used in Ghost to understand filter query strings, and convert them into something more useful. Essentially, all that has to be done is write a set of rules & the hard part of this is done. Neat.

The parser should live in its own repository, with a full suite of tests, and become a dependency of Ghost. The repository for it is here: https://github.com/TryGhost/GQL

Parser Output

The next step is to determine what the 'something more useful' that the parser is supposed to output looks like. The essence of what we're doing is representing SQL, so we could output the SQL equivalent of a filter query string and then pass that to knex.whereRaw() , which will then run the query for us inside the model layer of Ghost.

In fact, this is what I did for a first test that all this works. I wrote a rudimentary version of the rules, output raw SQL, and ran it through knex. And it worked.

The downside to parsing straight to SQL is two-fold: firstly, by translating from one string format to another, there is little opportunity for semantic analysis - that is to validate or operate on the property and value parts of the query to check they exist, or are permitted (filtering by the password property of the User resource should not be permitted, for example). Secondly, it seems to me that running raw sql queries, even through knex, is likely to result in security problems.

Instead of churning out SQL, the parser needs to output an intermediary format that we can use to get all of the useful and contextual information out of the query. This format should be something we can easily pass to knex's query builder functions, like .where() and .orWhere(), (which take a property, and operator and a value) so that we can run the queries through the query builder.

The intermediary format must not lose any information from the query.

JSON Format

The proposal is to use an array. Each filter expression will be converted into an object with key value pairs representing the property, operator, and value.

Arrays are ordered, and so the order of the array will match the order in which the query should be built.

All except the first item of the array should also indicate whether the expression is to be combined with and or or. Something like this:

{statements: [
    {prop: 'tags.slug', op: '=', value: 'photo'},
    {prop: 'featured', op: '=', value: true, func: 'and'},
    {prop: 'tag.count', op: '>', value: 5, func: 'or'}
]}

To maintain precedence overrides, a single object in the array can alternatively represent a group that contains a sub-array of objects. The group object must also indicate whether it is combined using and or or.

{statements: [
    {prop: 'author.slug', op: '!=', value: 'joe'},
    {
        group: [
            {prop: 'tags.slug', op: '=', value: 'photo'},
            {prop: 'image', op: 'IS NOT', value: null, func: 'or'},
            {prop: 'featured', op: '=', value: true, func: 'or'}
        ], func: 'and'
     }
]}

This format lends itself nicely to building queries using the knex query builder functions, as we can cycle through the array (and recursively through groups) calling where or orWhere as necessary. Having a JSON format with named key-value pairs for property, operator, and value, makes it easy to use lodash magic to pull out parts of the query (e.g. all properties or all values) and validate them.

Post-Processing

Once we have our filter in a JSON format, and we can pull out properties (table & field names) and values, we can start to do some validation and processing of the information in order to ensure the queries are safe and sensible.

Public vs Private

Certain properties should not be filtered upon, e.g. user.email, user.password, user.status and post.status as all of these properties are affected by permissions / not intended to be available to public endpoints.

For now, as we're only intending to offer 'official' public-only beta API access via HTTP and the get helper, these properties could be santised out entirely if it's easier, or otherwise they should be santised out if the context is public (see the change coming with public permissions for more context for what I mean here).

Standard validation

All of the properties should be known attributes of the models, and all of the values should be valid values for those attributes, according to our standard validation mechanisms.

Joins

Any reference to a related table needs to be translated into a join.

Dates

Any String value used along with a date property E.g. created_at, updated_at, published_at, last_login should be converted into a Date that SQL can understand, providing it is something that can be parsed using JavaScript's new Date(). Some care needs to be taken here to check this will work across SQLite, MySQL and pg as they treat dates differently and Bookshelf/Knex is not good at handling this (yet).

Special Rules

There are a few queries which should have special rules for how they affect other parts of the query. The single example at the moment is filtering posts by tags where tag.slug is IN a set (or in fact, any use of IN).

Where using IN we need to be sure that we still return distinct items in the set (and the correct number). When using IN the default order should naturally change to be ordered by the number of matches in descending order.

E.g. when requesting posts with tags IN a group, the query output should be similar to (or have the same affect as) this:

select * from "posts" inner join "posts_tags" on "posts_tags"."post_id" = "posts"."id" inner join "tags" on "posts_tags"."tag_id" = "tags"."id" where "page" = false and "status" = 'published' and "posts"."id" != 11 and "tags"."slug" in ('markdown', 'list') group by "posts"."id" order by count(tags.id) DESC, "posts"."status" ASC, "posts"."published_at" DESC, "posts"."updated_at" DESC, "posts"."id" DESC limit 5

Note the group by and order by count(tags.id) parts.

Examples

Here is a separate Gist outlining some example queries in HTTP, method call & get helper form: https://gist.github.com/ErisDS/f516a859355d515aa6ad, feel free to suggest more example use cases in the comments.

@ErisDS ErisDS added the affects:api Affects the Ghost API label Jul 26, 2015
@ErisDS
Copy link
Member Author

ErisDS commented Jul 27, 2015

The GQL repository has a PR with an initial implementation: TryGhost/GQL#2
Working on a PR to include this in Ghost, along with the {{#get}} helper so that people can start to play with it.

@ErisDS
Copy link
Member Author

ErisDS commented Oct 19, 2015

I put this in slack, but going to put it here too as it's a design change in the API:

For the old style requests, we include in the response some data under meta.filter which represents the object we filtered by, so posts.browse({tag: 'getting-started'}) would include meta.filter.tagscontaining an array with the getting started tag object inside it.

This was, I think, always a bit of a hack so we didn't have to do multiple requests to the API to build the page

I'm not intending to rebuild that behaviour for the new-style filter parameter - trying to figure out what extra objects need to be fetched and included is incredibly difficult, and I think, unnecessary fragility.

I think the frontend controller needs to ask for the data it wants, much like getting rid of the ability to fetch both posts & pages from the API, we need to remove all the 'hacks' from the API that we built in to make our own lives easier. E.g. part of #5943 will be rewriting the frontend controller to be able to perform multiple requests.

@ErisDS ErisDS mentioned this issue Oct 20, 2015
24 tasks
ErisDS added a commit to ErisDS/Ghost that referenced this issue Oct 22, 2015
refs TryGhost#5604, refs TryGhost#5463

- deps: ghost-gql@0.0.2
- adds code to wire up the filtering to a paginated query
- updated pagination plugin count query to use 'distinct' so it's more robust
- rename paginationUtils.query to addLimitAndOffset to be more explicit and make the code clearer
- add a new 'advanced browsing spec' set of tests for tracking these features as they are built out
ErisDS added a commit to ErisDS/Ghost that referenced this issue Nov 16, 2015
refs TryGhost#5604

- `filter` is missing from tags & users - add it in and add tests which show it works
@laran
Copy link

laran commented Mar 5, 2016

@ErisDS I read over the requirements, examples and the GQL source. I put together a bunch of examples along with sample code for an api that I think will support everything you're looking for.

I reworked filters (aka "the JSON format"). I don't think the current implementation is flexible enough to achieve what you're looking for. I borrowed ideas from mongo on how to structure the JSON.

I also changed filters around to support aggregate queries like what you're going for with posts.count. I took some ideas from how ElasticSearch handles aggregates.

And I made one small change to how null/not-null is handled. Made it more concise.

I wrote all of this up on a fork of the README and I forked the gist with samples as well. Here's the gist and here the README.

In the gist I also show what a javascript API for GQL might look like. I think it would be amazing to have that library available to work with knexify independently from the blogging platform. We can talk more about that if you're interested.

I think going forward with the actual search implementation through bookshelf is probably best. That way you'll tie in with the event emitters, validation, and the pipelining via promises, etc.

I have some ideas on joins/projections as well that I can offer as well. But first things first. Let's see what you think of the filters. Once we're on the same page there and have a working implementation we can add to the pile.

Cheers!

@ErisDS
Copy link
Member Author

ErisDS commented Mar 5, 2016

I've read everything you've done, but I feel like I'm missing the beginning of your thought process. GQL is currently out there in the wild, it's kinda working, it's not the prettiest, but it gets most of the job done. There are some bugs which I haven't had chance to really look into yet so maybe you're basing your thoughts on one of those, but I don't see the explanation written anywhere.

Sooo could you go back a few steps and explain in much more detail why you're proposing these changes?

@laran
Copy link

laran commented Mar 5, 2016

Hi Hannah. Sure. A few pieces came together to prompt my ideas.

I'm working on a blog. I want to add search. Google search might work but really isn't what I'm looking for. I don't want to pay for Swiftype. So I started looking around for search options for Ghost.

I saw you're looking for a search lead and read some comments around the code which led me to believe that there was some work to be done and room still for implementation design.

I read through the models and saw how the relations fit together.

I found your sample use cases. A common problem stood out to me when I looked at them. APIs like this tend to start out with simple matching. People start to use the API and then want to do more an more complex queries. Before too long the API falls down and can't support pretty basic things like date range queries, aggregations or really anything other than exact matches.

You did a fantastic job on the spec. So it was easy to see that not only had you thought ahead to some of those more advanced query types, but you took the time to document them beautifully so that we all could understand what you're looking for :)

One thing that came up a lot in your examples was the use of .count. I translated some of the sample the GQL queries into SQL and tried to achieve the intended results, but couldn't. In order to get counts you need to use a GROUP BY. In order to achieve the results intended by posts.count:>1 you need to use GROUP BY HAVING.

At this point the statements: {prop, op, value} API fell down for me because it's not in sync with the way I have to think to get .count. The query to get .count requires a bit at the beginning and also at the end of the query. So it's more of an 'around' than an inline thought process.

Having worked a lot with mongo and elasticsearch, I had those models as references.

In ES, queries based on .count are implemented as an aggregation. And aggregations are specified separately from basic boolean or comparison match operators. It was a little weird for me at first to think of them differently. But it came to make sense quickly, feels natural after the initial learning curve and works very well.

The way .count was modeled in GQL was more like the ES count API. The difference between the count API and the count aggregation is that the count API just tells you the number of results that matched. But to use the number of matches to filter results, you need to use the aggregation.

With this in mind I went back to .count. With the idea that an aggregation mechanism is the way to go, and having worked with mongo a lot lately it seemed obvious to change posts.count to posts.$count.

This immediately brought into scope the way the mongo has implemented query operators. Having worked a lot with meteor, I know how expressive, natural and powerful this syntax is as you work with it.

At this point statements: {prop, op, value} seemed kinda clunky and verbose.

With all of this as context, I thought:

  • This seems straightforward.
  • I really want that search capability on my blog.
  • I'm going to need it if I'm going to be able to use Ghost for the larger help site that I need to build.
  • And I sure would like to use Ghost because:
  • - I've already built the custom theme for the blog
  • - and I want to have the same basic theme for the help site
  • - without having to go and use another tool like Wordpress just because Ghost doesn't support search.
  • And Swiftype is still a lot more than I want to spend for something as fundamental as search.
  • And Ghost would be a lot more amazing than it already is
  • - if it had things like analytics such as number of views, number of clicks, related posts
  • - and they're going to need a more robust API to be able to provide things like that
  • And I bet the Ghost team feels pretty hamstrung by not having great search in spite of all of the great features Ghost does have.
  • And they're already looking for a search lead.

Heck, why don't I just build this thing!

So I swapped my architect hat in favor of my builder hat and I started to look at how I would implement an API to turn GQL into JSON filters which I could then pass directly into a call to posts.fetchAll().

It seemed natural that there should be a javascript API to do this transparently. This led me to the gql.findAll('posts').filter() API that I modeled in my fork of your gist.

I iterated on this for a couple of hours by working through a lot of the use cases in the the original README. As I did I found that the filter mechanism that I had in mind felt a lot more fluid than the original Statement Processing recommendation. And I knew it would support the more advanced aggregations that will be valuable to me and to others.

I ran into one or two things such as the recommendation for support for whitespace in GQL. It seemed to prevent me from running the query title:Hello World! against posts. But the space doesn't break any real rules in terms of the query logic. So I allowed it.

And there was the obvious conflict in how nulls were handled. image:null makes a bit more sense than title:null because image is a blob. But it prevents a query where title is actually the string 'null'. This may seem contrived. But I can avoid the conflict altogether by stating that image means WHERE image NOT NULL and image: means WHERE image IS NULL. With this syntax there's no special handling. And I felt like I was on the right track when I saw that others had stumbled on similar small inconsistencies. This syntax is also more concise.

And lastly, to wrap this up, the filter model that I came up with is less verbose than the JSON format defined here.

For example:

This:

{statements: [
    {prop: 'author.slug', op: '!=', value: 'joe'},
    {
        group: [
            {prop: 'tags.slug', op: '=', value: 'photo'},
            {prop: 'image', op: 'IS NOT', value: null, func: 'or'},
            {prop: 'featured', op: '=', value: true, func: 'or'}
        ], func: 'and'
     }
]}

becomes this

{
    'author.slug': { $ne: 'joe' },
    {
        'tags.slug': 'photo',
        'image',
        'features': true
    }
} 

I see a huge amount of potential for GQL. It solves a common problem that primarily Lucene solves. But the GQL syntax is even better. It's more fluid and more concise. I really like that.

Does this help? I sure hope so. 'cus it's all I've got :)

@ErisDS
Copy link
Member Author

ErisDS commented Mar 24, 2016

Sorry for the slow response here! Too many exciting things happening at once 😁

Pretty much everything you've said here makes 100% sense and seems like a great improvement to GQL. I kinda rolled with what I had as a beta implementation - the adage that you should build one to throw away. I don't think it's quite that drastic, but what I mean is, I'm definitely not precious about the implementation. (With the exception of the fact we need to use an actual parser and not regex 😉)

I am, however, a little bit more precious about the user-facing syntax. I'm open to changing it but I feel that needs to be considered from more angles than the internal workings because it's what users come into contact with. Re: the null behaviour, I'm not 100% what the right thing is but I do agree it needs revisiting.

I'm really interested in the distinction you drew between a "count API" and a "count aggregate". This distinction makes it much easier to reason about some of the behaviour of counts in Ghost's API as a whole. However, I don't think that the user-facing filter syntax needs to change here?

I may be missing something, but at first glance it appears to me that in Ghost's API include="posts.count" is the equivalent of a count API, and filter="post.count>1" is the equivalent of an aggregate. As count in a filter is always an aggregate there seems little need for adding the $?

@laran
Copy link

laran commented Mar 26, 2016

Hey Hannah. No worries :)

So I worked on this a bunch over the last couple weeks and pretty much have it where I think it makes the most sense. It's all in laran/GQL.

To touch on your comments:

Syntax: I changed it slightly. You can see what I did here. You'll see a couple differences in there from the spec defined here in 5604. Adding support for LIKE queries was one in particular that I thought was valuable.

Nulls: What I suggested before was harder work on the parser than I felt was necessary. So I left it as you originally had it, which works very well. My suggestion was more clever than it needs to be and more work that I think is worth it. It doesn't prevent a query from containing the actual string 'null' either because you can just quote it.

Aggregates: Yeah, I took an even stronger departure from the original syntax, and for good reason. I took some time to document how I actually implemented aggregates. But I'll explain a bit more here just for reinforcement.

Aggregate queries are a lot more tricky. And there's more than just the filter. There's table joins, grouping, etc. and it's (imho) just too complicated to cram all of that into the query. Hacks like this are required (which restrict the portability of the library). And there's a certain amount of context from the domain model that has to be configured outside of GQL. In the case of Ghost this comes from bookshelf. But it's a general concern.

So, what I did instead was just create the API in a way that allowed all of the grouping, aggregate function calls, and joining to be done outside of gql, and then reference the aggregate value (something like posts_count, which would have been posts.count in what I'll refer to as the 'old' syntax, and posts.$count in the syntax I proposed a couple weeks ago).

I did this by introducing a $having operator that works like this: $having.post_count. I wrote a full example including all the logic for joining, etc. here. Seeing the example will probably help to understand how $having works.

Aggregates also gets us into the topic of joins, so I'll mention that I also added some smarts to help enable that.

Joins: Joins are another thing that need some support separately from the filters.

There's actually three parts to the complete definition of a query.

  1. What fields to return (the * in select *)
  2. What rows to return (determined by where and having clauses)
  3. How tables relate to one another (foreign keys, joins, etc.) within the context of points 1 and 2.

I standardized the API on knex. So defining what fields to return is done that way (using knex.select().

Determining what rows to return is exactly what the GQL string syntax defines. So this is the problem that GQL solves.

Re: how tables relate, it's something that also has to happen outside of GQL. But, because the GQL query tells us what tables are referenced (e.g. comments.author_id or posts.slug, etc.), I added smarts to extract that information from the query. So I wrote a function that returns an array of the tables referenced in the GQL query. There's a full example of how to use that function on the wiki.

I would love to have you take a look at what I've put together. It's pretty complete from the perspective of what I was trying to achieve, which is basically building a GQL library that's complete enough to support full search capabilities within Ghost, while keeping it general-purpose enough to make GQL broadly viable and useful outside of Ghost. I think it does that nicely.

I would have submitted a PR to review it. But it's such a complete overhaul that it didn't seem to make sense.

Talk soon.

@ErisDS
Copy link
Member Author

ErisDS commented Mar 28, 2016

This paragraph touches on a lot of different ideas, that I think are most poignant to the direction GQL needs to go in:

Aggregate queries are a lot more tricky. And there's more than just the filter. There's table joins, grouping, etc. and it's (imho) just too complicated to cram all of that into the query. Hacks like this are required (which restrict the portability of the library). And there's a certain amount of context from the domain model that has to be configured outside of GQL. In the case of Ghost this comes from bookshelf. But it's a general concern.

There are a couple of key things to pull out here I think:

Aggregate queries are a lot more tricky. And there's more than just the filter. There's table joins, grouping, etc. and it's (imho) just too complicated to cram all of that into the query.

You're absolutely correct here that the filter parameter of the API has wide reaching consequences on the SQL query. I've got a document I've been working on locally which is trying to work out the various details from both the perspective of doing count filtering (aggregates) and which I was also using to try to understand where permissions fit in (other issue entirely) - this is the relevant snippet from that doc: https://gist.github.com/ErisDS/f3e817e97f50ec039128

When I was originally designing these different query parameters, I didn't pay quite enough attention to the detail of how many aspects of the query filters would need to change. I had where & join covered (just about), plus the special rules for order & group, but I totally missed needing to work with having clauses. Therefore I'm quite certain the issues you're working through with this are things I missed in the original spec, and that need thinking through and solving.

However, I disagree that it's too complicated to cram it all in. Fundamentally that's the purpose of GQL - to abstract all of that craziness of how SQL works away from the user. The GQL query language is primarily intended for use by theme creators = designers & frontend developers, the MO here is people who don't know how SQL works.

The fact that asking for all the tags with at least one post has such wide ranging effects on the query structure is an SQL implementation detail that I want to hide away, even if it is ridiculously hard to do that. I don't want the end user to have to think about any of that. That's why I don't think that count should have a $ and I definitely don't think that the keyword "having" has any place in the GQL syntax.

Hacks like this are required (which restrict the portability of the library). And there's a certain amount of context from the domain model that has to be configured outside of GQL. In the case of Ghost this comes from bookshelf. But it's a general concern.

The portability of the library is definitely not a primary concern. Knex itself allows for various different underlying databases, so we've already got quite a flexible tool. I think we need to solve problems with the querying possibilities first, and then when all that is nailed, if there's someone interested in making GQL more portable, that'll be great.

It's also worth noting that all of the context from the domain model that's currently configured in GQL is deliberately kept apart from the internal workings because it's clearly a problem that needs to be solved, but again, getting the internals of GQL & the querying working 100% first is more important - we might find the domain model needs to change yet, once we know what it looks like, it'll be easier to move.

Secondarily, the way that GQL interacts with knex is also identified as not being quite right just yet and there's even an open issue for it. Again, it should be much easier to solve when the internals of GQL is closer to being complete.

In short, I really want to focus on getting the internals right before addressing the interface problems.

@laran
Copy link

laran commented Mar 28, 2016

Hey Hannah. I appreciate the comments.

I'm with you on all points in terms of where we want to end up. I think we can have a conversation about how we get there. I think we're actually super, super close to being able to just do straight up tags.count for Ghost. And I think we can get 100% there using the internals that I've built without taking away the portability. So let's have a conversation about that. I think we're in really good shape there.

Bottom line is I 100% share your feelings in terms of building GQL to be right for the intended audience. We'll make that happen.

One point I want to bring up quickly is that the JSON in the previous comment on this thread is no longer valid. Here's what it would be given the current GQL change I'm proposing:

This was the old JSON that you posted:

{statements: [
    {prop: 'author.slug', op: '!=', value: 'joe'},
    {
        group: [
            {prop: 'tags.slug', op: '=', value: 'photo'},
            {prop: 'image', op: 'IS NOT', value: null, func: 'or'},
            {prop: 'featured', op: '=', value: true, func: 'or'}
        ], func: 'and'
     }
]}

I translated that to this GQL string:
'!author.slug:joe+(tags.slug:photo, !image:null, featured:true)'

Which becomes this JSON using the new gql.parse():

[
  {$not: {'author.slug': 'joe'}}
  [
    {'tags.slug': "photo"}
    {$or: {$not: {image: null}}}
    {$or: {featured: true}}
  ]
]

I'll follow up on your other comments later today when I have some time.

I'm looking forward to working on this with you Hannah. Cheers!

@ErisDS
Copy link
Member Author

ErisDS commented Mar 28, 2016

I 100% agree that this is all going in the right direction 👍

The updated internal JSON spec looks good to me. I think it makes a lot more sense, but it would be great to have it outlined in an issue over GQL

@laran
Copy link

laran commented Mar 29, 2016

Super. We're off to a great start :)

I'll create an issue with an outline of the changes and we can take it from there.

@laran
Copy link

laran commented Mar 29, 2016

Just wanted to point out some changes the the string syntax which I've outlined here:
https://github.com/laran/GQL/wiki/GQL-Syntax

  • Literals and strings are now almost interchangeable. In literals you have to escape , and +. In strings you only need to escape '. Literals in IN clauses also need ] to be escaped.
  • Whitespace is allowed anywhere. Leading and trailing space is trimmed off literals.
  • Negation is done with ! instead of -
  • Double quotes are no longer a special character
  • A 'LIKE' operator was added with ~. Should be paired with % appropriately.
  • Dates and values starting with true, false or null can be written as literals

@ErisDS
Copy link
Member Author

ErisDS commented Mar 29, 2016

OK, most of those things are improvements to my pretty buggy handling of the intended syntax.

There are two major exceptions:

Negation is done with ! instead of -

This should not change. It was a deliberate design choice based on gmail, google, and other filter interfaces that users have exposure to. ! only means 'not' to a developer.

A 'LIKE' operator was added with ~. Should be paired with % appropriately.

This is new. I think that ~ makes sense for "LIKE" but I'm a developer, so it needs consideration from a user perspective.

@laran
Copy link

laran commented Mar 29, 2016

Just to be sure you noticed it I'll mention that the negation is also done now by putting the negation character before the property, not the value. So, now it's !created_at:<2016-01-01 instead of what it used to be, which was created_at:-<2016-01-01

One of the observations that I made early on was that with - and the old syntax there was no way to indicate that you want something to equal a negative number. Saying quantity equals negative 1 would have been written quantity:-1. But that would have been interpreted as quantity does not equal 1. Changing that to ! solved that problem.

Then, as I looked at handling negation of not just single values but also clauses putting the negation before the property instead of before the value would allow the syntax to stay consistent.

For example: !created_at:2016-01-01 translates to {$not: {created_at: '2016-01-01'}}.

Likewise: !(featured:true, created_at:>2016-03-01) translates to {$not: [{featured: true}, {$or: {created_at: {$gt: '2016-01-01'}}}]}. The whole clause is negated.

This wasn't possible in a consistent way with the negation character coming before the value because there is no value for clauses. There's just the clause itself as a unit.

I also moved away from not equals via either != or $ne again to keep the idea of negation consistent. Having those in the mix would introduce is not, does not equal or other variations of the language around a simple not depending if it was an IN clause, a simple value or a set of conditions (using parens). The most consistent approach was to put the not operator before the property. That way the syntax is consistent and applies equally to all types of negation.

And, in fact, this is the way google does negation for search conditions with a property (e.g. -site:wordpress.org).

Re: - vs !. I understand where you're coming from. Google does use - for negation. I think it's relevant to point out though that Google searches are never anything but text. The search terms are never the name of a column, for instance. For anything more complicated than a simple match (clauses of conditions, IN queries, counts or other aggregations would fall outside what I'd consider "simple" Google directs you to use the advanced search page. So, with GQL can do we're already pushing the boundaries past what Google's search syntax can do.

Also, Github uses "NOT" for negation. It doesn't use -. So we're already different from Github syntax there.

Lucene, on the other hand [uses both](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Boolean operators) NOT and -.

Another thing to consider is that someone that does a lot of front-end work, but who isn't a super-technical person, and who isn't a 'walking compiler' is probably used to seeing - in the context of CSS selectors and HTML attributes. To them - is a separator. It doesn't conceptually translate automatically to boolean logic. So in a way, using - is overloading the character, which makes it less straightforward of a choice for a boolean operator for that audience. It's the kind of thing where they're so used to glossing over hyphens as separators that they might gloss over a hyphen in a GQL string as well. And since they are probably less used to looking at code with a critical eye for syntax, it might be harder for them to spot a bug in their query.

For someone with absolutely zero context of boolean logic or programming syntax I would agree with you that - is probably a better choice though. Their mental model would likely gravitate to thinking of addition and subtraction operators.

And then, of course, you have developers who DO intuitively understand !. I can't think of a language that doesn't use ! to mean NOT.

Another thing is that you'll end up with queries like this featured:true+-created_at:<2016-03-01. There you have the + and the - right next to each other. They seem like inverses because addition and subtraction are inverses. But they mean totally different things. People could get confused into thinking that they should use EITHER the + OR the -.

For instance, we want a query that means "is featured and was not created after 2016-01-01".

The correct syntax (using the - operator) would be featured:true+-created_at:<2016-01-01.

But a less technical person could very easily try to collapse the + and - together to think that - means "AND NOT" and try to write the intended query like this instead featured:true-created_at:<2016-01-01, which would be invalid.

I think having the NOT character being the inverse of the AND character will lead to confusion like this. Using ! better distinguishes NOT from AND.

So, I think it's just worth noting these things as we think things through. What are your thoughts?

@ErisDS ErisDS added the later [triage] Things we intend to work but are not immediate priority label Sep 20, 2016
@ErisDS
Copy link
Member Author

ErisDS commented Sep 20, 2016

@laran I'm sorry for dropping the ball on GQL big time over the last.. um 4 months. We're looking to pick API & OAuth up again after the current project to get Ghost 1.0.0 shipped. If you still want to be involved I'd love to have you 😄


I'm closing most API issues temporarily with the later label.

JSON API Overhaul & OAuth access are currently scheduled next on the roadmap

@ErisDS ErisDS closed this as completed Sep 20, 2016
@ErisDS ErisDS removed later [triage] Things we intend to work but are not immediate priority labels Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects:api Affects the Ghost API
Projects
None yet
Development

No branches or pull requests

2 participants