Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide drill across #245

Open
grugnog opened this issue Dec 11, 2014 · 8 comments
Open

Provide drill across #245

grugnog opened this issue Dec 11, 2014 · 8 comments

Comments

@grugnog
Copy link

grugnog commented Dec 11, 2014

Right now to drill across, users need to be implement this logic as part of the analytics/reporting system, which is fine - but can be quite a bit of work in practice. It would be really convenient if there was a function that could assist with this.

In terms of architecture, it seems like this could perhaps be implemented as an additional layer on or beside the backend that could perform multiple queries and aggregate the results. This would make reuse easier (compared with having cubes viewer make multiple queries directly, for example).

Rough sketch of how this might work:
/cubes/<cube1>+<cube2>+.../<aggr-options>/<query>

The would potentially be any query that is accepted by the "cube" endpoint. The logic would probably first need to parse the query and validate that each cube has each of the dimensions used for that query (it would be nice if cubes didn't need to have identical dimensions, as in #169).

Then the backend would perform the queries on each cube and aggregate the results. Assuming there are more than one way to aggregate the results (beyond just extra columns for each cube) having argument might be useful to allow the user to tweak that.

I can see more complex stuff like cross-tables that wouldn't fit the above pattern, but am not sure if it makes sense to try and include them (and if it does, if they could sensibly use this format without making it more complex for simple stuff).

@Stiivi
Copy link
Member

Stiivi commented Jan 13, 2015

This feature can be indeed added to the SQL query generator, however it has to be refactored first. See reasons in this post.

As for the API, my suggestion would be just to extend the original interface with one more argument: across. The + in list of cubes is interesting idea, but what would server say on /cube/foo+bar+baz/model instead of .../aggregate?

My suggestion would be: /cube/foo/aggregate?across=bar,baz&drilldown=something and in Python: browser.aggregate(foo, drilldown=something, across=[bar,baz]).

How does that interface sound?

As with joining, I'm not quite sure yet how difficult it would be to implement, but I'll just throw few pointers and notes here for future reference:

  • the QueryBuilder (to be dissolved, as described in the post linked above) will have to consider SnowflakeSchemas of other cubes as well
  • there should be something on top of the join_expression() that would take the already joined snowflakes

@Stiivi Stiivi modified the milestone: Refactoring Jan 14, 2015
@Stiivi
Copy link
Member

Stiivi commented Jan 28, 2015

Few notes based on current work at branch sqlrefactor:

  • cubes have to come from the same store
  • store should be the place to provide/create StarSchema objects, not the Browser
  • there should be one StarSchema per (cube, locale[, user]?) tuple combo
  • StarSchema.star() and StarSchema.relevant_joins() should accept an argument across= with another star schema

@grugnog
Copy link
Author

grugnog commented Jan 30, 2015

Thanks for looking into this. I think the interface sounds OK.

One question I have is if there is some semantic or functional meaning in having a "primary" vs "secondary" cube in a drill-across query? The cube (in the path element) feels more "primary" relative to the ones in the "across" argument, and it feels like it might suggest that it is treating them somehow differently. So if there is semantic or functional meaning between these, I think that distinction is a benefit - if not, then I think it complicates the mental model a bit, since it suggests there might be.

Also, to clarify - my suggestion was to use a route "/cubes" (plural) that only handles drill across queries, and perhaps other operations involving >1 cubes. This would be in addition to the "/cube" endpoint. Hence I don't think the /cube/foo+bar+baz/model issue would come up, since that would only be an available operation on /cube/*/model.

@Stiivi
Copy link
Member

Stiivi commented Jan 30, 2015

As for "primary" vs "secondary" – I'm just following current interface which might also impose some limitations on the "secondary" cubes. This distinction is definitely not a benefit and as you say, it complicates the mental model a bit. I'm not in favor for those limitations, but I take it as an intermediate step.

I was thinking briefly what would have to be changed to accommodate queries across multiple cubes. The current AggregationBrowser (in browser.py and its subclasses) is based on a premise that there will always be only one cube to be queried at a time. It holds a reference to a localized cube being browsed, cube's mapper and a reference cube's store. Based on this assumption, the Cell has also reference to a cube. There is quite a lot of historical tight coupling in there that has to be untangled.

Browser

As for Browser, we can either add cubes instead of cube and have the browser tied to the multi-cube schema. Another option would be to use browser as a pure functional/controller, tied only to a store, that will receive cube(s) as an argument to every query. I have no opinion on either of the two right now. If you do, please, let me know.

We need to have an object which knows about all joins and tables. Currently called StarSchema on the sqlrefactor branch provides that functionality (uncoupled from Cubes model complexity). This object can be easily changed to accommodate multiple cubes. But then we would need to create (lazily) one star schema object per cube combination.

Cell

As for Cell – there are some methods that were meant to be of "explorative" nature, such as drill-down. which are using the Cell-linked cube. We can either:

A. make Cell dumb and use only string based references, move the explorative functions to some other utility object keeping a state (†). No cube reference in a cell. Cell would be "portable" to any other cube with the same dimensions.
B. have Cell include multiple cubes. Keep the explorative functions.
C. keep the cell single-cube. No changes needed, but introduces another mental model complexity.

(A) feels lighter, but will require additional methods for checking validity of the dimensions in the cell.

Model

Another radical option would be to bring back concept of a "model" as a closed set of physically related cubes. There would be one Schema object with all tables and joins pre-analysed that would generate underlying star JOIN compositions. Every Cube will belong to a Model and Model will be tied to a `Store. Browser, regardless of design decision from the options above, would use the store to provide the appropriate schema composition for given query.

This might reduce some complexity, but require more work as it will affect the Workspace as well.

Summary

We have a design chain reaction here, for historical reasons.

Besides SQL backend query generator lacking the ability to include multiple fact tables, there is a bit more in the library that was based on single-cube query assumption. This definitely needs to be redesigned.

I will keep this in mind during this refactoring. I don't think that any of the above will happen in this round, if I want to deliver more understandable query generator, but it has to happen soon or later. For the time being we will have to stick with primary/secondary cubes in the drill-across.

Anyway, any input – thoughts, pics, questions – on this matter is appreciated.


(†) Off-topic note: Here I kind of regret calling Browser a Browser as it is in current implementation. It would be nice to have Browser as explorative class holding together cube and currently viewed cell. Useable from within environments such as iPython Notebook. The aggregation/query providing class would have a different name and functionality of current AggregationBrowser.

@Stiivi
Copy link
Member

Stiivi commented Jan 30, 2015

@grugnog (forgot to respond) I see your point with /cubes/foo+bar+baz. Agreed.

@dtheodor
Copy link

Imo, this project is young enough that you should push class name and API specification changes sooner (in version 2) rather than later (= closer to never).

@Stiivi
Copy link
Member

Stiivi commented Feb 27, 2015

@dtheodor good point.

How to change the concept of AggregationBrowser though?

Suggestion A

  • drop the AggregationBrowser as it is now completely
  • backends provide aggregation (and other methods) through the store, requiring that all queried objects through the methods belong to the store: store.aggregate(["cube1.agg1", "cube2.agg2"], drilldown=...)
  • workspace providers aggregation (and other methods) and delegates the actual computation to the appropriate store for given attributes. In early versions will reject inter-store queries, in later versions it might provide some result combining functionality

The usage would be:

cube = workspace.cube("sales")
# or
cube = provider.cube("sales")

result = workspace.aggregate(cube.aggregate["value"])

# or as a string `cube.aggregate`
result = workspace.aggregate(["sales.value"])

# or from a store:
result = store.aggregate(cube.aggregate["value"])

Workspace can work with string references, since it can resolve them into namespaced cubes. Store would require Aggregate objects from a cube.

All attribute objects would have to be back-linked to their owners (model objects) such as cubes or dimensions.

HTTP API

With this change the HTTP API will not have to be changed, cube-oriented API will stay as:
GET /cube/sales/aggregate and the aggregate oriented API (implicit drill-across from cubes within the same store) would be:
GET /aggregate?aggregates=sales.agg1,sales.agg2

To list all aggergates: GET /aggregates which will make aggregates to appear as top-level objects to the front-ends.

Therefore instead of GET /cube/cube1+cube2/aggregate I would go with GET /aggregate.

Suggestion B

The first suggestion has several issues. The main is, that we can't have multiple models using the same store (database). Therefore we need to have an object that will contain the original model together:

  • bring back concept of Model as a package of dimensions and cubes
  • bind Browser to Model and Store
  • browser stays backend-specific

Side effects:

  • drop Namespace, the dimension lookup and linking would be done ONLY within the model
  • have functionality such as Model.merge(model) or Model.append(model_object), in the model specified as "import": ["model2", "model3.cube1", "base_model.base_date"]

Not going to evolve this part deeper here, more should be discussed in #256.

Cube can be requested in a similar way as in the suggestion A. Aggregation done through workspace or browser – similar principle as in A.

HTTP API would be the same as in A.

Conclusion

Regardless of implementation, the best way to implement drill-across is to hide concept of cubes completely in the aggregation call and use just aggregate list. Cubes would be resolved in the aggregating objects and the aggregation would be computed or rejected based on the possibility of the given combination of aggregates and dimensions (same store, shared dimensions, other rules ...).

Thoughts?

@Stiivi Stiivi modified the milestones: Refactoring, 1.3 Mar 24, 2015
@Stiivi Stiivi removed this from the 1.y - Drill-across milestone Mar 10, 2017
@Stiivi
Copy link
Member

Stiivi commented Mar 14, 2017

See also #273

@Stiivi Stiivi added this to the 3.0 milestone Mar 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants