Skip to content

research_optimization

nebula0717 edited this page Nov 1, 2016 · 59 revisions

Heuristic and Cost-based Optimization for Provenance in GProM

GProM supports Heuristic and Cost-based optimization for provenance computation. Heuristic optimization applies relational algebra equivalences technique to speed up provenance computations by rewriting a query into an equivalent one which with the expectation to have a lower cost. Cost-base optimization chooses between alternative options for instrumenting a query for provenance capture and controls the application of some of the algebraic equivalence rules we support.

  • -heuristic_opt
    Apply heuristic application of all relational algebra optimization rules. Default value: FALSE.

  • -cbo
    Apply cost-based optimization. Default value: FALSE.

Heuristic Optimization

  • -optimization.*optimization_option*
    Activate optimization option. Most options correspond to equivalence preserving relational algebra transformations. -optimization.*optimization_option* activates the option. To deactivate an option use -optimization.*optimization_option* FALSE. For example, -optimization.merge_operators activates a rule that merges adjacent selections and projections in a query.

All Options

  • -optimization.push_selections -push down selections
  • -optimization.merge_operators -merge adjacent projections and selections
  • -optimization.factor_proj_attr_in_expr -factorize projection attributes in expression
  • -optimization.materialize_merge_unsafe_proj -materialize merge unsafe projection
  • -optimization.remove_redundant_projections -remove redundant projections
  • -optimization.remove_redundant_duplicate_operator -remove redundant duplicate removals
  • -optimization.remove_unnecessary_window_operators -remove unnecessary window operators
  • -optimization.remove_unnecessary_columns -remove unnecessary columns
  • -optimization.pull_up_duplicate_remove_operators -pull up duplicate removal operators
  • -optimization.pulling_up_provenance_proj -pull up provenance projections
  • -optimization.push_selections_through_joins -push down selections through joins
  • -optimization.selection_move_around -selection move around

Detail

Selection Move Around

This optimization applies standard selection move-around techniques.For example, in JOIN(A=B)(SELECTION(A=5)(R), S) if schemas are R(A) and S(B) then we can pull-up SELECTION(A=5) through the join to get SELECTION(A=5)(JOIN(A=B)(SELECTION(A=5)(R), S)), then based on the equivalence A=B enforced by the join add the predicate B=5 to get SELECTION(A=5 AND B=5)(JOIN(A=B)(SELECTION(A=5)(R), S)) and then push down B=5 into the right input of the join to get SELECTION(A=5 AND B=5)(JOIN(A=B)(SELECTION(A=5)(R), SELECTION(B=5)(S))). Finally we remove the redundant predicate SELECTION(A=5 AND B=5) to get JOIN(A=B)(SELECTION(A=5)(R), SELECTION(B=5)(S)).


Example

Query:

 "SELECT * FROM R,(SELECT * FROM S WHERE C=5) sub WHERE A=C"

Result:

---->
Before Optimization After Optimization
***

Remove Redundant Duplicate Removals

Removes duplicate removal operators if the application of duplicate removal has no effect on the query result. We check for two cases here: 1) if the input relation has at least one candidate key, then there are no duplicates and the operator has no effect and 2) if the result of the duplicate removal is later subjected to duplicate removal by a downstream operator and none of the operators on the path to this downstream operator are sensitive to the number of duplicates then the operator can be safely removed.


Example

Query:

 "SELECT DISTINCT A,B FROM (SELECT DISTINCT A,B FROM R) sub"

Result:

----> ---->
Before Optimization After Optimization By Set After Optimization By Key
***

Pull up Provenance Projections

The provenance instrumentation used by GProM duplicates attributes of input tables using projection and propagates them to produce results annotated with provenance. This optimization tries to pull up such projections to delay the increase of schema sized caused by duplicating attributes.


Example

Query:

 "PROVENANCE OF (SELECT * FROM (SELECT * FROM R,S WHERE A=C) sub)"

Result:

---->
Before Optimization After Optimization
***

Merge Operators

Merge adjacent projection and selection operators. Selections will always be merged. However, merging projections can lead to an explosion of projection expression size. We actively check for such cases and avoid merging if this would increase the expression size dramatically. For example, consider a projection A + A AS B followed by a projection B + B AS C. Merging these two projections would result in the projection expression A + A + A + A AS C which has double the number of A references as the original projection. This optimization is important when computing transaction provenance. For a thorough explanation see the publications referenced on the GProM webpage.


Example

Query:

 "SELECT C+C AS D FROM (SELECT A+A AS C FROM R) sub"

Result:

---->
Before Optimization After Optimization
***

Factorize Attributes

Try to factor attributes in projection expressions to reduce the number of references to attributes. We currently support addition and multiplication expressions in CASE constructs. For example, CASE WHEN cond THEN A + 2 ELSE A END AS A can be refactored into A + CASE WHEN cond THEN 2 ELSE 0 END AS A to reduce the number of references to attribute A from 2 to 1.

Materialize Unsafe Projections

Force the backend database to materialize projections that could lead to uncontrolled expression growth if they would be merged with adjacent projections (as explained above for merge_ops).

Remove Redundant Projections

Removes projections that are unnecessary from a query, e.g., a projection on A, B over a table R(A,B) is redundant and should be removed to simplify the query.


Example

Query:

 "SELECT A FROM (SELECT * FROM R) sub"

Result:

---->
Before Optimization After Optimization
***

Remove Redundant Window Operators

Remove window operators (corresponding to SQL OVER clause expressions) which produce an output that is not used by any downstream operators.


Example

Query:

 "ELECT A FROM (SELECT SUM(A) OVER(PARTITION BY B ORDER BY B desc) AS W,A FROM R JOIN S ON A=C) sub"

Result:

---->
Before Optimization After Optimization
***

Remove Unnecessary Columns

Based on an analysis of which columns of the relation produced by an operator are used by downstream operators, we add additional projections to remove unused columns.


Example

Query:

 "SELECT A FROM (SELECT * FROM R,S WHERE A=C) sub"

Result:

---->
Before Optimization After Optimization
***

Pull up Duplicate Removals

This optimization tries to pull up duplicate removal operators.

Cost-based Optimization

The following options control the behavior of GProM’s cost-based optimizer:

-cbo_choice_point_remove_duplicate_removal - makes a cost-based choice of whether to remove a duplicate removal operator when possible.

-cbo_max_considered_plans num_plans - stop cost-based optimization after num_plans have been considered.

-cbo_sim_ann_const c - Set the constant c used by the simulated annealing search strategy to calculate ap, e.g., c = 10, 20, 50 or 100.

-cbo_sim_ann_cooldown_rate -
Set the cooling down rate used by simulated annealing. Value has to be between 0.1 and 0.9.

-cbo_num_heuristic_opt_iterations num_iter - Apply each heuristic optimization rule num_iter times.