Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the interpreter #3

Open
ashalkhakov opened this issue Sep 18, 2015 · 15 comments
Open

Improving the interpreter #3

ashalkhakov opened this issue Sep 18, 2015 · 15 comments

Comments

@ashalkhakov
Copy link

Below are some thoughts on improving the interpreter. Posting mainly for discussion.

Proposal 1 (on nil handling in the interpreter):

  1. add a new characteristic to operators, IsNilPropagating, which means that if any arguments evaluate to a nil, then the operator will also inevitably evaluate to a nil (my guess is that most operators are nil-propagating)
  2. use a new kind of exception (say NilException) for signalling if a nil is the result of evaluating a plan node; said exception will be caught in the following situations:
    1. when storing a nil value to a variable (assignment, update/insert/delete statements)
    2. when encountering a nil in a restrict (or the like), where it is considered same as false
    3. when exiting the expression context (e.g. in a program begin Foo(); Bar() end, where Foo() is a function, a call to Foo() is evaluated and results in a nil)
    4. inside if-condition, case-condition, for/while-condition, where-condition evaluation (and the like)

The upside is that a lot of repetitive conditional logic is replaced by exception handling, which should also be faster at runtime.

Proposal 2 (avoiding scalar type boxing when calling operators):

The PlanNode.Execute() function works only on boxed types. Avoiding boxing of return types and of arguments should improve performance considerably (by lowering GC stress).

  1. introduce static non-virtual Evaluate methods
    1. .NET type will be along these lines: A (Program, A1,A2,...,An) if the operator is nil-propagating, and MakeNullable<A> (Program, MakeNullable<A1>,MakeNullable<A2>,...,MakeNullable<An>) if the operator is not nil-propagating, where MakeNullable is just like Nullable if T is a value type, and is just T otherwise
  2. for every PlanNode, keep a Type _returnType for the .NET type it's Evaluate method evaluates to (or, introduce TypedPlanNode)
  3. at the CallNode, figure out if the operator's defining class defines a typed static Evaluate method (if it does, then the CallNode will invoke the method via a DynamicMethod)

Note that we could probably dispense with the Nodes array in the plan node for making every node typed on its arguments. For instance, IntegerPowerNode will have members _left of type TypedPlanNode<int> and _right of type TypedPlanNode<int>. There doesn't seem to be much code that goes over the Nodes array in some involved computation.

Is there a characteristic for operators which could help in deciding if Program argument is necessary for the operator to evaluate?

Obviously, some unboxing/boxing will still be necessary (e.g. when reading from/storing into variables).

Proposal 3 (avoid boxing of rows):

This one I haven't figured out yet. Currently, all rows are generically boxed (see Runtime.Data.Row and Runtime.Data.NativeRow).

  1. generate descendants of a Row/NativeRow which implement the existing API and typed getters/setters (currently, values are persisted as objects...)
  2. keep a cache somewhere that will ensure there is only one dynamic type for every DAE type
    http://www.codeproject.com/Articles/13337/Introduction-to-Creating-Dynamic-Types-with-Reflec

I guess we could specialize some interpreter code with the exact row types?

  • for instance, specialize PlanNodes by the respective types of rows they can handle
@ashalkhakov
Copy link
Author

Regarding proposal 3: I think I've implemented a drop-in replacement for Runtime.Data.NativeRow that will store the elements in unboxed form, and will allow getting/setting their values using type-parameterized methods.

@ashalkhakov
Copy link
Author

Attempted proposal 1.

Issue: NilException is really pervasive! There are many places where nil is a perfectly fine result of evaluation, and there are way too many places where Evaluate is called on a plan subnode, so it's hard (for me) to decide how to rearrange the code. Proposal 1 seems out of reach.

@brynrhodes
Copy link
Contributor

Sorry for the delayed response, but regarding proposal 1, when we first built Dataphor, we had exceptions for accessing nils, so that nil was effectively treated just like a null in C#, but we found that it caused many more problems than it solved, and ended up allowing nil propagation in general.

Having said that, I like the suggestion that an operator be able to inform the compiler about it's nil propagation behavior so that the nilable characteristic could be inferred more generally, but I'm not sure what would be gained by throwing exceptions when nils are encountered? We can already use constraints to prevent assignment of nil, what problem is being addressed by adding the NilException?

Regards,
Bryn

@brynrhodes
Copy link
Contributor

Regarding proposals 2 and 3, I agree, that would be a useful change, and I have been working quite a bit in that area lately. I would definitely be interested in seeing what you are doing there.

Specifically, I've been looking at dealing with all values natively, rather than having to have DataValues. To do this, I'm introducing interfaces to support access to data values of different categories of types. IDataValue being the most general, which only specifies an IDataType. Not that all native types would have to implement IDataValue, but if they did, the engine could take advantage of it.

Also looking at being able to support IList directly (instead of ListValue), and introducing ITable and IRow to deal with Table and Row.

I have committed some of these changes to trunk already as part of adding more RESTful functionality, and some integration work with FHIR.

Regards,
Bryn

@ashalkhakov
Copy link
Author

Having said that, I like the suggestion that an operator be able to inform the compiler about it's nil propagation behavior so that the nilable characteristic could be inferred more generally, but I'm not sure what would be gained by throwing exceptions when nils are encountered?
We can already use constraints to prevent assignment of nil, what problem is being addressed by adding the NilException?

It is currently the case that the interpreter code is littered with null-checking. I was under the impression that in null-heavy programs, using exceptions instead of explicit branching was faster (no idea where I got that, but probably read in some paper...).

Regarding the nil-propogation characteristic: if nils are properly accounted for by an operator (e.g., we can say that it can never return a nil if its inputs are non-nil), then we could use the native .NET representations for the argument and result types, it should supposedly help with performance of arithmetic and comparison operators. However, I have yet to implement this characteristic.

Regarding proposals 2 and 3, I agree, that would be a useful change, and I have been working quite a bit in that area lately. I would definitely be interested in seeing what you are doing there.

I'm basically following the paper Self-optimizing AST interpreters, in particular: subsection 5.3 (return type specialization). I'm planning to go a bit further: let every plan node implement static methods with the native input types and the result type, then we could generate some IL code at call sites to avoid boxing completely. Much of IL code generation would then consist of a generic "apply function to arguments" routine.

@ashalkhakov
Copy link
Author

Please find an example of proposal 2.

Obviously it needs more polish (I don't follow coding conventions..), but it shows concretely what I am proposing to do. I would like to know what you think (and especially, if you foresee any issues with this approach that would limit its usefulness).

A simple expression like select 1+2 does evaluate via IL code (only the result is boxed!), but there are issues with:

  1. debugging such code (breakpoints, etc.)
  2. creating symbolic information for such code

Specifically, I've been looking at dealing with all values natively, rather than having to have DataValues. To do this, I'm introducing interfaces to support access to data values of different categories of types. IDataValue being the most general, which only specifies an IDataType. Not that all native types would have to implement IDataValue, but if they did, the engine could take advantage of it.

That would be great. For instance, row types could be generated at runtime, and if they support some interface (say IRowType), then it would make it much easier to use from non-generated code. Also, tables could use different runtime representations as long as they conform to an interface.

@ashalkhakov
Copy link
Author

Stumbled on a generic nil. In an expression like nil+5, nil is assigned the type Generic. How should it be treated? Its type is unknown at compile-time.

How is generic supposed to work? (We don't have any generic code in D4 at the moment.) Is it like a C# dynamic type? Or is it like a C# parametric type (i.e., T in void Method<T>(T arg))?

@brynrhodes
Copy link
Contributor

Nils are typed as generic, but when attempting to resolve an operator reference, the compiler will look for a "compatible" type, and a type that is only "generic" is considered compatible with any type, similar to the way an untyped null is handled in C#.

So in that respect, generic types in D4 are more like parametric types, but without the ability to declare anything other than the generics: "generic", "row", "list", "table", and "cursor".

@ashalkhakov
Copy link
Author

Thanks for the note on generic.

I've implemented some basic imperative node compilation. Currently using the interpreter stack (so every assignment and variable reference involve boxing/unboxing), but I'd like to change that (eventually).

Some sample programs that work:

select if 1=2 then 1 else 2
select case when 1+1=2 then 1 when 1=1 then 3 else 2 end

begin
end

begin
1+1;
end

begin
var x:Integer;
end

begin
var x:Integer;
x:=2+34+if 1=2 then 4 else 555;
select x;
end

Planning to tackle operator compilation and CallNode next.

@brynrhodes
Copy link
Contributor

I think I've implemented a drop-in replacement for Runtime.Data.NativeRow

Very cool, I like the approach. I am concerned about having to create .NET types for every row type though, but it does have a lot of advantages. I will definitely take a look and see if this approach could work for native row representation.

@brynrhodes
Copy link
Contributor

Please find an example of proposal 2.

Also very cool, and yes, I think the approach will work. We started on IL emission a long time ago, but never completed the effort. I am working on general memory and performance improvements now though, and IL compilation is close to the top of the list.

@ashalkhakov
Copy link
Author

Note that I've been working on this some more, and the progress so far is as follows.

  1. added more operators and more types
  2. ensured that existing Dataphor tests pass with the compiled code:

BooleanLibrary
ByteLibrary
IntegerLibrary
MathLibrary
MoneyLibrary
ShortLibrary
DecimalLibrary
GuidLibrary
LongLibrary
Conversions (quite a lot of money, decimal, date/datetime/time/timespan nodes involve US-specific conversions...)
MinMaxOperators (except device-specific tests)
most of LanguageConstructs
most of StringLibrary

  1. added compilation of operators and CallNode

Currently, the expression like 3+45*2 will be compiled into, basically: IntegerAdd.InternalExecute(new Nullable<int>(3), IntegerMultiply.InternalExecute(new Nullable<int>(45), new Nullable<int>(2)), where InternalExecute is a static method. However, when reading from variables or writing to variables, the stack is still used.

Issue: DDL nodes, like CreateOperatorNode, cannot be truly compiled, because these nodes rely on some member variables when executed. I haven't been able to find an easy way to access these variables from the generated code. I'm wondering if you have something to propose about this?

Currently working on removing the use of interpreter stack in compiled operator bodies.

These are the issues I'm planning to work on after that:

  1. emission of debugging symbols in IL code
  2. extending the supported D4 subset: row types, list types, generic type, table types and cursors

@brynrhodes
Copy link
Contributor

One fairly major concern I have with actually building .NET types for rows is that there is no way to unload assemblies (without unloading the entire app domain), so building types for each row type effectively constitutes an unbounded cache. The system could remember types to reduce the number of unnecessarily created new row types, but it would still represent a potential memory issue. At the same time, I love the idea that a row type is just a class with properties for each attribute of the row, so I'm torn here.

@ashalkhakov
Copy link
Author

One fairly major concern I have with actually building .NET types for rows is that there is no way to unload assemblies (without unloading the entire app domain), so building types for each row type effectively constitutes an unbounded cache. The system could remember types to reduce the number of unnecessarily created new row types, but it would still represent a potential memory issue.

I've picked the idea of canonicalizing row types from some paper on extensible records. Basically, {Y:Integer, X:String} is the same as {X:String, Y:Integer} (from the type equality point of view), therefore, we could always use the type where column names are sorted lexicographically. The difference in column ordering will then be represented by a "folder" object (which could be a simple array of integers, giving column indices in the desired order). For instance, a D4 row type {X: Integer, C: String, L:Boolean} will be represented by .NET class type (with columns ordered like C, L, X), coupled with an array of indices {2,0,1}, which says that, say, 3rd member of the class type is first, followed by the 1st member, followed by the 2nd member. This doesn't solve the issue of unloading types, though.

@ashalkhakov
Copy link
Author

I've just pushed an update: dropped the use of interpreter stack in compiled code. TestFramework.Coverage.Scripts.Timing runs in about 1 second on my machine (at 10000000 iterations). Quite sure it could be further improved.

One realization I had while working on this is that D4 programs use an unnamed De Bruijn representation for variables. Then, it occured to me that we could still introduce some local variables as needed (e.g. for nil checking), provided that the VariableContext stack is maintained in 1:1 correspondence to the Symbol stack. The alternative would be to introduce a let-insertion phase in the D4 compiler, such a phase would name every subexpression with a fresh variable name. The input program like:

var x := 5;
var y := x+x+10;
select y;

would get transformed into:

var x := 5;
var tmp1 := x+x;
var y := tmp1+10;
select y;

I guess that the let-insertion phase is much harder to implement.

Not using Program.Stack at all breaks batches (since a variable declaration at top level is just another batch, it gets compiled to a meaningless static function with 1 local). Any ideas on solving this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants