Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic shapes #3

Open
botev opened this issue Dec 21, 2017 · 44 comments
Open

Dynamic shapes #3

botev opened this issue Dec 21, 2017 · 44 comments

Comments

@botev
Copy link

botev commented Dec 21, 2017

Reading the current spec it seems that NNEF supports only fixed (constant) shape sizes. Is that correctly understood or not and if so do you intend to support dynamic/variable shapes?

@rgiduthuri
Copy link
Contributor

NNEF supports dynamic shapes for inputs. See https://www.khronos.org/registry/NNEF/specs/1.0/nnef-1.0.html#external-op

@botev
Copy link
Author

botev commented Dec 21, 2017

Yes, however, this is true only for "external". However, for parameters (e.g. variable) one can have variable shapes in order to produce a more "generic" graph expression, which then needs to be specified for the graph later. The verification of the correctness of shapes can be done by using polynomial integer expressions. I would guess this is not really on your radar atm though.

@rgiduthuri
Copy link
Contributor

Since the variables in NNEF come out of training, like weights, and needs to be stored in th container using Tensor File Format, they can’t be dynamic in NNEF 1.0.

Do you have a use case that requires dynamic shape support for variables coming out of any training frameworks?

@botev
Copy link
Author

botev commented Dec 21, 2017

Not for post-training no. The only use case would be for actual compilers that optimize graphs and cache, as in this way changing the shape would not (most likely not) change the optimized code.

@rgiduthuri
Copy link
Contributor

Interesting idea. Such compilers could use “0” value in the shape to indicate dynamic shape of variables. Since such behavior falls outside the scope of NNEF 1.0, it could be a vendor extension for now.

@gyenesvi
Copy link
Contributor

@botev just to clarify if I get it right: are you saying that the structure description could be generic, not exactly specifying the shape of variables, and the shape would get fixed when it is bound to some actual data for the weights? For example I could have the same general network structure, but in one data I have 3x3 convolutions, in the other I have 5x5, or the channel counts may be different. As @rgiduthuri says, this could be expressed syntactically (with 0s in the shape, just as in case of externals), but we'd have to see clearly its implications on semantics, such as shape propagation (verification of correctness as you mention). Can you give some more info (maybe links) on how polynomial integer expressions are used for this in practise?

@botev
Copy link
Author

botev commented Dec 21, 2017

@gyenesvi Well, the main issue is - nobody uses polynomial integer expressions for now. I have argued a few times people for adopting it as it can make graph optimization significantly more generic and easy. This is something I think would be a nice feature. To clarify imagine you have the following network (pseudocode):

n = symbolic_int("n")
x = input(2*n, 784)
x_half = x[:(n // 2)]
h1 = dense(x, num_units="a")
h2 = dense(h1, num_units="b")
h3 = dense(h2, nonlinearity=softmax, num_units=10)
loss = (x - h3)**2 / 2.0
// Raises an error since first axis are different - n and n/2
loss_err = (x_half - h3) **2 / 2.0 

Now when you initialize/construct the model you will need to bind a and b (as they affect parameters).
Also, note that if you use polynomials rather than just a generic 0/None you can detect faults such as the last line in the above (in polynomials n is not equal to n/2). This additionally provides shape verification at compile/construction time rather than at runtime (e.g. lot's of Theano/Tf errors would not even exist). An example how this could be used is here. Note that the package there is not maintained, however it uses polynomials for inferring shapes correctly at construction time. It is a bit like providing better static inference in compilers (e.g. templates a like).

@gyenesvi
Copy link
Contributor

Okay, I understand what you mean now. I played around a bit with a similar idea, but thought it would be too complicated to implement (I did not think it through that you end up with polynomial expressions). It would be easy to express in NNEF syntax, since it can have compile-type symbols, something like the following along your example (this is not valid now, because graphs cannot have compile-time parameters (only fragments can), just for the sake of the example, but otherwise the syntax is legal):

graph net( input: tensor, n: extent, a: extent, b: extent )
{
    x = external(shape = [2 * n,  784])
    [y, z] = split(x, axis = 0, ratios = [1,1])
    h1 = linear_layer(x, channels = a)
    h2 = linear_layer(h1, channels = b)
    h3 = linear_layer(h2, channels = 10)
    loss = (x - h3) ^ 2.0 / 2.0
    loss_err = (y - h3) ^ 2.0 / 2.0
}

So theoretically, it would be possible to deduce from the syntax that the last line is not valid, and raise an error, however, it would require somewhat complicated evaluation and comparison of the shape expressions (essentially creating the polynomial expressions in some canonical form). And as you say, it would require a binding mechanism for the compile-time parameters n, a, b.
So I'd say that this is definitely not a basic feature, but could be an interesting extension for the future if needed. As always, the question is whether its utility would outweigh the complexity required to implement it.

@botev
Copy link
Author

botev commented Dec 21, 2017

Yes, this is exactly it. In terms of the complexity I have already implemented the integer polynomials module for this, which support operations like a floor, ceil (for convolutions when we have bigger strides) and max, min. The implementation is in Rust, however, it should not be too difficult to port that to C++ - here.

I do agree, however, that it is up to you guys to weight it in if that is something you want. Tbh I'm happy to help on this if you do think it is an interesting feature to add in the future. My personal like of it is that in very complicated models, having something to verify your shapes at construction time, rather than run time, is really amazing for research debugging. Similar benefits to having a static-typed variable compared to dynamically typed.

@gyenesvi
Copy link
Contributor

Thanks, I'll let you know if we consider this in the future.

@gyenesvi
Copy link
Contributor

@botev, some updates on this one. I believe, that from the point of view of NNEF, enough information is present to implement what you discuss above. Previously, I thought you need some symbolic integers to declare parametric shapes (like 'n' above), but in fact I have realized that in order to perform the checks, it does not matter how those variables are named. One only needs to be able to express that it is not a fixed value, and that is possible via setting some components of the shape to 0 in the external definition, which is allowed. So the below is valid, and declares that the batch size is unknown:

graph net( input ) -> ( output )
{
    x = external(shape = [0,  784])
    [y, z] = split(x, axis = 0, ratios = [1,1])
    h1 = linear_layer(x, channels = 20)
    h2 = linear_layer(h1, channels = 30)
    h3 = linear_layer(h2, channels = 10)
    loss = (x - h3) ^ 2.0 / 2.0
    loss_err = (y - h3) ^ 2.0 / 2.0
}

So the real question is whether the NNEF specification should mandate such checks that you mention. I'd say that no, it should not mandate them to allow for simpler parser implementations, however, since all information is present, an NNEF parser/graph compiler may further check whatever it wants to. So I'd say the above NNEF document would be declared valid by the spec, but parser/validator tools are free to implement further checking. For example a tool that realizes/proves that it will actually never result in a valid graph, no matter what you substitute for the batch dimension could be useful.

How does that sound to you?

@botev
Copy link
Author

botev commented May 22, 2018

I think that is a valid point and it makes things a lot easier. However, note that the inverse problem (e.g. given all of the graph operations deciding what values of 0 are valid) is a lot harder then the forward one, where you specify any divisibility of the original shapes that it must satisfy and verify that works (E.g. verifying that 2*n is divisible by two is a lot easier than inferring that the minimum irreducible statement is that shapes divisible by 2 satisfy the constraints of the graph).

@gyenesvi
Copy link
Contributor

Sure, I can imagine that the problem may become easier if you express the unknown (here batch) dimension with a formula that guides the validation. However, what if in the above example, instead of writing 2*n, the user only puts n. You'd still want to validate that, and the result would be the same, that the last line will always be invalid, no matter what you substitute to n. So I guess you need to prepare your algorithm for the general case anyway (no guidance, just a single symbolic integer for the extent in the shape).

(Actually, I believe that your original example above should contain n instead of 2*n, or the definition of x_half should be x[:n] instead of x[:(n // 2)], right?)

@botev
Copy link
Author

botev commented May 23, 2018

You are right my example actually is quite bad at showing the issue I described. So let me remake the example to make this clear:

n = symbolic_int("n")
x = input(2*n, 784)
y = input(2*n, 10)
x_half = x[:n]
h1 = dense(x_half, num_units="a")
h2 = dense(h1, num_units="b")
h3 = dense(h2, nonlinearity=softmax, num_units=10)
// Easy to check inconsistent shapes since second dimensions are constant - 10 vs 784
loss_err = (x_half - h3) **2 / 2.0
// Hard error to check since the frist dimensions are inconsistent - n vs 2n
loss = (y - h3)**2 / 2.0

I think what you are refering is the frist error, which I think is easy to check as long as you have constant shapes. However, the ones like the second are the more general and difficult to catch. Hope that makes it a bit clear. I personally prefer to be able to solve both of them, not only the first. However, that does require the user to specify some constraints on the unknown shapes.

@gyenesvi
Copy link
Contributor

I was actually referring to the second error (sure, the first is easy). My point was that the example is the same if you write the following (n instead of 2*n and n//2 instead of n). Are you saying that the following case would not be detectable because the user did not specify enough constraints on the unknown shapes?

n = symbolic_int("n")
x = input(n, 784)
y = input(n, 10)
x_half = x[:(n//2)]
h1 = dense(x_half, num_units="a")
h2 = dense(h1, num_units="b")
h3 = dense(h2, nonlinearity=softmax, num_units=10)
// Easy to check inconsistent shapes since second dimensions are constant - 10 vs 784
loss_err = (x_half - h3) **2 / 2.0
// Hard error to check since the frist dimensions are inconsistent - (n//2) vs n
loss = (y - h3)**2 / 2.0

I guess if I was the user, I would expect these two formulations to be equivalent, coming to the same conclusions about validity, which means that you have to prepare your implementation for cases where you do not have the luxury of constraints on the unknown shapes. And if you have to prepare for the hard cases, you don't win too much by having some cases simplified. On the other hand, not preparing for the hard cases feels like a half solution.

I think if your intention is exactly to catch errors made by the user, you cannot rely on the user providing you some hints to ease validation, that sounds kind of counter-intuitive to me. Or am I still missing something here?

@botev
Copy link
Author

botev commented May 23, 2018

So if there is a symbolic expression of n//2 I guess you can assume that this must be a valid division and then directly "internally redefine n->2n, so maybe this should be ok. However, having this to be inferred my lead to hard cases where you have something like a+2n = c+d` and then it is very difficult to reason.
As for allowing the user to specify it, I think given how much 3rd party libraries are being used the actual end users would potentially have an easier time doing this in my opinion, as usually these constraints are very natural and easy to specify when you have variable shapes, but can prevent you from making mistakes.

@gyenesvi
Copy link
Contributor

So how would the a + 2n = c + d case be solved by the approach you mention? If I understand correctly, here a, n, c, d may come from different unknown shapes, and the equality is what you want to test. By proper hinting, these could actually become related, for example the explicit symbolic variables could state that in fact a and c are derived from the same symbolic integer and that d is actually equal to 2n. Is that what you are proposing?

To see this in a real example:

graph G( input1, input2 ) -> ( output )
{
    input1 = external(shape = [0,10]);
    input2 = external(shape = [0,10]);
    output = add(input1, input2);
}

Here you can't reason about the addition being valid, because the two input shapes could be unrelated, however, if we could write:

graph G( input1: tensor, input2: tensor, batch: extent ) -> ( output: tensor )
{
    input1 = external(shape = [batch,10]);
    input2 = external(shape = [batch,10]);
    output = add(input1, input2);
}

Then the user explicitly states that the two are the same (although unknown), so you can reason about it and perform more checks.

Is this the essential difference that you'd need to let you perform all checks that you'd like?

@botev
Copy link
Author

botev commented May 24, 2018

Yes, that is indeed what I mean. And to I think that these constraints are quite reasonable to be specified as well as being very simple. E.g. things like both things are from the same batch, or m,y convolution kernel dimension is equal to that of the input etc... Libraries on top can as well use the shapes to create new tensor with matching, but unknown shapes. This is a bit like compile-time checks vs runtime checks in programming languages.

@gyenesvi
Copy link
Contributor

Okay, finally we got the essence of it, I agree that such a description carries more information and is something that users or high level libraries could naturally express.

For now, the graph can only have tensor parameters (that's why it does not need to be marked as such), so this would require allowing non-tensor parameters as well. We'd have to see what that entails, but at least now it is clear in which direction the syntax would require extension.

@gyenesvi
Copy link
Contributor

I have recently been experimenting with describing and validating output shapes in NNEF, and not surprisingly, arrived to integer polynomial expressions, so it might be interesting to understand this in more detail.

The first question we have to answer is how does this exactly help compilation. The reason I have been looking at this is validation of a parameterized description (such as a single, possibly compound operation, or alternatively a whole graph) for structural correctness. Also, you mentioned above that you know of some use cases for optimization, could you elaborate on that just to have some motivating example?

Second, in order to estimate the complexity of implementation, it would be good to clarify what properties/operations are imaginable/required on integer polynomials (for example in mathematical, implementation independent terms). From my experiments, I have gathered the following:

  • a basic polynomial may be a single identifier (a, b, ...) or a single integer constant (1, 2, ...)
  • we can add/subtract two polynomials (a + b = a+b, a - 2 == a-2)
  • we can multiply a polynomial with a constant (2 * a+1 == 2a + 2)
  • we can divide a polynomial with a constant, however, care must be taken for proper rounding (a + 1 / 2 is not necessarily equal to a/2 + 1/2 in case of integer division). How to define this properly? For example by a separate constant divisor?

For the examples I have encountered so far, multiplication of two polynomials were not required, so each monomial could be of degree 1. Can you give an example when multiplication of polynomials is required? Any further operations that you found necessary? For example, looking at symbolic polynomials library you linked, I don't quite get the meaning/utility of floor/ceil and min/max operations, if they would be required here. Could you explain these?

Also, maybe some implementation tricks if any. A canonical form of polynomials with maximal degree 1 can be easily implemented as a map from strings to signed integers. These maps can then be simply compared for equality. How much more complicated would it get if higher degree/polynomial multiplication/division is needed?

@gyenesvi
Copy link
Contributor

It turns out that for the fully general case, we do need multiplication and division of polynomials, which leads us to multivariate rational polynomial expressions, i.e. the ratio of two multivariate (integer) polynomials. So in the end we would need to test equality of two such expressions.

Building those polynomials from arithmetic expressions seems not too complicated. Furthermore, a seemingly simple way to perform comparison is taking the difference of the two expressions, transforming them to common denominator form, and seeing if the nominator becomes zero (after terms cancelling out each other). This seems doable.

Any experience on this?

@botev
Copy link
Author

botev commented Dec 20, 2018

Indeed as you have arrived at this the main purpose of it is validation of parameter (and graph) descriptors that they conform to correct shapes. In terms of optimization this can be used for a optimizing memory allocator as it will potentially allow you to have an explicit polynomial expression for the storage of all tensors in the computation and together with arithmetic scheduler can build explicit division of memory ahead of execution, even if that is dynamic (e.g. input shapes can change) the division of the memory can still be done. This is quite involved and most likely you don't need to be bothered with it.

So multiplication and division is usually are used during reshaping. I'm not sure that you need rational polynomial expression though, what would be the case where you have something that is not an integer?
In terms of the division example that (a+1) / 2 is not equal to (a/ 2 + 1/2) my personal view is that none of this should be allowed. If some variable at some point requires the division (a+1)/2 it should be the user responsibility to define that in fact a = 2k + 1 hence ther result of the division is k + 1 where k is an integer. This essentially would mean just to specify explicitely any asumptions about the variables, which the user knows. Note that this is pretty much how debugging this is done by people - we know that certain variables have certain properties of their shape so we are allowed to perform a specific operation.
As for the library I linked, the reason for the floor and ceil functions is that some standard operations actually contain such expression (e.g. for instance convolutions with size that is not correctly divisble by the filter size). This leads to a bit of a corner casing, but essentially avoids the need for doing any rational polynomial expression and remains just in the integer realm. In that library a polynomial is just an ordered list of monomials. Generally these structures are really tiny in terms of memory footprint (although I have not measured it).
Similarly, for shapes that are polynomial under division, one would requre that the division gives actual integer polynomial, e.g. (a^2 + 2a + 1)/(a+1) = a + 1 while (a^2 + a + 1)/(a + 1) is not allowed (throws an error).

@gyenesvi
Copy link
Contributor

Just to clarify: in a rational polynomial expression (or function) everything is still integer. It is not the same as a rational polynomial, where the coefficients are rational, but instead the ratio of two integer polynomials.

It is required exactly for things like stride in convolution, for example a simple formula for the output size of a convolution is:

output_size = (input_size + padding - filter_size + stride) / stride

which is the ratio of two integer polynomials. Here / means integer division. It is required to make the formulas work for all cases, not just when exact integer division is possible, because in most networks out there this flexibility is allowed.

Your approach to formulate divisibility assumptions seems an interesting approach, although I find it a bit unnatural on the user side, and it may lead to parsing difficulties as well, as it would require the declaration of a shape to be an expression, like a = 2k+1, where the value of k must be deduced, and that can become very complicated depending on what expressions are permitted (maybe it's okay if simple expressions such as a single variable with multiplicative and additive constants are allowed).

Divisibility assumptions may also be given in the form a % b == 0. Furthermore, an integer division can be rewritten using the following identity: (a + c) / b = a / b + (a % b + c) / b. Then, substituting the above assumption can be useful for proving equivalence of two formulas. For example, let's say we have a formula that describes the output size of a network as x / 4 where x is the input size, and we have the assumption that x % 4 == 0. Let's say we have a 'same' convolution with a stride of 4. Then the convolution formula says that the output size should be (x + 3) / 4 (the + 3 achieving the ceil operation usual in this case). Then using the divisibility assumption we have

(x + 3) / 4 == x / 4 + (x % 4 + 3) / 4 == x / 4 + 3 / 4 == x / 4

and so we can see that the convolution output shape is as we expected. That was my idea for approaching this.

@botev
Copy link
Author

botev commented Jan 5, 2019

It is required to make the formulas work for all cases, not just when exact integer division is possible, because in most networks out there this flexibility is allowed.

This is the main reason why in that library there exists ceil and floor operations, which allow representing in "valid form" expression as the one you showed. Whether you decide to call the data structure of this a rational polynomial or a ceil expression I think is equivalent, it's just a way of keeping the correct information around.

like a = 2k+1, where the value of k must be deduced

Deducing integer polynomial constraints is (I think) generally NP-hard hence why I think for any real purposes this has to be something that the user provides. However, as a researcher and user of libraries, I'm very comfortable (and even prefer to) provide myself with all assumptions about the data that a model needs for correctly inferencing shapes (e.g. that some size needs to be of the form 3k+1). Rather than being unnatural I would say it is similar to what a strong-typed language provides compared to weak type - being explicit allows for much better static checks.

Divisibility assumptions may also be given in the form a % b == 0

I think this is equivalent to a = b * k where k is an integer, hence the two things are fundamentally the same. The benefit of keeping everything in polynomial form is that it allows all operations (divisions, multiplications, etc...) to be implemented only once for polynomials and there is nothing else. Of course, it is very subjective if this is "a better" choice, but I've implemented those libraries in about a week or two using this perspective, without any extra "deduction" engine.

E.g. using your example, the user says that the shape of the convolution is 4k, then the result of the convolution is ceil(4k + 3, 4)=k. More or less the main motivation for keeping the things in polynomials is that it makes implementation and understanding of the concepts easier (for me at least) since everything is just an operation on a polynomial.

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 6, 2019

I agree that it's okay if the user provides the assumptions about the model, what I don't see clearly is the actual means to do so with your approach. Can you give me more concrete examples of how this would work out in practise? Here are the bits/cases I want to be able to describe.
For each (primitive) operation, there is a formula that describes the output shape given the input shape and the attributes of the operation. Given that, let's suppose we have a higher level, either a compound operation that uses those primitives, or the actual top level graph that has some shape parameters, and we want to reason about the validity of shapes in that higher level. For that we need to represent intermediate shapes as polynomials of the parameters, and compare those polynomials for equality. In general, this seems to be okay, the problematic bit being when division is required and when there may be additional knowledge in the form of divisibility constraints. How to introduce/represent these in the system? All other operations are clear, because addition/multiplication is simply defined on polynomials, but division is not.

Let's take a concrete operation, like conv. Let's assume it's a strided 'same' convolution, with a parameter called stride. In my approach, the output size is then defined as

output_size = (input_size + stride - 1) / stride

How would that look in your approach, given that it must be defined for all input shapes, not just the cases when there is a divisibility constraint. Would it be something like

output_size = ceil(input_size / stride)

where / means real division? Or alternatively, with an expression ceil_div(input_size, stride)? Is that what you mean by a ceil operation in your approach?

The next bit is when I have a graph, and I want to define it's input size with some divisibility constraint, and I want to express the output size with a formula. Say, I have a single strided conv op in the graph, I substitute 2 for the value of stride, and I want to express that the input is divisible by 2. I can write that the input size is 2 * k, the output size is k. Now the system should be able to check that when I substitute 2 * k for the input size of conv, does the output size based on the formula of conv match my expectation of k. It should compare if the polynom k == ceil(2 * k / 2). Is that right?

How is the floor and ceil represented in your approach? Is it actually the ratio of two polynomials, with the additional info of knowing whether the division is floor or ceil? Or is the key here that in the end when you have to do equality comparisons, the division always cancels out, and you are left with comparing simple polynomials?

@botev
Copy link
Author

botev commented Jan 6, 2019

where / means real division? Or alternatively, with an expression ceil_div(input_size, stride)? Is that what you mean by a ceil operation in your approach?

How is the floor and ceil represented in your approach? Is it actually the ratio of two polynomials, with the additional info of knowing whether the division is floor or ceil?

Indeed in the approach taken in the library, there are several "primitives" - a monomial, a constant and a ceil/floor operation. Essentially, you create a special "monomial" primitive that contains information about the quotient and dividend and whether its a floor or ceil (I think they are actually two different types).

It should compare if the polynom k == ceil(2 * k / 2). Is that right?

The system as it is even atm whenever you call ceil/floor with two polynomials, it always first checks if they are divisible, in which case returns the polynomial exact division and if not constructs the new primitive. E.g:

ceil(2 * k, 2) -> k
ceil(2 * k + 1, 2) -> ceil_primitive(ref 2*k + 1, ref 2)

This basically allows for such a comparison to be done correctly. I even think that it simplifies the expressions, e.g. ceil(6 * k + 3, 6) -> ceil_primitive(2 * k + 1, 2) (it can find common polynomial divisors, not only constants).

@botev
Copy link
Author

botev commented Jan 6, 2019

As an example I've added to the README in https://github.com/Metadiff/symbolic_polynomials:

// (5b + 2)
let poly1 = 5 * b + 2;
// (5b + 2)^2
let poly11 = &poly1 * &poly1;
// floor((5b + 2)^2, 5b + 2) = 5b + 2
let poly12 = floor(&poly11, &poly1);
// ceil((5b + 2)^2, 5b + 2) = 5b + 2
let poly13 = ceil(&poly11, &poly1);
println!("(5b + 2)^2 = {}", poly11);
println!("floor((5b + 2)^2, 5b + 2) = {}", poly12);
println!("ceil((5b + 2)^2, 5b + 2) = {}", poly13);

You get:

(5b + 2)^2 = 25b^2 + 20b + 4
floor((5b + 2)^2, 5b + 2) = 5b + 2
ceil((5b + 2)^2, 5b + 2) = 5b + 2

There exist also max and min operation which are sometimes needed for some operations.

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 7, 2019

Thanks, that clarifies some things about floor/ceil operations. However, I have some further bits that are not clear. If floor and ceil are separate things, how do you define operations on floor/ceil objects? Say, addition/multiplication of floor/ceil with a regular poly, or even addition/multiplication of a floor object with a ceil object, like floor(a,b) + ceil(c,d) or floor(a,b) * ceil(c,d)? I haven't thought this through, it may be possible to do, but I feel that there might be nontrivial details here.

What I like about a simple ratio of polynomials (essentially a floor(a,b) in you approach) is that every polynomial can be represented as a same ratio object (trivial ones with a denominator of 1), and so you have to define all operations on these ratio objects only, which is trivial. The ceil operation seems to complicate that scheme. Apart from the above questions of operations on floor/ceil, the ceil operation is redundant, because it can also be expressed as floor(a + b - 1, b), and it should be possible to detect the equivalence of the two. Can this be done?

Furthermore, ratio polynomials can be compared for equality without the need to do actual polynomial division and simplification, which would require algorithms, like polynomial gcd, that are quite complicated, especially in the case of multivariate polynomials. I understand that it is possible to implement it, but may be too much burden for an NNEF consumer.

To make this work in NNEF, it should be possible to clearly define how expressions can be converted to polynomials, how operations are defined on those polynomials, and prove that it is always possible to decide equality of two expressions via polynomials. With ratio polynomials, this seems to be true in my view. Have you done some analysis in this direction using the floor/ceil formulation?

@botev
Copy link
Author

botev commented Jan 8, 2019

Thanks, that clarifies some things about floor/ceil operations. However, I have some further bits that are not clear. If floor and ceil are separate things, how do you define operations on floor/ceil objects? Say, addition/multiplication of floor/ceil with a regular poly, or even addition/multiplication of a floor object with a ceil object, like floor(a,b) + ceil(c,d) or floor(a,b) * ceil(c,d)? I haven't thought this through, it may be possible to do, but I feel that there might be nontrivial details here.

You can imagine that the Floor and Ceil entities are treated as separate irreducable variables. Mathmatically is like saying if we have variables a,b, then we say let c=floor(a,b) and now we have three totally seperate variables a,b,c. Implementation wise there is a tagged union type that looks like this (simplified):

pub enum Composite{
    Variable(I),
    Floor(polynomial, polynomial),
    Ceil(polynomial, polynomial),
}

Then a monomial is defined as

pub struct Monomial{
    pub coefficient: f32,
    pub powers: Vec<(Composite, int32)>,
}

And a polynomial is just a vector of monomials, representing their sum. Hence, you can build polynomials of these composite expressions in arbitrary form.

The ceil operation is redundant, because it can also be expressed as floor(a + b - 1, b), and it should be possible to detect the equivalence of the two. Can this be done?

That's a very good point which have not crossed my mind before. Hence, it is probably enough to define only one of these as a Composite and the other being constructed in the manner you described.

Furthermore, ratio polynomials can be compared for equality without the need to do actual polynomial division and simplification, which would require algorithms, like polynomial gcd, that are quite complicated, especially in the case of multivariate polynomials.

This is a good point, however you will need to represent standard polynomials somehow separately from irreducable ratios (similar to how the floor is separate), since b * floor(a, b) is not a and hence if floor(a,b) is represented just as a ratio you can not multiply it without precautions. Could you elaborate on how this would be done when we represent things as pure ratios?

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 9, 2019

Okay, now I understand that you treat those sub-expression as irreducibles, and that means that you cannot further simplify the expression. I agree that it is probably necessary to have such an irreducible, so you need to have two kinds of ratios, one reducible (a / b that can participate in multiplication, because we know that a % b == 0) and one irreducible (floor(a,b)). So pure ratios alone are not enough, that's true.

However, I still want to establish some relation between the two, because it is needed in some cases to prove equalities. For example, imagine two operations, one downsamples the input by a factor, and the other upsamples it. Supposing the input size is x, output size of the downsampling is defined as floor(x,factor), and the output size of upsampling is x * factor. Suppose we apply the two after each other (downsample, then upsample), and we want to create a compound operation from this sequence. However, we want the compound to have identical output shape as the input shape (x). For this, the definition of the compound provides the assumption that x % factor == 0.

When checking the validity of that compound definition, combining the formulas of the two ops, we get floor(x,factor) * factor, and we want to prove that given x % factor == 0, this actually equals (x / factor) * factor, which can be reduced to x.

We could use the identity floor(a,b) == a / b - (a % b) / b to rewrite the floor primitive, maybe having a % b as an irreducible primitive, and then the divisibility assumption in the form a % b == 0 could easily be used during the reduction, simply by substituting the value of such an irreducible, resulting in floor(a,b) simplified to a / b.

Or alternatively, as you proposed, we can state in our compound operation definition that the input and output size is in the form x = factor * y, and then floor(factor * y, factor) * factor can be simplified to y * factor.

I find the first approach (a % b as an irreducible) slightly simpler, because you only need to do substitution, whereas understanding when an expression floor(expr1, expr2) can be reduced seems to be more involved algorithmically (you need to simplify polynomials by gcd).

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 9, 2019

Actually, in the a % b as irreducible case, floor(a,b) is only needed when writing expressions, and under the hood, the system can always rewrite it to (a - a % b) / b meaning a ratio polynomial that can now be used in multiplication, and may simplify if there is a divisibility assumption. So a % b is the only irreducible expression required.

@botev
Copy link
Author

botev commented Jan 9, 2019

I think that the two approaches are on some level "equivalent" hence I'm not convinced that one of them leads to a very hard problem (polynomial gcd) and the other does not.
E.g. note that figuring out that floor(factor * y, factor) = y does not require gcd, but rather simple polynomial division, which is easier and fast.

PS: did that close by mistake

@botev botev closed this as completed Jan 9, 2019
@botev botev reopened this Jan 9, 2019
@botev
Copy link
Author

botev commented Jan 9, 2019

Actually, maybe you can clarify in what curcumstances you envision to be required to do GCD in my approach to clear that out, as I think there we have some form of misscomunication.

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 9, 2019

I agree that the two approaches may be equivalent and it would be good to see the difference if any. I guess you still wanted to format this example, but something went wrong, I cannot interpret it, there seem to be unfinished lines and unused names.

@botev
Copy link
Author

botev commented Jan 9, 2019

Yeah the example I started writing I thought that it would need GCD in my approach, but after I ran it in the library I realize I was wrong and it still worked. I actually don't think even in my approach you ever will need GCD, instead polynomial division is enough.

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 9, 2019

When you say you can get away with simple polynomial division, do you mean that you perform the division and throw away the remainder? And calculate on with the quotient? How easy is that to do in the multivariate case? I have never implemented it, but did not seem as simple as addition and multiplication.

@botev
Copy link
Author

botev commented Jan 9, 2019

Ok so maybe this is a somewhat reasonable example, however it still works and I explain below why:

// (x-1)(x+1)(x+3)
p1 = x^3 + 3x^2 -x + 3
// (x-1)(x+2)
p2 = x^2 + x - 2
// Floor division 
p3 = floor_div(p1, p2)
// Equvialent but different
p4 = floor_div((x+1)*(x+3), x+2)
// Make sure these are equal
assert p3 == p4

Now the first expression floor_div(p1, p2) indeed requires GCD if we want receive the exact same struct as p4. In my framework though it will still be represented as the irreducable Floor(p1, p2). However, the equality operator for comparing Floor(a,b) and Floor(c,d) is defined as checking whether a * d = b * c and hence things work out correctly.

Could you maybe follow how you see this would work out in your construction?

@gyenesvi
Copy link
Contributor

gyenesvi commented Jan 9, 2019

I defined equality exactly the same way for ratio polynomials as you do above to avoid requiring to get the exact same structure for the two polynomials, but I define it for the true ratio, not for the floored one, and I believe that is when you can actually do that trick, and not as you apply it. To represent Floor(p1, p2) I'd use the ratio (p1 - p1 % p2) / p2,p1 % p2 being an irreducible, just so that I can do multiplication on it, which is required for that trick to hold. I don't understand how you employ the a * d = b * c trick for the Floor expression, since it is not always equivalent; there is a Floor(a,b) * (b * d) expression hidden in there that cannot be reduced to a * d. As a concrete counter-example Floor(5,3) == Floor(4,3), but 5 * 3 != 4 * 3. This trick only holds if we are doing real division, not for integer division. With my approach Floor(5,3) == Floor(4,3) is rewritten to (5 - 5 % 3) / 3 == (4 - 4 % 3) / 3, which evaluates to true, properly.

Does that make sense, or am I mistaken somewhere?

@botev
Copy link
Author

botev commented Jan 10, 2019

I think I started to see your point and you might be right that this is an easier and potentially better repreresentation. Let me experiment a bit with it on that library during the weekend and see if it comes out as expected.

As a concrete counter-example Floor(5,3) == Floor(4,3), but 5 * 3 != 4 * 3. This trick only holds if we are doing real division, not for integer division. With my approach Floor(5,3) == Floor(4,3) is rewritten to (5 - 5 % 3) / 3 == (4 - 4 % 3) / 3, which evaluates to true, properly.

This is true, however since we work with polynomials, which have variables one can not do this in general except with constants. The equality I have defined guarantees that Floor(a, b) = Floor(c, d) for all possible values of the variables involve. It is a stronger notion then that for some specific set of values the two are equal. I think for constants the floor operation can actually do the floor and return the correct integer.

@gyenesvi
Copy link
Contributor

I think the essence of the above counter example is that since a constant is a special case of a polynomial, if I find a counter-example for constants, that is also a counter-example for polynomials. But here's a counter example using proper polynomials. Suppose I am given that a % b == 0, and I want to prove that

floor(a + b - 1, b) == floor(a, b)

If I follow your rules, then I can substitute a = k * b, and do the cross multiplication:

floor(k * b + b - 1, b) == floor(k * b, b)
k * b + b - 1 == k * b
b - 1 == 0

which does not hold in general (for all b).

Furthermore, I believe you can do the same rewriting and calculations as above with polynomials too, not only with constants (since all operations, including modulo, are defined, and are generalizations of corresponding integer operations). And that stronger notion will hold; the equality will hold for all possible values substituted for the variables:

floor(a,b) == floor(c,d)

is rewritten to

(a - a % b) / b == (c - c % d) / d

which holds iff

(a - a % b) * d == (c - c % d) * b

Doing that for the above example, we have (given that a % b == 0):

floor(a + b - 1, b) == floor(a, b)
(a + b - 1 - (a + b - 1) % b) / b == (a - a % b) / b
a + b - 1 - (a + b - 1) % b == a
b - 1 == (a + b - 1) % b
b - 1 == (a - 1) % b

which seems to be true, assuming a % b == 0, but to prove it, it might not be enough to treat modulo expressions as irreducibles, and we need rules of modulo arithmetic to be applied.

Let me know what comes out of your experiments!

@botev
Copy link
Author

botev commented Jan 11, 2019

So one of my concerns regarding this is that when you have more complicated expressions it would be very difficult to "proove" that something is still an integer. Consider the same example with a = b * k. You can basically show that floor(a + b - 1, b) = k, that is easy. Now consider floor(a + c - 1, b)=k + floor(c - 1,b). This will be rewritten as ratio:

k + floor(c-1, b) = k + (c - 1 - reminder(c-1,b))/b = [k*b + c - 1 - reminder(c - 1,b)]/b

let's add to this floor(c + b-1,c) = [c + b - 1 - reminder(b-1, c)] / c

L1 + L2 = [k * b * c + c^2 + c - c * reminder(c-1,b) + b*c + b^2 - b - b * reminder(b-1,c)] / (b*c)

How do we check that this expression is indeed integer? Removing all parts of the dividend that are divsible by b*c we are left with c^2 + c - c * reminder(c-1,b) + b^2 - b - b * reminder(b-1,c). It is easy to do now by us as mathematically we can revert the floors and see this is true, however in a more general and complicated situations, reverting the floors might not be possible. hence one would need some form of complicated procedure to verify that this here is indeed an integer. How would that algorithm look?

As an a more concrete example is (2ab + 3b - reminder(1, b) + 1)/(b) an integer or not?

@botev
Copy link
Author

botev commented Jan 11, 2019

Maybe it would be nice if we have several test cases which I can test. Other than that I think I have implemented what you were discussing an approach, but at this point, it is a bit hard to reason what the system could or not infer without having explicit examples.

@gyenesvi
Copy link
Contributor

Sorry for not responding for so long, I have been out of office for a while and forgot a bit about this thread..

I have to agree that things can get wild quite easily, and we may end up with problematic cases (as I said previously, we might need more rewrite rules related to modulo calculations to prove equivalences, how many, I don't know). Basically, as I see, the cases that we cannot decide whether to be integers or not basically result in cases in which there might be some simplifications via some equivalences that we don't discover, and hence we fail to derive that two formulas are the same.

In general, we have two options: either try to aim for a representation/rewriting rules that provably always works to decide equivalences of two formulas (the existence of which I highly doubt at this point), or just embrace the fact that some equivalences cannot be proven, and work with that. This could still be useful, if all implementations would use the same rewriting rules and same algorithm for simplification, and would declare two formulas equivalent or not in the same cases. Then we could just say that a network description is valid if the algorithm can prove the required equivalences, and invalid otherwise (meaning that it is truly invalid or the validity cannot be proven by the tools we have).

I agree that it would be good to see a set of examples, that we think the system should be able to handle (decide equivalences of formulas). What I wanted to use this for is to decide if the definition of intermediate or output shapes in a (sub-)graph is valid. Suppose that each operation has an output shape definition (via a math formula). Take a sequence of operations, and compose a shape formula at the end of the (sub-)graph by composing the shape formulas in the sequence of operations. Is the math formula for a composed shape consistent with some probably simpler definition of that shape at the output of the (sub-)graph? I think it should be easy to enumerate quite a few interesting cases by composing just a few operations into a graph, defining the output shape with a formula, and see if that definition can be decided to be consistent with the sequence of operations inside the graph? Like my example above (jan 9) for downsampling and then upsampling an input is a small sub-graph of two operations.

@botev
Copy link
Author

botev commented Feb 14, 2019

So, pretty much for similar reasons I had my own view on this with the "extra entity" for floor as essentially it can handle some easy cases, but eventually you have to embrace that some equivalences are going to be too hard.

One can potentially resort to numerics when things get too harry - e.g. compare the results for 10-20 random "base" integer values. Of course this will never be fully correct, but potentially can still catch wrong shapes often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants