Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frest start : Next steps #49

Open
ayush-1506 opened this issue Apr 12, 2021 · 51 comments
Open

Frest start : Next steps #49

ayush-1506 opened this issue Apr 12, 2021 · 51 comments

Comments

@ayush-1506
Copy link
Member

Opening an issue here to keep track of this. @dfdx @DrChainsaw Would be great if you could add details about what needs to be done to add complete support for Flux here.

For context, the first set of changes were done in #46.

@DrChainsaw
Copy link
Contributor

A somewhat lengthy issue here with some thoughs on how to proceed.

what needs to be done to add complete support for Flux here.

One idea is to use the package name ONNX.jl for things which are ML framework agnostic and do Flux to ONNX import/export in e.g a OnnxFlux.jl package. As discussed in the lengthy issue above, there are some other parts which can be put in separate packages (for instance NNlib <-> ONNX primitives). One obvious drawback is the overhead from many packages, including potential confusion for end users but the advantage of reuse could be beneficial when it comes to adding and testing OPs.

This issue here has a small concrete potential next step which could fit in a framework agnostic ONNX package: opus111/BaseOnnx.jl#2 but I think it needs some other eyes than mine to judge whether it is the right choice.

@ToucheSir
Copy link
Member

I think I have a grasp on what you're proposing in that issue, but it would be great to have a couple of the motivating examples mentioned to play with :)

@dfdx
Copy link
Collaborator

dfdx commented Jun 18, 2021

Let's revive this issue. If I understand correctly, we are stuck on a Julia-friendly representation of ONNX graphs, either in a form of computational graph, or in a form of high-level serialization/deserialization primitives. I don't know the status of the NaiveNASlib / CompGraph, but one thing I've been working recently is the Ghost.jl package which implements the Tape type which is intentionally designed similar (though not identical) to ONNX graph (e.g. see Tape anatomy). Ghost currently supports tracing functions, tape execution, compilation to Julia code and most needed node (operation) types including loops. Long-term plan is to have ONNX interop in ~6-9 months, but whether the mapping should be direct or via another representation (e.g. CompGraph) is an open question.

I'll be happy to hear your current thoughts on the representation of ONNX graphs in Julia!

@DrChainsaw
Copy link
Contributor

Sorry for being inactive here. I have had a bit of a dip w.r.t energy and motivation lately but I’m working on it :)

Ghost looks awesome and I absolutely see how I would love to throw out the ad-hoc tracing mechanism in ONNXmutable for it when it comes to exporting. Would it be fair to say it does roughly what Mjolnir would do when it comes to exporting ONNX?

How do you see the relation to import?

To me, the import requires normal (but domain specific) Julia code to just implement what the natural language and type-definitions of the ONNX spec says. For instance, to “trace” through a .onnx file, you need to do things like deserialize the strings which are the names of the inputs to a Node, look through all other nodes and initializers for those exact strings and figure out what to do with that info. Or is this something that one could/should use ghost for?

Some verbose thoughts on representation of ONNX graphs in Julia

I would say that the CompGraph in NaiveNASlib does not require any extra work w.r.t ONNX. It can be used as is, but it might not be the ideal choice in every situation (or rather, it is probably very rarely the ideal choice). I think that the Flux crowd seem to prefer that Flux import of ONNX models return a Flux.Chain, even if this means failing to import models which can’t be expressed with Chain.

Making some stand-alone CompGraph (perhaps one which is static and therefore can be typesafe) in a separate package might be useful, but it is such a trivial thing to do that it might feel silly to make a stand alone package out of it.

This is probably stating the obvious, but both CompGraph and Chain are just ways to express a function with Julia objects and thereby making it possible to write normal Julia code which reasons about and/or modifies them where a normal Julia function would require one to work with IR or some other ‘lower’ and more cumbersome representation. One reason why I think they happen to be useful for ONNX import is that the high level goal of ONNX import is to generate a function at runtime, i.e f = import(“model.onnx”) and to the extent of my knowledge (please correct me if I’m wrong) it is not possible to generate a normal Julia function like this without hitting the top level or using tools like RuntimeGeneratedFunctions.jl or GeneralizedGenerated.jl which come with a bunch of other restrictions.

All in all, I think that the choice of how to represent the model needs to be up to each individual package and I think it should be possible to do so without having to duplicate the lions share of work which is to define the ONNX primitives. Therefore it is not a central thing which needs agreement.

At least to me, there is value in being able to import an onnx-file as a model built by building blocks from ones favourite framework of choice, i.e that a Node with the operator type set to Conv in an .onnx file can be turned into a Flux.Conv or an Avalon.Conv2d and it seems obvious that this is done by separate packages.

One needs to be aware that getting full support for ONNX this way is going to be very cumbersome. For instance, a Conv operator which has both the weights and the input feature maps as trainable parameters but where the bias is computed by another node is a perfectly valid ONNX model. One could also take the approach to make a package which aims at giving full support to the ONNX spec without trying to shoehorn things into structs predefined by some other framework. This is also a valid approach but I somehow see less value in this as it would basically be a brand new framework.

The reason why I brought this up is that it does relate a bit to what assumptions one can do if one is to put some generic import code in this package. I suppose that if we do, we want to take height for the full-blown ONNX support, not just what seems to be the norm in existing frameworks even if it means that the code (and the use of it) will be more complicated.

@ToucheSir
Copy link
Member

ToucheSir commented Jun 19, 2021

TL;DR copy and paste over CompGraph, hack off anything JuMP-related, add Functors.jl and let's get shipping :)

I can't speak for any of the other flux contributors, but to respond to the extended thoughts above:

I would say that the CompGraph in NaiveNASlib does not require any extra work w.r.t ONNX. It can be used as is, but it might not be the ideal choice in every situation (or rather, it is probably very rarely the ideal choice). I think that the Flux crowd seem to prefer that Flux import of ONNX models return a Flux.Chain, even if this means failing to import models which can’t be expressed with Chain.

Au contraire, model import should not rely on any framework's functionality. There's a reason both TF and PyTorch still lack any ONNX import functionality: the impedance mismatch is kind of unavoidable.

Thankfully, the Julia ecosystem means we don't have to do this! What's required to make a model that works with Flux?

  1. Needs to be callable. This is handled by NNlib and its various backends.
  2. Needs to be deconstructable to and reconstructable from a list of parameter arrays. This is required for functionality like eltype conversion and accelerator support. Functors.jl is a slim package that handles this, and Flux takes a dependency on it for its own layers.

ONNXMutable's CompGraph already fulfills 1), and adding support for 2) would be trivial AFAICT.

Making some stand-alone CompGraph (perhaps one which is static and therefore can be typesafe) in a separate package might be useful, but it is such a trivial thing to do that it might feel silly to make a stand alone package out of it.

Why not put it in this package? Honestly, the only thing I see preventing this from happening now is the JuMP dependency. If that and all the functionality around graph re-balancing were removed, it could be copied over pretty much wholesale.

This is probably stating the obvious, but both CompGraph and Chain are just ways to express a function with Julia objects and thereby making it possible to write normal Julia code which reasons about and/or modifies them where a normal Julia function would require one to work with IR or some other ‘lower’ and more cumbersome representation. One reason why I think they happen to be useful for ONNX import is that the high level goal of ONNX import is to generate a function at runtime, i.e f = import(“model.onnx”) and to the extent of my knowledge (please correct me if I’m wrong) it is not possible to generate a normal Julia function like this without hitting the top level or using tools like RuntimeGeneratedFunctions.jl or GeneralizedGenerated.jl which come with a bunch of other restrictions.

This is true, but the allure of runtime function generation shouldn't stop us from getting something out the door. Most users don't care about what their model looks like as long as they can import and run it. I'm not even sure that a dynamic DAG like CompGraph would be any slower than a fully generated function given how much NNlib et al. dominate computation time. Given how many questions come up on Slack/Zulip/Discourse about it, I would argue there is a bigger opportunity cost in still not having anything workable.

One needs to be aware that getting full support for ONNX this way is going to be very cumbersome. For instance, a Conv operator which has both the weights and the input feature maps as trainable parameters but where the bias is computed by another node is a perfectly valid ONNX model. One could also take the approach to make a package which aims at giving full support to the ONNX spec without trying to shoehorn things into structs predefined by some other framework. This is also a valid approach but I somehow see less value in this as it would basically be a brand new framework.

IMO full support is more important than feeling idiomatic. One of the biggest frustrations with ONNX implementations is hitting unsupported operators, because there's no good way to fix it as a user. Aesthetic concerns, on the other hand, are much easier to handle.

I also don't think such a package would be a brand new framework. Rather, it would be little more than plumbing between ONNX protobufs and thin wrappers over NNlib functions (with as much metadata as possible for higher level frameworks to use). If it turns out some op doesn't exist in NNlib, then we should add it there. Note that this and Functors.jl support would also allow frameworks like Flux to pick and choose layers to convert to more "idiomatic" representations, while still maintaining a broad level of support. ONNX.jl and CompGraph then become more of a common substrate than a separate framework.

@DrChainsaw
Copy link
Contributor

Rather, it would be little more than plumbing between ONNX protobufs and thin wrappers over NNlib functions (with as much metadata as possible for higher level frameworks to use)

While I think this is a fully valid and perhaps somewhat ideal ONNX solution, I do fear that it will be quite alot of code to write and maintain. It will certainly be more code per op than the layer definitions in Flux since it has more features per op (right?). More features also mean more maintenance and looking at Flux it does not look like the layer definitions in there are low maintenance.

Some more minor comments and clarifications on CompGraph

ONNXMutable's CompGraph already fulfills 1), and adding support for 2) would be trivial AFAICT.

NaiveNASflux adds Functors.jl support to CompGraph for cpu<->gpu movement and datatype change and (barring any misconceptions I might have) it is unfortunately non-trivial and I was forced to rely on mutating the model instead of creating a new copy.

The crux is that a vertex also contains its input vertices, and the same vertex might be input to several vertices. When recursing with fmap through the graph two vertices with the same input vertex would end up with two copies of the same vertex instead after fmap.

NaiveNASlib has a copy function which kinda functions like fmap but does not have this issue, but somehow I think there has to be a reason why it wouldn't work to just implement something like fmap(f, g::CompGraph) = copy(g, f) or else I would have used it already.

Perhaps this is a reason to consider another design, e.g. one where the structure of the graph is inside the CompGraph and not inside the vertices. I went back and forth a few times on this in the prototyping stage and settled for the current solution because it makes it possible to to the size rebalancing stuff from the perspective of a single vertex.

If that and all the functionality around graph re-balancing were removed, it could be copied over pretty much wholesale.

Yeah, this would be very easy to do as the compute only parts of CompGraph are in files/structs which don't have any dependencies to JuMP or the re-balancing stuff. The LightGraphs stuff is obsolete and can just be deleted and same goes for the logging stuff.
https://github.com/DrChainsaw/NaiveNASlib.jl/blob/master/src/compgraph.jl
https://github.com/DrChainsaw/NaiveNASlib.jl/blob/master/src/vertex.jl

This is true, but the allure of runtime function generation shouldn't stop us from getting something out the door.

Fully agree. Fwiw, I have not been able to spot a significant difference between CompGraph and Chain in benchmarking. As you implied, whatever ns -> us happens due to type instability seems to disappear in the noise when the whole function is in the millisecond range. I suppose that if someone was to use e.g. NaiveGAflux for the same purpose as SymbolicRegression.jl they might find the type instability being the bottleneck but I don't lose any sleep over that.

@ToucheSir
Copy link
Member

I would distinguish between the amount of code and the overhead of maintenance. One reason the per-layer maintenance burden is higher for Flux is because layers need to be ergonomic, robust and somewhat general. An under the hood representation needs none of those things as long as it's not exposed to users (which it shouldn't be).

As for the repeated vertex issue, I'd totally forgotten about it!. This is something we've been trying to work out recently because it affects all weight tying. See FluxML/Flux.jl#1592, FluxML/Flux.jl#1504, FluxML/Zygote.jl#991 and the last couple of days of discussion on Slack's #autodiff. I guess tracing would be a more optimal approach until we resolve weight tying, but that brings with it the issue of playing nice with source-to-source ADs like Zygote. If that's not a concern though, I say we should go for it.

Given the awkwardness of sending big paragraphs back and forth here, might I suggest @dfdx and @DrChainsaw join us this coming Tuesday for the ML Community Call? The link is at https://calendar.google.com/calendar/event?eid=X2I5N2t1ajlwNnNxMzBlOXA2b3FqMmQxbl8yMDIxMDYyMlQxNjAwMDBaIGp1bGlhbGFuZy5vcmdfa29tYXVhcWV0MTRlb2c5b2l2M3A2bzdwbWdAZw and you can copy a calendar invite into your own calendar via the links on https://julialang.org/community/.

@dfdx
Copy link
Collaborator

dfdx commented Jun 19, 2021

As for the repeated vertex issue, I'd totally forgotten about it!. This is something we've been trying to work out recently because it affects all weight tying. See FluxML/Flux.jl#1592, FluxML/Flux.jl#1504, FluxML/Zygote.jl#991 and the last couple of days of discussion on Slack's #autodiff.

I'm not sure this is the same issue. I think the core of the problem applying functors to CompGraph is that CompVertex contains direct (as opposed to symbolic) references to other vertices. I had this issue in early design of Ghost (back when it was part of Yota) and resolved it by having both - operations (i.e. vertex) and variables (i.e. references to operations). ONNX seems to follow the same design - its Node holds inputs as names of other nodes, not direct references. Applying a function to the whole graph then boils down to a for loop over the list of nodes instead of recursive traversal.

Given the awkwardness of sending big paragraphs back and forth here, might I suggest @dfdx and @DrChainsaw join us this coming Tuesday for the ML Community Call?

Unfortunately this is not an option for me - my open-source-working-hours are usually between 0am and 2am local time (I'm at UTC+3), and my family wouldn't really enjoy me talking aloud that late 😃 However I'll be happy to see the recording (if any).

@ToucheSir
Copy link
Member

AIUI the main concern is with mutable structs shifting identity after fmap in a way that invalidates external references? The case of a node having multiple parents (i.e DAG vs tree structure) should be handled by fmap already, because it maintains an IdDict cache of previously visited nodes. I presume that removing the requirement for mutable vertices and any operations that would modify the graph in place (e.g. rebalancing) would make this a non-issue. Is that a fair assessment?

Unfortunately this is not an option for me...

No worries! I think it would be nice to have a touch base some time this summer, but that'll likely require some more scheduling.

@DrChainsaw
Copy link
Contributor

I'm unsure as to whether I can join as it collides with dinner over here.

I suppose one good agenda item could be whether shoehorn into existing frameworks or make it fully ONNX compliant is preferable. I guess one might want to leave both doors open which I guess is doable if we proceed with the ecosystem route.

The case of a node having multiple parents (i.e DAG vs tree structure) should be handled by fmap already, because it maintains an IdDict cache of previously visited nodes.

You're right! I should probably just give it a try now and see if it works. Perhaps I could just replace the whole copy meachnism in NaiveNASlib with Functors.jl too.

FWIW, fmap works in NaiveNASflux now too (would be pretty useless if it didn't), it is just that it has the surprising side effect that it mutates the input.

@dfdx
Copy link
Collaborator

dfdx commented Jun 22, 2021

In a quickly evolving discussion I forgot to answer @DrChainsaw 's questions!

Ghost looks awesome and I absolutely see how I would love to throw out the ad-hoc tracing mechanism in ONNXmutable for it when it comes to exporting. Would it be fair to say it does roughly what Mjolnir would do when it comes to exporting ONNX?

A kind of, with the exception that Ghost.jl is maintained :) At some point I tried to migrate to IRTools.IR for tape representation and Mjolnir as a tracer. It didn't go well. Even with all the nice API and safeguards Julia IR (be it CodeInfo, IRTools.IR or other representations I tried) is hard to work with. Simple operations like inserting an instruction or replacing a constant with a function call may suddenly result in a segfault. Changing the tracer behavior in runtime is quite problematic (e.g. adding primitives or switching loop tracing on and off), and in general Mjolnir is designed more for partial evaluation rather than tracing. But the worst thing is the lack of a maintainer - relying on a package which is unlikely to ever see any updates is quite unreliable strategy.

To me, the import requires normal (but domain specific) Julia code to just implement what the natural language and type-definitions of the ONNX spec says.

It really depends on what your domain is. If you have a simple ONNX definition and want to construct Flux.Chain from it, then yes, direct translation sounds like a good plan. In more complex scenarios - collapsing several nodes into one or vice versa, dealing with loops, code generation, etc. - you may want an intermediate representation as a graph. Ghost provides such a graph with semantics close to that in ONNX as well as the tools for modifying this graph, execution, code generation, etc.


Obviously, I'm more familiar with Ghost.Tape so I keep thinking (and saying) in its terms, but I guess CompGraph should work equally well. If you have in mind the next steps which I can help with, please let me know and I'll be happy to jump in!

@ToucheSir
Copy link
Member

ToucheSir commented Jun 22, 2021

Yup, I doubt Mjolnir will ever go beyond the experiment phase or see the light of day again (we should probably archive it). The work there is now being picked up and more robustly developed as part of https://github.com/JuliaCompilerPlugins and the Symbolics.jl ecosystem. I don't have a single authoritative link, but #compiler-plugins on slack has a lot of great discussion. Unlike last time, there is both upstream/core support and many downstream users (e.g. JET) involved, so it's very much worth a look to see what infrastructure can be shared.

Edit: I should add that this new effort is able to learn from the experience of IRTools/Cassette/Mjolnir development and has a much better bus factor!

@darsnack
Copy link
Member

Flux crowd seem to prefer that Flux import of ONNX models return a Flux.Chain, even if this means failing to import models which can’t be expressed with Chain.

Potentially speaking as "the crowd" 😉, I want to echo @ToucheSir's comments and reassure that this is not quite the case. I think eventually it would be nice to have an ONNX2Flux.jl package that can translate an ONNX model to a Flux model when possible. For the simple reason that it will be in the representation that Flux.jl users expect, and it will work most easily with other packages that operate on the Flux models themselves.

But I do not think that our first step should be reading ONNX models into Flux layers. I think the approach outlined here is the right one. ONNX.jl should not rely on Flux.jl in any way. We want complete support for arbitrary ONNX structures, which means using CompGraph (or Ghost.Tape or whatever we decide) as the structure to store and run the model. If we can make it Functors.jl compatible, then that would basically ONNX2Flux.jl an aesthetic nicety in almost every case.

I am happy to help review/write code to push this effort forward. In particular, fmap should work for CompGraphs, but if you run into Functors.jl issues, then ping me/file an issue so that we can fix it.

@DrChainsaw
Copy link
Contributor

DrChainsaw commented Aug 15, 2021

Alright, I suppose the 'impedance mismatch' problem makes the separate framework the least risky way if the goal is to eventually support the whole spec although it does seem like a pretty gargantuan task to me.

Between CompGraph and Ghost, the latter seems much more worked through and generic so my vote goes for it.

Do we want the operators in this repo? I suppose they might be dead weight for e.g. ONNX2Flux.jl unless one tries to first load the model in the generic framework and then translates what can be translated from that to Flux layers.

Just some things which might be good to think about before starting:

Alot of the impedance mismatch comes from ONNX being built with row major in mind (and even when its not, models from the python frameworks will most likely be row major). Both old ONNX and ONNXNaiveNASflux (ex. ONNXmutable) tries to import as column major and the latter exports as row major. I suppose alot of risk can be mitigated if one does not attempt this.

The maintaining the versioning of the opsets might become quite burdensome if one does not think about it early (i.e before implementing a bunch of OPs). I tried asking the question in the onnx forum but did not get any replies, but this post is roughly the same and has some discussion in it: onnx/onnx#2872

@dfdx
Copy link
Collaborator

dfdx commented Aug 16, 2021

Do we want the operators in this repo? I suppose they might be dead weight for e.g. ONNX2Flux.jl unless one tries to first load the model in the generic framework and then translates what can be translated from that to Flux layers.

I imagine 2-step import for most frameworks. Assuming Ghost.Tape for graph representation:

  1. ONNX file -> raw Tape with only NNlib functions.
  2. Raw Tape -> Tape with layers -> target model, e.g. Flux.Chain.

We can implement (1) in this repo and delegate (2) to the framework-specific repo. Transformations like conv(w, x) + σ -> Conv(w; σ) should be pretty easy on the already constructed graph, we can even introduce generic machinery for operation rewriting.

[...] tries to import as column major and the latter exports as row major. I suppose alot of risk can be mitigated if one does not attempt this.

If we don't permute dimensions, would it still be possible to use usual operations like * or NNlib.conv()?

@darsnack
Copy link
Member

Having recently worked with Ghost.Tape, I can say it would be a really nice structure for this. We might want to add some more graph-like helper functions for easier traversal though.

@dfdx
Copy link
Collaborator

dfdx commented Aug 18, 2021

Here's a draft of the loader based on Ghost.Tape. It only partially implements a single operation (Conv) and replaces all the others with identity, but with eval=false (i.e. not trying to evaluate the tape during construction) I was able to load AlexNet into this:

Tape{ONNXCtx}
  inp %1::Nothing
  inp %2::Array{Float32, 4}
  inp %3::Vector{Float32}
  inp %4::Array{Float32, 4}
  inp %5::Vector{Float32}
  inp %6::Array{Float32, 4}
  inp %7::Vector{Float32}
  inp %8::Array{Float32, 4}
  inp %9::Vector{Float32}
  inp %10::Array{Float32, 4}
  inp %11::Vector{Float32}
  inp %12::Matrix{Float32}
  inp %13::Vector{Float32}
  inp %14::Matrix{Float32}
  inp %15::Vector{Float32}
  inp %16::Matrix{Float32}
  inp %17::Vector{Float32}
  inp %18::Vector{Int64}
  %19 = conv(%1, %2)::Nothing
  %20 = broadcast(+, %19, %3)::Nothing
  %21 = identity(%20)::Nothing
  %22 = identity(%21)::Nothing
  %23 = identity(%22)::Nothing
  %24 = conv(%23, %4)::Nothing
  %25 = broadcast(+, %24, %5)::Nothing
  %26 = identity(%25)::Nothing
  %27 = identity(%26)::Nothing
  %28 = identity(%27)::Nothing
  %29 = conv(%28, %6)::Nothing
  %30 = broadcast(+, %29, %7)::Nothing
  %31 = identity(%30)::Nothing
  %32 = conv(%31, %8)::Nothing
  %33 = broadcast(+, %32, %9)::Nothing
  %34 = identity(%33)::Nothing
  %35 = conv(%34, %10)::Nothing
  %36 = broadcast(+, %35, %11)::Nothing
  %37 = identity(%36)::Nothing
  %38 = identity(%37)::Nothing
  %39 = identity(%38)::Nothing
  %40 = identity(%39)::Nothing
  %41 = identity(%40)::Nothing
  %42 = identity(%41)::Nothing
  %43 = identity(%42)::Nothing
  %44 = identity(%43)::Nothing
  %45 = identity(%44)::Nothing
  %46 = identity(%45)::Nothing
  %47 = identity(%46)::Nothing

A few takeaways from this exercise:

  1. Moving from CompGraph to Tape seems to be easy. Two things that are feasible but not beautiful are multi-output nodes (which we need to destructure from a tuple) and functions with keyword arguments (for which we must use Core.kwfun(f)).
  2. Since not all inputs have init values (e.g. data like in inp %1 above), we either need to provide example arguments for them or skip evaluation (and thus validation).
  3. Converting row-major to column-major arrays looks easier than not converting. Though, I admit this may lead to ambiguities.
  4. Currently I implement loading of NodeProto onto a tape as load_node!(::Tape, ::NodeProto, ::Val{OP}) where OP. I think we can optionally extend it with one more parameter ::Val{OpSetVersion} to easily extend it for new opsets. But since I haven't run into issues with opsets yet, I'm not sure this idea makes sense at all 😄

@darsnack
Copy link
Member

Looks like a good starting point to me. Perhaps the tape context can store a backend selection (defaulting to NNlib). Then load_node! can dispatch on both the stored OP and the selected backend value. Probably every library is going to want to use NNlib.jl, but this will make it possible to switch out libraries for the few packages that provide their own conv, etc.

@dfdx dfdx mentioned this issue Aug 19, 2021
@dfdx
Copy link
Collaborator

dfdx commented Aug 19, 2021

I've added a few more operations and created the PR to easier track the changes. So far adding/migrating operations is easy as long as there's a counterpart in NNlib. Unfortunately it's not always the case, a few operations that I hit are:

  • BatchNormalization
  • Dropout
  • GlobalAveragePool
  • Flatten (can use reshape instead, but would be great to have a more direct way to express it)

Adding these ops to NNlib was discussed here and there, but as far as I know nothing has been actually done. I'm ready to handle it, but since it's quite a huge piece of work it would be great to get an agreement about the approach itself first.

In case we go with Ghost and NNlib, I imagine this project to have the following stages:

  1. Implement import of a reasonably large subset of ONNX operations as NNlib primitives on Ghost.Tape. This alone will be enough to load many of the pretrained models to Julia and run on new data.
  2. Implement conversion Tape + NNlib -> Flux, e.g. replace sequence of conv(x, w) + b into c = Conv(w, b); c(x). As a result, users will be able not only load pretrained models, but also tweak them using familiar interface.
  3. Implement export of Tape + NNlib to ONNX. At this stage we may want to adjust primitives we use. For example, currently I expand ONNX's Gemm into up to 6 operations on the tape.
  4. Implement conversion Flux -> Tape + NNlib.
  5. Support alternative backends instead of NNlib, multiple opsets, etc.
  6. Support additional NN frameworks such as Knet or Avalon.

@DrChainsaw
Copy link
Contributor

Great progress!

I would add testing somewhere between step 0 and 2 as this becomes quite an uphill battle once more ops are added. There is some functionality in ONNXNaiveNASflux you could borrow for this. One approach is do download the testdata from the onnx-repo as artifacts and run the same testcases they do. I found that alot of bugs still slip through, not seldom because the test arrays are all ones and similar. The other approach which feels more safe (but less unit-y) is testing against onnxruntime, but I suppose this might be a bit cumbersome without export functionality.

Is there a chance that the Tape + NNlib format is more difficult to work with compared to the ONNX format? Or can one add metadata in there to facilitate?

One thing which I strived for in ONNXNaiveNASflux is to do alot of the NodeProto extraction in generic code so that adding OPs would not require implementers to unpack the protos. Not sure how successful it was as I don't think anyone but me has added any ops and it did spread things out a little bit.

My 2 cents on two of the points above:

  1. Converting row-major to column-major arrays looks easier than not converting. Though, I admit this may lead to ambiguities.

One risk/consequence is that one needs to consider the rerrangement of dimensions when implementing all ops (e.g. reshape). Many OPs also use magial index numbers and negative indexes which also require some extra thought. I haven't ran into anything showstopping, but it certainly makes adding ops a bit more cumbersome. I'm still in overall in favour of permuting into Julia friendly shapes though.

  1. Currently I implement loading of NodeProto onto a tape as load_node!(::Tape, ::NodeProto, ::Val{OP}) where OP. I think we can optionally extend it with one more parameter ::Val{OpSetVersion} to easily extend it for new opsets. But since I haven't run into issues with opsets yet, I'm not sure this idea makes sense at all 😄

I think that approach makes sense. One could perhaps just default to load_node!(t::Tape, n::NodeProto, op::Val{OP}, opset::Val{N}) = load_node!(t, n, op, Val(N-1)) where hitting 0 throws an op not supported error. Perhaps one does not even need to wrap the opset version in a val (or use a struct OpSet{N} end for clarity.

Any thoughts on Val-dispatch vs dict of functions with op as key? One aspect is that it perhaps becomes a little less friction if one wants to overload an op as they can just replace an entry in the dict, but I suppose this can be seen as a drawback too. I would suggest to add this early. If nothing else it shows which opset versions the current version is implemented after.

@dfdx
Copy link
Collaborator

dfdx commented Aug 21, 2021

Any thoughts on Val-dispatch vs dict of functions with op as key?

I see 2 ways to implement replacement of an operation:

  1. Using a global dict to store operations - by replacing the entry with that operation.
  2. Using separate methods - by adding an optional parameter designating the backend as suggested by @darsnack.

If there are two packages A and B using ONNX.jl, and B decides to replace one of the operations in the global dict, A will be affected too. On the other hand, using separate methods we can do something like this:

load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{MyOp}) = ...  # default implementation
load_node!(tape::Tape{ONNXCtx{PkgBBackend}}, nd::NodeProto, ::Val{MyOp}) = ... # implementation specific to the package B

Does it make sense to you?


Is there a chance that the Tape + NNlib format is more difficult to work with compared to the ONNX format? Or can one add metadata in there to facilitate?

All on all, Tape is a general-purpose representation of a computational graph and should be able to represent data from any similar structure including GraphProto. It can also be extended using the context type parameter (available as a .c field). Is there any particular piece of metadata you are interested in?

Not sure how successful it was as I don't think anyone but me has added any ops and it did spread things out a little bit.

Things like Dict(nd.attribute), akpsd(...), array(init), etc. are really helpful! Not sure I'd dive into it if I had to figure them out myself.

I would add testing somewhere between step 0 and 2 as this becomes quite an uphill battle once more ops are added.

I'm not so sure about testing, considering it even after step 4 is implemented. I see a lot of risk in steps 1-4 since we may hit obstacles that will affect the whole design. Maybe it will be easier to get an MVP for all four steps and then iteratively add more operations and cover them with tests. Maybe it will be too large piece of work and we'll better make the result of step 1 really robust before moving further. The next step for me is to add functional forms of Dropout and BatchNormalization to NNlib so that I could load at least one real model from the ONNX model zoo :)

@ToucheSir
Copy link
Member

Functional forms of dropout and alpha dropout can probably be adapted easily from FluxML/Flux.jl#1618. Norm layers are a bit trickier, but most of the pieces are in FluxML/Flux.jl#1509 and bringing them into NNlib would be great for a number of reasons.

@darsnack
Copy link
Member

darsnack commented Aug 22, 2021

So far adding/migrating operations is easy as long as there's a counterpart in NNlib. Unfortunately it's not always the case

This is why I think handling multiple libraries now makes sense. These ops are in Flux but not NNlib (though functional forms of normalization layers should make their way to NNlib).

I would suggest the something like having a backends field in the ONNXCtx. Then change the interface to be

load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, backend, op) = nothing
load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{:NNlib}, ::Val{:Conv}) = ...

The main loading function will loop over ctx.backends calling load_node! accordingly. If the return is not nothing, then it moves on; otherwise, it calls load_node! again with the next backend. By default ctx.backends = [:NNlib, :Flux].

Any thoughts on Val-dispatch vs dict of functions with op as key?

My experience with a very similar project where I was translating computational graphs written in Julia to a more restricted subset of ops was that a global dict was more cumbersome to work with. We'd need to provide safe accessors to the dict to "register" ops, and the dict entries tend to grow with tuples of metadata. I found writing a dispatch rule to be a far more straightforward API that should be familiar to anyone writing Julia. It also keeps the door open to eventually have an ONNXCore that folks can import to write their own op translation rules for their library.

@dfdx
Copy link
Collaborator

dfdx commented Aug 22, 2021

The main loading function will loop over ctx.backends calling load_node! accordingly.

Oh, sorry, I misinterpreted your idea and thought of backends as a way to replace particular operations, not to extend the list of supported ones. But now I'm confused - if we add Flux as a backend, ONNX.jl will become quite a heavy package and hardly suitable to be included in other deep learning libraries (e.g. it would be weird to have Flux as a dependency for Knet). In fact, I assumed the opposite direction - to move common stuff like dropout or batch normalization to NNlib so that they could be re-used by a wider set of high-level libraries.

@ToucheSir
Copy link
Member

ToucheSir commented Aug 23, 2021

IIUC, this declaration:

load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{:NNlib}, ::Val{:Conv}) = ...

Would be in Flux or a separate package, not ONNX.jl. This repo shouldn't ever have to depend on any higher-level framework. That's another argument for a dispatch-based approach: all other libraries need to do to hook into the system is define their own overloads. NNlib should still be the common denominator, but there are scenarios where it would be nicer to extract full layer structs as well.

@darsnack
Copy link
Member

Would be in Flux or a separate package, not ONNX.jl

Yup, exactly. It will also help iterate on this package without waiting for Flux to move stuff to NNlib (which might be slower). Eventually, we'll drop the Flux dependency.

@dfdx
Copy link
Collaborator

dfdx commented Aug 23, 2021

If eventually all operations implemented in ONNX.jl will be based on NNlib, do we actually need multiple backends? If the main goal at the moment is to have at least some implementation for all operations, we can do use Flux functions directly, e.g.:

function load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{:Dropout})
    ...
    val = Flux.dropout(...)
end

and when these functions are moved to NNlib, simply replace them:

function load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{:Dropout})
    ...
    val = NNlib.dropout(...)
end

Why do we need the .backends field then?

Regarding operation overloading in specific packages, I thought about extending ONNXCtx to include the backend, e.g. (repeating my example above):

load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{MyOp}) = ...  # default implementation, i.e. NNlib-based
load_node!(tape::Tape{ONNXCtx{:Flux}}, nd::NodeProto, ::Val{MyOp}) = ... # Flux implementation

This change will be non-breaking, so we can easily postpone it.

(Not that I'm against passing the backend as an additional parameter, I just don't fully understand why we need it).

@DrChainsaw
Copy link
Contributor

I guess we are suffering a bit from branching here. Posting a few hip-shot answers to some things which seem hip-shotable to me.

Not that I'm against passing the backend as an additional parameter, I just don't fully understand why we need it

I suppose backends and translating a Tape to some other representation are two kind of redundant features and we should try to decide which one is the recommended way.

As for getting stuff into some other framework, I think this is useful due to auxullary tools which may be specialized towards one certain framework (e.g. NaiveNASflux).

About val-dispatch:
I also like the flexibility and I agree the dict only works with single dispatch. While we don't care so much about performance, isn't there a risk that things become unbearably slow to compile as per the performance tips. I suppose this can be mitigated by @nospecialize but I have never used it so I don't know for sure.

About testing:
I consider it a must for people to be able to trust the package. A good way to mitigate risk of test-churn is to use test specific functions so that if the API under test changes one only needs to change in one place.

After one can import e.g. VGG, chances are that one will need to manually check the output from each operation in the graph, and this effort becomes kinda wasted if it is not put into automated tests.

Anyways, I'm not a TDD zealot and I don't want try to force people to work in a certain way. Your comment read as if tests might not even be needed in the end and that is where my main complaint lies.

@ToucheSir
Copy link
Member

@dfdx there are two overlapping goals with the ONNX effort. The first is to allow imported and exported ONNX models to run period. That shouldn't require anything beyond NNlib and maybe Functors. The second is to open up ONNX as a storage format for higher-level libraries like Flux. This would require mostly lossless round-tripping for layers, which is not a goal of the NNlib backend. Now, it may come to pass that this approach is too finicky to be useful, so I agree with your idea to structure the dispatch such that we can postpone it.

@DrChainsaw I believe the point about testing was not that they shouldn't exist at all, but that trying to do too much too early. Certainly testing outputs against some reference requires enough operations to be implemented to make it worth our while, otherwise they're just tests against NNlib by proxy. Testing whether a series of ops lowers to a specific tape representation seems much easier, and I imagine that is in the works here.

@darsnack
Copy link
Member

darsnack commented Aug 23, 2021

I suppose backends and translating a Tape to some other representation are two kind of redundant features and we should try to decide which one is the recommended way.

This is correct, and I see two possible options here.

Direct Translation Approach

Right now, we seem to be going down the first one, which is the translate each node into a function in NNlib. The main issue I see here is NNlib must be able to express the ONNX graph. For certain ops, like conv or normalization, that will probably be okay, even though it will take some time for normalization to make it in. I don't see flatten making its way into the library, and stuff like adaptive/global pooling is handled by computing PoolDims correctly in the Flux layer. As it stands, NNlib isn't a functional library of layer primitives; it operates at a lower level IMO. We could move towards NNlib as a common source of functional primitives, but it seems like most packages like Flux or Knet wouldn't benefit from such changes.

In this option, the API is to define direct translations via:

load_node!(tape::Tape{ONNXCtx}, nd::NodeProto, ::Val{MyOp}) = ...

Transformation Approach

The other option is for ONNX.jl to translate nodes into a tape of ONNX-specific structs. For example, ONNX.Conv, etc. which can store metadata similar to Flux.Conv. These don't necessarily need to run (meaning no importing Flux or NNlib). The goal here is to translate the raw ONNX file into a data structure that can be easily transformed via dispatch-based overloading. This keeps ONNX.jl very lightweight, and we define all the core reading/writing functionality based on a common opset that ONNX.jl controls. We could then write ONNXFlux.jl which defines a transformation to take this ONNX-specific tape and turn it into a Flux tape. We can also have ONNXRuntime.jl which actually defines the execution of this tape via the C API. And we could still allow running the ONNX-tape via just NNlib though there would effectively be code-duplication with Flux.

In this option, the API is to pass in a transform function. A package defines something like

my_transform!(tape::Tape, call::ONNX.Conv) = ...

and ONNX.jl uses Ghost.replace!, etc. to provide something like:

transform_tape!(ftransform!, tape::Tape) = ...

You can see here and here where I used a similar design approach with Ghost.


The backends idea was an attempt at straddling between the two approaches. It would allow us to directly translate to NNlib function calls while still using other packages when NNlib falls short. And it would allow users to pass in a vector of backends in their preferred order when they want to. And it allows NNlib, Flux, and Knet translations to all be defined simultaneously.

@dfdx
Copy link
Collaborator

dfdx commented Aug 24, 2021

@DrChainsaw

Your comment read as if tests might not even be needed in the end and that is where my main complaint lies.

Sorry I made you think so! I definitely don't propose to release an untested code, I just want to have a little bit more confidence in our approach before investing a lot of time into testing, docs, etc.

@darsnack

I don't see flatten making its way into the library, and stuff like adaptive/global pooling is handled by computing PoolDims correctly in the Flux layer.

Could you please elaborate on this? Adding functional form of these operations to NNlib was exactly what I was targeting. Maybe it won't bring much value to Flux or Knet which already have their own versions, but it will certainly help other (existing and potential) frameworks as well as improve unification across the infrastructure.

Regarding the transformation approach, as far as I understand the API you described requires one-to-one correspondence between ONNX.Layer and a MyFramework.Layer. But it may not always be possible. For example, Flux and Knet keep activation inside the Conv layer, but e.g. Avalon, which following the PyTorch API, keeps them separate. This means that either on import or on export we will need to support transformation from M to N operations, e.g.:

ONNX.Conv + ONNX.Sigmoid -> Flux.Conv

This is a bit harder than replacing a single operation, but in the previous incarnation of Tape I had an utility function rewrite!() that could transform the graph according to a set of rules like:

Sigmoid(Conv(x)) => Conv(x, sigmoid)

I "lost" this function due to refactoring, but earlier or later I'm going to add it again. Using it, it should be possible not only transform ONNX.Conv to and from a framework-specific representation, but also use NNlib primitives directly, e.g.:

act(conv(x, w) .+ b) => Conv(w, b; activation=act)(x)

@darsnack
Copy link
Member

Could you please elaborate on this?

The initial (and current) focus of NNlib has been defining fast kernels for operations that need them (GEMM, conv, and pooling). The purpose is to have a common API to C libraries, etc. We could expand the scope to provide a unified functional API. As it stands, since Flux won't benefit much, I wouldn't hold my breath that this will happen quickly (it would mainly be work towards ONNX). That's okay. I'm certainly not saying that a unified API package would be a bad thing.

The main concern I have with the NNlib approach is that it cements NNlib functions as the definitions for usable ONNX ops. This poses two potential issues:

  1. We would need to make NNlib capable of expressing the ONNX spec as well as keep it in sync. I don't know much about the ONNX spec, so maybe my concerns are misplaced. But I see this as piling two purposes onto NNlib — providing fast kernels and representing ONNX graphs. I worry about a future issue where those purposes don't line up neatly (again, maybe this is worrying about a future that doesn't exist).
  2. A tape that directly contains NNlib functions might be annoying to write transformation rules for (i.e. the NNlib (f, args...) might not be the nicest format). The alternative is to extend load_node!, but I'm beginning to thinking that interacting directly with NodeProto is not friendly.

So, my current thoughts are to have ONNX ops like ONNX.Conv and to load a file into a tape of those ops. These would be easier to reason with for transformations. This still keeps the door open to unify all functional forms in NNlib and define the forward pass of ONNX op structs using those functions.


I had not considered the M to N transform case. Currently, 1 to N is easily doable. If adding the rewrite! back isn't too hard, it sounds like it would be useful here.

@darsnack
Copy link
Member

darsnack commented Aug 24, 2021

If we did continue with the current approach though, I would still have backends be a field not a parameter. It seems constraining that all the ops should map to functions within the same (backend) package. Also, I find that it is easier to have flexible code when the default is "just using" the interface rather than a specific hardcoding (i.e. the default is backends = [:NNlib]).

@dfdx
Copy link
Collaborator

dfdx commented Aug 24, 2021

Currently, 1 to N is easily doable.

My point is that 1 to N during import means N to 1 during export :)

So, my current thoughts are to have ONNX ops like ONNX.Conv and to load a file into a tape of those ops.

What concerns me in this approach is that graphs built from a definition-only ops won't be executable by themselves. This means one would not be able to run ONNX models until e.g. ONNXFlux or ONNXRuntime is implemented, but what's even more important, it will be much harder to test the code properly.

How about using NNlib when possible & convenient and implementing additional ops directly in this repo? It's quite a dirty approach, but this way we should be able to move quickly, get at least something working and then see what actually works or doesn't work for us.

If we did continue with the current approach though, I would still have backends be a field not a parameter.

I think I still don't fully understand your idea here. You say that it will be useful to have multiple implementations of load_node!() for the same op with different backends and let the user to choose between them, even provided that all backends will be incomplete by themselves? For example:

# Conv implemented by both - NNlib and Flux backends
load_node!(tape::Tape, nd::NodeProto, ::Val{:NNlib}, ::Val{:Conv}) = ...
load_node!(tape::Tape, nd::NodeProto, ::Val{:Flux}, ::Val{:Conv}) = ...

# Sin implemented only by NNlib
load_node!(tape::Tape, nd::NodeProto, ::Val{:NNlib}, ::Val{:Sin}) = ...

# Softmax implemented only by Flux
load_node!(tape::Tape, nd::NodeProto, ::Val{:Flux}, ::Val{:Conv}) = ...

# user chooses the backend
ONNX.load(filename, backends=[:Flux])   # will only use Flux backend, won't be able to load Sin

@darsnack
Copy link
Member

My point is that 1 to N during import means N to 1 during export :)

Ah, got it, good point!

What concerns me in this approach is that graphs built from a definition-only ops won't be executable by themselves.

How about using NNlib when possible & convenient and implementing additional ops directly in this repo?

I agree, and what I'm suggesting, if we go the route where the API is "transformations," is to define all ops in this repo for consistency. The ops that map 1-1 with NNlib functions can just forward there, and other ops can be implemented in the repo as needed. So,

(layer::ONNX.Conv)(x) = NNlib.conv(...)
(layer::ONNX.Flatten)(x) = reshape(...)
(layer::ONNX.Sine)(x) = sin(...)

Alternatively, if we don't use the transformation API, then these suggestions are not as relevant.

it will be useful to have multiple implementations of load_node!() for the same op with different backends and let the user to choose between them, even provided that all backends will be incomplete by themselves

The usefulness would be in the flexibility of ONNX.jl's design for rule writers. As @DrChainsaw pointed out, there is overlap between the backends approach and the transformations approach. We provide a load function that uses a ruleset defined by load_node! to import the graph, looking for the best match across arbitrary backends. So, a package like ONNX2Flux will define/override some additional rules and wrap the ONNX.load to pass in the appropriate backends (in order of priority) for ONNX2Flux. This basically allows downstream code to define the rules that make sense for them and re-use existing rulesets from an alternate backend as a fallback. For example, writing a rule for :Sin with a Flux backend is just extra work to re-route to Base.sin. It would be make more sense to me to have

# this is already defined in ONNX.jl
load_node!(tape::Tape, nd::NodeProto, ::Vale{:Base}, ::Val{:Sin}) = ...

# ONNX2Flux.jl overrides Conv
load_node!(tape::Tape, nd::NodeProto, ::Vale{:Flux}, ::Val{:Conv}) = ...

And the wrapping ONNX2Flux.load uses backends = [:Flux, :NNlib, :Base].

Of course, if the none of the backends passed in have a rule defined for a certain op, then we will throw an error saying "Could not find rule to load node $OP in backends = $backends.".

@dfdx
Copy link
Collaborator

dfdx commented Aug 24, 2021

@darsnack Got it, thanks! I've pushed the version with multiple backends to the PR.

I agree, and what I'm suggesting, if we go the route where the API is "transformations," is to define all ops in this repo for consistency.

I like this idea. The only question I have is whether ONNX.Ops should behave like functions or like layers (i.e. structs), e.g. should :Conv be translated to

y = conv(w, x, b; pad=pad)

or

c = Conv(w, b; pad=pad)
y = c(x)

ONNX itself is a purely functional with no internal fields so I'm leaning towards the first option, but I can try to use layers instead to make it more convenient for the high-level frameworks.

@darsnack
Copy link
Member

The only question I have is whether ONNX.Ops should behave like functions or like layers

That's a great questions that I don't know the best answer to 😅. It will probably be a balance between staying true to the ONNX nodes vs. re-packaging it to make transformations easier. I think your suggestion to start with functions because that's what ONNX does is a good idea.

@dfdx
Copy link
Collaborator

dfdx commented Aug 28, 2021

I can now load and execute ResNet-18. It doesn't mean all operations are implemented correctly, but at least the shapes match in this limited case, I'm mostly satisfied with the API and ready to move forward. Some details:

  • We now have almost 1-to-1 mapping between ONNX operations and our implementations backed by NNlib and Flux. "Almost" because of operations with multiple return values for which I had to add getfield() calls to extract individual output vars.
  • load_node!() is now dispatched using parametric type OpConfig{Backend, OpType}. The function signature has also been simplified.
  • Just as @DrChainsaw mentioned, the difference between column- and row-major indexing creates some... surprises. For instance, in the Gemm implementation I had to swap the order of arguments, which worked well in this case but may not be suitable for others.

Given the current design I don't see risks in conversion to/from Flux.Chain anymore, but I start feeling the uphill of the missing tests. @DrChainsaw could you please elaborate on the options you mentioned above? In particular:

One approach is do download the testdata from the onnx-repo as artifacts and run the same testcases they do.

I understand the relevant part in ONNXNativeNASflux is here, and I see that you download test data from here, but I don't immediately get the rest of the logic. Is it correct that you create a single-node graph, evaluate it and compare to...?

The other approach which feels more safe (but less unit-y) is testing against onnxruntime

I don't know much about onnxruntime, could you please clarify how we can use it for testing?

@DrChainsaw
Copy link
Contributor

the difference between column- and row-major indexing creates some... surprises.

Yeah, and this tends to get worse with the more difficult to grok dimension manipulating ops. For instance, with reshape, I just closed my eyes and prayed the testcases would pass when reversing the dimensions to reshape. There are much worse things than reshape in there though, this one looks pretty anxiety inducing at a glance. Hopefully most of them just follow the pattern of either reversing the index array or the indices themselves (i.e index i becomes N - i where N is the number of dimensions of the input). The whole thing does put a dent in the ambition to just do the ONNX ops as they are described though.

Is it correct that you create a single-node graph, evaluate it and compare to...?

The onnx tests consists of 1) an onnx model/node, 2) input data and 3) expected output when 2) is fed into 1). So, 1) load the model/node, 2) load the input data and feed it into the loaded model/node and 3) load the expected output and verify that it is identical (within last decimal or so) to the output from step 2).

Drawback as I said is that in many cases the input and maybe also the model parameters are just arrays with a single unique value (e.g. all ones) and this lets alot of errors slip through. For example, incorrect padding reshuffling goes undetected with the onnx test data iirc. Perhaps this has been corrected in later versions, so it might be worthwhile to just use the latest onnx version.

I don't know much about onnxruntime, could you please clarify how we can use it for testing?

onnxruntime just happens to be what my 5 minute search found to be the most 'canonical' onnx inference library. The way it is used in ONNXNaiveNASflux is that an op/model (e.g. Flux.Conv or some function with several ops) is exported, then some data is fed though 1) the original op/model, 2) the exported op/model using onnxruntime (through PyCall) and 3) the op/model imported by ONNXNaiveNASflux and it is asserted that all three produce the same output (within last decimal or so). This is for instance how I caught the incorrect padding shuffling. Without any export functionality I'm not sure how to make use of this though. One way could be do to the onnx-testdata testing with better input data.

@ToucheSir
Copy link
Member

I don't know much about onnxruntime, could you please clarify how we can use it for testing?

I assume this means testing outputs for either individual layers or groups of layers against those derived from running the same graph on onnxruntime (via PyCall). This would certainly be useful for some integration tests, but might not work as well overall because of differences in how certain operations are calculated. For example, my understanding is that NNlib.conv actually performs a convolution, whereas most python frameworks run the usual cross-correlation.

@DrChainsaw
Copy link
Contributor

DrChainsaw commented Aug 28, 2021

This would certainly be useful for some integration tests, but might not work as well overall because of differences in how certain operations are calculated.

But then the onnx model is not representing the same thing which would be very bad. If that is the case then either onnxruntime or this package needs to change. Or perhaps you are talking about small numerical differences just from computing the same thing in a slightly different order? I'm not sure how "the usual cross-correlation" differs from "convolution". The result of convolution between two sequences can be called the cross-correlation between those two sequences (ok, there are a number of options for how to scale things), or is my signal processing too rusty?

Btw, ONNXNaiveNASflux has alot of tests against onnxruntime and it has only caught real errors in ONNXNaiveNASflux so far.

@dfdx
Copy link
Collaborator

dfdx commented Aug 28, 2021

Without any export functionality I'm not sure how to make use of this though.

I'll check how hard it will be to implement the export side.

@ToucheSir
Copy link
Member

@DrChainsaw good to know, I'll defer to your expertise on this :) AIUI the only difference for my specific example is that convolution flips the kernel while cross-correlation doesn't, but how everything washes out once you start talking about row vs column major I'm not sure.

Without any export functionality I'm not sure how to make use of this though. One way could be do to the onnx-testdata testing with better input data.

It seems like this scheme could still be useful? We could always bootstrap it by writing Python scripts which generate the test cases. That would avoid the need for export code until that functionality is implemented.

@dfdx
Copy link
Collaborator

dfdx commented Sep 2, 2021

I've removed the WIP status of the PR in #50 . Here's what changed from the last update:

  • added save() method and implemented export of a single operation (Add, elementwise addition)
  • added ort_test() method for testing the implementation against onnxruntime as suggested by @DrChainsaw
  • commented out most loaders & savers as untested, leaving only the beforementioned Add
  • added docstrings for the main API

Adding more features in a single PR becomes increasingly harder, so if you guys don't have hard objections against the approach I'd like to merge it and add further changes in smaller chunks. This doesn't imply fixing the API of course, we still can change any details in more specific PRs.

@ToucheSir
Copy link
Member

I left a couple quick comments, but this is probably deserving of review from someone with more experience handling ONNX.

@dfdx
Copy link
Collaborator

dfdx commented Sep 3, 2021

Meanwhile I'm trying to add exporter for Conv, but I don't quite understand how to handle dimensions, so maybe you guys can help me.

  1. For 2D convolution ONNX expects inputs as (N, C, H, W), i.e. (batch_size, num_channels, height, width). Images.jl, as well as my intuition, tell that in Julia dimensions must be (H, W, C, N) - similar to matrix representation. However, Flux documentation states that Conv expects input to have size (W, H, C, N). Sure, working only in Flux it doesn't really matter since input just needs to have the same encoding as weights, but I'm not sure it will work well for the interop with ONNX.
  2. For 3D convolution, where should the depth dimension go?
  3. Flux.Conv is actually cross-correlation, right? Then why do we need flipweights?

@ToucheSir
Copy link
Member

  1. The blame points to FluxML/Flux.jl@7aa6854 which unfortunately does not seem to have a PR nor a good explanation thereof. I do think WHCN makes the most sense because it makes for the most painless transition between col -> row major when dispatching to backends like CuDNN. @johnnychen94 wrote a great summary of the trade-offs as part of add documentation for image layouts Flux.jl#1705.
  2. WHDCN, if we follow the same pattern as 2D convs and reverse https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html. I haven't tested if this works OOTB, but PyTorch tends to give a good picture of what the constraints are because it mirrors CuDNN behaviour.
  3. Someone please correct me if I'm wrong, but Flux.Conv may actually be a proper convolution. I'm basing this off the fact that you can replicate https://www.mathworks.com/help/images/what-is-image-filtering-in-the-spatial-domain.html using NNlib.conv without any flipping, but the row/column major transition makes everything quite confusing.

@DhairyaLGandhi
Copy link
Member

Whcn is for the colmajor behaviour. And yes, conv does a proper conv, not a cross correlation.

@DrChainsaw
Copy link
Contributor

I had a quick look into the motivation for flipweights as I forgot why I added it in the first place.

It seems like all testcases in ONNXNaiveNASflux pass if flipping is removed. This is however not strange as flipping is done both when serializing and deserializing and the onnx testdata uses ones for weights so flipping or not makes no difference.

I did however try loading vgg with testdata from the onnx model zoo and both the mxnet version (vgg16) and the caffe version require weight flipping for the testdata to give identical (within floating point precision) results.

Another thing which confuses me is that I would have expected a permutedims to be needed somewhere to translate OIHW to WHIO, but apparently this is not needed. If you look at the generic translation from TensorProto to Array here, it just reshapes with dimensions in reverse order, and the only other thing done is to flipweights.

@dfdx
Copy link
Collaborator

dfdx commented Sep 11, 2021

Another thing which confuses me is that I would have expected a permutedims to be needed somewhere to translate OIHW to WHIO, but apparently this is not needed.

Testing symmetric changes turns to be pretty hard. In this commit I do add a permutedims() (via julia2onnx/onnx2julia) and all tests equally pass before and after this change. I don't know what kind of a test would distinguish between them, but here's an example of what we get with permutedims vs. only reshape:

julia> w = Float64[1 2 3; 4 5 6]
2×3 Matrix{Float64}:
 1.0  2.0  3.0
 4.0  5.0  6.0

julia> onnx2julia(w)
3×2 Matrix{Float64}:
 1.0  4.0
 2.0  5.0
 3.0  6.0

julia> reshape(w, 3, 2)
3×2 Matrix{Float64}:
 1.0  5.0
 4.0  3.0
 2.0  6.0

As you can see, instead of transposing the matrix, which would be the proper reversion of dimensions, reshape() totally changes the content. Also, the result of multiplication on a random vector is different:

julia> x = rand(1, 2)
1×2 Matrix{Float64}:
 0.0829937  0.579257

# matmul in ONNX-native dimensions
julia> x * w
1×3 Matrix{Float64}:
 2.40002  3.06227  3.72452

# matmul in Julia-native dimensions
# the shape is different, but the content is the same
julia> onnx2julia(w) * onnx2julia(x) 
3×1 Matrix{Float64}:
 2.400021221468818
 3.0622718363075028
 3.7245224511461874

# reshape-based matmul
# the content is different
julia> reshape(w, 3, 2) * reshape(x, 2, 1)
3×1 Matrix{Float64}:
 2.979278090345529
 2.0697455904780284
 3.641528705184214

Meanwhile, I've prepared the next chunk of changes, check out the diff with the current PR in dfdx#1 (I'll retarget the PR to ONNX.jl#master once #50 is merged). This new PR adds a set of converters between ONNX and Julia and tests them on Conv operator.

@dfdx
Copy link
Collaborator

dfdx commented Sep 11, 2021

Here's the actual PR for conversions + conv: #51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants