Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Halide Development Roadmap #5055

Open
abadams opened this issue Jun 19, 2020 · 46 comments
Open

Halide Development Roadmap #5055

abadams opened this issue Jun 19, 2020 · 46 comments
Labels
discussion For talking about potential improvements & changes.

Comments

@abadams
Copy link
Member

abadams commented Jun 19, 2020

This issue serves to collect high-level areas where we want to improve or extend Halide. Reading it will let you know what is on the minds of the core Halide developers. If there's something you think we're not considering that we should be, leave a comment. This document is a continual work in progress.

This document aims to address the following high-level questions:

  • How should we organize development?
  • How do we make Halide easier to use for new users?
  • How do we make Halide easier to use for new contributors?
  • How do we keep Halide maintainable over time?
  • How do we make Halide easier to use for researchers wanting to cannibalize it, extend it, or compare to it?
  • How do we make Halide more useful on current and upcoming hardware?
  • How do we make Halide more useful for new types of application?

To the greatest extent possible we should attach actionable items to roadmap issues.

Documentation and education

The new user experience could use an audit (e.g. the README).

There are a large number of topics that are missing tutorials

Some examples:

  • The GPU memory model (e.g. dirty bits, implicit device copies, explicitly scheduled device copies)
  • Using Func::compute_with
  • Effectively picking a good TailStrategy
  • Scheduling atomic reductions, including horizontal vector reductions
  • Generators with multiple outputs (there's a trade-off between tuples, extra channels, compute_with)
  • Using (unrolled) extra reduction dimensions for scattering to multiple sites (plus the scatter/gather intrinsics)
  • Using extern funcs and extern stages in generators
  • Calling other generators inside a generator
  • Using a Generator class defined in the process directly via JIT (Generator::realize isn't discoverable)
  • Overriding the runtime
  • Automatic differentiation
  • Integrating with OpenCV, tensorflow, pytorch, and other popular frameworks.
  • lambda
  • Buffer

There is not enough educational material on the Halide expert-user development flow, looping between tweaking a schedule, benchmarking/profiling it, and examining the .stmt and assembly output.

One thing we have is this: https://www.youtube.com/watch?v=UeyWo42_PS8

Documentation for the developers

  • There should be a guide for how an external contributor should make their first pull request on Halide and what to expect. This is commonly in a CONTRIBUTING.md top-level document. There are also pull request templates we can create.

  • There should be a more detailed document or talk describing the entire compilation pipeline from the front-end IR to backend code to help new developers understand the entire project.

Support for extending or repurposing parts of Halide for other projects

Some things that could help:

Build system issues

We shouldn't assume companies have functioning build tools

Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants). Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio). We should consider making GenGen.cpp friendlier to the build system (e.g. by implementing caching or depfiles) to help out these users.

Our buildbots aren't keeping up and require too much manual maintenance

Our buildbots are overloaded and have increasingly out-of-date hardware in them. Some can only be administered by employees at specific companies. We need to figure out how to increase capacity without requiring excessive manual management of them.

Runtime issues

  • The runtime includes a lot of global state, which is great for sharing things between all the Halide pipelines in a process, but if there are multiple types of user of Halide in the same large process things can get complicated quickly (e.g. if they want different custom allocators). One option would be removing all global state and passing the whole runtime in as a struct of function pointers.

  • While most of the important parts of the runtime can be overridden by setting function pointers, some parts of the runtime can only be overridden using weak linkage or other linker tricks, and this is problematic on some platforms in some build configurations.

  • There needs to be more top-level documentation for the runtime, describing how one may want to customize it in various situations. Currently there's just a few paragraphs at the top of HalideRuntime.h, and then documentation on the individual functions.

  • Runtime error handling is a contentious topic. The default behavior (abort on any error) is the wrong thing for production environments. There isn't much guidance or consistency on how to handle errors in production environments.

Lifecycle

Versioning

Since October 2020, Halide uses semantic versioning. The latest release is (or will soon be) v15.0.0. We should adopt some practice for keeping a changelog between versions for Halide users. Our current approach of labeling "important" PRs with release_notes has not scaled.

Packaging

Much work has been put into making Halide's CMake build amenable to third-party package maintainers. There is still more to do for cross-compiling our arm builds on x86.

We maintain a list of packaging partners here: #4660

Code reuse, modularity

How do we reuse existing Halide code without recompiling it, especially in a fast prototyping JIT environment? An extension of the extern function calls or the generators should be able to achieve this.

Building a Halide standard library

There should be a set of Halide functions people can just call or include in their programs (e.g., image resampling, FFT, winograd convolution). The longstanding issue to solve is that it's hard to compose the scheduling.

Fast prototyping

How can we make fast prototyping of algorithms in Halide easier? JIT is great for getting started, but not all platforms support it (e.g. iOS), and the step from JIT to AOT is large, in terms of what the code looks like syntactically, what the API is, and what the mental model is.

Consider typical deep learning/numerical computation workflows (PyTorch, NumPy, Matlab, etc). A user would fire up an interpreter, manipulate and visualize their data, experiments with different computation models, print out intermediate values of their program for understanding the data and debugging, and rerun the programs multiple times for different inputs and iterate.

Unfortunately, the current Halide workflow does not fit this very well, even with the Python frontend.

  • JIT caches are cleared every time the program instance is terminated. Even if the Halide program has not changed, if you rerun the program (for different parameters or inputs), Halide needs to recompile the whole program. This has become a major bottleneck for fast iteration of ideas.
  • Printing intermediate values of Halide programs for debugging and visualization is painful. Either you have to use the cumbersome print() (and recompile the program) or adding the intermediate Halide function to the output (and recompile the program).
  • Halide's metaprogramming interface makes it less usable in a (Jupyter) notebook environment.

Two immediate work items.

  • Have an option for the JIT compiling such that it can save the result to disk, and load it back automatically if it is cached. Related to the serialization effort.
  • Have an interpreter for Halide (or equivalently the "eager mode" c.f. TensorFlow) that defaults to some slow schedule (e.g., compute root everything with basic parallelization).

GPU features

We should be able to place Funcs in texture memory and use texture sampling units to access them.

This is particularly relevant on mobile GPUs where you can't otherwise get things to use the texture cache. It's also necessary to interop with other frameworks that use texture memory (e.g. coreML).

An API to perform filtered texture sampling is needed. Ideally this will work, if not necessarily be blazingly fast, in a cross platform way. Validating on CPUs is very useful. There are some issues in the design having to do with the scope and cost of required sampler object allocations in many GPU APIs.

Currently this has been low priority because we don't have examples where texture sampling matters a lot. Even for cases where it obviously should (e.g. bilateral guided upsample), it doesn't seem to matter much.

A good first step is supporting texture sampling on CUDA, because it doesn't require changing the way in which the original buffer is written to or allocated. An independent first step would be supporting texture memory on some GPU API without supporting filtered texture sampling. These two things can be done orthogonally.

Past issues on this topic: #1021, #1866

We should support tensor instructions.

We have support for dot product instructions on arm and ptx via within-vector reductions. The next task is nested vectorization #4873. After that we'll need to do some backend work to recognize the right set of multi-dimensional vector reductions that map to tensor cores. A relevant paper on the topic is: https://dl.acm.org/doi/10.1145/3378678.3391880

New CPU features

ARM SVE support

Machine learning use-cases

We should be able to compile generators to tensorflow and coreML custom ops.

We can currently do this for pytorch (see apps/HelloPytorch), but it's not particularly discoverable.

We should have fully-scheduled examples of a few neural networks. We have resnet50, but it's still unscheduled.

Targeting MLIR is worth consideration as well.

This is likely a poor match, because most MLIR flavors operate at a higher level of abstraction than Halide (operations on tensors rather than loops around scalar computation).

Autoschedulers

There's lots of work to do before autoschedulers are truly useful. A list of tasks:

  • We need to figure out how to provide stable autoschedulers that work with Halide master to serve as baselines for academic work while at the same time being able to improve autoschedulers over time.

  • There needs to be a tutorial on using standalone autoschedulers, including autotuning modes for those that can autotune.

  • We need to figure out how to include them in distributions

  • There should be a hello-world autoscheduler that serves as a guide for writing a custom one.

  • There should be a click button solution for all sorts of autoscheduling scenarios (pretraining, autotuning, heuristics-based, etc).

  • For several autoschedulers, the generated schedules may or may not work for image sizes smaller than the estimate provided. This is lousy, because autoschedulers should be usable by people who don't understand the scheduling language and don't know to fix tailstrategies.

  • loop-unroll failure in Autoscheduler #4271

Things we can deprecate

  • arm-32 (probably still need this)
  • x86 without sse4.1
@abadams
Copy link
Member Author

abadams commented Jun 19, 2020

Assigning a bunch of people who seem like they may want to contribute to top-level planning.

@jrk
Copy link
Member

jrk commented Jun 19, 2020

GPU support: I think there's a bigger, higher-level architectural issue: the memory management and runtime models for GPUs/accelerators feels broken and insufficient. We should consider rethinking it significantly to allow clearer and more explicit, predictable control (as we have on CPUs with a single unified memory space).

@jrk
Copy link
Member

jrk commented Jun 19, 2020

Modules / Libraries / reusable code / abstraction

@jrk
Copy link
Member

jrk commented Jun 19, 2020

Build system: Should we explicitly break apart build system issues for Halide development and build system issues for Halide users? I think these are mostly quite distinct and probably should be separate top-level headings.

@jrk
Copy link
Member

jrk commented Jun 19, 2020

Better accessibility and support for research within/on the Halide code base

@slomp
Copy link
Contributor

slomp commented Jun 19, 2020

Build Halide without LLVM: useful for GPU JIT and IR manipulation.

@alexreinking alexreinking added the discussion For talking about potential improvements & changes. label Jun 19, 2020
@abadams
Copy link
Member Author

abadams commented Jun 19, 2020

GPU support: I think there's a bigger, higher-level architectural issue: the memory management and runtime models for GPUs/accelerators feels broken and insufficient. We should consider rethinking it significantly to allow clearer and more explicit, predictable control (as we have on CPUs with a single unified memory space).

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

@alexreinking
Copy link
Member

alexreinking commented Jun 19, 2020

We shouldn't assume companies have functioning build tools
Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants).

I agree with @jrk that we should distinguish build issues that affect Halide developers versus users. Generator aliases fix the GeneratorParams thing a bit, but they aren't very discoverable and aren't covered in the tutorials AFAIK. See #4054 and #3677

Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio).

I'm not sure why this is painful in Visual Studio? Just because of how many times GenGen.cpp gets built? We could fix that by optimizing GenGen for build time. Windows users who wish to build from Halide from source should use CMake. If they want to use binary releases without our CMake rules, then they're on their own. We shouldn't pay off their technical debt for nothing in return.

We should make GenGen.cpp capable of taking over some of the role of the build system (e.g. caching) to help out these users.

Properly caching Halide outputs is complicated. Outputs are a function of the Halide version (we don't currently version Halide), the autoscheduler version (if used. we also don't currently version our autoschedulers), the algorithm and schedule (can these be consistently hashed?), and the generator parameters. It's not clear to me how often this is a benefit in incremental build scenarios. If your source files changed, then typically so has your pipeline.

In CI scenarios, Halide versioning becomes more important since users would otherwise run into cache invalidation issues every time they update. Between builds, there could be some wins here, but they could also implement their own caching system by hashing the source files, Halide git commit hash, and generator parameters.

Our buildbots aren't keeping up and require too much manual maintenance
Our buildbots are overloaded and have increasingly out-of-date hardware in them. We need to figure out how to increase capacity without requiring excessive manual management of them.

Maybe one of the companies that wants Halide to work around their build system should pay for hardware and hire a full-time DevOps specialist. They could get all our buildbots configured with Ansible and set up all the Docker/virtual machine images we'd need. Failing that, they could foot the bill for a cloud-based CI service that has GPU and Android support.


Versioning and Releasing Halide

We should start versioning Halide and getting on a steady (quarterly?) release schedule. We could start at v0.1.0 so we don't imply any API stability per semantic versioning -- only v1 and above implies API stability within a major version. This would allow us to publish Halide on vcpkg / pip / APT PPAs / etc.

@jrk
Copy link
Member

jrk commented Jun 19, 2020

@abadams:

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

I was partly thinking of the heavily dynamic, lazy runtime aspects. When does memory get allocated and freed? When do copies happen? Most things in Halide are pretty static and eager, and explicitly controlled via schedules; GPU runtime behavior inherently includes a bunch of dynamic and lazy behavior, which is not clearly controlled by the schedule.

Imagine now having multiple GPUs or different accelerators in a machine. I should be able to use schedules to decompose computation across multiple GPUs, reason about and control explicit data movement between them, etc.

@abadams
Copy link
Member Author

abadams commented Jun 19, 2020

We shouldn't assume companies have functioning build tools
Some companies build projects using a mix of duct tape and glue in a platform-varying way. Any configuration that goes into the build system is very painful for them (e.g. GeneratorParams for generator variants).

I agree with @jrk that we should distinguish build issues that affect Halide developers versus users. Generator aliases fix the GeneratorParams thing a bit, but they aren't very discoverable and aren't covered in the tutorials AFAIK. See #4054 and #3677

Large numbers of binaries (e.g. one generator binary per generator) can also be painful (e.g. in Visual Studio).

I'm not sure why this is painful in Visual Studio? Just because of how many times GenGen.cpp gets built? We could fix that by optimizing GenGen for build time. Windows users who wish to build from Halide from source should use CMake. If they want to use binary releases without our CMake rules, then they're on their own. We shouldn't pay off their technical debt for nothing in return.

It's painful in visual studio because the actual GUI stops working right once you have more than a certain number of binary targets. Shoaib says it just stops showing them so you have no way to access them.

People should just use X build system is a non-solution. Halide is being used into large products that already have build systems, and it must exist within them. Punting on solving this problem entirely means that the current experience of using Halide in a company other than Google is 80% build system nightmare and 20% writing code. If we want Halide to be a useful tool it must be able to integrate into existing build systems cleanly.

We should make GenGen.cpp capable of taking over some of the role of the build system (e.g. caching) to help out these users.

Properly caching Halide outputs is complicated. Outputs are a function of the Halide version (we don't currently version Halide), the autoscheduler version (if used. we also don't currently version our autoschedulers), the algorithm and schedule (can these be consistently hashed?), and the generator parameters. It's not clear to me how often this is a benefit in incremental build scenarios. If your source files changed, then typically so has your pipeline.

The particular problem I've seen is that people work around the number-of-binaries issue by packing all of their generators into a single binary, but then a naive dependency analysis then thinks that editing any source file requires rerunning every generator (and there may be hundreds). We might be able to help. Telling people to just fix their damn build system is obviously an attractive attitude, but that's also asking them to pay down a large amount of technical debt before they can start using Halide. The outcome is that they don't use Halide.

In CI scenarios, Halide versioning becomes more important since users would otherwise run into cache invalidation issues every time they update. Between builds, there could be some wins here, but they could also implement their own caching system by hashing the source files, Halide git commit hash, and generator parameters.

Probably better to do the hashing correctly in one place upstream than have lots of incorrectly-implemented hashing schemes downstream. We're the ones who know how to hash an algorithm/schedule/Halide version correctly.

But the caching idea was just an example of how we can make life easier for people by making it possible to do some of the things that should really be happening in the build system in C++ instead/as well, so that people can use Halide without taking on the possibly-intractable task of fixing their build system first.

Our buildbots aren't keeping up and require too much manual maintenance
Our buildbots are overloaded and have increasingly out-of-date hardware in them. We need to figure out how to increase capacity without requiring excessive manual management of them.

Maybe one of the companies that wants Halide to work around their build system should pay for hardware and hire a full-time DevOps specialist. They could get all our buildbots configured with Ansible and set up all the Docker/virtual machine images we'd need. Failing that, they could foot the bill for a cloud-based CI service that has GPU and Android support.

Versioning and Releasing Halide

We should start versioning Halide and getting on a steady (quarterly?) release schedule. We could start at v0.1.0 so we don't imply any API stability per semantic versioning -- only v1 and above implies API stability within a major version. This would allow us to publish Halide on vcpkg / pip / APT PPAs / etc.

@abadams
Copy link
Member Author

abadams commented Jun 20, 2020

@abadams:

Have you been following the store_in stuff? You just explicitly place Funcs in the memory type you want now.

I was partly thinking of the heavily dynamic, lazy runtime aspects. When does memory get allocated and freed? When do copies happen? Most things in Halide are pretty static and eager, and explicitly controlled via schedules; GPU runtime behavior inherently includes a bunch of dynamic and lazy behavior, which is not clearly controlled by the schedule.

Imagine now having multiple GPUs or different accelerators in a machine. I should be able to use schedules to decompose computation across multiple GPUs, reason about and control explicit data movement between them, etc.

Generally agree, but wanted to add that you can explicitly schedule the copies using Func::copy_to_device and friends if you don't want them done lazily. If you use that often no dirty bits come into play. The input to the Func lives only on the CPU, and the output lives only on the GPU.

@BachiLi
Copy link
Contributor

BachiLi commented Jun 20, 2020

I edited the top post to add:

  • A few high-level philosophical questions
  • More features that need tutorials e.g., compute_with, autodiff, and nesting generators (is it called "stub"?)
  • Documentation for the developers
  • Packages, releases
  • Code reuse/modularity
  • Fast prototyping

@alexreinking
Copy link
Member

It's painful in visual studio because the actual GUI stops working right once you have more than a certain number of binary targets. Shoaib says it just stops showing them so you have no way to access them.

I don't understand why this is our problem as opposed to Visual Studio's. Isn't this something their customers would complain about? I've see this happen in Visual Studio myself, but Googling for the issue doesn't turn up much. I'll bet if Adobe, no doubt paying hundreds of thousands of dollars for Visual Studio licenses, complained loudly enough, it could get fixed.

People should just use X build system is a non-solution. Halide is being used into large products that already have build systems, and it must exist within them. Punting on solving this problem entirely means that the current experience of using Halide in a company other than Google is 80% build system nightmare and 20% writing code. If we want Halide to be a useful tool it must be able to integrate into existing build systems cleanly.

From offline discussion, it sounds like incremental building with a unified generator binary is our worst end-user story. Still, shouldering the maintenance burden for every hand-rolled, proprietary build system is also a non-solution. If we implement caching, even opt-in, we'll have to test it, keep its behavior stable, and deal with the whole can of worms that opens up.

Even our first-party CMake doesn't get it perfect because it can't assume one-generator-per-file. It can't generically establish a mapping between source files and generator invocations. We should look in to Ninja depfiles for more precise dependencies in CMake.

@jrk
Copy link
Member

jrk commented Jun 21, 2020

I don't understand why this is our problem as opposed to Visual Studio's. Isn't this something their customers would complain about? I've see this happen in Visual Studio myself, but Googling for the issue doesn't turn up much. I'll bet if Adobe, no doubt paying hundreds of thousands of dollars for Visual Studio licenses, complained loudly enough, it could get fixed.

It may seem dumb, but it is a reality that many real users face today, and the alternatives are:

  • Don't use Halide
  • Somehow work around it

Even if Microsoft should fix it (and I think it is not likely that they could or would on any reasonable time scale), the only thing in our power to do is help support working around it. If we don't, we're effectively just shutting out some of our highest-impact potential users.

@steven-johnson
Copy link
Contributor

My grab-bag of thoughts:

  • Figure out useful guidance for practical Halide-on-GPU usage and update the code/examples/docs accordingly. For instance: by far, the most questions about GPU usage that I get are "how do I use Halide for mobile GPUs"? This boils down to "On iOS, use Metal; on Android, ¯_(ツ)_/¯ (because hardware fragmentation, crappy driver support, OpenGL is mostly useless for Halide, etc)". It may well be that we'd be better off advising Android developers to focus on non-GPU solutions, but we'd be better off communicating that more up-front than we do currently.

  • There is an awful lot of useful/necessary information on how to effectively code in Halide that isn't written down anywhere, and has been passed around by word of mouth. (e.g., the use of .stmt files to iterate on schedule development). We have to fix this. Andrew's suggestion of recording a walkthrough of this is a good first start, but that will really need to get converted to text form as well.

  • I'd call our buildbot setup a disaster, but that would be an insult to actual disasters. IMHO should really try to move as much of our testing as possible into some cloud-based solution, with local hardware (eg for GPU testing) added as needed, but the bar for improvement is low.

  • On the topic of versioning and releases, I agree with what's been said before, but will go further and suggest that we consider planned release schedules (with bugfix updates), as most other projects do; some orgs may want to continue to just track the trunk branch (as Google has done), but others can stabilize on specific versions for longer term, without having to worry as much about subtle API or behavior issues. (This would make documentation more tractable too, since we'd be able to say "this applies to Halide 3.x" or whatnot.) Of course, this means that we might need a way to decide what features should be targeted to go into any current branch, and what release schedule we'd use (eg quarterly?) etc, but that seems like a good thing to me. (The implied stability of versioning would be especially beneficial for autoscheduler adoption, as it would be much more tractable to promise that autogenerated schedules would be stable within a particular release.)

  • Autoscheduler infrastructure. For it to be well accepted, autoscheduler(s) need to not just provide good schedules, they have to be easy to integrate into an existing build setup; currently it requires either a lot of manual work (e.g. manual copy-paste of text schedules, etc) or a lot of trust in the stability of the autoscheduler (i.e. that Halide updates won't regress your schedule). Not sure what the right solution for improving this is.

  • Lower the barrier for experimenting with Halide code. Currently, the quick way to try adding Halide code to an existing project is to use the JIT... unless you are running on a system where this won't work (eg iOS). Then, if it does prove profitable, you probably need to rework your code to use AOT compilation, which has a very different API and build surface (wrap it in a Generator, move it to a separate file, add some build rules as needed, make a completely different looking set of calls). Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily? Would using the Python bindings (or a bespoke Halide 'language') instead of C++ make this any easier?

  • Runtime code model. The current model is nice mainly in that it allows for bringing up simple builds easily, but it breaks down quickly for many real-world apps (that require lots of customizations) or on systems without weak linkage (e.g Windows). It also makes it harder than necessary to realize what is a public bit of the API. IMHO we should really consider moving to a runtime model that avoids weak linkage entirely, and instead uses something like a customizable-via-struct-of-pointers approach; this would also allow us to rationalize the user-context stuff and to normalize the runtime API between AOT and JIT, but would be (eventually) a breaking API change. (Yes, I have spent some time thinking about such a design; hopefully I'll get the time to finish an actual proposal someday...)

@alexreinking
Copy link
Member

  • On the topic of versioning and releases, I agree with what's been said before, but will go further [...] but that seems like a good thing to me.

👍 Fully agree here. Having a version is basically a prerequisite for inclusion into package managers, too. Having a stable API also means that shared distributions of Halide can be upgraded independently of applications, which is important if we hope FOSS will adopt us.

  • [...] try to move as much of our testing as possible into some cloud-based solution, with local hardware (eg for GPU testing) added as needed, but the bar for improvement is low.

👍 AppVeyor seems to have a reasonable set-up that allows for a mix of self-hosted (for special hardware) and cloud-hosted instances. Also, we should try to convince one or more of the multi-billion-dollar companies that employ our developers and benefit from our work to donate computing resources for this purpose.

  • Runtime code model. The current model [...] breaks down [...] on systems without weak linkage (e.g Windows). IMHO we should really consider moving to a runtime model that avoids weak linkage entirely [...]

I agree. Weak linkage and dynamic lookup into executable exports are super cool... if you're writing Linux software. Unfortunately, since they aren't standard C/C++, they're inherently non-portable and aren't modeled by CMake, so they require hacks for the supported platforms and don't work on Windows. Plus, dynamic lookup breaks a fundamental assumption about static linkage, namely that other modules won't be affected by changes to statically linked libraries. This doesn't just affect the runtime, but the plugins/autoschedulers, too. We're already planning to refactor the autoschedulers out of apps. While we're at it, we should make the interface accept a pointer to a structure in the parent process that it can populate, rather than trying to find the structure via dynamic lookup.

It also makes it harder than necessary to realize what is a public bit of the API.

See also #4651 -- as we discuss versioning, we should also discuss symbol export, since they're inter-related. At the very least, we should investigate whether -fvisibility-inlines-hidden matters in terms of binary size.

@alexreinking
Copy link
Member

alexreinking commented Jun 23, 2020

  • There is an awful lot of useful/necessary information on how to effectively code in Halide [...] that will really need to get converted to text form as well.

I think both @BachiLi and I have put some thought into writing Halide tutorials. I think it would be a good idea to merge our efforts 🙂

  • Lower the barrier for experimenting with Halide code. [...] you probably need to rework your code to use AOT compilation, which has a very different API and build surface

Fortunately, this is now pretty easy to do if you're using our CMake build 😉

[...] Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily?

An export API that would generate C++ code representing a Halide pipeline and schedule would be cool... but I'm not sure it would be more useful than our existing compile_to_file API. The build story would still be pretty bad.

Would using the Python bindings (or a bespoke Halide 'language') instead of C++ make this any easier?

I'm torn on the idea of having an external Halide syntax. There are some clear benefits... it would become easier to write tests, to write analysis tools, to metaprogram (maybe), provide more helpful compiler diagnostics, integrate with the Compiler Explorer, etc. But on the other hand, maybe it would just be a high-maintenance dunsel.

@abadams
Copy link
Member Author

abadams commented Jun 23, 2020

Porting JIT code to AOT is much bigger than just build system issues. All of a sudden it's staged compilation. E.g. things that were constants like memory layout and image size are now unknown.

@slomp
Copy link
Contributor

slomp commented Jul 1, 2020

Is there a way we could make it easy to add code using JIT (or feels-like-JIT) that could be transitioned to AOT more easily?

That would be a nice addition. I've been experimenting with compile_jit() with Param placeholders, and then rebinding the Params later before calling realize() -- it's cumbersome, I think we could come up with something more intuitive. For example, compile_jit() could return a "callable" Pipeline which provides a function-call operator () to pass the parameters directly.
In addition, it would be nice if we could just instantiate a Generator class and call compile_jit() on it.
These changes would bridge the gap between JIT and AoT workflows and would ease development a lot.

On a side note, speaking of compile_to_file, we need a better way to request a dump of generated shader code. Having to use state-of-the-"ark" environment variables like HL_DEBUG_CODEGEN to do that is just... aarghh.

@abadams
Copy link
Member Author

abadams commented Jul 1, 2020

You can just instantiate a generator and call it via jit. Generator instances have a "realize" method you can call directly, or a get_pipeline() method that gives you a Pipeline object just like when you're jitting code.

@dsharletg
Copy link
Contributor

On a side note, speaking of compile_to_file, we need a better way to request a dump of generated shader code. Having to use state-of-the-"ark" environment variables like HL_DEBUG_CODEGEN to do that is just... aarghh.

In general, I'd love to see a cleaner workflow for handling (inspecting, modifying, etc.) the "fat binaries" we produce.

Right now, we have a lot of targets that generate some code for a host target, and an "offload" target. This includes GPUs, OpenCL, Hexagon, etc.. Currently, these are designed to produce single object files with the offloaded code embedded in it somehow, which are great for convenience and dependency management.

However, inspecting these embedded objects or even modifying somehow, e.g. signing Hexagon code is hard and requires inspecting object files, or hooks/callbacks (mostly implemented with environment variables like HL_DEBUG_CODEGEN or HL_HEXAGON_CODE_SIGNER).

We took some small steps here, like Module, but that only partially solved the problem for the in-memory part of Halide. It would be great to think of some way to improve this. The things I have thought a lot about are:

  • A tempting way to go is to just have good tooling for working with objects/shared objects as if they were more like archives/tarballs/zip files, such that this tooling is designed to be invoked during builds. If we could have good reliable tools to extract/update global variables in object files as if they were archives, that would help with this. Object files are hard to edit though (changing the size of an embedded object is a mess).

  • Another option is compiling objects to folders. Building a fat binary pipeline would become a two stage processing: "compile_to_folder" followed by "link_folder", and inspection/modification steps could be added in between. This would actually be really tempting if it were possible to simply link a file as embedded data with a given symbol name in a standardized way. But, folders introduce a lot of room for new headaches: tooling that doesn't understand folders, having to track down dependencies, and there really isn't a good way to link objects and data together in a standard way. We'd be creating a new build system challenge for Halide users to solve.

@alexreinking
Copy link
Member

I would say that any approach that does not involve inspecting an object file is inherently better than any approach that does. It's simply not portable and we try to support a variety of compilers and platforms.

@abadams
Copy link
Member Author

abadams commented Jul 1, 2020

For non-hexagon backends, .stmt files should capture the IR, and assembly output should capture the generated machine code in human readable form. Hexagon is compiled earlier in lowering, so it's tricky. We should find some way to carry along the higher-level representations of it. For other shader backends it's an escaped string constant, so it's there, but it looks like:

\tld.param.u32 \t%r11, [kernel_output_s0_v1_v1___block_id_y_2_param_8];\n\tld.param.u32 \t%r12, [kernel_output_s0_v1_v1___block_id_y_2_param_14];\n\tadd.s32 \t%r13, %r12, -8;\n\tld.param.u32 \t%r14, [kernel_output_s0_v1_v1___block_id_y_2_param_9];\n\tld.param.u32 \t%r15, [kernel_output_s0_v1_v1___block_id_y_2_param_10];\n\tmin.s32 \t%r16, %r10, %r13;\n\tld.param.u32 \t%r17, [kernel_output_s0_v1_v1___block_id_y_2_param_11];\n\tshl.b32 \t%r18, %r5, 4;\n\tld.param.u32 \t%r19, [kernel_output_s0_v1_v1___block_id_y_2_param_12];\n\tld.param.u32 \t%r20, [kernel_output_s0_v1_v1___block_id_y_2_param_15];\n\tadd.s32 \t%r21, %r20, -16;\n\tld.param.u32 \t%r22, [kernel_output_s0_v1_v1___block_id_y_2_param_13];\n\tmin.s32 \t%r23, %r18, %r21;\n\tadd.s32 \t%r24, %r11, -1;\n\tld.param.u32 \t%r25, [kernel_output_s0_v1_v1___block_id_y_2_param_16];\n\tadd.s32 \t%r26, %r16, %r7;\n\tld.param.u32 \t%r27, [kernel_output_s0_v1_v1___block_id_y_2_param_17];\n\tadd.s32 \t%r28, %r26, %r19;\n\tsetp.lt.s32 \t%p1, %r28, %r11;\n\tselp.b32 \t%r29, %r28, %r24, %p1;\n\tmax.s32 \t%r30, %r29, %r27;\n\tadd.s32 \t%r31, %r23, %r9;\n\tadd.s32 \t%r32, %r31, %r22;\n\tadd.s32 \t%r33, %r8, -1;\n\tmin.s32 \t%r34, %r32, %r33;\n\tmax.s32 \t%r35, %r34, %r14;\n\tmad.lo.s32 \t%r36, %r26, %r20, %r31;\n\tmul.wide.s32 \t%rd7, %r36, 4;\n\tadd.s64 \t%rd8, %rd6, %rd7;\n\tld.global.nc.f32 \t%f1, [%rd8];\n\tmad.lo.s32 \t%r37, %r30, %r25, %r35;\n\tadd.s32 \t%r38, %r37, %r15;\n\tmul.wide.s32 \t%rd9, %r38, 2;\n\tadd.s64 \t%rd10, %rd3, %rd9;\n\tld.global.nc.u16 \t%rs1, [%rd10];\n\tcvt.rn.f32.u16 \t%f

Maybe we should add a shader_assembly generator output?

@abadams
Copy link
Member Author

abadams commented Jul 1, 2020

Meanwhile I believe standard practice is HL_DEBUG_CODEGEN=1.

@pranavb-ca
Copy link
Contributor

For non-hexagon backends, .stmt files should capture the IR, and assembly output should capture the generated machine code in human readable form. Hexagon is compiled earlier in lowering, so it's tricky. We should find some way to carry along the higher-level representations of it. For other shader backends it's an escaped string constant, so it's there, but it looks like:

\tld.param.u32 \t%r11, [kernel_output_s0_v1_v1___block_id_y_2_param_8];\n\tld.param.u32 \t%r12, [kernel_output_s0_v1_v1___block_id_y_2_param_14];\n\tadd.s32 \t%r13, %r12, -8;\n\tld.param.u32 \t%r14, [kernel_output_s0_v1_v1___block_id_y_2_param_9];\n\tld.param.u32 \t%r15, [kernel_output_s0_v1_v1___block_id_y_2_param_10];\n\tmin.s32 \t%r16, %r10, %r13;\n\tld.param.u32 \t%r17, [kernel_output_s0_v1_v1___block_id_y_2_param_11];\n\tshl.b32 \t%r18, %r5, 4;\n\tld.param.u32 \t%r19, [kernel_output_s0_v1_v1___block_id_y_2_param_12];\n\tld.param.u32 \t%r20, [kernel_output_s0_v1_v1___block_id_y_2_param_15];\n\tadd.s32 \t%r21, %r20, -16;\n\tld.param.u32 \t%r22, [kernel_output_s0_v1_v1___block_id_y_2_param_13];\n\tmin.s32 \t%r23, %r18, %r21;\n\tadd.s32 \t%r24, %r11, -1;\n\tld.param.u32 \t%r25, [kernel_output_s0_v1_v1___block_id_y_2_param_16];\n\tadd.s32 \t%r26, %r16, %r7;\n\tld.param.u32 \t%r27, [kernel_output_s0_v1_v1___block_id_y_2_param_17];\n\tadd.s32 \t%r28, %r26, %r19;\n\tsetp.lt.s32 \t%p1, %r28, %r11;\n\tselp.b32 \t%r29, %r28, %r24, %p1;\n\tmax.s32 \t%r30, %r29, %r27;\n\tadd.s32 \t%r31, %r23, %r9;\n\tadd.s32 \t%r32, %r31, %r22;\n\tadd.s32 \t%r33, %r8, -1;\n\tmin.s32 \t%r34, %r32, %r33;\n\tmax.s32 \t%r35, %r34, %r14;\n\tmad.lo.s32 \t%r36, %r26, %r20, %r31;\n\tmul.wide.s32 \t%rd7, %r36, 4;\n\tadd.s64 \t%rd8, %rd6, %rd7;\n\tld.global.nc.f32 \t%f1, [%rd8];\n\tmad.lo.s32 \t%r37, %r30, %r25, %r35;\n\tadd.s32 \t%r38, %r37, %r15;\n\tmul.wide.s32 \t%rd9, %r38, 2;\n\tadd.s64 \t%rd10, %rd3, %rd9;\n\tld.global.nc.u16 \t%rs1, [%rd10];\n\tcvt.rn.f32.u16 \t%f

Maybe we should add a shader_assembly generator output?

I would argue that even if we were to output a separate .stmt file for the offloaded Hexagon part of the pipeline, it would be a significant benefit.

@pranavb-ca
Copy link
Contributor

Autoscheduler

  • For pipelines that already have (what the programmer thinks) a tight hand optimized schedule, I wonder if it would be possible for the autoscheduler / autotuner to accept that as a start point in the search space.

Debugging

  • A --save-temps equivalent to examining various levels of code generation (For instance, .stmt, .s, .o if "-e o,h" is used).

@alexreinking
Copy link
Member

We should start talking about converting this into actionable items and divvying up the work.

@benzwt
Copy link

benzwt commented Jul 7, 2020

I think the commit messages are too short and lack of details.
commit messages with more detail help beginners to look under the hook with ease. Otherwise, the learning curve is too high.

@benzwt
Copy link

benzwt commented Jul 7, 2020

Halide issues are not maintained very well.
It seems that Halide team is short-handed.
Maybe, we should solve the issue as fast as possible and don't let it pile up.
I'm very happy to orgainize the solution and commit it to document as soon as the issue is solved.

@alexreinking
Copy link
Member

alexreinking commented Jul 7, 2020

Halide issues are not maintained very well. [...] Maybe, we should solve the issue as fast as possible and don't let it pile up.

Maybe we can start by closing all the issues that were opened more than, say. 6-12 months ago and never got a comment. That would take care of 169 issues.

We have many issues that are open and quite old:


Similar to this is the number of branches that are still on this repo that have been merged or are stale. See #4567.

Both of these issues make life harder for new collaborators ("Which issues/branches are important? Where do I get started?") and deter would-be new users ("This project's maintainers don't care / are overwhelmed. The project is buggy and/or unstable").

@benzwt
Copy link

benzwt commented Jul 8, 2020

The community for OpenCV is huge. Many of them find that OpenCV is not fast enough in practice and eager to know how well the Halide can perform. However, most of them stuck on the issue of data structure (Mat <-> halide_buffer_t) and give up, eventually. Halide should at least give an official guide for this issue.

I suggest Halide team write some simple examples for OpenCV users. We have a bunch of good practical applications in apps, porting shouldn't be an issue.
Hopefully, the size of Halide team will grow after attracting more audience from OpenCV.

@abadams
Copy link
Member Author

abadams commented Jul 8, 2020

This is a good discussion. I want to provide an alternative point of view for some things raised above:

Open source projects have a strike a careful balance between rigor/discipline and ease-of-contribution. Things like aggressively closing issues, deleting branches, mandating clean git histories, requiring very verbose commit messages, and having very strict CI checks have real benefits but also real costs - they can make working on Halide less fun and deter contributors. Very few of our core contributors get paid to specifically maintain Halide (just Steven, I think), so you can't tell people to eat their vegetables without risking them just working on Halide less. This even applies to me: One reason I left Google in 2017 was to get a break from the day-to-day maintenance and do more research. PRs from outsiders get abandoned because we pile on requirements that they have no incentive to address.

So far we've avoided being strict about things that don't slow down development or hurt our users in obvious ways. So I don't care much about deleting branches and closing old issues. Defunct branches aren't particularly visible. Issues can be searched rather than sorted, and we have plenty of issues tagged as good for first-time contributors. I also don't care much about semantic versioning, because it's not particularly useful to the main users that I'm aware of (compared to calendar or git-commit based versioning). I have no objection to these things, and would be appreciative of anyone who wants to take it on. I certainly appreciate that semantic versioning would help with package managers. Historically we've been much stricter about CI cleanliness that these other issues because not being strict about keeping tests working causes cascading issues that can stall development of unrelated work (i.e. it makes development less fun). clang-format/clang-tidy was added to the mix because the fixes are automatic and it removes an entire class of nits from code review, so while I was wary I think it actually makes life easier for regular contributors and reviewers on balance.

Some of these attitudes come at the cost of growth. E.g. clang-tidy/format as part of CI instead of done periodically after-the-fact makes life easier for regular contributors and maintainers but is another roadblock to PRs from outsiders. But growth is not an end we should target for its own sake. I want Halide to be a great tool for high-performance image processing, and I want it to be a great platform for PL research. Those objectives don't necessarily require growth, and can be in conflict with it.

Regarding OpenCV specifically, an app or tutorial that demos integrating Halide with opencv would be very welcome.

@jrk
Copy link
Member

jrk commented Jul 10, 2020

We might consider completely gutting and restarting (or at least auditing for outdated and misleading content, and properly emphasizing what we most want to emphasize today) the top-level README.

@alexreinking
Copy link
Member

alexreinking commented Jul 10, 2020

@abadams I am attempting to address the following three points at the top-level:

  • How do we make Halide easier to use for new contributors?
  • How do we keep Halide maintainable over time?
  • How do we make Halide easier to use for researchers wanting to cannibalize it, extend it, or compare to it?

Open source projects have a strike a careful balance between rigor/discipline and ease-of-contribution. Things like aggressively closing issues, deleting branches, mandating clean git histories, requiring very verbose commit messages ...

I agree with you about the commit messages / git histories, but closing issues that are more than a year old and never got a comment is not "aggressive". We could attach a bot that would do it automatically, but if we cared to do it manually, I doubt it would take more than an hour. I'd guess we could decide whether to close such an issue in less than a minute and there's only ~150.

Issues can be searched rather than sorted, and we have plenty of issues tagged as good for first-time contributors.

That assumes that the search results are relevant. If we merge a PR that fixes an issue, then that issue should be closed. There's no way to filter out issues that are open but shouldn't be, and there are a lot of them. This is a drag on new contributors and on our own maintenance. It's also an issue for our users who run into a bug in an old version, see the open issue, and conclude that the bug is still relevant.

[...] you can't tell people to eat their vegetables without risking them just working on Halide less. This even applies to me: [...]

Would closing an issue you've fixed really make working on Halide so much less fun and such a deterrent that you or another reasonable person would consider leaving the project? I would be surprised if that were true. It's not a burden to put "Fixes #issue" in the top-level PR comment.

We're not talking about going on a full-on raw diet here, just eating something green every once in a while.

PRs from outsiders get abandoned because we pile on requirements that they have no incentive to address.

This is confusing. Asking that an outsider's PR passes the tests and meets basic static analysis / formatting checks is (a) reasonable and widely expected, and (b) less costly than whatever incentivized them to put in the work and open the PR in the first place. PRs get abandoned because our testing infrastructure is unreasonably slow.

I also don't care much about semantic versioning, because it's not particularly useful to the main users that I'm aware of (compared to calendar or git-commit based versioning). I have no objection to these things, and would be appreciative of anyone who wants to take it on. I certainly appreciate that semantic versioning would help with package managers.

Using semantic versioning, which is well understood by tooling, makes it easier for other researchers to compare to Halide in a reproducible manner. Yes, they can reference the git commit hash, but nobody does this and those values aren't directly comparable. Semantic version numbers can be registered in a package repo and then particular versions can be quickly installed for comparison.

It also adds zero additional maintenance burden over our calendar-based versioning (which I believe is great for applications, but unsuitable for libraries) because we can just version as 0.x.0 where x increases monotonically. The major version 0 implies no API promises, so we don't even need to think about symbol export issues. It also solves the problem of

[figuring] out how to provide stable autoschedulers that work with Halide master to serve as baselines for academic work while at the same time being able to improve autoschedulers over time.


E.g. clang-tidy/format as part of CI instead of done periodically after-the-fact makes life easier for regular contributors and maintainers but is another roadblock to PRs from outsiders.

Clang-tidy/format is not a roadblock to PRs for outsiders. Just the opposite. It paves the road for such PRs and makes the process faster by streamlining code review. Less burden on both the reviewers and the PR author when we don't have to argue about where the space goes around * or whether override is needed.

Some of these attitudes come at the cost of growth. [...] But growth is not an end we should target for its own sake. I want Halide to be a great tool for high-performance image processing, and I want it to be a great platform for PL research. Those objectives don't necessarily require growth, and can be in conflict with it.

No one is saying that we need to aggressively scale. I don't see how my proposed actions:

Maybe we can start by closing all the issues that were opened more than, say. 6-12 months ago and never got a comment. [...]
[implied: we should clean up the] branches that are still on this repo that have been merged or are stale. See #4567.

are promoting growth "for its own sake" or are in conflict with making Halide a great tool and research platform.

@BachiLi
Copy link
Contributor

BachiLi commented Jul 10, 2020

Added more details regarding fast prototyping. Let me know if it is unclear. These are from our experience developing new (differentiable) image processing pipelines in Halide.

@abadams
Copy link
Member Author

abadams commented Jul 10, 2020

@alexreinking I don't have much new to add, but there's one thing that's important to respond to for the benefit of others reading this: The correct way to cite a version of Halide is with a git commit hash. That's the current standard practice I see in papers that reference Halide in a way where the version matters (e.g. things that benchmark against our manual schedules). Everyone reading this, please keep doing that.

The growth comments weren't a response to you. They were more a response to benzwt proposing actively trying to grow by attracting OpenCV users. While it's great to be as useful as possible to the largest number of users possible, I wanted to point out that we shouldn't get distracted by growth for its own sake as I've seen some projects do (which leads to things like over-promising).

@alexreinking
Copy link
Member

The correct way to cite a version of Halide is with a git commit hash. That's the current standard practice I see in papers that reference Halide in a way where the version matters (e.g. things that benchmark against our manual schedules). Everyone reading this, please keep doing that.

Is this not a consequence of our slow release schedule and versioning scheme? I cannot think of another case where I have seen a git commit hash in a citation as opposed to a released version number (eg. when comparing compilers, libraries, etc.). Ideally a version number would also tag a commit such that they are equivalent (and equally unambiguous). This is as-good since you should change git tags about as often as you force push.

@abadams
Copy link
Member Author

abadams commented Jul 10, 2020

Version numbers can't refer to branches, commits not associated with any release (e.g. just after a paper author reaches out to us about a bug), etc. I see both commit hashes and version numbers in papers for different projects. I prefer commit hashes.

@BachiLi
Copy link
Contributor

BachiLi commented Jul 18, 2020

We should revisit and complete https://github.com/halide/halide-app-template with standalone Makefiles/CMakefiles/VS solutions/XCode projects inside

@steven-johnson
Copy link
Contributor

Another thing to make plans for: when do we want to upgrade the minimum C++ requirement? I would hope that we could consider moving to require C++17 as a baseline in the short-to-medium-term future.

@BachiLi
Copy link
Contributor

BachiLi commented Aug 11, 2020

An idea from an offline conversation: it would be nice to have a mode where the bounds information is forwardly propagated from the inputs (similar to numpy/Tensor Comprehension). More suitable for building deep learning architectures/linear algebra/etc.

@jrk
Copy link
Member

jrk commented Aug 11, 2020

Part of the motivation for @BachiLi's comment above is the crazy list of estimates needed for autoscheduling resnet50: https://github.com/halide/Halide/blob/standalone_autoscheduler_gpu/apps/resnet_50_blockwise/Resnet50BlockGenerator.cpp#L286

These are all inferable using simple forward propagation from the inputs, in the style of NumPy or any ML framework.

@steven-johnson
Copy link
Contributor

This is a somewhat minor issue, but I think we should consider making it a policy to prefer squash-and-merge (when possible) when merging Pull Requests; the individual commit history within a PR is useful for people reviewing the PR, but it really clogs up the history of the main branch with small commits. I can't think of a good reason for all the interstitial commits to be preserved for eternity.

@alexreinking
Copy link
Member

This is a somewhat minor issue, but I think we should consider making it a policy to prefer squash-and-merge (when possible) when merging Pull Requests; the individual commit history within a PR is useful for people reviewing the PR, but it really clogs up the history of the main branch with small commits. I can't think of a good reason for all the interstitial commits to be preserved for eternity.

Relatedly, we should codify this in the standard .github\CONTRIBUTING.md file, along with anything else we want contributors to do (eg. run clang-format).

@mpettit2253
Copy link

I'd appreciate the group's take on Modular/Mojo regarding the Halide roadmap. After listening to Lex Fridman's long interview with Chris Lattner on the subject, it seems that the goals of Mojo include those of Halide, especially the autoscheduling part.
Mojo is committing to a superset of python since it won the ML iterative programming contest. C++ still wins the ML deployment contest. I keep trying to get my customers' management and programming team interested in pursuing Halide for our very large-scale computer vision and ML work, but it isn't working.

Thanks for any input. It seems that the community that has developed Halide all these years is a natural group to converge with efforts in the ML community to get at the same functionality from either a C++ foundation or a Python foundation.

@FabianSchuetze
Copy link
Contributor

Would it be possible to show source locations for errors produced with generator invocations? Consider, for example, the following error:

➜  build_linux git:(main) ✗ ./conv_layer.generator -g conv_layer -e stmt_html -o . target=host                      
Unhandled exception: Error: Func "filter_im" was called with 4 arguments, but was defined with 3

Would it be possible to highlight the source location in the conv_layer_generator.cpp file?

I think highlighting source locations would facilitate working with Halide, especially for addressing bugs during the code generation. I have looked for a related ticket but didn't find one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion For talking about potential improvements & changes.
Projects
None yet
Development

No branches or pull requests