Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change precompilation format to be more amenable to copying #32705

Closed
wants to merge 1 commit into from

Conversation

timholy
Copy link
Sponsor Member

@timholy timholy commented Jul 26, 2019

This is a hackathon project by @vtjnash (🧠) and myself (💪). This is partway towards the end goal (with the harder thinking still to go).

As I understand it, the problem is that inferred CodeInfo objects are serialized by putting all objects into the method's roots table, and then each specialization gets stored by referring to these objects via their index. There is one roots table for all the specializations, and this has an important consequence: when you compile for new input types, what happens frequently is that you need new objects added to roots, and this changes the numbering of items in roots. Consequently the lookup is state-dependent ("which specializations do you happen to have compiled?"), and thus we can't merge this info into other things. To avoid such problems, currently we refuse to save inference info for methods for which additional specializations would change the roots table. Since that's a large fraction of all methods, that means in practice we save relatively little. This PR is designed to eliminate such confusion by ensuring that each CodeInstance gets its own private roots table.

In the current state, we may have successfully added the new field and gotten Julia through bootstrap. If this were relatively neutral in terms of system image size and build time, it could presumably be merged at any point despite not yet doing anything useful. However, the .data field of the so is 125MB vs its expected 118MB, suggesting it should wait for more improvement.

This is designed to eliminate confusion in the serialization by ensuring
that offsets are relative to a "private" roots table. This may allow
more extensive caching of inference results, because it should eliminate
root-indexing conflicts between different instances of the same method.
@timholy
Copy link
Sponsor Member Author

timholy commented Jul 30, 2019

Having learned a bunch while doing this, I now have to ask: would a better approach be to go back to a single roots table for the method, but switch to an ihtable_t (indexable hash table) that maintains keys in a list sorted by order of entry into the hash table? That is to say, hash(obj)=>index=>obj, where the last stage is looked up by objects[index]. Whenever you add a new item not previously in the hash table you just push onto the end of objects. The advantage seems to be that it would preserve the validity of previously computed indexes while also allowing new entries to be added. Consequently if you have the "master" roots list in your running session and load a .ji file with its own roots list, you can effectively do append!(masterroots, pkgroots) and add just the missing items to masterroots. When you load the package's inferred IR you also have to update the indexes to those of masterroots, but while running that append! you can also compute a reindex vector and use that for translation.

Ordering is distinct from indexing, and our canonical Julia OrderedSet doesn't support indexing. But that's because it has decided to support deletion, and indexes (if we want them to always increase by 1) are not stable under deletion. But this is a case where we never want to delete anything, in which case indexing seems justifiable.

There would be an extra layer of indirection for each lookup, but you'd only hit that when serializing (not deserializing). In any case, it's my understanding that this is not the performance bottleneck for loading & compiling code.

However, maybe this isn't any better: is the fundamental problem that the indexing is still order-dependent? That is, if we first execute setindex! with a key of type A and then with a key of type B, that will give different ordering than if the two are compiled in the opposite order?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 16, 2019

cross-ref #14556

@timholy timholy added needs decision A decision on this change is needed status:DO NOT MERGE Do not merge this PR! kind:speculative Whether the change will be implemented is speculative labels Sep 19, 2019
@timholy
Copy link
Sponsor Member Author

timholy commented Sep 19, 2019

I added a couple of tags. What's holding this up is the question: is this even going in the right direction, and if not would the strategy in #32705 (comment) be more viable? (I tend to suspect the answers to these two questions are "no" and "yes," respectively, but need guidance, since the effort is not small and my time & expertise on this problem are both limited.) I am willing to keep pushing on this but from a pragmatic standpoint I will need a collaborator, or at least a reviewer making suggestions about next steps, who knows the compile chain better than I do.

@timholy
Copy link
Sponsor Member Author

timholy commented Sep 20, 2019

Bump @JeffBezanson. At JuliaCon you raised some concerns about the approach Jameson & I took here, might #32705 (comment) be a better strategy?

@erezsh
Copy link

erezsh commented Sep 20, 2019

Hi, most of the technical details here are still over my head, but I just wanted to ask about the general architecture of this solution.

Have you considered implementing this with a write-ahead log? That means, appending every new signature to a list (i.e. appending to the end of the file), so they can be replayed in the next load. I imagine it would be reasonably efficient, and much easier to implement than a dynamically updating file.

A second, unifying pass, can occur at any desired interval, so that its performance won't matter. It's optional, but will result in even faster modules.

This approach might even help for dealing with multiple processes of the same module, in some distant future (You'll have to forgive me, I'm not even sure how Julia does multiprocessing today).

@timholy
Copy link
Sponsor Member Author

timholy commented Sep 20, 2019

A lot of it is unfamiliar to me too. It's less a question of how you write the file, and more about what you put into it. How do you represent code in binary format? Pointers are obviously off the table, so you have to come up with a serialization, which is the subject of src/dump.c. IIUC, currently serializing uses an integer index to refer to object lists stored in a methods' roots table. Try this gist (instructions are in the comment near the top) and you'll see how the order in which code gets compiled can alter the roots list. Consequently the integer index that we use to serialize doesn't have an invariant meaning.

Having put this together, I'm not sure #32705 (comment) is a panacea. The original strategy suggested by Jameson is more workable, although the "merge" step will be harder, and the sheer amount of duplicated information is troubling.

Now I'm wondering about saving the roots table for each method in a .ji file, comparing it against the master copy, and renumbering things as you deserialize the file. #32705 (comment) might be useful to prevent problems from rehashing, though.

@erezsh
Copy link

erezsh commented Sep 20, 2019

Consequently the integer index that we use to serialize doesn't have an invariant meaning.

Why is that? If you append only, the old definitions don't lose their meaning. And it's possible to "merge" new definitions by translating their indices (to avoid duplicates).

I still don't understand where is the root of the problem. When Julia compiles a new method, it adds the definition somewhere. If it added to roots, we could just rewrite the whole file. But if not, why can't we take that definition, and serialize it / deserialize it on its own?

@timholy
Copy link
Sponsor Member Author

timholy commented Sep 21, 2019

Most things are possible, it's just a question of doing it 😄. The strategies you're proposing about merging are the same ones I'm proposing. Whip out that editor and go for it!

@bjarthur
Copy link
Contributor

bjarthur commented Aug 1, 2020

do i understand correctly that this PR hopes to make it possible to cache the native code compiled in a session so that it can be subsequently re-used in a new session thereby avoiding the JIT lag? that would be huge. kinda like an on-the-fly PackageCompiler but without modifying the sysimage?

@timholy
Copy link
Sponsor Member Author

timholy commented Aug 1, 2020

It's not specifically addressed at the question of native code, more about being able to pick up some stragglers in our current precompilation.

However, with the invalidation work having recently achieved quite a lot of success, the biggest obstacle to working on native-code caching is basically gone.

@timholy
Copy link
Sponsor Member Author

timholy commented Sep 1, 2020

Friendly bump; now with so many other improvements to latency, I'm seeing lots of examples where inference time is a major part of getting real work done. Here's an example of profiling ProfileView on its first run (with dat, lidict = Profile.retrieve(); Profile.clear(); @profile ProfileView(dat; lidict=lidict):

image

First block (gray-dominated) is inference, second (yellow-dominated) is LLVM, third is mixed (including some computation), and fourth is borked pointers. My best result with @compile_options is to set optlevel=1 in Gtk, GtkReactive, and ProfileView (turning off inference seems to make things worse), where I've gotten the first @time ProfileView.view() down from about 2s to 1.8s. That is nice, but given that inference on single functions (ProfileView.view() takes ballpark 1s, getting rid of that would be a much bigger prize.

I've added a ton of precompile statements but many of them don't "take," presumably because of the reason described in the top post.

It sure would be some nice icing on the latency cake if we could make progress on this. It's more than I'm prepared to handle solo, but I'm happy to help as I can.

timholy added a commit that referenced this pull request Dec 16, 2020
For some reason (perhaps #32705?) most or all of these fail if they
are emitted as precompile statements, so this moves them into Base itself.

This drops the time for a revision down to 1.85s.
KristofferC pushed a commit that referenced this pull request Dec 17, 2020
For some reason (perhaps #32705?) most or all of these fail if they
are emitted as precompile statements, so this moves them into Base itself.

This drops the time for a revision down to 1.85s.

(cherry picked from commit cc1623b)
staticfloat pushed a commit that referenced this pull request Jan 15, 2021
For some reason (perhaps #32705?) most or all of these fail if they
are emitted as precompile statements, so this moves them into Base itself.

This drops the time for a revision down to 1.85s.

(cherry picked from commit cc1623b)
@vtjnash vtjnash closed this Apr 13, 2021
@vtjnash vtjnash deleted the teh/precompile branch April 13, 2021 17:45
@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 13, 2021

This is probably in the right direction, but needs more work and is heavily conflicted now. Maybe next hackathon we'll be able to get more progress

ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
For some reason (perhaps JuliaLang#32705?) most or all of these fail if they
are emitted as precompile statements, so this moves them into Base itself.

This drops the time for a revision down to 1.85s.
@DilumAluthge DilumAluthge removed the status:DO NOT MERGE Do not merge this PR! label Jun 18, 2021
@timholy
Copy link
Sponsor Member Author

timholy commented Aug 22, 2021

What about changing the format of the roots table to something like push!(roots, uuid=>list_of_pkg_roots)? That would seem to make it possible to share a single roots table but not be vulnerable to package-loading order or the cost of objid computation (uuid = PkgId.uuid).

@HiperMaximus
Copy link

Bump

staticfloat pushed a commit that referenced this pull request Dec 23, 2022
For some reason (perhaps #32705?) most or all of these fail if they
are emitted as precompile statements, so this moves them into Base itself.

This drops the time for a revision down to 1.85s.

(cherry picked from commit cc1623b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:speculative Whether the change will be implemented is speculative needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants