New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to deterministically generate unique ids instead of randomized animal names #72
Conversation
Changes look great to me. Moreover, they seem to have solved the issue of repeating names in ONNX. 👍 . @MikeInnes Can you merge these changes so that we can continue its usage with ONNX? |
I would prefer this was just a new function (maybe I'd also like to see a test case for the repeated names if you have one; I haven't seen that happening before. I do think it makes sense to make the names a bit more stable. |
Sure, a completely new function would be fine, maybe call it gensyms_to_ids? I wonder, though, if it might be best just to eliminate the animal names entirely and use the generated identifiers - the animal names approach is going to cause problems when there is a large number of gensyms (greater than the number of animal names on animals.txt) and it's easy enough just to generate new unique identifiers via a counter or just eliminate the '#''s from the gensym symbol as you convert to string, though, I suppose the advantage to generating a unique id with a counter is that you'd be able to easily see the progression of identifier names and usage (the numbers on the current gensyms seem to skip around). |
…cally generated identifiers
Updated pull request; added gensyms_to_ids function which replaces gensyms with deterministically generated identifiers. Also: checking with Ayush about the repeated names problem; I wasn't seeing it in ONNX.jl just now prior to replacing the call to alias_gensyms with gensyms_to_ids. However I do remember seeing generated code with a duplicated animal name at some point. |
@MikeInnes A useful test case can be the results while generating the ONNX model file. A sample generated file is:
As you can see, this code is wrong. (Notice the working of |
What's your version/commit of macrotools there? |
Cloned the latest version today for testing purposes. |
Ok, can you send a version that breaks but printed without |
Generated the file again, but without
|
One problem I see is that you end up with two 'bears' in your list of animal symbols because animals.txt has entries for both "polar bear" and "bear". In the init function in src/MacroTools.jl the "polar bear" entry gets split into two separate entries (in this code):
There is already a separate bear entry in animals.txt so you end up with duplicate bears. I tried it out in the REPL (prior to the shuffle):
So 'polar' and 'bear' were split into separate entries, but there was also the previous bear entry:
The same problem exists for "seal" because there are both "elepant seal" and "seal" entries in animals.txt:
(there are also "panda" and "giant panda" entries in animals.txt, so "panda" is a duplicate as well. The random shuffle explains the intermitant nature of this problem. Probably best to eliminate spaces in the names in animals.txt. |
Ah, great catch, thanks. I'll fix that up and get this merged. |
@MikeInnes I think there should be an option to use generated tokens, rather than animal names. This will solve two problems:
Any thoughts? |
Isn't that what I just merged? |
There are a couple of problems with the current animal name scheme used in alias_gensyms:
Base.@get!(syms, x, pop!(left))
(left is a copy of the animals symbol list)
This patch generates a simple unique id in the form of "sym_id_$counter" where the counter value is incremented on each gensym encountered.
I would have preferred replacing all of the "#" characters in the gensym with blanks, however that did not work for me even though I tried the same scheme in the REPL on a gensym string and it did work (see the comment I added in alias_gensyms - maybe someone could get that working?)