New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ASCII to Unicode tables for Infix operators #567
Conversation
e7abf88
to
4b1467c
Compare
@mmatera - this fills in some of the ideas from a while ago where there was confusion over formatting output. To start out with this PR just is handling Infix operators. Prefix and Postfix operators will be needed. Also lacking, the level of flexibility works via mathics setting I think these are straightforward to add.
|
@rocky, thanks for working on this. I will try to review it tomorrow. |
105e3e8
to
96661ea
Compare
@rocky, I was looking at this and I am not very convinced about this approach. Encoding seems to be something at the level of the conversion of a box expression to a string ("boxes_to_string" functions), in order to have an encoding independent evaluation. An encoding dependent evaluation could produce different results on different platforms. For example, two different strings in one platform could result equal in a different platform because two different characters could have the same "printable" representation. Also, such an approach could make the debugging harder. |
Yes, this is what I was referring to when I wrote:
Added was a Mathics built in function called |
I'd like to clarify my purpose in doing this PR draft. The main goal here was to start steering work back to the area where it belongs. Historically, initially there was this very wrong-minded idea that search and replace of characters strings on MathML output was going to fix up operators. And we still have horrible methods that allow this. After that. there was this idea that somehow scanning or parsing needs to be changed. Or that a massive revision of all operators is needed add some sort of WL-specific unicode symbol. There is no doubt some way to turn this into a scanning and parsing problem by first writing something out as a string and then reading that back in. We see something like this go on currently I think in one of the The point of this PR draft was to show that, no, none of this is needed. The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting. When it comes to the details of the Boxing rules used or where this hooks in, that's not my area of expertise. I think what should be done is to see what Boxing rules WMA uses, and so see what interfaces it has in support of this. This is what I was asking about previously. We should be able to see all of this at some level in a running Mathematica which I do not have access to. We have to keep in mind though that WMA may have some unexposed function that acts like But if there is a corresponding function, I would suspect it uses the operator name, not Uncode with custom WL-characters. Right now in JSON emitter of the mathics scanner repository, we have a character-symbol-to-unicode translation spit out. That could easily be adapted to a start out with just the operator name by stripping off the surrounding brackets in the output conversion program. Understanding the process WMA emits Makeboxes (which is probably what it does) or the thing that corresponds to So again, for my part, I just want, and what this PR does is just direct focus back to the area where this kind of thing belongs. |
Completely agree with this. In this way, this PR is very positive.
Regarding this, I spend some time doing experiments to check how this work in WMA. However, this particular part resulted quite hard to hack to see the internal behavior. I have some notes, but I didn't have time to write them down to something communicable...
Actually, it seems to have it: WMA has tables (written in WL) that provide the replacement at the level of characters. This is one of the reasons I think the character replacement must be done at the level of the
Maybe this would be useful: in .m files, all the strings are translated to the ASCII representation, using escaped NamedCharacters. The same happens if you pass a string with special characters
Agree with it too. |
@Rock, also notice that locally, I found problems with
because the tests do not pass. Did you check that? |
Here is what I am seeing:
If this is what you are seeing too, this is a problem in packaging from a prior commit when I don't immediately see what the problem is. But a workaround is to copy the file If you are seeing something different, please attach more details like a traceback above. |
This is definitely possible. I don't have a strong objection, but here are downsides: it means we'd have to maintain tables in two places which causes a data problem. If there is be a discrepency between the two, which one is right? The answer is whichever is used, but that is probably not a great answer. Also this could slow loading down. JSON load using ujson has to be faster than interpreting In favor of
Thanks for your patience and understanding. At heart you are a great person. |
I make the copy and am currently running a "make doc". That reminds me of one other thing that we have been glossing over that in the future we will have to deal with. Take for example "DifferentialD" - right now we do not have an
In other words, when we have a pure ASCII system and no unicode, we should be printing ASCII "d". The means that on an ASCII system you can't take output and feed it in as input. Here the "d" will get confused. But that is the nature of ASCII anyway. You can't cut and paste the output of
and expect that to be valid input either. It is possible that there are a lot of operators that are not in the character tables yet. I don't recall if I finished the pass over that for these. |
The problem is that
(in the terminal interface, the default format is Notice that in OutputForm,
This is why I wonder if it makes sense to use the specific table for converting operators in the output. It would be simpler to translate them just as special characters in strings. Where operators would be important is in the input, when a query is tokenized. Still, I wonder if we shouldn't encode the query in canonical encoding before parsing it. |
No, the problem is this one: the reference output is not encoded. For example,
here, |
Right now after parsing the representation is FullForm. I don't see any problem with that. Just the reverse. It is the clearest and most straightforward representation. It is also how it is conventionally done in all compilers and interpreters. |
The flag What is used in straight testing without doc generation is to set I didn't go that route because I thought that in docs we would want the unicode symbols. However I see what we really want is possibly a new kind of form which outputs more correct Possible remedies are:
For now, this is a large enough problem in of itself that I don't think we should try take on this problem until the other ones have been settled on and works. We don't have a solid basis (and never had a solid basis) for handling this properly. So for now, ignoring errors which I thought was happening (and that is what I see), setting |
I am not talking about the |
I don't see how this is convenient. Everything works off of the FullForm parse tree on input and M-Expressions on output and I don't see a problem with that, but rather it is a good thing. (BTW Microsoft Basic used character-encoded opcode names internally; it might have copied this behavior from other Basic implementations. However it was the only language that I know that did this. APL APL/2 was like this but that was different in that it started out assuming an IBM APL specific keyboards so there really was no translation done at all. ) |
./admin-tools/make-op-tables.sh builds the JSON tables. Also, environment variable MATHICS_CHARACTER_ENCODING can be used to set SYSTEM_CHARACTER_ENCODING Some adjustments to tests with DifferentialD were made so we use standard Unicode symbols not WMA unicode. There was a bug in gstest from a prior commit in mathics-scanner where the "pre-scanner()" was replacing strings starting with "|".
Windows workflow CI opt-tables build
in document production. Test expectations fail if this isn't done and then the output does not appear.
96661ea
to
babfc6b
Compare
I was thinking about it because the problem of maintaining the "operators" table. I was wondering where that table is useful.
(with the named characters replaced by the canonical Unicode character). Then, for showing it in the front end, if "[DifferentialD]" comes from an operator in |
Yes, that is correct. Mathics-scanner has |
OK. Then, suppose that the expression is
which produces boxes RowBox[{"[Integral]", RowBox[{"F","[", "[Alpha]","]"}], RowBox[{"[DifferentialD]", "[Alpha]"}]}] but |
"named-characters" would work I think. Or a table adapted along those lines. |
Then, my questions are:
|
I am sorry to report that I believe #541 is thinking about things in the old wrong way. Or that is my reading, not having access to a working Mathematica. The WL doc for ToString says in part:
My reading of this is that whenever an (output) form is involved, either you start out with parsed input (a parse tree, or an M-expression) or if a string then this this implies scanning and parsing the string. Unless I missed something, I don't see that in happening #541. I suggest seeing what the behavior of ToString is when given invalid input. For example what is the output of In general, if our code is using |
OK, I realized that there is something from that approach that you do not like it. What I am trying to understand is what is it exactly.
Let me try to clarify my attempt. What I did there is to focus the task of producing an output with certain character encoding AFTER all the formatting (boxing) task is done. This is why the only place that I really needed to touch the code is in a single routine in Probably, I did wrong by changing also the The rest of the changes are tests and a mechanism to set the default encoding. This last part probably is better implemented in #567 than in #541. My questions are then focused on the first part (the change in mathics.format.text). So, for some reason, there is something so wrong with my approach that you think that instead of changing a line, it would be better to change several formatting routines, to apply the encoding earlier, even at the cost that if in the end, the evaluation requires another encoding, let's say,
Now, regarding your question, for these three cases, the output is exactly the same that the input: a String with a value consisting on ASCII characters. Differences come up if you pass non-ASCII characters. For example, in WMA,
Also, if you copy the character Notice that
Probably the main problem is that I do not understand why you think that |
The idea of taking strings and doing a string-like search and replace for operators does not appear to be a concept embraced in WMA. It is like saying I put 2 and 2 to get and I get 22: what's wrong with that? Well if you are talking about strings and string concatenation, that makes sense. However when you are talking about Mathematics it makes little sense and shows a lack of understanding of standard addition. Another analogy might be trying to solve a Rubik's cube by getting all faces the same color face by face, rather than understanding the notion that there is permutation group which contains subgroups which don't divide up that way. In short, you are going against the grain of the philosophy of WMA which, while might not be documented from the standpoint that of a interpreter developer, does follow standard interpreter principles and has a certain logic to it.
And you could say the same thing about putting 2 and 2 together to make 22, or solving a Rubik's cube by making each face match. You can also say that for someone who doesn't understand addition or groups and subgroups this is a "natural" or "necessary" first step. For you maybe, but I'd ask you to recognize your weaknesses and just leave them for me then. And when I am weak in an area like the exact details of the behavior of specific functions or what transformation rules WMA uses, I'll leave that for you or others.
The purpose of these tests was to dig deeper into the semantics of When expr is not parsable, then the behavior can either be to leave the string alone, or throw some sort of error. And this too isn't all that different than evaluating input. https://reference.wolfram.com/language/tutorial/TextualInputAndOutput.html#8971 says:
If we are short-circuiting this process in ToString by going from string to string and short-circuiting the normal way an expr gets converted to a Form, then this doesn't follow the above. Continuing:
So to repeat, implementation in that PR doesn't follow this approach.
Yes, but there always is a Form involved in |
But, most important, the formatting process (Format/ MakeBoxes) does not take into account the First, some strings and boxes are produced assuming the default encoding (UTF-8):
Then I set another encoding, and compare the output of the prebuilt string and boxes, against the boxes I make after setting the encoding:
Now,
|
No it doesn't (see below). And more importantly, In As I mentioned before, As for |
No, this pr is not about Now, focusing in the subject of encoding, what I tried to show with the example is that
|
Sure, I got that. I repeat:
and from before:
And this needs to be revised or made more specific: what we are seeing is that what is needed is:
This has been and will hang out as a draft until we have a better handle on the above. Once that happens, either this can be closed, or revised since a lot of is about the mechanics needed to interface with some Mathicsscanner table that assists the formatting part. |
OK, but as I tried to show before, this behavior is not compatible with WMA. So, even if your approach is better than the one in WMA, it could bring issues when we try to load packages like CellsToTeX (one of the main motivations that I have in the background for all of this). So, whatever would be the right implementation, I have (and try to shown) several reasons to keep
|
Forget that I wrote:
which was a generalization based on how the existing Mathics code works. Pretend that I had written instead:
Are you saying that this kind of approach doesn't work? |
Ok, let's keep this:
Almost. What I am saying is that what makes boxes_to_text works, will also makes that ToString does. So,
|
@mmatera Good - I think we have this a better understanding of things. I have been thinking about about how to divide and separate code organization and division of work more. (Too often we are working on the same kinds of things.) I am going to redo #541 which is what I had said I'd do a while ago. The focus there though is going to shift to handing the kinds of calls that involve a non-string expr and after getting something minimal there, it can expand to expr with different kinds of Forms. My current feeling around writing any new builtin nowadays is to have a clearer separation from the top-level built-in function and its implementation after function arguments are converted. We do this somewhat already for things that in the I believe that the Higher-level design and implementation is not your strong point, finding edge cases, testing against a running WMA are your strong points. So after I have a first rough cut of this, feel free to dig in and find flaws and fix those flaws. However the overall structure will likely be correct or at least not grossly incorrect. Example: the difference between an S-Expression and an M-expression is slight compared to not understanding what an S-Expression is and why it is important and how it is used. There are a number of unfinished things that had been started and haven't finished. As I recall #565 was close to being done, and then there was that detour of going writing some code to reformat the entire codebase. (All that is needed right here is to address the format inconsistency of that one new class). And there are a number of examples in |
Yes, the problem is that still many parts are very entangled. A good workflow example was the part of
That would be great. To do that, notice that for most of the basic cases,
I fully agree with this. This is one of the reasons that makes me avoid using time to implement new Builtins.
Seems a good plan.
Actually, I had been oscillating for some time between both approaches (using ToString to format expressions, or using MakeBoxes to produce the result of ToString). At this point, it seems that
Sure. All my previous proposals are just in the direction of reproducing the WMA behavior in general situations. But probably you are in a better position for proposing a good implementation. Please, just take my code as an sketch (let's say, a Play Doh model) to see how the whole think should connect an input with an output.
Yes, that was my plan. However, to make them work I need to do a rework on how format_element / MakeBoxes works. I am working on a "Play Doh" model, but it is not finished yet. |
The Infix rule should not be adding formatting. Do this in makebox evaluation. As a result, we now use operator-to-{ascii,unicode}. operator-to-{unicode-wl} still needs adding. Some hard nl's on docstrings on Builtins have been removed since formatting doesn't handle that. Some methods and classes have been alphabeticed better. $CharacterEncoding now respects MATHICS_CHARACTER_ENCODING environtment variable
mathics/builtin/makeboxes.py: correct the boxing rules for infix operators started previously. files.py, image.py: OpenRead and Import examples now need to specify the CharacterEncoding base.We start to remove formatting decisions inside MakeBox rules. In particular for Infix. For Integrate, we've noted the problem. calculus.py: Note improper Makebox rules, Add symbols which might be used in the future to start to do this correctly. symbols.py: use fn as the more proper way to set a short name test_format.py: revert StandardForm and TraditionalForm formatting for DifferentialD. This is something we should address in the future.
Infix formatting
We get different results on different systems. Right now we don't have in place proper formatting for output with DifferentialD in them. So rather than test for the presence of something rigidly wrong, we skip these tests for now.
./admin-tools/make-op-tables.sh builds the JSON tables.
Also, environment variable MATHICS_CHARACTER_ENCODING can be used to set
$SystemCharacterEncoding
and the initial value of$CharcterEncoding
.Some adjustments to tests with DifferentialD were made so we use standard Unicode symbols not WMA unicode.
There was a bug in gstest from a prior commit in mathics-scanner where the "pre-scanner()" was replacing strings starting with "|".
I had a hard time understanding the flow of what's going on, and this process was both painful and made me a bit annoyed.
I was pleased though at @mmatera start to move the makeboxes into its own module, and we have a little bit of segregating boxing routines, so that I suppose helps a little.
It is not that the code isn't sophisticated or complicated. It is just that it is not organized and changes haven't been coordinated.
Let me describe a little of the history around this PR and the code I see, and reflect. The aim show of this reflection shows how the code has become of poorer quality, is in constant need of serious refactoring, and can get worse with uncoordinated bug fixes or feature improvements. Functionally the code does a quite a bit; but it is at the cost of lots of code with undue complexity. I assume that much of the complicated code is to ensure correct behavior.
However, where we find that the code is not correct or lacking, we have a hard time figuring out where to go and what to change that isn't going to impact a lot of other things.
mmatera notes a problem with infix operator behavior in the ascii-op-to-Unicode branch. I had sort of seen this when I wrote that branch initially. One thing that I noticed that was was wrong in my PR is that the place where the change occurs felt wrong: it is code inside a "MakeBox" rule. But this rule is defined on class creation or registering each of the infix operators. These rules are is never changed. Something like $CharacterEncoding can change at will, so this location is too static.
Therefore I had to give up (temporarily) on making this work for $CharacterEncoding. Instead, I used the less dynamic $SystemCharacterEncoding instead.
Looking over this in a second pass, I see now that a rule like this is boneheaded:
formatted_output = 'MakeBoxes[Infix[{%s}," %s ",%d,%s], form]' % (
replace_items,
operator,
self.precedence,
self.grouping,
)
default_rules = {
...
"MakeBoxes[{0}, form:InputForm|OutputForm]".format(op_pattern):
formatted_output
}
Do you see why? ...
Hint: it is in the " %s " part.
This is supposed to be a rule that is called when boxing infix operators. And already it is deciding that operators should be surrounded by spaces for "InputForm" and "OutputForm".
In other words, right here in part which dictates rules to Box Infix expressions, we are already making decisions about the low-level formatting: that there should be spaces around the operator and what characters to use to represent the operator.
And when we actually get to the low-level format to final string, we have lost structure. Basically we have violated the principle that you evaluate to get an M-expression, then you Box the result, and then after Boxing then a low-level formatting is done.
If we want to write dozens more forms, the above isn't scalable. And we make high-quality formatting harder.
At some level this was probably noticed, at least implicitly; when there are such few comments it is hard to know what was noticed and what was just random programming by reacting to problems that come up.
Since everything can't be done by MakeBox rules, we have do_format_xxx routines in mathics-core/mathics/core/formatter.py as well as special routines in builtin/arithfns/basic.py, builtin/pympler/asizeof.py, some that are done on inside MakeBox() internal functions.
In short, since principles if they were decided or defined, they were not communicated anywhere I can tell; so naturally formatting codes is scattering at many of places in the code at several conceptual levels. Possibly some of the code is redundant or worse, works cross purposes.
If discussion of high-level principle were lacking, so are basic description code. Things like docstrings on functions. Or making an effort to name what a function does. For example take the get_op() an interanal function inside evaluation method apply_infix() leaving aside the missing apply_ prefix that I have mentioned many times before.
What is get_op() "getting"? You already pass it some sort of operator. If you look at the function you realize it is not getting or accessing something, but rather converiting or formatting.
And then once you realize that you are then in a position to ask why is this routine formatting inside a Boxing routine? In the overall architecture isn't a separate step. Or should low-level formatting routines, be put together?
Vagueness in code, lack of description and discussion around the code has led to very haphazard code that, in the end result, does not seem fully thought out.
With all the effort spent in just figuring out what he code does; by the time I understand that, I am exhausted and often not in a mood to think about how it should be designed or whether it follows a design or how to best write this.
Let me come back to adding spaces around the operator and losing the operator structure (the " %s " part) one more time. Because of this, the "remedy" was put in place that makes things worse. A routine was added that scans strings and unconditionally changes some strings (e.g. ASCII-formatted operators) into Unicode characters. I doesn't matter if this was done by ignorance, willful frustration: in the end, it makes the code a bigger mess.
It is, as I say, like trying to solve a Rubik's cube by getting adjacent faces in line one at a time. In the beginning there is a certain satisfaction because you can "make progress". However getting to the end this way is much harder than understanding what is going on performing operations that work in conjunction with the principles and groups of the Rubik's cube.
Given all this, my inclination right now is to hold off on adding new Forms, or correcting the existing ones to be more correct. Instead if we just make things work they do but in a logical sensible and extensible way, I think this would buy us the most to then correct the existing behavior to match the specifications more closely and to add more Forms.