Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ASCII to Unicode tables for Infix operators #567

Merged
merged 10 commits into from Oct 11, 2022
Merged

Conversation

rocky
Copy link
Member

@rocky rocky commented Oct 1, 2022

./admin-tools/make-op-tables.sh builds the JSON tables.

Also, environment variable MATHICS_CHARACTER_ENCODING can be used to set $SystemCharacterEncoding and the initial value of $CharcterEncoding.

Some adjustments to tests with DifferentialD were made so we use standard Unicode symbols not WMA unicode.

There was a bug in gstest from a prior commit in mathics-scanner where the "pre-scanner()" was replacing strings starting with "|".


I had a hard time understanding the flow of what's going on, and this process was both painful and made me a bit annoyed.

I was pleased though at @mmatera start to move the makeboxes into its own module, and we have a little bit of segregating boxing routines, so that I suppose helps a little.

It is not that the code isn't sophisticated or complicated. It is just that it is not organized and changes haven't been coordinated.

Let me describe a little of the history around this PR and the code I see, and reflect. The aim show of this reflection shows how the code has become of poorer quality, is in constant need of serious refactoring, and can get worse with uncoordinated bug fixes or feature improvements. Functionally the code does a quite a bit; but it is at the cost of lots of code with undue complexity. I assume that much of the complicated code is to ensure correct behavior.

However, where we find that the code is not correct or lacking, we have a hard time figuring out where to go and what to change that isn't going to impact a lot of other things.

mmatera notes a problem with infix operator behavior in the ascii-op-to-Unicode branch. I had sort of seen this when I wrote that branch initially. One thing that I noticed that was was wrong in my PR is that the place where the change occurs felt wrong: it is code inside a "MakeBox" rule. But this rule is defined on class creation or registering each of the infix operators. These rules are is never changed. Something like $CharacterEncoding can change at will, so this location is too static.

Therefore I had to give up (temporarily) on making this work for $CharacterEncoding. Instead, I used the less dynamic $SystemCharacterEncoding instead.

Looking over this in a second pass, I see now that a rule like this is boneheaded:

formatted_output = 'MakeBoxes[Infix[{%s}," %s ",%d,%s], form]' % (
replace_items,
operator,
self.precedence,
self.grouping,
)
default_rules = {

...

"MakeBoxes[{0}, form:InputForm|OutputForm]".format(op_pattern):
formatted_output
}
Do you see why? ...

Hint: it is in the " %s " part.

This is supposed to be a rule that is called when boxing infix operators. And already it is deciding that operators should be surrounded by spaces for "InputForm" and "OutputForm".
In other words, right here in part which dictates rules to Box Infix expressions, we are already making decisions about the low-level formatting: that there should be spaces around the operator and what characters to use to represent the operator.

And when we actually get to the low-level format to final string, we have lost structure. Basically we have violated the principle that you evaluate to get an M-expression, then you Box the result, and then after Boxing then a low-level formatting is done.

If we want to write dozens more forms, the above isn't scalable. And we make high-quality formatting harder.

At some level this was probably noticed, at least implicitly; when there are such few comments it is hard to know what was noticed and what was just random programming by reacting to problems that come up.
Since everything can't be done by MakeBox rules, we have do_format_xxx routines in mathics-core/mathics/core/formatter.py as well as special routines in builtin/arithfns/basic.py, builtin/pympler/asizeof.py, some that are done on inside MakeBox() internal functions.

In short, since principles if they were decided or defined, they were not communicated anywhere I can tell; so naturally formatting codes is scattering at many of places in the code at several conceptual levels. Possibly some of the code is redundant or worse, works cross purposes.

If discussion of high-level principle were lacking, so are basic description code. Things like docstrings on functions. Or making an effort to name what a function does. For example take the get_op() an interanal function inside evaluation method apply_infix() leaving aside the missing apply_ prefix that I have mentioned many times before.

What is get_op() "getting"? You already pass it some sort of operator. If you look at the function you realize it is not getting or accessing something, but rather converiting or formatting.

And then once you realize that you are then in a position to ask why is this routine formatting inside a Boxing routine? In the overall architecture isn't a separate step. Or should low-level formatting routines, be put together?

Vagueness in code, lack of description and discussion around the code has led to very haphazard code that, in the end result, does not seem fully thought out.

With all the effort spent in just figuring out what he code does; by the time I understand that, I am exhausted and often not in a mood to think about how it should be designed or whether it follows a design or how to best write this.

Let me come back to adding spaces around the operator and losing the operator structure (the " %s " part) one more time. Because of this, the "remedy" was put in place that makes things worse. A routine was added that scans strings and unconditionally changes some strings (e.g. ASCII-formatted operators) into Unicode characters. I doesn't matter if this was done by ignorance, willful frustration: in the end, it makes the code a bigger mess.

It is, as I say, like trying to solve a Rubik's cube by getting adjacent faces in line one at a time. In the beginning there is a certain satisfaction because you can "make progress". However getting to the end this way is much harder than understanding what is going on performing operations that work in conjunction with the principles and groups of the Rubik's cube.

Given all this, my inclination right now is to hold off on adding new Forms, or correcting the existing ones to be more correct. Instead if we just make things work they do but in a logical sensible and extensible way, I think this would buy us the most to then correct the existing behavior to match the specifications more closely and to add more Forms.

@rocky rocky marked this pull request as draft October 1, 2022 23:59
@rocky rocky force-pushed the ascii-op-to-unicode branch 4 times, most recently from e7abf88 to 4b1467c Compare October 2, 2022 00:28
@rocky
Copy link
Member Author

rocky commented Oct 2, 2022

@mmatera - this fills in some of the ideas from a while ago where there was confusion over formatting output.

To start out with this PR just is handling Infix operators. Prefix and Postfix operators will be needed.

Also lacking, the level of flexibility works via mathics setting SYSTEM_CHARACTER_ENCODING rather than $CHARACTER_ENCODING.

I think these are straightforward to add.

(Even though this fails CI, this works locally. Think the errors have to do with not being able to load op tables.)

@mmatera
Copy link
Contributor

mmatera commented Oct 2, 2022

@rocky, thanks for working on this. I will try to review it tomorrow.

@rocky rocky force-pushed the ascii-op-to-unicode branch 3 times, most recently from 105e3e8 to 96661ea Compare October 2, 2022 08:00
@mmatera
Copy link
Contributor

mmatera commented Oct 2, 2022

@rocky, I was looking at this and I am not very convinced about this approach. Encoding seems to be something at the level of the conversion of a box expression to a string ("boxes_to_string" functions), in order to have an encoding independent evaluation.

An encoding dependent evaluation could produce different results on different platforms. For example, two different strings in one platform could result equal in a different platform because two different characters could have the same "printable" representation.

Also, such an approach could make the debugging harder.

@rocky
Copy link
Member Author

rocky commented Oct 2, 2022

@rocky, Encoding seems to be something at the level of the conversion of a box expression to a string ("boxes_to_string" functions), in order to have an encoding independent evaluation.

Yes, this is what I was referring to when I wrote:

Also the level of flexibility works via mathics setting SYSTEM_CHARACTER_ENCODING rather than $CHARACTER_ENCODING

Added was a Mathics built in function called AsciiOpToString[] and that is what can be used to make this adjustable at format-evaluation time. So your job, should you choose to undertake this, is to replace that fixed MakeBox rule with something that uses AsciiOpToString[] or something similar.

@rocky
Copy link
Member Author

rocky commented Oct 2, 2022

I'd like to clarify my purpose in doing this PR draft. The main goal here was to start steering work back to the area where it belongs.

Historically, initially there was this very wrong-minded idea that search and replace of characters strings on MathML output was going to fix up operators. And we still have horrible methods that allow this.

After that. there was this idea that somehow scanning or parsing needs to be changed. Or that a massive revision of all operators is needed add some sort of WL-specific unicode symbol.

There is no doubt some way to turn this into a scanning and parsing problem by first writing something out as a string and then reading that back in. We see something like this go on currently I think in one of the Import functions which does work this way. Although this a way you might be able to simulate this behavior in pure WMA, it is definitely not the way WMA implements this internally when it needs to do this.

The point of this PR draft was to show that, no, none of this is needed. The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting.

When it comes to the details of the Boxing rules used or where this hooks in, that's not my area of expertise. I think what should be done is to see what Boxing rules WMA uses, and so see what interfaces it has in support of this. This is what I was asking about previously.

We should be able to see all of this at some level in a running Mathematica which I do not have access to.

We have to keep in mind though that WMA may have some unexposed function that acts like AsciiopToString[]. And I will bet money that it doesn't start out with the ASCII sequence - I used that because that is what is currently there and it is an acceptable thing to use.

But if there is a corresponding function, I would suspect it uses the operator name, not Uncode with custom WL-characters. Right now in JSON emitter of the mathics scanner repository, we have a character-symbol-to-unicode translation spit out. That could easily be adapted to a start out with just the operator name by stripping off the surrounding brackets in the output conversion program.

Understanding the process WMA emits Makeboxes (which is probably what it does) or the thing that corresponds to MakeBoxes[Infix[{%s}," %s ",%d,%s], form] (which is what Mathics has been using from the beginning) needs investigation and fixing. However this needs to be done who by someone who understands WMA behavior and operation better.

So again, for my part, I just want, and what this PR does is just direct focus back to the area where this kind of thing belongs.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

I'd like to clarify my purpose in doing this PR draft. The main goal here was to start steering work back to the area where it belongs.

Historically, initially there was this very wrong-minded idea that search and replace of characters strings on MathML output was going to fix up operators. And we still have horrible methods that allow this.

After that. there was this idea that somehow scanning or parsing needs to be changed. Or that a massive revision of all operators is needed add some sort of WL-specific unicode symbol.

There is no doubt some way to turn this into a scanning and parsing problem by first writing something out as a string and then reading that back in. We see something like this go on currently I think in one of the Import functions which does work this way. Although this a way you might be able to simulate this behavior in pure WMA, it is definitely not the way WMA implements this internally when it needs to do this.

The point of this PR draft was to show that, no, none of this is needed. The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting.

Completely agree with this. In this way, this PR is very positive.

When it comes to the details of the Boxing rules used or where this hooks in, that's not my area of expertise. I think what should be done is to see what Boxing rules WMA uses, and so see what interfaces it has in support of this. This is what I was asking about previously.

Regarding this, I spend some time doing experiments to check how this work in WMA. However, this particular part resulted quite hard to hack to see the internal behavior. I have some notes, but I didn't have time to write them down to something communicable...

We should be able to see all of this at some level in a running Mathematica which I do not have access to.

We have to keep in mind though that WMA may have some unexposed function that acts like AsciiopToString[]. And I will bet money that it doesn't start out with the ASCII sequence - I used that because that is what is currently there and it is an acceptable thing to use.

Actually, it seems to have it: WMA has tables (written in WL) that provide the replacement at the level of characters. This is one of the reasons I think the character replacement must be done at the level of the boxes_to_text functions instead of the formatting functions (as this PR suggest).

But if there is a corresponding function, I would suspect it uses the operator name, not Uncode with custom WL-characters. Right now in JSON emitter of the mathics scanner repository, we have a character-symbol-to-unicode translation spit out. That could easily be adapted to a start out with just the operator name by stripping off the surrounding brackets in the output conversion program.

Maybe this would be useful: in .m files, all the strings are translated to the ASCII representation, using escaped NamedCharacters. The same happens if you pass a string with special characters $string$ through ToString with the CharacterEncoding parameter:

In[1]:= ToString["\[Integral]"]                                                 

Out[1]= ∫

In[2]:= ToString["\[Integral]", CharacterEncoding->"ASCII"]                     

Out[2]= \[Integral]

Understanding the process WMA emits Makeboxes (which is probably what it does) or the thing that corresponds to MakeBoxes[Infix[{%s}," %s ",%d,%s], form] (which is what Mathics has been using from the beginning) needs investigation and fixing. However this needs to be done who by someone who understands WMA behavior and operation better.

So again, for my part, I just want, and what this PR does is just direct focus back to the area where this kind of thing belongs.

Agree with it too.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

@Rock, also notice that locally, I found problems with

make clean && make doc

because the tests do not pass. Did you check that?

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

@Rock, also notice that locally, I found problems with

make clean && make doc

because the tests do not pass. Did you check that?

Here is what I am seeing:

(cd mathics/doc/latex && make doc)
make[1]: Entering directory '/src/external-vcs/github/Mathics3/mathics-core/mathics/doc/latex'
(cd ../.. && python docpipeline.py --output --keep-going --want-sorting)
Traceback (most recent call last):
  File "docpipeline.py", line 31, in <module>
    from mathics.timing import show_lru_cache_statistics
ModuleNotFoundError: No module named 'mathics.timing'

If this is what you are seeing too, this is a problem in packaging from a prior commit when mathics.timing was split out from mathics.core.util. In another branch this should be fixed since this isn't related to this.

I don't immediately see what the problem is. But a workaround is to copy the file mathics/timing.py to whereever docpipeline.py is looking for which is most likely the place this got installed.

If you are seeing something different, please attach more details like a traceback above.

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

Maybe this would be useful: in .m files, all the strings are translated to the ASCII representation, using escaped NamedCharacters. The same happens if you pass a string with special characters string through ToString with the CharacterEncoding parameter:

In[1]:= ToString["\[Integral]"]                                                 

Out[1]= ∫

In[2]:= ToString["\[Integral]", CharacterEncoding->"ASCII"]                     

Out[2]= \[Integral]

This is definitely possible. I don't have a strong objection, but here are downsides: it means we'd have to maintain tables in two places which causes a data problem. If there is be a discrepency between the two, which one is right? The answer is whichever is used, but that is probably not a great answer. Also this could slow loading down. JSON load using ujson has to be faster than interpreting .m files Gark measured various loading options and JSON.load using ujson was the fastest way to get information in.

In favor of .m files this is simpler, more straight-forward and our build system is simpler.

Understanding the process WMA emits Makeboxes (which is probably what it does) or the thing that corresponds to MakeBoxes[Infix[{%s}," %s ",%d,%s], form] (which is what Mathics has been using from the beginning) needs investigation and fixing. However this needs to be done who by someone who understands WMA behavior and operation better.
So again, for my part, I just want, and what this PR does is just direct focus back to the area where this kind of thing belongs.

Agree with it too.

Thanks for your patience and understanding. At heart you are a great person.

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

I make the copy and am currently running a "make doc". That reminds me of one other thing that we have been glossing over that in the future we will have to deal with.

Take for example "DifferentialD" - right now we do not have an ASCII string for that. In fact it is an operator and that is not noted either. Because it is an operator, we should have

DifferentialD: 
  ascii: "d"
  ...

In other words, when we have a pure ASCII system and no unicode, we should be printing ASCII "d". The means that on an ASCII system you can't take output and feed it in as input. Here the "d" will get confused.

But that is the nature of ASCII anyway. You can't cut and paste the output of 5 / 2

5
-
2

and expect that to be valid input either.

It is possible that there are a lot of operators that are not in the character tables yet. I don't recall if I finished the pass over that for these.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

I make the copy and am currently running a "make doc". That reminds me of one other thing that we have been glossing over that in the future we will have to deal with.

Take for example "DifferentialD" - right now we do not have an ASCII string for that. In fact it is an operator and that is not noted either. Because it is an operator, we should have

DifferentialD: 
  ascii: "d"
  ...

The problem is that OutputForm and StandardForm are not supposed to be something that you can re-enter. This is the goal of InputForm. In WMA,

In[1]:= Integrate[F[x],x]//OutputForm                                           

Out[1]//OutputForm= Integrate[F[x], x]

In[2]:= Integrate[F[x],x]//StandardForm                                         

Out[2]//StandardForm= ∫ F[x]  x

In[3]:= Integrate[F[x],x]//InputForm                                                                                                                                          

Out[3]//InputForm= Integrate[F[x], x]

(in the terminal interface, the default format is OutputForm)

Notice that in OutputForm, Integrate is shown just as in InputForm. On the other hand, for a rational number,

In[1]:= 5/3//OutputForm                                                                                                                                                       

                    5
Out[1]//OutputForm= -
                    3

In[2]:= 5/3//StandardForm                                                                                                                                                     

                      5
Out[2]//StandardForm= -
                      3

In[3]:= 5/3// InputForm                                                                                                                                                       

Out[3]//InputForm= 5/3

In other words, when we have a pure ASCII system and no unicode, we should be printing ASCII "d". The means that on an ASCII system you can't take output and feed it in as input. Here the "d" will get confused.

But that is the nature of ASCII anyway. You can't cut and paste the output of 5 / 2

5
-
2

and expect that to be valid input either.

It is possible that there are a lot of operators that are not in the character tables yet. I don't recall if I finished the pass over that for these.

This is why I wonder if it makes sense to use the specific table for converting operators in the output. It would be simpler to translate them just as special characters in strings. Where operators would be important is in the input, when a query is tokenized. Still, I wonder if we shouldn't encode the query in canonical encoding before parsing it.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

No, the problem is this one: the reference output is not encoded. For example,

(pystonmathics) mauricio@mauricio-T15thinkpad:~/Projects/mathics-core$ make doc
(cd mathics/doc/latex && make doc)
make[1]: se entra en el directorio '/home/mauricio/Projects/mathics-core/mathics/doc/latex'
(cd ../.. && python docpipeline.py --output --keep-going --want-sorting)
Testing Mathics 5.0.3dev0
on CPython 3.8.12 (heads/v2.3.4.1_release:cd8ca63678, Jun  7 2022, 02:05:46)
using SymPy 1.8, mpmath 1.2.1, numpy 1.21.4
b'********** Examples / Curve sketching **********'
b'   1 ( 1): TEST f[x_] := 4 x / (x ^ 2 + 3 x + 5)'
b"   2 ( 2): TEST {f'[x], f''[x], f'''[x]} // Together"
b"   3 ( 3): TEST extremes = Solve[f'[x] == 0, x]"
result =!=wanted
----------------------------------------------------------------------
Test failed: Curve sketching in Manual / Examples
Manual
Result: {{x → -Sqrt[5]}, {x → Sqrt[5]}}
Wanted: {{x -> -Sqrt[5]}, {x -> Sqrt[5]}}

here, in the result is compared with -> in the reference.

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

This is why I wonder if it makes sense to use the specific table for converting operators in the output. It would be simpler to translate them just as special characters in strings. Where operators would be important is in the input, when a query is tokenized. Still, I wonder if we shouldn't encode the query in canonical encoding before parsing it.

Right now after parsing the representation is FullForm. I don't see any problem with that. Just the reverse. It is the clearest and most straightforward representation. It is also how it is conventionally done in all compilers and interpreters.

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

No, the problem is this one: the reference output is not encoded. For example,

(pystonmathics) mauricio@mauricio-T15thinkpad:~/Projects/mathics-core$ make doc
(cd mathics/doc/latex && make doc)
make[1]: se entra en el directorio '/home/mauricio/Projects/mathics-core/mathics/doc/latex'
(cd ../.. && python docpipeline.py --output --keep-going --want-sorting)
Testing Mathics 5.0.3dev0
on CPython 3.8.12 (heads/v2.3.4.1_release:cd8ca63678, Jun  7 2022, 02:05:46)
using SymPy 1.8, mpmath 1.2.1, numpy 1.21.4
b'********** Examples / Curve sketching **********'
b'   1 ( 1): TEST f[x_] := 4 x / (x ^ 2 + 3 x + 5)'
b"   2 ( 2): TEST {f'[x], f''[x], f'''[x]} // Together"
b"   3 ( 3): TEST extremes = Solve[f'[x] == 0, x]"
result =!=wanted
----------------------------------------------------------------------
Test failed: Curve sketching in Manual / Examples
Manual
Result: {{x → -Sqrt[5]}, {x → Sqrt[5]}}
Wanted: {{x -> -Sqrt[5]}, {x -> Sqrt[5]}}

here, in the result is compared with -> in the reference.

The flag --keep-going is supposed to ignore errors and continue. Is that not happening?

What is used in straight testing without doc generation is to set SYSTEM_CHARACTER_ENCODING to ASCII via an Environment variable. So that is a possibility.

I didn't go that route because I thought that in docs we would want the unicode symbols. However I see what we really want is possibly a new kind of form which outputs more correct LaTeX. Right now, things are a little haphazard because we are hoping or assuming LaTeX can handle certain unicode symbols. That is what the sed hack does.

Possible remedies are:

  • ignore errors (which is what I thought was happening here - and that worked for me in my when I produced a PDF
  • Set SYSTEM_CHARACTER_ENCODING to ASCII
  • don't bother running the test and assume the expected output was obtained
  • set default output to TeXForm in reading expected output and evaluated output
  • ditch this (which is what is on the plan)

For now, this is a large enough problem in of itself that I don't think we should try take on this problem until the other ones have been settled on and works. We don't have a solid basis (and never had a solid basis) for handling this properly.

So for now, ignoring errors which I thought was happening (and that is what I see), setting SYSTEM_CHARACTER_ENCODING to "ASCII" or not bothering to run tests feels like the path of least effort that will still give us a PDF as good as what we have now.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

This is why I wonder if it makes sense to use the specific table for converting operators in the output. It would be simpler to translate them just as special characters in strings. Where operators would be important is in the input, when a query is tokenized. Still, I wonder if we shouldn't encode the query in canonical encoding before parsing it.

Right now after parsing the representation is FullForm. I don't see any problem with that. Just the reverse. It is the clearest and most straightforward representation. It is also how it is conventionally done in all compilers and interpreters.

I am not talking about the Form, but about the encoding before parsing. Suppose in the terminal you input
a \u2A75 b
then the (Python) string "a \u2A75 b" now is directly tokenized and parsed to Equal[a,b], because "\u2A75" is the standard Unicode equivalent of "[Equal]". My question is if it shouldn't be convenient to convert the string to
"a \u03F5 b" (assuming WL Unicode as the canonical encoding) and then doing the parsing.

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

This is why I wonder if it makes sense to use the specific table for converting operators in the output. It would be simpler to translate them just as special characters in strings. Where operators would be important is in the input, when a query is tokenized. Still, I wonder if we shouldn't encode the query in canonical encoding before parsing it.

Right now after parsing the representation is FullForm. I don't see any problem with that. Just the reverse. It is the clearest and most straightforward representation. It is also how it is conventionally done in all compilers and interpreters.

I am not talking about the Form, but about the encoding before parsing. Suppose in the terminal you input a \u2A75 b then the (Python) string "a \u2A75 b" now is directly tokenized and parsed to Equal[a,b], because "\u2A75" is the standard Unicode equivalent of "[Equal]". My question is if it shouldn't be convenient to convert the string to "a \u03F5 b" (assuming WL Unicode as the canonical encoding) and then doing the parsing.

I don't see how this is convenient. Everything works off of the FullForm parse tree on input and M-Expressions on output and I don't see a problem with that, but rather it is a good thing.

(BTW Microsoft Basic used character-encoded opcode names internally; it might have copied this behavior from other Basic implementations. However it was the only language that I know that did this. APL APL/2 was like this but that was different in that it started out assuming an IBM APL specific keyboards so there really was no translation done at all. )

./admin-tools/make-op-tables.sh builds the JSON tables.

Also, environment variable MATHICS_CHARACTER_ENCODING can be used to set
SYSTEM_CHARACTER_ENCODING

Some adjustments to tests with DifferentialD were made so we use
standard Unicode symbols not WMA unicode.

There was a bug in gstest from a prior commit in mathics-scanner where
the "pre-scanner()" was replacing strings starting with "|".
Windows workflow CI opt-tables build
in document production.

Test expectations fail if this isn't done and then the output does not
appear.
@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

I don't see how this is convenient. Everything works off of the FullForm parse tree, and I don't see a problem with that, but rather it is a good thing.

I was thinking about it because the problem of maintaining the "operators" table. I was wondering where that table is useful.
In the output, I guess it is not needed. Once we have a boxed expression, the only thing we need is to encode strings. For example, when the expression StandardForm[Integrate[F[x],x]] is formatted, it is produced the boxed expression

RowBox[{"\[Integral]", RowBox[{"F","[", "x","]"}], RowBox[{"\[DifferentialD]", "x"}]}]

(with the named characters replaced by the canonical Unicode character). Then, for showing it in the front end, if "[DifferentialD]" comes from an operator in Integrate of from something introduced by hand inside the string, it is the same: boxes_to_text applies the encoding map and then in the terminal, we get "\U0001D451" "d" or any other character that looks as similar as possible to "\U0001D451". Then, what is used there is not the "ascii_to_operators" table, but the table that converts the canonical encoding to the SystemEncoding (UFT-8, ASCII or would be available/convenient).

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

I don't see how this is convenient. Everything works off of the FullForm parse tree, and I don't see a problem with that, but rather it is a good thing.

I was thinking about it because the problem of maintaining the "operators" table. I was wondering where that table is useful. In the output, I guess it is not needed. Once we have a boxed expression, the only thing we need is to encode strings. For example, when the expression StandardForm[Integrate[F[x],x]] is formatted, it is produced the boxed expression

RowBox[{"\[Integral]", RowBox[{"F","[", "x","]"}], RowBox[{"\[DifferentialD]", "x"}]}]

(with the named characters replaced by the canonical Unicode character). Then, for showing it in the front end, if "[DifferentialD]" comes from an operator in Integrate of from something introduced by hand inside the string, it is the same: boxes_to_text applies the encoding map and then in the terminal, we get "\U0001D451" "d" or any other character that looks as similar as possible to "\U0001D451". Then, what is used there is not the "ascii_to_operators" table, but the table that converts the canonical encoding to the SystemEncoding (UFT-8, ASCII or would be available/convenient).

Yes, that is correct. Mathics-scanner has operator-to-unicode which is roughly the same. That could be used as is, or "ascii-operator-to..." could be easily changed to "operator-charname-to-unicode{,-wl}".

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

I don't see how this is convenient. Everything works off of the FullForm parse tree, and I don't see a problem with that, but rather it is a good thing.

I was thinking about it because the problem of maintaining the "operators" table. I was wondering where that table is useful. In the output, I guess it is not needed. Once we have a boxed expression, the only thing we need is to encode strings. For example, when the expression StandardForm[Integrate[F[x],x]] is formatted, it is produced the boxed expression

RowBox[{"\[Integral]", RowBox[{"F","[", "x","]"}], RowBox[{"\[DifferentialD]", "x"}]}]

(with the named characters replaced by the canonical Unicode character). Then, for showing it in the front end, if "[DifferentialD]" comes from an operator in Integrate of from something introduced by hand inside the string, it is the same: boxes_to_text applies the encoding map and then in the terminal, we get "\U0001D451" "d" or any other character that looks as similar as possible to "\U0001D451". Then, what is used there is not the "ascii_to_operators" table, but the table that converts the canonical encoding to the SystemEncoding (UFT-8, ASCII or would be available/convenient).

Yes, that is correct. Mathics-scanner has operator-to-unicode which is roughly the same. That could be used as is, or "ascii-operator-to..." could be easily changed to "operator-charname-to-unicode{,-wl}".

OK. Then, suppose that the expression is

Integrate[F[\[alpha]], \[\alpha]] // StandardForm

which produces boxes

RowBox[{"[Integral]", RowBox[{"F","[", "[Alpha]","]"}], RowBox[{"[DifferentialD]", "[Alpha]"}]}]

but \[Alpha] is not an operator. So, when / with which table would be it encoded?

@rocky
Copy link
Member Author

rocky commented Oct 4, 2022

Integrate[F[\[alpha]], \[\alpha]] // StandardForm

which produces boxes

RowBox[{"[Integral]", RowBox[{"F","[", "[Alpha]","]"}], RowBox[{"[DifferentialD]", "[Alpha]"}]}]

but \[Alpha] is not an operator. So, when / with which table would be it encoded?

"named-characters" would work I think. Or a table adapted along those lines.

@mmatera
Copy link
Contributor

mmatera commented Oct 4, 2022

Then, my questions are:

@rocky
Copy link
Member Author

rocky commented Oct 5, 2022

Then, my questions are:

I am sorry to report that I believe #541 is thinking about things in the old wrong way. Or that is my reading, not having access to a working Mathematica.

The WL doc for ToString says in part:

ToString[expr]
gives a string corresponding to the printed form of expr in OutputForm
ToString[expr, form ]
gives the string corresponding to output in the specified form.

My reading of this is that whenever an (output) form is involved, either you start out with parsed input (a parse tree, or an M-expression) or if a string then this this implies scanning and parsing the string. Unless I missed something, I don't see that in happening #541.

I suggest seeing what the behavior of ToString is when given invalid input. For example what is the output of ToString["1->"], Tostring["1-"] or even more simply ToString["-"]?

In general, if our code is using replace_with_plain_text() then that should be a red flag that we probably doing something incorrectly.

@mmatera
Copy link
Contributor

mmatera commented Oct 5, 2022

I am sorry to report that I believe #541 is thinking about things in the old wrong way. Or that is my reading, not having access to a working Mathematica.

The WL doc for ToString says in part:

ToString[expr]
gives a string corresponding to the printed form of expr in OutputForm
ToString[expr, form ]
gives the string corresponding to output in the specified form.

OK, I realized that there is something from that approach that you do not like it. What I am trying to understand is what is it exactly.

My reading of this is that whenever an (output) form is involved, either you start out with parsed input (a parse tree, or an M-expression) or if a string then this this implies scanning and parsing the string. Unless I missed something, I don't see that in happening #541.

Let me try to clarify my attempt. What I did there is to focus the task of producing an output with certain character encoding AFTER all the formatting (boxing) task is done. This is why the only place that I really needed to touch the code is in a single routine in mathics.format.text.

Probably, I did wrong by changing also the ToString code, which is also a straightforward change. However, I started there because it was the easier place to analyze how encoding works, without being entangled with all the other format parts.

The rest of the changes are tests and a mechanism to set the default encoding. This last part probably is better implemented in #567 than in #541.

My questions are then focused on the first part (the change in mathics.format.text). So, for some reason, there is something so wrong with my approach that you think that instead of changing a line, it would be better to change several formatting routines, to apply the encoding earlier, even at the cost that if in the end, the evaluation requires another encoding, let's say,
ToString[StandardForm[Integrate[F[x],x]], CharacterEncoding->ASCII]
then the encoding must be reverted (assuming it would be possible). Maybe you are right, and probably you have strong reasons for thinking in that way. What I would like to know is what are these reasons.

I suggest seeing what the behavior of ToString is when given invalid input. For example what is the output of ToString["1->"], Tostring["1-"] or even more simply ToString["-"]?

Now, regarding your question, for these three cases, the output is exactly the same that the input: a String with a value consisting on ASCII characters.

Differences come up if you pass non-ASCII characters. For example, in WMA,

In[1]:= ToString["\[LeftArrow]"]                                                
Out[1]= ←
In[2]:= ToString["\[LeftArrow]", CharacterEncoding->"ASCII"]                    
Out[2]= <-

Also, if you copy the character from the notebook interface, what you get is \[LeftArrow]. (but for a weird reason,
if you copy you get -> instead of \[RightArrow] in the clipboard...). In any case, the encoding

Notice that CharcterEncoding is an option of ToString in WMA: https://reference.wolfram.com/language/ref/ToString.html?q=ToString

In general, if our code is using replace_with_plain_text() then that should be a red flag that we probably doing something incorrectly.

Probably the main problem is that I do not understand why you think that replace_with_plain_text should be deprecated. My guess was that it is too limited, because it handles just two possible encodings, and maybe we would like something more flexible. However, whatever the replacement is, I am pretty convinced that the effect of the encoding parameter (both from ToString and from the system) should take place there. Otherwise, we would need to spread changes on many parts of the code, which look by far less modular that the other approach.

@mmatera
Copy link
Contributor

mmatera commented Oct 5, 2022

Regarding ToString, here is a more involved but interesting example:

By default, ToString uses the OutputForm

In[1]:=ToString[Integrate[F[x],x]] //InputForm
Out[1]//InputForm= "Integrate[F[x],x]"

But for if you ask for the StandardForm, the output is

In[2]:=ToString[\[Integral]F[x] \[DifferentialD]x, StandardForm] // InputForm
Out[2]//InputForm="\!\(\*RowBox[{\"\[Integral]\", RowBox[{RowBox[{\"F\", \"[\", \"x\", \"]\"}], RowBox[{\"\[DifferentialD]\", \"x\"}]}]}]\)"

Suppose now you ask for a specific encoding. For example,

In[3]:=ToString[Integrate[F[x], x], StandardForm, 
  CharacterEncoding -> "ASCII"] 
Out[3] = "\\!\\(\\*RowBox[{\"\\[Integral]\", RowBox[{RowBox[{\"F\", \"[\", \"x\
\", \"]\"}], RowBox[{\"\\[DifferentialD]\", \"x\"}]}]}]\\)" 

Now, if you copy the output of StandardForm[Integrate[F[x],x]]
imagen
and paste it in a text editor, you get
\[Integral]F[x] \[DifferentialD]x

@rocky
Copy link
Member Author

rocky commented Oct 6, 2022

OK, I realized that there is something from that approach that you do not like it. What I am trying to understand is what is it exactly.

The idea of taking strings and doing a string-like search and replace for operators does not appear to be a concept embraced in WMA.

It is like saying I put 2 and 2 to get and I get 22: what's wrong with that? Well if you are talking about strings and string concatenation, that makes sense. However when you are talking about Mathematics it makes little sense and shows a lack of understanding of standard addition. Another analogy might be trying to solve a Rubik's cube by getting all faces the same color face by face, rather than understanding the notion that there is permutation group which contains subgroups which don't divide up that way. In short, you are going against the grain of the philosophy of WMA which, while might not be documented from the standpoint that of a interpreter developer, does follow standard interpreter principles and has a certain logic to it.

Probably, I did wrong by changing also the ToString code, which is also a straightforward change.

And you could say the same thing about putting 2 and 2 together to make 22, or solving a Rubik's cube by making each face match. You can also say that for someone who doesn't understand addition or groups and subgroups this is a "natural" or "necessary" first step. For you maybe, but I'd ask you to recognize your weaknesses and just leave them for me then. And when I am weak in an area like the exact details of the behavior of specific functions or what transformation rules WMA uses, I'll leave that for you or others.

Now, regarding your question, for these three cases, the output is exactly the same that the input: a String with a value consisting on ASCII characters.

Differences come up if you pass non-ASCII characters. For example, in WMA,

In[1]:= ToString["\[LeftArrow]"]                                                
Out[1]= ←
In[2]:= ToString["\[LeftArrow]", CharacterEncoding->"ASCII"]                    
Out[2]= <-

Also, if you copy the character from the notebook interface, what you get is \[LeftArrow]. (but for a weird reason, if you copy you get -> instead of \[RightArrow] in the clipboard...). In any case, the encoding

The purpose of these tests was to dig deeper into the semantics of ToString[]. In the WMA reference the argument of ToString is an expr . And expr in the reference, from a implementor's standpoint, represents is parsed or structured object, generally an M-expression. When the expr is a string, the behavior from your experiments implies that there is parse of the string; if the parse succeeds then that structure is used to dictate how to produce the Form, which is the normal way that Form output to a string (Unicode or not, whatever CharacterEncoding) is produced.

When expr is not parsable, then the behavior can either be to leave the string alone, or throw some sort of error. And this too isn't all that different than evaluating input.

https://reference.wolfram.com/language/tutorial/TextualInputAndOutput.html#8971 says:

Input: convert from a textual form to an expression
Processing: do computations on the expression [which here is nothing]
Output: convert the resulting expression [a structure] to textual form

If we are short-circuiting this process in ToString by going from string to string and short-circuiting the normal way an expr gets converted to a Form, then this doesn't follow the above.

Continuing:

When you type something like x^2 what the Wolfram Language at first sees is just the string of characters x, ^, 2. But with the usual way that the Wolfram Language is set up, it immediately knows to convert this string of characters into the expression Power[x,2].
Then, after whatever processing is possible has been done, the Wolfram Language takes the expression Power[x,2] and converts it into some kind of textual representation for output.

So to repeat, implementation in that PR doesn't follow this approach.

Notice that CharcterEncoding is an option of ToString in WMA: https://reference.wolfram.com/language/ref/ToString.html?q=ToString

Yes, but there always is a Form involved in ToString and CharacterEncoding is just an option or variation added to the Form.

@mmatera
Copy link
Contributor

mmatera commented Oct 6, 2022

OK, I realized that there is something from that approach that you do not like it. What I am trying to understand is what is it exactly.

The idea of taking strings and doing a string-like search and replace for operators does not appear to be a concept embraced in WMA.

StringReplace https://reference.wolfram.com/language/ref/StringReplace.html have exactly that kind of behavior.

But, most important, the formatting process (Format/ MakeBoxes) does not take into account the $CharacterEncoding parameter. To see this, please consider the following test that I run in WMA commandline interpreter:

First, some strings and boxes are produced assuming the default encoding (UTF-8):

In[1]:= (*Premade boxes *)                                                                                                                                                    

In[2]:= boxesint = MakeBoxes[Integrate[F[x],x]]                                                                                                                               

Out[2]= RowBox[{∫, RowBox[{RowBox[{F, [, x, ]}], RowBox[{, x}]}]}]

In[3]:= stringint = "\[Integral] F[x] \[DifferentialD]x"                                                                                                                      

Out[3]= ∫ F[x] x

Then I set another encoding, and compare the output of the prebuilt string and boxes, against the boxes I make after setting the encoding:

In[4]:= (*CharacterEncoding ASCII*)                                                                                                                                           

In[5]:= $CharacterEncoding = "ASCII"                                                                                                                                          

Out[5]= ASCII

In[6]:= stringint                                                                                                                                                             

Out[6]= \[Integral] F[x] dx

In[7]:= "\[Integral] F[x] \[DifferentialD]x"                                                                                                                                  

Out[7]= \[Integral] F[x] dx

In[8]:= boxesint                                                                                                                                                              

Out[8]= RowBox[{\[Integral], RowBox[{RowBox[{F, [, x, ]}], RowBox[{d, x}]}]}]

In[9]:= MakeBoxes[Integrate[F[x],x]]                                                                                                                                          

Out[9]= RowBox[{\[Integral], RowBox[{RowBox[{F, [, x, ]}], RowBox[{d, x}]}]}]

Now, UTF-8 are restored, and the output compared:


In[10]:= (*CharacterEncoding UTF-8*)                                                                                                                                          

In[11]:= $CharacterEncoding = "UTF-8"                                                                                                                                         

Out[11]= UTF-8

In[12]:= stringint                                                                                                                                                            

Out[12]= ∫ F[x] x

In[13]:= "\[Integral] F[x] \[DifferentialD]x"                                                                                                                                 

Out[13]= ∫ F[x] x

In[14]:= boxesint                                                                                                                                                             

Out[14]= RowBox[{∫, RowBox[{RowBox[{F, [, x, ]}], RowBox[{, x}]}]}]

In[15]:= MakeBoxes[Integrate[F[x],x]]                                                                                                                                         

Out[15]= RowBox[{∫, RowBox[{RowBox[{F, [, x, ]}], RowBox[{, x}]}]}]

@rocky
Copy link
Member Author

rocky commented Oct 6, 2022

The idea of taking strings and doing a string-like search and replace for operators does not appear to be a concept embraced in WMA.

StringReplace https://reference.wolfram.com/language/ref/StringReplace.html have exactly that kind of behavior.

No it doesn't (see below). And more importantly, StringReplace isn't the same thing as ToString which is what what that PR was about, not StringReplace. This habit when you are wrong to just expand things a little and go off on a tangent doesn't help solve any of the existing problems.

In StringReplace, the argument is listed as a str ; that indicates this is working on a different kind of object normally. "2" and "2" when we are discussing strings and string concatenation is okay. But not in a discussion of 2 and 2 in the context of arithmetic addition. And postulating or finding MultiplyBy10AndAdd[] doesn't change things either.

As I mentioned before, ToString[] is intended to normally work on an expr ; when the expr happens to be a string and the string can be parsed seems to be an extension of this behavior and a somewhat special case.

As forStringReplace, notice that the replacement string always has to be given. There is no implied automatic conversion as is found in replace_wl_with_plain_text(). Again, there is no use of this function that I can find needed to implement anything in WMA.

@mmatera
Copy link
Contributor

mmatera commented Oct 6, 2022

The idea of taking strings and doing a string-like search and replace for operators does not appear to be a concept embraced in WMA.

StringReplace https://reference.wolfram.com/language/ref/StringReplace.html have exactly that kind of behavior.

No it doesn't (see below). And more importantly, StringReplace isn't the same thing as ToString which is what what that PR was about, not StringReplace. This habit when you are wrong to just expand things a little and go off on a tangent doesn't help solve any of the existing problems.

No, this pr is not about ToString, but about how encodings enters into the formatting sequence. This is why I said that include the part about ToString was a mistake. Sorry for that.
The main reason to include that is because it is apart that most of the work done in formatting must be reproduced also in ToString.

Now, focusing in the subject of encoding, what I tried to show with the example is that $CharacterEncoding does not play any role in the WMA implementation of Makeboxes, but just in how the front end translates boxes into a printable string. Again, I acknowledge that I am not an expert about how this should be inplemented. However, what I know is what would be the expected behavior. This PR cannot reproduce that behavior , and this is why I am reticent about mergin this.

In StringReplace, the argument is listed as a str ; that indicates this is working on a different kind of object normally. "2" and "2" when we are discussing strings and string concatenation is okay. But not in a discussion of 2 and 2 in the context of arithmetic addition. And postulating or finding MultiplyBy10AndAdd[] doesn't change things either.

As I mentioned before, ToString[] is intended to normally work on an expr ; when the expr happens to be a string and the string can be parsed seems to be an extension of this behavior and a somewhat special case.

As forStringReplace, notice that the replacement string always has to be given. There is no implied automatic conversion as is found in replace_wl_with_plain_text(). Again, there is no use of this function that I can find needed to implement anything in WMA.

@rocky
Copy link
Member Author

rocky commented Oct 6, 2022

Now, focusing in the subject of encoding, what I tried to show with the example is that $CharacterEncoding does not play any role in the WMA implementation of Makeboxes, but just in how the front end translates boxes into a printable string.

Sure, I got that. I repeat:

The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting.

and from before:

So your job, should you choose to undertake this, is to replace that fixed MakeBox rule with something that uses AsciiOpToString[] or something similar.

And this needs to be revised or made more specific: what we are seeing is that what is needed is:

  • improved rules for Boxing
  • improved formatting routines that take into account $CharacterEncoding

This has been and will hang out as a draft until we have a better handle on the above. Once that happens, either this can be closed, or revised since a lot of is about the mechanics needed to interface with some Mathicsscanner table that assists the formatting part.

@mmatera
Copy link
Contributor

mmatera commented Oct 7, 2022

Now, focusing in the subject of encoding, what I tried to show with the example is that $CharacterEncoding does not play any role in the WMA implementation of Makeboxes, but just in how the front end translates boxes into a printable string.

Sure, I got that. I repeat:

The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting.

and from before:

So your job, should you choose to undertake this, is to replace that fixed MakeBox rule with something that uses AsciiOpToString[] or something similar.

OK, but as I tried to show before, this behavior is not compatible with WMA. So, even if your approach is better than the one in WMA, it could bring issues when we try to load packages like CellsToTeX (one of the main motivations that I have in the background for all of this).

So, whatever would be the right implementation, I have (and try to shown) several reasons to keep Format/do_format/MakeBoxes agnostic about $CharacterEncoding settings. These reasons suggest to me that the right place to use this parameter is in the mathics.format module, where routines that convert Box expressions into strings and different file formats happen.

And this needs to be revised or made more specific: what we are seeing is that what is needed is:

* improved rules for Boxing

* improved formatting routines that take into account `$CharacterEncoding`

This has been and will hang out as a draft until we have a better handle on the above. Once that happens, either this can be closed, or revised since a lot of is about the mechanics needed to interface with some Mathicsscanner table that assists the formatting part.

@rocky
Copy link
Member Author

rocky commented Oct 7, 2022

Now, focusing in the subject of encoding, what I tried to show with the example is that $CharacterEncoding does not play any role in the WMA implementation of Makeboxes, but just in how the front end translates boxes into a printable string.

Sure, I got that. I repeat:

The goal here was to reset things back to a boxing and formatting output problem. And now is the time when one of us should be primarily dealing with Boxing and formatting.

and from before:

So your job, should you choose to undertake this, is to replace that fixed MakeBox rule with something that uses AsciiOpToString[] or something similar.

OK, but as I tried to show before, this behavior is not compatible with WMA. So, even if your approach is better than the one in WMA, it could bring issues when we try to load packages like CellsToTeX (one of the main motivations that I have in the background for all of this).

Forget that I wrote:

is to replace that fixed MakeBox rule with something that uses AsciiOpToString[] or something similar.

which was a generalization based on how the existing Mathics code works.

Pretend that I had written instead:

  • improve rules for Boxing and how Boxing works for the different kinds of forms. From the tests, it looks like the output after Boxing could be some sort of ToString with an expression (not a composite string expression as happens in master).
  • improve formatting routines that take into account $CharacterEncoding. Alternatively, instead of AsciiOpToString[] this might be done via ToString of an expr . Either the formatting routine or ToString uses tables from MathicsScanner

Are you saying that this kind of approach doesn't work?

@mmatera
Copy link
Contributor

mmatera commented Oct 7, 2022

Ok, let's keep this:

Pretend that I had written instead:

  • improve rules for Boxing and how Boxing works for the different kinds of forms. From the tests, it looks like the output after Boxing could be some sort of ToString with an expression (not a composite string expression as happens in master).
  • improve formatting routines that take into account $CharacterEncoding. Alternatively, instead of AsciiOpToString[] this might be done via ToString of an expr . Either the formatting routine or ToString uses tables from MathicsScanner

Are you saying that this kind of approach doesn't work?

Almost. What I am saying is that what makes boxes_to_text works, will also makes that ToString does. So,

  • We need to improve format and MakeBoxes. That part does not involves the encoding.
  • improve boxes_to_text, in order to take into account the encoding. Let's figure out what is the best way to translate strings using mathics scanner tables.
  • from this pr, let's keep the part that handles the settings for the default encoding.

@rocky
Copy link
Member Author

rocky commented Oct 7, 2022

Ok, let's keep this:

Pretend that I had written instead:

  • improve rules for Boxing and how Boxing works for the different kinds of forms. From the tests, it looks like the output after Boxing could be some sort of ToString with an expression (not a composite string expression as happens in master).
  • improve formatting routines that take into account $CharacterEncoding. Alternatively, instead of AsciiOpToString[] this might be done via ToString of an expr . Either the formatting routine or ToString uses tables from MathicsScanner

Are you saying that this kind of approach doesn't work?

Almost. What I am saying is that what makes boxes_to_text works, will also makes that ToString does. So,

  • We need to improve format and MakeBoxes. That part does not involves the encoding.
  • improve boxes_to_text, in order to take into account the encoding. Let's figure out what is the best way to translate strings using mathics scanner tables.
  • from this pr, let's keep the part that handles the settings for the default encoding.

@mmatera Good - I think we have this a better understanding of things.

I have been thinking about about how to divide and separate code organization and division of work more. (Too often we are working on the same kinds of things.)

I am going to redo #541 which is what I had said I'd do a while ago. The focus there though is going to shift to handing the kinds of calls that involve a non-string expr and after getting something minimal there, it can expand to expr with different kinds of Forms.

My current feeling around writing any new builtin nowadays is to have a clearer separation from the top-level built-in function and its implementation after function arguments are converted.

We do this somewhat already for things that in the algorithm.py module and something is similar in eval_N, let's say the internal work-horse code for ToString[] is called eval_ToString().

I believe that the eval_ToString() portion can also be used by whatever lower-level formatting routines that would get called in formatting.

Higher-level design and implementation is not your strong point, finding edge cases, testing against a running WMA are your strong points. So after I have a first rough cut of this, feel free to dig in and find flaws and fix those flaws. However the overall structure will likely be correct or at least not grossly incorrect. Example: the difference between an S-Expression and an M-expression is slight compared to not understanding what an S-Expression is and why it is important and how it is used.

There are a number of unfinished things that had been started and haven't finished. As I recall #565 was close to being done, and then there was that detour of going writing some code to reformat the entire codebase. (All that is needed right here is to address the format inconsistency of that one new class).

And there are a number of examples in mathics-core/test/builtin/test_makeboxes.py that need to be fixed. Would you please focus on those until I have a rough cut of ToString? Thanks.

@mmatera
Copy link
Contributor

mmatera commented Oct 7, 2022

@mmatera Good - I think we have this a better understanding of things.

I have been thinking about about how to divide and separate code organization and division of work more. (Too often we are working on the same kinds of things.)

Yes, the problem is that still many parts are very entangled. A good workflow example was the part of formatters that split very well the work of processing "Boxed" expressions into different formats (text, SVG, HTML). Now, what I would like to have is something similar for the part of "Formatting" (Expression-> Formatted Expression-> BoxedExpression).

I am going to redo #541 which is what I had said I'd do a while ago. The focus there though is going to shift to handing the kinds of calls that involve a non-string expr and after getting something minimal there, it can expand to expr with different kinds of Forms.

That would be great. To do that, notice that for most of the basic cases, ToString essentially takes the argument (a general Expression) formats it into Boxes (essentially, applies the MakeBoxes rules) and then, converts the boxes into strings.
However, there are some cases handled in a different way (for example, Boxed expressions including GraphicsBox), so it would be great to have this implemented in a separated way.

My current feeling around writing any new builtin nowadays is to have a clearer separation from the top-level built-in function and its implementation after function arguments are converted.

I fully agree with this. This is one of the reasons that makes me avoid using time to implement new Builtins.

We do this somewhat already for things that in the algorithm.py module and something is similar in eval_N, let's say the internal work-horse code for ToString[] is called eval_ToString().

Seems a good plan.

I believe that the eval_ToString() portion can also be used by whatever lower-level formatting routines that would get called in formatting.

Actually, I had been oscillating for some time between both approaches (using ToString to format expressions, or using MakeBoxes to produce the result of ToString). At this point, it seems that format_element would be the basic function on which MakeBoxes and ToString should be based. Then both functions should use (slightly) different implementations of boxes_to_text.

Higher-level design and implementation is not your strong point, finding edge cases, testing against a running WMA are your strong points. So after I have a first rough cut of this, feel free to dig in and find flaws and fix those flaws. However the overall structure will likely be correct or at least not grossly incorrect. Example: the difference between an S-Expression and an M-expression is slight compared to not understanding what an S-Expression is and why it is important and how it is used.

Sure. All my previous proposals are just in the direction of reproducing the WMA behavior in general situations. But probably you are in a better position for proposing a good implementation. Please, just take my code as an sketch (let's say, a Play Doh model) to see how the whole think should connect an input with an output.

There are a number of unfinished things that had been started and haven't finished. As I recall #565 was close to being done, and then there was that detour of going writing some code to reformat the entire codebase. (All that is needed right here is to address the format inconsistency of that one new class).

And there are a number of examples in mathics-core/test/builtin/test_makeboxes.py that need to be fixed. Would you please focus on those until I have a rough cut of ToString? Thanks.

Yes, that was my plan. However, to make them work I need to do a rework on how format_element / MakeBoxes works. I am working on a "Play Doh" model, but it is not finished yet.

rocky and others added 4 commits October 9, 2022 13:05
The Infix rule should not be adding formatting. Do this in makebox
evaluation.

As a result, we now use
operator-to-{ascii,unicode}. operator-to-{unicode-wl} still needs
adding.

Some hard nl's on docstrings on Builtins have been removed since
formatting doesn't handle that.

Some methods and classes have been alphabeticed better.

$CharacterEncoding now respects MATHICS_CHARACTER_ENCODING environtment variable
mathics/builtin/makeboxes.py: correct the boxing rules for infix
operators started previously.

files.py, image.py: OpenRead and Import examples now need to specify the
CharacterEncoding

base.We start to remove formatting decisions inside MakeBox rules.
In particular for Infix. For Integrate, we've noted the problem.

calculus.py: Note improper Makebox rules, Add symbols which might be
used in the future to start to do this correctly.

symbols.py: use fn as the more proper way to set a short name

test_format.py: revert StandardForm and TraditionalForm formatting for
DifferentialD. This is something we should address in the future.
@rocky rocky marked this pull request as ready for review October 11, 2022 10:33
We get different results on different systems. Right now we don't have
in place proper formatting for output with DifferentialD in them.

So rather than test for the presence of something rigidly wrong,
we skip these tests for now.
@rocky rocky merged commit eceb744 into master Oct 11, 2022
@rocky rocky deleted the ascii-op-to-unicode branch October 11, 2022 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants