Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert wl-code->unicode-equivalent in parsing. Adding amstex tables #43

Closed
wants to merge 2 commits into from

Conversation

mmatera
Copy link
Contributor

@mmatera mmatera commented Aug 30, 2022

This PR adds the table of amstex symbols, and fixes #42

@rocky rocky marked this pull request as draft August 30, 2022 14:50
@rocky
Copy link
Member

rocky commented Aug 30, 2022

Thanks for making the addition for amslatex.

As for differential I will look at this when I get a chance. I need to understand why the change was made. Note that tables are used by a number of front ends, and not just used in mathics core.

@@ -148,7 +148,7 @@
("ConjugateTranspose", r" \uf3c9 "),
("HermitianConjugate", r" \uf3ce "),
("Integral", r" \u222b "),
("DifferentialD", r" \U0001D451 "),
("DifferentialD", r" \u1d451 | \uf74c"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something about I am not completely sure. I guess that \[DifferentialD] should be parsed as \uf74c for the internal representation, and then it makes sense that a match with \uf74c should represent a match with the operator. \u1d451, on the other hand, is a printable representation for \[DifferentialD], so it does not need to match the operator in general parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly the intent was to allow for different kinds of input devices sending different kinds of codes.

In this situation and the one above, it may be that you are being myopic - you see a problem and consider only the aspect that you are running into problems with while possibly not understanding the larger picture.

With respect to the table change things to think about are not just the existing kind of tagging, but changes can be made in the translator to specific tables. So there might be a table that is custom for mathics-core, but mathics-scanner gets something similar but different.

And we can also consider expanding the fields in the YML file as has been done a number of times in the past as we discover different kinds of needs. For example operator information was recently added, before that unicode equivalences for non-standard WL symbols.

At any rate I need some time to look this over and the history and think about this and that may happen on the weekend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall correctly the intent was to allow for different kinds of input devices sending different kinds of codes.

Ok, but whatever is the input, what we want to pass to the tokenizer is a uniform wl string. Then, after all the evaluations are carried on, in the output, we want to translate l the wl string to an encoding that the interpreter knows how to handle.

In this situation and the one above, it may be that you are being myopic - you see a problem and consider only the aspect that you are running into problems with while possibly not understanding the larger picture.

With respect to the table change things to think about are not just the existing kind of tagging, but changes can be made in the translator to specific tables. So there might be a table that is custom for mathics-core, but mathics-scanner gets something similar but different.

And we can also consider expanding the fields in the YML file as has been done a number of times in the past as we discover different kinds of needs. For example operator information was recently added, before that unicode equivalences for non-standard WL symbols.

At any rate I need some time to look this over and the history and think about this and that may happen on the weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but whatever is the input, what we want to pass to the tokenizer is a uniform wl string.

Yes, that is a new and different situation. A couple possibilities come to mind that don't change this particular piece of code, but I'd have to think about it some more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given what I wrote in #43 (comment) this change is okay as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd note in a comment that \u1d451 is to handle standard Unicode while \uf74c is there to handle WL-specific unicode.

amstex_characters = {
v["wl-unicode"]: v.get("amslatex", v.get("wl-unicode"))
for v in data.values()
if "amslatex" in v and "wl-unicode" in v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't "and" be "or"?

That is, we have an amslatex character if we have either amslatex noted in an character entry or a wl-unicode character.

Also, it looks to me like we need to separate input from output. In input it is okay to use wl-unicode, but on output we should use unicode since we don't have Wolfram front-ends.

k: v.get("unicode-equivalent", v.get("wl-unicode"))
for k, v in data.items()
if "wl-unicode" in v
k: v["wl-unicode"] for k, v in data.items() if "wl-unicode" in v
Copy link
Member

@rocky rocky Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the motivation for this change. Please explain.

The former says to me that something is a named character if it appears it has a wl-unicode tagging, and that we want to use the unicode equivalent first, and if that doesn't exist the wl-unicode value.

The change says that we want the wl-unicode value whether or not there is a unicode equivalent.

I can probably guess what is in your mind but I'd prefer not to guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is we want to take a string in the internal representation, where you have wl-unicode characters, and you want to replace them by a LaTeX command. So, the key is the wl-unicode, and the value is the amstex code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I think we need both fields (and) available .

Copy link
Member

@rocky rocky Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is we want to take a string in the internal representation, where you have wl-unicode characters,

Yeah, this is what I thought. And this was what I was getting to in an earlier remark about being myopic and going in to some area that isn't an area of expertise without hesitation or solicitation and just blasting away. That then puts pressure on me to play catch up with what was going on in your mind via PRs. (And sloppy commits, vague PR comments and using repositories PRs like a local sandbox doesn't help here.)

I get that you'd like to use some sort of WL unicode symbol internally in mathics-core for things like an operators or named symbols. However as I mentioned before, mathics scanner is already used in other repositories, in particular mathicsscript and mathics-pygments.

Therefore what we need are additional tables for the purpose of symbols which can be used internally by mathics-core.
And the descriptions of what the tables mean also needs to be expanded upon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is we want to take a string in the internal representation, where you have wl-unicode characters,

Yeah, this is what I thought. And this was what I was getting to in an earlier remark about being myopic and going in to some area that isn't an area of expertise without hesitation or solicitation and just blasting away. That then puts pressure on me to play catch up with what was going on in your mind via PRs. (And sloppy commits, vague PR comments and using repositories PRs like a local sandbox doesn't help here.)

No @rocky, it was not my intention to put pressure on you. I put a PR because I thought I had understood the logic of this package. I leave some issues explaining what I think is necessary to make to work on an idea on the mathics-core side, related to something that we have previously discussed (the one of using mathics-scanner to implement the conversion of wl-unicode characters to amstex commands). I do not have any hurry on having this finished, but I wanted to communicate the idea about what is need and how I think we can implement it. That's all.

I get that you'd like to use some sort of WL unicode symbol internally in mathics-core for things like an operators or named symbols. However as I mentioned before, mathics scanner is already used in other repositories, in particular mathicsscript and mathics-pygments.

Yep, but what I am proposing does not change the behaviour regarding the last release. Just adds some functionalities and propose to revert a change that breaks mathics-core in some extent.

Therefore what we need are additional tables for the purpose of symbols which can be used internally by mathics-core. And the descriptions of what the tables mean also needs to be expanded upon.

Yes, one possibility is to generate the new tables on the mathics-core side, and then migrate them to here, if it seems useful.

Copy link
Member

@rocky rocky Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b> No @rocky, it was not my intention to put pressure on you.

I didn't say and definitely didn't think there was intention. I never ascribe to maliciousness that which is easily explained by thoughtlessness. I want to bring it up because, it does put pressure. Anything Mathics wise I had thought I might do this week has been scrapped to make sure we move forward on the work that was done during the week and before.

Overall I think it is good intent. And better formatting is much needed!

(the one of using mathics-scanner to implement the conversion of wl-unicode characters to amstex commands).

Yes, this I was aware of from the discussions. But please, again, be sensitive to when you are straying out what you well know. And when that happens stop and ask.

propose to revert a change that breaks mathics-core in some extent.

This is what I mean by myopic. Did you look at the commit in the revert to see why it was changed? What you consider a revert for some purpose may be a breakage for whatever reason it was changed.

Past experience has always been that when you venture into an area your expertise that code needs to be looked at more carefully. And it may need explaining, and this takes time.

Yes, one possibility is to generate the new tables on the mathics-core side, and then migrate them to here, if it seems useful.

That probably leads to the fastest iteration in development. But again, one thing this activity has made clear to me is that we need a better description of the tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here 8a2dc4a is where the change in the table of NamedCharacters was introduced. There is no further description in the commit. In any case, it was introduced after the last release, so I would expect that the reversion does not have a conflict / breakages on any of the (currently released versions of the ) other packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now gone over the code in history in detail and looked at the mathicsscript and mathics-pygments.

Here is what I understand. Input has to allow unicode forms because those are the input devices we currently support. We don't have kinds of keyboards that I am aware of that generate WL unicode symbols.

A new desire in the mathics-core project is to canonicalize various character symbols into the WL unicode symbol. That's okay and that makes sense. Another equally valid alternative would be to canonicalize to the "standard" unicode symbol.

What is needed is a canonicalization inside the scanner to one either WL unicode or standard unicode if the other kind of unicode was input. When is canonicalization is to standard unicode, I don't think this is an issue. I gather there is some reason for preferring WL unicode - what is it?

If input devices always produced WL unicode or in other words if WL had not defined its own unicode for some symbols that we wouldn't need this extra step.

But when you want to support both kinds of input then a new step is needed to do the conversion when the other form is used.

k: v.get("unicode-equivalent", v.get("wl-unicode")) is there to support unicode input and when that doesn't appear because it is the same as wl-unicode, then we pick it up from wl-unicode.

The basis for this comment about not adding a unicode equivalent if it is not needed, that is, either the same thing as the ascii or the same thing as the wl-unicode is related to this. That comment should be elaborated upon, but I'll do that later.

If that were followed (and the #45 violates that; we should fix this add a test for this for improper use of unicode-equivalent), then the scanner would convert to wl-unicode only when there is a unicode-equivalent field. Here a new table for this should be created.

@rocky
Copy link
Member

rocky commented Sep 5, 2022

Today I was running tests in mathics core and I am getting a failure in test/format/test_format.py:

test/format/test_format.py:687: AssertionError
_ test_makeboxes_text[Integrate[F[x], {x, a, g[b]}]-Subsuperscript[\u222b, a, g(b)]\u2062F(x)\u2062\uf74cx-form53-Non trivial SubsuperscriptBox] _

str_expr = 'Integrate[F[x], {x, a, g[b]}]'
str_expected = 'Subsuperscript[∫, a, g(b)]\u2062F(x)\u2062\uf74cx', form = <Symbol: System`TraditionalForm>
msg = 'Non trivial SubsuperscriptBox'

    @pytest.mark.parametrize(
        ("str_expr", "str_expected", "form", "msg"),
        mandatory_tests,
    )
    def test_makeboxes_text(str_expr, str_expected, form, msg):
        result = session.evaluate(str_expr)
        format_result = result.format(session.evaluation, form)
        if msg:
>           assert (
                format_result.boxes_to_text(evaluation=session.evaluation) == str_expected
            ), msg
E           AssertionError: Non trivial SubsuperscriptBox
E           assert 'Subsuperscri...2F(x)\u2062𝑑x' == 'Subsuperscri...\u2062\uf74cx'
E             - Subsuperscript[∫, a, g(b)]⁢F(x)⁢x
E             ?                                 ^
E             + Subsuperscript[∫, a, g(b)]⁢F(x)⁢𝑑x
E             ?                                 ^

And if I read this right, it is saying that it got the standard Unicode differential "d", but the WL differential d is what was coded is expected in the test. So here, I am inclined to say the test expectation needs adjusting and the current mathics scanner behavior is the better one.

Looking at where behavior starts to be different, commit df948fe is the last "good" commit and that commit 8a2dc4a ,"More letter-like work"(with massive amounts of changes) ,is the first "bad" commit, although in my view it is the other way around. More letter-like-work outputs Unicode-rendered "d" while before that we get WL-rendered "d".

I don't want to deal with this now having spent too much time on the weekend looking at this stuff which is not what i had planned to do. But I did want to report what I have found.

@mmatera
Copy link
Contributor Author

mmatera commented Sep 7, 2022

Today I was running tests in mathics core and I am getting a failure in test/format/test_format.py:

test/format/test_format.py:687: AssertionError
_ test_makeboxes_text[Integrate[F[x], {x, a, g[b]}]-Subsuperscript[\u222b, a, g(b)]\u2062F(x)\u2062\uf74cx-form53-Non trivial SubsuperscriptBox] _

str_expr = 'Integrate[F[x], {x, a, g[b]}]'
str_expected = 'Subsuperscript[∫, a, g(b)]\u2062F(x)\u2062\uf74cx', form = <Symbol: System`TraditionalForm>
msg = 'Non trivial SubsuperscriptBox'

    @pytest.mark.parametrize(
        ("str_expr", "str_expected", "form", "msg"),
        mandatory_tests,
    )
    def test_makeboxes_text(str_expr, str_expected, form, msg):
        result = session.evaluate(str_expr)
        format_result = result.format(session.evaluation, form)
        if msg:
>           assert (
                format_result.boxes_to_text(evaluation=session.evaluation) == str_expected
            ), msg
E           AssertionError: Non trivial SubsuperscriptBox
E           assert 'Subsuperscri...2F(x)\u2062𝑑x' == 'Subsuperscri...\u2062\uf74cx'
E             - Subsuperscript[∫, a, g(b)]⁢F(x)⁢x
E             ?                                 ^
E             + Subsuperscript[∫, a, g(b)]⁢F(x)⁢𝑑x
E             ?                                 ^

And if I read this right, it is saying that it got the standard Unicode differential "d", but the WL differential d is what was coded is expected in the test. So here, I am inclined to say the test expectation needs adjusting and the current mathics scanner behavior is the better one.

Looking at where behavior starts to be different, commit df948fe is the last "good" commit and that commit 8a2dc4a ,"More letter-like work"(with massive amounts of changes) ,is the first "bad" commit, although in my view it is the other way around. More letter-like-work outputs Unicode-rendered "d" while before that we get WL-rendered "d".

I don't want to deal with this now having spent too much time on the weekend looking at this stuff which is not what i had planned to do. But I did want to report what I have found.

I found a similar problem some time ago (see #284 in Mathics-Core)

Probably, at the level of pytests in mathics-core, the formatter tests should check just the ASCII (even maybe ANSI) encoded output. But for that, we should go over the expected behavior of mathics-scanner. When I have some time, I will try to write down how I think this should go. However, I would like first to finish with the format refactor, and then face this other aspect.

@rocky
Copy link
Member

rocky commented Sep 18, 2022

I spent some time today going over the tables in #51

I am ready to answer any questions you have about this stuff. Note that in this PR the standard unicode for DifferentialD is \U0001D451 and that is not the same thing as \u1d451 .

This works for me as is in Mathics-core. However there is work to be done in mathics-core to get operators to display properly, and this is neither the role of either the scanner nor parser.

I can't work on this much on Sunday.

Basically the flow is that on input we should accept WL unicode and standard Unicode. In mathics core in after parsing there is no Unicode or ASCII used per se for operators, we just have a node of the operator type, e.g. "And".

Right now in formatting rules we are picking out the object's "operator" string value which is wrong. Instead we should be using the named character which we can now get based on the ascii operator or one of the character codes, and with $CharacterEncoding turn this into the right kind of string.

@mmatera
Copy link
Contributor Author

mmatera commented Sep 18, 2022

I spent some time today going over the tables in #51

@rocky, thank you for investigating and fixing this.

I am ready to answer any questions you have about this stuff. Note that in this PR the standard unicode for DifferentialD is \U0001D451 and that is not the same thing as \u1d451 .

This works for me as is in Mathics-core. However there is work to be done in mathics-core to get operators to display properly, and this is neither the role of either the scanner nor parser.

I can't work on this much on Sunday.

Basically the flow is that on input we should accept WL unicode and standard Unicode. In mathics core in after parsing there is no Unicode or ASCII used per se for operators, we just have a node of the operator type, e.g. "And".

Right now in formatting rules we are picking out the object's "operator" string value which is wrong. Instead we should be using the named character which we can now get based on the ascii operator or one of the character codes, and with $CharacterEncoding turn this into the right kind of string.

During the week, I am going to try fixing this in mathics-core.

@mmatera
Copy link
Contributor Author

mmatera commented Sep 18, 2022

Superseeded by #51

@mmatera mmatera closed this Sep 18, 2022
@rocky
Copy link
Member

rocky commented Sep 18, 2022

During the week, I am going to try fixing this in mathics-core.

@mmatera Good. Here are current thoughts.

Based on https://mathematica.stackexchange.com/a/3628/73996 ,

if you have the name of the operator, let us say it is "LeftVector", then:

ToExpression["\"\\[LeftVector]\""]

will produce its (Standard) unicode equivalent. But we want to do that only when $CharacterEncoding is not "ASCII" (and what else); when it is "ASCII" , we just want the class "operator" value. "operator" is a vague name to represent "the ASCII character string sequence for the given operator".

Based on the Mathematica StackExchange answer, I suspect that there is no built-in WMA to do this kind of conversion based on $CharacterEncoding (or anything similar). So the conversion would have to be wrapped in an WMA If[] and then it would be good to create a non-WMA function for this.

But if we add non-WMA built-in function to do thi, then ToExpression[] is not needed. And the implementation can be much more efficient since it doesn't go through the parser, but instead it could just consult conversion tables produced by the mathics_scanner project.

Note that in contrast to what we do now where operator conversion , e.g. And" node to "&&", is done once, when done properly it has to be done every time that operator is used. But overhead can be driven down using the standard Python cache decorator on a function call that is passed the operator (either name or ascii sequence) and the current $CharacterEncoding value. (Note that in the actual call, this value is known; so caching would have a value for each pair operator name and character encoding of values. But in reality, $CharacterEncoding wouldn't change much if at all.)

If a new builtin is used (because there is no direct equivalent in WMA), then there is a question of what conversion table to use.

Above, I used the character name. However our code currently records the ASCII operator equivalent. So such a function might want a conversion table of ASCII operator string to Standard Unicode, and maybe a different one if going to WMA Unicode. But I don't understand how the distinction between the two would be specified.

@rocky rocky mentioned this pull request Sep 18, 2022
@mmatera
Copy link
Contributor Author

mmatera commented Sep 19, 2022

@rocky, thanks for the reference. I have added #52 to present my thoughts about this.

@TiagoCavalcante TiagoCavalcante deleted the ams_tex branch November 28, 2022 01:54
@TiagoCavalcante TiagoCavalcante restored the ams_tex branch November 28, 2022 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The parser should convert NamedCharacters into wl-code as before (1.2.4), not in unicode-equivalent
2 participants