Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion/Specification] Relayout/Remap legacy fonts ? #15

Closed
laicasaane opened this issue Nov 25, 2017 · 59 comments
Closed

[Discussion/Specification] Relayout/Remap legacy fonts ? #15

laicasaane opened this issue Nov 25, 2017 · 59 comments

Comments

@laicasaane
Copy link
Contributor

laicasaane commented Nov 25, 2017

I've just found out that non-breaking space is missing in Annatar, Eldamar and Sindarin fonts. If I use that space, its font will be changed to the default and the spacing is not correct.

Also, Glaemscribe processor seems to ignore the non-breaking space. I've added a definition for it in the charset and use it in the mode, but the output text contains only normal spaces.

@BenTalagan
Copy link
Owner

Hi Laicasaane! Nice to see you here again. I've just checked these fonts and you're right. This is an interesting 'fine-tuning' matter that I've never thought about. I'll fix this and provide you with a solution today or tomorrow. I have also quite a few bug fixes/small changes here and there that I need to commit - that were waiting to be released in march, but I'll probably profit of the occasion to do a release during the week-end then :)

@BenTalagan
Copy link
Owner

NB : After looking more closely, the non-breaking space does not seem to miss in the Sindarin font.

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 25, 2017

You are right about Sindarin.

Just to be clear a little bit, here is the example for non-breaking space usage in my mode:
{PUNCTUATIONS} --> NBSPACE {_PUNCTUATIONS_} SPACE
The expected result is p ! won't be separated at the end of line.

@BenTalagan
Copy link
Owner

Thanks for the explanation ; this is very clear, and your solution is very clever. I love it because I always put a space before punctuation signs in tengwar, and I see no cases where this should not be done that way. (Now I'm pondering seriously the eventuality of integrating that way of doing to the glaemsrafu's website transcriptions :) )

For your specific need, adding the NBSPACE character to the charsets will be effectively sufficient. I am also looking for a way not to lose the non-breaking spaces if they are present in the input.

@BenTalagan
Copy link
Owner

Okay. I've released the 1.1.8 version, and it should address your issue. It was a bit tricky, because under the hood, the engine is using at one or two places functions like "trim" which will do unpredictable things to nbsp. To avoid this, I am using a pre-substitution of all nbsp to another character (in the dingbats range) and allow the use of a pre-defined variable {NBSP} which will match this input char.

I had already added this technique for another mode which I will publish in march (a japanese tengwar mode). I wanted to be able to use the '_は' combination in input to disambiguate the 'は' kana from the ending subject 'は' (pronounced wa), and realized that '_' could not be used in rule input because it is already used for defining word beginnings / ends. So I have also added a pre-defined variable {UNDERSCORE} for that purpose.

Fonts have been updated accordingly @here .

Charsets have been updated accordingly, with the NBSP char definition.

I have also added the following rules to all tengwar modes :
{NBSP} --> NBSP

This will allow to keep non-breaking spaces from inputs.

In your use case, I believe a cool addition would be to treat in the same way punctuation signs preceded by nbsp or not. This would add nbsp only if it is missing. A simple example :

({NULL},{NBSP}) . --> NBSP PUNCT_DOT

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 25, 2017

I have tested the new version, but the problem still persists. Ah, I use the mode editor for testing, I wonder if it's the cause?

screenshot

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 25, 2017

Nay, it's because you've also tried to make it work with a regular space. I think it will be difficult to make it work that way because space is used at low level in the engine for tokenization of words so we're not on solid ground here (that's why there is no {SPACE} variable pre-defined, that you would be able to use here --- and the rule that you've wrote would do the following thing : transcribe "space!" by " !" :D )

However, you can try to input p! and you will see that it outputs TINCO NBSP PUNCT_EXCLAM, so this was what you were trying to achieve at first.

Still, there's a way to treat regular spaces the way you want to do it : in the preprocessor with a regexp. You can replace all \s*! by  ! with a non-breaking space (and generalize for all punctuation chars).

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 26, 2017

Hum maybe, I haven't read well one of your first post - sorry ^^

The expected result is p ! won't be separated at the end of line.

You were already explaining that you'd like to cover the regular space+punctuation case, but I missed that point because I was focusing on the rule itself. Still, I think it falls into the 'disambiguation/normalisation' category rather than into the transcription logic category, so it'd be better done with the preprocessor (in the spirit of 'correcting' the user input).

However I've just found that there's a bug with my latest patch if you want to do it with the preprocessor. I'm investigating.

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 26, 2017

Found the problem : this was due to my patching of special chars (underscore/nbsp) before applying the preprocessor instead of between the preprocessor and the processor. I've released a patch. Now you should be able to write things like :

\rxsubstitute "\\s*([!.:;])" " \\1"

in the preprocessor (meaning replace a serie of spaces OR nothing + punctuation sign by nbsp + punctuation sign).

Of course this would simplify a lot your logic in the punctuation section, you would only need to keep : {NBSP} --> NBSP.

You can also extend the rule to add a space AFTER the punctuation sign if it's missing (I think it was your intention).

@laicasaane
Copy link
Contributor Author

I really want to avoid preprocessing as much as I can. Because there is something that simply not appropriate to be preprocessed.

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 27, 2017

Spaces are inherent to the way words are separated. Processor rule groups define classes of characters that define how "words" are made (a serie of characters of from the same class). Spaces are the only characters which are special for processing the input. They are not part of words (except for nbsp which could be). This can't be changed, this would mean complex and unuseful changes in the engine.

There's a pre-processor in glaemscribe made especially for these kinds of hacks. It offers you a solution for your problem. You just have to add one rule in the preprocessor, \rxsubstitute "\\s*([!.:;])" " \\1". You have a solution to your problem within glaemscribe's preprocessor, so why complaining?

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 27, 2017

I've made a test, it seems my issue is caused by the editor: it cannot reserve the non-breaking space character and always convert it into normal space.

My initial choice was putting normal spaces around the punctuation. But then I realized that the normal space might cause the punctuation to be moved to the next line alone.
Expected:

...(very long text)... p !

Real result:

...(very long text)... p 
!

So I want to put a non-breaking space before the punctuation instead and the whole block p ! will be moved to the next line, rather than just !.
Expected after using non-breaking space:

...(very long text)...
p !

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 27, 2017

I have tested with the editor yesterday and it was working. But I had to be very careful because copy-pasting nbsp sometimes replaces it with a normal space.

Be sure to have a real nbsp in the rule \rxsubstitute "\\s*([!.:;])" " \\1", just before \\1.

One way to do it is copy paste it from chrome/firefox console :

"\u00a0"

then press enter, then copy paste the resulting space.

My own nbsp did not survive the copy pastes up to github it seems ...

@laicasaane
Copy link
Contributor Author

Right, the problem is copy-paste. I've printed the result's codepoints to firefox console and there is \u00a0 indeed.

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 27, 2017

Cool, thanks for your confirmation!

I wish I could simplify the parsing of args so that we don't have the double escapings \\ . Also it would be nice if we could enter \u00a0 directly instead of a real nbsp in the preprocessor rules but for now it does not seem to work. This is one of the things that is really old and has not been refactored.

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 27, 2017

Nah, I think the problem is not copy-paste. After investigate the glaemscribe_editor.js file, I think this is the place where the editor lost all NBSPs in the output.

transcribed_selector.html(ret)

I suggest to replace \u00a0 with &nbsp; before sending it to the output panel. Also, you should replace \n with <br> to retain line-breaks.

@BenTalagan
Copy link
Owner

😲 damn you're right! This looks like a firefox 'bug' ; I'm under Vivaldi (chromium) and it does not behave the same. My test from the console :

$(".transcribed").html("1111\u00a01111")

Firefox effectively does not translate the nbsp; . I guess there's no other patch than yours. I will investigate tonight. Thanks for noticing!

@laicasaane
Copy link
Contributor Author

Well, I've looked around and made a little test, but it didn't solve the problem. So I think the best method should be exporting the output to an actually file.

@BenTalagan
Copy link
Owner

Hmm after investigating, when I input this command :

$(".transcribed").html("1111\u00a01111")

Firefox DOES put a nbsp; . It just does not show it in the elements inspector, but if you do "Edit as html" it will show the nbsp; .

What is your web browser ?

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 27, 2017

Ah, yes, indeed I've forgot to check "Edit as HTML". But either way it's not what I want. I just want to copy the output text and paste it to Word more easily. So I was looking for a way to retain the NBSP in the editor. But it turns out that there is no solution since no browser actually shows &nbsp; inside the HTML texts. Then the only possible solution is to have a button to export the output text as a file. I have made a change to the Glaemscribe editor in my local machine to serve my current needs. The problem is now solved. 😃

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 27, 2017

Ahhhh ok 😃

It's very clear to me now. Funny enough, I've just checked : this bug is both present in chrome and firefox, when you select+copy, nbsps are lost (at least under macos) !!

Worse than with the output, this is quite dangerous if copy-pasting the code of the mode itself (which I do between textmate and the editor), so I probably should rework the preprocessor rules syntax to be able to use '\u00a0' for safety.

Also, I should probably add a 'copy html' button to both the editor and the official glaemscribe UI later for that purpose.

Sorry for the misunderstandings in this thread, there was a lot of incomprehension, this is was you get when struggling with ghostly invisible chars! 😃 Anyway, it's cool that it's working now. And imho the solution using a preprocessor rule is simpler to write and understand and more logical (you clean the input before the processing).

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 27, 2017

You don't have to say sorry. It's me who wasn't clear enough from the beginning. Thanks a lot for your support!

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 28, 2017

Ok, so I have rewrote the glaeml args parser to be able to handle unicode escaped characters (such as \u00a0), but also a few others : \n \t \\ . This should not break any existing mode, but will allow you to cleanly write your preprocessor rules for non-breaking spaces. This is available with 1.1.12.

I have also added a small toolbar in the editor. It will allow to do a copy to the clipboard of the current transcription. It relies on a trick (create a hidden textarea/put the transcription inside/launch the copy). It works well under at least chrome & safari - textareas seem to behave better than regular divs that lose nbsp when copying.

Unfortunately Firefox still loses the non breaking spaces which is lame ^^; . I guess you're right, only the "Save to file" feature would be the safest.

For now, I'm happy with the current toolbar because I'm working under Vivaldi, but if you think it'd be useful to have a "save to file" button, I could easily add it as well.

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 29, 2017

Thanks. Direct copying is great. But if browsers don't behave the same then you should include "save to file" button.

It's so wonderful that the parser can handle unicode escape in rxsubstitute!
Wah, does that mean we can now write {COMMA} === \u002c?

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 29, 2017

Thanks. Direct copying is great. But if browsers don't behave the same then you should include "save to file" button.

I see no reasons not to do it in the editor... so it's done, and pushed 😃 (and finally it works well with Firefox!)

I have quite a lot of changes to the official UI, but I keep their publication for march. So I may add these export features to it too, but there are already quite a lot of options so I always fear to overload the UX with too many features.

It's so wonderful that the parser can handle unicode escape in rxsubstitute!
Wah, does that mean we can now write {COMMA} === \u002c?

Haha. Good question but nope, sorry. Because it is only implemented for glaeml arguments - and rules are written in glaeml text nodes. You're pushing the limits of the engine 😄

However, your question is interesting. I should probably add more predefined variables for characters that are used in the glaemscribe rules syntax. For the moment, we only have :

{NULL}, {NBSP}, {UNDERSCORE}

But others like {COMMA}, {ASTERISK}, {LPAREN}, {RPAREN}, {LBRACKET} and {RBRACKET} could also be added. It's hard for me to see some use cases, but why not. It may just cost a little more processing resources.

@laicasaane
Copy link
Contributor Author

laicasaane commented Nov 30, 2017

I should probably add more predefined variables for characters that are used in the glaemscribe rules syntax.
But others like {COMMA}, {ASTERISK}, {LPAREN}, {RPAREN}, {LBRACKET} and {RBRACKET} could also be added. It's hard for me to see some use cases, but why not. It may just cost a little more processing resources.

You don't have to do this kind of work because a bunch of predefined variables for these symbols aren't what I need. I just think that if we can make unicode escape on the right-hand side of === that would be good sometime. Since my mode contains various intermediate symbols (as a result of the preprocessor), and some symbols aren't properly shown in text editors.

For example these rules convert numbers to their intermediate form with decimal mark, digit group separator, negative brackets:

1.2     > 1٬2
-1.2    > 【1٬2】
1,000.4 > 1٬000٫4 
\** Place negative numbers in quotes **\
\rxsubstitute "\\-(((\\d+(\\,\\d+)*)+((\\.\\d+)|(\\,\\d+)+)*)|((\\.\\d+)|(\\,\\d+)+)+)" "【\\1】"

\** Convert decimal mark **\
\rxsubstitute "[.](\\d)"      "\u066b\\1"
\rxsubstitute "(\\d)[.](\\d)" "\\1\u066b\\2"

\** Convert digit group separator **\
\rxsubstitute "^[,](\\d)"    ", \\1"
\rxsubstitute "\\s[,](\\d)"  ", \\1"

\rxsubstitute "(\\d)[,](\\d)" "\\1\u066c\\2"
\rxsubstitute "[,](\\d)"      "\u066c\\1"
{LNEGATIVE}           === 【
{RNEGATIVE}           === 】
{DECIMAL_MARK}        === ٫
{DIGIT_GROUP_MARK}    === ٬

@laicasaane
Copy link
Contributor Author

However, I think you should add these symbols beside of NBSP: word-joiner, word-divider, zero-width space, zero-width non-joiner.

@BenTalagan
Copy link
Owner

BenTalagan commented Nov 30, 2017

However, I think you should add these symbols beside of NBSP: word-joiner, word-divider, zero-width space, zero-width non-joiner.

Sure, because there are no other ways to make them work now (even if \uXXXX was working). The parsing of rules is done with regexps full of \s and calls to strip/trim, and they are unreliable concerning nbsp and other special spaces. So currently they would probably need the 'fake char' cheat to work because I don't see myself rewriting the whole rule parsing yet :)

You don't have to do this kind of work because a bunch of predefined variables for these symbols aren't what I need.

Yes, but they might prove useful at some point - imagine a mode where you'd like to use '*' at the end of some words to mark special terminations : currently it's not possible to do it because * will probably break the rule. Same problem existed for underscore, hence my patch.

I just think that if we can make unicode escape on the right-hand side of === that would be good sometime. Since my mode contains various intermediate symbols (as a result of the preprocessor), and some symbols aren't properly shown in text editors.

It could be cool to add the feature at lower into the glaeml norm, and allow some escaped chars in text nodes. It would be very close to what I've done for glaeml args, but in a simpler version : only \ and \uXXXX, maybe \t and \n but it's not even sure. Thus you would have it everywhere, you could even write everything with just a serie of \uXXXX characters (which is odd).

Since it is non blocking (except maybe for special spaces), I keep it in mind but will take my time to implement these few points.

@laicasaane
Copy link
Contributor Author

Since it is non blocking (except maybe for special spaces), I keep it in mind but will take my time to implement these few points.

Sure, that should be just a minor enhancement. 😃

@BenTalagan BenTalagan changed the title Non-breaking space is missing Enhance handling of special spaces and glaemscribe's language reserved characters Jan 7, 2018
@BenTalagan
Copy link
Owner

BenTalagan commented Aug 1, 2018

I have a feeling that there are some misunderstanding between us?

I don't think so but I may sound very hasty since I'm under heavy charge professionally (finishing a big project) atm :) So my answers may seem inaccurate or incomplete, sorry for that.

On the contrary, I believe we're on the same level of expectation and comprehension. But, however, I'd like to take time to think carefully of the implications of what we want to do before rushing and working under urgency. I'd like to avoid proposing something too exotic, not universal etc. Thus I feel like there's a need to discuss these matters with a maximum of people experienced in both font design and tengwar knowledge, to avoid designing something flawed (by ignorance).

For example, one could be tempted to use the upcase character range to map some bearer tengwar, but I'm not sure it's a good idea because at the FreeTengwar font project, Mach has already thought about the possibility of having tengwar fonts with both upcase and downcase characters. So ideally, I think each bearer tengwar should be mapped on a downcase char that has an upercase counterpart.

Another example, which gives me headaches concerning the english mode, is the handling of sa-rinci and ar-rinci. There is a lot (and I mean a lot) of versions of these infernal hooks in the last PE and fonts are (without exception) not well enough designed to handle them. Layouting them will be complicated.

@BenTalagan
Copy link
Owner

I remember you've said that you were afraid of losing glyph's data (kerning, spacing,...) when moving the glyph around, I'm not familiar with FontForge so, could you confirm that is a case? If we use an automated tool to move the glyphs, could that issue happen?

I don't have a straight answer to this yet, but yes, that's exactly what I'm afraid of. We should take time to look at FontForge's file format and how glyphs are pointing to each other amidst these features.

Ok, just to confirm : remapping a char by hand in FontForge breaks its kerning (the kerning info is removed completely). So it's probably better done with scripting, but one should be extra careful of how glyphs are pointing to each other with indexes.

@laicasaane
Copy link
Contributor Author

I think each bearer tengwar should be mapped on a downcase char that has an upercase counterpart.

Apparently. In fact, all Latin blocks combined surely have more than enough uppercase-lowercase pairs to hold Tengwar. So you don't need to worry about this.

Layouting them will be complicated.

Could you briefly describe this issue concerning sa-rincer & ar-rincer in English mode? I don't understand what is complicated in re-mapping these glyphs.

one should be extra careful of how glyphs are pointing to each other with indexes

This could only be verified after actually re-mapping the glyphs of a font, is it?

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 1, 2018

In fact, all Latin blocks combined surely have more than enough uppercase-lowercase pairs to hold Tengwar. So you don't need to worry about this.

I'm quite confident too on that point :)

Could you briefly describe this issue concerning sa-rincer & ar-rincer in English mode? I don't understand what is complicated in re-mapping these glyphs.

There is a large number of versions of them, which are not really normalized. Long, short, inclined or not, oriented left, oriented right, attached at the top, attached at the bottom. It's sometimes hard to know if a right oriented sa-rince has a left-oriented counterpart or not (and what appearance it should have).

It's unclear to me if multiple version of sa rincer could be combined or not. I haven't found any example of this in tolkien works. But there are cases when we are stuck, for example a word like axes in my english mode would tend to cumulate two sa-rincer on the x (k + sa_rince short + tehta + sa_rince long). I don't have a solution for handling this yet.

The flourished sa-rince is considered as a graphical variant of sa-rince by the FreeTengwar Font project and Everson's norm, but it's called ar-rince in some versions of the quenya writing system in the latest PE so it to my opinion it should better be considered as an independent tengwa.

Long sa-rincer may bear tehtar at the end of words (see the old english modes in Sauron Defeated). How do we handle complex combinations of multiple sa-rincer + tehtar?

Nothing else comes to my mind yet, but i'm pretty sure there are other problems that I can't recall. So to answer quickly your question, it's not more complicated to remap these chars than any other one, BUT there's (to my opinion) a problem of design concerning them which should be nice to solve if we decide to publish a new layout.

This could only be verified after actually re-mapping the glyphs of a font, is it?

Not sure to understand completely your question ; I think that because FontForge loses Kerning info when moving a glyph, we should not do it within FontForge, but with an external script. However, this means being really careful on how things are described in the sfd format, because glyphs are pointing to each others for various reasons (kerning is one). It seems they use their internal index to point to each other, so maybe the remapping can be done without breaking the internal indexes. I need to investigate more atm.

@laicasaane
Copy link
Contributor Author

laicasaane commented Aug 1, 2018

As I understand, the new layout really has nothing to do with this sa-rincer problem. Because I only suggest that we're going re-map all the existing glyphs in Dan Smith layout to a new layout which would ease our work on Tengwar document at first. Solving the composition of sa-rincer (and other marks) would be another project, and would come after, as it mainly involves in designing new glyphs or making new position for existing marks.

@BenTalagan
Copy link
Owner

As I understand, the new layout really has nothing to do with this sa-rincer problem. Because I only suggest that we're going re-map all the existing glyphs in Dan Smith layout to a new layout which would ease our work on Tengwar document at first. Solving the composition of sa-rincer (and other marks) would be another project, and would come after, as it mainly involves in designing new glyphs or making new position for existing marks.

I do not totally agree with that way of thinking. If we design a new layout, it should be thought to be able to solve the largest number of issues. So all these issues should be identified before designing the layout so that we're not stuck afterwards. See my remark about uppercase tengwar above : remapping should take into account that one day we may have uppercase chars ; so the new layout is affected by that non-existing feature in the sense that it's a good idea to map bearer tengwar on characters that have upcase counterparts. As well, even if we do not implement better versions of sa-rincer, it should still be thought of during the specification stage. E.G. where should we keep room for them? In the diacritic range of latin ? Or somewhere else?

One of the first step before developing anything is to identify all issues that should be resolved by the design of a new layout, and all features that should be proposed by a new layout.

By the way, I've started to write a remapper for sfd fonts. That should greatly help us to advance later on.

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 1, 2018

Useful links regarding line breaking matters :

Unicode line breaking algorithm
Unicode line breaking classes

Thus, concerning digits, we could easily extend our mapping to other ranges, like arabic numerical, as long as they belong to the 'NU' class.

@laicasaane
Copy link
Contributor Author

Thanks for the links, I'll read them later. But here are some quick thoughts:

As well, even if we do not implement better versions of sa-rincer, it should still be thought of during the specification stage. E.G. where should we keep room for them? In the diacritic range of latin ? Or somewhere else?

To me this is really not a complicated problem. It' true that we should consider every problem before hand and leave place for non-existing glyphs. But there are plenty of rooms for this, we don't need to be very careful. For example, about the uppercase tengwar: if we intend to leave rooms for them, the number of existing marks that can be attached to a tengwar would be doubled; there would be 2 set of marks: one for lowercase tengwar, another for uppercase; and even if there are some unforseen glyphs might appear in the future (exotic glyphs for some languages?), in total it'd hardly exceed all the Latin ranges combined.

Thus, concerning digits, we could easily extend our mapping to other ranges, like arabic numerical, as long as they belong to the 'NU' class.

I really don't encourage the use of any range that's not Latin. Since the modern-and-powerful linguistic-capable functions of nowaday softwares can misinterpret and treat a specific block of text differently because to them, that block of texts is in another langage.

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 2, 2018

To me this is really not a complicated problem. It' true that we should consider every problem before hand and leave place for non-existing glyphs. But there are plenty of rooms for this, we don't need to be very careful. For example, about the uppercase tengwar: if we intend to leave rooms for them, the number of existing marks that can be attached to a tengwar would be doubled; there would be 2 set of marks: one for lowercase tengwar, another for uppercase; and even if there are some unforseen glyphs might appear in the future (exotic glyphs for some languages?), in total it'd hardly exceed all the Latin ranges combined.

Yes, but ideally, I'd have liked to keep a matching between downcase and upcase tengwar. Like if a tengwa (let's say parma) is mapped on p it would be nice to have upcase parma on P. I've not made the exact count of bearer tengwar but I believe there are between 50 and 100. So the exhaustion goes quite fast if we want to keep this regular.

As well, there are a lot of tehtar, and ideally I'd have liked to have some kind of regularity (like all a tehtar would use variants of a- â, ä, etc) but it's really easy to exaust them, especially because in latin ranges there are not always all the variants of one diacritic for a, e, i, o, and u - e.g. double accent ő and ű exist but not the a,e,i counterparts, so this prevent us from using the double accent if we want to keep the regularity.

Also, it could be nice to take into account the fact that some fonts propose variants for ligature purposes.

I really don't encourage the use of any range that's not Latin. Since the modern-and-powerful linguistic-capable functions of nowaday softwares can misinterpret and treat a specific block of text differently because to them, that block of texts is in another langage.

There's simply no other choice for numbers. The only characters that belong to the numeric class in the latin range are digits so, according to the line breaking spec, there's no other way than using numeric chars from another range, and respecting the numeric class is to my opinion the cleanest thing to do. (By the way I've tested the following string 0123456789١٢٣٤٥٦٧٨٩ in firefox, chrome, safari, openoffice, they all take it as one entity and the wrapping works well).

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 2, 2018

Hmm, I think there may be an alternative. If we make full use of the private area of the unicode range, it looks like line breaking will work flawlessly (to be verified more thoroughly). It means that, if a sequence of characters wholly belongs to the private use space, the sequence will be treated as a word. The documentation I've given above states :

[...] Unassigned code positions, private-use characters, and characters for which reliable line breaking information is not available are assigned this line breaking property. The default behavior for this class is identical to class AL (NB: alphabetic characters). Users can manually insert ZWSP or WORD JOINER around characters of class XX to allow or prevent breaks as needed. [...]

If this is indeed true, my suggestion is to stick as much as possible to the Free Tengwar Font mapping for the basic tengwar and extend that mapping for the multiple tehtar variants. It seems that it is already what Enrique Mombello have done in elfica if we look at the private use area carefully.

That would mean copying the tengwar from their initial place to the private use area taking care of not losing the kerning information. This seems to be a cool solution since the font would not lose its original mapping. But unfortunately that's not true. In the original DS mapping, punctuation signs are used to map some bearer tengwar, and to my opinion we should squash these slots to have the real elvish punctuation signs there (the Everson/FTF norm put them in the private use area, but maybe we should have them at the place of real punctuation signs for a better browser handling? This remark also stands for brackets, parenthesis, quotes, etc).

So, to conclude, such versions of the fonts would not be usable easily with a standard keyboard. It's a major drawback and I don't know if it's a good way to go or not - but still, I like the idea better than what we've discussed before, since we're not reinventing anything but only extending what already exists.

@laicasaane
Copy link
Contributor Author

laicasaane commented Aug 3, 2018

Using the PUA range is indeed the only option that would be most compatible with the nature of Unicode since it's not going to break any of Unicode rules. Initially I'd like to work on that option. But after considering the usability with modern softwares, I've noticed that PUA will drop us into a desert land where we can't easily (or possibly) take advantage of any tool we currently have, because the language functions are all built-in with the OS, there's no way for us to specify what is what in the PUA range for the OS and/or softwares to understand. And since my work mostly concern actual documents in Tengwar, not just some short texts, this time, I highly prefer the usefulness of the new Latin-based layout.

@BenTalagan BenTalagan changed the title Enhance handling of special spaces and glaemscribe's language reserved characters [Discussion/Specification] Relayout/Remap legacy fonts ? Aug 3, 2018
@BenTalagan
Copy link
Owner

BenTalagan commented Aug 3, 2018

I have renamed the current topic, it has derived from its original subject. I am creating a new issue for the other subject that was still on hold, #18.

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 3, 2018

But after considering the usability with modern softwares, I've noticed that PUA will drop us into a desert land where we can't easily (or possibly) take advantage of any tool we currently have, because the language functions are all built-in with the OS, there's no way for us to specify what is what in the PUA range for the OS and/or softwares to understand.

Sorry for being reluctant :) , but I'm not convinced. Do you have any concrete examples where a full PUA solution would have some limitations compared to a latin solution? I'm getting more and more seduced by a solution that sticks to the FTF project mapping (or, a mixed solution, with tengwar in the PUA but punctuation deported to the latin blocks for example, but i'm not even sure that it's needed - the FTF mapping is already a mixed solution, some of the punctuation chars are outside of the PUA).

Having some examples would help me a lot understand your feeling.

By the way, I have finished writing the remapping tool for sfd/FontForge files (with copy/move/delete directives and kerning conservation) so technically we're good, and we can really focus on the layout debate.

@laicasaane
Copy link
Contributor Author

Sorry for being reluctant :) , but I'm not convinced.

No, I understand you. Because I've been seduced by the PUA layout for a long time, even before I discovered Glaemscribe.

Do you have any concrete examples

For the time being, the first-hand limitation I've met with the current layout and FTF layout is I cannot make use of page numbering in any word processor, because there is no way to define custom numeric characters, as far I as know.

we can really focus on the layout debate.

After having automated tool, can we just make some quick test?

@BenTalagan
Copy link
Owner

For the time being, the first-hand limitation I've met with the current layout and FTF layout is I cannot make use of page numbering in any word processor, because there is no way to define custom numeric characters, as far I as know.

I think I see the problem for page numbering. I've just tested to get a better impression, and these are the problems I've seen (with Open Office).

  • the control you have on page numbering is automatic and limited
  • the format can be chosen in a fixed list (standard arabic, roman, greek, etc) but they use the wrong characters for roman numbering ; there are two unicode ranges for this 0x2160-0x216F and 0x2170-0x217F. These ranges would have been nice for mapping elvish digits (they are wider) but unfortunately they are not used/proposed by openoffice. They use alphabetic letters instead.
  • Openoffice does not let you use a custom character set for page numbering which is sad because it would be really easy to implement. I don't know about other word-processing softwares.

As far as I see, even with a correct latin mapping there are things you would not be able to achieve : base 12, right to left, with most significant digit dot, all of these elvish features would not possible. So the best you can do atm and you're right, it is having latin decimal numbering, with tengwar numbers replacing latin digits. Clearly, here, (and I'm sure you do agree with this), we're facing a limitation of the software, not of the layout. So what you describe is clearly a hack, because a latin digit is a latin digit, a japanese digit is a japanese digit, and a tengwar digit is a tengwar digit.

So I think we should stick to the FTF/Everson norm BUT with additions, especially hacks for dealing with these kind of retro-compatibility/legacy/whatever problems. For example, I could remap a legacy font on the FTF/Everson norm, and duplicate the tengwar decimal digits to the latin digit slots (and, why not, duplicate some punctuation signs to latin slots too). In addition, I would add some ranges in the PUA for mapping legacy versions of the tehtar. Such a font would solve all your problems : word spacing and breaking/page numbering. It would be clean because aligned on the norm. And it would not interfere with anything (except for the hacks).

After having automated tool, can we just make some quick test?

The word 'quick' might be a bit too optimistic 😄 What are the needs for your test? Does the hypothetical font that I've described suits you well?

It might imply a new font, and a new cst (charset file) at minimum. The mode files should not change, we're at a lower level here. So there's still a little bit of work to do in all cases.

@laicasaane
Copy link
Contributor Author

Such a font would solve all your problems

"all" is too optimistic a word, 😂 I agree with you, this problem is really a software limitation, we can never solve it by ourselves. Having some hacks can help only a little, but that's sufficient enough for now.

@machsna
Copy link

machsna commented Aug 5, 2018

Thanks Talagan for pointing me to this discussion.

It appears both of you are now sympathizing with using an expanded FTF layout, mapping the tengwar numerals over our Arabic numerals, thus violating the Unicode standard. This would be justified by solving the problem of tengwar page numbers in certain applications. I have several reservations against such a solution:

  • By violating the Unicode standard, you get a problem when mixing scripts, e.g. in LaTeX code or when copy-pasting.
  • You would need a complicated solution if you wanted to transcribe tengwar texts such as PE 20 Q9b, Q10h, Q11j where Tolkien has mixed our Arabic numerals with tengwar.
  • It would be a hard-coded choice for one specific system of tengwar numerals when in fact we know of at least seven different tengwar numeral systems (see below).
  • It does not solve the problem of potential duodecimal numbering.
  • What if, say, the tengwar numeral for 3 turns out to be identical with nwalme (cf. Gildir’s analysis Mellonath Daeron : Tengwar numerals in the King's Letter)?

This last reservation brings me to a broader point: Our knowledge of numerals in tengwar is very sketchy. Up to now, we know of at least seven different numeral system to be used with tengwar, some of them only partly or indirectly attested (I am leaving away the various options of marking numerals with dots or bars):

  1. J.R.R. Tolkien’s numerals for 1, 3, 4, and 6 in DTS 49.
  2. J.R.R. Tolkien’s tengwar numerals in DTS 87 – 1: parma, 2: tinco, 3: calma, 4: quesse, 5: umbar, 6: ando, 7: anga, 8: ungwe, 9: unque, 0: stemless vilya.
  3. J.R.R. Tolkien’s Rúmilian numerals in DTS 87, to be used with tengwar – nota bene: these are different from the various Rúmilian numerals we know of, cf. Helios’s analysis Rúmilian Numerals.
  4. J.R.R. Tolkien’s Arabic numerals in PE 20 Q10h, Q11j, to be used with tengwar – they have a style closely resembling the tengwar, thus 1 looks like a short carrier with a dot above, 3 like alda, 6 like esse, 0 like úre; optionally, 9 may look like rómen (so there you may have an eighth system).
  5. J.R.R. Tolkien’s tengwar numerals in PE 20 Q10h, Q11j – 1: long carrier with a dot above, 2: tinco, 3: ando, 4: vilya, 5: esse, 6: silme, 7: calma, 8: anga, 9: rómen, 0: úre.
  6. The primary letters according to Christopher Tolkien in Quettar 13.
  7. Christopher Tolkien’s numerals in Quettar 13 – they look similar J.R.R. Tolkien’s numerals from DTS 49.

In my opinion, Christopher’s numerals are certainly not the best choice. They differ from J.R.R. Tolkien’s numerals in DTS 49. The difference is that the DTS 49 numerals look much more like actual tengwar than Christopher’s numerals. Unless the primary material Christopher’s numerals are based on is published, we cannot know for sure whether Christopher’s numerals truly represent the shapes J.R.R. Tolkien had intended. I believe the only reason why it has become popular on the internet is that it was there before the internet.

Long story short: I do not think that a possible solution for tengwar page numbering is sufficient reason for mapping some system of tengwar numerals over our Arabic numerals. Instead, I believe the way to go is as follows:

  • All tengwar characters should be encoded in the Private Use Areas.
  • In a tengwar font, our Arabic numerals may be drawn in a way that resembles the tengwar so they stylistically fit, like J.R.R. Tolkien did in PE 20 (cf. the fourth numeral system above).
  • For practical problems such as page numbering, we would need specific solutions for the various applications, e.g. a LibreOffice Extension or a LaTeX package. Ideally, such a solution should have many options so people can choose between the different tengwar numeral systems. I do not have the knowledge for coding such a solution, but I am certain that it is possible (cf. e.g. a Sub AddPersianPageNumbers macro for LibreOffice Impress).

And now for something completely different:

  • Are you considering adding variant tehtar to the PUA for placing them on different width signs (as in the Dan Smith encoding)? I think this would be a very poor solution. These placement choices are better hard-coded into the font by means of the OpenType Layout. For a long time, there was a severe lack of applications capable of displaying advanced OpenType Layout operations. Basically, you were restricted to Xe(La)TeX. This has changed in recent years, with major applications such as LibreOffice, Firefox or Google Chrome now having OpenType Layout capabilities that are at least as powerful as the ones of the Adobe Creative Suite (or Cloud).
  • @laicasaane: Talagan has told me you are using a tengwar mode for Vietnamese. Do you have a description of that mode somewhere?

@BenTalagan
Copy link
Owner

BenTalagan commented Aug 5, 2018

Hi Mach, and thanks a lot for taking the time to participate to this discussion, with such a neat and documented post! I appreciate it all the more that Glaemscribe was designed with the FTF Project in mind all along the way. I'm totally convinced that the FTFP is, by evidence, the most advanced milestone and serious work regarding the matter of bringing and adapting Tolkien writings to modern technologies. The FTF Project is the achievement of one or two decades of work from various dedicated people and should be, to my opinion, the obligatory basis and bedrock for future works.

I don't really feel like I'm sympathizing with the idea of violating the unicode standard ^^; (this sounds a bit too harsh) , but I totally follow you nonetheless on your argumentation of trying to respect (in a perfect world) the segmentation of things and the meaning of characters (hence my argumentation above in that same direction too). Your remarks on numerals are all pertinent and convincing, and it only comforts me in my first impression that doing such a mapping (from tengwar digits to latin ones) would be clearly identified as a dirty hack (in the meantime, we have to cope with the fact that having all digits in the PUA will prevent us from affecting them the unicode 'NU' numeral class - but I don't think it's a major problem since in that configuration, line-breaking matters should be solved anyway). [Please note nonetheless that in that scope, the aim of the hack is just to duplicate the tengwar digits (while still keeping them in the PUA) to solve that very specialized problem, not to alter the FTF mapping]

That being said, we're not (at the moment) in the perspective of porting the legacy fonts to a full OpenType solution. Laicasaane regularly reports very pointy issues regarding tengwar transcriptions and that make the project evolve in the right way ; because the vast majority of existing tengwar fonts are not equipped with clever OpenType features and are still using the old Dan Smith mapping, I've designed Glaemscribe to be able to handle all that variety (the fonts coming with their flaws) ; but now that the project is getting stronger and stronger, it looks like people are enjoying that variety and at the same time they want more - more features, more precision, more cleanliness. Thus, we're at the middle of the ford. There's a need for better fonts, but (I may be wrong), imho there's a lack of dynamics for a few years in that field. My guess is that for a designer, it's quite complicated to deal with all these Unicode and OpenType features (which is the food for us, poor engineers). The Glaemscribe project started on a challenge proposed by my friend Didier Willis concerning the transcription of Sarati, and the first step was to port Måns Björkman's Sarati Eldamar font to OpenType so that it could handle well the diacritics. I've done it using GPOS tables and it was not that easy ; and it was only dealing with Sarati, which are probably easier to instrument because diacritics are put on the other side of the main direction line. Tengwar may offer the most complex cases, because both bearer or diacritics could hypothetically change their aspect when combined (size, rotation, even the shape), so we have to deal with all kind of combinations, ligatures, stacking & so on. I consider these font design tasks out of my scope (or let's say, tangential) regarding my work on Glaemscribe, for many reasons : first, I'm not a font designer, so there are probably real artists out there that would do far better; and, secondly, I'd like to focus on the engine and the modes and I'm already overloaded by work in that domain (sometimes I feel like I don't even actually). But in the meantime, I'm forced to recognized that fonts are a prerequisite :D (obviously), and I always finish torn in between.

So what interests me here in Laicasaane's requests is that I see an opportunity for keeping our variety of fonts and renderings, while transitioning slowly to the next stage. It would be too complicated and too long (and maybe not feasible for me!) to instrument the legacy fonts up to a point where they would be polished and opentype-full-featured ; but it's possible for me to bring them closer to that state in a reasonable amount of time, and a unicode remapping is a logical first step.

Are you considering adding variant tehtar to the PUA for placing them on different width signs (as in the Dan Smith encoding)? I think this would be a very poor solution. These placement choices are better hard-coded into the font by means of the OpenType Layout. For a long time, there was a severe lack of applications capable of displaying advanced OpenType Layout operations. Basically, you were restricted to Xe(La)TeX. This has changed in recent years, with major applications such as LibreOffice, Firefox or Google Chrome now having OpenType Layout capabilities that are at least as powerful as the ones of the Adobe Creative Suite (or Cloud).

This is what I've been considering and what I've implemented for the last two days, in that transitional perspective ; it's not meant to be engraved or normalized. I've used the end blocks of the PUA for that purpose to avoid hypothetical collisions on the mid term. I do agree with you, it's a poor solution, but it's a (relatively) quick hack and it's ready for use as a crossroad solution. I still hope that these steps done with Glaemscribe will motivate other people by proving things can be done, and facilitate their efforts in working on new fonts :). (By the way, could you provide us with any news concerning the OpenType version of Telcontar? That would be awesome to have it in Glaemscribe, as it would offer the most accurate Tengwar rendering ever).

@laicasaane : that was hard work, but new font adaptations and charsets are now available in the unicode_font_remapping branch. As Mach has said very clearly, please remember that these are transitional pieces of work : elvish numerals have been copied to Arabic Numerals to fit your needs so this is a clearly personalized hack (in the meantime, these fonts do not have any other latin chars, so for the moment, this does not overlap anything), as well as the fact that they still use old DS tehtar variants (remapped in the end of the PUA). Could you please beta test it to see if it fits your need?

@machsna : Thanks a lot again, I feel like this is exactly the kind of discussions that is needed to make things evolve the right way. Please feel free to share other thoughts or participate again any time!

Cheers, Talagan.

@machsna
Copy link

machsna commented Aug 6, 2018

(By the way, could you provide us with any news concerning the OpenType version of Telcontar? That would be awesome to have it in Glaemscribe, as it would offer the most accurate Tengwar rendering ever).

I forgot: Here is a proof of concept for porting the Tengwar Telcontar intelligence from SIL Graphite to the OpenType Layout: OpenType Layout test page. The SIL Graphite column will only display properly on Firefox, the only browser to support this technology. The right column displays perfectly in Firefox and Google Chrome, almost perfectly in Safari (in the absence of any explicit feature setting, mkmk is off). If I remember correctly, the Microsoft browsers have similar issues like Safari. They can be solved with a funny CSS hack:

font-feature-settings: 'dumb' 1;

Explanation: This is instructs the browsers to switch on the OpenType Layout feature dumb. The point is that no such feature exists, so a smart browser will ignore this instruction, while dumb browsers may use it to improve their font display. ☺

The only feature I could not get to work is the underlining. Anyway, underlining is perhaps a problem best solved on the application level, not on the font level.

The font is not yet ready to be released because it still lacks a few characters, especially several uppercase characters. But of course, if you must, you will find it in the SVN repositories (Arno already uses it on Tecendil).

@BenTalagan
Copy link
Owner

The font is not yet ready to be released because it still lacks a few characters, especially several uppercase characters.

That's where I stayed ; I remember having read some exchanges last year on the FTF list (even participating if I'm correct) and had put into a corner of my mind a sign 'wait for release'.

But of course, if you must, you will find it in the SVN repositories

I'll take a look at this today, and test an integration. It will probably be a hundred times easier than for the other fonts 😃 Anyway it did not come to my mind that the latest version of the font in the SVN was sufficiently advanced to be used, this is awesome news! Thanks!

@laicasaane
Copy link
Contributor Author

laicasaane commented Aug 6, 2018

@machsna I initially support the idea of a Unicode standard layout, especially after knowing the FTF Project. I like that idea so much so I'd decided to change the layout of Tengwar Annatar by myself. But that is just a short-lived idea, lacking of font designing skills is a huge drawback I cannot overcome. After discovering Glaemscribe, I've immersed myself in making a transcriber for Quốc ngữ script as well as revising the mode for Vietnamese. When I was able to combine some long documents, finally, problems arised one after another. Thus I have to consider some quick hacks to finish my works first. Alas, thanks for your information regarding macros, I'll see if I can make it work. If software-level problems can be solved by software-level solutions, then we don't need any hack in the fonts, and that would be more appropriate.

And here is the mode for Vietnamese, written in English:
https://drive.google.com/open?id=0B4vpFvDhhjSmcHlwNlh2YXpfcVE

@BenTalagan BenTalagan added this to the 1.2.0 milestone Sep 15, 2018
@BenTalagan BenTalagan self-assigned this Sep 22, 2018
@BenTalagan
Copy link
Owner

Closed after 1.2.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants