Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML importer stops importing on the escape character #500

Closed
vadi2 opened this issue Mar 31, 2017 · 5 comments
Closed

XML importer stops importing on the escape character #500

vadi2 opened this issue Mar 31, 2017 · 5 comments
Labels

Comments

@vadi2
Copy link
Member

vadi2 commented Mar 31, 2017

The XML importer (for importing packages or loading profiles) chokes on the escape character (0x1B) and does not load anything after, thus corrupting scripts that include it.

Attached is a test alias that demonstrates the problem.

Launchpad Details: #LP1397594 Vadim Peretokin - 2014-11-29 20:35:01 +0000

@vadi2
Copy link
Member Author

vadi2 commented Mar 31, 2017

N.B. the above file is NOT entirely viewable directly in your browser (at least not one that validates XML) it will error out at the offending escape character...!

The problem, as Vadim and I now know it is that the XML 1.0 specification PROHIBITS the use of the ASCII "C0" group of control characters (with single byte values between 0x01 and 0x1F) EXCEPT for the Tab, Line-Feed and Carriage Return ones {0x09,0x0A & 0x0D}. This design restriction has been reversed in XML 1.1 however such codes must be entered as "numeric entities" i.e. for the above escape in the form "�" or "�" there. That doesn't help us though, because Qt does not support XML 1.1 (the documentation for QXmlStreamReader and QXmlStreamWriter is not immediately clear unless you look at the small print - despite the apparent ability to change the ability to change the XML version text on the first line of the file that the writer produces that does not alter the fact that the reader does say that it is a 1.0 reader - the 1.1 {now on it's third revision I believe} specification has been around since 2006 so it is not as if it is THAT new fangled.)

One partial solution which I want to check further is to use custom entities for those control characters until Qt does parse 1.1 type documents. The attached file uses an entity (which comprises a leading '&' the TLA "esc" and the trailing ';' character. For display purposes that (and all the other C0 control characters) entities have replacement characters in the Unicode range {U+2400 to U+241F} that are pictorial representations of the C0 characters - unfortunately the Deja Vu series of fonts that we include with Mudlet do not include those glyphs and the free Symbola font that I want to include in future Mudlet version (for a very extensive range of Map symbols) uses glyphs that look a lot like the IBM PC ROM ones that computer users from the MSDos 3.3-6.0 era might recognize - the visual effect I was hoping for is realizable using the FSF's GPLv3 FreeFonts (FreeSerif, FreeSans, FreeMono) or RedHat's GPLv2 Liberation font set (Liberation[-]Mono, [-]Sans, [-]Sans Narrow, [-]Serif) though others will do.

That only covers however the part about having a file that a browser and a human read can read. It means that, provided a suitable font is available on the system the C0 characters will be displayable in their Unicode form. What is left is that we will also need to hack the editor for the simple "command to send" type QLineEdits and the "script" multi-line edit boxes so that they "store" and "edit" the "&<2or3LetterAcronyms>;" form - but these will need to be "translated" at the point that any such codes get sent to the MUD server OR if also permitted in other places where the user want them to match MUD server output {perhaps custom telnet sub command handling code?}

At the point we implement this I'd up the Mudlet package version to 1.1 (like I have in the attached sample) and start to process the file in this new way - so that if Qt gains XML 1.1 support not only will the first line change but we can increment our package form to 1.2 because I think THEN we'd want to change the entity definitions at start of the file to put in the permitted numeric C0 codes directly...

tl;dr;

Anyhow I know that I must do some more research before I could compose a complete solution...

Launchpad Details: #LPC Stephen Lyons - 2014-12-03 02:54:13 +0000

@vadi2
Copy link
Member Author

vadi2 commented Apr 5, 2017

It's still an issue - go to https://codepoints.net/U+000B?lang=en, press Copy to Clipboard top-left, paste it at the beginning of a Mudlet script. Save and reopen Mudlet - script is gone.

It's one of the XML invalid characters from www-01.ibm.com/support/docview.wss?uid=swg21514211&aid=1. Mudlet should strip them out from the output XML because Qt's parser isn't handling this for us, apparently.

@vadi2
Copy link
Member Author

vadi2 commented Apr 5, 2017

I don't think we should be upgrading the XML format for this - that is an overkill solution. We can prevent these characters from getting into Mudlet to begin with by stripping them out when writing our save files and also by doing a pre-xml-load replacement.

@SlySven
Copy link
Member

SlySven commented May 4, 2017

ℹ️ We could not upgrade the XML format to switch to 1.1 anyhow. It isn't on the Qt roadmap of something they have plans to ever do!

It is just a pity that the documentation for the relevant methods in the XMLreader/XMLwriter classes suggest that you can specify a different XML version but in fact the arguments are purely cosmetic and non-functional... like male nipples! 😮

@SlySven
Copy link
Member

SlySven commented May 7, 2017

OK - I think I have a (neat) solution using Unicode U+FFFC {OBJECT REPLACEMENT CHARACTER} plus one from the range U+2401 to U+2424 {CONTROL PICTURE xxxx} as substitutes inside the XML file - the chances of a matching pair of code-points being used in any other circumstances being, I think, infinitesimally small...!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants