Files containing control characters due to dfhack cannot be loaded #14

PeridexisErrant · 2014-05-14T06:20:43Z

A known bug with Legends Viewer is that it often crashes if the user attempts to load files exported after playing a fort (via the backup-abandon-export-reload trick).

This is because dfhack currently stores values such as workflow settings as fake historical figures in hex format, and Legends Viewer cannot parse files with such entries.

There are workarounds - such as a script that deletes any line containing such characters, or a future fix to dfhack that hides such entries from export - but it would be good to improve the parsing here.

JyeGuru · 2014-05-14T08:12:47Z

Are you able to upload an example export and I'll take a look at it tonight. Can't guarantee anything, but my C# skills have gotten a lot better over the last week or so.

PeridexisErrant · 2014-05-14T10:14:12Z

Sure - here it is. I've included an archive with the bug (-corrupt), and one without (-clean).

In the process, I seem to have found a new issue - while extracting the files and then loading them into LV produces the expected behaviour, trying to load the archive has the xml turn green and then hang at the history text indefinitely. I have no idea why this would happen but since it's not inteferring with the main bugtest I'll leave it for now. If you quit, you then have to manually remove the files from LV's folder, so it must be in the loading not the extraction (and maybe needs to clean up if unexpectedly quit?).

Here's the error message I get:

'�', hexadecimal value 0x01, is an invalid character. Line 571, position 11.
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlReader.ReadStartElement()
at LegendsViewer.Legends.XMLParser.ParseProperty()
at LegendsViewer.Legends.XMLParser.ParseItem()
at LegendsViewer.Legends.XMLParser.ParseSection()
at LegendsViewer.Legends.XMLParser.Parse()
at LegendsViewer.Legends.World..ctor(String xmlFile, String historyFile, String sitesAndPopulationsFile, String mapFile)
at LegendsViewer.FileLoader.load_DoWork(Object sender, DoWorkEventArgs e)
at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)`

JyeGuru · 2014-05-15T05:34:06Z

So, being at work just glancing at the export XML files ... seems that all the fake Workflow entries have a negative ID. Will see if there's a cleaner way to do it than just a try/catch loop.

JyeGuru · 2014-05-15T11:35:04Z

Compiled a testing version that loads the example corrupted files just fine, however it uses more RAM because it loads the whole file into memory to do the replacement of non-ascii characters. I couldn't find a good way to implement this with the XMLTextReader/StreamReader combination currently in use. May have to refactor some of that XML Reader file to fix it without the memory hog.

Testing of this version on my ~250M XML used about 1.2G RAM at peak, compared to around 800M peak in the original version. Not considerably more, but I imagine it would fail hard with massive exports.

Will continue looking for a better way to do this. In the meantime, testing executable availble here from build: https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.8/artifacts

Edit: Push didn't work properly the first time, was incorrect changes. Repushed below.

JyeGuru · 2014-05-15T15:36:54Z

Think I fixed it properly this time. Basically makes a cleaned copy of the file if it fails to load the first time.This temp file is removed after the world is loaded.

Pro: Less memory usage while loading than previous fix
Con: More drive space needed for second copy of file
Unknown: Performance impact of the valid XML test on super-huge files

https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.9/artifacts

Parker147 · 2014-05-15T18:16:42Z

I'm thinking the upfront validation of the XML file is good. I think you could optimize it and remove the XDocument.Load and simply Regex Replace the file from the start since you have to read through part of the document regardless if it fails, might as well fix it instead of reloading the file again. It happens in the Historical Figures section of the XML so you could even limit the regex replacement to just that section of the file since the start and end tags for that section would be easy to find.

JyeGuru · 2014-05-15T18:22:55Z

The problem is that the XML classes (as in ALL of them!) will refuse to parse any invalid entries - the XML spec requires draconian error-handling. Which is why I had to do the replace outside an XML parser.

If I did the replace (which forces a write to a temp file) on every file regardless of errors, it would slow everything down considerably. I'm confident that the performance loss by loading the XDocument instance (and then releasing it again straight away) is a lot less than the performance impact of writing the entire file back to disk.

I will see about generating some obscenely large history files if I get some time to leave DF running, unless you have some to test it on.

Parker147 · 2014-05-15T18:48:35Z

I'm concerned with XDocument having to read the entire file when we don't need to, causing a performance issue. We really only care about the Historical Figure section, which is a relatively small part of these XML files. I agree that having to write the entire file back to disk is a costly task and we would want to avoid that if necessary. Could we put in some checks to see if any regex replacements actually occurred and only write to file in that case?

JyeGuru · 2014-05-15T18:56:02Z

The regex replacement is only done if there is something to replace. The "validity check" is just XDocument safeFile = XDocument.Load(xmlFile); safeFile = null; inside a try/catch block, which literally just loads and frees the object straight away.

This uses XDocument as a validator only. If an exception is generated, then the XML is bad, and it's handed over to a FileStream that does the replacement and writes to a temp file. If there is no exception, then the filename is returned unchanged.

Doing a validation on the XML before loading it is going to be slow, no matter how it's done. I'ce just tried to do it with as little memory impact as possible, since that seems to be the limiting factor. Ideally, I think the whole XML reader can be refactored to use a custom iterator[1] but it's going to require quite a rewrite of that class to do it.

1: http://lennilobel.wordpress.com/2009/09/02/streaming-into-linq-to-xml-using-c-custom-iterators-and-xmlreader/

Parker147 · 2014-05-16T15:26:48Z

This validity check too broad though, we only really care about invalid XML characters in the historical figures section, so why not simply search for those characters to validate this file?

JyeGuru · 2014-05-17T01:55:30Z

This will prevent any other malformed-XML exceptions in the future, in the event that something else goes horribly wrong with the exports - if another issue arises, the 'fix' for it can be implemented in the block that writes the temp file back.

Additionally, to search for those characters, and/or replace them, will require streaming over the entire file before starting the XML reader, which will introduce massive performance hits for everyone, not just those people that have those characters in their export files. (I, for example, don't use Workflow, so this would impact my exports for precisely zero benefit to me).

The other option that has occurred to me is to add a tick-box to the main interface, "Fix invalid XML" ... then simply trap the exception and report an error to the user if anything happens with the read. This way, the performance impact is optional to the user.

Parker147 · 2014-05-18T00:17:30Z

Is XDocument.Load not already streaming over the entire file to look for these invalid characters?

I like the idea of giving the user the option to fix the issue. Not sure I like the checkbox since it seems like some extra clutter in the window for something that I don't think would be used very often. What about capturing the error and then just popping up a dialog to attempt fixing the file?

PeridexisErrant · 2014-05-18T02:51:53Z

What about capturing the error and then just popping up a dialog to attempt fixing the file?

Sounds good - minimal impact if it's valid, and it eventually works for everyone else too. This would also avoid some legacy issues once dfhack fixes it at that end.

JyeGuru · 2014-05-18T02:56:07Z

I'll get on that this week. Will also add an option to overwrite the original file if they choose. :)

PeridexisErrant · 2014-05-18T03:03:01Z

I'll get on that this week.

Awesome!

Will also add an option to overwrite the original file if they choose. :)

Check that it won't break archives - I'm pretty 7zip can do it, but it could be tricky to ensure it works properly for all supported compression types.

JyeGuru · 2014-05-18T03:09:01Z

I think I'll simplify it to only asking if it's not in an archive. If it's an archive, it's most likely one made by your script, which will have pre-fixed the XML anyway.

JyeGuru · 2014-05-18T07:56:51Z

Committed a fix, build available at: https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.14/artifacts

This will prompt on corrupt XML, if you say no it stops gracefully with an error (no crash).
Will then prompt again after writing out the temp file, asking if you want to overwrite the original.

Seems to work fine with the test files earlier in this thread, would appreciate comments/tests.

Parker147 · 2014-05-22T02:05:10Z

I think that works well.

JyeGuru · 2014-05-22T02:31:21Z

Pull request submitted

Parker147 closed this as completed in a86a81c May 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files containing control characters due to dfhack cannot be loaded #14

Files containing control characters due to dfhack cannot be loaded #14

PeridexisErrant commented May 14, 2014

JyeGuru commented May 14, 2014

PeridexisErrant commented May 14, 2014

JyeGuru commented May 15, 2014

JyeGuru commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 16, 2014

JyeGuru commented May 17, 2014

Parker147 commented May 18, 2014

PeridexisErrant commented May 18, 2014

JyeGuru commented May 18, 2014

PeridexisErrant commented May 18, 2014

JyeGuru commented May 18, 2014

JyeGuru commented May 18, 2014

Parker147 commented May 22, 2014

JyeGuru commented May 22, 2014

Files containing control characters due to dfhack cannot be loaded #14

Files containing control characters due to dfhack cannot be loaded #14

Comments

PeridexisErrant commented May 14, 2014

JyeGuru commented May 14, 2014

PeridexisErrant commented May 14, 2014

JyeGuru commented May 15, 2014

JyeGuru commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 15, 2014

JyeGuru commented May 15, 2014

Parker147 commented May 16, 2014

JyeGuru commented May 17, 2014

Parker147 commented May 18, 2014

PeridexisErrant commented May 18, 2014

JyeGuru commented May 18, 2014

PeridexisErrant commented May 18, 2014

JyeGuru commented May 18, 2014

JyeGuru commented May 18, 2014

Parker147 commented May 22, 2014

JyeGuru commented May 22, 2014