Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files containing control characters due to dfhack cannot be loaded #14

Closed
PeridexisErrant opened this issue May 14, 2014 · 19 comments
Closed

Comments

@PeridexisErrant
Copy link

A known bug with Legends Viewer is that it often crashes if the user attempts to load files exported after playing a fort (via the backup-abandon-export-reload trick).

This is because dfhack currently stores values such as workflow settings as fake historical figures in hex format, and Legends Viewer cannot parse files with such entries.

There are workarounds - such as a script that deletes any line containing such characters, or a future fix to dfhack that hides such entries from export - but it would be good to improve the parsing here.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 14, 2014

Are you able to upload an example export and I'll take a look at it tonight. Can't guarantee anything, but my C# skills have gotten a lot better over the last week or so.

@PeridexisErrant
Copy link
Author

Sure - here it is. I've included an archive with the bug (-corrupt), and one without (-clean).

In the process, I seem to have found a new issue - while extracting the files and then loading them into LV produces the expected behaviour, trying to load the archive has the xml turn green and then hang at the history text indefinitely. I have no idea why this would happen but since it's not inteferring with the main bugtest I'll leave it for now. If you quit, you then have to manually remove the files from LV's folder, so it must be in the loading not the extraction (and maybe needs to clean up if unexpectedly quit?).

Here's the error message I get:

'�', hexadecimal value 0x01, is an invalid character. Line 571, position 11.
at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args)
at System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars)
at System.Xml.XmlTextReaderImpl.ParseText()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlReader.ReadStartElement()
at LegendsViewer.Legends.XMLParser.ParseProperty()
at LegendsViewer.Legends.XMLParser.ParseItem()
at LegendsViewer.Legends.XMLParser.ParseSection()
at LegendsViewer.Legends.XMLParser.Parse()
at LegendsViewer.Legends.World..ctor(String xmlFile, String historyFile, String sitesAndPopulationsFile, String mapFile)
at LegendsViewer.FileLoader.load_DoWork(Object sender, DoWorkEventArgs e)
at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)`

@JyeGuru
Copy link
Contributor

JyeGuru commented May 15, 2014

So, being at work just glancing at the export XML files ... seems that all the fake Workflow entries have a negative ID. Will see if there's a cleaner way to do it than just a try/catch loop.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 15, 2014

Compiled a testing version that loads the example corrupted files just fine, however it uses more RAM because it loads the whole file into memory to do the replacement of non-ascii characters. I couldn't find a good way to implement this with the XMLTextReader/StreamReader combination currently in use. May have to refactor some of that XML Reader file to fix it without the memory hog.

Testing of this version on my ~250M XML used about 1.2G RAM at peak, compared to around 800M peak in the original version. Not considerably more, but I imagine it would fail hard with massive exports.

Will continue looking for a better way to do this. In the meantime, testing executable availble here from build: https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.8/artifacts

Edit: Push didn't work properly the first time, was incorrect changes. Repushed below.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 15, 2014

Think I fixed it properly this time. Basically makes a cleaned copy of the file if it fails to load the first time.This temp file is removed after the world is loaded.

Pro: Less memory usage while loading than previous fix
Con: More drive space needed for second copy of file
Unknown: Performance impact of the valid XML test on super-huge files

https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.9/artifacts

@Parker147
Copy link
Owner

I'm thinking the upfront validation of the XML file is good. I think you could optimize it and remove the XDocument.Load and simply Regex Replace the file from the start since you have to read through part of the document regardless if it fails, might as well fix it instead of reloading the file again. It happens in the Historical Figures section of the XML so you could even limit the regex replacement to just that section of the file since the start and end tags for that section would be easy to find.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 15, 2014

The problem is that the XML classes (as in ALL of them!) will refuse to parse any invalid entries - the XML spec requires draconian error-handling. Which is why I had to do the replace outside an XML parser.

If I did the replace (which forces a write to a temp file) on every file regardless of errors, it would slow everything down considerably. I'm confident that the performance loss by loading the XDocument instance (and then releasing it again straight away) is a lot less than the performance impact of writing the entire file back to disk.

I will see about generating some obscenely large history files if I get some time to leave DF running, unless you have some to test it on.

@Parker147
Copy link
Owner

I'm concerned with XDocument having to read the entire file when we don't need to, causing a performance issue. We really only care about the Historical Figure section, which is a relatively small part of these XML files. I agree that having to write the entire file back to disk is a costly task and we would want to avoid that if necessary. Could we put in some checks to see if any regex replacements actually occurred and only write to file in that case?

@JyeGuru
Copy link
Contributor

JyeGuru commented May 15, 2014

The regex replacement is only done if there is something to replace. The "validity check" is just XDocument safeFile = XDocument.Load(xmlFile); safeFile = null; inside a try/catch block, which literally just loads and frees the object straight away.

This uses XDocument as a validator only. If an exception is generated, then the XML is bad, and it's handed over to a FileStream that does the replacement and writes to a temp file. If there is no exception, then the filename is returned unchanged.

Doing a validation on the XML before loading it is going to be slow, no matter how it's done. I'ce just tried to do it with as little memory impact as possible, since that seems to be the limiting factor. Ideally, I think the whole XML reader can be refactored to use a custom iterator[1] but it's going to require quite a rewrite of that class to do it.

1: http://lennilobel.wordpress.com/2009/09/02/streaming-into-linq-to-xml-using-c-custom-iterators-and-xmlreader/

@Parker147
Copy link
Owner

This validity check too broad though, we only really care about invalid XML characters in the historical figures section, so why not simply search for those characters to validate this file?

@JyeGuru
Copy link
Contributor

JyeGuru commented May 17, 2014

This will prevent any other malformed-XML exceptions in the future, in the event that something else goes horribly wrong with the exports - if another issue arises, the 'fix' for it can be implemented in the block that writes the temp file back.

Additionally, to search for those characters, and/or replace them, will require streaming over the entire file before starting the XML reader, which will introduce massive performance hits for everyone, not just those people that have those characters in their export files. (I, for example, don't use Workflow, so this would impact my exports for precisely zero benefit to me).

The other option that has occurred to me is to add a tick-box to the main interface, "Fix invalid XML" ... then simply trap the exception and report an error to the user if anything happens with the read. This way, the performance impact is optional to the user.

@Parker147
Copy link
Owner

Is XDocument.Load not already streaming over the entire file to look for these invalid characters?

I like the idea of giving the user the option to fix the issue. Not sure I like the checkbox since it seems like some extra clutter in the window for something that I don't think would be used very often. What about capturing the error and then just popping up a dialog to attempt fixing the file?

@PeridexisErrant
Copy link
Author

What about capturing the error and then just popping up a dialog to attempt fixing the file?

Sounds good - minimal impact if it's valid, and it eventually works for everyone else too. This would also avoid some legacy issues once dfhack fixes it at that end.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 18, 2014

I'll get on that this week. Will also add an option to overwrite the original file if they choose. :)

@PeridexisErrant
Copy link
Author

I'll get on that this week.

Awesome!

Will also add an option to overwrite the original file if they choose. :)

Check that it won't break archives - I'm pretty 7zip can do it, but it could be tricky to ensure it works properly for all supported compression types.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 18, 2014

I think I'll simplify it to only asking if it's not in an archive. If it's an archive, it's most likely one made by your script, which will have pre-fixed the XML anyway.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 18, 2014

Committed a fix, build available at: https://ci.appveyor.com/project/yukihyou/legends-viewer/build/1.13.09.14/artifacts

This will prompt on corrupt XML, if you say no it stops gracefully with an error (no crash).
Will then prompt again after writing out the temp file, asking if you want to overwrite the original.

Seems to work fine with the test files earlier in this thread, would appreciate comments/tests.

@Parker147
Copy link
Owner

I think that works well.

@JyeGuru
Copy link
Contributor

JyeGuru commented May 22, 2014

Pull request submitted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants