New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor LoadHandler #5554
base: master
Are you sure you want to change the base?
Refactor LoadHandler #5554
Conversation
df6f332
to
c0c358b
Compare
const auto attributeMap = getAttributeMap(); | ||
|
||
std::string creator; | ||
auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto optVersion = XmlParserHelper::getAttrib<std::string>("verison", attributeMap); | |
auto optVersion = XmlParserHelper::getAttrib<std::string>("version", attributeMap); |
To be safe, all those strings should be defined as constexpr in some header(s), and used both here and in the SaveHandler. It's the only way to ensure there is no typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've put all the attribute names in a file XmlNames.h
and a namespace XmlAttrs
. Two questions for you:
- Should I make it more structured, that is, define
XmlAttrs::Page::WIDTH_STR
andXmlAttrs::Stroke::WIDTH_STR
instead of justXmlAttrs::WIDTH_STR
? - I should probably do the same for tag names right?
edit: append "_STR", since there was a collision on the macOS build
I've only given it a short glance. This needs more time to evaluate. About this:
Are you sure? It may be slower with CMAKE_BUILD_TYPE = Debug, but not with Release or RelWithDebInfo. Compiler optimisation is more crucial with C++'s advanced features. |
Take your time, this is a huge amount of code for you to review. With a release build,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another partial review.
Overall, the code is pretty clean and easy to read, that good!
Some functions may benefit from a comment.
e.g. XmlParser::getAttributeMap(), explaining that this only reads the current node. Also, is there a way to assert that a node is only read once?
Another question: are you able to pinpoint places where a C to C++ string copy adds in complexity: meaning that the string is later only used as a const char* or so?
If so, in such places, you could use xoj::util::OwnedCString to avoid copying while maintaining mem-safety.
You may also gain a tiny bit by using xoj::util::EnumIndexedArray instead of unordered maps (e.g. in XmlParser::tagTypeToName()). Not sure you'd save a lot but maybe a little.
Ideally, having a benchmark function by function (when a similar function existed before this PR, and with -DCMAKE_BUILD_TYPE=Release) could help finding optimization possibilities.
Maybe you could add a disabled test unit which does some benchmarking, so it can be used again later, when someone else touches the LoadHandler. For a way to do so, you can look at this (unmerged) commit.
Finally: have you looked at the TestLoadHandler test unit? Does it look complete enough to catch any possible issues?
This is a very critical operation, so we must make sure nothing is broken, and we must optimize it quite a bit. I'm a bit scared of the loss of performances.
std::string decodeBase64(const char* base64data); | ||
|
||
// custom types for easier parsing using stream operators | ||
enum class Domain { ABOSLUTE, ATTACH, CLONE }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum class Domain { ABOSLUTE, ATTACH, CLONE }; | |
enum class Domain { ABSOLUTE, ATTACH, CLONE }; |
auto operator<<(std::ostream& stream, const XmlParserHelper::Domain domain) -> std::ostream& { | ||
switch (domain) { | ||
case XmlParserHelper::Domain::ABOSLUTE: | ||
stream << "absolute"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well, those strings should be constexpr defined in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These strings are also needed in the SaveHandler. But is it worth it to make a whole "AttachmentDomain" class in a separate file ? Where else would it fit ?
About strings / C-strings: About unordered maps: About the test unit: About benchmarking: |
We don't want to expose
Yeah, the set of values is pretty small anyway.
Random documents would not even remotely model use cases, and developping them would cost a lot of time. Typical documents are fine. E.g. one very long with only strokes, one with a lot of images, one with a long pdf and some annotations on top, and so on. The important thing is to be able to evaluate each kind of data more or less separately. |
Rename XmlNames.h to XmlAttrs.h Move TagType enum to new XmlTags namespace and use it to index the constexpr names Add forwarding constructor to EnumIndexedArray
This is a partial rewrite of the load handler, as had been suggested some long time ago should be necessary in #937.
This would close #5183. There are also many smaller issues with memory safety that are addressed with this. Ideally, it should be resilient against any ill-formed files.
I tried to split the one monster class into two smaller ones:
XmlParser
reads the XML and parses all the data necessary for loading the file. It only cares about what that data says when it is relevant for parsing.LoadHandler
receives the data parsed byXmlParser
and builds the document.Commit structure:
07a9247 is a preparation for abstracting away where the XML is being read from.
4dfb59b is the addition of the whole
XmlParser
class. The load handler loads the file twice, once with the old interface, and once with the new one. A time measurement is printed so you can try to compare performance for yourself.fdb2828 is the move away from streams for parsing stroke points and pressures.
745a4a0 is the cleanup of the old interface, and of the LoadHandler class in general.
e31a102 addresses the fact that alpha values were not read in the old parser (no visual difference: alpha values are unused in the file as I understand it), and that some tests were expecting that behavior.
Error handling:
This part is not quite finished up yet, but I wanted to get some feedback first:
LoadHandler::logError()
which will print a warning right away and store the message for it to be retrieved when loading finishes. This should be extended to theXmlParser
class as well if it fits. It currently contains a lot of asserts that can be triggered by ill-formed files. Language: both would be nice (user language in the GUI and English in the console), but I couldn't figure out a way to do that nicely, so for now it's all English.Performance:
The new XML Parser is somewhat more wordy, performs more checks, and generally copies data coming from a C API into strings or similar. It is consequently slower on some files, especially small ones. However, I believe the old code was copying Base64 encoded data more than necessary, as this new parser is clearly faster as soon as the file includes images. Now at first performance was actually terrible, and then I figured out that for parsing doubles, the stream
operator>>
was about twice as slow asg_ascii_strtod()
. This seemed like a sufficiently compelling reason to use the latter.Other notes:
In theory, the separation between parser and loader would make it possible to use parallel processing (parser continues reading while the loader builds the document), but I think (I tried) that'd be more trouble than it's worth.
The parser uses libxml2, which we already rely upon for the settings XML.
I know that the
XmlParser::process*Node()
functions are all very similar, but nevertheless, there are some small differences which kept me from writing a general form.The
*InputStream::open()
functions are not used, should I make the minimal viable class, or is this fine in order to allow easy reuse?How far along is the zip file format / Where is it going to get? Is it still flexible to changes? For example attachments for images and TEX images would be so much more straightforward to parse if they were given in an attribute.
I tried to provide some kind of forward compatibility (ignoring unknown nodes and such), but I'm not sure how effective that would really be.
Finally, this is intended as a (small) stepping stone for the new file format, which will bring a completely new parser. Ideally, it should be possible to separate zip and gzip file formats a bit more. I'm certainly willing to continue into that direction if considered useful.