Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse non-null-terminated strings / Parse with std::string #158

Closed
spl opened this issue Oct 9, 2014 · 17 comments
Closed

Parse non-null-terminated strings / Parse with std::string #158

spl opened this issue Oct 9, 2014 · 17 comments

Comments

@spl
Copy link
Contributor

spl commented Oct 9, 2014

I would like to parse strings that are not null-terminated but do have a string length.

What's the best way to do that now? Implement the Stream concept? I guess it would involve something like copying the GenericStringStream, adding a length-remaining member, and modifying Peek() to return '\0' when the end is reached.

I think this would be useful for RapidJSON in general. It could be exposed as overloaded Parse(const Ch *, size_t) and ParseInsitu (Ch *, size_t) methods.

@miloyip
Copy link
Collaborator

miloyip commented Oct 9, 2014

This seems fit your need
https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/memorystream.h

On Thu, Oct 9, 2014 at 6:30 PM, Sean Leather notifications@github.com
wrote:

I would like to parse strings that are not null-terminated but do have a
string length.

What's the best way to do that now? Implement the Stream concept? I guess
it would involve something like copying the GenericStringStream, adding a
length-remaining member, and modifying Peek() to return '\0' when the end
is reached.

I think this would be useful for RapidJSON in general. It could be exposed
as overloaded Parse(const Ch *, size_t) and ParseInsitu (Ch *, size_t)
methods.


Reply to this email directly or view it on GitHub
#158.

Milo Yip

http://www.cnblogs.com/miloyip/
http://weibo.com/miloyip/
http://twitter.com/miloyip/

@pah
Copy link
Contributor

pah commented Oct 9, 2014

You can do this already by using a MemoryStream, wrapped by an EncodedInputStream.

An overload GenericDocument::Parse could look like (untested):

Parse(const Ch * str, size_t sz) {
    const char* buf = (const char*) str;
    size_t    bufsz = sz * sizeof(Ch);
    MemoryStream ms(buf, bufsz);
    EncodedInputStream<Encoding, MemoryStream> is(ms);
    ParseStream(is);
    return *this;
}

Don't know if this is reasonable to add to the core API.

Edit: Oh, @miloyip was quicker.

@spl
Copy link
Contributor Author

spl commented Oct 9, 2014

Great. Thanks, guys.

@spl spl closed this as completed Oct 9, 2014
@oranjuice
Copy link

Sorry to bring it up again guys, but do you have any plans of making this a part of the main API?
I think it is an important use-case given that JSON data can easily contain bytes disguised as strings.

How about a Parse(const std::string&) ?

EDIT: Parse(const std::string&) doesn't really make sense. Sorry. const char* is better.
Thanks

@pah
Copy link
Contributor

pah commented Oct 17, 2014

I agree, that this might indeed be a useful addition. Feel free to prepare a pull-request (and/or add it to your own fork) based on the following sketch (untested!):

    // add required headers

    template <unsigned parseFlags, typename SourceEncoding>
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        RAPIDJSON_ASSERT(!(parseFlags & kParseInsituFlag));
        const char* buf = (const char*) str;
        size_t    bufsz = sz * sizeof(Ch);
        MemoryStream ms(buf, bufsz);
        EncodedInputStream<SourceEncoding, MemoryStream> is(ms);
        ParseStream<parseFlags, SourceEncoding>(is);
        return *this;
    }
    template <unsigned parseFlags>
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        return Parse<parseFlags, Encoding>(str, sz);
    }
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        return Parse<kParseDefaultFlags>(str, sz);
    }

Please add documentation and tests (e.g. to test/unittest/documenttest.cpp) as well.

To support std::string, you could add a another set of overloads:

#if RAPIDJSON_HAS_STDSTRING
    template <unsigned parseFlags, typename SourceEncoding>
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<parseFlags, SourceEncoding>(str.data(), str.size());
    }    
    template <unsigned parseFlags>
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<parseFlags, Encoding>(str);
    }
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<kParseDefaultFlags>(str);
    }
#endif // RAPIDJSON_HAS_STDSTRING

@spl
Copy link
Contributor Author

spl commented Oct 17, 2014

I would also like to see it in the interface. Since there's interest, I'll reopen this issue.

@oranjuice In addition to @pah's suggestions, you might consider adding performance tests to test/perftest/rapidjsontest.cpp like the ones I have in #165.

@spl spl reopened this Oct 17, 2014
@oranjuice
Copy link

Thanks guys. I'm not sure if I can get it to soon (if at all), but I'll try to find time.

@spl
Copy link
Contributor Author

spl commented Oct 22, 2014

Are there any problems with using only MemoryStream without wrapping it in an EncodedInputStream for UTF8? For UTF16 and UTF32, it doesn't make sense because MemoryStream deals with bytes and these encodings require multiple bytes. But for UTF8, unless I'm missing something, it should be fine.

I added this perf test:

TEST_F(RapidJson, SIMD_SUFFIX(DocumentParse_MemoryStream)) {
    for (size_t i = 0; i < kTrialCount; i++) {
        MemoryStream ms(json_, length_);
        Document doc;
        doc.ParseStream<0, UTF8<> >(ms);
        ASSERT_TRUE(doc.IsObject());
    }
}

And here are some of the results for comparison:

[ RUN      ] RapidJson.DocumentParse_MemoryPoolAllocator
[       OK ] RapidJson.DocumentParse_MemoryPoolAllocator (899 ms)
[ RUN      ] RapidJson.DocumentParse_CrtAllocator
[       OK ] RapidJson.DocumentParse_CrtAllocator (1273 ms)
[ RUN      ] RapidJson.DocumentParse_MemoryStream
[       OK ] RapidJson.DocumentParse_MemoryStream (1445 ms)
[ RUN      ] RapidJson.DocumentParseEncodedInputStream_MemoryStream
[       OK ] RapidJson.DocumentParseEncodedInputStream_MemoryStream (1833 ms)

@pah
Copy link
Contributor

pah commented Oct 22, 2014

Can you try to check, whether adding a StreamTraits specialization (to encodedstream.h) for EncodedInputStream helps?

template <typename Encoding, typename InputByteStream>
struct StreamTraits<EncodedInputStream<Encoding, InputByteStream> > {
    enum { copyOptimization = 1 };
};

I would prefer to keep the EncodedInputStream for the proposed interface if possible, in order to keep the symmetry among the different overloads. We could consider to add an option to the Encoded*Stream classes to skip the BOM support (which might make sense for the use case here).

@spl
Copy link
Contributor Author

spl commented Oct 22, 2014

I would prefer to keep the EncodedInputStream for the proposed interface if possible, in order to keep the symmetry among the different overloads.

Absolutely. I didn't mean to suggest that we do otherwise. I was just curious if my suggestion actually made sense from a purely technical standpoint.

Can you try to check, whether adding a StreamTraits specialization (to encodedstream.h) for EncodedInputStream helps?

Sure, I'll try that.

@pah
Copy link
Contributor

pah commented Oct 22, 2014

Absolutely. I didn't mean to suggest that we do otherwise. I was just curious if my suggestion actually made sense from a purely technical standpoint.

Technically, it should be sufficient to use a plain MemoryStream, if you don't need BOM support and the target encoding is UTF-8 (or simply has sizeof(Ch) == 1).

@spl
Copy link
Contributor Author

spl commented Oct 22, 2014

Thanks! That confirms my intuition.

@lichray
Copy link
Contributor

lichray commented Oct 23, 2014

Note that SIMD whitespace skipping is also only made available for StringStream.

@pah
Copy link
Contributor

pah commented Oct 24, 2014

Note that SIMD whitespace skipping is also only made available for StringStream.

I don't see a way to safely do SIMD whitespace checking for a MemoryStream (at least not with the current API), as these streams are not guaranteed to be '\0'-terminated. Consequently, the SIMD implementations may overflow in some cases.

@miloyip
Copy link
Collaborator

miloyip commented Apr 11, 2015

I am OK for PR for supporting Parse(std::string).
That may be put in v1.1 Beta.

@miloyip miloyip changed the title Parse non-null-terminated strings Parse non-null-terminated strings / Parse with std::string Apr 11, 2015
@miloyip miloyip changed the title Parse non-null-terminated strings / Parse with std::string Parse non-null-terminated strings / Parse with std::string Apr 11, 2015
@miloyip miloyip added this to the v1.1 Beta milestone Apr 24, 2015
@Climax777
Copy link

This doesn't work with ParseInsitu yet or does it?

@jmrico01
Copy link

This doesn't work with ParseInsitu yet or does it?

+1 ! I don't think it does - MemoryStream doesn't support the required write ops. Maybe we need an InsituMemoryStream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants