Parse non-null-terminated strings / Parse with std::string #158

spl · 2014-10-09T10:30:18Z

I would like to parse strings that are not null-terminated but do have a string length.

What's the best way to do that now? Implement the Stream concept? I guess it would involve something like copying the GenericStringStream, adding a length-remaining member, and modifying Peek() to return '\0' when the end is reached.

I think this would be useful for RapidJSON in general. It could be exposed as overloaded Parse(const Ch *, size_t) and ParseInsitu (Ch *, size_t) methods.

The text was updated successfully, but these errors were encountered:

miloyip · 2014-10-09T11:13:32Z

This seems fit your need
https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/memorystream.h

On Thu, Oct 9, 2014 at 6:30 PM, Sean Leather notifications@github.com
wrote:

I would like to parse strings that are not null-terminated but do have a
string length.

What's the best way to do that now? Implement the Stream concept? I guess
it would involve something like copying the GenericStringStream, adding a
length-remaining member, and modifying Peek() to return '\0' when the end
is reached.

I think this would be useful for RapidJSON in general. It could be exposed
as overloaded Parse(const Ch *, size_t) and ParseInsitu (Ch *, size_t)
methods.

—
Reply to this email directly or view it on GitHub
#158.

Milo Yip

http://www.cnblogs.com/miloyip/
http://weibo.com/miloyip/
http://twitter.com/miloyip/

pah · 2014-10-09T11:18:41Z

You can do this already by using a MemoryStream, wrapped by an EncodedInputStream.

An overload GenericDocument::Parse could look like (untested):

Parse(const Ch * str, size_t sz) {
    const char* buf = (const char*) str;
    size_t    bufsz = sz * sizeof(Ch);
    MemoryStream ms(buf, bufsz);
    EncodedInputStream<Encoding, MemoryStream> is(ms);
    ParseStream(is);
    return *this;
}

Don't know if this is reasonable to add to the core API.

Edit: Oh, @miloyip was quicker.

spl · 2014-10-09T20:09:01Z

Great. Thanks, guys.

oranjuice · 2014-10-17T19:40:01Z

Sorry to bring it up again guys, but do you have any plans of making this a part of the main API?
I think it is an important use-case given that JSON data can easily contain bytes disguised as strings.

How about a Parse(const std::string&) ?

EDIT: Parse(const std::string&) doesn't really make sense. Sorry. const char* is better.
Thanks

pah · 2014-10-17T20:54:20Z

I agree, that this might indeed be a useful addition. Feel free to prepare a pull-request (and/or add it to your own fork) based on the following sketch (untested!):

    // add required headers

    template <unsigned parseFlags, typename SourceEncoding>
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        RAPIDJSON_ASSERT(!(parseFlags & kParseInsituFlag));
        const char* buf = (const char*) str;
        size_t    bufsz = sz * sizeof(Ch);
        MemoryStream ms(buf, bufsz);
        EncodedInputStream<SourceEncoding, MemoryStream> is(ms);
        ParseStream<parseFlags, SourceEncoding>(is);
        return *this;
    }
    template <unsigned parseFlags>
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        return Parse<parseFlags, Encoding>(str, sz);
    }
    GenericDocument& Parse(const Ch * str, SizeType sz) {
        return Parse<kParseDefaultFlags>(str, sz);
    }

Please add documentation and tests (e.g. to test/unittest/documenttest.cpp) as well.

To support std::string, you could add a another set of overloads:

#if RAPIDJSON_HAS_STDSTRING
    template <unsigned parseFlags, typename SourceEncoding>
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<parseFlags, SourceEncoding>(str.data(), str.size());
    }    
    template <unsigned parseFlags>
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<parseFlags, Encoding>(str);
    }
    GenericDocument& Parse(const std::basic_string<Ch>& str) {
        return Parse<kParseDefaultFlags>(str);
    }
#endif // RAPIDJSON_HAS_STDSTRING

spl · 2014-10-17T21:30:28Z

I would also like to see it in the interface. Since there's interest, I'll reopen this issue.

@oranjuice In addition to @pah's suggestions, you might consider adding performance tests to test/perftest/rapidjsontest.cpp like the ones I have in #165.

oranjuice · 2014-10-18T17:12:13Z

Thanks guys. I'm not sure if I can get it to soon (if at all), but I'll try to find time.

spl · 2014-10-22T07:08:42Z

Are there any problems with using only MemoryStream without wrapping it in an EncodedInputStream for UTF8? For UTF16 and UTF32, it doesn't make sense because MemoryStream deals with bytes and these encodings require multiple bytes. But for UTF8, unless I'm missing something, it should be fine.

I added this perf test:

TEST_F(RapidJson, SIMD_SUFFIX(DocumentParse_MemoryStream)) {
    for (size_t i = 0; i < kTrialCount; i++) {
        MemoryStream ms(json_, length_);
        Document doc;
        doc.ParseStream<0, UTF8<> >(ms);
        ASSERT_TRUE(doc.IsObject());
    }
}

And here are some of the results for comparison:

[ RUN      ] RapidJson.DocumentParse_MemoryPoolAllocator
[       OK ] RapidJson.DocumentParse_MemoryPoolAllocator (899 ms)
[ RUN      ] RapidJson.DocumentParse_CrtAllocator
[       OK ] RapidJson.DocumentParse_CrtAllocator (1273 ms)
[ RUN      ] RapidJson.DocumentParse_MemoryStream
[       OK ] RapidJson.DocumentParse_MemoryStream (1445 ms)
[ RUN      ] RapidJson.DocumentParseEncodedInputStream_MemoryStream
[       OK ] RapidJson.DocumentParseEncodedInputStream_MemoryStream (1833 ms)

pah · 2014-10-22T15:00:50Z

Can you try to check, whether adding a StreamTraits specialization (to encodedstream.h) for EncodedInputStream helps?

template <typename Encoding, typename InputByteStream>
struct StreamTraits<EncodedInputStream<Encoding, InputByteStream> > {
    enum { copyOptimization = 1 };
};

I would prefer to keep the EncodedInputStream for the proposed interface if possible, in order to keep the symmetry among the different overloads. We could consider to add an option to the Encoded*Stream classes to skip the BOM support (which might make sense for the use case here).

spl · 2014-10-22T15:31:29Z

I would prefer to keep the EncodedInputStream for the proposed interface if possible, in order to keep the symmetry among the different overloads.

Absolutely. I didn't mean to suggest that we do otherwise. I was just curious if my suggestion actually made sense from a purely technical standpoint.

Can you try to check, whether adding a StreamTraits specialization (to encodedstream.h) for EncodedInputStream helps?

Sure, I'll try that.

pah · 2014-10-22T15:41:15Z

Absolutely. I didn't mean to suggest that we do otherwise. I was just curious if my suggestion actually made sense from a purely technical standpoint.

Technically, it should be sufficient to use a plain MemoryStream, if you don't need BOM support and the target encoding is UTF-8 (or simply has sizeof(Ch) == 1).

spl · 2014-10-22T16:03:11Z

Thanks! That confirms my intuition.

lichray · 2014-10-23T21:02:39Z

Note that SIMD whitespace skipping is also only made available for StringStream.

pah · 2014-10-24T08:55:49Z

Note that SIMD whitespace skipping is also only made available for StringStream.

I don't see a way to safely do SIMD whitespace checking for a MemoryStream (at least not with the current API), as these streams are not guaranteed to be '\0'-terminated. Consequently, the SIMD implementations may overflow in some cases.

miloyip · 2015-04-11T08:22:01Z

I am OK for PR for supporting Parse(std::string).
That may be put in v1.1 Beta.

Climax777 · 2018-05-13T08:24:55Z

This doesn't work with ParseInsitu yet or does it?

jmrico01 · 2022-03-27T21:50:32Z

This doesn't work with ParseInsitu yet or does it?

+1 ! I don't think it does - MemoryStream doesn't support the required write ops. Maybe we need an InsituMemoryStream?

spl closed this as completed Oct 9, 2014

spl reopened this Oct 17, 2014

miloyip added enhancement question labels Apr 11, 2015

miloyip changed the title ~~Parse non-null-terminated strings~~ Parse non-null-terminated strings / Parse with std::string Apr 11, 2015

miloyip changed the title ~~Parse non-null-terminated strings / Parse with std::string~~ Parse non-null-terminated strings / Parse with std::string Apr 11, 2015

miloyip added this to the v1.1 Beta milestone Apr 24, 2015

pah mentioned this issue Aug 31, 2015

Add api to parse string with specified length #417

Closed

miloyip mentioned this issue Feb 19, 2016

Issue158 parsestdstring #553

Merged

miloyip closed this as completed in #553 Feb 20, 2016

pah mentioned this issue Jun 14, 2016

parse with length option #659

Closed

pah mentioned this issue Dec 12, 2017

Added max length parameter for GenericStringStream #1135

Closed

n00bmind mentioned this issue Jul 27, 2023

Still no insitu parsing for non null-terminated strings? #2181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse non-null-terminated strings / Parse with std::string #158

Parse non-null-terminated strings / Parse with std::string #158

spl commented Oct 9, 2014

miloyip commented Oct 9, 2014

pah commented Oct 9, 2014

spl commented Oct 9, 2014

oranjuice commented Oct 17, 2014

pah commented Oct 17, 2014

spl commented Oct 17, 2014

oranjuice commented Oct 18, 2014

spl commented Oct 22, 2014

pah commented Oct 22, 2014

spl commented Oct 22, 2014

pah commented Oct 22, 2014

spl commented Oct 22, 2014

lichray commented Oct 23, 2014

pah commented Oct 24, 2014

miloyip commented Apr 11, 2015

Climax777 commented May 13, 2018

jmrico01 commented Mar 27, 2022

Parse non-null-terminated strings / Parse with std::string #158

Parse non-null-terminated strings / Parse with std::string #158

Comments

spl commented Oct 9, 2014

miloyip commented Oct 9, 2014

pah commented Oct 9, 2014

spl commented Oct 9, 2014

oranjuice commented Oct 17, 2014

pah commented Oct 17, 2014

spl commented Oct 17, 2014

oranjuice commented Oct 18, 2014

spl commented Oct 22, 2014

pah commented Oct 22, 2014

spl commented Oct 22, 2014

pah commented Oct 22, 2014

spl commented Oct 22, 2014

lichray commented Oct 23, 2014

pah commented Oct 24, 2014

miloyip commented Apr 11, 2015

Climax777 commented May 13, 2018

jmrico01 commented Mar 27, 2022