Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading past end of JSON data in a stream #239

Open
ScottPJones opened this issue Feb 9, 2018 · 8 comments
Open

Reading past end of JSON data in a stream #239

ScottPJones opened this issue Feb 9, 2018 · 8 comments

Comments

@ScottPJones
Copy link
Contributor

While investigating an issue with my changes for better performance and to fix numeric parsing issues, I realized that parsing a number can "eat" the next character in a stream.
The way that the async.jl unit test works, by writing out a stream of JSON values, and then reading them on another process, breaks if there is not another character to separate the values, and just a number is being read.
I attempted to fix this using Base.peek, however on v0.6.2, it doesn't work for the type TCPStream.
I'm not sure what the behavior of a similar test in JavaScript or Python would be.

@TotalVerb
Copy link
Collaborator

We should put back any characters we are eating. Is there a reason peek doesn't work on TCPStream?

@ScottPJones
Copy link
Contributor Author

ScottPJones commented Feb 10, 2018

For one thing, peek isn't even exported, and it is not implemented for TCPStream in v0.6.2.
I haven't tried yet on master.

(The peek implementation looked like it might hurt performance a lot, also, at least one of the methods, that does a try / finally and saves / restores the position in the stream, after reading a single byte)
A better approach in that case would be to only restore the position in the case of reading a JSON number when finding something that isn't a digit (after a -, ., e, E or + (in an exponent), one or more digits are always required).

@TotalVerb
Copy link
Collaborator

What I mean is, is there a reason why peek is not implemented (in Base), as it seems this would be a really useful feature overall?

@ScottPJones
Copy link
Contributor Author

I'm not sure why it was left unexported in master, but at least there it is better supported (instead of just 2 methods,
it has 7, including peek(s::Base.LibuvStream) in Base at stream.jl:1182, which should handle TCPSocket.
(I misremembered the name before, it's TCPSocket instead of TCPStream)

@samoconnor
Copy link

doesn't work for the type TCPStream.

You'll likely also run into trouble with wrappers like SSLContext (almost every HTTP connection is a HTTPS connection nowdays).

To me having a peek that only returns one character is a bit arbitrary. I think it would be more useful to have a peek that returns a view of however many bytes are available in whatever buffer the IO implementation has. e.g. it would be handy to peek in the buffer for presence of e.g. \r\n\r\n before reading.

@ScottPJones
Copy link
Contributor Author

One character (really, 1 byte) look-ahead is all that the JSON format needs, and that is only for numbers (everything else is terminated by a byte that is known in advance, i.e. '}', ']', ':'or '"').
Having a function that could read the entire buffer of ready to be read bytes would also be quite useful, for other things.

@peteristhegreat
Copy link

peteristhegreat commented Dec 4, 2020

I hit this bug today while accidentally reading some SOAP responses. Having the response wiped while doing a JSON read feels like a bug. In this example the original string is preserved, but in my debugging session, I was dealing with the array output initially not a string.

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body)
28-element Array{UInt8,1}:
 0x3c
 0x58
 0x4d
 0x4c
 0x3e
 0x54
 0x68
 0x69
 0x73
 0x20
 0x69
 0x73
 0x20
 0x6e
 0x6f
 0x74
 0x20
 0x4a
 0x53
 0x4f
 0x4e
 0x3c
 0x2f
 0x58
 0x4d
 0x4c
 0x3e
 0x0a

julia> JSON.parse(String(array))
ERROR: Unexpected character
Line: 0
Around: ...<XML>This is not JSON<...
            ^

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] _error(::String, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:140
 [3] parse_jsconstant(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:193
 [4] parse_value(::JSON.Parser.ParserContext{Dict{String,Any},Int64,true,nothing}, ::JSON.Parser.MemoryParserState) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:170
 [5] #parse#1(::Type, ::Type{Int64}, ::Bool, ::Nothing, ::typeof(JSON.Parser.parse), ::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:460
 [6] parse(::String) at /Users/user/.julia/packages/JSON/3rsiS/src/Parser.jl:458
 [7] top-level scope at REPL[19]:1

julia> array
0-element Array{UInt8,1}

julia> body
"<XML>This is not JSON</XML>\n"

Basically, reading off the end of the array causes the array to get dropped. I would rather it not drop the array.

It looks like this thread here is related.

JuliaLang/julia#24388

@KristofferC
Copy link
Member

KristofferC commented Dec 5, 2020

How is this related to JSON.jl?

julia> body = "<XML>This is not JSON</XML>\n"
"<XML>This is not JSON</XML>\n"

julia> array = Vector{UInt8}(body);

julia> String(array);

julia> array
UInt8[]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants