-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IncompleteJSONError: lexical error: invalid char in json text #33
Comments
@carlosg-m saldy, no.
Sadly, not either. Addressing both problems would require a fair amount of work. In particular, all the The best you can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used |
Thank you for the information it helps a lot. We are going to try to fix the issue directly on the source, that in this case is a custom made legacy geographic information system. |
I think nan support is something that should be seriously reconsidered. While it's not part of standard json, nans are extremely common in real life and real data. This issue is blocking me from making use of ijson. I suppose I'll try stream manipulation in the meantime, but I'm interested in finding a way to solve this. (or perhaps just the python backend could get experimental support for this feature?) Is the diff as simple as: diff --git a/ijson/backends/python.py b/ijson/backends/python.py
index 8efda79..1726ef0 100644
--- a/ijson/backends/python.py
+++ b/ijson/backends/python.py
@@ -8,7 +8,7 @@ from ijson import common, utils
import codecs
-LEXEME_RE = re.compile(r'[a-z0-9eE\.\+-]+|\S')
+LEXEME_RE = re.compile(r'[a-z0-9eNE\.\+-]+|\S')
UNARY_LEXEMES = set('[]{},')
EOF = -1, None
@@ -184,6 +184,9 @@ def parse_value(target, multivalue, use_float):
elif symbol == 'false':
send(('boolean', False))
pop()
+ elif symbol == 'NaN':
+ send(('number', float('nan')))
+ pop()
elif symbol[0] == '"':
send(('string', parse_string(symbol)))
pop()
@@ -226,7 +229,7 @@ def parse_value(target, multivalue, use_float):
if number == inf:
raise common.JSONError("float overflow: %s" % (symbol,))
except:
- if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol):
+ if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol) or 'NaN'.startswith(symbol):
raise common.IncompleteJSONError('Incomplete JSON content')
raise UnexpectedSymbol(symbol, pos)
else: ? |
@Erotemic I'm not sure if that diff you posted above would be all that's necessary (I don't mean this as in "there is stuff missing!", I'm literally not sure because I'd have to invest some time to think thoroughly about it). In any case it would only cover the python backend -- which is definitely not what you want to be using in a production environment, when the The rest of the backends are not that easily modifiable though (or not at all, some just load compiled externally-provided shared libraries), so I'd be hesitant to add support for a non-standard JSON feature that is only supported by a single backends, which happens to be the slowest. So sorry, but I won't be adding support for nans in ijson. Even if they are a common in real life and real data, they should not be common in JSON documents. I'm sorry to hear this blocks you from using ijson, but I'd expect this to block you from using a few other JSON libraries too that probably won't handle nans either. If you have control over your data sources, then you can try modifying it so it generates valid JSON; if you don't have that control then your best bet is to tap into the stream and perform the necessary corrections to obtain valid JSON content. |
Actually it is. When I have a 12 GB json file, where I know there is one entry at the very top that is only a few bytes, taking a 50x slowdown to read a few bytes is more than worth it. Specifically, I have a COCO json file that represents an object detection dataset. The basic structure is It's not uncommon for some backends to include features that are unsupported by faster more specialized backends, especially when they are experimental. There are very real production use-cases where using a slower, more feature-rich backend is an overall benefit to the system. It is nice for all backends to have feature parity, but I think could be more room for flexibility than your initial impressions would suggest.
This is an opinion, and I respectfully disagree. My opinion is that it was shortsighted for the rfc8259 standard not to include NaN. A NaN and null value have subtly different meanings, and they are not interchangeable. Of course you aren't obliged to make or accept changes to this library, and even less obliged to change your opinion. There's not anything objectively wrong with it, and there is value in adhering to specifications, but I want to suggest that your notion of correctness is conflated with compliance. The opposing opinion is simply non-compliant, not incorrect. Food for thought. Fortunately this library is extremely well written and I can simply copy the python backend into a new module, modify it, and use it with the regular ijson core. For anyone else that needs the functionality, here is the code that I'm using: https://gist.github.com/Erotemic/60ec1b4ffc9961fa34a1821bb139f70f |
I'm confused by this issue. ijson is a JSON parser, and NaN is not part of JSON... |
True, yet support for values like NaN and Inf are common in json parsers (including ujson and the stdlib json library). There is no specification for NaN in RFC8259. It's not unreasonable to deny support for it, but it's also not unreasonable to request support for it, or fork a project to add support for it. |
@Erotemic thanks for the constructive discussion, and for raising your points with respect and clarity -- that's hugely appreciated. Adding that link to your solution is also a good idea for people in the future who might find it useful.
Sorry, I meant this in the more general case, and was just guessing at your particular use case, where you indeed don't care much about performance.
My apologies, that was poor writing from my side, I meant that phrase as an observation, not as an opinion. The observation is that this package is downloaded millions of time per month and yet the issue of NaNs appearing in JSON documents has been brought up only two times in multiple years. My assumption then is that NaNs do not appear in JSON documents that often. I'm not sure what the story behind explicitly not supporting NaNs in the RFC is. My impression is that they didn't want to bind themselves to the IEEE 754 format, given also how they allow arbitrarily big or precise numbers (e.g., "1e400" or "3.141592653589793238462643383279") to be used, but that's just a guess. I agree with that that this feels like an obvious miss, but this is thanks to the benefit of hindsight; the original specification is 16 years old, when JSON wasn't as ubiquitous as it is today. About the support for NaNs in ijson, and to finish my intervention: probably if ijson consisted only on the python backend I wouldn't mind too much about adding this support. But the fact that there are other backends that I don't control as much (plus another one on the works) I'm much more hesitant to add more maintenance burden on myself for a feature that isn't standard. If I merge this in it's a matter of time before people request support for |
Is there a solution for parsing json files with syntax problems?
Can this type of error be skipped?
This is the actual pipeline being applied:
The text was updated successfully, but these errors were encountered: