Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IncompleteJSONError: lexical error: invalid char in json text #33

Closed
carlosg-m opened this issue Sep 24, 2020 · 8 comments
Closed

IncompleteJSONError: lexical error: invalid char in json text #33

carlosg-m opened this issue Sep 24, 2020 · 8 comments
Labels
question Further information is requested

Comments

@carlosg-m
Copy link

carlosg-m commented Sep 24, 2020

Is there a solution for parsing json files with syntax problems?

Can this type of error be skipped?

IncompleteJSONError: lexical error: invalid char in json text.
          type":"Point","coordinates":[NaN,NaN],"crs":{"type":"link","
                     (right here) ------^

This is the actual pipeline being applied:

def get_generator(gen, export_path, batch_size):
  records = []
  for i, ob in enumerate(gen):
    records.append(get_record(ob))
    if len(records)==batch_size:
      records = save_parquet(records, i, export_path)
      print('saved:', str(i+1))
  if len(records) > 0:
    records = save_parquet(records, i, export_path)
    print('saved:', str(i+1))
    
def read_json_parquet(file_name, export_path, batch_size):
  Path(export_path).mkdir(parents=False, exist_ok=False)
  with open(file_name, 'rb') as file:
    obj = ijson.items(file, 'features.item')
    obj = get_generator(obj, export_path=export_path, batch_size=batch_size)
@carlosg-m carlosg-m changed the title IncompleteJSONError: lexical error: invalid char in json text. IncompleteJSONError: lexical error: invalid char in json text Sep 24, 2020
@rtobar
Copy link

rtobar commented Sep 24, 2020

Is there a solution for parsing json files with syntax problems?

@carlosg-m saldy, no.

Can this type of error be skipped?

Sadly, not either.

Addressing both problems would require a fair amount of work. In particular, all the yajl-based backends (almost all) do not get access to the last valid byte in the buffer when errors are found, and therefore there's nothing ijson can work with to fix or skip the problem, or to let the user do it.

The best you can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm (even within your own program) you to change those NaNs into some proper JSON content (e.g., sed 's/NaN,NaN/"nan","nan"/g' you-file > fixed-file) before it's given to ijson.

@carlosg-m
Copy link
Author

can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm

Thank you for the information it helps a lot. We are going to try to fix the issue directly on the source, that in this case is a custom made legacy geographic information system.

@rtobar rtobar added the question Further information is requested label Sep 24, 2020
@Erotemic
Copy link

Erotemic commented Oct 28, 2022

I think nan support is something that should be seriously reconsidered. While it's not part of standard json, nans are extremely common in real life and real data. This issue is blocking me from making use of ijson. I suppose I'll try stream manipulation in the meantime, but I'm interested in finding a way to solve this.

(or perhaps just the python backend could get experimental support for this feature?)

Is the diff as simple as:

diff --git a/ijson/backends/python.py b/ijson/backends/python.py
index 8efda79..1726ef0 100644
--- a/ijson/backends/python.py
+++ b/ijson/backends/python.py
@@ -8,7 +8,7 @@ from ijson import common, utils
 import codecs
 
 
-LEXEME_RE = re.compile(r'[a-z0-9eE\.\+-]+|\S')
+LEXEME_RE = re.compile(r'[a-z0-9eNE\.\+-]+|\S')
 UNARY_LEXEMES = set('[]{},')
 EOF = -1, None
 
@@ -184,6 +184,9 @@ def parse_value(target, multivalue, use_float):
             elif symbol == 'false':
                 send(('boolean', False))
                 pop()
+            elif symbol == 'NaN':
+                send(('number', float('nan')))
+                pop()
             elif symbol[0] == '"':
                 send(('string', parse_string(symbol)))
                 pop()
@@ -226,7 +229,7 @@ def parse_value(target, multivalue, use_float):
                     if number == inf:
                         raise common.JSONError("float overflow: %s" % (symbol,))
                 except:
-                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol):
+                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol) or 'NaN'.startswith(symbol):
                         raise common.IncompleteJSONError('Incomplete JSON content')
                     raise UnexpectedSymbol(symbol, pos)
                 else:

?

@rtobar
Copy link

rtobar commented Oct 28, 2022

@Erotemic I'm not sure if that diff you posted above would be all that's necessary (I don't mean this as in "there is stuff missing!", I'm literally not sure because I'd have to invest some time to think thoroughly about it). In any case it would only cover the python backend -- which is definitely not what you want to be using in a production environment, when the yajl2_c backend is about ~50x times faster.

The rest of the backends are not that easily modifiable though (or not at all, some just load compiled externally-provided shared libraries), so I'd be hesitant to add support for a non-standard JSON feature that is only supported by a single backends, which happens to be the slowest.

So sorry, but I won't be adding support for nans in ijson. Even if they are a common in real life and real data, they should not be common in JSON documents. I'm sorry to hear this blocks you from using ijson, but I'd expect this to block you from using a few other JSON libraries too that probably won't handle nans either. If you have control over your data sources, then you can try modifying it so it generates valid JSON; if you don't have that control then your best bet is to tap into the stream and perform the necessary corrections to obtain valid JSON content.

@Erotemic
Copy link

which is definitely not what you want to be using in a production environment

Actually it is. When I have a 12 GB json file, where I know there is one entry at the very top that is only a few bytes, taking a 50x slowdown to read a few bytes is more than worth it. Specifically, I have a COCO json file that represents an object detection dataset. The basic structure is {"info": List[dict], "categories": List[dict], "images": List[dict], "annotations": List[dict]}. There are thousands of image and annotation entries, which make the file quite large, and in this case I'm only interested in parsing the items in "info" at the very top because they contain some metadata I'm interested in.

It's not uncommon for some backends to include features that are unsupported by faster more specialized backends, especially when they are experimental. There are very real production use-cases where using a slower, more feature-rich backend is an overall benefit to the system. It is nice for all backends to have feature parity, but I think could be more room for flexibility than your initial impressions would suggest.

they should not be common in JSON documents

This is an opinion, and I respectfully disagree. My opinion is that it was shortsighted for the rfc8259 standard not to include NaN. A NaN and null value have subtly different meanings, and they are not interchangeable.

Of course you aren't obliged to make or accept changes to this library, and even less obliged to change your opinion. There's not anything objectively wrong with it, and there is value in adhering to specifications, but I want to suggest that your notion of correctness is conflated with compliance. The opposing opinion is simply non-compliant, not incorrect. Food for thought.

Fortunately this library is extremely well written and I can simply copy the python backend into a new module, modify it, and use it with the regular ijson core. For anyone else that needs the functionality, here is the code that I'm using: https://gist.github.com/Erotemic/60ec1b4ffc9961fa34a1821bb139f70f

@jpmckinney
Copy link

I'm confused by this issue. ijson is a JSON parser, and NaN is not part of JSON...

@Erotemic
Copy link

Erotemic commented Oct 28, 2022

True, yet support for values like NaN and Inf are common in json parsers (including ujson and the stdlib json library). There is no specification for NaN in RFC8259. It's not unreasonable to deny support for it, but it's also not unreasonable to request support for it, or fork a project to add support for it.

@rtobar
Copy link

rtobar commented Oct 30, 2022

@Erotemic thanks for the constructive discussion, and for raising your points with respect and clarity -- that's hugely appreciated. Adding that link to your solution is also a good idea for people in the future who might find it useful.

"which is definitely not what you want to be using in a production environment"
Actually it is

Sorry, I meant this in the more general case, and was just guessing at your particular use case, where you indeed don't care much about performance.

"they [nands] should not be common in JSON documents"
This is an opinion, and I respectfully disagree

My apologies, that was poor writing from my side, I meant that phrase as an observation, not as an opinion. The observation is that this package is downloaded millions of time per month and yet the issue of NaNs appearing in JSON documents has been brought up only two times in multiple years. My assumption then is that NaNs do not appear in JSON documents that often.

I'm not sure what the story behind explicitly not supporting NaNs in the RFC is. My impression is that they didn't want to bind themselves to the IEEE 754 format, given also how they allow arbitrarily big or precise numbers (e.g., "1e400" or "3.141592653589793238462643383279") to be used, but that's just a guess. I agree with that that this feels like an obvious miss, but this is thanks to the benefit of hindsight; the original specification is 16 years old, when JSON wasn't as ubiquitous as it is today.

About the support for NaNs in ijson, and to finish my intervention: probably if ijson consisted only on the python backend I wouldn't mind too much about adding this support. But the fact that there are other backends that I don't control as much (plus another one on the works) I'm much more hesitant to add more maintenance burden on myself for a feature that isn't standard. If I merge this in it's a matter of time before people request support for inf/infinite, then for pi, then to support these in the other backends, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants