IncompleteJSONError: lexical error: invalid char in json text #33

carlosg-m · 2020-09-24T10:16:36Z

Is there a solution for parsing json files with syntax problems?

Can this type of error be skipped?

IncompleteJSONError: lexical error: invalid char in json text.
          type":"Point","coordinates":[NaN,NaN],"crs":{"type":"link","
                     (right here) ------^

This is the actual pipeline being applied:

def get_generator(gen, export_path, batch_size):
  records = []
  for i, ob in enumerate(gen):
    records.append(get_record(ob))
    if len(records)==batch_size:
      records = save_parquet(records, i, export_path)
      print('saved:', str(i+1))
  if len(records) > 0:
    records = save_parquet(records, i, export_path)
    print('saved:', str(i+1))
    
def read_json_parquet(file_name, export_path, batch_size):
  Path(export_path).mkdir(parents=False, exist_ok=False)
  with open(file_name, 'rb') as file:
    obj = ijson.items(file, 'features.item')
    obj = get_generator(obj, export_path=export_path, batch_size=batch_size)

The text was updated successfully, but these errors were encountered:

rtobar · 2020-09-24T10:44:15Z

Is there a solution for parsing json files with syntax problems?

@carlosg-m saldy, no.

Can this type of error be skipped?

Sadly, not either.

Addressing both problems would require a fair amount of work. In particular, all the yajl-based backends (almost all) do not get access to the last valid byte in the buffer when errors are found, and therefore there's nothing ijson can work with to fix or skip the problem, or to let the user do it.

The best you can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm (even within your own program) you to change those NaNs into some proper JSON content (e.g., sed 's/NaN,NaN/"nan","nan"/g' you-file > fixed-file) before it's given to ijson.

carlosg-m · 2020-09-24T11:24:38Z

can hope for is to make sure the data you are parsing is well defined before you give it to ijson. Obviously the best place to begin with is the code that writes them, if you have access to it. If you don't you can used sed or any similar streaming mechainsm

Thank you for the information it helps a lot. We are going to try to fix the issue directly on the source, that in this case is a custom made legacy geographic information system.

Erotemic · 2022-10-28T00:46:22Z

I think nan support is something that should be seriously reconsidered. While it's not part of standard json, nans are extremely common in real life and real data. This issue is blocking me from making use of ijson. I suppose I'll try stream manipulation in the meantime, but I'm interested in finding a way to solve this.

(or perhaps just the python backend could get experimental support for this feature?)

Is the diff as simple as:

diff --git a/ijson/backends/python.py b/ijson/backends/python.py
index 8efda79..1726ef0 100644
--- a/ijson/backends/python.py
+++ b/ijson/backends/python.py
@@ -8,7 +8,7 @@ from ijson import common, utils
 import codecs
 
 
-LEXEME_RE = re.compile(r'[a-z0-9eE\.\+-]+|\S')
+LEXEME_RE = re.compile(r'[a-z0-9eNE\.\+-]+|\S')
 UNARY_LEXEMES = set('[]{},')
 EOF = -1, None
 
@@ -184,6 +184,9 @@ def parse_value(target, multivalue, use_float):
             elif symbol == 'false':
                 send(('boolean', False))
                 pop()
+            elif symbol == 'NaN':
+                send(('number', float('nan')))
+                pop()
             elif symbol[0] == '"':
                 send(('string', parse_string(symbol)))
                 pop()
@@ -226,7 +229,7 @@ def parse_value(target, multivalue, use_float):
                     if number == inf:
                         raise common.JSONError("float overflow: %s" % (symbol,))
                 except:
-                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol):
+                    if 'true'.startswith(symbol) or 'false'.startswith(symbol) or 'null'.startswith(symbol) or 'NaN'.startswith(symbol):
                         raise common.IncompleteJSONError('Incomplete JSON content')
                     raise UnexpectedSymbol(symbol, pos)
                 else:

?

rtobar · 2022-10-28T06:38:56Z

@Erotemic I'm not sure if that diff you posted above would be all that's necessary (I don't mean this as in "there is stuff missing!", I'm literally not sure because I'd have to invest some time to think thoroughly about it). In any case it would only cover the python backend -- which is definitely not what you want to be using in a production environment, when the yajl2_c backend is about ~50x times faster.

The rest of the backends are not that easily modifiable though (or not at all, some just load compiled externally-provided shared libraries), so I'd be hesitant to add support for a non-standard JSON feature that is only supported by a single backends, which happens to be the slowest.

So sorry, but I won't be adding support for nans in ijson. Even if they are a common in real life and real data, they should not be common in JSON documents. I'm sorry to hear this blocks you from using ijson, but I'd expect this to block you from using a few other JSON libraries too that probably won't handle nans either. If you have control over your data sources, then you can try modifying it so it generates valid JSON; if you don't have that control then your best bet is to tap into the stream and perform the necessary corrections to obtain valid JSON content.

Erotemic · 2022-10-28T15:47:03Z

which is definitely not what you want to be using in a production environment

Actually it is. When I have a 12 GB json file, where I know there is one entry at the very top that is only a few bytes, taking a 50x slowdown to read a few bytes is more than worth it. Specifically, I have a COCO json file that represents an object detection dataset. The basic structure is {"info": List[dict], "categories": List[dict], "images": List[dict], "annotations": List[dict]}. There are thousands of image and annotation entries, which make the file quite large, and in this case I'm only interested in parsing the items in "info" at the very top because they contain some metadata I'm interested in.

It's not uncommon for some backends to include features that are unsupported by faster more specialized backends, especially when they are experimental. There are very real production use-cases where using a slower, more feature-rich backend is an overall benefit to the system. It is nice for all backends to have feature parity, but I think could be more room for flexibility than your initial impressions would suggest.

they should not be common in JSON documents

This is an opinion, and I respectfully disagree. My opinion is that it was shortsighted for the rfc8259 standard not to include NaN. A NaN and null value have subtly different meanings, and they are not interchangeable.

Of course you aren't obliged to make or accept changes to this library, and even less obliged to change your opinion. There's not anything objectively wrong with it, and there is value in adhering to specifications, but I want to suggest that your notion of correctness is conflated with compliance. The opposing opinion is simply non-compliant, not incorrect. Food for thought.

Fortunately this library is extremely well written and I can simply copy the python backend into a new module, modify it, and use it with the regular ijson core. For anyone else that needs the functionality, here is the code that I'm using: https://gist.github.com/Erotemic/60ec1b4ffc9961fa34a1821bb139f70f

jpmckinney · 2022-10-28T17:16:00Z

I'm confused by this issue. ijson is a JSON parser, and NaN is not part of JSON...

Erotemic · 2022-10-28T18:56:04Z

True, yet support for values like NaN and Inf are common in json parsers (including ujson and the stdlib json library). There is no specification for NaN in RFC8259. It's not unreasonable to deny support for it, but it's also not unreasonable to request support for it, or fork a project to add support for it.

rtobar · 2022-10-30T20:09:18Z

@Erotemic thanks for the constructive discussion, and for raising your points with respect and clarity -- that's hugely appreciated. Adding that link to your solution is also a good idea for people in the future who might find it useful.

"which is definitely not what you want to be using in a production environment"
Actually it is

Sorry, I meant this in the more general case, and was just guessing at your particular use case, where you indeed don't care much about performance.

"they [nands] should not be common in JSON documents"
This is an opinion, and I respectfully disagree

My apologies, that was poor writing from my side, I meant that phrase as an observation, not as an opinion. The observation is that this package is downloaded millions of time per month and yet the issue of NaNs appearing in JSON documents has been brought up only two times in multiple years. My assumption then is that NaNs do not appear in JSON documents that often.

I'm not sure what the story behind explicitly not supporting NaNs in the RFC is. My impression is that they didn't want to bind themselves to the IEEE 754 format, given also how they allow arbitrarily big or precise numbers (e.g., "1e400" or "3.141592653589793238462643383279") to be used, but that's just a guess. I agree with that that this feels like an obvious miss, but this is thanks to the benefit of hindsight; the original specification is 16 years old, when JSON wasn't as ubiquitous as it is today.

About the support for NaNs in ijson, and to finish my intervention: probably if ijson consisted only on the python backend I wouldn't mind too much about adding this support. But the fact that there are other backends that I don't control as much (plus another one on the works) I'm much more hesitant to add more maintenance burden on myself for a feature that isn't standard. If I merge this in it's a matter of time before people request support for inf/infinite, then for pi, then to support these in the other backends, etc.

carlosg-m changed the title ~~IncompleteJSONError: lexical error: invalid char in json text.~~ IncompleteJSONError: lexical error: invalid char in json text Sep 24, 2020

carlosg-m closed this as completed Sep 24, 2020

rtobar added the question Further information is requested label Sep 24, 2020

cheahjs mentioned this issue Feb 2, 2024

durability appears 'NaN' value cheahjs/palworld-save-tools#108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IncompleteJSONError: lexical error: invalid char in json text #33

IncompleteJSONError: lexical error: invalid char in json text #33

carlosg-m commented Sep 24, 2020 •

edited

Loading

rtobar commented Sep 24, 2020

carlosg-m commented Sep 24, 2020

Erotemic commented Oct 28, 2022 •

edited

Loading

rtobar commented Oct 28, 2022

Erotemic commented Oct 28, 2022

jpmckinney commented Oct 28, 2022

Erotemic commented Oct 28, 2022 •

edited

Loading

rtobar commented Oct 30, 2022

IncompleteJSONError: lexical error: invalid char in json text #33

IncompleteJSONError: lexical error: invalid char in json text #33

Comments

carlosg-m commented Sep 24, 2020 • edited Loading

rtobar commented Sep 24, 2020

carlosg-m commented Sep 24, 2020

Erotemic commented Oct 28, 2022 • edited Loading

rtobar commented Oct 28, 2022

Erotemic commented Oct 28, 2022

jpmckinney commented Oct 28, 2022

Erotemic commented Oct 28, 2022 • edited Loading

rtobar commented Oct 30, 2022

carlosg-m commented Sep 24, 2020 •

edited

Loading

Erotemic commented Oct 28, 2022 •

edited

Loading

Erotemic commented Oct 28, 2022 •

edited

Loading