Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ijson.kvitems(prefix) iterates through items even after the prefix was found and closed #61

Closed
nisace opened this issue Nov 23, 2021 · 3 comments

Comments

@nisace
Copy link

nisace commented Nov 23, 2021

Describe the bug
ijson.kvitems(prefix) iterates through items even after the prefix was found and closed.

How to reproduce

If first generated an example.json file with the following code.

import json


def generate_example(n):
    with open("example.json", "w") as f:
        json.dump({"a": {"foo": 13}, "b": {"bar": list(range(n))}}, f)

This gives the following JSON for n=5

{"a": {"foo": 13}, "b": {"bar": [0, 1, 2, 3, 4]}}

Then I tried to load only the items under the prefix a with the two following functions:

import ijson


def g():
    with open("example.json", "r") as f:
        result = list(ijson.kvitems(f, prefix="a", use_float=True))

    print(result)


def h():
    with open("example.json", "r") as f:
        under_prefix = False
        result = []
        for prefix, event, value in ijson.parse(f):

            if prefix == "a" and event == "start_map":
                under_prefix = True

            if under_prefix:
                result.append((prefix, event, value))

            if prefix == "a" and event == "end_map":
                break

    print(result)

To compare the two functions, I created a big example.json file with n=10000000.

The function h took 0.43s to execute while the function g took 2.60s. Thus it seems that ijson.kvitems(prefix) iterates through items even after the prefix was found and closed (i.e event end_map is found).

Expected behavior
ijson.kvitems(prefix) should stop iterating through items once the prefix was found and closed.

Execution information:

  • Python version: 3.7.4
  • ijson version: 3.1.4
  • ijson backend: I don't know
  • ijson installation method: pip (with poetry 1.1.5)
  • OS: Ubuntu 16.04

Thanks !

@nisace nisace added the bug Something isn't working label Nov 23, 2021
@rtobar
Copy link

rtobar commented Nov 24, 2021

@nisace thanks for the very thorough report.

I think there's a misalignment between the expectations you have for your particular JSON document and prefix, and how kvitems works for arbitrary JSON documents and prefixes. Consider the following two cases:

$> echo '[{"b": 1}, {"b": 1}]' | python -m ijson.dump -m kvitems -p item
#: key, value
-------------
0: b, 1
1: b, 1
$> echo '{"a": {"b": 1}, "a": {"b": 1}}' | python -m ijson.dump -m kvitems -p a
#: key, value
-------------
0: b, 1
1: b, 1

The first case is not uncommon, and the second is a bit weirder, in the sense that the input JSON document has duplicate keys -- but possible, as the JSON specification doesn't disallow duplicate keys.

These two examples should hopefully make it clear that kvitems's behavior should be to traverse the input document until exhaustion, as there's no way for it to know in advance when all matches for a given prefix have been found. Hence this is not really a bug, but a case of wrong expectations.

However....

I do see the opportunity to implement a new option for kvitems (e.g. stop_after_first or similar). Using this option, users could indicate that they know the prefix appears only once in the document. The option would default to False for the reasons explained. When set to True, kvitems would behave basically as you describe, yielding the key/value pairs off the prefix the first time around and then exiting.

Thoughts?

@rtobar rtobar removed the bug Something isn't working label Nov 24, 2021
@nisace
Copy link
Author

nisace commented Nov 24, 2021

@rtobar thanks for your answer. Indeed, I understand your point and agree with the stop_after_first option as a solution. That would be great !

Not sure how to implement though :)

@rtobar
Copy link

rtobar commented Dec 1, 2021

@rtobar thanks for your answer. Indeed, I understand your point and agree with the stop_after_first option as a solution. That would be great !

Not sure how to implement though :)

Yeah, I have some ideas but it could take a while. I'll close this ticket and create a new one to keep track of the potential new feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants