-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for iterating over key/value pairs? #18
Comments
@ltalirz that's a situation that indeed has been experienced by a number of users now, so yes, it makes sense to add support for iterating over object members in addition to being able to iterate over full objects. Instead of adding a new method I would go for a different approach: the I'm not sure yet how much effort would be required to implement this, or even if it's possible (there might be some aspect of the problem I'm not seeing yet?), but I'll keep it mind and try to experiment with it. |
Hm... a function that changes return type depending on the prefix string? Anyhow, I had a quick go and came up with this minor generalization of the example isagalaev#62 (comment), put into the form of the import ijson
from ijson.common import ObjectBuilder
def objects(prefixed_events, prefix):
'''
An iterator returning native Python objects constructed from the events
under a given prefix.
'''
prefixed_events = iter(prefixed_events)
try:
key='-'
while True:
current, event, value = next(prefixed_events)
if current == prefix and event == 'map_key': # found new object at prefix
key=value
builder = ObjectBuilder()
elif current.startswith(prefix + '.' + key): # while at this key, build the object
builder.event(event, value)
if event == 'end_map': # found end of object at current key, yield
yield key, builder.value
except StopIteration:
pass
def kviter(file, prefix):
return objects(ijson.parse(file), prefix)
f = open('data.json', 'rb')
for k,v in kviter(f, 'my_big_data'):
print(k, v)
break |
Good point regarding different return types, I hadn't thought of that actually, and I think it's a good reason to require a new function, so let's go for that. Regarding its name, as you point out the more natural Would you be willing to submit a PR to include this new functionality added to all backends? Unit tests would be required together with the code though. Mind you that the C backend doesn't use the code under If that's not possible then I can take your code and add the missing bits, but might take a bit more depending on other things I have on my plate. |
Hi @rtobar , I'm very busy this week but I could perhaps make a PR for the python implementation if you let me know where/how to add tests. Somewhere here, in this style as well? Lines 258 to 262 in 87c4a0e
|
I would try to add tests that demonstrate this working over a simple prefix (like in your example), a prefix including array elements (e.g., Unit tests, as you saw, should go into the Lines 299 to 307 in 87c4a0e
or Lines 218 to 225 in 87c4a0e
I also realized that one could actually (I think) offer an implementation of |
This new functionality, suggested in #18, allows users to iterate over (key, value) pairs representing object members for objects with a given prefix rather than iterating over the objects themselves. This opens up the possibility of iterating not only over big collections of objects, but over big objects themselves as well, without exhausting system memory. This is a feature that users have required for a time now (see #18 and isagalaev#62), so it makes sense to offer it out of the box. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Hi @ltalirz I actually went ahead and gave this a try myself -- adding tests and all. I started with your code, but had to change it a bit to work properly in a few cases. Could you give this a try and see if it works for your example as well? Changes are in the |
Thanks a lot @rtobar! I checked out the branch and it seems to work fine for my use case as well. |
Just as one performance data point:
While there's probably still room for improvement, I think that's already not too bad. Edit: I was a bit surprised to have only a factor of 10x wrt ujson, and indeed I overlooked that there were other top-level keys in the file. |
Good to see nice performance going on.
(I had missed the fact that only 100k pairs is what takes 20s, all clear now) It would also be good to double-check maximum memory usage on each case (via In any case, performance could indeed go up once the |
This new functionality, suggested in #18, allows users to iterate over (key, value) pairs representing object members for objects with a given prefix rather than iterating over the objects themselves. This opens up the possibility of iterating not only over big collections of objects, but over big objects themselves as well, without exhausting system memory. This is a feature that users have required for a time now (see #18 and isagalaev#62), so it makes sense to offer it out of the box. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
@ltalirz I just ran the |
Sounds good! |
Changes merged to |
Great! I would say this warrants a new release :-) |
Yes, I'll try to push 2.6.0 out as time allows, hopefully not after the end of the week. |
New 2.6.0 released today, available in PyPI, includes this now. |
Thanks for this! |
If I understand correctly, there is the built-in
items
wrapper for iterating over items in a list, but there isn't one for iterating over keys in a dictionary.I've seen the solution for the special case when the keys are at the top level of the JSON isagalaev#62 (comment) but what if the large list of keys is not at the top level? E.g.
Would it be difficult to do an analogous function to
items
where one can specify the prefix of the dictionary to iterate over and returns the keys and values?I guess besides the implementation, there is also the question what to call it.
It is perhaps a bit unfortunate that in python 3 the natural name for the dictionary iterator returning keys and values would actually be
items
but I guess that is already taken ;-)The text was updated successfully, but these errors were encountered: