add support for iterating over key/value pairs? #18

ltalirz · 2020-01-15T19:33:11Z

If I understand correctly, there is the built-in items wrapper for iterating over items in a list, but there isn't one for iterating over keys in a dictionary.

I've seen the solution for the special case when the keys are at the top level of the JSON isagalaev#62 (comment) but what if the large list of keys is not at the top level? E.g.

{
 "my_big_data":
  { 
    "1": 1,
    "2": 2
  }
}

Would it be difficult to do an analogous function to items where one can specify the prefix of the dictionary to iterate over and returns the keys and values?

I guess besides the implementation, there is also the question what to call it.
It is perhaps a bit unfortunate that in python 3 the natural name for the dictionary iterator returning keys and values would actually be items but I guess that is already taken ;-)

The text was updated successfully, but these errors were encountered:

rtobar · 2020-01-16T03:50:53Z

@ltalirz that's a situation that indeed has been experienced by a number of users now, so yes, it makes sense to add support for iterating over object members in addition to being able to iterate over full objects.

Instead of adding a new method I would go for a different approach: the prefix could contains a special final character (e.g., :, | or even .) to signal that iteration should be over the object's members, not objects themselves. So in your case you could use ijson.items(f, 'my_big_data|') (or one of the other final characters), and the result of the iteration would give you two-tuples with the individual key and values.

I'm not sure yet how much effort would be required to implement this, or even if it's possible (there might be some aspect of the problem I'm not seeing yet?), but I'll keep it mind and try to experiment with it.

ltalirz · 2020-01-16T23:25:19Z

Hm... a function that changes return type depending on the prefix string?
Are you sure this is more intuitive than just using a different wrapper?

Anyhow, I had a quick go and came up with this minor generalization of the example isagalaev#62 (comment), put into the form of the objects function in the codebase.
Only tested for my use case, will test more later

import ijson
from ijson.common import ObjectBuilder

def objects(prefixed_events, prefix):
    '''
    An iterator returning native Python objects constructed from the events
    under a given prefix.
    '''
    prefixed_events = iter(prefixed_events)
    try:
        key='-'
        while True:
            current, event, value = next(prefixed_events)
            if current == prefix and event == 'map_key':  # found new object at prefix
                key=value
                builder = ObjectBuilder()
            elif current.startswith(prefix + '.' + key):  # while at this key, build the object
                builder.event(event, value)
                if event == 'end_map':  # found end of object at current key, yield
                    yield key, builder.value
    except StopIteration:
        pass

def kviter(file, prefix):
    return objects(ijson.parse(file), prefix)


f = open('data.json', 'rb')

for k,v in kviter(f, 'my_big_data'):
    print(k, v)
    break

rtobar · 2020-01-17T01:43:15Z

Good point regarding different return types, I hadn't thought of that actually, and I think it's a good reason to require a new function, so let's go for that. Regarding its name, as you point out the more natural items name is sadly already taken (and items could probably have been called objects in hindsight), but inspired in your code snipped it could be kvitems to keep names aligned.

Would you be willing to submit a PR to include this new functionality added to all backends? Unit tests would be required together with the code though. Mind you that the C backend doesn't use the code under ijson.common, so it needs to be implemented separately (and excluded from the tests until it's implemented), but I can take care of that one if that's an issue.

If that's not possible then I can take your code and add the missing bits, but might take a bit more depending on other things I have on my plate.

ltalirz · 2020-01-19T23:50:22Z

Hi @rtobar , I'm very busy this week but I could perhaps make a PR for the python implementation if you let me know where/how to add tests.

Somewhere here, in this style as well?

ijson/tests.py

Lines 258 to 262 in 87c4a0e

    
           def test_items_twodictlevels(self): 
        
               f = BytesIO(b'{"meta":{"view":{"columns":[{"id": -1}, {"id": -2}]}}}') 
        
               ids = list(self.backend.items(f, 'meta.view.columns.item.id')) 
        
               self.assertEqual(2, len(ids)) 
        
               self.assertListEqual([-2,-1], sorted(ids))

rtobar · 2020-01-20T02:21:32Z

I would try to add tests that demonstrate this working over a simple prefix (like in your example), a prefix including array elements (e.g., docs.item when used against the JSON global value in tests.py), and a prefix producing no results.

Unit tests, as you saw, should go into the Parse class in tests.py, and then they are run for each supported backend. To exclude a backend you'll need to do something like this:

ijson/tests.py

Lines 299 to 307 in 87c4a0e

    
           def test_lazy_file_reading(self): 
        
               file_type = SingleReadFile 
        
               if self.backend.__name__.endswith('.python'): 
        
                   if IS_PY2: 
        
                       # We know it doesn't work because because the decoder itself 
        
                       # is quite eager on its reading 
        
                       return 
        
                   file_type = SingleReadFileStr 
        
               self._test_item_iteration_validity(file_type)

or

ijson/tests.py

Lines 218 to 225 in 87c4a0e

    
           def test_invalid(self): 
        
               for json in INVALID_JSONS: 
        
                   # Yajl1 doesn't complain about additional data after the end 
        
                   # of a parsed object. Skipping this test. 
        
                   if self.__class__.__name__ == 'YajlParse' and json == YAJL1_PASSING_INVALID: 
        
                       continue 
        
                   with self.assertRaises(common.JSONError) as cm: 
        
                       list(self.backend.basic_parse(BytesIO(json)))

I also realized that one could actually (I think) offer an implementation of kvitems for the C backend that uses the code under ijson.common, it's just that it might not be as fast as possible. So maybe give it a go, and if it works then it'd be better to include it than not.

This new functionality, suggested in #18, allows users to iterate over (key, value) pairs representing object members for objects with a given prefix rather than iterating over the objects themselves. This opens up the possibility of iterating not only over big collections of objects, but over big objects themselves as well, without exhausting system memory. This is a feature that users have required for a time now (see #18 and isagalaev#62), so it makes sense to offer it out of the box. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

rtobar · 2020-01-20T03:42:57Z

Hi @ltalirz I actually went ahead and gave this a try myself -- adding tests and all. I started with your code, but had to change it a bit to work properly in a few cases. Could you give this a try and see if it works for your example as well? Changes are in the kvitems branch.

ltalirz · 2020-01-20T09:15:32Z

Thanks a lot @rtobar!

I checked out the branch and it seems to work fine for my use case as well.
I'd vote for including this already, and leaving an issue open to improve speed for other backends if needed.

ltalirz · 2020-01-20T09:26:28Z

Just as one performance data point:
On my SSD, it takes me 20s to iterate over 100k key-value pairs in a file containing ~1.4M key-value pairs (2GB in size), i.e. 0.2ms per pair.

~~This compares to 60s for parsing the entire file using json.load (0.04ms/pair) or 30s using ujson.load (0.02ms/pair).~~

While there's probably still room for improvement, I think that's already not too bad.

Edit: I was a bit surprised to have only a factor of 10x wrt ujson, and indeed I overlooked that there were other top-level keys in the file.
After removing those, the appropriate comparison is against 13s for json.load (0.01ms/pair) and 10s for ujson.load (0.007ms/pair).

rtobar · 2020-01-20T09:43:05Z

Good to see nice performance going on.

Shouldn't the two last measurements be 0.6ms/pair and 0.3ms/pair though, given the times you report? Or are you otherwise scaling the overall times for json and ujson to account only for the pair construction somehow?

(I had missed the fact that only 100k pairs is what takes 20s, all clear now)

It would also be good to double-check maximum memory usage on each case (via /usr/bin/time -v in Linux, or similar).

In any case, performance could indeed go up once the kvitems logic is implemented in C for the C backend. Right now it's implemented in python, and hence it could see a boost. I'll create a new issue for that bit of work though, which can be handled separately.

This new functionality, suggested in #18, allows users to iterate over (key, value) pairs representing object members for objects with a given prefix rather than iterating over the objects themselves. This opens up the possibility of iterating not only over big collections of objects, but over big objects themselves as well, without exhausting system memory. This is a feature that users have required for a time now (see #18 and isagalaev#62), so it makes sense to offer it out of the box. Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>

rtobar · 2020-01-20T10:27:40Z

@ltalirz I just ran the benchmark.py tool (after enhancing it a bit to allow for this), and could see a difference of ~3x, 4x in the speed of items compared to that of kvitems for the C backend, while for the other backends they yield similar times. This would indicate that this sort of speedup factor should be possible once kvitems is implemented in C.

ltalirz · 2020-01-20T10:28:59Z

Sounds good!
Let's put that benchmark info in the new issue as well.

rtobar · 2020-01-20T14:55:37Z

Changes merged to master branch, closing this issue now.

ltalirz · 2020-01-20T15:28:36Z

Great! I would say this warrants a new release :-)

rtobar · 2020-01-20T15:40:35Z

Yes, I'll try to push 2.6.0 out as time allows, hopefully not after the end of the week.

rtobar · 2020-01-28T14:18:05Z

New 2.6.0 released today, available in PyPI, includes this now.

ltalirz · 2020-01-28T14:19:12Z

Thanks for this!

rtobar added the enhancement New feature or request label Jan 16, 2020

rtobar changed the title ~~add support for iterating over keys?~~ add support for iterating over key/value pairs? Jan 20, 2020

rtobar mentioned this issue Jan 20, 2020

Implement kvitems in the C extension for increased performance. #21

Closed

rtobar closed this as completed Jan 20, 2020

ltalirz mentioned this issue Jan 21, 2020

export archive format forces reading of data.json file to memory aiidateam/aiida-core#493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for iterating over key/value pairs? #18

add support for iterating over key/value pairs? #18

ltalirz commented Jan 15, 2020 •

edited

rtobar commented Jan 16, 2020

ltalirz commented Jan 16, 2020 •

edited

rtobar commented Jan 17, 2020

ltalirz commented Jan 19, 2020

rtobar commented Jan 20, 2020

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

ltalirz commented Jan 20, 2020 •

edited

rtobar commented Jan 20, 2020 •

edited

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

rtobar commented Jan 20, 2020

rtobar commented Jan 28, 2020

ltalirz commented Jan 28, 2020

add support for iterating over key/value pairs? #18

add support for iterating over key/value pairs? #18

Comments

ltalirz commented Jan 15, 2020 • edited

rtobar commented Jan 16, 2020

ltalirz commented Jan 16, 2020 • edited

rtobar commented Jan 17, 2020

ltalirz commented Jan 19, 2020

rtobar commented Jan 20, 2020

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

ltalirz commented Jan 20, 2020 • edited

rtobar commented Jan 20, 2020 • edited

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

rtobar commented Jan 20, 2020

ltalirz commented Jan 20, 2020

rtobar commented Jan 20, 2020

rtobar commented Jan 28, 2020

ltalirz commented Jan 28, 2020

ltalirz commented Jan 15, 2020 •

edited

ltalirz commented Jan 16, 2020 •

edited

ltalirz commented Jan 20, 2020 •

edited

rtobar commented Jan 20, 2020 •

edited