question: performant iteration of multi-thousand-points intervals #50

spock · 2020-11-23T14:20:00Z

As a continuation of #49 :),

I'm now trying to read back from two IntervalDicts, compare the values assigned to each interval, modify the values, and write this all back to files.

[As a follow-up to #49, I have anywhere between 1-18700 intervals per value (with ~2800 values), with average size being ~216 intervals per value. (Did not collect the median, so no idea about the actual distribution 🙂 ) ]

I am now trying to iterate over all points (sequentially or not doesn't really matter as long as I check all the points) with this code (sorry for verbosity, I like seeing it do stuff):

# interval1 and interval2 are IntervalDict instances
for key in tqdm(interval1.keys(), desc="Key", position=0):
        tqdm.write(f"key: {str(key):.30}", file=sys.stderr)
        tqdm.write(f"key size: {str(len(key))}", file=sys.stderr)  # 17500 for the first element!
        for position in tqdm(P.iterate(key, step=1), desc="Position", position=1):
            # tqdm.write("pos: {:.30}".format(str(position)), file=sys.stderr)  # first value is 119
            c1 = interval1[position]
            c2 = interval2[position]
            if c2 > 0 and float(c1) / c2 >= threshold:
                # change c1 and c2 here
                filtered_counter += 1
            elif c1 > 0 and float(c2) / c1 >= threshold:
                # change c1 and c2 here
                filtered_counter += 1
            # creating plain dicts of modified values first, will then convert to IntervalDict - faster this way
            new_interval1_dict[c1].append(P.singleton(position))
            new_interval2_dict[c2].append(P.singleton(position))
    # TODO: convert into IntervalDict and return

The speed looks very variable, mostly in the 6-12 positions/sec range, but I've also seen peak values of 98 pos/sec.
So far 13k position processed in 15 minutes; if the speed sample is representative, this will take 26 hours.

@AlexandreDecan , you mentioned a smarter way of iterating, can you please share it? 🙂

The text was updated successfully, but these errors were encountered:

AlexandreDecan · 2020-11-23T15:20:54Z

Hello,

I'll have a look at your code and see how we can do this in a not-so-time-consuming way ;-)
I'll try to do this tomorrow if I can have some spare time ;-)

In the meantime, can you clarify what's the value of threshold, what's the purpose of filtered_counter' and what's the purpose of new_interval*_dict? (they seem to correspond to classical dict since none of c1andc2` are intervals actually, am I right?).

AlexandreDecan · 2020-11-23T15:33:22Z

I already see "small improvements" (things that could speed-up the computation but shouldn't make a huge difference):

Instead of iterating on the keys of interval1 then on the "position", you can iterate on interval1.domain();
Instead of the above, you can prevent a call to interval1[position] by iterating on .items() instead of .keys(). This is likely to twice the speed at which things are done since you won't have to do interval1[position] to retrieve c1;
It's probably more efficient to think about the distinct values rather than about the intervals (given how they are internally stored). Could you please try to use IntervalDict.combine (probably on a smaller example first) to see if this could do the job in a less time-consuming way?

(Yeah, I know, I said I wouldn't have time until tomorrow, but it was on my mind ;-)

spock · 2020-11-23T16:32:05Z

threshold is a value I need to compare values from interval1 and interval2 with; I have to do this for all points (or rather "same-value intervals", but I don't want to code that 😛 )
filtered_counter just tells me how many times I had to adjust the values of intervals
yes, new_interval*_dicts are plain dicts; they let me avoid iterative insertion of intervals with different values directly into the IntervalDict - basically, I'm re-using your advice of grouping same-value intervals first, then inserting them into IntervalDict in one operation; thus, c1 and c2 are the values of intervals.

I'll need to think on and try your suggestions tomorrow, thanks!

I thought I could use combine here somehow, but I still need two separate (modified) outputs...
So I'm not sure how to apply it here.
To be more clear, I am comparing values from both interval dicts, and modify those values if certain conditions are met.
I need to maintain both outputs as separate entities...

AlexandreDecan · 2020-11-23T17:02:03Z

Thanks for the answer. Could you elaborate a little bit on the need to maintain two separate (modified) outputs? What if, for example, we define the function provided to combine in such a way that it stores the "sources" of the value (either c1 or c2)? Would it fits your need? If so, I think we can get with a not-so-ugly and not-so-time-consuming solution ;-) (e.g. func could be something like (c2, 2) if c2 > 0 and float(c1) / c2 >= threshold else (c1, 1)) ?

spock · 2020-11-25T12:38:46Z

(I was away yesterday, now back to this mini-project.)

I could not figure out how to use interval1.domain() , all it gives me is a single large interval (and for some reason sometimes this single operation takes forever to complete, but other times it works fast - not sure what is happening there)
yes, iterating on .items() instead of .keys() works nicely, and has a side benefit of showing me total sizes in progress bars 😃 ; it is still rather slow, single position iteration takes about 4 seconds

Using tuples as values is an interesting idea!
I've just implemented and started it (basically, saving both modified c1/c2 values as a single tuple into the combined IntervalDict).
As I cannot see any progress of .combine, I run it on a very small, non-representative sub-sample of only 50k positions.
This seems to be taking a while as well. I'll post when it returns.

(Looking at the code of combine, I wonder if the last return IntervalDict(new_items) will cause issues, too.)

Ok, took roughly 6 minutes.
An optimistic estimate is then 7+ hours for the full interval (not taking increased complexity into account).
I guess I should really implement this differently, after all 🙂

AlexandreDecan · 2020-11-25T13:17:17Z

I expected domain to be a bit slow given the high number of values and intervals (you said 2800 values for an average of 216 intervals, so that means that domain has to union ~604.800 intervals :-)).

I think it's probably better to copy&paste the code of combine and to tweak it according to your needs, to speed up the computation. For example, it seems that interval1 and interval2 share a similar domain, hence there is no need to compute their intersection and their difference. Moreover, by reusing the code of combine, you can add a progress bar on the inner loop to keep track of the progress ;-)

Anyway, I share your feeling that an alternative implementation is probably the way to go :( I'm afraid IntervalDict is not suited for such large dataset :-/

spock · 2020-11-25T15:57:52Z

I've reimplemented the whole thing with numpy arrays (turned out to be much easier than I thought) and ended up with ~30 seconds running time for everything (reading, filtering, merging same-value intervals, writing out in the original format).

I guess I didn't really need the features offered by portion in the first place 🤣 Oh well, was still a good learning.

Thank you for all the support!

AlexandreDecan · 2020-11-26T07:58:27Z

You're welcome. I'm sorry I couldn't help you much with portion. At least, we now know that it cannot deal with large datasets :-D

AlexandreDecan added the question Issue is a question label Nov 23, 2020

spock closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question: performant iteration of multi-thousand-points intervals #50

question: performant iteration of multi-thousand-points intervals #50

spock commented Nov 23, 2020

AlexandreDecan commented Nov 23, 2020

AlexandreDecan commented Nov 23, 2020

spock commented Nov 23, 2020 •

edited

AlexandreDecan commented Nov 23, 2020

spock commented Nov 25, 2020 •

edited

AlexandreDecan commented Nov 25, 2020

spock commented Nov 25, 2020

AlexandreDecan commented Nov 26, 2020

question: performant iteration of multi-thousand-points intervals #50

question: performant iteration of multi-thousand-points intervals #50

Comments

spock commented Nov 23, 2020

AlexandreDecan commented Nov 23, 2020

AlexandreDecan commented Nov 23, 2020

spock commented Nov 23, 2020 • edited

AlexandreDecan commented Nov 23, 2020

spock commented Nov 25, 2020 • edited

AlexandreDecan commented Nov 25, 2020

spock commented Nov 25, 2020

AlexandreDecan commented Nov 26, 2020

spock commented Nov 23, 2020 •

edited

spock commented Nov 25, 2020 •

edited