Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question: performant iteration of multi-thousand-points intervals #50

Closed
spock opened this issue Nov 23, 2020 · 8 comments
Closed

question: performant iteration of multi-thousand-points intervals #50

spock opened this issue Nov 23, 2020 · 8 comments
Labels
question Issue is a question

Comments

@spock
Copy link

spock commented Nov 23, 2020

As a continuation of #49 :),

I'm now trying to read back from two IntervalDicts, compare the values assigned to each interval, modify the values, and write this all back to files.

[As a follow-up to #49, I have anywhere between 1-18700 intervals per value (with ~2800 values), with average size being ~216 intervals per value. (Did not collect the median, so no idea about the actual distribution 🙂 ) ]

I am now trying to iterate over all points (sequentially or not doesn't really matter as long as I check all the points) with this code (sorry for verbosity, I like seeing it do stuff):

# interval1 and interval2 are IntervalDict instances
for key in tqdm(interval1.keys(), desc="Key", position=0):
        tqdm.write(f"key: {str(key):.30}", file=sys.stderr)
        tqdm.write(f"key size: {str(len(key))}", file=sys.stderr)  # 17500 for the first element!
        for position in tqdm(P.iterate(key, step=1), desc="Position", position=1):
            # tqdm.write("pos: {:.30}".format(str(position)), file=sys.stderr)  # first value is 119
            c1 = interval1[position]
            c2 = interval2[position]
            if c2 > 0 and float(c1) / c2 >= threshold:
                # change c1 and c2 here
                filtered_counter += 1
            elif c1 > 0 and float(c2) / c1 >= threshold:
                # change c1 and c2 here
                filtered_counter += 1
            # creating plain dicts of modified values first, will then convert to IntervalDict - faster this way
            new_interval1_dict[c1].append(P.singleton(position))
            new_interval2_dict[c2].append(P.singleton(position))
    # TODO: convert into IntervalDict and return

The speed looks very variable, mostly in the 6-12 positions/sec range, but I've also seen peak values of 98 pos/sec.
So far 13k position processed in 15 minutes; if the speed sample is representative, this will take 26 hours.

@AlexandreDecan , you mentioned a smarter way of iterating, can you please share it? 🙂

@AlexandreDecan AlexandreDecan added the question Issue is a question label Nov 23, 2020
@AlexandreDecan
Copy link
Owner

Hello,

I'll have a look at your code and see how we can do this in a not-so-time-consuming way ;-)
I'll try to do this tomorrow if I can have some spare time ;-)

In the meantime, can you clarify what's the value of threshold, what's the purpose of filtered_counter' and what's the purpose of new_interval*_dict? (they seem to correspond to classical dict since none of c1andc2` are intervals actually, am I right?).

@AlexandreDecan
Copy link
Owner

I already see "small improvements" (things that could speed-up the computation but shouldn't make a huge difference):

  • Instead of iterating on the keys of interval1 then on the "position", you can iterate on interval1.domain();
  • Instead of the above, you can prevent a call to interval1[position] by iterating on .items() instead of .keys(). This is likely to twice the speed at which things are done since you won't have to do interval1[position] to retrieve c1;
  • It's probably more efficient to think about the distinct values rather than about the intervals (given how they are internally stored). Could you please try to use IntervalDict.combine (probably on a smaller example first) to see if this could do the job in a less time-consuming way?

(Yeah, I know, I said I wouldn't have time until tomorrow, but it was on my mind ;-)

@spock
Copy link
Author

spock commented Nov 23, 2020

  • threshold is a value I need to compare values from interval1 and interval2 with; I have to do this for all points (or rather "same-value intervals", but I don't want to code that 😛 )
  • filtered_counter just tells me how many times I had to adjust the values of intervals
  • yes, new_interval*_dicts are plain dicts; they let me avoid iterative insertion of intervals with different values directly into the IntervalDict - basically, I'm re-using your advice of grouping same-value intervals first, then inserting them into IntervalDict in one operation; thus, c1 and c2 are the values of intervals.

I'll need to think on and try your suggestions tomorrow, thanks!

I thought I could use combine here somehow, but I still need two separate (modified) outputs...
So I'm not sure how to apply it here.
To be more clear, I am comparing values from both interval dicts, and modify those values if certain conditions are met.
I need to maintain both outputs as separate entities...

@AlexandreDecan
Copy link
Owner

Thanks for the answer. Could you elaborate a little bit on the need to maintain two separate (modified) outputs? What if, for example, we define the function provided to combine in such a way that it stores the "sources" of the value (either c1 or c2)? Would it fits your need? If so, I think we can get with a not-so-ugly and not-so-time-consuming solution ;-) (e.g. func could be something like (c2, 2) if c2 > 0 and float(c1) / c2 >= threshold else (c1, 1)) ?

@spock
Copy link
Author

spock commented Nov 25, 2020

(I was away yesterday, now back to this mini-project.)

  • I could not figure out how to use interval1.domain() , all it gives me is a single large interval (and for some reason sometimes this single operation takes forever to complete, but other times it works fast - not sure what is happening there)
  • yes, iterating on .items() instead of .keys() works nicely, and has a side benefit of showing me total sizes in progress bars 😃 ; it is still rather slow, single position iteration takes about 4 seconds

Using tuples as values is an interesting idea!
I've just implemented and started it (basically, saving both modified c1/c2 values as a single tuple into the combined IntervalDict).
As I cannot see any progress of .combine, I run it on a very small, non-representative sub-sample of only 50k positions.
This seems to be taking a while as well. I'll post when it returns.

(Looking at the code of combine, I wonder if the last return IntervalDict(new_items) will cause issues, too.)

Ok, took roughly 6 minutes.
An optimistic estimate is then 7+ hours for the full interval (not taking increased complexity into account).
I guess I should really implement this differently, after all 🙂

@AlexandreDecan
Copy link
Owner

I expected domain to be a bit slow given the high number of values and intervals (you said 2800 values for an average of 216 intervals, so that means that domain has to union ~604.800 intervals :-)).

I think it's probably better to copy&paste the code of combine and to tweak it according to your needs, to speed up the computation. For example, it seems that interval1 and interval2 share a similar domain, hence there is no need to compute their intersection and their difference. Moreover, by reusing the code of combine, you can add a progress bar on the inner loop to keep track of the progress ;-)

Anyway, I share your feeling that an alternative implementation is probably the way to go :( I'm afraid IntervalDict is not suited for such large dataset :-/

@spock
Copy link
Author

spock commented Nov 25, 2020

I've reimplemented the whole thing with numpy arrays (turned out to be much easier than I thought) and ended up with ~30 seconds running time for everything (reading, filtering, merging same-value intervals, writing out in the original format).

I guess I didn't really need the features offered by portion in the first place 🤣 Oh well, was still a good learning.

Thank you for all the support!

@spock spock closed this as completed Nov 25, 2020
@AlexandreDecan
Copy link
Owner

You're welcome. I'm sorry I couldn't help you much with portion. At least, we now know that it cannot deal with large datasets :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issue is a question
Projects
None yet
Development

No branches or pull requests

2 participants