Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Module for motif and discord discovery #185

Merged
merged 28 commits into from Mar 18, 2021

Conversation

mihailescum
Copy link
Contributor

I gave the motif discovery module a go (issue #184), what do you think? (Ignore all changes outside motifs.py and test_motifs.py, as we discuss this in another PR)

There are still more tests to be done, but I'm not sure how to approach it yet. Ideally I'd like a naive_k_motifs method.

@codecov-commenter
Copy link

codecov-commenter commented May 24, 2020

Codecov Report

Merging #185 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #185   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           13        13           
  Lines         1027      1032    +5     
=========================================
+ Hits          1027      1032    +5     
Impacted Files Coverage Δ
stumpy/core.py 100.00% <100.00%> (ø)
stumpy/gpu_stump.py 100.00% <100.00%> (ø)
stumpy/mstump.py 100.00% <100.00%> (ø)
stumpy/mstumped.py 100.00% <100.00%> (ø)
stumpy/stamp.py 100.00% <100.00%> (ø)
stumpy/stomp.py 100.00% <100.00%> (ø)
stumpy/stump.py 100.00% <100.00%> (ø)
stumpy/stumped.py 100.00% <100.00%> (ø)
stumpy/floss.py 100.00% <0.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f9a2f8...59ad4d5. Read the comment docs.

@mihailescum
Copy link
Contributor Author

The tests fail because of a typo, but I'm interested in your take on the structure of the code.

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

I like where it's going and the structure seems reasonable. The only thing that I can think of is whether we can generalize the problem to solve for motifs as well as discords (anomalies). Additionally, I have a bias that the default thresholds should be motivated by or relative to the mean/stddev of the matrix profile values. But I do like where you are going with rtol and atol as this is consistent with NumPy . However, I wonder how many casual users will recognize this to be similar to NumPy?

I would strongly advise against having while loops in the code. It's safer to have for-loops and add breaks if necessary

@mihailescum
Copy link
Contributor Author

What do you mean with motivating the motifs with mean/stddev?

Searching for discords should be really easy: set rtol=atol=0 and as matrix profile pass P.max() - P (invert the profile). I didn't try it out yet, but I think it should work.

And why do you advise against while loops? I agree on while True loops, but what's so bad about them otherwise?

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

What do you mean with motivating the motifs with mean/stddev?

Sorry for the brevity in my previous response and, hopefully, I understand the use of atol and rtol properly. Let me elaborate on "default thresholds should be motivated by or relative to the mean/stddev of the matrix profile values" by using this "steam gen" example. So, for the global motif (i.e., where candidate_idx = np.argmin(P)), the nearest neighbor is a matrix profile distance of 5.49161982776594 units away. However, the current default of atol=1.0 and rtol=0.0 would actually miss the nearest neighbor since (IIUC), according to D[candidate_idx] < atol + rtol * profile_value, 5.49161982776594 would be greater than 1.0 + 0.0 * profile_value = 1.0. Instead, I would set default rtol = 1.0 and default atol = 0.25 * np.stddev(P) so that:

D[candidate_idx] < 1.0 + 0.0 * profile_value

becomes

D[candidate_idx] < 0.25 * np.stddev(P) + 1.0 * profile_value

This way, the atol default value is "motivated by or relative to the mean/stddev of the matrix profile values". Note that the use of 0.25 as a factor is somewhat arbitrary (it could be 1.0 or 0.5) but the key point is that it is relative to the stddev(P). If all values of P are shifted up by 100 then the factor * np.stddev(P) should still work reasonably well (and provide the user with some sensible defaults that don't require a ton of hand tuning). And if the input time series were, say, a perfect sine wave (maybe with a small bit of random noise added to it) then a motif may not be found since the stddev(P) (and thereforeatol) would be exceedingly small.

Searching for discords should be really easy: set rtol=atol=0 and as matrix profile pass P.max() - P (invert the profile). I didn't try it out yet, but I think it should work.

I'm not sure I follow but I trust you 👍

And why do you advise against while loops? I agree on while True loops, but what's so bad about them otherwise?

I think, perhaps, that this is a philosophical point. There is a higher possibility that a while loop may not end due to the conditional expression (usually caused by a bug or faulty logic) and this is bad for production code as it will likely cause a memory error/leakage. However, a for loop must end since the stop iteration is explicitly stated upfront. 99.99%+ of the time, a while loop can be restructured into a much safer for loop that will avoid bugs in production. I would be happy to help think about how to restructure the two while loops if you'd like?

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

I think, perhaps, that this is a philosophical point. There is a higher possibility that a while loop may not end due to the conditional expression (usually caused by a bug or faulty logic) and this is bad for production code as it will likely cause a memory error/leakage. However, a for loop must end since the stop iteration is explicitly stated upfront. 99.99%+ of the time, a while loop can be restructured into a much safer for loop that will avoid bugs in production. I would be happy to help think about how to restructure the two while loops if you'd like?

Can you tell that I've been burned once too many times by other people's while loops? 😭

@mihailescum
Copy link
Contributor Author

Can you tell that I've been burned once too many times by other people's while loops? 😭

Only slightly 😄

Yes, I'd gladly have some help in restructuring the loops. note that for the one in find_motifs its not enough to use for i in range(k), because we don't find a motif in every iteration necessarily.

Searching for discords should be really easy: set rtol=atol=0 and as matrix profile pass P.max() - P (invert the profile). I didn't try it out yet, but I think it should work.

Doing find_motifs(T, P.max() - P, k, atol=0, rtol=0, min_num_neigbours=1) should yield the top k discords (min_num_nightbours has to be a new parameter), as this way you should find the peaks of the matrix profile, i.e. the discords.

I see what you mean now with mean/stddev, I will implement it this way.

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

Yes, I'd gladly have some help in restructuring the loops. note that for the one in find_motifs its not enough to use for i in range(k), because we don't find a motif in every iteration necessarily.

Okay, I will take a look and thank you for the heads-up!

Doing find_motifs(T, P.max() - P, k, atol=0, rtol=0, min_num_neigbours=1) should yield the top k discords (min_num_nightbours has to be a new parameter), as this way you should find the peaks of the matrix profile, i.e. the discords.

So, even though I'm Canadian (now living in the USA) and I grew up spelling "neighbours" with a "U", I think we should remove "U" in "neighbours" and then simply call this "min_neighbors". By any chance did you come across this issue about finding anomalies (an possibly motifs) using np.diff? Could it be applicable here?

I see what you mean now with mean/stddev, I will implement it this way.

Thank you!

@mihailescum
Copy link
Contributor Author

mihailescum commented May 25, 2020

I didn't follow the thread, but I think using the second derivative is overcomplicateing a problem...

With this method you'd find all peaks (or better said, candidates for peaks), bit generally for motif or discord search you only need the lowest/highest ones. One could also take advantage of scipy.find_peaks but also then, you'd get a lot of peaks you aren't interested in.

I think that the k motif problem is fairly straight forward, as you look for the lowest MP values and take these subsequences as candidates for motifs. Especially since the MP was engineered for solving this problem, I think following the ideas outlined in the papers makes sense.

And searching for discords is nothing else as searching for Maxima in the matrix profile. Since searching for Maxima of a function is the same as searching for minima in the negative function, searching for discords and motifs are very similar tasks, but I think it makes sense to provide the user with two different interfaces for simplicity.

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

Sounds good

@seanlaw
Copy link
Contributor

seanlaw commented May 25, 2020

Especially since the MP was engineered for solving this problem, I think following the ideas outlined in the papers makes sense.

Can you remind me which paper(s) talk about this? I feel like I should go back and re-read them and we should reference them in the docstring Notes section

@seanlaw
Copy link
Contributor

seanlaw commented May 26, 2020

Do you think it is possible to also find anomalies/outliers in multidimensional matrix profiles? Not sure if this is related to #168

@mihailescum
Copy link
Contributor Author

mihailescum commented May 26, 2020

I'd recommend looking at the MP tutorial https://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part1.pdf (especially slides 35 if I'm not wrong, there it is detailed)

@seanlaw
Copy link
Contributor

seanlaw commented May 26, 2020

Thank you. I definitely missed that part

@seanlaw
Copy link
Contributor

seanlaw commented May 26, 2020

I think it makes sense to provide the user with two different interfaces for simplicity.

Do you mean two separate modules? One for motifs and one for discords? Or only two separate functions within the same module?

@mihailescum
Copy link
Contributor Author

Sorry for the short answer before, I was in a hurry!

I hope some of the doubts about motif discovery could be cleared up, else please ask as it also helps to strengthen my understanding. Maybe there are things that should be implemented in a different manner?

Generally multidimensional motif discovery is possible (see the mstomp paper), but there need to be different changes made first, especially mstump has to return a matrix profile subspace (see definition 13 of the paper). While this should be fairly easy to implement (altought it is outside my scope right now), MDL is a different beast, and while it might be interesting, I know to little about it.

@mihailescum
Copy link
Contributor Author

Do you mean two separate modules? One for motifs and one for discords? Or only two separate functions within the same module?

I was thinking about a function that calls kmotifs in the way I outlined above, so the user doesn't see how/why it works

@seanlaw
Copy link
Contributor

seanlaw commented May 26, 2020

I hope some of the doubts about motif discovery could be cleared up, else please ask as it also helps to strengthen my understanding. Maybe there are things that should be implemented in a different manner?

Honestly, I think the best thing would be to demonstrate the capability through a Tutorial with a good illustrative example. Perhaps, only with code and minimal text. While it may be more work, it usually helps to uncover the API. I think I understand what is happening for the most part. Essentially

  1. You run through the MP and look for the smallest value (i.e. a motif)
  2. Then, you use that subsequence as your Q to search for other subsequences within some threshold
  3. Once complete, you set those part(s) of the MP to np.inf and repeat until k motifs are identified
  4. This can be modified for discords by multiplying MP * -1.0 and essentially flipping the problem on its head which is nice

I am thinking that instead of calling it motif.py we call it search.py (or discover.py) and we have a single generic _search (or _discover) function that can be called by either motifs() or discords(). Then, users would do something like:

from stumpy import search

search.motifs(mp, k=10, ...)

# or

search.discords(mp, k=10, ...)

And then, later, we might be able to add m-dimensional input support. In that case, I think it would be wise to assume that the user supplies a 1-D MP input and, therefore, it is understood that the MP is not a multi-dimensional one. But, perhaps, the dimensions of the input time series gives that information away as well.

Generally multidimensional motif discovery is possible (see the mstomp paper), but there need to be different changes made first, especially mstump has to return a matrix profile subspace (see definition 13 of the paper). While this should be fairly easy to implement (altought it is outside my scope right now),

You know, I could never understand definition 13! You'll have to explain it to me. Maybe discuss this in our DM #163?

MDL is a different beast, and while it might be interesting, I know to little about it.

Understood. I might look into MDL a little and do some research and add what I discover to #186

I was thinking about a function that calls kmotifs in the way I outlined above, so the user doesn't see how/why it works

Please see my response of doing a search.py (or discover.py) instead

@mihailescum
Copy link
Contributor Author

We can definitely rename motifs.py, but I'm not so sure about a generic _search function. In the end, it would solve a task, e.g. motif discovery, so it should be named accordingly. However, it might make sense to do somesting like

def find_motifs(T, ...):
    if T is 1D:
        _find_motifs_onedimensional(T, ..)
    elif T is is multiD:
        _find_motifs_multidimensional(T, ...)

But, I'm not so sure if it makes sense to have these private functions, because probably the multidimensional case will require different arguments. What do you think?

@seanlaw
Copy link
Contributor

seanlaw commented May 27, 2020

But, I'm not so sure if it makes sense to have these private functions, because probably the multidimensional case will require different arguments. What do you think?

I'm probably not seeing/understanding what is actually needed for discord/motif finding in the multi-dimensional case. So, I can't tell what the arguments would be.

What I was hoping was to rename motif.py to search.py and have the contents be:

# stumpy/search.py

def _search(T, P, k=1, excl_zone=None, atol=None, rtol=1.0, discord=False):
    """
    This is just renaming `k_motifs` but note the `discord` argument which controls whether or not. Also notice that we call the output as a more generic `top_Q` rather than `top_motifs`
    """
    ...
    while len(top_Q_indices) < k:
        ....
        occurrences = _find_occurrences(Q, T, M_T, Σ_T, excl_zone, profile_value, atol, rtol, discord)
    ...
    return top_Q_indices, top_Q_values


def _multi_search(T, P, k=1, excl_zone=None, atol=None, rtol=1.0, discord=False):
    """
    This is the multi-dimensional version of `_search`
    """
    ...
    # Something happens here but this may or may not be needed
    ...
    return top_Q_indices, top_Q_values

def _find_occurrences(
    Q, T, M_T=None, Σ_T=None, excl_zone=None, profile_value=0.0, atol=None, rtol=1.0, discord=False
):
    ...
    return occurrences

def motifs(T, P, k=10, excl_zone=None, atol=None, rtol=1.0):
    if T.ndim == 1:
        top_motifs_indices, top_motifs_values = _search(T, P, k, excl_zone, atol, rtol)
    elif T.ndim == 2:
        top_motifs_indices, top_motifs_values = _multi_search(T, P, k, excl_zone, atol, rtol)
    ...
    return top_motifs_indices, top_motifs_values

def discords(T, P, k=10, excl_zone=None, atol=None, rtol=1.0):
    """
    Notice that `discord=True`
    """
    if T.ndim == 1:
        top_discords_indices, top_discordss_values = _search(T, P, k, excl_zone, atol, rtol, discord=True)
    elif T.ndim == 2:
        top_discords_indices, top_discordss_values = _multi_search(T, P, k, excl_zone, atol, rtol, discord=True)
    ...
    return top_discords_indices, top_discords_values

So, as you pointed out, since process of finding discords and motifs are so similar then I want to make sure that we have a single algorithm/entry point that handles both discords/motifs (i.e., _search) but that are differentiated by single discords function argument. And motifs() and discords are just thin wrappers around _search.

Does this help? Otherwise, maybe it is worth hopping on another call?

@mihailescum
Copy link
Contributor Author

Don't you think that can become to convoluted easily? Having to keep track of if/else statements in the _search function versus calling the motifs function with slightly modified inputs?

Anyways, first I have to implement the motif search correctly 😄

About the call, sure! I have too see how and when I get back next week, but I'd definitely like to!

@seanlaw
Copy link
Contributor

seanlaw commented May 28, 2020

Agreed. Let’s get something working first and let’s iterate and not be attached to what we decide now

@seanlaw
Copy link
Contributor

seanlaw commented Mar 13, 2021

This is due to the possibility of passing max_matches=None and if we would track this case separately, I feel like everything would be more complicated. I don't think that this motif discovery is performance critical, so using built in python functionality should be fine.

Sorry, I missed that max_matches can be None (i.e., the number of matches may be larger). Thank you for explaining that! In that case I think everything is ready to merge except for one small issue. Currently, the _jagged_list_to_array docstring and parameter still refers to x and should be changed to a:

Fits a 2d jagged list into a 2d numpy array of the specified dtype.
The resulting array will have a shape of (len(x), v), where v is the length
of the longest list in x. 

and

x : list

@mihailescum
Copy link
Contributor Author

I fixed the docstring, I should be fine now. Moreover, I slightly changed the default behavior if atol=None in the match function.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 13, 2021

I fixed the docstring, I should be fine now. Moreover, I slightly changed the default behavior if atol=None in the match function.

So, I don't quite understand this:

If None, defaults to the distance of Q to its closest match in T, which is not Q.

So, for motifs it's default is atol = 0.25 * np.std(P) but it is different for match. If I am interpreting this correctly, the default radius for match is being set to twice the distance to its nearest neighbor? Why this number? Why not keep it consistent and have the default as atol = 0.25 * np.std(distance_profile)

It's been so long that I don't recall how we chose 0.25. I know why we chose np.std(P) (to ensure that the radius is relative to the distances at hand) so it looks like we wanted to choose something smaller than a standard deviation away?

@mihailescum
Copy link
Contributor Author

That's a good question. To be honest, coming up with good/sensible default values is really difficult, as every application requires a different notion of proximity. Consider the case of a heart beat in an EEG (very repetitive, clear defined pattern, thus requires a very small threshold for close matches) and a spoken word in some audio data (noisy, differences in pronunciation, ... thus matches can be similar while still having a somewhat large distance).

I don't think there was any reasoning with the 0.25, just to set it into relation with the matrix profile.

However, for match no matrix profile is required. Thus my reasoning was the following: In general, profile_value will be 0.0 if we don't have a MP. Thus only the atol is relevant. What is a sensible thing we can do without knowing the user data? I figured it would be to return the best match of Q, no matter how far away it is (which is equivalent to setting atol=np.min(D)). Now if we have a self join, i.e. Q is a sub sequence of T, it would be sensible to return the closest match we can find that is not Q itself. Hence I went for atol=np.min(D[D>1e-7]), because np.min(D) would be 0.0 if T contains Q, namely at the index where Q appears in T.

My reasoning for setting rtol=1.0 is similar. Assuming atol=0.0, i.e. only rtol is important, we would like to find the closest match of Q. We also have to assume that the distance of the closest match is given in profile_value, which is exactly the case if we have a matrix profile. Thus, under these assumptions, match will return exactly the index of this closest match (and the index of Q, if Q is a subsequence of T).

Does this sound reasonable? If you have a suggestion for other default values, let me know. In the end, I think that the user will have to adjust these params and we should only give him a rough, sensible start.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 13, 2021

Also, in match, what is the point/purpose of:

D = np.sum(D, axis=0) / d

This looks like it's somehow related to multi-dimensional matrix profiles but, in that case, we wouldn't squash all dimensions into one single dimension and then average.

@mihailescum
Copy link
Contributor Author

Also, in match, what is the point/purpose of: D = np.sum(D, axis=0) / d

In the multidimensional case, the user would chose the k dimensions he is interested in (lets say dimensions [1, 2, 10]). In this case we would have to compute the k-dimensional distance profile over the selected dimensions (in our example k=3). If I'm not mistaken, this is exactly defined as averaging the individual distance profiles.

For now, the user cannot select the dimensions he is interested in and only use all available dimensions. Hence we need to average the distance profiles of all dimensions (which obviously makes no difference if we have a 1D time series).

@seanlaw
Copy link
Contributor

seanlaw commented Mar 13, 2021

In the multidimensional case, the user would chose the k dimensions he is interested in (lets say dimensions [1, 2, 10]). In this case we would have to compute the k-dimensional distance profile over the selected dimensions (in our example k=3). If I'm not mistaken, this is exactly defined as averaging the individual distance profiles.

For now, the user cannot select the dimensions he is interested in and only use all available dimensions. Hence we need to average the distance profiles of all dimensions (which obviously makes no difference if we have a 1D time series).

I see. I think the part that I was missing was that the user would have to be explicit and choose the precise k dimensions. In that case, indeed, you would simply sum up all of the k available dimensions and average. Thanks for the clarification!

Give me some time to think through the atol and rtol stuff. My mental model seems to be different from what is presented in your code and I want to first understand what your approach is doing and how it is different from what I am thinking of. I apologize for not catching this earlier but I also think it is important for us to figure this out before exposing it to users.

@mihailescum
Copy link
Contributor Author

Of course, let me know when you are ready and if you have any more questions.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 13, 2021

However, for match no matrix profile is required. Thus my reasoning was the following: In general, profile_value will be 0.0 if we don't have a MP. Thus only the atol is relevant. What is a sensible thing we can do without knowing the user data? I figured it would be to return the best match of Q, no matter how far away it is (which is equivalent to setting atol=np.min(D)). Now if we have a self join, i.e. Q is a sub sequence of T, it would be sensible to return the closest match we can find that is not Q itself. Hence I went for atol=np.min(D[D>1e-7]), because np.min(D) would be 0.0 if T contains Q, namely at the index where Q appears in T.

Okay, I think I understand now (and agree). Instead of 1e-7, would you mind adding a STUMPY_D_THRESHOLD = 1e-7 to the stumpy/config.py file (right above STUMPY_D_SQUARED_THRESHOLD = 1e-14) and then replace this value in match with the constant?

My reasoning for setting rtol=1.0 is similar. Assuming atol=0.0, i.e. only rtol is important, we would like to find the closest match of Q. We also have to assume that the distance of the closest match is given in profile_value, which is exactly the case if we have a matrix profile. Thus, under these assumptions, match will return exactly the index of this closest match (and the index of Q, if Q is a subsequence of T).

After the above change then I think we're good! It would be really helpful to explain all of this in the tutorial with simple examples (i.e., if you set atol to this and rtol to this then this will happen).

@mihailescum
Copy link
Contributor Author

mihailescum commented Mar 13, 2021

Actually I was thinking about it a bit longer. In the motif function, having a relative tolerance makes sense, to encode the intuition "if it has a low MP value, there is a very close match, so I only look at very similar matches, while if it has a large MP value, the nearest neighbor is far away so when searching for other neighbors, I also have to look further away". The absolute tolerance serves as a MP independent threshold.

However, when matching patterns, there is no notion of a a priori distance to the closest neighbor. So it makes more sense to have only a single radius parameter (instead of atol, rtol and profile_value), which would default to np.min(D[D > STUMPY_D_THRESHOLD]), so the default behavior would be to return the closest, not identical match.

So when match is called from within motifs, the it would set tolerance = atol + MPvalue * rtol, while the user only has one parameter to worry about when looking only for matches.

What do you think?

@seanlaw
Copy link
Contributor

seanlaw commented Mar 14, 2021

I've been wondering if doing "top-K motif" is the wrong way to approach this? Before I address your comment, would you mind critiquing something first?

Let's say we have a matrix profile with a stddev of 20.0 (i.e., atol = 5.0) and its corresponding time series has a motif_1with a nearest neighbor that is a distance of 0.1 away (i.e., a pretty close match) and another motif_2 has its nearest neighbor at a distance of 10.0 away (not really close). So, for motif_1, we'd look for other subsequences that are, at most, a distance of 0.1 + 5.0 = 5.1 away. This seems really, really big and would likely include a lot of subsequences that don't look like motif_1. The average distance in this set of subsequences should be between 0.1 and 0.5.

Then, for motif_2, we look for other subsequences that are, at most, a distance of 10.0 + 5.0 = 15.0 away. So, relatively speaking, this means that the available subsequences that match motif_2 will not likely look like motif_2 (i.e., the average distance in this set of subsequences will be between 10 and 15).

So, the set of subsequences "near" motif_1 has the highest potential to look like motif_1 while the set of subsequences "near" motif_2 will look a lot less like motif_2 (relatively speaking). And as k increases, the set of subsequences "near" the nth motif will look worse and worse. While choosing k might seem like an intuitive decision, the choice of k feels uninformed and quite arbitrary because it is clearly matrix profile/distance profile dependent.

Instead, my original thinking (which I haven't thought through fully) was to compute a threshold = max(np.min(P), np.mean(P) - 2 * np.std(P)) (i.e., this is the "classic" 2 standard deviations away from the mean but used in this case to identify "significantly" small nearest neighbor distances) and then only consider subsequences in P that have a 1-nearest-neighbor that is less than or equal to that threshold as a motif candidate. Of course, we could've been super generous and chosen the threshold as threshold = np.mean(P) or threshold = max(np.min(P), np.mean(P) - np.std(P)) but then we'd end up with a lot of candidate motifs. With the set of candidate motifs, I would just continue to use this same threshold to select the set of subsequences that match a given candidate motif. Practically, the motif function would be as follows:

# written for comprehension, not for speed/efficiency

if threshold is None:
    threshold = max(np.min(P), np.mean(P) - 2 * np.std(P))
top_k_values = []
top_k_indices = []

P[P > threshold] = np.inf

while np.any(np.isfinite(P)):
    candidate_idx = np.argsort(P)[0]
    # The next three lines can be put in `match`. See below
    Q = T[candidate_idx : candidate_idx + m]
    distance_profile = core.mass(Q, T)
    indices = np.argwhere(distance_profile <= threshold).flatten()
    top_k_indices.append(indices)
    top_k_values.append(distance_profile[indices])
    for idx in indices:
            core.apply_exclusion_zone(P, idx, excl_zone)

So, the user doesn't get to specify k and, instead, k comes out as a result of adjusting the threshold. Also, for each motif that gets added to your top_k, by definition, you are guaranteed that each motif has at least one (or more) matches that are within the threshold.

Then, match function would do something similar like:

Q = T[candidate_idx : candidate_idx + m]
distance_profile = core.mass(Q, T)

if threshold is None:
    threshold = max(np.min(distance_profile), np.mean(distance_profile) - 2 * np.std(distance_profile))

indices = np.argwhere(distance_profile <= threshold).flatten()

return indices, distance_profile[indices]

For match we can potentially allow the user to define their own custom function for computing a threshold that is dependent on the distance_profile (think lambda function).

@mihailescum
Copy link
Contributor Author

I agree on what you said, especially with this part.

So, the set of subsequences "near" motif_1 has the highest potential to look like motif_1 while the set of subsequences "near" motif_2 will look a lot less like motif_2 (relatively speaking). And as k increases, the set of subsequences "near" the nth motif will look worse and worse. While choosing k might seem like an intuitive decision, the choice of k feels uninformed and quite arbitrary because it is clearly matrix profile/distance profile dependent.

To summarize your proposal:

  • In match, replace atol and rtol with threshold, that can be a None, a fixed value float, or a function of the distance profile D of Q with T
  • The default, i.e. if threshold=None would be threshold=max(np.min(distance_profile), np.mean(distance_profile) - 2 * np.std(distance_profile)).
  • In motifs we also replace atol and rtol with threshold, that can be a None, a fixed value float, or a function.
  • If it is a value or a function, we pass it on to match using the threshold parameter. This implies, it has to be a function of the distance profile of a candidate subsequence Q with T.
  • If it is None, we set it to threshold = max(np.min(P), np.mean(P) - 2 * np.std(P)), i.e a fixed value. If the user wants more/less standard deviations, he can compute a different threshold and pass it to motifs as a float because he knows P at the time of calling the function.
  • We remove k, i.e. we return all motifs with a representative below the threshold. Since we return a numpy array, the user can still very easily retrieve only the k top motifs if he likes, by doing motifs = stumpy.motifs(...)[:k]. Discussion: We could still keep it as a parameter, maybe rename it to max_motifs, so we can stop the execution earlier if the user is only interested in the top-k anyways. This would also prevent errors with [:k] if less than k motifs are within the threshold.

Did I get it right? Should we proceed like this?

I would like to add one more thing I noticed:

At the moment, we have the max_matches parameter. Assume the following case: max_matches=5 and a time series with two motifs. Motif A has seven matches, each match has a profile value <0.5, motif B has three matches and the top match has a matrix profile value of >0.5.

In such a case, if I'm not mistaken, the current behavior would be the following: In line 134 we query the top max_matches matches, and apply exclusion zones around them (line 151). Hence we would find two motifs with matches of motif A (the lowest five as one motif, the other two, not excluded matches as a second motif) and only them the matches of motif B. This is not caught in the tests because I didn't think about this behavior. I will have to come up with a long time series.

To solve it, we need to match all matches of motif A, apply an exclusion zone, and only then add the best max_matches to the output array.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 14, 2021

We could still keep it as a parameter, maybe rename it to max_motifs, so we can stop the execution earlier if the user is only interested in the top-k anyways. This would also prevent errors with [:k] if less than k motifs are within the threshold.

Yes! It's funny that I was thinking the exact same thing as I was reading your summary. It feels like there should be a max_motifs and max_matches for the motifs function and max_matches for the matches function?

Did I get it right? Should we proceed like this?

Yes! Do you like it? I hope it wasn't too hard to understand? I want it to be "easy" for the user to comprehend too. I welcome any criticism too! This is probably one of the few places where we are significantly deviating from Keogh's work but I think it might be a nice contribution. The only thing that I haven't thought through is how this affects the multi-dimensional case.

I would like to add one more thing I noticed:
At the moment, we have the max_matches parameter. Assume the following case: max_matches=5 and a time series with two motifs. Motif A has seven matches, each match has a profile value <0.5, motif B has three matches and the top match has a matrix profile value of >0.5.
In such a case, if I'm not mistaken, the current behavior would be the following: In line 134 we query the top max_matches matches, and apply exclusion zones around them (line 151). Hence we would find two motifs with matches of motif A (the lowest five as one motif, the other two, not excluded matches as a second motif) and only them the matches of motif B. This is not caught in the tests because I didn't think about this behavior. I will have to come up with a long time series.
To solve it, we need to match all matches of motif A, apply an exclusion zone, and only then add the best max_matches to the output array.

Interesting! Yes, what you said sounds correct and I am glad that you caught that.

Is threshold an appropriate word? cutoff? max_distance?

@mihailescum
Copy link
Contributor Author

Thinking about it still confuses me a bit.

In case of match I think we can agree that max_distance is a great name, a this is exactly what the parameter controls: The maximum distance to the query to be called a match.

But what exactly do we want to control in the motif case? Is it, given a candidate subsequence Q for a representative of a new motif, the maximum distance other subsequences can have from Q, to be considered matches of the motif?

This does make sense to me and in such a case I would also call the parameter max_distance and just pass it on to match. But, the default value we chose (2 standard deviations of P) doesn't make sense and should rather be 2*std(dist(Q, T)), i.e. exactly the default of match. There shouldn't be any (default) dependency on P, because the matrix profile consist of many distances between completely different subsequences, while we are interested in bounding the maximal distance to one very specific subsequence (the current motif candidate).

Do you see my point? Do you agree?

@seanlaw
Copy link
Contributor

seanlaw commented Mar 15, 2021

Thinking about it still confuses me a bit.

In case of match I think we can agree that max_distance is a great name, a this is exactly what the parameter controls: The maximum distance to the query to be called a match.

But what exactly do we want to control in the motif case? Is it, given a candidate subsequence Q for a representative of a new motif, the maximum distance other subsequences can have from Q, to be considered matches of the motif?

This does make sense to me and in such a case I would also call the parameter max_distance and just pass it on to match. But, the default value we chose (2 standard deviations of P) doesn't make sense and should rather be 2*std(dist(Q, T)), i.e. exactly the default of match. There shouldn't be any (default) dependency on P, because the matrix profile consist of many distances between completely different subsequences, while we are interested in bounding the maximal distance to one very specific subsequence (the current motif candidate).

Do you see my point? Do you agree?

I see your point. In the case of motif, maybe we should add a threshold or cutoff parameter that dictates whether or not a subsequence should be considered a "motif candidate" and this parameter is, by default, based on the input matrix profile, P. And, separate from this threshold/cutoff parameter, we have a max_distance parameter (for motif) that is specific for match and based on the distance_profile that is computed at run-time in match (i.e., it never depends on P). So, you can control max_distance in match from motif (i.e., it can be passed down the stack) but max_distance has no real affect in motif.

Does that answer your question? Maybe I'm misunderstanding

@mihailescum
Copy link
Contributor Author

That's exactly what I was thinking too, great.

Copy link
Contributor

@seanlaw seanlaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks great and I think that our new approach simplifies things a lot! I found some things that can be improved. Would you mind taking a look? Thanks in advance!

stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Show resolved Hide resolved
stumpy/motifs.py Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
stumpy/motifs.py Outdated Show resolved Hide resolved
return top_k_indices, top_k_values


def motifs(
Copy link
Contributor

@seanlaw seanlaw Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there is a fundamental question of whether or not we should add the query motif into the list-of-lists? Currently, we identify a query motif but it looks like we don't add that query subsequence to the results. This means that it is up to the user to figure out what the query motif is for each set of indices.

I think we should include the query motif (and index) as the first element of each list.

Or maybe we return a third array that contains all of the motif indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For self joins, the query will always be in the list (the query is the subsequence of T which is closest to itself). For AB joins, it makes no sense to have the query in the list (because it is part of the "wrong" time series), but its closest match will be.

It actually makes more sense to terminate the motif search if profile_value > max_distance. profile_value is the closest match of Q in T, so if this is not within the maximal distance, there is no point in continuing. However if profile_value <= max_distance, this ensures at least one match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Thank you

@seanlaw
Copy link
Contributor

seanlaw commented Mar 18, 2021

@mexxexx Would you prefer if I handled the remaining minor changes so that you can focus your time on the tutorial? They are more like subtle refactoring. The additional changes that I am proposing should not change the return values (i.e., I will ensure that your tests still pass without alterations).

@mihailescum
Copy link
Contributor Author

Hi @seanlaw that would actually be great. I was a bit busy the last days. I feel like you write the better docstrings anyway 😄
I will commit the latest version.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 18, 2021

I will commit the latest version.

@mexxexx Thank you for this wonderful and much desired contribution! As always, it is a pleasure working together with you!

@seanlaw seanlaw merged commit b8d7998 into TDAmeritrade:main Mar 18, 2021
@mihailescum mihailescum deleted the motifs branch March 18, 2021 21:34
@mihailescum
Copy link
Contributor Author

Great, thank you and it was a pleasure!

@seanlaw seanlaw mentioned this pull request May 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants