New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Module for motif and discord discovery #185
Conversation
Codecov Report
@@ Coverage Diff @@
## master #185 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 13 13
Lines 1027 1032 +5
=========================================
+ Hits 1027 1032 +5
Continue to review full report at Codecov.
|
The tests fail because of a typo, but I'm interested in your take on the structure of the code. |
I like where it's going and the structure seems reasonable. The only thing that I can think of is whether we can generalize the problem to solve for motifs as well as discords (anomalies). Additionally, I have a bias that the default thresholds should be motivated by or relative to the mean/stddev of the matrix profile values. But I do like where you are going with I would strongly advise against having |
What do you mean with motivating the motifs with mean/stddev? Searching for discords should be really easy: set And why do you advise against |
Sorry for the brevity in my previous response and, hopefully, I understand the use of
becomes
This way, the
I'm not sure I follow but I trust you 👍
I think, perhaps, that this is a philosophical point. There is a higher possibility that a |
Can you tell that I've been burned once too many times by other people's |
Only slightly 😄 Yes, I'd gladly have some help in restructuring the loops. note that for the one in
Doing I see what you mean now with mean/stddev, I will implement it this way. |
Okay, I will take a look and thank you for the heads-up!
So, even though I'm Canadian (now living in the USA) and I grew up spelling "neighbours" with a "U", I think we should remove "U" in "neighbours" and then simply call this "min_neighbors". By any chance did you come across this issue about finding anomalies (an possibly motifs) using
Thank you! |
I didn't follow the thread, but I think using the second derivative is overcomplicateing a problem... With this method you'd find all peaks (or better said, candidates for peaks), bit generally for motif or discord search you only need the lowest/highest ones. One could also take advantage of I think that the k motif problem is fairly straight forward, as you look for the lowest MP values and take these subsequences as candidates for motifs. Especially since the MP was engineered for solving this problem, I think following the ideas outlined in the papers makes sense. And searching for discords is nothing else as searching for Maxima in the matrix profile. Since searching for Maxima of a function is the same as searching for minima in the negative function, searching for discords and motifs are very similar tasks, but I think it makes sense to provide the user with two different interfaces for simplicity. |
Sounds good |
Can you remind me which paper(s) talk about this? I feel like I should go back and re-read them and we should reference them in the docstring Notes section |
Do you think it is possible to also find anomalies/outliers in multidimensional matrix profiles? Not sure if this is related to #168 |
I'd recommend looking at the MP tutorial https://www.cs.ucr.edu/~eamonn/Matrix_Profile_Tutorial_Part1.pdf (especially slides 35 if I'm not wrong, there it is detailed) |
Thank you. I definitely missed that part |
Do you mean two separate modules? One for motifs and one for discords? Or only two separate functions within the same module? |
Sorry for the short answer before, I was in a hurry! I hope some of the doubts about motif discovery could be cleared up, else please ask as it also helps to strengthen my understanding. Maybe there are things that should be implemented in a different manner? Generally multidimensional motif discovery is possible (see the mstomp paper), but there need to be different changes made first, especially mstump has to return a matrix profile subspace (see definition 13 of the paper). While this should be fairly easy to implement (altought it is outside my scope right now), MDL is a different beast, and while it might be interesting, I know to little about it. |
I was thinking about a function that calls kmotifs in the way I outlined above, so the user doesn't see how/why it works |
Honestly, I think the best thing would be to demonstrate the capability through a Tutorial with a good illustrative example. Perhaps, only with code and minimal text. While it may be more work, it usually helps to uncover the API. I think I understand what is happening for the most part. Essentially
I am thinking that instead of calling it
And then, later, we might be able to add m-dimensional input support. In that case, I think it would be wise to assume that the user supplies a 1-D MP input and, therefore, it is understood that the MP is not a multi-dimensional one. But, perhaps, the dimensions of the input time series gives that information away as well.
You know, I could never understand definition 13! You'll have to explain it to me. Maybe discuss this in our DM #163?
Understood. I might look into MDL a little and do some research and add what I discover to #186
Please see my response of doing a |
We can definitely rename
But, I'm not so sure if it makes sense to have these private functions, because probably the multidimensional case will require different arguments. What do you think? |
I'm probably not seeing/understanding what is actually needed for discord/motif finding in the multi-dimensional case. So, I can't tell what the arguments would be. What I was hoping was to rename
So, as you pointed out, since process of finding discords and motifs are so similar then I want to make sure that we have a single algorithm/entry point that handles both discords/motifs (i.e., Does this help? Otherwise, maybe it is worth hopping on another call? |
Don't you think that can become to convoluted easily? Having to keep track of if/else statements in the Anyways, first I have to implement the motif search correctly 😄 About the call, sure! I have too see how and when I get back next week, but I'd definitely like to! |
Agreed. Let’s get something working first and let’s iterate and not be attached to what we decide now |
Sorry, I missed that
and
|
I fixed the docstring, I should be fine now. Moreover, I slightly changed the default behavior if |
So, I don't quite understand this:
So, for It's been so long that I don't recall how we chose |
That's a good question. To be honest, coming up with good/sensible default values is really difficult, as every application requires a different notion of proximity. Consider the case of a heart beat in an EEG (very repetitive, clear defined pattern, thus requires a very small threshold for close matches) and a spoken word in some audio data (noisy, differences in pronunciation, ... thus matches can be similar while still having a somewhat large distance). I don't think there was any reasoning with the However, for My reasoning for setting Does this sound reasonable? If you have a suggestion for other default values, let me know. In the end, I think that the user will have to adjust these params and we should only give him a rough, sensible start. |
Also, in
This looks like it's somehow related to multi-dimensional matrix profiles but, in that case, we wouldn't squash all dimensions into one single dimension and then average. |
In the multidimensional case, the user would chose the k dimensions he is interested in (lets say dimensions For now, the user cannot select the dimensions he is interested in and only use all available dimensions. Hence we need to average the distance profiles of all dimensions (which obviously makes no difference if we have a 1D time series). |
I see. I think the part that I was missing was that the user would have to be explicit and choose the precise Give me some time to think through the |
Of course, let me know when you are ready and if you have any more questions. |
Okay, I think I understand now (and agree). Instead of
After the above change then I think we're good! It would be really helpful to explain all of this in the tutorial with simple examples (i.e., if you set |
Actually I was thinking about it a bit longer. In the motif function, having a relative tolerance makes sense, to encode the intuition "if it has a low MP value, there is a very close match, so I only look at very similar matches, while if it has a large MP value, the nearest neighbor is far away so when searching for other neighbors, I also have to look further away". The absolute tolerance serves as a MP independent threshold. However, when matching patterns, there is no notion of a a priori distance to the closest neighbor. So it makes more sense to have only a single So when What do you think? |
I've been wondering if doing "top-K motif" is the wrong way to approach this? Before I address your comment, would you mind critiquing something first? Let's say we have a matrix profile with a stddev of Then, for So, the set of subsequences "near" Instead, my original thinking (which I haven't thought through fully) was to compute a
So, the user doesn't get to specify Then,
For |
I agree on what you said, especially with this part.
To summarize your proposal:
Did I get it right? Should we proceed like this? I would like to add one more thing I noticed: At the moment, we have the In such a case, if I'm not mistaken, the current behavior would be the following: In line 134 we query the top To solve it, we need to match all matches of motif A, apply an exclusion zone, and only then add the best |
Yes! It's funny that I was thinking the exact same thing as I was reading your summary. It feels like there should be a
Yes! Do you like it? I hope it wasn't too hard to understand? I want it to be "easy" for the user to comprehend too. I welcome any criticism too! This is probably one of the few places where we are significantly deviating from Keogh's work but I think it might be a nice contribution. The only thing that I haven't thought through is how this affects the multi-dimensional case.
Interesting! Yes, what you said sounds correct and I am glad that you caught that. Is |
Thinking about it still confuses me a bit. In case of But what exactly do we want to control in the This does make sense to me and in such a case I would also call the parameter Do you see my point? Do you agree? |
I see your point. In the case of Does that answer your question? Maybe I'm misunderstanding |
That's exactly what I was thinking too, great. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks great and I think that our new approach simplifies things a lot! I found some things that can be improved. Would you mind taking a look? Thanks in advance!
return top_k_indices, top_k_values | ||
|
||
|
||
def motifs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there is a fundamental question of whether or not we should add the query motif into the list-of-lists? Currently, we identify a query motif but it looks like we don't add that query subsequence to the results. This means that it is up to the user to figure out what the query motif is for each set of indices.
I think we should include the query motif (and index) as the first element of each list.
Or maybe we return a third array that contains all of the motif indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For self joins, the query will always be in the list (the query is the subsequence of T
which is closest to itself). For AB joins, it makes no sense to have the query in the list (because it is part of the "wrong" time series), but its closest match will be.
It actually makes more sense to terminate the motif search if profile_value > max_distance
. profile_value
is the closest match of Q
in T
, so if this is not within the maximal distance, there is no point in continuing. However if profile_value <= max_distance
, this ensures at least one match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! Thank you
@mexxexx Would you prefer if I handled the remaining minor changes so that you can focus your time on the tutorial? They are more like subtle refactoring. The additional changes that I am proposing should not change the return values (i.e., I will ensure that your tests still pass without alterations). |
Hi @seanlaw that would actually be great. I was a bit busy the last days. I feel like you write the better docstrings anyway 😄 |
@mexxexx Thank you for this wonderful and much desired contribution! As always, it is a pleasure working together with you! |
Great, thank you and it was a pleasure! |
I gave the motif discovery module a go (issue #184), what do you think? (Ignore all changes outside
motifs.py
andtest_motifs.py
, as we discuss this in another PR)There are still more tests to be done, but I'm not sure how to approach it yet. Ideally I'd like a
naive_k_motifs
method.