-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Tutorial for Matrix Profile XXI: MERLIN algorithm #Issue 417 #418
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Codecov Report
@@ Coverage Diff @@
## main #418 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 35 35
Lines 2796 2796
=========================================
Hits 2796 2796 Continue to review full report at Codecov.
|
@ninimama Will you let me know if/when you think it would make sense for me to take a look? |
@seanlaw Do you want me to implement the MERLIN algorithm (Table 3 of the paper) and then let you know? Because, for now, it is just Table 1 and 2 (DRAG algorithm) that is used in MERLIN algorithm. |
Yes, it would be great if you could implement the MERLIN algorithm. In the meantime, I'll try to find some time to review the DRAG parts and I may have some questions for you. |
In case I forget to mention it, these initial versions of the code are excellent candidates for the unit tests. If you look at |
docs/Tutorial_MERLIN.ipynb
Outdated
@@ -0,0 +1,407 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of suggestions:
- Use
T
instead ofT_A
- The z-norm for
T_i
can be moved outside of thefor j
loop - Instead of returning
failure
, I think it is fine to simplyreturn C
with length0
It also looks like we may be able to parallelize the inner for j
loop
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also looks like we may be able to parallelize the inner
for j
loop
I need to read a few articles to see how this can be done. Will let you know if I get into any trouble. Did you use a similar approach in any of the stumpy modules? (so I can check it out and get some idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't worry about it for now. Let's focus on getting it to work first before making it fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have experience with this and can help assist. We'll be using the Numba package for parallelization
docs/Tutorial_MERLIN.ipynb
Outdated
@@ -0,0 +1,407 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we are storing the subsequence length m
inside of D
? It looks like it doesn't change and would be a constant. In fact, I would considering using two separate 1D arrays (idx
and distance
), do the computation, and combine the results into D
at the end before returning. This would make the code a lot easier to read as the variable names are obvious. Eventually, I can help rename things to be more consistent with the variable names in the code base (i.e., we typically use D
for a distance profile and so we'd likely just use discords
to represent the set of discords, etc).
Why are we converting C
into a list
? Is this necessary?
Instead of D[j_ind,-1]
please be explicit and use D[j_ind, 3]
.
I noticed that is_discord
isn't actually being used anywhere. You are setting it to True
/False
and that's it.
The use of C_indices_tot = np.arange(len(C_lst))
and C_indices = C_indices_tot[:]
seem complicated to me. I wonder if there is a way to simplify all of this referencing. The Table 2 in the paper does not seem to look as complicated
Why return a set of indices to remove? Why not just remove them before and only return the final set of discords
?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we are storing the subsequence length
m
inside ofD
?
I misread the question. Sorry. I will modify it.
Why are we converting
C
into alist
? Is this necessary?Instead of
D[j_ind,-1]
please be explicit and useD[j_ind, 3]
.I noticed that
is_discord
isn't actually being used anywhere. You are setting it toTrue
/False
and that's it.
Thanks. I will take care of it. In fact, that was when I tried to understand the difference between the proposed algorithm in the paper and the one provided in MATLAB and what dj
is. So, I wrote as I tried to figure things out and didn't pay much attention to small details. So, I called it [WIP]. I will refine the code and will consider your comments in the next commit for sure.
The use of
C_indices_tot = np.arange(len(C_lst))
andC_indices = C_indices_tot[:]
seem complicated to me. I wonder if there is a way to simplify all of this referencing. The Table 2 in the paper does not seem to look as complicatedWhy return a set of indices to remove? Why not just remove them before and only return the final set of
discords
?
I totally agree. I need to go through it. I remember I got an error about the change in the size of set
object in the for loop. So, I tried to resolve it. I need to go back to that set
approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good and no rush! I just wanted to provide feedback while it was still fresh in my head. I can tell you that I don't do well jumping back-and-forth from PR to PR. The context switching hurts my brain :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Table 2 in the paper does not seem to look as complicated
I tried to go through the algorithm one more time. I believe the Table 2 algorithm is not complete. Let's take a look at dj
: It is initially set to inf
. And, it is being updated through the outer loop . It is not shown in the TABLE 2; but, that is what authors did in their MATLAB code. They store it and updated it. (please see below)
% Algorithm 1: Discord Refinement Phase
cand_nndists2 = containers.Map(C.keys, inf(length(C), 1), 'UniformValues', true);
cand_nn_pos = containers.Map(C.keys, NaN(length(C), 1), 'UniformValues', true);
r2 = r^2;
% I diverge from the pseudocode in the paper here. First, I have to use
% squared distances for correctness reasons in order to avoid taking a
% square root at every step where we check against best so far. Second,
% I had to change the way elements are removed from the candidate set, or
% rather I had to change the loop tracking variable. If elements are
% removed from an array while iterating over that same array, it will
% result in skipped elements in almost any environment. Instead I'm
% iterating over a copy of its hash keys, which remains constant over
% the inner 2 loops. Elements are still progressively removed from the
% candidate set.
for i = 1 : subseqcount
if isempty(C)
break;
end
keys = C.keys;
Si = (TS(i : i + L - 1) - mu(i)) ./ sig(i);
for j = 1 : length(keys)
if abs(i - keys{j}) < min_separation
continue;
end
dist2 = 0;
candidx = keys{j};
Candj = C(candidx);
nndist2 = cand_nndists2(candidx);
for k = 1 : L
dist2 = dist2 + (Si(k) - Candj(k))^2;
if dist2 >= nndist2
break;
end
end
if dist2 < r2
C.remove(candidx);
cand_nndists2.remove(candidx);
cand_nn_pos.remove(candidx);
continue;
elseif dist2 < cand_nndists2(candidx)
cand_nndists2(candidx) = dist2;
cand_nn_pos(candidx) = i;
end
end
end
I can use dictionary to store dj
(and also delete the ones (i.e. keys) that should be removed throughout the for loop). I used numpy array instead for now. If you thin dictionary or any other structure is better to store/update the discords, pleases let me know.
========================================================================
Now, there is one thing left that should be noticed for now! I took a look at the documents on their webpage and realized:
There are a few differences:
(1) It says the NN is at index 470. But I didn't get that in the result of my implementation
(2) [PROBABLY NOT IMPORTANT] The length of time series in the image above is 3000, similar to what provided in the paper. However, the length of the time series provided as a data set on the webpage of the paper is 2000.
(3) It says the actual discord value is 10.27 (above which the algorithm returns no discord). However, my implementation shows that the actual discord value is 10.43.
===================================================================
I already revised the first two functions according to your suggestion. I will implement MERLIN and then push the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What value(s) do you get when you run the Matlab code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discord location and its distance to NN are the same. HOWEVER, the index of the NN subsequence are not the same.
Does that imply that there are two subsequences that have the same distance to the discord? Or do you think that they reported the wrong nearest neighbor index? Can you compute the "other" nearest neighbor distance by hand and compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discord location and its distance to NN are the same. HOWEVER, the index of the NN subsequence are not the same.
Does that imply that there are two subsequences that have the same distance to the discord? Or do you think that they reported the wrong nearest neighbor index? Can you compute the "other" nearest neighbor distance by hand and compare?
I just deleted my previous comment. I am going to test the output of my implementation as I think it doesn't return the corrrect NN distance. (I checked the distance between subseq and its NN (provided in my output) and their distance is 0! I am now trying to debug it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, don't forget that you can find the top discord using the more expensive stump
(i.e., compute the matrix profile and take the max). So you can at least use it to verify your results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE:
EVERYTHING looks good! Thanks for your support. Now, I am going to do the MERLIN main algorithm,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the MERLIN algorithm using Table 3. And, I ALMOST successfully reproduce one of the figures of the paper (Fig. 11). I also compared it to the results of MATLAB code (MERLIN3.m) and the results are ALMOST the same. I used the term ALMOST because there are only two cases (out of 92 cases) where the discord indices were different ======================================== Some notes:
|
@ninimama When you say 92 cases is it only for the taxi cab dataset? Or is this for other datasets too? I think it is worthwhile applying MERLIN to a few different datasets in order to validate the approach. If it is for other datasets, can you show those in the tutorial so that I can take a look? We probably won't include all of them but I'd like to see them for my own reference. |
That is only for TAXI cab. I will plot those two discords that are different in Python compared to MATLAB. ================================================================= I will try to see if I can apply the MERLIN algorithm on any other data. As I said, the problem is that each data set provided on their webpage contains some sets of time series data, and it is hard to know which one is used in the paper. I will go through the paper again and the data sets to see if I can find at least two more data sets |
@ninimama |
…g. 17 of paper (top and botthm)
So, I applied MERLIN on Mars Science Laboratory data and visually reproduced Fig. 17 (top) and Fig. 17 (bottom). These two figures corresponds to two different sets of time series. I also got the results of MATLAB (marlin3.m) for Fig.17 (top) and I realized that out of 100 discords (with length m= ====================================================================== Some additional notes:
===================================================================== Final word (my suggestion): |
Thank you @ninimama for all your ongoing work and your nice succinct summary of the enhanced algorithm. I would agree with your general sentiment , especially given the headline description of MERLIN:
The downside of course, as @ninimama also alludes to is the increase in implementation complexity. |
Thank you @ninimama!
Can you explain why the three discords have different indices? Is it the difference between MERLIN and MERLIN3?
I agree with both you and @udf2457. If MERLIN3 is essentially producing the same results but it is drastically faster than the pseudocode in the paper then we should try to implement MERLIN3. @ninimama Aside from the added code complexity, do you see any negative reasons not to use MERLIN3? |
@seanlaw
I believe the reason is related to the implementation of MERLIN3. This is what I understand: Let's say ================================================================= I compared the results for both NYC TAXI data (Fig. 11) and also MSL-A-4 (Fig. 17-top). MSL-A-4 (Fig. 17-top), the difference is negligible and in TAXI case study the difference is small. =========================================================================== MERLIN vs MERLIN3: ==========================================================================
All in all, I believe MERLIN3 is a better choice as it provides almost the same discords but faster. Some of them might be false positive which is totally okay as the authors mentioned the same challenge for MERLIN in their paper. The question that I have in mind and might be a good idea to ask the authors is this: Does my explanation make sense? |
@ninimama in relation to the false-positives I agree. Infact if you look at the penultimate slide of the PPT in the MERLIN3 zip, the authors mention three scenarios that may lead to false-positives (constant regions, twin freaks and large magnitude changes). In relation to your suggestion of a question to ask the authors, would you like me to try to reach out to them again or will/are you do/doing that already ? If me, then are there any other questions at the back of your mind we could present at the same time ? I will could also see if I can entice one of them to review this PR and comment.... ;-) |
Thanks for the input. You are right. They discussed some of the scenarios that can lead to false positive.
I haven't asked them yet.
This is MERLIN implementation. Do you think they are interested in reviewing MERLIN as they are currently working on MERLIN3? ================================================================= I probably take a few days off and work on another stumpy project to finish that first and then will resume my work on MERLIN. |
Fair point, and given one of them did make a passing comment to me (whilst we awaited feedback from the MERLIN code author) I think you're correct in your thinking:
(i.e. "introduction to MERLIN" being the name of the zip with the MERLIN3 code in it). |
Nope, I have no additional questions at the moment. @udf2457 Please feel free to reach out to them and thanks in advance! |
Just a quick note on my recent comment: I thought again and realized updating r doesn't affect the discord and can only impact the computation time. Those few cases where the discovered discord index was different is because of the enhanced version of DRAG algorithm used by the authors. they try to eliminate some of the subsequences to speed up the computation. And that results in those few errors but it is worth it as it can reduce the computation time. |
Just landed in my Inbox in reply to your earlier question:
|
Would you please explain this comment further? because, the way I see it, is that I have if/else statement and each of those assigns different value to last_no_conflict. And, then, this variable is going to be used in the for-loop that is outside of the if-else block.
I can say last_no_conflict+1 is equivalent to first_with_conflict. Then, I can change the if/else and the for-loop that comes after accordingly. View entire conversation on ReviewNB |
It's usually best to reply to each comment within the notebook so at least the context (rough location of the original question) is preserved. Unfortunately, this review process is an imperfect system but it is clear to me that notebook comments are annotated with I'm starting to feel like this PR is getting way too long. We might want to consider starting a new PR in the near future. I recommend starting a new PR after we've completed the review of Phase 1. What do you think? |
I see your point in that it is being assigned later. Unfortunately, I still don't understand what the goal is from a conceptual level. However, from experience, when I see this kind of
Now that you have working code, the next part is refactoring to make sure that anybody who reads it can immediately see the logic. I can help but only if I understand what the goal is and I'll need your help with that. View entire conversation on ReviewNB |
Agreed. Since we now find the direction, we can focus better in a new PR.
Just to confirm, you mean I should, for instance, write a comment in another cell of notebook and asks the reader to go there to read about the comment that should be inside the code? I can do this next time for sure.
I see. I am trying to see if I can change the variables (and their names accordingly) to make it more legible. |
@seanlaw Also:
I am going to answer each of your comments in its corresponding reply section. |
This variable is automatically initialized by (fast)MERLIN and will be passed to this function when it is called inside the main (fast)MERLIN function. I added a note to the docstring. View entire conversation on ReviewNB |
Maybe it's a personal preference but I like how the matrix profile papers lay things out by first defining the phase and what it is trying to accomplish. Then, it shows the pseudocode and then goes through each line of the code. So, in a similar fashion, you can have one (Markdown) cell that describes what the overall goal is and what the general thought process is for achieving that goal. The "deeper thinking"/"rationale" behind the concepts is the most important to elaborate on and then the code should be clear/obvious enough to support the rationale. Then, start a new (Code) cell where you write the corresponding code. Finally, in a third cell, walk through the relevant sections/lines of code to explain what those lines are doing. If you ever find that the authors tried to do something "smart"/"tricky" then that would be really important to point out. We don't want those things to be hidden. But, hopefully, those non-obvious things are captured in the Notes section of the function's docstring. Usually, when you do this, you'll realize that there just might be too much code and that there is an opportunity to refactor and simplify things OR even split things out into a separate function. |
Resolved! I change the code accordingly. View entire conversation on ReviewNB |
Resolved! I also avoid pre-generating the array by creating an empty array as follows:
View entire conversation on ReviewNB |
I changed the code accordingly.
View entire conversation on ReviewNB |
esolved. Outside of for-loop:
And, inside the for-loop, I re-use it as follows:
View entire conversation on ReviewNB |
#explanation:
#NOTE: #we are trying to see if we are "allowed" to compare the so-far-discorverd candidates #(i.e. the ones found among indices [0,i]) with "at least" one subsequence from the current block! #please note that cand_index[0] is ALWAYS less than i; because, so far, we just searched indices [0, i]. #so, the condition might seem to be always true. However, what if excl_zone>block_size? see example below:
View entire conversation on ReviewNB |
the variable is changed from
The code is changed accordingly.
Also, in line #70, View entire conversation on ReviewNB |
View entire conversation on ReviewNB |
#i.e. we are allowed to do within-block comparison #if not, that means any two subsequences of the block (i.e. their starting index is in the block) are #going to have conflict. Therefore, this block will be always skipped! The only downside now is that #more candidates are going to be returned at the end of this phase! #Therefore, we can set block_size higher than excl_zone to make sure it considers the within-block comparison #to prune some candidates.
#( NOTE: # let's consider an extreme case where block-size is the len of time series. # In this case, we are not getting that much advantage from the block-by-block approach. # The idea behind the block-by-block approach is to narrow down the pair-wise comparison to # the so-far-discovered candidates and the sequences (of each block) as we move forward in the time series. # So, what is the optimal block_size? one can think about it later. #) View entire conversation on ReviewNB |
the very first comparison is between the subsequence at index i (the starting index of block) and the one at index i+excl_zone. Alternatively, here, we first start at subsequence "i+excl_zone" in the block and compare it with the preceding subsequences
View entire conversation on ReviewNB |
Already resolved and answered in a similar comment. View entire conversation on ReviewNB |
Did you click "Resolved"? I don't see the conversations either :( I wonder if it is best to let me decide whether the comment is resolved (or just leave them as unresolved as long as possible) Actually, I think what is happening is that the comments/replies for the notebook follows the notebook version. Each time you push a new version, reviewNB will display the newest version of the notebook (which will have no comments since it is "new"). So, rather than continually pushing little changes one at a time, you should commit your changes to your local branch (so I do not see them) and do not push those changes to Github until ALL of the comments are addressed. I think that this will help reduce the amount of noise and confusion. |
we are trying to see if we are "allowed" to compare the so-far-discorverd candidates (i.e. the ones found among indices [0,i]) with "at least" one subsequence from the current block! Please note that cand_index[0] is ALWAYS less than i; because, so far, we just searched indices [0, i]. So, the condition might seem to be always true. However, what if excl_zone>block_size? see example below:
EXAMPLE: let's say we are now at a block that starts at i=100, and block_size=100, and excl_zone=120! Then, the right side of inequation becomes 79. Therefore, EVEN in a case where the very-first-discovered candidate (i.e. the one at cand_index[0]) is located at 79, there is still one subsequence in the block that the candidate should be compared with. And, that subsequence is the last subsequence of the block located at index 199.
View entire conversation on ReviewNB |
Also, I replied to each of your comments separately using its corresponding reply section. However, the whole conversation is not being appeared here!!!
Did you click "Resolved"? I don't see the conversations either :(
I just checked out all the comments again and clicked on "Resolved" button. |
@ninimama Actually, I think what is happening is that the comments/replies for the notebook follows the notebook version. Each time you push a new version, reviewNB will display the newest version of the notebook (which will have no comments since it is "new"). So, rather than continually pushing little changes one at a time, you should commit your changes to your local branch (so I do not see them) and do not push those changes to Github until ALL of the comments are addressed. I think that this will help reduce the amount of noise and confusion. This is starting to be unmanageable and overwhelming. Would you mind starting a new PR (for phase I) and close this one? And, when you are ready for review, stop making changes until we've agreed that you have addressed all of my questions. This way, each version of the notebook is in a "resolved state". Unfortunately, you may need to repeat your answers. How does that sound? |
@seanlaw Yes, my inbox has been swamped with notifications. I decided against commenting about at the time, but it would be nice if the volume could be quietened down a little. ;-) |
@seanlaw |
Pull Request Checklist
Below is a simple checklist but please do not hesitate to ask for assistance!
black
(i.e.,python -m pip install black
orconda install -c conda-forge black
)flake8
(i.e.,python -m pip install flake8
orconda install -c conda-forge flake8
)pytest-cov
(i.e.,python -m pip install pytest-cov
orconda install -c conda-forge pytest-cov
)black .
in the root stumpy directoryflake8 .
in the root stumpy directory./setup.sh && ./test.sh
in the root stumpy directory