Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Tutorial for Matrix Profile XXI: MERLIN algorithm #Issue 417 #418

Closed
wants to merge 26 commits into from

Conversation

NimaSarajpoor
Copy link
Collaborator

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • Run black . in the root stumpy directory
  • Run flake8 . in the root stumpy directory
  • Run ./setup.sh && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov-commenter
Copy link

codecov-commenter commented Jun 20, 2021

Codecov Report

Merging #418 (d3b68aa) into main (d15460e) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #418   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2796      2796           
=========================================
  Hits          2796      2796           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d15460e...d3b68aa. Read the comment docs.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 22, 2021

@ninimama Will you let me know if/when you think it would make sense for me to take a look?

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
The commit itself is a good size for review. However, I need to check the notebook out again and plot the discords. I will let you know in a day or two after I push the next commit. Discords are found by DRAG algorithm (Table 1 and 2 of the paper)

Do you want me to implement the MERLIN algorithm (Table 3 of the paper) and then let you know? Because, for now, it is just Table 1 and 2 (DRAG algorithm) that is used in MERLIN algorithm.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 22, 2021

Do you want me to implement the MERLIN algorithm (Table 3 of the paper) and then let you know? Because, for now, it is just Table 1 and 2 (DRAG algorithm) that is used in MERLIN algorithm.

Yes, it would be great if you could implement the MERLIN algorithm. In the meantime, I'll try to find some time to review the DRAG parts and I may have some questions for you.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 22, 2021

In case I forget to mention it, these initial versions of the code are excellent candidates for the unit tests. If you look at tests/naive.py you'll see basic/slow but accurate implementation of all of the algorithms and then we compare the results from our (future) fast implementation with the naive implementation.

@@ -0,0 +1,407 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of suggestions:

  1. Use T instead of T_A
  2. The z-norm for T_i can be moved outside of the for j loop
  3. Instead of returning failure , I think it is fine to simply return C with length 0 

It also looks like we may be able to parallelize the inner for j loop


Reply via ReviewNB

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like we may be able to parallelize the inner for j loop

I need to read a few articles to see how this can be done. Will let you know if I get into any trouble. Did you use a similar approach in any of the stumpy modules? (so I can check it out and get some idea)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about it for now. Let's focus on getting it to work first before making it fast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have experience with this and can help assist. We'll be using the Numba package for parallelization

@@ -0,0 +1,407 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we are storing the subsequence length m inside of D ? It looks like it doesn't change and would be a constant. In fact, I would considering using two separate 1D arrays (idx and distance), do the computation, and combine the results into D at the end before returning. This would make the code a lot easier to read as the variable names are obvious. Eventually, I can help rename things to be more consistent with the variable names in the code base (i.e., we typically use D for a distance profile and so we'd likely just use discords to represent the set of discords, etc).

Why are we converting C into a list ? Is this necessary?

Instead of D[j_ind,-1]  please be explicit and use D[j_ind, 3] .

I noticed that is_discord isn't actually being used anywhere. You are setting it to True/False and that's it.

The use of C_indices_tot = np.arange(len(C_lst)) and C_indices = C_indices_tot[:] seem complicated to me. I wonder if there is a way to simplify all of this referencing. The Table 2 in the paper does not seem to look as complicated

Why return a set of indices to remove? Why not just remove them before and only return the final set of discords ?


Reply via ReviewNB

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we are storing the subsequence length m inside of D ?

I misread the question. Sorry. I will modify it.

Why are we converting C into a list ? Is this necessary?

Instead of D[j_ind,-1]  please be explicit and use D[j_ind, 3] .

I noticed that is_discord isn't actually being used anywhere. You are setting it to True/False and that's it.

Thanks. I will take care of it. In fact, that was when I tried to understand the difference between the proposed algorithm in the paper and the one provided in MATLAB and what dj is. So, I wrote as I tried to figure things out and didn't pay much attention to small details. So, I called it [WIP]. I will refine the code and will consider your comments in the next commit for sure.

The use of C_indices_tot = np.arange(len(C_lst)) and C_indices = C_indices_tot[:] seem complicated to me. I wonder if there is a way to simplify all of this referencing. The Table 2 in the paper does not seem to look as complicated

Why return a set of indices to remove? Why not just remove them before and only return the final set of discords ?

I totally agree. I need to go through it. I remember I got an error about the change in the size of set object in the for loop. So, I tried to resolve it. I need to go back to that set approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good and no rush! I just wanted to provide feedback while it was still fresh in my head. I can tell you that I don't do well jumping back-and-forth from PR to PR. The context switching hurts my brain :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Table 2 in the paper does not seem to look as complicated

image

I tried to go through the algorithm one more time. I believe the Table 2 algorithm is not complete. Let's take a look at dj: It is initially set to inf. And, it is being updated through the outer loop . It is not shown in the TABLE 2; but, that is what authors did in their MATLAB code. They store it and updated it. (please see below)

% Algorithm 1: Discord Refinement Phase
cand_nndists2 = containers.Map(C.keys, inf(length(C), 1), 'UniformValues', true);
cand_nn_pos = containers.Map(C.keys, NaN(length(C), 1), 'UniformValues', true);
r2 = r^2;

% I diverge from the pseudocode in the paper here. First, I have to use 
% squared distances for correctness reasons in order to avoid taking a 
% square root at every step where we check against best so far. Second,
% I had to change the way elements are removed from the candidate set, or
% rather I had to change the loop tracking variable. If elements are 
% removed from an array while iterating over that same array, it will 
% result in skipped elements in almost any environment. Instead I'm 
% iterating over a copy of its hash keys, which remains constant over 
% the inner 2 loops. Elements are still progressively removed from the 
% candidate set.

for i = 1 : subseqcount
    if isempty(C)
        break;
    end
    keys = C.keys;
    Si = (TS(i : i + L - 1) - mu(i)) ./ sig(i);
    for j = 1 : length(keys)
        if abs(i - keys{j}) < min_separation
            continue;
        end
        dist2 = 0;
        candidx = keys{j};
        Candj = C(candidx);
        nndist2 = cand_nndists2(candidx); 
        for k = 1 : L
            dist2 = dist2 + (Si(k) - Candj(k))^2;
            if dist2 >= nndist2
                break;
            end
        end
        if dist2 < r2
            C.remove(candidx);
            cand_nndists2.remove(candidx);
            cand_nn_pos.remove(candidx);
            continue;
        elseif  dist2 < cand_nndists2(candidx)  
            cand_nndists2(candidx) = dist2;
            cand_nn_pos(candidx) = i;
        end
    end    
end

I can use dictionary to store dj (and also delete the ones (i.e. keys) that should be removed throughout the for loop). I used numpy array instead for now. If you thin dictionary or any other structure is better to store/update the discords, pleases let me know.

========================================================================

Now, there is one thing left that should be noticed for now! I took a look at the documents on their webpage and realized:

image

There are a few differences:

(1) It says the NN is at index 470. But I didn't get that in the result of my implementation

(2) [PROBABLY NOT IMPORTANT] The length of time series in the image above is 3000, similar to what provided in the paper. However, the length of the time series provided as a data set on the webpage of the paper is 2000.

(3) It says the actual discord value is 10.27 (above which the algorithm returns no discord). However, my implementation shows that the actual discord value is 10.43.

===================================================================

I already revised the first two functions according to your suggestion. I will implement MERLIN and then push the commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value(s) do you get when you run the Matlab code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discord location and its distance to NN are the same. HOWEVER, the index of the NN subsequence are not the same.

Does that imply that there are two subsequences that have the same distance to the discord? Or do you think that they reported the wrong nearest neighbor index? Can you compute the "other" nearest neighbor distance by hand and compare?

Copy link
Collaborator Author

@NimaSarajpoor NimaSarajpoor Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discord location and its distance to NN are the same. HOWEVER, the index of the NN subsequence are not the same.

Does that imply that there are two subsequences that have the same distance to the discord? Or do you think that they reported the wrong nearest neighbor index? Can you compute the "other" nearest neighbor distance by hand and compare?

I just deleted my previous comment. I am going to test the output of my implementation as I think it doesn't return the corrrect NN distance. (I checked the distance between subseq and its NN (provided in my output) and their distance is 0! I am now trying to debug it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't forget that you can find the top discord using the more expensive stump (i.e., compute the matrix profile and take the max). So you can at least use it to verify your results

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE:

EVERYTHING looks good! Thanks for your support. Now, I am going to do the MERLIN main algorithm,

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanlaw Just to say sorry for the delay in reply, yesterday was a bit nuts at $work. @ninimama good to hear you found a solution.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw @udf2457

I implemented the MERLIN algorithm using Table 3. And, I ALMOST successfully reproduce one of the figures of the paper (Fig. 11).

I also compared it to the results of MATLAB code (MERLIN3.m) and the results are ALMOST the same.

I used the term ALMOST because there are only two cases (out of 92 cases) where the discord indices were different

========================================

Some notes:

  • The python implementation is naive and needs to be improved to enhance the efficiency in time and memory.
  • It seems the algorithm provided in the MATLAB code (MERLIN3.m) is different from the one provided in the Table3 of the paper. I think the main difference is "updating the parameter r " . Please take a look at how the author updated the parameter r in their MATLAB code, and compared it with Table 3.
  • Although different data sets are provided on the webpage of the paper, it is not really clear which samples of those data sets are used in the paper. For instance, CVP data (Section IV-B, Fig. 9) is provided as .mat file that contains 9 different sets of time series.
  • After checking out four data sets, It seems TAXI cab is a good option for reproducing the result. TAXI cab has 30min resolution. According to Fig. 11, it seems we should only consider the time series values at timestamps 00:00, 01:00, 02:00, ... (and skip the values recorded at --:30)

@seanlaw
Copy link
Contributor

seanlaw commented Jun 27, 2021

@ninimama When you say 92 cases is it only for the taxi cab dataset? Or is this for other datasets too? I think it is worthwhile applying MERLIN to a few different datasets in order to validate the approach. If it is for other datasets, can you show those in the tutorial so that I can take a look? We probably won't include all of them but I'd like to see them for my own reference.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 27, 2021

@seanlaw

@ninimama When you say 92 cases is it only for the taxi cab dataset? Or is this for other datasets too? I think it is worthwhile applying MERLIN to a few different datasets in order to validate the approach. If it is for other datasets, can you show those in the tutorial so that I can take a look? We probably won't include all of them but I'd like to see them for my own reference.

That is only for TAXI cab. I will plot those two discords that are different in Python compared to MATLAB.

=================================================================

I will try to see if I can apply the MERLIN algorithm on any other data. As I said, the problem is that each data set provided on their webpage contains some sets of time series data, and it is hard to know which one is used in the paper. I will go through the paper again and the data sets to see if I can find at least two more data sets

@udf2457
Copy link

udf2457 commented Jun 27, 2021

@ninimama MERLIN3.m seems to be (mostly) an attempt at optimising the algorithm with a view to introducing additional features under Merlin 3.0 in the future (see hints in the ppt file in the same zip)

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 28, 2021

@seanlaw @udf2457

So, I applied MERLIN on Mars Science Laboratory data and visually reproduced Fig. 17 (top) and Fig. 17 (bottom). These two figures corresponds to two different sets of time series. I also got the results of MATLAB (marlin3.m) for Fig.17 (top) and I realized that out of 100 discords (with length m=100 to m=200), there are only three discords that have different index. (however, their starting position are close to each other.)

======================================================================

Some additional notes:

  • As @udf2457 mentioned, the merlin3.m is supposed to be a more efficient algorithm. (And, it is faster to some extent! I will explain it below in the next paragraph) However, I think DRAG algorithm plays a more important role as the implantation of this algorithm is changed to speed up the process of discovering the discords candidates. If you take a look at the comments in discord_discovery_gemm3.m file, you can see the authors do some eliminations to speed up the discovery process, and they avoid comparing a subsequence will ALL other subsequences. So, for instance, my implementation (which is based on Table 1 and 2 and 3) takes more than 1 hour to provide the results for the Fig 17 (bottom).

  • There is also another reason that their implementation is faster, and that is how they update r. In the paper, they reduce r by 1% if they don't find a discord for a subsequence of length m. And, as soon as they find a discord (and that is the top discord with length m), they go to discover the next discord with length m+1 and reset r to the dist of previous discord to its NN subsequence.
    HOWEVER, in the MATLAB code, they reduce r by a factor of (m-1)/m which is smaller than 0.99 for m<100. Therefore, r will decrease faster and, subsequently, a discord with dist>r will be discovered sooner. However, if m goes beyond 100, then 0.99 reduces the r faster. Since the major time of the algorithm is probably spent on lower m (I did not investigated that), their approach in marlin3.m is more efficient than the one proposed in Table3.

=====================================================================

Final word (my suggestion):
I think we should understand their approach in MATLAB code and implement that in Python, as it is faster compared to the one proposed in their paper.

@udf2457
Copy link

udf2457 commented Jun 28, 2021

Thank you @ninimama for all your ongoing work and your nice succinct summary of the enhanced algorithm.

I would agree with your general sentiment , especially given the headline description of MERLIN:

MERLIN, an algorithm that can efficiently and exactly find discords of all lengths in massive time series archives.

The downside of course, as @ninimama also alludes to is the increase in implementation complexity.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 28, 2021

Thank you @ninimama!

So, I applied MERLIN on Mars Science Laboratory data and visually reproduced Fig. 17 (top) and Fig. 17 (bottom). These two figures corresponds to two different sets of time series. I also got the results of MATLAB (marlin3.m) for Fig.17 (top) and I realized that out of 100 discords (with length m=100 to m=200), there are only three discords that have different index. (however, their starting position are close to each other.)

Can you explain why the three discords have different indices? Is it the difference between MERLIN and MERLIN3?

I think we should understand their approach in MATLAB code and implement that in Python, as it is faster compared to the one proposed in their paper.

I agree with both you and @udf2457. If MERLIN3 is essentially producing the same results but it is drastically faster than the pseudocode in the paper then we should try to implement MERLIN3. @ninimama Aside from the added code complexity, do you see any negative reasons not to use MERLIN3?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 28, 2021

@seanlaw
I hope the following explanation finds you well.

Can you explain why the three discords have different indices? Is it the difference between MERLIN and MERLIN3?

I believe the reason is related to the implementation of MERLIN3. This is what I understand:

Let's say disc_m is the discord of length m that is different in MERLIN and MERLIN3. To see how much the two discords, found by MERLIN and MERLIN3, are different, we can compare their distance to their NN subsequence. The ideal scenario is to see if the two different discords have the same distance to their NN subsequence. Then, we can say both discords are equally good. And therefore, MERLIN and MERLIN3 have the same quality.

=================================================================

I compared the results for both NYC TAXI data (Fig. 11) and also MSL-A-4 (Fig. 17-top).
Comparing the results shows that the distance of disc_m to its NN subsequence discovered by MERLIN is bigger compared to the one discovered by MERLIN3. This shows that MERLIN discovered better discords in those few cases for both data sets.

MSL-A-4 (Fig. 17-top), the difference is negligible and in TAXI case study the difference is small.

===========================================================================

MERLIN vs MERLIN3:
MERLIN uses three different ways to update 'r' (please see line 5, 12, 16-18 of Table 3). However, MERLIN3 only uses 'r=r * (m-1)/m'. I think the value of 'r' (as threshold of distance to NN subsequence) causes such difference. Also, one might be interested to investigate the impact of MERLIN3-DRAG compared to the DRAG provided in Table 1&2 of the paper.

==========================================================================

@ninimama Aside from the added code complexity, do you see any negative reasons not to use MERLIN3?

All in all, I believe MERLIN3 is a better choice as it provides almost the same discords but faster. Some of them might be false positive which is totally okay as the authors mentioned the same challenge for MERLIN in their paper.

The question that I have in mind and might be a good idea to ask the authors is this:
Is there any particular reason why we should update 'r' more slowly as we increase the length of subsequence, m?

Does my explanation make sense?

@udf2457
Copy link

udf2457 commented Jun 28, 2021

@ninimama in relation to the false-positives I agree. Infact if you look at the penultimate slide of the PPT in the MERLIN3 zip, the authors mention three scenarios that may lead to false-positives (constant regions, twin freaks and large magnitude changes).

In relation to your suggestion of a question to ask the authors, would you like me to try to reach out to them again or will/are you do/doing that already ? If me, then are there any other questions at the back of your mind we could present at the same time ? I will could also see if I can entice one of them to review this PR and comment.... ;-)

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jun 28, 2021

@udf2457

Thanks for the input. You are right. They discussed some of the scenarios that can lead to false positive.

In relation to your suggestion of a question to ask the authors, would you like me to try to reach out to them again or will/are you do/doing that already ? If me, then are there any other questions at the back of your mind we could present at the same time ?

I haven't asked them yet.
@seanlaw: Do you have any questions in mind?

I will could also see if I can entice one of them to review this PR and comment.... ;-)

This is MERLIN implementation. Do you think they are interested in reviewing MERLIN as they are currently working on MERLIN3?

=================================================================

I probably take a few days off and work on another stumpy project to finish that first and then will resume my work on MERLIN.

@udf2457
Copy link

udf2457 commented Jun 28, 2021

This is MERLIN implementation. Do you think they are interested in reviewing MERLIN as they are currently working on MERLIN3?

Fair point, and given one of them did make a passing comment to me (whilst we awaited feedback from the MERLIN code author) I think you're correct in your thinking:

In the meantime, you may be interested in our expanded "introduction to MERLIN"

(i.e. "introduction to MERLIN" being the name of the zip with the MERLIN3 code in it).

@seanlaw
Copy link
Contributor

seanlaw commented Jun 28, 2021

Nope, I have no additional questions at the moment. @udf2457 Please feel free to reach out to them and thanks in advance!

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw @udf2457

Just a quick note on my recent comment:

I thought again and realized updating r doesn't affect the discord and can only impact the computation time.

Those few cases where the discovered discord index was different is because of the enhanced version of DRAG algorithm used by the authors. they try to eliminate some of the subsequences to speed up the computation. And that results in those few errors but it is worth it as it can reduce the computation time.

@udf2457
Copy link

udf2457 commented Jun 30, 2021

@ninimama

Just landed in my Inbox in reply to your earlier question:

I believe the single update of 'r=r * (m-1)/m' our first implementation. We later found that 'r = M - (2xS)' performed better in our tests, but it also required the two initializing steps with different update equations.

Now to answer why we thought 'r=r * (m-1)/m' seemed reasonable. As we consider larger subsequence lengths, the change in maximum possible euclidean distance changes from sqrt(4L) to sqrt(4L+1). The percent difference in these sequential values decreases with longer lengths.

Copy link
Collaborator Author

Would you please explain this comment further? because, the way I see it, is that I have if/else statement and each of those assigns different value to last_no_conflict. And, then, this variable is going to be used in the for-loop that is outside of the if-else block.

I can say last_no_conflict+1 is equivalent to first_with_conflict. Then, I can change the if/else and the for-loop that comes after accordingly.


View entire conversation on ReviewNB

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2021

I am changing the notebook. Should I reply to every comment you provided to let you know where I address it in the notebook? (because as I change the code, the line number you mentioned in your comments are going to change)

Or, are you good to go through the changes yourself? I just want to see if I need to do anything to make the back-and-forth process easier.

It's usually best to reply to each comment within the notebook so at least the context (rough location of the original question) is preserved. Unfortunately, this review process is an imperfect system but it is clear to me that notebook comments are annotated with View entire conversation on ReviewNB at the bottom of the comment box while PR comments do not. However, instead of continually adding more and more comments to the code, keep the code unchanged (and therefore the line numbers remain mostly unchanged) and add other text before/after the code that refers to the specific line numbers as I've done so that it can be easily cross referenced.

I'm starting to feel like this PR is getting way too long. We might want to consider starting a new PR in the near future. I recommend starting a new PR after we've completed the review of Phase 1. What do you think?

Copy link
Contributor

seanlaw commented Dec 18, 2021

I see your point in that it is being assigned later. Unfortunately, I still don't understand what the goal is from a conceptual level. However, from experience, when I see this kind of range(x + 1) and then x = -1, it suggests that we might be able to do something else to simplify the logic of the code as this usually means that there is a lot of bookkeeping going on and the it is rather hard for others to follow.

Now that you have working code, the next part is refactoring to make sure that anybody who reads it can immediately see the logic. I can help but only if I understand what the goal is and I'll need your help with that.


View entire conversation on ReviewNB

@NimaSarajpoor
Copy link
Collaborator Author

I'm starting to feel like this PR is getting way too long. We might want to consider starting a new PR in the near future. I recommend starting a new PR after we've completed the review of Phase 1. What do you think?

Agreed. Since we now find the direction, we can focus better in a new PR.

It's usually best to reply to each comment within the notebook so at least the context (rough location of the original question) is preserved. Unfortunately, this review process is an imperfect system but it is clear to me that notebook comments are annotated with View entire conversation on ReviewNB at the bottom of the comment box while PR comments do not. However, instead of continually adding more and more comments to the code, keep the code unchanged (and therefore the line numbers remain mostly unchanged) and add other text before/after the code that refers to the specific line numbers as I've done so that it can be easily cross referenced.

Just to confirm, you mean I should, for instance, write a comment in another cell of notebook and asks the reader to go there to read about the comment that should be inside the code? I can do this next time for sure.

Now that you have working code, the next part is refactoring to make sure that anybody who reads it can immediately see the logic.

I see. I am trying to see if I can change the variables (and their names accordingly) to make it more legible.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I think you can check out the implementation of phase 1 and the markdown in which I tried to explain the difference between MP, DRAG, MERLIN, and fastMERLIN.

Also:

It's usually best to reply to each comment within the notebook so at least the context (rough location of the original question) is preserved.

I am going to answer each of your comments in its corresponding reply section.

Copy link
Collaborator Author

This variable is automatically initialized by (fast)MERLIN and will be passed to this function when it is called inside the main (fast)MERLIN function. I added a note to the docstring.


View entire conversation on ReviewNB

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2021

Just to confirm, you mean I should, for instance, write a comment in another cell of notebook and asks the reader to go there to read about the comment that should be inside the code? I can do this next time for sure.

Maybe it's a personal preference but I like how the matrix profile papers lay things out by first defining the phase and what it is trying to accomplish. Then, it shows the pseudocode and then goes through each line of the code. So, in a similar fashion, you can have one (Markdown) cell that describes what the overall goal is and what the general thought process is for achieving that goal. The "deeper thinking"/"rationale" behind the concepts is the most important to elaborate on and then the code should be clear/obvious enough to support the rationale. Then, start a new (Code) cell where you write the corresponding code. Finally, in a third cell, walk through the relevant sections/lines of code to explain what those lines are doing.

If you ever find that the authors tried to do something "smart"/"tricky" then that would be really important to point out. We don't want those things to be hidden. But, hopefully, those non-obvious things are captured in the Notes section of the function's docstring.

Usually, when you do this, you'll realize that there just might be too much code and that there is an opportunity to refactor and simplify things OR even split things out into a separate function.

Copy link
Collaborator Author

Resolved! I change the code accordingly.


View entire conversation on ReviewNB

Copy link
Collaborator Author

Resolved! I also avoid pre-generating the array by creating an empty array as follows:

cand_index = np.empty(shape=(0, ), dtype=np.int64)


View entire conversation on ReviewNB

Copy link
Collaborator Author

I changed the code accordingly.

cand = np.empty(shape=(0, m))


View entire conversation on ReviewNB

Copy link
Collaborator Author

esolved. Outside of for-loop:

cand_in_block_flag = np.full(max_block_size, True)

And, inside the for-loop, I re-use it as follows:

is_cand_in_block = np.copy(cand_in_block_flag[:block_size])


View entire conversation on ReviewNB

Copy link
Collaborator Author

#explanation:

#NOTE: 
            #we are trying to see if we are "allowed" to compare the so-far-discorverd candidates
            #(i.e. the ones found among indices [0,i]) with "at least" one subsequence from the current block!
            #please note that cand_index[0] is ALWAYS less than i; because, so far, we just searched indices [0, i]. 
            #so, the condition might seem to be always true. However, what if excl_zone>block_size? see example below:
        #EXAMPLE: 
        #let's say we are now at a block that starts at i=100, and block_size=100, and excl_zone=120! 
        #Then, the right side of inequation becomes 79.
        #therefore, EVEN in a case where the very-first-discovered candidate (i.e. the one at cand_index[0]) 
        #is located at 79, there is still one subsequence in the block that the candidate should be compared with.
        #And, that is the last subsequence of the block locarted at index 199.
        #(because 199-79=120 which is the exclusion zone!)


View entire conversation on ReviewNB

Copy link
Collaborator Author

the variable is changed from last_no_conflict to first_with_conflict to avoid the confusion. (Additional note: it is possible that all so-far-discovered candidates have no conflict with the subsequences of the current block. In that case, the value of first_with_conflict is the same of last_cmp + 1 )

The code is changed accordingly.

Also, in line #70, np.where(cand_index[0] ... is corrected and changed to np.where(cand_index ... 


View entire conversation on ReviewNB

Copy link
Collaborator Author

if cand_index[0] > i - excl_zone:

          first_with_conflict = 0

else:

          # Previously, in if condition above, we checked to see if the very-first-discovered candidate is 

          #located at an index after i - excl_zone. If yes, that means ALL candidates are going to have conflict 

          #with the subsequences of the current block. So, the first_with_conflict candidate...

          #is the very-first-discovered candidate!!!

           

          # ELSE: 

          #that means there exist some candidates that do not have conflict with this current block. So,

          # to speed up computation, we can perform the matrix product between those candidates and

          # ALL the subsequences of the current block!!! 

           

          # the following line finds the very first candidate that has conflict with the current block! 

          first_with_conflict = np.where(cand_index <= i - excl_zone)[0][-1] + 1

           

          #so, all the discovered candidates in cand[:first_with_conflict, :] has no conflict with the block.


View entire conversation on ReviewNB

Copy link
Collaborator Author

if block_size > excl_zone:

#i.e. we are allowed to do within-block comparison

      #if not, that means any two subsequences of the block (i.e. their starting index is in the block) are 

      #going to have conflict. Therefore, this block will be always skipped! The only downside now is that

      #more candidates are going to be returned at the end of this phase! 

      #Therefore, we can set block_size higher than excl_zone to make sure it considers the within-block comparison 

      #to prune some candidates. 

       

      #( NOTE: 

      # let's consider an extreme case where block-size is the len of time series. 

      # In this case, we are not getting that much advantage from the block-by-block approach. 

      # The idea behind the block-by-block approach is to narrow down the pair-wise comparison to 

      # the so-far-discovered candidates and the sequences (of each block) as we move forward in the time series.

      # So, what is the optimal block_size? one can think about it later.

      #)


View entire conversation on ReviewNB

Copy link
Collaborator Author

the very first comparison is between the subsequence at index i (the starting index of block)  and the one at index i+excl_zone. Alternatively, here, we first start at subsequence "i+excl_zone" in the block  and compare it with the preceding subsequences 

         


View entire conversation on ReviewNB

Copy link
Collaborator Author

Already resolved and answered in a similar comment.


View entire conversation on ReviewNB

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2021

Also, I replied to each of your comments separately using its corresponding reply section. However, the whole conversation is not being appeared here!!!

Did you click "Resolved"? I don't see the conversations either :( I wonder if it is best to let me decide whether the comment is resolved (or just leave them as unresolved as long as possible)

Actually, I think what is happening is that the comments/replies for the notebook follows the notebook version. Each time you push a new version, reviewNB will display the newest version of the notebook (which will have no comments since it is "new"). So, rather than continually pushing little changes one at a time, you should commit your changes to your local branch (so I do not see them) and do not push those changes to Github until ALL of the comments are addressed. I think that this will help reduce the amount of noise and confusion.

Copy link
Collaborator Author

we are trying to see if we are "allowed" to compare the so-far-discorverd candidates (i.e. the ones found among indices [0,i]) with "at least" one subsequence from the current block! Please note that cand_index[0] is ALWAYS less than i; because, so far, we just searched indices [0, i]. So, the condition might seem to be always true. However, what if excl_zone>block_size? see example below:

      

EXAMPLE:

let's say we are now at a block that starts at i=100, and block_size=100, and excl_zone=120!  Then, the right side of inequation becomes 79. Therefore, EVEN in a case where the very-first-discovered candidate (i.e. the one at cand_index[0]) is located at 79, there is still one subsequence in the block that the candidate should be compared with. And, that subsequence is the last subsequence of the block located at index 199.


View entire conversation on ReviewNB

Copy link
Collaborator Author

NimaSarajpoor commented Dec 18, 2021

Also, I replied to each of your comments separately using its corresponding reply section. However, the whole conversation is not being appeared here!!!

Did you click "Resolved"? I don't see the conversations either :(

I just checked out all the comments again and clicked on "Resolved" button.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2021

@ninimama Actually, I think what is happening is that the comments/replies for the notebook follows the notebook version. Each time you push a new version, reviewNB will display the newest version of the notebook (which will have no comments since it is "new"). So, rather than continually pushing little changes one at a time, you should commit your changes to your local branch (so I do not see them) and do not push those changes to Github until ALL of the comments are addressed. I think that this will help reduce the amount of noise and confusion.

This is starting to be unmanageable and overwhelming. Would you mind starting a new PR (for phase I) and close this one? And, when you are ready for review, stop making changes until we've agreed that you have addressed all of my questions. This way, each version of the notebook is in a "resolved state". Unfortunately, you may need to repeat your answers. How does that sound?

@udf2457
Copy link

udf2457 commented Dec 19, 2021

@seanlaw Yes, my inbox has been swamped with notifications. I decided against commenting about at the time, but it would be nice if the volume could be quietened down a little. ;-)

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
I was about to say exactly the same thing. I think we should better start a new PR. I am going to close it now. And, I will not push any changes until we resolve all the issues/comments at each step of the reviewing process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants