Skip to content

Fixed #117 and #130#142

Merged
seanlaw merged 2 commits intostumpy-dev:masterfrom
mihailescum:fix_mstump_ed_nan_inf
Mar 16, 2020
Merged

Fixed #117 and #130#142
seanlaw merged 2 commits intostumpy-dev:masterfrom
mihailescum:fix_mstump_ed_nan_inf

Conversation

@mihailescum
Copy link
Copy Markdown
Contributor

@mihailescum mihailescum commented Mar 13, 2020

This PR fixes the handling of nans in mstump and mstumped (issues #117 and #130 ). It also contains a bit of refactoring in order to clean the code (issue #135 )

The refactoring was done to improve readability and to have the internals more or less identical to stump.

@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 13, 2020

I'm not sure why the tests aren't being ran here.



def _get_first_mstump_profile(start, T, m, excl_zone, M_T, Σ_T):
def _get_first_mstump_profile(start, T_A, T_B, m, excl_zone, M_T, Σ_T):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that we are prepared to perform AB-joins in addition to self-joins?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first mstump profile yes, then we can directly use this function.

Currently I implemented it here to implement nan/inf handling the same way as in stump. There (for the first subsequence) all nans in T_A are replaced with zeros, while T_B still contains them, because otherwise there is no way to pass this information to core.mass.

To actually support AB-joins, there also have to be done some changes in _stump itself that are currently a bit out of range.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I'm not sure what an AB-join for multi-dimensional matrix profile actually means. I don't remember the original authors discussing it and I am concerned with supporting this for users if we haven't thought through it.

To actually support AB-joins, there also have to be done some changes in _stump itself that are currently a bit out of range.

I'm not sure I understand, technically, _stump already supports AB-joins. Or, do you mean AB-joins that contain nan/inf in both sequences?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think I am okay with this since it is a private function

Comment on lines 107 to 108
zone_start = max(0, start - excl_zone)
zone_stop = min(n - m + 1, start + excl_zone)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AB-joins, this is not necessary. This is only needed for self-joins

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I did not plan to support AB-joins at the moment.

hosts = list(dask_client.ncores().keys())
nworkers = len(hosts)
T_A = np.asarray(core.transpose_dataframe(T)).copy()
T_B = T_A.copy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused as to why we need T_B? It doesn't look like its being used for anything

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to implement nan/inf handling the same way as we did in stump. There you first calculate the mean/stddev and for each subsequence to ignore you set the mean to inf. Then you set T_A=0 everywhere, where it was nan. Then you calculate the first matrix profile value, and here T_B can still be nan, because this information has to be passed to core.mass. Afterwards you set T_B to zero everywhere where it was nan and pass those two altered timeserieses to _stump, where the decision if one subsequence should be ignored is made only by looking at the mean and checking if it is inf.

So actually to support AB-joins we only need to pass T_B to _mstump and handle it from there, but I skipped this as this is a completely different issue.

Basically, at the moment T_B is only used to correctly calculate the first matrix profile value.


M_T, Σ_T = core.compute_mean_std(T, m)
right_P, right_I = _get_first_mstump_profile(start, T, m, excl_zone, M_T, Σ_T)
right_P, right_I = _get_first_mstump_profile(start, T, T, m, excl_zone, M_T, Σ_T)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird. I feel like _get_first_mstump_profile should detect whether one or two time series are being passed to it

Copy link
Copy Markdown
Contributor Author

@mihailescum mihailescum Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stump._get_first_stump_profile doesn't, so I kept it this way in mstump. The detection is done in the main stump function.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I overlooked the fact that:

  1. This is a private function
  2. This sets us up in case we decide to allow AB-joins in the future (currently, this is blocked/omitted in the public function)

@seanlaw seanlaw changed the title Fixed issues #117 and #130 Fixed #117 and #130 Mar 13, 2020
@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 13, 2020

When I run the tests locally with the master branch (prior to this PR), I notice that the tests are consistently getting stuck on tests/test_stumped_two_constant_subsequences.py. This happens on Azure Pipelines as well. We may need to break this down further again.

@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 14, 2020

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 16, 2020

@mexxexx Thank you for completing this PR! This is another great step in getting STUMPY to the next level. I really appreciate it.

Additionally, given the current global situation, I hope that you and your loved ones are staying safe out there.

@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 16, 2020

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mihailescum
Copy link
Copy Markdown
Contributor Author

Thank you very much, I really appreciate it. I wish you all the best too!

@seanlaw
Copy link
Copy Markdown
Contributor

seanlaw commented Mar 16, 2020

Btw, there was indeed an issue with Azure Pipelines (they changed some things that caused CI to not get triggered). I ended up adding a couple of new lines to the azure-pipelines.yml file. So, when this finishes testing and gets merged, please remember to pull the latest code.

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #142 into master will not change coverage by %.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #142   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           12        12           
  Lines          842       836    -6     
=========================================
- Hits           842       836    -6     
Impacted Files Coverage Δ
stumpy/stump.py 100.00% <ø> (ø)
stumpy/mstump.py 100.00% <100.00%> (ø)
stumpy/mstumped.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 334e0d2...45ebd16. Read the comment docs.

@seanlaw seanlaw merged commit 514a1c4 into stumpy-dev:master Mar 16, 2020
@mihailescum mihailescum deleted the fix_mstump_ed_nan_inf branch March 18, 2020 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants