-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new multivariate dataset to pipeline and test in notebook #72
Merged
Merged
Changes from all commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
c04661a
rename special variable input to input_vars
maxifischer 11d85bb
shift binarized values by len_out-len_in
maxifischer 609499e
lower threshold
maxifischer d5163d9
Merge branch 'master' into fix/lstmad_threshold
WGierke 11f8fad
Remove scaling factor
d65c46e
Fix padding
1cf13fa
Merge remote-tracking branch 'origin/master' into fix/lstmad_threshold
a165731
fix binarize function, adjusted threshold
maxifischer 4bb1294
add LSTMAD to CircleCI
maxifischer 754fc45
Merge branch 'fix/lstmad_threshold' of https://github.com/KDD-OpenSou…
maxifischer 5efad36
add get_optimal_threshold function to call in benchmarks binarize
maxifischer 3853d15
Merge branch 'master' into feature/dynamic_thresholds
maxifischer 3988733
fix build
maxifischer e584e73
rename, refactor threshold plots
maxifischer 20b7864
rename th
maxifischer 2c62a85
lint
maxifischer a382a88
lint more
maxifischer d822e81
Merge branch 'master' into feature/dynamic_thresholds
maxifischer 7f1d7bb
add differing extreme outlier experiment
maxifischer aabc9c1
merge master & add extremeness experiment
maxifischer 46bd3e0
fix merge conflicts
maxifischer e9a4d16
lint
maxifischer 149d6fb
Merge branch 'feature/dynamic_thresholds' into extremeness_experiment
maxifischer 873e229
Add new multivariate dataset to pipeline and test in notebook
Chaoste 8dde2db
Merge remote-tracking branch 'origin/master' into feature-40/func-dep…
Chaoste 61083c6
merge master
maxifischer 1912fe0
Merge branch 'master' into extremeness_experiment
maxifischer 8907d85
lint again
maxifischer c34e8a7
Merge branch 'extremeness_experiment' into feature-40/func-dependencies
maxifischer 1d4ef23
change return value to DataFrame
maxifischer df7cca6
comments, renaming etc
maxifischer 47c7acd
Merge remote-tracking branch 'origin/master' into feature-40/func-dep…
Chaoste 90a1e27
add negative amplitudes
maxifischer 37e9c04
Store notebook
Chaoste 2e1f139
return value
Chaoste 10062ef
Merge branch 'feature-40/func-dependencies' of https://github.com/KDD…
Chaoste 2d07d71
Refactor mutlivariate dataset and add new dim2 functions
Chaoste f2170a5
Lint dataset class
Chaoste 0de017b
Fix merge conflicts in main py
Chaoste 13e1cd8
Add param for pause length
Chaoste 632df15
Prettify multivariate dataset before adding more features
Chaoste c0919cb
Working MUltivariate anomaly timeseries
Chaoste d246bde
Plots figures in notebook, update code
Chaoste 43bcdd6
flake
Chaoste 03c8fa7
PR Review
Chaoste 90f174f
unused parameter to _
maxifischer 7ba7379
renaming
maxifischer f57569b
refactored anomaly functions in new class & created experiment run wi…
maxifischer e8c4773
comment run_experiments in
maxifischer d52df1f
merge master
maxifischer 58b9441
flake8
maxifischer 1b75b00
clean run_pipeline and comment in
maxifischer 6ad1bb9
refactor run_experiments into main, move experiments to base dir
maxifischer 096aeef
add CircleCI option for experiments
maxifischer 9cd3da1
flake8
maxifischer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
from .synthetic_multivariate_dataset import SyntheticMultivariateDataset | ||
import numpy as np | ||
|
||
|
||
class MultivariateAnomalyFunction: | ||
# ----- Functions generating the anomalous dimension --------- # | ||
# A MultivariateAnomalyFunction should return a tuple containing the following three values: | ||
# * The values of the second dimension (array of max `interval_length` numbers) | ||
# * Starting point for the anomaly | ||
# * End point for the anomaly section | ||
# The last two values are ignored for generation of not anomalous data | ||
|
||
# Get a dataset by passing the method name as string. All following parameters | ||
# are passed through. Throws AttributeError if attribute was not found. | ||
@staticmethod | ||
def get_multivariate_dataset(method, *args, **kwargs): | ||
func = getattr(MultivariateAnomalyFunction, method) | ||
return SyntheticMultivariateDataset(anomaly_func=func, name=f'Synthetic Multivariate {method} Curve Outliers') | ||
|
||
@staticmethod | ||
def doubled(curve_values, anomalous, _): | ||
factor = 4 if anomalous else 2 | ||
return curve_values * factor, 0, len(curve_values) | ||
|
||
@staticmethod | ||
def inversed(curve_values, anomalous, _): | ||
factor = -2 if anomalous else 2 | ||
return curve_values * factor, 0, len(curve_values) | ||
|
||
@staticmethod | ||
def shrinked(curve_values, anomalous, _): | ||
if not anomalous: | ||
return curve_values, -1, -1 | ||
else: | ||
new_curve = curve_values[::2] | ||
nonce = np.zeros(len(curve_values) - len(new_curve)) | ||
values = np.concatenate([nonce, new_curve]) | ||
return values, 0, len(values) | ||
|
||
@staticmethod | ||
def delayed(curve_values, anomalous, _): | ||
if not anomalous: | ||
return curve_values, -1, -1 | ||
else: | ||
# The curve in the second dimension occurs a few timestamps later | ||
nonce = np.zeros(len(curve_values) // 10) | ||
values = np.concatenate([nonce, curve_values]) | ||
return values, 0, len(values) | ||
|
||
@staticmethod | ||
def xor(curve_values, anomalous, interval_length): | ||
orig_amplitude = max(abs(curve_values)) | ||
orig_amplitude *= np.sign(curve_values.mean()) | ||
pause_length = interval_length - len(curve_values) | ||
if not anomalous: | ||
# No curve during the other curve in the 1st dimension | ||
nonce = np.zeros(len(curve_values)) | ||
# Insert a curve with the same amplitude during the pause of the 1st dimension | ||
new_curve = SyntheticMultivariateDataset.get_curve(pause_length, orig_amplitude) | ||
return np.concatenate([nonce, new_curve]), -1, -1 | ||
else: | ||
# Anomaly: curves overlap (at the same time or at least half overlapping) | ||
max_pause = min(len(curve_values) // 2, pause_length) | ||
nonce = np.zeros(np.random.randint(max_pause)) | ||
return np.concatenate([nonce, curve_values]), len(nonce), len(nonce) + len(curve_values) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
from typing import Tuple, Callable | ||
|
||
import numpy as np | ||
import pandas as pd | ||
from . import Dataset | ||
|
||
|
||
class SyntheticMultivariateDataset(Dataset): | ||
def __init__(self, | ||
# anomaly_func: Lambda for curve values of 2nd dimension | ||
anomaly_func: Callable[[np.ndarray, bool, int], Tuple[np.ndarray, int, int]], | ||
name: str = 'Synthetic Multivariate Curve Outliers', | ||
length: int = 5000, | ||
mean_curve_length: int = 40, # varies between -5 and +5 | ||
mean_curve_amplitude: int = 1, # By default varies between -0.5 and 1.5 | ||
pause_range: Tuple[int, int] = (5, 75), # min and max value for this a pause | ||
labels_padding: int = 6, | ||
random_seed: int = 42, | ||
features: int = 2, | ||
file_name: str = 'synthetic_mv1.pkl'): | ||
super().__init__(name, file_name) | ||
self.length = length | ||
self.mean_curve_length = mean_curve_length | ||
self.mean_curve_amplitude = mean_curve_amplitude | ||
self.global_noise = 0.1 # Noise added to all dimensions over the whole timeseries | ||
self.anomaly_func = anomaly_func | ||
self.pause_range = pause_range | ||
self.labels_padding = labels_padding | ||
self.random_seed = random_seed | ||
self.features = features | ||
|
||
@staticmethod | ||
def get_noisy_value(x, strength=1): | ||
return x + np.random.random(np.shape(x)) * strength - strength / 2 | ||
|
||
# Use part of sinus to create a curve starting and ending with zero gradients. | ||
# Using `length` and `amplitude` you can adjust it in both dimensions. | ||
@staticmethod | ||
def get_curve(length, amplitude): | ||
# Transformed sinus curve: [-1, 1] -> [0, amplitude] | ||
def curve(t: int): | ||
return amplitude * (np.sin(t)/2 + 0.5) | ||
# Start and end of one curve section in sinus | ||
from_ = 1.5 * np.pi | ||
to_ = 3.5 * np.pi | ||
return np.array([curve(t) for t in np.linspace(from_, to_, length)]) | ||
|
||
# Randomly adjust curve size by adding noise to the passed parameters | ||
def get_random_curve(self, length_randomness=10, amplitude_randomness=1): | ||
is_negative = np.random.choice([True, False]) | ||
sign = -1 if is_negative else 1 | ||
new_length = self.get_noisy_value(self.mean_curve_length, length_randomness) | ||
new_amplitude = self.get_noisy_value(sign * self.mean_curve_amplitude, amplitude_randomness) | ||
return self.get_curve(new_length, new_amplitude) | ||
|
||
# The interval between two curves must be random so a detector doesn't recognize a pattern | ||
def create_pause(self): | ||
xmin, xmax = self.pause_range | ||
diff = xmax - xmin | ||
return xmin + np.random.randint(diff) | ||
|
||
def add_global_noise(self, x): | ||
return self.get_noisy_value(x, self.global_noise) | ||
|
||
""" | ||
pollution: Portion of anomalous curves. Because it's not known how many curves there are | ||
in the end. It's randomly chosen based on this value. To avoid anomalies set this to zero. | ||
""" | ||
def generate_data(self, pollution=0.5): | ||
values = np.zeros((self.length, self.features)) | ||
labels = np.zeros(self.length) | ||
pos = self.create_pause() | ||
|
||
# First pos data points are noise (don't start directly with curve) | ||
values[:pos] = self.add_global_noise(values[:pos]) | ||
|
||
while pos < self.length - self.mean_curve_length - 20: | ||
# General outline for the repeating curves, varying height and length | ||
curve = self.get_random_curve() | ||
# Outlier generation in second dimension | ||
create_anomaly = np.random.choice([False, True], p=[1-pollution, pollution]) | ||
# After curve add pause, only noise | ||
end_of_interval = pos + len(curve) + self.create_pause() | ||
self.insert_features(values[pos:end_of_interval], labels[pos:end_of_interval], curve, create_anomaly) | ||
pos = end_of_interval | ||
# rest of values is noise | ||
values[pos:] = self.add_global_noise(values[pos:]) | ||
return pd.DataFrame(values), pd.Series(labels) | ||
|
||
""" | ||
Insert values for curve and following pause over all dimensions. | ||
interval_values is changed by reference so this function doesn't return anything. | ||
(this is done by using numpy place function/slice operator) | ||
|
||
""" | ||
def insert_features(self, interval_values: np.ndarray, interval_labels: np.ndarray, | ||
curve: np.ndarray, create_anomaly: bool): | ||
assert self.features == 2, 'Only two features are supported right now!' | ||
|
||
# Insert curve and pause in first dimension (after adding the global noise) | ||
interval_values[:len(curve), 0] = self.add_global_noise(curve) | ||
interval_values[len(curve):, 0] = self.add_global_noise(interval_values[len(curve):, 0]) | ||
|
||
# Get values of anomaly_func and fill missing spots with noise | ||
# anomaly_func function gets the clean curve values (not noisy) | ||
interval_length = interval_values.shape[0] | ||
anomaly_values, start, end = self.anomaly_func(curve, create_anomaly, interval_length) | ||
assert len(anomaly_values) <= interval_length, f'Interval too long: {len(anomaly_values)} > {interval_length}' | ||
|
||
interval_values[:len(anomaly_values), 1] = self.add_global_noise(anomaly_values) | ||
# Fill interval up with noisy zero values | ||
interval_values[len(anomaly_values):, 1] = self.add_global_noise(interval_values[len(anomaly_values):, 1]) | ||
|
||
# Add anomaly labels with slight padding (dont start with the first interval value). | ||
# The padding is curve_length / padding_factor | ||
if create_anomaly: | ||
assert end > start and start >= 0, f'Invalid anomaly indizes: {start} to {end}' | ||
padding = (end - start) // self.labels_padding | ||
interval_labels[start+padding:end-padding] += 1 | ||
|
||
def load(self): | ||
np.random.seed(self.random_seed) | ||
X_train, y_train = self.generate_data(pollution=0) | ||
X_test, y_test = self.generate_data(pollution=0.5) | ||
self._data = X_train, y_train, X_test, y_test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify that to
assert end > start >= 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: indices