Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new multivariate dataset to pipeline and test in notebook #72

Merged
merged 55 commits into from
Jun 20, 2018
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
c04661a
rename special variable input to input_vars
maxifischer Jun 5, 2018
11d85bb
shift binarized values by len_out-len_in
maxifischer Jun 5, 2018
609499e
lower threshold
maxifischer Jun 5, 2018
d5163d9
Merge branch 'master' into fix/lstmad_threshold
WGierke Jun 7, 2018
11f8fad
Remove scaling factor
Jun 7, 2018
d65c46e
Fix padding
Jun 7, 2018
1cf13fa
Merge remote-tracking branch 'origin/master' into fix/lstmad_threshold
Jun 7, 2018
a165731
fix binarize function, adjusted threshold
maxifischer Jun 7, 2018
4bb1294
add LSTMAD to CircleCI
maxifischer Jun 7, 2018
754fc45
Merge branch 'fix/lstmad_threshold' of https://github.com/KDD-OpenSou…
maxifischer Jun 7, 2018
5efad36
add get_optimal_threshold function to call in benchmarks binarize
maxifischer Jun 7, 2018
3853d15
Merge branch 'master' into feature/dynamic_thresholds
maxifischer Jun 7, 2018
3988733
fix build
maxifischer Jun 7, 2018
e584e73
rename, refactor threshold plots
maxifischer Jun 7, 2018
20b7864
rename th
maxifischer Jun 8, 2018
2c62a85
lint
maxifischer Jun 8, 2018
a382a88
lint more
maxifischer Jun 8, 2018
d822e81
Merge branch 'master' into feature/dynamic_thresholds
maxifischer Jun 12, 2018
7f1d7bb
add differing extreme outlier experiment
maxifischer Jun 12, 2018
aabc9c1
merge master & add extremeness experiment
maxifischer Jun 12, 2018
46bd3e0
fix merge conflicts
maxifischer Jun 12, 2018
e9a4d16
lint
maxifischer Jun 12, 2018
149d6fb
Merge branch 'feature/dynamic_thresholds' into extremeness_experiment
maxifischer Jun 12, 2018
873e229
Add new multivariate dataset to pipeline and test in notebook
Chaoste Jun 12, 2018
8dde2db
Merge remote-tracking branch 'origin/master' into feature-40/func-dep…
Chaoste Jun 12, 2018
61083c6
merge master
maxifischer Jun 12, 2018
1912fe0
Merge branch 'master' into extremeness_experiment
maxifischer Jun 12, 2018
8907d85
lint again
maxifischer Jun 12, 2018
c34e8a7
Merge branch 'extremeness_experiment' into feature-40/func-dependencies
maxifischer Jun 12, 2018
1d4ef23
change return value to DataFrame
maxifischer Jun 12, 2018
df7cca6
comments, renaming etc
maxifischer Jun 14, 2018
47c7acd
Merge remote-tracking branch 'origin/master' into feature-40/func-dep…
Chaoste Jun 14, 2018
90a1e27
add negative amplitudes
maxifischer Jun 14, 2018
37e9c04
Store notebook
Chaoste Jun 14, 2018
2e1f139
return value
Chaoste Jun 14, 2018
10062ef
Merge branch 'feature-40/func-dependencies' of https://github.com/KDD…
Chaoste Jun 14, 2018
2d07d71
Refactor mutlivariate dataset and add new dim2 functions
Chaoste Jun 15, 2018
f2170a5
Lint dataset class
Chaoste Jun 15, 2018
0de017b
Fix merge conflicts in main py
Chaoste Jun 15, 2018
13e1cd8
Add param for pause length
Chaoste Jun 15, 2018
632df15
Prettify multivariate dataset before adding more features
Chaoste Jun 16, 2018
c0919cb
Working MUltivariate anomaly timeseries
Chaoste Jun 18, 2018
d246bde
Plots figures in notebook, update code
Chaoste Jun 18, 2018
43bcdd6
flake
Chaoste Jun 18, 2018
03c8fa7
PR Review
Chaoste Jun 18, 2018
90f174f
unused parameter to _
maxifischer Jun 19, 2018
7ba7379
renaming
maxifischer Jun 19, 2018
f57569b
refactored anomaly functions in new class & created experiment run wi…
maxifischer Jun 19, 2018
e8c4773
comment run_experiments in
maxifischer Jun 20, 2018
d52df1f
merge master
maxifischer Jun 20, 2018
58b9441
flake8
maxifischer Jun 20, 2018
1b75b00
clean run_pipeline and comment in
maxifischer Jun 20, 2018
6ad1bb9
refactor run_experiments into main, move experiments to base dir
maxifischer Jun 20, 2018
096aeef
add CircleCI option for experiments
maxifischer Jun 20, 2018
9cd3da1
flake8
maxifischer Jun 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions notebooks/3.0-tk-data-generation.ipynb

Large diffs are not rendered by default.

181 changes: 181 additions & 0 deletions src/datasets/synthetic_multivariate_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
from typing import Tuple, Callable

import numpy as np
import pandas as pd

from .dataset import Dataset


def get_noisy_value(x, strength=1):
return x + np.random.random(np.shape(x)) * strength - strength / 2


# Use part of sinus to create a curve starting and ending with zero gradients.
# Using `length` and `amplitude` you can adjust it in both dimensions.
def get_curve(length, amplitude):
# Transformed sinus curve: [-1, 1] -> [0, amplitude]
def curve(t: int):
return amplitude * (np.sin(t)/2 + 0.5)
# Start and end of one curve section in sinus
from_ = 1.5 * np.pi
to_ = 3.5 * np.pi
return np.array([curve(t) for t in np.linspace(from_, to_, length)])


# ----- Functions generating the second dimension --------- #
# A dim2 function should return a tuple containing the following three values:
# * The values of the second dimension (array of max `interval_length` numbers)
# * Starting point for the anomaly
# * End point for the anomaly section
# The last two values are ignored for generation of not anomalous data


def doubled_dim2(curve_values, anomalous, interval_length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next 4 methods you're not using interval_length. I think it'd make sense to rename it to _ to indicate that it is not used.

factor = 4 if anomalous else 2
return curve_values * factor, 0, len(curve_values)


def inversed_dim2(curve_values, anomalous, interval_length):
factor = -2 if anomalous else 2
return curve_values * factor, 0, len(curve_values)


def shrinked_dim2(curve_values, anomalous, interval_length):
if not anomalous:
return curve_values, -1, -1
else:
new_curve = curve_values[::2]
nonce = np.zeros(len(curve_values) - len(new_curve))
values = np.concatenate([nonce, new_curve])
return values, 0, len(values)


def delayed_dim2(curve_values, anomalous, interval_length):
if not anomalous:
return curve_values, -1, -1
else:
# The curve in the second dimension occures a few timestamps later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

occurs

nonce = np.zeros(len(curve_values) // 10)
values = np.concatenate([nonce, curve_values])
return values, 0, len(values)


def xor_dim2(curve_values, anomalous, interval_length):
orig_amplitude = max(abs(curve_values))
orig_amplitude *= np.sign(curve_values.mean())
pause_length = interval_length - len(curve_values)
if not anomalous:
# No curve during the other curve in the 1st dimension
nonce = np.zeros(len(curve_values))
# Insert a curve with the same amplitude during the pause of the 1st dimension
new_curve = get_curve(pause_length, orig_amplitude)
return np.concatenate([nonce, new_curve]), -1, -1
else:
# Anomaly: curves overlap (at the same time or at least half overlapping)
max_pause = min(len(curve_values) // 2, pause_length)
nonce = np.zeros(np.random.randint(max_pause))
return np.concatenate([nonce, curve_values]), len(nonce), len(nonce) + len(curve_values)


class SyntheticMultivariateDataset(Dataset):

def __init__(self, name: str = 'Synthetic Multivariate Curve Outliers',
length: int = 5000,
mean_curve_length: int = 40, # varies between -5 and +5
mean_curve_amplitude: int = 1, # By default varies between -0.5 and 1.5
# dim2: Lambda for curve values of 2nd dimension
dim2: Callable[[np.ndarray, bool, int], Tuple[np.ndarray, int, int]] = doubled_dim2,
pause_range: Tuple[int, int] = (5, 75), # min and max value for this a pause
labels_padding: int = 6,
random_seed: int = 42,
features: int = 2,
file_name: str = 'synthetic_mv1.pkl'):
super().__init__(name, file_name)
self.length = length
self.mean_curve_length = mean_curve_length
self.mean_curve_amplitude = mean_curve_amplitude
self.global_noise = 0.1 # Noise added to all dimensions over the whole timeseries
self.dim2 = dim2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "dim2"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to call it "anomaly_function". At the moment there are only 2D anomalies, another PR will solve that

self.pause_range = pause_range
self.labels_padding = labels_padding
self.random_seed = random_seed
self.features = features

# Randomly adjust curve size by adding noise to the passed parameters
def get_random_curve(self, length_randomness=10, amplitude_randomness=1):
is_negative = np.random.choice([True, False])
sign = -1 if is_negative else 1
new_length = get_noisy_value(self.mean_curve_length, length_randomness)
new_amplitude = get_noisy_value(sign * self.mean_curve_amplitude, amplitude_randomness)
return get_curve(new_length, new_amplitude)

# The interval between two curves must be random so a detector doesn't recognize a pattern
def create_pause(self):
xmin, xmax = self.pause_range
diff = xmax - xmin
return xmin + np.random.randint(diff)

def add_global_noise(self, x):
return get_noisy_value(x, self.global_noise)

"""
pollution: Portion of anomalous curves. Because it's not known how many curves there are
in the end. It's randomly chosen based on this value. To avoid anomalies set this to zero.
"""
def generate_data(self, pollution=0.5):
values = np.zeros((self.length, self.features))
labels = np.zeros(self.length)
pos = self.create_pause()

# First pos data points are noise (don't start directly with curve)
values[:pos] = self.add_global_noise(values[:pos])

while pos < self.length - self.mean_curve_length - 20:
# General outline for the repeating curves, varying height and length
curve = self.get_random_curve()
# Outlier generation in second dimension
create_anomaly = np.random.choice([False, True], p=[1-pollution, pollution])
# After curve add pause, only noise
end_of_interval = pos + len(curve) + self.create_pause()
self.insert_features(values[pos:end_of_interval], labels[pos:end_of_interval], curve, create_anomaly)
pos = end_of_interval
# rest of values is noise
values[pos:] = self.add_global_noise(values[pos:])
return pd.DataFrame(values), pd.Series(labels)

"""
Insert values for curve and following pause over all dimensions.
interval_values is changed by reference so this function doesn't return anything.
(this is done by using numpy place function/slice operator)

"""
def insert_features(self, interval_values: np.ndarray, interval_labels: np.ndarray,
curve: np.ndarray, create_anomaly: bool):
assert self.features == 2, 'Only two features are supported right now!'

# Insert curve and pause in first dimension (after adding the global noise)
interval_values[:len(curve), 0] = self.add_global_noise(curve)
interval_values[len(curve):, 0] = self.add_global_noise(interval_values[len(curve):, 0])

# Get values of dim2 and fill missing spots with noise
# dim2 function gets the clean curve values (not noisy)
interval_length = interval_values.shape[0]
dim2_values, start, end = self.dim2(curve, create_anomaly, interval_length)
assert len(dim2_values) <= interval_length, f'Interval too long: {len(dim2_values)} > {interval_length}'

interval_values[:len(dim2_values), 1] = self.add_global_noise(dim2_values)
# Fill interval up with noisy zero values
interval_values[len(dim2_values):, 1] = self.add_global_noise(interval_values[len(dim2_values):, 1])

# Add anomaly labels with slight padding (dont start with the first interval value).
# The padding is curve_length / padding_factor
if create_anomaly:
assert end > start and start >= 0, f'Invalid anomaly indizes: {start} to {end}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify that to assert end > start >= 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: indices

padding = (end - start) // self.labels_padding
interval_labels[start+padding:end-padding] += 1

def load(self):
np.random.seed(self.random_seed)
X_train, y_train = self.generate_data(pollution=0)
X_test, y_test = self.generate_data(pollution=0.5)
self._data = X_train, y_train, X_test, y_test
6 changes: 2 additions & 4 deletions src/evaluation/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,10 +163,8 @@ def plot_threshold_comparison(self, steps=40, store=True):
for det, ax in zip(self.detectors, axes_row):
score = np.array(self.results[(ds.name, det.name)])

anomalies, _, prec, rec, f_score, f01_score, thresh = self.get_optimal_threshold(det,
y_test,
score,
return_metrics=True)
anomalies, _, prec, rec, f_score, f01_score, thresh = self.get_optimal_threshold(
det, y_test, score, return_metrics=True)

ax.plot(thresh, anomalies / len(y_test),
label=fr"anomalies ({len(y_test)} $\rightarrow$ 1)")
Expand Down