# Using anomaly handlers for DisjointTimeBasedCesnetDataset

### Import

In [1]:
import numpy as np
import logging

from cesnet_tszoo.utils.enums import AgreggationType, SourceType, AnomalyHandlerType, DatasetType
from cesnet_tszoo.datasets import CESNET_TimeSeries24
from cesnet_tszoo.configs import DisjointTimeBasedConfig # Disjoint dataset MUST use DisjointTimeBasedConfig

from cesnet_tszoo.utils.anomaly_handler import AnomalyHandler # For creating custom Anomaly handler

### Setting logger

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s][%(name)s][%(levelname)s] - %(message)s")

### Preparing dataset

In [3]:
disjoint_dataset = CESNET_TimeSeries24.get_dataset(data_root="/some_directory/", source_type=SourceType.IP_ADDRESSES_SAMPLE, aggregation=AgreggationType.AGG_10_MINUTES, dataset_type=DatasetType.DISJOINT_TIME_BASED, display_details=True)

[2025-08-26 09:05:12,381][wrapper_dataset][INFO] - Dataset is disjoint_time_based. Use cesnet_tszoo.configs.DisjointTimeBasedConfig



Dataset details:

    AgreggationType.AGG_10_MINUTES
        Time indices: range(0, 40297)
        Datetime: (datetime.datetime(2023, 10, 9, 0, 3, 49, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 7, 14, 21, 50, 52, tzinfo=datetime.timezone.utc))

    SourceType.IP_ADDRESSES_SAMPLE
        Time series indices: [ 11  20 101 103 118 ... 2003134 2008461 2011839 2022235 2044888], Length=1000; use 'get_available_ts_indices' for full list
        Features with default values: {'n_flows': 0, 'n_packets': 0, 'n_bytes': 0, 'n_dest_ip': 0, 'n_dest_asn': 0, 'n_dest_ports': 0, 'tcp_udp_ratio_packets': 0.5, 'tcp_udp_ratio_bytes': 0.5, 'dir_ratio_packets': 0.5, 'dir_ratio_bytes': 0.5, 'avg_duration': 0, 'avg_ttl': 0}
        
        Additional data: ['ids_relationship', 'weekends_and_holidays']
        


### Anomaly handlers

- Anomaly handlers are implemented as class.
    - You can create your own or use built-in one.
- Anomaly handler is applied before `default_values` and fillers took care of missing values.
- Every time series has its own anomaly handler instance.
- Anomaly handler must implement `fit` and `transform_anomalies`.
- To use anomaly handler, train set must be implemented.
- Anomaly handler will only be used on train set.
- You can change used anomaly handler later with `update_dataset_config_and_initialize` or `apply_anomaly_handler`.

#### Built-in

In [4]:
# Options

## Supported
AnomalyHandlerType.Z_SCORE
AnomalyHandlerType.INTERQUARTILE_RANGE

<AnomalyHandlerType.INTERQUARTILE_RANGE: 'interquartile_range'>

In [5]:
config = DisjointTimeBasedConfig(train_ts=500, val_ts=None, test_ts=None, train_time_period=0.5, features_to_take=["n_flows", "n_packets"],
                           handle_anomalies_with=AnomalyHandlerType.Z_SCORE, nan_threshold=0.5, random_state=1500)
disjoint_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 09:05:12,392][disjoint_time_based_config][INFO] - Quick validation succeeded.
[2025-08-26 09:05:12,429][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:12,430][disjoint_time_based_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 09:05:12,433][cesnet_dataset][INFO] - Updating config for train set.
100%|██████████| 500/500 [00:01<00:00, 366.47it/s]
[2025-08-26 09:05:13,812][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDs: [182151  10158  65072  10196 338309 ... 175742 659213  11188  73422 483796], Length=60
        Val time series IDs: None
        Test time series IDs: None
    Time periods
        Train time periods: range(0, 20149)
        Val time periods: None
        Test time periods: None
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type (train set): z-score
    Batch sizes
        Trai

In [6]:
disjoint_dataset.get_train_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,182151.0,0.0,7.0,7.0
1,182151.0,1.0,4.0,4.0
2,182151.0,2.0,4.0,4.0
3,182151.0,3.0,5.0,5.0
4,182151.0,4.0,8.0,8.0
5,182151.0,5.0,3.0,3.0
6,182151.0,6.0,6.0,6.0
7,182151.0,7.0,4.0,4.0
8,182151.0,8.0,9.0,12.0
9,182151.0,9.0,0.0,0.0


Or later with:

In [7]:
disjoint_dataset.update_dataset_config_and_initialize(handle_anomalies_with=AnomalyHandlerType.Z_SCORE, workers=0)
# Or
disjoint_dataset.apply_anomaly_handler(handle_anomalies_with=AnomalyHandlerType.Z_SCORE, workers=0)

[2025-08-26 09:05:14,047][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 09:05:14,084][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:14,085][disjoint_time_based_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 09:05:14,087][cesnet_dataset][INFO] - Updating config for train set.
100%|██████████| 60/60 [00:00<00:00, 292.69it/s]
[2025-08-26 09:05:14,295][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 09:05:14,295][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 09:05:14,296][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 09:05:14,329][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:14,329][disjoint_time_based_config][INFO] - Finalization and validation completed successfull

#### Custom

You can create your own custom anomaly handler. It is recommended to derive from AnomalyHandler base class.

In [8]:
class CustomAnomalyHandler(AnomalyHandler):
    def __init__(self):
        self.lower_bound = None
        self.upper_bound = None
        self.iqr = None

    def fit(self, data: np.ndarray) -> None:
        q25, q75 = np.percentile(data, [25, 75], axis=0)
        self.iqr = q75 - q25

        self.lower_bound = q25 - 1.5 * self.iqr
        self.upper_bound = q75 + 1.5 * self.iqr

    def transform_anomalies(self, data: np.ndarray) -> np.ndarray:
        mask_lower_outliers = data < self.lower_bound
        mask_upper_outliers = data > self.upper_bound

        data[mask_lower_outliers] = np.take(self.lower_bound, np.where(mask_lower_outliers)[1])
        data[mask_upper_outliers] = np.take(self.upper_bound, np.where(mask_upper_outliers)[1])       

In [9]:
config = DisjointTimeBasedConfig(train_ts=500, val_ts=None, test_ts=None, train_time_period=0.5, features_to_take=["n_flows", "n_packets"],
                           handle_anomalies_with=CustomAnomalyHandler, nan_threshold=0, random_state=1500)
disjoint_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 09:05:14,568][disjoint_time_based_config][INFO] - Quick validation succeeded.
[2025-08-26 09:05:14,605][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:14,606][disjoint_time_based_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 09:05:14,658][cesnet_dataset][INFO] - Updating config for train set.
100%|██████████| 500/500 [00:00<00:00, 549.91it/s]
[2025-08-26 09:05:15,569][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDs: [ 1368  1774 10396], Length=3
        Val time series IDs: None
        Test time series IDs: None
    Time periods
        Train time periods: range(0, 20149)
        Val time periods: None
        Test time periods: None
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type (train set): CustomAnomalyHandler (Custom)
    Batch sizes
        Train batch size: 32
        Val batch 

In [10]:
disjoint_dataset.get_train_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1368.0,0.0,9615.0,314238.0
1,1368.0,1.0,9720.0,157529.0
2,1368.0,2.0,8880.0,192754.0
3,1368.0,3.0,9026.0,189354.0
4,1368.0,4.0,9961.0,307351.0
5,1368.0,5.0,10056.0,197319.0
6,1368.0,6.0,9833.0,188012.0
7,1368.0,7.0,10087.0,236297.0
8,1368.0,8.0,10455.0,233227.0
9,1368.0,9.0,10217.0,212582.0


Or later with:

In [11]:
disjoint_dataset.update_dataset_config_and_initialize(handle_anomalies_with=CustomAnomalyHandler, workers=0)
# Or
disjoint_dataset.apply_anomaly_handler(handle_anomalies_with=CustomAnomalyHandler, workers=0)

[2025-08-26 09:05:15,600][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 09:05:15,634][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:15,635][disjoint_time_based_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 09:05:15,638][cesnet_dataset][INFO] - Updating config for train set.
100%|██████████| 3/3 [00:00<00:00, 315.63it/s]
[2025-08-26 09:05:15,650][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 09:05:15,650][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 09:05:15,651][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 09:05:15,684][disjoint_time_based_config][INFO] - Anomaly handler will only be used for train set, because of nature of disjoint-time-based.
[2025-08-26 09:05:15,684][disjoint_time_based_config][INFO] - Finalization and validation completed successfully.