# Using scalers for SeriesBasedCesnetDataset

### Import

In [1]:
import numpy as np
import logging

from cesnet_tszoo.utils.enums import AgreggationType, SourceType, ScalerType
from cesnet_tszoo.datasets import CESNET_TimeSeries24
from cesnet_tszoo.configs import SeriesBasedConfig # Series based dataset MUST use SeriesBasedConfig

from cesnet_tszoo.utils.scaler import Scaler # For creating custom Scaler

### Setting logger

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s][%(name)s][%(levelname)s] - %(message)s")

### Preparing dataset

In [3]:
series_based_dataset = CESNET_TimeSeries24.get_dataset(data_root="/some_directory/", source_type=SourceType.IP_ADDRESSES_SAMPLE, aggregation=AgreggationType.AGG_10_MINUTES, is_series_based=True, display_details=True)

[2025-04-09 11:45:59,458][wrapper_dataset][INFO] - Dataset is series-based. Use cesnet_tszoo.configs.SeriesBasedConfig



Dataset details:

    AgreggationType.AGG_10_MINUTES
        Time indices: range(0, 40297)
        Datetime: (datetime.datetime(2023, 10, 9, 0, 3, 49, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 7, 14, 21, 50, 52, tzinfo=datetime.timezone.utc))

    SourceType.IP_ADDRESSES_SAMPLE
        Time series indices: [ 11  20 101 103 118 ... 2003134 2008461 2011839 2022235 2044888], Length=1000; use 'get_available_ts_indices' for full list
        Features with default values: {'n_flows': 0, 'n_packets': 0, 'n_bytes': 0, 'n_dest_ip': 0, 'n_dest_asn': 0, 'n_dest_ports': 0, 'tcp_udp_ratio_packets': 0.5, 'tcp_udp_ratio_bytes': 0.5, 'dir_ratio_packets': 0.5, 'dir_ratio_bytes': 0.5, 'avg_duration': 0, 'avg_ttl': 0}
        
        Additional data: ['ids_relationship', 'weekends_and_holidays']
        


### Scalers

- Scalers are implemented as class.
    - You can create your own or use built-in one.
- Scaler is applied after `default_values` and fillers took care of missing values.
- One scaler is used for all time series.
- Scaler must implement `transform`.
- Scaler must implement `partial_fit` (unless scaler is already fitted and `partial_fit_initialized_scalers` is False).
- To use scaler, train set must be implemented (unless scaler is already fitted and `partial_fit_initialized_scalers` is False).
- You can change used scaler later with `update_dataset_config_and_initialize` or `apply_scaler`.

#### Built-in

In [4]:
# Options

## Supported
ScalerType.STANDARD_SCALER
ScalerType.L2_NORMALIZER
ScalerType.LOG_SCALER
ScalerType.MAX_ABS_SCALER
ScalerType.MIN_MAX_SCALER

<ScalerType.MIN_MAX_SCALER: 'min_max_scaler'>

In [5]:
config = SeriesBasedConfig(time_period=0.5, train_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=ScalerType.MIN_MAX_SCALER, nan_threshold=0.5, random_state=1500)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:45:59,469][config][INFO] - Quick validation succeeded.
[2025-04-09 11:45:59,502][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:45:59,505][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 500/500 [00:03<00:00, 139.26it/s]
[2025-04-09 11:46:03,108][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:03,109][cesnet_dataset][INFO] - Config initialized successfully.



Config Details:
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDS: [182151  10158  65072  10196 338309 ... 175742 659213  11188  73422 483796], Length=60
        Val time series IDS: None
        Test time series IDS None
        All time series IDS [182151  10158  65072  10196 338309 ... 175742 659213  11188  73422 483796], Length=60
    Time periods
        Time period: range(0, 20149)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Fillers         
        Filler type: None
    Scalers
        Scaler type: min_max_scaler
        Are scalers premade: False
        Are premade scalers partial_fitted: False
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch si

In [6]:
series_based_dataset.get_train_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,182151.0,0.0,7.9e-05,2.240487e-07
1,182151.0,1.0,4.5e-05,1.280278e-07
2,182151.0,2.0,4.5e-05,1.280278e-07
3,182151.0,3.0,5.6e-05,1.600348e-07
4,182151.0,4.0,9e-05,2.560557e-07
5,182151.0,5.0,3.4e-05,9.602087e-08
6,182151.0,6.0,6.8e-05,1.920417e-07
7,182151.0,7.0,4.5e-05,1.280278e-07
8,182151.0,8.0,0.000102,3.840835e-07
9,182151.0,9.0,0.0,0.0


In [7]:
series_based_dataset.get_scalers()

<cesnet_tszoo.utils.scaler.MinMaxScaler at 0x1e7920b0740>

Or later with:

In [8]:
series_based_dataset.update_dataset_config_and_initialize(scale_with=ScalerType.MIN_MAX_SCALER, partial_fit_initialized_scalers="config", workers=0)
# Or
series_based_dataset.apply_scaler(scale_with=ScalerType.MIN_MAX_SCALER, partial_fit_initialized_scalers="config", workers=0)

[2025-04-09 11:46:03,302][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:46:03,339][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:03,343][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 60/60 [00:00<00:00, 411.48it/s]
[2025-04-09 11:46:03,490][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:03,492][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:46:03,492][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:46:03,493][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:46:03,522][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:03,525][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 60/60 [00:00<00:00, 424.24it/s]
[2025-04-09 11:46:03,668][cesnet_dataset]

#### Custom

You can create your own custom scaler. It is recommended to derive from Scaler base class.

In [9]:
class CustomScaler(Scaler):
    def __init__(self):
        super().__init__()
        
        self.max = None
        self.min = None
    
    def transform(self, data):
        return (data - self.min) / (self.max - self.min)
    
    def fit(self, data):
        self.partial_fit(data)
    
    def partial_fit(self, data):
        
        if self.max is None and self.min is None:
            self.max = np.max(data, axis=0)
            self.min = np.min(data, axis=0)
            return
        
        temp_max = np.max(data, axis=0)
        temp = np.vstack((self.max, temp_max)) 
        self.max = np.max(temp, axis=0)
        
        temp_min = np.min(data, axis=0)
        temp = np.vstack((self.min, temp_min)) 
        self.min = np.min(temp, axis=0)            

In [10]:
config = SeriesBasedConfig(time_period=0.5, train_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=CustomScaler, nan_threshold=0.5, random_state=1500)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:46:03,680][config][INFO] - Quick validation succeeded.
[2025-04-09 11:46:03,711][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:03,713][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 500/500 [00:00<00:00, 1533.44it/s]
[2025-04-09 11:46:04,042][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:04,043][cesnet_dataset][INFO] - Config initialized successfully.



Config Details:
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDS: [182151  10158  65072  10196 338309 ... 175742 659213  11188  73422 483796], Length=60
        Val time series IDS: None
        Test time series IDS None
        All time series IDS [182151  10158  65072  10196 338309 ... 175742 659213  11188  73422 483796], Length=60
    Time periods
        Time period: range(0, 20149)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Fillers         
        Filler type: None
    Scalers
        Scaler type: CustomScaler (Custom)
        Are scalers premade: False
        Are premade scalers partial_fitted: False
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test b

In [11]:
series_based_dataset.get_train_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,182151.0,0.0,7.9e-05,2.240487e-07
1,182151.0,1.0,4.5e-05,1.280278e-07
2,182151.0,2.0,4.5e-05,1.280278e-07
3,182151.0,3.0,5.6e-05,1.600348e-07
4,182151.0,4.0,9e-05,2.560557e-07
5,182151.0,5.0,3.4e-05,9.602087e-08
6,182151.0,6.0,6.8e-05,1.920417e-07
7,182151.0,7.0,4.5e-05,1.280278e-07
8,182151.0,8.0,0.000102,3.840835e-07
9,182151.0,9.0,0.0,0.0


In [12]:
series_based_dataset.get_scalers()

<__main__.CustomScaler at 0x1e7940ecb00>

Or later with:

In [13]:
series_based_dataset.update_dataset_config_and_initialize(scale_with=CustomScaler, partial_fit_initialized_scalers="config", workers=0)
# Or
series_based_dataset.apply_scaler(scale_with=CustomScaler, partial_fit_initialized_scalers="config", workers=0)

[2025-04-09 11:46:04,222][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:46:04,251][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:04,255][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 60/60 [00:00<00:00, 482.69it/s]
[2025-04-09 11:46:04,380][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:04,381][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:46:04,381][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:46:04,382][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:46:04,410][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:04,413][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 60/60 [00:00<00:00, 466.24it/s]
[2025-04-09 11:46:04,543][cesnet_dataset]

#### Using already fitted scaler

- When `partial_fit_initialized_scaler` is False (default value), scaler has no requirement for `partial_fit` nor for train set.

In [14]:
config = SeriesBasedConfig(time_period=0.5, train_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=CustomScaler, nan_threshold=0.5, random_state=1500)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=False, workers=0)

fitted_scaler = series_based_dataset.get_scalers()

[2025-04-09 11:46:04,549][config][INFO] - Quick validation succeeded.
[2025-04-09 11:46:04,579][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:04,582][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 500/500 [00:00<00:00, 1668.09it/s]
[2025-04-09 11:46:04,884][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:04,884][cesnet_dataset][INFO] - Config initialized successfully.


In [15]:
config = SeriesBasedConfig(time_period=0.5, train_ts=500, val_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=fitted_scaler, nan_threshold=0.5, random_state=999)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:46:04,889][config][INFO] - Quick validation succeeded.
[2025-04-09 11:46:04,920][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:04,924][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 1000/1000 [00:02<00:00, 458.32it/s]
[2025-04-09 11:46:07,108][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:07,108][cesnet_dataset][INFO] - Config initialized successfully.



Config Details:
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDS: [  4380  35210 322201    190 307400 ...  29194 617662 677144 211973 612051], Length=65
        Val time series IDS: [121596 599190  10703 338309 151737 ... 252575 259860     20     11 191587], Length=70
        Test time series IDS None
        All time series IDS [  4380  35210 322201    190 307400 ... 252575 259860     20     11 191587], Length=135
    Time periods
        Time period: range(0, 20149)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Fillers         
        Filler type: None
    Scalers
        Scaler type: CustomScaler (Custom)
        Are scalers premade: True
        Are premade scalers partial_fitted: False
    

In [16]:
series_based_dataset.get_val_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,121596.0,0.0,0.000237,6e-06
1,121596.0,1.0,0.000361,9e-06
2,121596.0,2.0,0.000361,1e-05
3,121596.0,3.0,0.000282,8e-06
4,121596.0,4.0,0.000316,1e-05
5,121596.0,5.0,0.000395,9e-06
6,121596.0,6.0,0.000361,1e-05
7,121596.0,7.0,0.000361,8e-06
8,121596.0,8.0,0.00035,1e-05
9,121596.0,9.0,0.000248,5e-06


Below you can see how scaler works even without train set.

In [17]:
config = SeriesBasedConfig(time_period=0.5, val_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=fitted_scaler, nan_threshold=0.5, random_state=999)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:46:07,316][config][INFO] - Quick validation succeeded.
[2025-04-09 11:46:07,348][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:07,351][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 500/500 [00:00<00:00, 1687.94it/s]
[2025-04-09 11:46:07,649][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:07,649][cesnet_dataset][INFO] - Config initialized successfully.



Config Details:
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDS: None
        Val time series IDS: [  4380  35210 322201    190 307400 ...  29194 617662 677144 211973 612051], Length=65
        Test time series IDS None
        All time series IDS [  4380  35210 322201    190 307400 ...  29194 617662 677144 211973 612051], Length=65
    Time periods
        Time period: range(0, 20149)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Fillers         
        Filler type: None
    Scalers
        Scaler type: CustomScaler (Custom)
        Are scalers premade: True
        Are premade scalers partial_fitted: False
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test ba

In [18]:
series_based_dataset.get_val_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,4380.0,0.0,0.001253,5e-06
1,4380.0,1.0,0.001162,4e-06
2,4380.0,2.0,0.000756,3e-06
3,4380.0,3.0,0.000959,4e-06
4,4380.0,4.0,0.001219,5e-06
5,4380.0,5.0,0.001275,5e-06
6,4380.0,6.0,0.001456,6e-06
7,4380.0,7.0,0.001196,4e-06
8,4380.0,8.0,0.001433,6e-06
9,4380.0,9.0,0.001072,4e-06


##### Partial fitting on train set

Makes already fitted scaler to be fitted on new train set too. Must implement `partial_fit`.

In [19]:
config = SeriesBasedConfig(time_period=0.5, train_ts=500, val_ts=500, features_to_take=["n_flows", "n_packets"],
                           scale_with=fitted_scaler, partial_fit_initialized_scaler=True, nan_threshold=0.5, random_state=999)
series_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:46:07,837][config][INFO] - Quick validation succeeded.
[2025-04-09 11:46:07,870][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:46:07,873][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time period.
100%|██████████| 1000/1000 [00:00<00:00, 1621.15it/s]
[2025-04-09 11:46:08,492][cesnet_dataset][INFO] - Dataset initialization complete. Configuration updated.
[2025-04-09 11:46:08,493][cesnet_dataset][INFO] - Config initialized successfully.



Config Details:
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Train time series IDS: [  4380  35210 322201    190 307400 ...  29194 617662 677144 211973 612051], Length=65
        Val time series IDS: [ 10158 171452 296195 252575 479584 ...    1774  210412  405441 1604957  175742], Length=70
        Test time series IDS None
        All time series IDS [  4380  35210 322201    190 307400 ...    1774  210412  405441 1604957  175742], Length=135
    Time periods
        Time period: range(0, 20149)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Fillers         
        Filler type: None
    Scalers
        Scaler type: CustomScaler (Custom)
        Are scalers premade: True
        Are premade scalers partial_fitted: 

In [20]:
series_based_dataset.get_val_df(workers=0).head(10)

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,10158.0,0.0,0.001162,3.3e-05
1,10158.0,1.0,0.001366,4.1e-05
2,10158.0,2.0,0.000643,1.5e-05
3,10158.0,3.0,0.000801,2.5e-05
4,10158.0,4.0,0.000903,2.1e-05
5,10158.0,5.0,0.00079,2.4e-05
6,10158.0,6.0,0.001275,4.1e-05
7,10158.0,7.0,0.00088,2.2e-05
8,10158.0,8.0,0.000959,3e-05
9,10158.0,9.0,0.001185,2.9e-05
