# Handling missing data

This notebook will only use TimeBasedCesnetDataset, but all methods work almost the same way for other dataset types.

### Import

In [1]:
import logging
import numpy as np

from cesnet_tszoo.utils.enums import AgreggationType, SourceType, FillerType, DatasetType
from cesnet_tszoo.datasets import CESNET_TimeSeries24
from cesnet_tszoo.configs import TimeBasedConfig # Time based dataset MUST use TimeBasedConfig

from cesnet_tszoo.utils.filler import Filler # For creating custom Filler

### Setting logger

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s][%(name)s][%(levelname)s] - %(message)s")

### Preparing dataset

In [3]:
time_based_dataset = CESNET_TimeSeries24.get_dataset(data_root="/some_directory/", source_type=SourceType.IP_ADDRESSES_SAMPLE, aggregation=AgreggationType.AGG_10_MINUTES, dataset_type=DatasetType.TIME_BASED, display_details=True)

[2025-08-26 20:06:31,721][wrapper_dataset][INFO] - Dataset is time-based. Use cesnet_tszoo.configs.TimeBasedConfig



Dataset details:

    AgreggationType.AGG_10_MINUTES
        Time indices: range(0, 40297)
        Datetime: (datetime.datetime(2023, 10, 9, 0, 3, 49, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 7, 14, 21, 50, 52, tzinfo=datetime.timezone.utc))

    SourceType.IP_ADDRESSES_SAMPLE
        Time series indices: [ 11  20 101 103 118 ... 2003134 2008461 2011839 2022235 2044888], Length=1000; use 'get_available_ts_indices' for full list
        Features with default values: {'n_flows': 0, 'n_packets': 0, 'n_bytes': 0, 'n_dest_ip': 0, 'n_dest_asn': 0, 'n_dest_ports': 0, 'tcp_udp_ratio_packets': 0.5, 'tcp_udp_ratio_bytes': 0.5, 'dir_ratio_packets': 0.5, 'dir_ratio_bytes': 0.5, 'avg_duration': 0, 'avg_ttl': 0}
        
        Additional data: ['ids_relationship', 'weekends_and_holidays']
        


### Default values

- Default values are set to missing values before filler is used.
- You can change used default values later with `update_dataset_config_and_initialize` or `set_default_values`.

#### Using default

- Default values are provided from used dataset.
- You can look at default values for each feature with `time_based_dataset.display_dataset_details()`.

In [4]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values="default")

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:31,728][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:31,784][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:31,788][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 661.98it/s]
[2025-08-26 20:06:31,792][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
       

In [5]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,0.0,0.0
1,1200.0,1.0,0.0,0.0
2,1200.0,2.0,0.0,0.0
3,1200.0,3.0,0.0,0.0
4,1200.0,4.0,0.0,0.0
5,1200.0,5.0,0.0,0.0
6,1200.0,6.0,0.0,0.0
7,1200.0,7.0,0.0,0.0
8,1200.0,8.0,0.0,0.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [6]:
time_based_dataset.update_dataset_config_and_initialize(default_values="default", workers=0)
# Or
time_based_dataset.set_default_values(default_values="default", workers=0)

[2025-08-26 20:06:31,818][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:31,871][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:31,876][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:31,877][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:31,878][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:31,878][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:31,928][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:31,931][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]
[2025-08-26 20:06:31,933][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:31,934][cesnet_dataset][INFO] - Configuration ha

#### Setting default_values as None

In [7]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:31,938][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:31,991][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:31,994][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:31,997][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
     

In [8]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


Or later with:

In [9]:
time_based_dataset.update_dataset_config_and_initialize(default_values=None, workers=0)
# Or
time_based_dataset.set_default_values(default_values=None, workers=0)

[2025-08-26 20:06:32,073][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,125][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,128][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,130][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,130][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:32,131][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,182][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,185][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.31it/s]
[2025-08-26 20:06:32,187][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,187][cesnet_dataset][INFO] - Configuration ha

#### Setting default_values with single number

In [10]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=0)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:32,192][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:32,241][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,244][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,246][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
       

In [11]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,0.0,0.0
1,1200.0,1.0,0.0,0.0
2,1200.0,2.0,0.0,0.0
3,1200.0,3.0,0.0,0.0
4,1200.0,4.0,0.0,0.0
5,1200.0,5.0,0.0,0.0
6,1200.0,6.0,0.0,0.0
7,1200.0,7.0,0.0,0.0
8,1200.0,8.0,0.0,0.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [12]:
time_based_dataset.update_dataset_config_and_initialize(default_values=0, workers=0)
# Or
time_based_dataset.set_default_values(default_values=0, workers=0)

[2025-08-26 20:06:32,267][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,318][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,321][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.55it/s]
[2025-08-26 20:06:32,323][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,323][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:32,324][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,380][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,384][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,386][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,386][cesnet_dataset][INFO] - Configuration ha

#### Setting default_values with list

- Position of values in list correspond to order of features in `features_to_take`.
- Number of values in list must be equal to number of used features.

In [13]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=[1, None])

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:32,392][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:32,447][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,450][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,453][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [ 1. nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
     

In [14]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,1.0,
1,1200.0,1.0,1.0,
2,1200.0,2.0,1.0,
3,1200.0,3.0,1.0,
4,1200.0,4.0,1.0,
5,1200.0,5.0,1.0,
6,1200.0,6.0,1.0,
7,1200.0,7.0,1.0,
8,1200.0,8.0,1.0,
9,1200.0,9.0,4.0,4.0


Or later with:

In [15]:
time_based_dataset.update_dataset_config_and_initialize(default_values=[1, None], workers=0)
# Or
time_based_dataset.set_default_values(default_values=[1, None], workers=0)

[2025-08-26 20:06:32,475][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,530][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,534][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,537][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,537][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:32,537][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,590][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,593][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,596][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,596][cesnet_dataset][INFO] - Configuration has been cha

#### Setting default_values with dictionary

- Dictionary must contain key and value for every feature in `features_to_take`.

In [16]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values={"n_flows" : 1, "n_packets": None})

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:32,602][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:32,656][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,658][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,661][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [ 1. nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
     

In [17]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,1.0,
1,1200.0,1.0,1.0,
2,1200.0,2.0,1.0,
3,1200.0,3.0,1.0,
4,1200.0,4.0,1.0,
5,1200.0,5.0,1.0,
6,1200.0,6.0,1.0,
7,1200.0,7.0,1.0,
8,1200.0,8.0,1.0,
9,1200.0,9.0,4.0,4.0


Or later with:

In [18]:
time_based_dataset.update_dataset_config_and_initialize(default_values={"n_flows" : 1, "n_packets": None}, workers=0)
# Or
time_based_dataset.set_default_values(default_values={"n_flows" : 1, "n_packets": None}, workers=0)

[2025-08-26 20:06:32,683][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,737][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,740][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.83it/s]
[2025-08-26 20:06:32,742][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,743][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:32,743][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,798][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,802][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]
[2025-08-26 20:06:32,804][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,804][cesnet_dataset][INFO] - Configu

### Fillers

- Fillers are implemented as classes.
    - You can create your own or use built-in one.
- One filler per time series is created.
- Filler is applied after default values and usually overrides them.
- You can change used filler later with `update_dataset_config_and_initialize` or `apply_filler`.

#### Built-in

In [19]:
# Options

FillerType.FORWARD_FILLER
FillerType.LINEAR_INTERPOLATION_FILLER
FillerType.MEAN_FILLER

<FillerType.MEAN_FILLER: 'mean_filler'>

In example below, you can see how `ForwardFiller` fills missing values, except those at the beginning which values are defined by default_values.

In [20]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:32,814][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:32,867][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,871][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]
[2025-08-26 20:06:32,874][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: forward_filler
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size:

In [21]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


Or later with:

In [22]:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)
# Or
time_based_dataset.apply_filler(FillerType.FORWARD_FILLER, workers=0)

[2025-08-26 20:06:32,895][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:32,948][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:32,952][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-08-26 20:06:32,954][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:32,954][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:32,954][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:33,006][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:33,010][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.07it/s]
[2025-08-26 20:06:33,012][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:33,013][cesnet_dataset][INFO] - Configuration ha

#### Custom

You can create your own custom filler, which must derive from Filler base class.

In [23]:
class CustomFiller(Filler):
    def fill(self, batch_values: np.ndarray, existing_indices: np.ndarray, missing_indices: np.ndarray, **kwargs):
        batch_values[missing_indices] = -1

In [24]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=CustomFiller)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:33,022][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:33,077][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:33,081][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.31it/s]
[2025-08-26 20:06:33,084][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: CustomFiller (Custom)
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batc

In [25]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,-1.0,-1.0
1,1200.0,1.0,-1.0,-1.0
2,1200.0,2.0,-1.0,-1.0
3,1200.0,3.0,-1.0,-1.0
4,1200.0,4.0,-1.0,-1.0
5,1200.0,5.0,-1.0,-1.0
6,1200.0,6.0,-1.0,-1.0
7,1200.0,7.0,-1.0,-1.0
8,1200.0,8.0,-1.0,-1.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [26]:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=CustomFiller, workers=0)
# Or
time_based_dataset.apply_filler(CustomFiller, workers=0)

[2025-08-26 20:06:33,106][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:33,164][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:33,167][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1811.01it/s]
[2025-08-26 20:06:33,172][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:33,173][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-08-26 20:06:33,173][cesnet_dataset][INFO] - Re-initialization is required.
[2025-08-26 20:06:33,229][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:33,232][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.60it/s]
[2025-08-26 20:06:33,235][cesnet_dataset][INFO] - Config initialized successfully.
[2025-08-26 20:06:33,235][cesnet_dataset][INFO] - Configu

#### Only for TimeBasedCesnetDataset

Values are carried over from train -> val -> test. Look below at example.

In [27]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-08-26 20:06:33,240][time_config][INFO] - Quick validation succeeded.
[2025-08-26 20:06:33,292][time_config][INFO] - Finalization and validation completed successfully.
[2025-08-26 20:06:33,296][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 995.09it/s]
[2025-08-26 20:06:33,299][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: forward_filler
    Transformers
        Transformer type: None
    Anomaly handler
        Anomaly handler type: None        
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size:

In [28]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


You can see that values for n_flows and n_packets were carried over from train to test.

In [29]:
time_based_dataset.get_test_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,30.0,6.0,6.0
1,1200.0,31.0,6.0,6.0
2,1200.0,32.0,6.0,6.0
3,1200.0,33.0,6.0,6.0
4,1200.0,34.0,6.0,6.0
5,1200.0,35.0,6.0,6.0
6,1200.0,36.0,6.0,6.0
7,1200.0,37.0,6.0,6.0
8,1200.0,38.0,6.0,6.0
9,1200.0,39.0,6.0,6.0
