# Handling missing data

This notebook will only use TimeBasedCesnetDataset, but all methods work almost the same way for SeriesBasedCesnetDataset.

### Import

In [1]:
import logging
import numpy as np

from cesnet_tszoo.utils.enums import AgreggationType, SourceType, FillerType
from cesnet_tszoo.datasets import CESNET_TimeSeries24
from cesnet_tszoo.configs import TimeBasedConfig # Time based dataset MUST use TimeBasedConfig

from cesnet_tszoo.utils.filler import Filler # For creating custom Filler

### Setting logger

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s][%(name)s][%(levelname)s] - %(message)s")

### Preparing dataset

In [3]:
time_based_dataset = CESNET_TimeSeries24.get_dataset(data_root="/some_directory/", source_type=SourceType.IP_ADDRESSES_SAMPLE, aggregation=AgreggationType.AGG_10_MINUTES, is_series_based=False, display_details=True)

[2025-04-09 11:44:34,244][wrapper_dataset][INFO] - Dataset is time-based. Use cesnet_tszoo.configs.TimeBasedConfig



Dataset details:

    AgreggationType.AGG_10_MINUTES
        Time indices: range(0, 40297)
        Datetime: (datetime.datetime(2023, 10, 9, 0, 3, 49, tzinfo=datetime.timezone.utc), datetime.datetime(2024, 7, 14, 21, 50, 52, tzinfo=datetime.timezone.utc))

    SourceType.IP_ADDRESSES_SAMPLE
        Time series indices: [ 11  20 101 103 118 ... 2003134 2008461 2011839 2022235 2044888], Length=1000; use 'get_available_ts_indices' for full list
        Features with default values: {'n_flows': 0, 'n_packets': 0, 'n_bytes': 0, 'n_dest_ip': 0, 'n_dest_asn': 0, 'n_dest_ports': 0, 'tcp_udp_ratio_packets': 0.5, 'tcp_udp_ratio_bytes': 0.5, 'dir_ratio_packets': 0.5, 'dir_ratio_bytes': 0.5, 'avg_duration': 0, 'avg_ttl': 0}
        
        Additional data: ['ids_relationship', 'weekends_and_holidays']
        


### Default values

- Default values are set to missing values before filler is used.
- You can change used default values later with `update_dataset_config_and_initialize` or `set_default_values`.

#### Using default

- Default values are provided from used dataset.
- You can look at default values for each feature with `time_based_dataset.display_dataset_details()`.

In [4]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values="default")

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:34,249][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:34,306][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,310][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 52.41it/s]
[2025-04-09 11:44:34,331][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    Default worke

In [5]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,0.0,0.0
1,1200.0,1.0,0.0,0.0
2,1200.0,2.0,0.0,0.0
3,1200.0,3.0,0.0,0.0
4,1200.0,4.0,0.0,0.0
5,1200.0,5.0,0.0,0.0
6,1200.0,6.0,0.0,0.0
7,1200.0,7.0,0.0,0.0
8,1200.0,8.0,0.0,0.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [6]:
time_based_dataset.update_dataset_config_and_initialize(default_values="default", workers=0)
# Or
time_based_dataset.set_default_values(default_values="default", workers=0)

[2025-04-09 11:44:34,394][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,448][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,453][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 995.09it/s]
[2025-04-09 11:44:34,457][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,457][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:34,459][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,514][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,518][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 998.88it/s]
[2025-04-09 11:44:34,520][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,520][cesnet_dataset][INFO] - Configuration has 

#### Setting default_values as None

In [7]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:34,526][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:34,577][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,580][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-04-09 11:44:34,583][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    Default wor

In [8]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


Or later with:

In [9]:
time_based_dataset.update_dataset_config_and_initialize(default_values=None, workers=0)
# Or
time_based_dataset.set_default_values(default_values=None, workers=0)

[2025-04-09 11:44:34,607][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,659][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,662][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 694.54it/s]
[2025-04-09 11:44:34,664][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,664][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:34,665][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,717][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,720][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-04-09 11:44:34,722][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,723][cesnet_dataset][INFO] - Configuration has been chan

#### Setting default_values with single number

In [10]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=0)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:34,727][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:34,779][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,783][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-04-09 11:44:34,785][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [0. 0.]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    Default worke

In [11]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,0.0,0.0
1,1200.0,1.0,0.0,0.0
2,1200.0,2.0,0.0,0.0
3,1200.0,3.0,0.0,0.0
4,1200.0,4.0,0.0,0.0
5,1200.0,5.0,0.0,0.0
6,1200.0,6.0,0.0,0.0
7,1200.0,7.0,0.0,0.0
8,1200.0,8.0,0.0,0.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [12]:
time_based_dataset.update_dataset_config_and_initialize(default_values=0, workers=0)
# Or
time_based_dataset.set_default_values(default_values=0, workers=0)

[2025-04-09 11:44:34,806][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,856][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,860][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.60it/s]
[2025-04-09 11:44:34,863][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,863][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:34,863][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:34,917][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,921][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-04-09 11:44:34,923][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:34,923][cesnet_dataset][INFO] - Configuration has been chan

#### Setting default_values with list

- Position of values in list correspond to order of features in `features_to_take`.
- Number of values in list must be equal to number of used features.

In [13]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=[1, None])

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:34,928][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:34,979][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:34,982][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.12it/s]
[2025-04-09 11:44:34,984][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [ 1. nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    Default wor

In [14]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,1.0,
1,1200.0,1.0,1.0,
2,1200.0,2.0,1.0,
3,1200.0,3.0,1.0,
4,1200.0,4.0,1.0,
5,1200.0,5.0,1.0,
6,1200.0,6.0,1.0,
7,1200.0,7.0,1.0,
8,1200.0,8.0,1.0,
9,1200.0,9.0,4.0,4.0


Or later with:

In [15]:
time_based_dataset.update_dataset_config_and_initialize(default_values=[1, None], workers=0)
# Or
time_based_dataset.set_default_values(default_values=[1, None], workers=0)

[2025-04-09 11:44:35,007][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,057][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,060][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 995.80it/s]
[2025-04-09 11:44:35,063][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,064][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:35,064][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,115][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,118][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.36it/s]
[2025-04-09 11:44:35,120][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,120][cesnet_dataset][INFO] - Configuration has 

#### Setting default_values with dictionary

- Dictionary must contain key and value for every feature in `features_to_take`.

In [16]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values={"n_flows" : 1, "n_packets": None})

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:35,126][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:35,178][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,180][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 997.22it/s]
[2025-04-09 11:44:35,183][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [ 1. nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: None
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    Default wor

In [17]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,1.0,
1,1200.0,1.0,1.0,
2,1200.0,2.0,1.0,
3,1200.0,3.0,1.0,
4,1200.0,4.0,1.0,
5,1200.0,5.0,1.0,
6,1200.0,6.0,1.0,
7,1200.0,7.0,1.0,
8,1200.0,8.0,1.0,
9,1200.0,9.0,4.0,4.0


Or later with:

In [18]:
time_based_dataset.update_dataset_config_and_initialize(default_values={"n_flows" : 1, "n_packets": None}, workers=0)
# Or
time_based_dataset.set_default_values(default_values={"n_flows" : 1, "n_packets": None}, workers=0)

[2025-04-09 11:44:35,206][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,257][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,260][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.31it/s]
[2025-04-09 11:44:35,262][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,262][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:35,263][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,314][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,317][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 998.17it/s]
[2025-04-09 11:44:35,319][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,319][cesnet_dataset][INFO] - Configuration has

### Fillers

- Fillers are implemented as classes.
    - You can create your own or use built-in one.
- One filler per time series is created.
- Filler is applied after default values and usually overrides them.
- You can change used filler later with `update_dataset_config_and_initialize` or `apply_filler`.

#### Built-in

In [19]:
# Options

FillerType.FORWARD_FILLER
FillerType.LINEAR_INTERPOLATION_FILLER
FillerType.MEAN_FILLER

<FillerType.MEAN_FILLER: 'mean_filler'>

In example below, you can see how `ForwardFiller` fills missing values, except those at the beginning which values are defined by default_values.

In [20]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:35,330][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:35,382][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,385][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1001.03it/s]
[2025-04-09 11:44:35,387][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: forward_filler
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    D

In [21]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


Or later with:

In [22]:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)
# Or
time_based_dataset.apply_filler(FillerType.FORWARD_FILLER, workers=0)

[2025-04-09 11:44:35,460][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,512][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,516][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 997.93it/s]
[2025-04-09 11:44:35,518][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,519][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:35,519][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,573][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,576][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<?, ?it/s]
[2025-04-09 11:44:35,579][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,579][cesnet_dataset][INFO] - Configuration has been chan

#### Custom

You can create your own custom filler, which must derive from Filler base class.

In [23]:
class CustomFiller(Filler):
    def fill(self, batch_values: np.ndarray, existing_indices: np.ndarray, missing_indices: np.ndarray, **kwargs):
        batch_values[missing_indices] = -1

In [24]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=CustomFiller)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:35,589][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:35,640][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,644][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 999.36it/s]
[2025-04-09 11:44:35,646][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: CustomFiller (Custom)
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 12

In [25]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,-1.0,-1.0
1,1200.0,1.0,-1.0,-1.0
2,1200.0,2.0,-1.0,-1.0
3,1200.0,3.0,-1.0,-1.0
4,1200.0,4.0,-1.0,-1.0
5,1200.0,5.0,-1.0,-1.0
6,1200.0,6.0,-1.0,-1.0
7,1200.0,7.0,-1.0,-1.0
8,1200.0,8.0,-1.0,-1.0
9,1200.0,9.0,4.0,4.0


Or later with:

In [26]:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=CustomFiller, workers=0)
# Or
time_based_dataset.apply_filler(CustomFiller, workers=0)

[2025-04-09 11:44:35,667][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,718][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,720][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 996.27it/s]
[2025-04-09 11:44:35,723][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,724][cesnet_dataset][INFO] - Configuration has been changed successfuly.
[2025-04-09 11:44:35,724][cesnet_dataset][INFO] - Re-initialization is required.
[2025-04-09 11:44:35,775][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,778][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 998.64it/s]
[2025-04-09 11:44:35,781][cesnet_dataset][INFO] - Config initialized successfully.
[2025-04-09 11:44:35,781][cesnet_dataset][INFO] - Configuration has 

#### Only for TimeBasedCesnetDataset

Values are carried over from train -> val -> test. Look below at example.

In [27]:
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)

time_based_dataset.set_dataset_config_and_initialize(config, display_config_details=True, workers=0)

[2025-04-09 11:44:35,786][config][INFO] - Quick validation succeeded.
[2025-04-09 11:44:35,836][config][INFO] - Finalization and validation completed successfully.
[2025-04-09 11:44:35,839][cesnet_dataset][INFO] - Updating config on train/val/test/all and selected time series.
100%|██████████| 1/1 [00:00<00:00, 1000.31it/s]
[2025-04-09 11:44:35,842][cesnet_dataset][INFO] - Config initialized successfully.



Config Details
    Used for database: CESNET-TimeSeries24
    Aggregation: AgreggationType.AGG_10_MINUTES
    Source: SourceType.IP_ADDRESSES_SAMPLE

    Time series
        Time series IDS: [1200], Length=1
        Test time series IDS: None
    Time periods
        Train time periods: range(0, 30)
        Val time periods: None
        Test time periods: range(30, 80)
        All time periods: range(0, 80)
    Features
        Taken features: ['n_flows', 'n_packets']
        Default values: [nan nan]
        Time series ID included: True
        Time included: True    
        Time format: TimeFormat.ID_TIME
    Sliding window
        Sliding window size: None
        Sliding window prediction size: None
        Sliding window step size: 1
        Set shared size: 0
    Fillers
        Filler type: forward_filler
    Scalers
        Scaler type: None
    Batch sizes
        Train batch size: 32
        Val batch size: 64
        Test batch size: 128
        All batch size: 128
    D

In [28]:
time_based_dataset.get_train_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,0.0,,
1,1200.0,1.0,,
2,1200.0,2.0,,
3,1200.0,3.0,,
4,1200.0,4.0,,
5,1200.0,5.0,,
6,1200.0,6.0,,
7,1200.0,7.0,,
8,1200.0,8.0,,
9,1200.0,9.0,4.0,4.0


You can see that values for n_flows and n_packets were carried over from train to test.

In [29]:
time_based_dataset.get_test_df(workers=0).iloc[:30]

Unnamed: 0,id_ip,id_time,n_flows,n_packets
0,1200.0,30.0,6.0,6.0
1,1200.0,31.0,6.0,6.0
2,1200.0,32.0,6.0,6.0
3,1200.0,33.0,6.0,6.0
4,1200.0,34.0,6.0,6.0
5,1200.0,35.0,6.0,6.0
6,1200.0,36.0,6.0,6.0
7,1200.0,37.0,6.0,6.0
8,1200.0,38.0,6.0,6.0
9,1200.0,39.0,6.0,6.0
