diff --git a/getting_started/index.html b/getting_started/index.html index 23fb7fc..f6d4f93 100755 --- a/getting_started/index.html +++ b/getting_started/index.html @@ -707,6 +707,7 @@

Jupyter notebooks

Code snippets

Download a dataset and compute statistics

diff --git a/search/search_index.json b/search/search_index.json index 34c6d9c..bdcb40c 100755 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"CESNET DataZoo","text":"

This is the documentation of the CESNET DataZoo project.

The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo package are:

"},{"location":"#papers","title":"Papers","text":""},{"location":"dataloaders/","title":"Using dataloaders","text":"

Apart from loading data into dataframes, the cesnet-datazoo package provides dataloaders for processing data in smaller batches.

An example of how dataloaders can be used is in cesnet_datazoo.datasets.loaders or in the following snippet:

def load_from_dataloader(dataloader: DataLoader):\n    other_fields = []\n    data_ppi = []\n    data_flowstats = []\n    labels = []\n    for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n        other_fields.append(batch_other_fields)\n        data_ppi.append(batch_ppi)\n        data_flowstats.append(batch_flowstats)\n        labels.append(batch_labels)\n    df_other_fields = pd.concat(other_fields, ignore_index=True)\n    data_ppi = np.concatenate(data_ppi)\n    data_flowstats = np.concatenate(data_flowstats)\n    labels = np.concatenate(labels)\n    return df_other_fields, data_ppi, data_flowstats, labels\n

When a dataloader is iterated, the returned data are in the format tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels). Batch size B is configured with batch_size and test_batch_size config options. The shapes are:

PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.

"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"

Each dataset class has its metadata available as a DatasetMetadata instance in the metadata attribute.

"},{"location":"dataset_metadata/#metadata","title":"Metadata","text":"Name CESNET-TLS22 CESNET-QUIC22 CESNET-TLS-Year22 Protocol TLS QUIC TLS Published in 2022 2023 2023 Collected in 2021 2022 2022 Collection duration 2 weeks 4 weeks 1 year Available samples 141392195 153226273 507739073 Available dataset sizes XS, S, M, L XS, S, M, L XS, S, M, L Collection period 4.10.2021 - 17.10.2021 31.10.2022 - 27.11.2022 1.1.2022 - 31.12.2022 Missing dates in collection period 20220128, 20220129, 20220130, 20221212, 20221213, 20221229, 20221230, 20221231 Application count 191 102 180 Background traffic classes default-background, google-background, facebook-background PPI features IPT, DIR, SIZE IPT, DIR, SIZE IPT, DIR, SIZE, PUSH_FLAG Flowstats features BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION Flowstats features boolean FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_OTHER FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_END, FLOW_ENDREASON_OTHER Packet histograms PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT TCP features FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV Other fields ID ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST Cite https://doi.org/10.1016/j.comnet.2022.109467 https://doi.org/10.1016/j.dib.2023.108888 Zenodo URL https://zenodo.org/record/7965515 https://zenodo.org/record/7963302 Related papers https://doi.org/10.23919/TMA58422.2023.10199052"},{"location":"datasets_overview/","title":"Overview of datasets","text":""},{"location":"datasets_overview/#cesnet-tls22","title":"CESNET-TLS22","text":"

CESNET-TLS22

This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.

For detailed information about the dataset, see the linked paper and the dataset metadata page.

"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"

CESNET-QUIC22

This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.

For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).

"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"

CESNET-TLS-Year22

This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.

"},{"location":"features/","title":"Features","text":"

This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.

"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"

A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set use_push_flags to include PUSH flags in PPI sequences, if available in the dataset.

Name Description SIZE Size of the transport payload IPT Inter-packet time in milliseconds. The IPT of the first packet is set to zero DIR Direction of the packet encoded as \u00b11 PUSH_FLAG Whether the push flag was set in the TCP packet"},{"location":"features/#flow-statistics","title":"Flow statistics","text":"

Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.

Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"

Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set use_packet_histograms for using packet histograms features, if available in the dataset.

Name Description PSIZE_BIN{x} Packet sizes histogram x-th bin for the forward direction PSIZE_BIN{x}_REV Packet sizes histogram x-th bin for the reverse direction IPT_BIN{x} Inter-packet times histogram x-th bin for the forward direction IPT_BIN{x}_REV Inter-packet times histogram x-th bin for the reverse direction

On the dataset metadata page, packet histogram features are called PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT. Those are the names of database columns that are flattened to the _BIN{x} features.

"},{"location":"features/#tcp-features","title":"TCP features","text":"

Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set use_tcp_features to use a subset of flags defined in cesnet_datazoo.constants.SELECTED_TCP_FLAGS.

Name Description FLAG_{F} Whether F flag was present in the forward (client to server) direction FLAG_{F}_REV Whether F flag was present in the reverse (server to client) direction"},{"location":"features/#other-fields","title":"Other fields","text":"

Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets. Set return_other_fields to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.

Name Description ID Per-dataset unique flow identifier TIME_FIRST Timestamp of the first packet TIME_LAST Timestamp of the last packet SRC_IP Source IP address DST_IP Destination IP address DST_ASN Destination Autonomous System number SRC_PORT Source port DST_PORT Destination port PROTOCOL Transport protocol TLS_SNI / QUIC_SNI Server Name Indication domain TLS_JA3 JA3 fingerprint QUIC_VERSION QUIC protocol version QUIC_USER_AGENT User agent string if available in the QUIC Initial Packet"},{"location":"features/#details-about-packet-histograms-and-ppi","title":"Details about packet histograms and PPI","text":"

Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.

TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow

*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.

For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.

"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"

Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:

"},{"location":"getting_started/#code-snippets","title":"Code snippets","text":""},{"location":"getting_started/#download-a-dataset-and-compute-statistics","title":"Download a dataset and compute statistics","text":"

from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n
This will download the dataset, compute dataset statistics, and save them into /datasets/CESNET-QUIC22/statistics.

"},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"

import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n
For running on Windows, we recommend using the spawn method for creating dataloader worker processes. Set up logging to get more information from the package.

"},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"
from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n    dataset=dataset,\n    apps_selection=AppSelection.ALL_KNOWN,\n    train_period_name=\"W-2022-44\",\n    test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n

The DatasetConfig class handles the configuration of datasets, and calling set_dataset_config_and_initialize initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset reference.

"},{"location":"installation/","title":"Installation","text":"

Install the package from pip with:

pip install cesnet-datazoo\n

or for editable install with:

pip install -e git+https://github.com/CESNET/cesnet-datazoo\n
"},{"location":"installation/#requirements","title":"Requirements","text":"

The cesnet-datazoo package requires Python >=3.10.

"},{"location":"installation/#dependencies","title":"Dependencies","text":"Name Version matplotlib numpy pandas pydantic >=2.0 PyYAML requests scikit-learn seaborn tables >=3.8.0 torch >=1.10 tqdm"},{"location":"reference_cesnet_dataset/","title":"Base dataset class","text":""},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset","title":"datasets.cesnet_dataset.CesnetDataset","text":"

The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:

The dataset is stored in a PyTables database. The internal PyTablesDataset class is used as a wrapper that implements the PyTorch Dataset interface and is compatible with DataLoader, which provides efficient parallel loading of the data. The dataset configuration is done through the DatasetConfig class.

Intended usage:

  1. Create an instance of the dataset class with the desired size and data root. This will download the dataset if it has not already been downloaded.
  2. Create an instance of DatasetConfig and set it with set_dataset_config_and_initialize. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.
  3. Use get_train_dataloader or get_train_df to get training data for a classification model.
  4. Validate the model and perform the hyperparameter optimalization on get_val_dataloader or get_val_df.
  5. Evaluate the model on get_test_dataloader or get_test_df.

Parameters:

Name Type Description Default data_root str

Path to the folder where the dataset will be stored. Each dataset size has its own subfolder data_root/size

required size str

Size of the dataset. Options are XS, S, M, L, ORIG.

'S' silent bool

Whether to suppress print and tqdm output.

False

Attributes:

Name Type Description name str

Name of the dataset.

database_filename str

Name of the database file.

database_path str

Path to the database file.

servicemap_path str

Path to the servicemap file.

statistics_path str

Path to the dataset statistics folder.

bucket_url str

URL of the bucket where the database is stored.

metadata DatasetMetadata

Additional dataset metadata.

available_classes list[str]

List of all available classes in the dataset.

available_dates list[str]

List of all available dates in the dataset.

time_periods dict[str, list[str]]

Predefined time periods. Each time period is a list of dates.

default_train_period_name str

Default time period for training.

default_test_period_name str

Default time period for testing.

The following attributes are initialized when set_dataset_config_and_initialize is called.

Attributes:

Name Type Description dataset_config Optional[DatasetConfig]

Configuration of the dataset.

class_info Optional[ClassInfo]

Structured information about the classes.

dataset_indices Optional[IndicesTuple]

Named tuple containing train_indices, val_known_indices, val_unknown_indices, test_known_indices, test_unknown_indices. These are the indices into PyTables database that define train, validation, and test sets.

train_dataset Optional[PyTablesDataset]

Train set in the form of PyTablesDataset instance wrapping the PyTables database.

val_dataset Optional[PyTablesDataset]

Validation set in the form of PyTablesDataset instance wrapping the PyTables database.

test_dataset Optional[PyTablesDataset]

Test set in the form of PyTablesDataset instance wrapping the PyTables database.

known_app_counts Optional[DataFrame]

Known application counts in the train, validation, and test sets.

unknown_app_counts Optional[DataFrame]

Unknown application counts in the validation and test sets.

train_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for training.

train_dataloader_sampler Optional[Sampler]

Sampler used for iterating the training dataloader. Either RandomSampler or SequentialSampler.

train_dataloader_drop_last bool

Whether to drop the last incomplete batch when iterating the training dataloader.

val_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for validation.

test_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for testing.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
class CesnetDataset():\n    \"\"\"\n    The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n    - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n    - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n    The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n    that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n    and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n    which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n    **Intended usage:**\n\n    1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n    2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n    This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n    3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n    5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n    Parameters:\n        data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n        size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n        silent: Whether to suppress print and tqdm output.\n\n    Attributes:\n        name: Name of the dataset.\n        database_filename: Name of the database file.\n        database_path: Path to the database file.\n        servicemap_path: Path to the servicemap file.\n        statistics_path: Path to the dataset statistics folder.\n        bucket_url: URL of the bucket where the database is stored.\n        metadata: Additional [dataset metadata][metadata].\n        available_classes: List of all available classes in the dataset.\n        available_dates: List of all available dates in the dataset.\n        time_periods: Predefined time periods. Each time period is a list of dates.\n        default_train_period_name: Default time period for training.\n        default_test_period_name: Default time period for testing.\n\n    The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n    Attributes:\n        dataset_config: Configuration of the dataset.\n        class_info: Structured information about the classes.\n        dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n        train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        known_app_counts: Known application counts in the train, validation, and test sets.\n        unknown_app_counts: Unknown application counts in the validation and test sets.\n        train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n        train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n        train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n        val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    \"\"\"\n    data_root: str\n    size: str\n    silent: bool = False\n\n    name: str\n    database_filename: str\n    database_path: str\n    servicemap_path: str\n    statistics_path: str\n    bucket_url: str\n    metadata: DatasetMetadata\n    available_classes: list[str]\n    available_dates: list[str]\n    time_periods: dict[str, list[str]]\n    default_train_period_name: str\n    default_test_period_name: str\n\n    dataset_config: Optional[DatasetConfig] = None\n    class_info: Optional[ClassInfo] = None\n    dataset_indices: Optional[IndicesTuple] = None\n    train_dataset: Optional[PyTablesDataset] = None\n    val_dataset: Optional[PyTablesDataset] = None\n    test_dataset: Optional[PyTablesDataset] = None\n    known_app_counts: Optional[pd.DataFrame] = None\n    unknown_app_counts: Optional[pd.DataFrame] = None\n    train_dataloader: Optional[DataLoader] = None\n    train_dataloader_sampler: Optional[Sampler] = None\n    train_dataloader_drop_last: bool = True\n    val_dataloader: Optional[DataLoader] = None\n    test_dataloader: Optional[DataLoader] = None\n\n    _collate_fn: Optional[Callable] = None\n    _tables_app_enum: dict[int, str]\n    _tables_cat_enum: dict[int, str]\n\n    def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -> None:\n        self.silent = silent\n        self.metadata = load_metadata(self.name)\n        self.size = size\n        if self.size != \"ORIG\":\n            if size not in self.metadata.available_dataset_sizes:\n                raise ValueError(f\"Unknown dataset size {self.size}\")\n            self.name = f\"{self.name}-{self.size}\"\n            filename, ext = os.path.splitext(self.database_filename)\n            self.database_filename = f\"{filename}-{self.size}{ext}\"\n        self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n        self.database_path = os.path.join(self.data_root, self.database_filename)\n        self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n        self.statistics_path = os.path.join(self.data_root, \"statistics\")\n        if not os.path.exists(self.data_root):\n            os.makedirs(self.data_root)\n        if not self._is_downloaded():\n            self._download()\n        if database_checks_at_init:\n            with tb.open_file(self.database_path, mode=\"r\") as database:\n                tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n                num_samples = 0\n                for p in tables_paths:\n                    table = database.get_node(p)\n                    assert isinstance(table, tb.Table)\n                    if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    num_samples += len(table)\n                if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n                    raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n                    raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n                    raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n        # Add all available dates as single date time periods\n        for d in self.available_dates:\n            self.time_periods[d] = [d]\n        available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n        if len(available_applications) != self.metadata.application_count:\n            raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n        self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n    def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n        \"\"\"\n        Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n        Parameters:\n            dataset_config: Desired configuration of the dataset.\n            disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n        \"\"\"\n        self.dataset_config = dataset_config\n        self._clear()\n        self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n    def get_train_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n        When the dataloader is iterated in random order, the last incomplete batch is dropped.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config               | Description                                                                                |\n        | ---------------------------- | ------------------------------------------------------------------------------------------ |\n        | `batch_size`                 | Number of samples per batch.                                                               |\n        | `train_workers`              | Number of workers for loading train data.                                                  |\n        | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n        | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n        Returns:\n            Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n        if not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n        assert self.train_dataset\n        if self.train_dataloader:\n            return self.train_dataloader\n        # Create sampler according to the selected order\n        if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n            if self.dataset_config.train_dataloader_seed is not None:\n                generator = torch.Generator()\n                generator.manual_seed(self.dataset_config.train_dataloader_seed)\n            else:\n                generator = None\n            self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n            self.train_dataloader_drop_last = True\n        elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n            self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n            self.train_dataloader_drop_last = False\n        else: assert_never(self.dataset_config.train_dataloader_order)\n        # Create dataloader\n        batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n        train_dataloader = DataLoader(\n            self.train_dataset,\n            num_workers=self.dataset_config.train_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.train_workers > 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.train_workers == 0:\n            self.train_dataset.pytables_worker_init()\n        self.train_dataloader = train_dataloader\n        return train_dataloader\n\n    def get_val_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        The dataloader is created on the first call and then cached.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `val_workers`     | Number of workers for loading validation data.                    |\n\n        Returns:\n            Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n        if not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n        assert self.val_dataset is not None\n        if self.val_dataloader:\n            return self.val_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        val_dataloader = DataLoader(\n            self.val_dataset,\n            num_workers=self.dataset_config.val_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.val_workers > 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.val_workers == 0:\n            self.val_dataset.pytables_worker_init()\n        self.val_dataloader = val_dataloader\n        return val_dataloader\n\n    def get_test_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n        The dataloader is created on the first call and then cached.\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `test_workers`    | Number of workers for loading test data.                          |\n\n        Returns:\n            Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n        if not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n        assert self.test_dataset is not None\n        if self.test_dataloader:\n            return self.test_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        test_dataloader = DataLoader(\n            self.test_dataset,\n            num_workers=self.dataset_config.test_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=False,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.test_workers == 0:\n            self.test_dataset.pytables_worker_init()\n        self.test_dataloader = test_dataloader\n        return test_dataloader\n\n    def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n        \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n        train_dataloader = self.get_train_dataloader()\n        val_dataloader = self.get_val_dataloader()\n        test_dataloader = self.get_test_dataloader()\n        return train_dataloader, val_dataloader, test_dataloader\n\n    def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n        !!! warning \"Memory usage\"\n\n            The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Train data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_train=True)\n        assert self.dataset_config is not None and self.train_dataset is not None\n        if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n        train_dataloader = self.get_train_dataloader()\n        assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n        # Read dataloader in sequential order\n        train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n        train_dataloader.sampler.drop_last = False\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        df = create_df_from_dataloader(dataloader=train_dataloader,\n                                       feature_names=feature_names,\n                                       flatten_ppi=flatten_ppi,\n                                       silent=self.silent)\n        # Restore the original dataloader sampler and drop_last\n        train_dataloader.sampler.sampler = self.train_dataloader_sampler\n        train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n        return df\n\n    def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n        !!! warning \"Memory usage\"\n\n            The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Validation data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_val=True)\n        assert self.dataset_config is not None and self.val_dataset is not None\n        if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n        !!! warning \"Memory usage\"\n\n            The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Test data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_test=True)\n        assert self.dataset_config is not None and self.test_dataset is not None\n        if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_num_classes(self) -> int:\n        \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n        return self.class_info.num_classes\n\n    def get_known_apps(self) -> list[str]:\n        \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n        return self.class_info.known_apps\n\n    def get_unknown_apps(self) -> list[str]:\n        \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n        return self.class_info.unknown_apps\n\n    def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n        \"\"\"\n        Computes dataset statistics and saves them to the `statistics_path` folder.\n\n        Parameters:\n            num_samples: Number of samples to use for computing the statistics.\n            num_workers: Number of workers for loading data.\n            batch_size: Number of samples per batch for loading data.\n            disabled_apps: List of applications to exclude from the statistics.\n        \"\"\"\n        if disabled_apps:\n            bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n            if len(bad_disabled_apps) > 0:\n                raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if not os.path.exists(self.statistics_path):\n            os.mkdir(self.statistics_path)\n        compute_dataset_statistics(database_path=self.database_path,\n                                   tables_app_enum=self._tables_app_enum,\n                                   tables_cat_enum=self._tables_cat_enum,\n                                   output_dir=self.statistics_path,\n                                   packet_histograms=self.metadata.packet_histograms,\n                                   flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                                   protocol=self.metadata.protocol,\n                                   extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                                   disabled_apps=disabled_apps if disabled_apps is not None else [],\n                                   num_samples=num_samples,\n                                   num_workers=num_workers,\n                                   batch_size=batch_size,\n                                   silent=self.silent)\n\n    def _generate_time_periods(self) -> None:\n        time_periods = {}\n        for period in self.time_periods:\n            time_periods[period] = []\n            if period.startswith(\"W\"):\n                split = period.split(\"-\")\n                collection_year, week = int(split[1]), int(split[2])\n                for d in range(1, 8):\n                    s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n                    # last week of a year can span into the following year\n                    if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n                        time_periods[period].append(s)\n            elif period.startswith(\"M\"):\n                split = period.split(\"-\")\n                collection_year, month = int(split[1]), int(split[2])\n                for d in range(1, calendar.monthrange(collection_year, month)[1]):\n                    s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n                    if s not in self.metadata.missing_dates_in_collection_period:\n                        time_periods[period].append(s)\n        self.time_periods = time_periods\n\n    def _is_downloaded(self) -> bool:\n        \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n        return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n    def _download(self) -> None:\n        if not self.silent:\n            print(f\"Downloading {self.name} dataset\")\n        database_url = f\"{self.bucket_url}&file={self.database_filename}\"\n        servicemap_url = f\"{self.bucket_url}&file={SERVICEMAP_FILE}\"\n        resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n        simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n    def _clear(self) -> None:\n        self.class_info = None\n        self.dataset_indices = None\n        self.train_dataset = None\n        self.val_dataset = None\n        self.test_dataset = None\n        self.known_app_counts = None\n        self.unknown_app_counts = None\n        self.train_dataloader = None\n        self.train_dataloader_sampler = None\n        self.train_dataloader_drop_last = True\n        self.val_dataloader = None\n        self.test_dataloader = None\n        self._collate_fn = None\n\n    def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -> None:\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n        if self.dataset_config.return_tensors:\n            raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n        if check_train and not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n        if check_val and not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n        if check_test and not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n    def _initialize_train_val_test(self, disable_indices_cache: bool = False) -> None:\n        assert self.dataset_config is not None\n        dataset_config = self.dataset_config\n        servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n        # Initialize train set\n        if dataset_config.need_train_set:\n            train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n                                                                                                        tables_app_enum=self._tables_app_enum,\n                                                                                                        servicemap=servicemap,\n                                                                                                        disable_indices_cache=disable_indices_cache,)\n            # Date weight sampling of train indices\n            if dataset_config.train_dates_weigths is not None:\n                assert dataset_config.train_size != \"all\"\n                if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                    # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n                    assert dataset_config.val_known_size != \"all\"\n                    num_samples = dataset_config.train_size + dataset_config.val_known_size\n                else:\n                    num_samples = dataset_config.train_size\n                if num_samples > len(train_indices):\n                    raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n                train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n        elif dataset_config.apps_selection == AppSelection.FIXED:\n            known_apps = sorted(dataset_config.apps_selection_fixed_known)\n            unknown_apps = sorted(dataset_config.apps_selection_fixed_unknown)\n            train_indices = no_indices()\n            train_unknown_indices = no_indices()\n        else:\n            raise ValueError(\"Either need train set or the fixed application selection\")\n        # Initialize validation set\n        if dataset_config.need_val_set:\n            if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n                val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n            elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n                val_data_path = dataset_config._get_train_data_path()\n                val_unknown_indices = train_unknown_indices\n                train_labels = train_indices[:, INDICES_LABEL_POS]\n                if dataset_config.train_dates_weigths is not None:\n                    assert dataset_config.val_known_size != \"all\"\n                    # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n                    if dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                    dataset_config.train_size = len(train_indices)\n                elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                else:\n                    if dataset_config.val_known_size != \"all\" and  dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.train_size != \"all\" and dataset_config.train_size > len(train_indices):\n                        raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices,\n                                                                        train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n                                                                        test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n                                                                        stratify=train_labels, shuffle=True, random_state=train_val_rng)\n        else:\n            val_known_indices = no_indices()\n            val_unknown_indices = no_indices()\n            val_data_path = None\n        # Initialize test set\n        if dataset_config.need_test_set:\n            test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n        else:\n            test_known_indices = no_indices()\n            test_unknown_indices = no_indices()\n            test_data_path = None\n        # Fit scalers if needed\n        if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n            dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n            if not dataset_config.need_train_set:\n                raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n            fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n        # Subset dataset indices based on the selected sizes and compute application counts\n        dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n        dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n        known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        # Combine known and unknown test indicies to create a single dataloader\n        assert isinstance(dataset_config.test_unknown_size, int)\n        if dataset_config.test_unknown_size > 0 and len(unknown_apps) > 0:\n            test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n        else:\n            test_combined_indices = dataset_indices.test_known_indices\n        # Create encoder the class info structure\n        encoder = LabelEncoder().fit(known_apps)\n        encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n        class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n        encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n        # Create train, validation, and test datasets\n        train_dataset = val_dataset = test_dataset = None\n        if dataset_config.need_train_set:\n            train_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.train_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,)\n        if dataset_config.need_val_set:\n            assert val_data_path is not None\n            val_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_val_tables_paths(),\n                indices=dataset_indices.val_known_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_val,\n                preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n        if dataset_config.need_test_set:\n            assert test_data_path is not None\n            test_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_test_tables_paths(),\n                indices=test_combined_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_test,\n                preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n        self.class_info = class_info\n        self.dataset_indices = dataset_indices\n        self.train_dataset = train_dataset\n        self.val_dataset = val_dataset\n        self.test_dataset = test_dataset\n        self.known_app_counts = known_app_counts\n        self.unknown_app_counts = unknown_app_counts\n        self._collate_fn = collate_fn_simple\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"
set_dataset_config_and_initialize(\n    dataset_config: DatasetConfig,\n    disable_indices_cache: bool = False,\n) -> None\n

Initialize train, validation, and test sets. Data cannot be accessed before calling this method.

Parameters:

Name Type Description Default dataset_config DatasetConfig

Desired configuration of the dataset.

required disable_indices_cache bool

Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.

False Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n    \"\"\"\n    Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n    Parameters:\n        dataset_config: Desired configuration of the dataset.\n        disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n    \"\"\"\n    self.dataset_config = dataset_config\n    self._clear()\n    self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"
get_train_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:

Dataset config Description batch_size Number of samples per batch. train_workers Number of workers for loading train data. train_dataloader_order Whether to load train data in sequential or random order. See config.DataLoaderOrder. train_dataloader_seed Seed for loading train data in random order.

Returns:

Type Description DataLoader

Train data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n    When the dataloader is iterated in random order, the last incomplete batch is dropped.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config               | Description                                                                                |\n    | ---------------------------- | ------------------------------------------------------------------------------------------ |\n    | `batch_size`                 | Number of samples per batch.                                                               |\n    | `train_workers`              | Number of workers for loading train data.                                                  |\n    | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n    | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n    Returns:\n        Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n    if not self.dataset_config.need_train_set:\n        raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n    assert self.train_dataset\n    if self.train_dataloader:\n        return self.train_dataloader\n    # Create sampler according to the selected order\n    if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n        if self.dataset_config.train_dataloader_seed is not None:\n            generator = torch.Generator()\n            generator.manual_seed(self.dataset_config.train_dataloader_seed)\n        else:\n            generator = None\n        self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n        self.train_dataloader_drop_last = True\n    elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n        self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n        self.train_dataloader_drop_last = False\n    else: assert_never(self.dataset_config.train_dataloader_order)\n    # Create dataloader\n    batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n    train_dataloader = DataLoader(\n        self.train_dataset,\n        num_workers=self.dataset_config.train_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.train_workers > 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.train_workers == 0:\n        self.train_dataset.pytables_worker_init()\n    self.train_dataloader = train_dataloader\n    return train_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"
get_val_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:

Dataset config Description test_batch_size Number of samples per batch for loading validation and test data. val_workers Number of workers for loading validation data.

Returns:

Type Description DataLoader

Validation data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n    The dataloader is created on the first call and then cached.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `val_workers`     | Number of workers for loading validation data.                    |\n\n    Returns:\n        Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n    if not self.dataset_config.need_val_set:\n        raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n    assert self.val_dataset is not None\n    if self.val_dataloader:\n        return self.val_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    val_dataloader = DataLoader(\n        self.val_dataset,\n        num_workers=self.dataset_config.val_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.val_workers > 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.val_workers == 0:\n        self.val_dataset.pytables_worker_init()\n    self.val_dataloader = val_dataloader\n    return val_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"
get_test_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for testing. The dataloader is created on the first call and then cached.

When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns test_known_size samples of known classes followed by test_unknown_size samples of unknown classes.

The dataloader is configured with the following config attributes:

Dataset config Description test_batch_size Number of samples per batch for loading validation and test data. test_workers Number of workers for loading test data.

Returns:

Type Description DataLoader

Test data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    The dataloader is created on the first call and then cached.\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `test_workers`    | Number of workers for loading test data.                          |\n\n    Returns:\n        Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n    if not self.dataset_config.need_test_set:\n        raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n    assert self.test_dataset is not None\n    if self.test_dataloader:\n        return self.test_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    test_dataloader = DataLoader(\n        self.test_dataset,\n        num_workers=self.dataset_config.test_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=False,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.test_workers == 0:\n        self.test_dataset.pytables_worker_init()\n    self.test_dataloader = test_dataloader\n    return test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"
get_dataloaders() -> (\n    tuple[DataLoader, DataLoader, DataLoader]\n)\n

Gets train, validation, and test dataloaders in one call.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n    \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n    train_dataloader = self.get_train_dataloader()\n    val_dataloader = self.get_val_dataloader()\n    test_dataloader = self.get_test_dataloader()\n    return train_dataloader, val_dataloader, test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"
get_train_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates a train Pandas DataFrame. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.

Memory usage

The whole train set is loaded into memory. If the dataset size is larger than 'S', consider using get_train_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Train data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n    !!! warning \"Memory usage\"\n\n        The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Train data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_train=True)\n    assert self.dataset_config is not None and self.train_dataset is not None\n    if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n    train_dataloader = self.get_train_dataloader()\n    assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n    # Read dataloader in sequential order\n    train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n    train_dataloader.sampler.drop_last = False\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    df = create_df_from_dataloader(dataloader=train_dataloader,\n                                   feature_names=feature_names,\n                                   flatten_ppi=flatten_ppi,\n                                   silent=self.silent)\n    # Restore the original dataloader sampler and drop_last\n    train_dataloader.sampler.sampler = self.train_dataloader_sampler\n    train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n    return df\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"
get_val_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates validation Pandas DataFrame. The dataframe is in sequential (datetime) order.

Memory usage

The whole validation set is loaded into memory. If the dataset size is larger than 'S', consider using get_val_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Validation data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n    !!! warning \"Memory usage\"\n\n        The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Validation data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_val=True)\n    assert self.dataset_config is not None and self.val_dataset is not None\n    if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"
get_test_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates test Pandas DataFrame. The dataframe is in sequential (datetime) order.

When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of test_known_size samples of known classes followed by test_unknown_size samples of unknown classes.

Memory usage

The whole test set is loaded into memory. If the dataset size is larger than 'S', consider using get_test_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Test data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n    !!! warning \"Memory usage\"\n\n        The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Test data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_test=True)\n    assert self.dataset_config is not None and self.test_dataset is not None\n    if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"
get_num_classes() -> int\n

Returns the number of classes in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_num_classes(self) -> int:\n    \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n    return self.class_info.num_classes\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"
get_known_apps() -> list[str]\n

Returns the list of known applications in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_known_apps(self) -> list[str]:\n    \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n    return self.class_info.known_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"
get_unknown_apps() -> list[str]\n

Returns the list of unknown applications in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_unknown_apps(self) -> list[str]:\n    \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n    return self.class_info.unknown_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"
compute_dataset_statistics(\n    num_samples: int | Literal[\"all\"] = 10000000,\n    num_workers: int = 4,\n    batch_size: int = 16384,\n    disabled_apps: Optional[list[str]] = None,\n) -> None\n

Computes dataset statistics and saves them to the statistics_path folder.

Parameters:

Name Type Description Default num_samples int | Literal['all']

Number of samples to use for computing the statistics.

10000000 num_workers int

Number of workers for loading data.

4 batch_size int

Number of samples per batch for loading data.

16384 disabled_apps Optional[list[str]]

List of applications to exclude from the statistics.

None Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n    \"\"\"\n    Computes dataset statistics and saves them to the `statistics_path` folder.\n\n    Parameters:\n        num_samples: Number of samples to use for computing the statistics.\n        num_workers: Number of workers for loading data.\n        batch_size: Number of samples per batch for loading data.\n        disabled_apps: List of applications to exclude from the statistics.\n    \"\"\"\n    if disabled_apps:\n        bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n        if len(bad_disabled_apps) > 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n    if not os.path.exists(self.statistics_path):\n        os.mkdir(self.statistics_path)\n    compute_dataset_statistics(database_path=self.database_path,\n                               tables_app_enum=self._tables_app_enum,\n                               tables_cat_enum=self._tables_cat_enum,\n                               output_dir=self.statistics_path,\n                               packet_histograms=self.metadata.packet_histograms,\n                               flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                               protocol=self.metadata.protocol,\n                               extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                               disabled_apps=disabled_apps if disabled_apps is not None else [],\n                               num_samples=num_samples,\n                               num_workers=num_workers,\n                               batch_size=batch_size,\n                               silent=self.silent)\n
"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"

The main class for the configuration of:

When initializing this class, pass a CesnetDataset instance to be configured and the desired configuration. Available options are here.

Attributes:

Name Type Description dataset InitVar[CesnetDataset]

The dataset instance to be configured.

data_root str

Taken from the dataset instance.

database_filename str

Taken from the dataset instance.

database_path str

Taken from the dataset instance.

servicemap_path str

Taken from the dataset instance.

flowstats_features list[str]

Taken from dataset.metadata.flowstats_features.

flowstats_features_boolean list[str]

Taken from dataset.metadata.flowstats_features_boolean.

flowstats_features_phist list[str]

Taken from dataset.metadata.packet_histograms if use_packet_histograms is true, otherwise an empty list.

other_fields list[str]

Taken from dataset.metadata.other_fields if return_other_fields is true, otherwise an empty list.

"},{"location":"reference_dataset_config/#config.DatasetConfig--configuration-options","title":"Configuration options","text":"

Attributes:

Name Type Description need_train_set bool

Use to disable the train set. Default: True

need_val_set bool

Use to disable the validation set. Default: True

need_test_set bool

Use to disable the test set. Default: True

train_period_name str

Name of the train period. See instructions.

train_dates list[str]

Dates used for creating a train set.

train_dates_weigths Optional[list[int]]

To use a non-uniform distribution of samples across train dates.

val_approach ValidationApproach

How a validation set should be created. Either split train data into train and validation or have a separate validation period. Default: SPLIT_FROM_TRAIN

train_val_split_fraction float

The fraction of validation samples when splitting from the train set. Default: 0.2

val_period_name str

Name of the validation period. See instructions.

val_dates list[str]

Dates used for creating a validation set.

test_period_name str

Name of the test period. See instructions.

test_dates list[str]

Dates used for creating a test set.

apps_selection AppSelection

How to select application classes. Default: ALL_KNOWN

apps_selection_topx int

Take top X as known.

apps_selection_background_unknown list[str]

Provide a list of background traffic classes to be used as unknown.

apps_selection_fixed_known list[str]

Provide a list of manually selected known applications.

apps_selection_fixed_unknown list[str]

Provide a list of manually selected unknown applications.

disabled_apps list[str]

List of applications to be disabled and not used at all.

min_train_samples_check MinTrainSamplesCheck

How to handle applications with not enough training samples. Default: DISABLE_APPS

min_train_samples_per_app int

Defines the threshold for not enough. Default: 100

random_state int

Fix all random processes performed during dataset initialization. Default: 420

fold_id int

To perform N-fold cross-validation, set this to 1..N. Each fold will use the same configuration but a different random seed. Default: 0

train_workers int

Number of workers for loading train data. 0 means that the data will be loaded in the main process. Default: 4

test_workers int

Number of workers for loading test data. 0 means that the data will be loaded in the main process. Default: 1

val_workers int

Number of workers for loading validation data. 0 means that the data will be loaded in the main process. Default: 1

batch_size int

Number of samples per batch. Default: 192

test_batch_size int

Number of samples per batch for loading validation and test data. Default: 2048

preload_val bool

Whether to dump the validation set with numpy.savez_compressed and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. Default: False

preload_test bool

Whether to dump the test set with numpy.savez_compressed and preload it in future runs. Default: False

train_size int | Literal['all']

Size of the train set. See instructions. Default: all

val_known_size int | Literal['all']

Size of the validation set. See instructions. Default: all

test_known_size int | Literal['all']

Size of the test set. See instructions. Default: all

val_unknown_size int | Literal['all']

Size of the unknown classes validation set. Use for evaluation in the open-world setting. Default: 0

test_unknown_size int | Literal['all']

Size of the unknown classes test set. Use for evaluation in the open-world setting. Default: 0

train_dataloader_order DataLoaderOrder

Whether to load train data in sequential or random order. Default: RANDOM

train_dataloader_seed Optional[int]

Seed for loading train data in random order. Default: None

return_other_fields bool

Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. Default: False

return_tensors bool

Use for returning torch.Tensor from dataloaders. Dataframes are not available when this option is used. Default: False

use_packet_histograms bool

Whether to use packet histogram features, if available in the dataset. Default: True

use_tcp_features bool

Whether to use TCP features, if available in the dataset. Default: True

use_push_flags bool

Whether to use push flags in packet sequences, if available in the dataset. Default: False

fit_scalers_samples int | float

Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. Default: 0.25

ppi_transform Optional[Callable]

Transform function for PPI sequences. See the transforms page for more information. Default: None

flowstats_transform Optional[Callable]

Transform function for flow statistics. See the transforms page for more information. Default: None

flowstats_phist_transform Optional[Callable]

Transform function for packet histograms. See the transforms page for more information. Default: None

"},{"location":"reference_dataset_config/#config.DatasetConfig--how-to-configure-train-validation-and-test-sets","title":"How to configure train, validation, and test sets","text":"

There are three options for how to define train/validation/test dates.

  1. Choose a predefined time period (train_period_name, val_period_name, or test_period_name) available in dataset.time_periods and leave the list of dates (train_dates, val_dates, or test_dates) empty.
  2. Provide a list of dates and a name for the time period. The dates are checked against dataset.available_dates.
  3. Do not specify anything and use the dataset's defaults dataset.default_train_period_name and dataset.default_test_period_name.

There are two options for configuring sizes of train/validation/test sets.

  1. Select an appropriate dataset size (default is S) when creating the CesnetDataset instance and leave train_size, val_known_size, and test_known_size with their default all value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).
  2. Provide exact sizes in train_size, val_known_size, and test_known_size. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the ORIG dataset size and want to control the size of experiments.

Tip

The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.

Source code in cesnet_datazoo\\config.py
@dataclass(config=C)\nclass DatasetConfig():\n    \"\"\"\n    The main class for the configuration of:\n\n    - Train, validation, test sets (dates, sizes, validation approach).\n    - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n    - Data transformations. See the [transforms][transforms] page for more information.\n    - Dataloader options like batch sizes, order of loading, or number of workers.\n\n    When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n    Attributes:\n        dataset: The dataset instance to be configured.\n        data_root: Taken from the dataset instance.\n        database_filename: Taken from the dataset instance.\n        database_path: Taken from the dataset instance.\n        servicemap_path: Taken from the dataset instance.\n        flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n        flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n        flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n        other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n    # Configuration options\n\n    Attributes:\n        need_train_set: Use to disable the train set. `Default: True`\n        need_val_set: Use to disable the validation set. `Default: True`\n        need_test_set: Use to disable the test set. `Default: True`\n        train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        train_dates: Dates used for creating a train set.\n        train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n        val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n        train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n        val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        val_dates: Dates used for creating a validation set.\n        test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        test_dates: Dates used for creating a test set.\n\n        apps_selection: How to select application classes. `Default: ALL_KNOWN`\n        apps_selection_topx: Take top X as known.\n        apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n        apps_selection_fixed_known: Provide a list of manually selected known applications.\n        apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n        disabled_apps: List of applications to be disabled and not used at all.\n        min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n        min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n        random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n        fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n        train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n        test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n        val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n        batch_size: Number of samples per batch. `Default: 192`\n        test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: False`\n        preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n        train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n        test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n        train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n        train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n        return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n        return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n        use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n        use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n        use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n    # How to configure train, validation, and test sets\n    There are three options for how to define train/validation/test dates.\n\n    1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n    2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n    3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n    There are two options for configuring sizes of train/validation/test sets.\n\n    1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n    This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n    2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n    This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n    !!! tip Validation set\n        The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n    \"\"\"\n    dataset: InitVar[CesnetDataset]\n    data_root: str = field(init=False)\n    database_filename: str =  field(init=False)\n    database_path: str =  field(init=False)\n    servicemap_path: str = field(init=False)\n    flowstats_features: list[str] = field(init=False)\n    flowstats_features_boolean: list[str] = field(init=False)\n    flowstats_features_phist: list[str] = field(init=False)\n    other_fields: list[str] = field(init=False)\n\n    need_train_set: bool = True\n    need_val_set: bool = True\n    need_test_set: bool = True\n    train_period_name: str = \"\"\n    train_dates: list[str] = field(default_factory=list)\n    train_dates_weigths: Optional[list[int]] = None\n    val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n    train_val_split_fraction: float = 0.2\n    val_period_name: str = \"\"\n    val_dates: list[str] = field(default_factory=list)\n    test_period_name: str = \"\"\n    test_dates: list[str] = field(default_factory=list)\n\n    apps_selection: AppSelection = AppSelection.ALL_KNOWN\n    apps_selection_topx: int = 0\n    apps_selection_background_unknown: list[str] = field(default_factory=list)\n    apps_selection_fixed_known: list[str] = field(default_factory=list)\n    apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n    disabled_apps: list[str] = field(default_factory=list)\n    min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n    min_train_samples_per_app: int = 100\n\n    random_state: int = 420\n    fold_id: int = 0\n    train_workers: int = 4\n    test_workers: int = 1\n    val_workers: int = 1\n    batch_size: int = 192\n    test_batch_size: int = 2048\n    preload_val: bool = False\n    preload_test: bool = False\n    train_size: int | Literal[\"all\"] = \"all\"\n    val_known_size: int | Literal[\"all\"] = \"all\"\n    test_known_size: int | Literal[\"all\"] = \"all\"\n    val_unknown_size: int | Literal[\"all\"] = 0\n    test_unknown_size: int | Literal[\"all\"] = 0\n    train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n    train_dataloader_seed: Optional[int] = None\n\n    return_other_fields: bool = False\n    return_tensors: bool = False\n    use_packet_histograms: bool = False\n    use_tcp_features: bool = False\n    use_push_flags: bool = False\n    fit_scalers_samples: int | float = 0.25\n    ppi_transform: Optional[Callable] = None\n    flowstats_transform: Optional[Callable] = None\n    flowstats_phist_transform: Optional[Callable] = None\n\n    def __post_init__(self, dataset: CesnetDataset):\n        \"\"\"\n        Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n        \"\"\"\n        self.data_root = dataset.data_root\n        self.servicemap_path = dataset.servicemap_path\n        self.database_filename = dataset.database_filename\n        self.database_path = dataset.database_path\n\n        if not self.need_train_set:\n            if self.apps_selection != AppSelection.FIXED:\n                raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n            if (len(self.train_dates) > 0 or self.train_period_name != \"\"):\n                raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n        else:\n            # Configure train dates\n            if len(self.train_dates) > 0 and self.train_period_name == \"\":\n                raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n            if len(self.train_dates) == 0 and self.train_period_name != \"\":\n                if self.train_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n                self.train_dates = dataset.time_periods[self.train_period_name]\n            if len(self.train_dates) == 0 and self.train_period_name == \"\":\n                self.train_period_name = dataset.default_train_period_name\n                self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n        # Configure test dates\n        if not self.need_test_set:\n            if (len(self.test_dates) > 0 or self.test_period_name != \"\"):\n                raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n        else:\n            if len(self.test_dates) > 0 and self.test_period_name == \"\":\n                raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n            if len(self.test_dates) == 0 and self.test_period_name != \"\":\n                if self.test_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n                self.test_dates = dataset.time_periods[self.test_period_name]\n            if len(self.test_dates) == 0 and self.test_period_name == \"\":\n                self.test_period_name = dataset.default_test_period_name\n                self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n        # Configure val dates\n        if not self.need_val_set:\n            if len(self.val_dates) > 0 or self.val_period_name != \"\" or self.val_approach != ValidationApproach.SPLIT_FROM_TRAIN:\n                raise ValueError(\"val_dates, val_period_name, and val_approach cannot be specified when need_val_set is false\")\n        else:\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                if len(self.val_dates) > 0 or self.val_period_name != \"\":\n                    raise ValueError(\"val_dates and val_period_name cannot be specified when the validation approach is split-from-train\")\n                if not self.need_train_set:\n                    raise ValueError(\"Cannot use the split-from-train validation approach when need_train_set is false. Either use the validation-dates approach or set need_val_set to false.\")\n            elif self.val_approach == ValidationApproach.VALIDATION_DATES:\n                if len(self.val_dates) > 0 and self.val_period_name == \"\":\n                    raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n                if len(self.val_dates) == 0 and self.val_period_name != \"\":\n                    if self.val_period_name not in dataset.time_periods:\n                        raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n                    self.val_dates = dataset.time_periods[self.val_period_name]\n                if len(self.val_dates) == 0 and self.val_period_name == \"\":\n                    raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n        # Check if train, val, and test dates are available in the dataset\n        bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n        bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n        bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n        if len(bad_train_dates) > 0:\n            raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_val_dates) > 0:\n            raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_test_dates) > 0:\n            raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        # Check time order of train, val, and test periods\n        train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n        test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n        if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):\n            warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n            if len(train_dates) > 0 and min(val_dates) <= max(train_dates):\n                warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n            if len(test_dates) > 0 and min(test_dates) <= max(val_dates):\n                warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        # Configure features\n        self.flowstats_features = dataset.metadata.flowstats_features\n        self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n        self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n        if self.use_packet_histograms:\n            if len(dataset.metadata.packet_histograms) == 0:\n                raise ValueError(\"This dataset does not support use_packet_histograms\")\n            self.flowstats_features_phist = dataset.metadata.packet_histograms\n        else:\n            self.flowstats_features_phist = []\n            if self.flowstats_phist_transform is not None:\n                raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n        if dataset.metadata.protocol == Protocol.TLS:\n            if self.use_tcp_features:\n                self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n            if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n                raise ValueError(\"This TLS dataset does not support use_push_flags\")\n        if dataset.metadata.protocol == Protocol.QUIC:\n            if self.use_tcp_features:\n                raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n            if self.use_push_flags:\n                raise ValueError(\"QUIC datasets do not support use_push_flags\")\n        # When train_dates_weigths are used, train_size and val_known_size have to be specified\n        if self.train_dates_weigths is not None:\n            if not self.need_train_set:\n                raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n            if len(self.train_dates_weigths) != len(self.train_dates):\n                raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n            if self.train_size == \"all\":\n                raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n                raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n        # App selection\n        if self.apps_selection == AppSelection.ALL_KNOWN:\n            self.val_unknown_size = 0\n            self.test_unknown_size = 0\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n        if self.apps_selection == AppSelection.TOPX_KNOWN:\n            if self.apps_selection_topx == 0:\n                raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n            if len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n        if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n            if len(self.apps_selection_background_unknown) == 0:\n                raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n            bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n            if len(bad_apps) > 0:\n                raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n        if self.apps_selection == AppSelection.FIXED:\n            if len(self.apps_selection_fixed_known) == 0:\n                raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n            bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n            if len(bad_apps) > 0:\n                raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if len(self.disabled_apps) > 0:\n                raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n            if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n                warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0:\n                raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n        # More asserts\n        bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n        if len(bad_disabled_apps) > 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples <= 0 or self.fit_scalers_samples > 1):\n            raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n    def get_flowstats_features_len(self) -> int:\n        \"\"\"Gets the number of flow statistics features.\"\"\"\n        return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n    def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n        \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n        phist_mapping = {\n            \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        }\n        short_names_mapping = {\n            \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n            \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n            \"FLOW_ENDREASON_END\": \"FEND_END\",\n            \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n            \"FLAG_CWR\": \"F_CWR\",\n            \"FLAG_CWR_REV\": \"F_CWR_REV\",\n            \"FLAG_ECE\": \"F_ECE\",\n            \"FLAG_ECE_REV\": \"F_ECE_REV\",\n            \"FLAG_PSH_REV\": \"F_PSH_REV\",\n            \"FLAG_RST\": \"F_RST\",\n            \"FLAG_RST_REV\": \"F_RST_REV\",\n            \"FLAG_FIN\": \"F_FIN\",\n            \"FLAG_FIN_REV\": \"F_FIN_REV\",\n        }\n        feature_names = self.flowstats_features[:]\n        for f in self.flowstats_features_boolean:\n            if shorter_names and f in short_names_mapping:\n                feature_names.append(short_names_mapping[f])\n            else:\n                feature_names.append(f)\n        for f in self.flowstats_features_phist:\n            feature_names.extend(phist_mapping[f])\n        assert len(feature_names) == self.get_flowstats_features_len()\n        return feature_names\n\n    def get_ppi_feature_names(self) -> list[str]:\n        \"\"\"Gets the names of flattened PPI features.\"\"\"\n        ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        if self.use_push_flags:\n            ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        return ppi_feature_names\n\n    def get_ppi_channels(self) -> list[int]:\n        \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n        if self.use_push_flags:\n            return TCP_PPI_CHANNELS\n        else:\n            return UDP_PPI_CHANNELS\n\n    def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n        \"\"\"\n        Gets feature names.\n\n        Parameters:\n            flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n        \"\"\"\n        feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n        feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n        return feature_names\n\n    def _get_train_tables_paths(self) -> list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n    def _get_val_tables_paths(self) -> list[str]:\n        if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n            return self._get_train_tables_paths()\n        return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n    def _get_test_tables_paths(self) -> list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n    def _get_train_data_hash(self) -> str:\n        train_data_params = self._get_train_data_params()\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        return params_hash\n\n    def _get_train_data_path(self) -> str:\n        if self.need_train_set:\n            params_hash = self._get_train_data_hash()\n            return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n        else:\n            return os.path.join(self.data_root, \"train-data\", \"default\")\n\n    def _get_train_data_params(self) -> TrainDataParams:\n        return TrainDataParams(\n            database_filename=self.database_filename,\n            train_period_name=self.train_period_name,\n            train_tables_paths=self._get_train_tables_paths(),\n            apps_selection=self.apps_selection,\n            apps_selection_topx=self.apps_selection_topx,\n            apps_selection_background_unknown=self.apps_selection_background_unknown,\n            apps_selection_fixed_known=self.apps_selection_fixed_known,\n            apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n            disabled_apps=self.disabled_apps,\n            min_train_samples_per_app=self.min_train_samples_per_app,\n            min_train_samples_check=self.min_train_samples_check,)\n\n    def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n        assert self.val_approach == ValidationApproach.VALIDATION_DATES\n        val_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.val_period_name,\n            test_tables_paths=self._get_val_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n        return val_data_params, val_data_path\n\n    def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n        test_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.test_period_name,\n            test_tables_paths=self._get_test_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n        return test_data_params, test_data_path\n\n    @model_validator(mode=\"before\") # type: ignore\n    @classmethod\n    def check_deprecated_args(cls, values):\n        kwargs = values.kwargs\n        if \"train_period\" in kwargs:\n            warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n            kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n        if \"val_period\" in kwargs:\n            warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n            kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n        if \"test_period\" in kwargs:\n            warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n            kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n        return values\n\n    def __str__(self):\n        _process_tag = yaml.emitter.Emitter.process_tag\n        _ignore_aliases = yaml.Dumper.ignore_aliases\n        yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n        yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n        s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n        yaml.emitter.Emitter.process_tag = _process_tag\n        yaml.Dumper.ignore_aliases = _ignore_aliases\n        return s\n
"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"
get_flowstats_features_len() -> int\n

Gets the number of flow statistics features.

Source code in cesnet_datazoo\\config.py
def get_flowstats_features_len(self) -> int:\n    \"\"\"Gets the number of flow statistics features.\"\"\"\n    return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"
get_flowstats_feature_names_expanded(\n    shorter_names: bool = False,\n) -> list[str]\n

Gets names of flow statistics features. Packet histograms are expanded into bin features.

Source code in cesnet_datazoo\\config.py
def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n    \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n    phist_mapping = {\n        \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n    }\n    short_names_mapping = {\n        \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n        \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n        \"FLOW_ENDREASON_END\": \"FEND_END\",\n        \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n        \"FLAG_CWR\": \"F_CWR\",\n        \"FLAG_CWR_REV\": \"F_CWR_REV\",\n        \"FLAG_ECE\": \"F_ECE\",\n        \"FLAG_ECE_REV\": \"F_ECE_REV\",\n        \"FLAG_PSH_REV\": \"F_PSH_REV\",\n        \"FLAG_RST\": \"F_RST\",\n        \"FLAG_RST_REV\": \"F_RST_REV\",\n        \"FLAG_FIN\": \"F_FIN\",\n        \"FLAG_FIN_REV\": \"F_FIN_REV\",\n    }\n    feature_names = self.flowstats_features[:]\n    for f in self.flowstats_features_boolean:\n        if shorter_names and f in short_names_mapping:\n            feature_names.append(short_names_mapping[f])\n        else:\n            feature_names.append(f)\n    for f in self.flowstats_features_phist:\n        feature_names.extend(phist_mapping[f])\n    assert len(feature_names) == self.get_flowstats_features_len()\n    return feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"
get_ppi_feature_names() -> list[str]\n

Gets the names of flattened PPI features.

Source code in cesnet_datazoo\\config.py
def get_ppi_feature_names(self) -> list[str]:\n    \"\"\"Gets the names of flattened PPI features.\"\"\"\n    ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    if self.use_push_flags:\n        ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    return ppi_feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"
get_ppi_channels() -> list[int]\n

Gets the available features (channels) in PPI sequences.

Source code in cesnet_datazoo\\config.py
def get_ppi_channels(self) -> list[int]:\n    \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n    if self.use_push_flags:\n        return TCP_PPI_CHANNELS\n    else:\n        return UDP_PPI_CHANNELS\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"
get_feature_names(\n    flatten_ppi: bool = False, shorter_names: bool = False\n) -> list[str]\n

Gets feature names.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten PPI into individual feature names or keep one PPI column.

False Source code in cesnet_datazoo\\config.py
def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n    \"\"\"\n    Gets feature names.\n\n    Parameters:\n        flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n    \"\"\"\n    feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n    feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n    return feature_names\n
"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"

The following enums are used for dataset configuration.

"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"

The validation approach defines which samples should be used for creating a validation set.

SPLIT_FROM_TRAIN class-attribute instance-attribute
SPLIT_FROM_TRAIN = 'split-from-train'\n

Split train data into train and validation. Scikit-learn train_test_split is used to create a random stratified validation set. The fraction of validation samples is defined in train_val_split_fraction.

VALIDATION_DATES class-attribute instance-attribute
VALIDATION_DATES = 'validation-dates'\n

Use separate validation dates to create a validation set. Validation dates need to be specified in val_dates, and the name of the validation period in val_period_name.

"},{"location":"reference_dataset_config/#config.AppSelection","title":"config.AppSelection","text":"

Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use ALL_KNOWN to select all the applications as known. Use TOPX_KNOWN or BACKGROUND_UNKNOWN for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The FIXED is for manual selection of known and unknown applications.

ALL_KNOWN class-attribute instance-attribute
ALL_KNOWN = 'all-known'\n

Use all applications as known.

TOPX_KNOWN class-attribute instance-attribute
TOPX_KNOWN = 'topx-known'\n

Use the first X (apps_selection_topx) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.

BACKGROUND_UNKNOWN class-attribute instance-attribute
BACKGROUND_UNKNOWN = 'background-unknown'\n

Use the list of background traffic classes (apps_selection_background_unknown) as unknown, and the rest as known.

FIXED class-attribute instance-attribute
FIXED = 'fixed'\n

Manual application selection. Provide lists of known applications (apps_selection_fixed_known) and unknown applications (apps_selection_fixed_unknown).

"},{"location":"reference_dataset_config/#config.MinTrainSamplesCheck","title":"config.MinTrainSamplesCheck","text":"

Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with min_train_samples_per_app, and its default value is 100. With the DISABLE_APPS approach, these applications will be disabled and not used for training or testing. With the WARN_AND_EXIT approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set min_train_samples_per_app to 0.

WARN_AND_EXIT class-attribute instance-attribute
WARN_AND_EXIT = 'warn-and-exit'\n

Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to disabled_apps.

DISABLE_APPS class-attribute instance-attribute
DISABLE_APPS = 'disable-apps'\n

Disable applications with not enough training samples.

"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"

Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use RANDOM if your classification model requires it; SEQUENTIAL otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.

RANDOM class-attribute instance-attribute
RANDOM = 'random'\n

Iterate train data in random order.

SEQUENTIAL class-attribute instance-attribute
SEQUENTIAL = 'sequential'\n

Iterate train data in sequential (datetime) order.

"},{"location":"reference_datasets/","title":"Dataset classes","text":"

These are subclasses of CesnetDataset representing individual datasets available in cesnet-datazoo.

"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS22","title":"datasets.datasets.CESNET_TLS22","text":"

Bases: CesnetDataset

Dataset class for CESNET-TLS22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n    name = \"CESNET-TLS22\"\n    database_filename = \"CESNET-TLS22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n    available_dates = _CESNET_TLS22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n        \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n    }\n    default_train_period_name = \"W-2021-40\"\n    default_test_period_name = \"W-2021-41\"\n    _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":"

Bases: CesnetDataset

Dataset class for CESNET-QUIC22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_QUIC22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n    name = \"CESNET-QUIC22\"\n    database_filename = \"CESNET-QUIC22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n    available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n        \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n        \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n        \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n        \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n                   \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n                   \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n    }\n    default_train_period_name = \"W-2022-44\"\n    default_test_period_name = \"W-2022-45\"\n    _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":"

Bases: CesnetDataset

Dataset class for CESNET-TLS-Year22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS_Year22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n    name = \"CESNET-TLS-Year22\"\n    database_filename = \"CESNET-TLS-Year22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n    available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n    time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n    default_train_period_name = \"M-2022-9\"\n    default_test_period_name = \"M-2022-10\"\n    _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n
"},{"location":"transforms/","title":"Transforms","text":"

The cesnet_datazoo package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.

Transforms are implemented in a separate package CESNET Models. See cesnet_models.transforms documentation for details.

Limitations

The current implementation does not support the composing of transformations.

"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"

PPI sequences

Flow statistics

Packet histograms

More transformations will be implemented in future versions.

"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"

Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.

"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"CESNET DataZoo","text":"

This is the documentation of the CESNET DataZoo project.

The goal of this project is to provide tools for working with large network traffic datasets and to facilitate research in the traffic classification area. The core functions of the cesnet-datazoo package are:

"},{"location":"#papers","title":"Papers","text":""},{"location":"dataloaders/","title":"Using dataloaders","text":"

Apart from loading data into dataframes, the cesnet-datazoo package provides dataloaders for processing data in smaller batches.

An example of how dataloaders can be used is in cesnet_datazoo.datasets.loaders or in the following snippet:

def load_from_dataloader(dataloader: DataLoader):\n    other_fields = []\n    data_ppi = []\n    data_flowstats = []\n    labels = []\n    for batch_other_fields, batch_ppi, batch_flowstats, batch_labels in dataloader:\n        other_fields.append(batch_other_fields)\n        data_ppi.append(batch_ppi)\n        data_flowstats.append(batch_flowstats)\n        labels.append(batch_labels)\n    df_other_fields = pd.concat(other_fields, ignore_index=True)\n    data_ppi = np.concatenate(data_ppi)\n    data_flowstats = np.concatenate(data_flowstats)\n    labels = np.concatenate(labels)\n    return df_other_fields, data_ppi, data_flowstats, labels\n

When a dataloader is iterated, the returned data are in the format tuple(batch_other_fields, batch_ppi, batch_flowstats, batch_labels). Batch size B is configured with batch_size and test_batch_size config options. The shapes are:

PPI and flow statistics features returned from dataloaders are transformed depending on the selected configuration. See the transforms page for more information.

"},{"location":"dataset_metadata/","title":"DatasetMetadata","text":"

Each dataset class has its metadata available as a DatasetMetadata instance in the metadata attribute.

"},{"location":"dataset_metadata/#metadata","title":"Metadata","text":"Name CESNET-TLS22 CESNET-QUIC22 CESNET-TLS-Year22 Protocol TLS QUIC TLS Published in 2022 2023 2023 Collected in 2021 2022 2022 Collection duration 2 weeks 4 weeks 1 year Available samples 141392195 153226273 507739073 Available dataset sizes XS, S, M, L XS, S, M, L XS, S, M, L Collection period 4.10.2021 - 17.10.2021 31.10.2022 - 27.11.2022 1.1.2022 - 31.12.2022 Missing dates in collection period 20220128, 20220129, 20220130, 20221212, 20221213, 20221229, 20221230, 20221231 Application count 191 102 180 Background traffic classes default-background, google-background, facebook-background PPI features IPT, DIR, SIZE IPT, DIR, SIZE IPT, DIR, SIZE, PUSH_FLAG Flowstats features BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION BYTES, BYTES_REV, PACKETS, PACKETS_REV, DURATION, PPI_LEN, PPI_ROUNDTRIPS, PPI_DURATION Flowstats features boolean FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_OTHER FLOW_ENDREASON_IDLE, FLOW_ENDREASON_ACTIVE, FLOW_ENDREASON_END, FLOW_ENDREASON_OTHER Packet histograms PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT TCP features FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV FLAG_CWR, FLAG_CWR_REV, FLAG_ECE, FLAG_ECE_REV, FLAG_URG, FLAG_URG_REV, FLAG_ACK, FLAG_ACK_REV, FLAG_PSH, FLAG_PSH_REV, FLAG_RST, FLAG_RST_REV, FLAG_SYN, FLAG_SYN_REV, FLAG_FIN, FLAG_FIN_REV Other fields ID ID, SRC_IP, DST_IP, DST_ASN, SRC_PORT, DST_PORT, PROTOCOL, QUIC_VERSION, QUIC_SNI, QUIC_USERAGENT, TIME_FIRST, TIME_LAST ID, SRC_IP, DST_IP, DST_ASN, DST_PORT, PROTOCOL, TLS_SNI, TLS_JA3, TIME_FIRST, TIME_LAST Cite https://doi.org/10.1016/j.comnet.2022.109467 https://doi.org/10.1016/j.dib.2023.108888 Zenodo URL https://zenodo.org/record/7965515 https://zenodo.org/record/7963302 Related papers https://doi.org/10.23919/TMA58422.2023.10199052"},{"location":"datasets_overview/","title":"Overview of datasets","text":""},{"location":"datasets_overview/#cesnet-tls22","title":"CESNET-TLS22","text":"

CESNET-TLS22

This dataset was published in \"Fine-grained TLS services classification with reject option\" (DOI, arXiv). It was built from live traffic collected using high-speed monitoring probes at the perimeter of the CESNET2 network.

For detailed information about the dataset, see the linked paper and the dataset metadata page.

"},{"location":"datasets_overview/#cesnet-quic22","title":"CESNET-QUIC22","text":"

CESNET-QUIC22

This dataset was published in \"CESNET-QUIC22: A large one-month QUIC network traffic dataset from backbone lines\" (DOI). The QUIC protocol has the potential to replace TLS over TLS as the standard protocol for reliable and secure Internet communication. Due to its design that makes the inspection of connection handshakes challenging and its usage in HTTP/3, there is an increasing demand for QUIC traffic classification methods.

For detailed information about the dataset, see the linked paper and the dataset metadata page. Experiments based on this dataset were published in \"Encrypted traffic classification: the QUIC case\" (DOI).

"},{"location":"datasets_overview/#cesnet-tls-year22","title":"CESNET-TLS-Year22","text":"

CESNET-TLS-Year22

This dataset is similar to CESNET-TLS22; however, it spans the entire year 2022. It will be published in the near future.

"},{"location":"features/","title":"Features","text":"

This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.

"},{"location":"features/#ppi-sequence","title":"PPI sequence","text":"

A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set use_push_flags to include PUSH flags in PPI sequences, if available in the dataset.

Name Description SIZE Size of the transport payload IPT Inter-packet time in milliseconds. The IPT of the first packet is set to zero DIR Direction of the packet encoded as \u00b11 PUSH_FLAG Whether the push flag was set in the TCP packet"},{"location":"features/#flow-statistics","title":"Flow statistics","text":"

Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.

Name Description DURATION Duration of the flow in seconds BYTES Number of transmitted bytes from client to server BYTES_REV Number of transmitted bytes from server to client PACKETS Number of packets transmitted from client to server PACKETS_REV Number of packets transmitted from server to client PPI_LEN Number of packets in the PPI sequence PPI_DURATION Duration of the PPI sequence in seconds PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence FLOW_ENDREASON_IDLE Flow was terminated because it was idle FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout FLOW_ENDREASON_OTHER Flow was terminated for other reasons"},{"location":"features/#packet-histograms","title":"Packet histograms","text":"

Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0\u201315, 16\u201331, 32\u201363, 64\u2013127, 128\u2013255, 256\u2013511, 512\u20131024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set use_packet_histograms for using packet histograms features, if available in the dataset.

Name Description PSIZE_BIN{x} Packet sizes histogram x-th bin for the forward direction PSIZE_BIN{x}_REV Packet sizes histogram x-th bin for the reverse direction IPT_BIN{x} Inter-packet times histogram x-th bin for the forward direction IPT_BIN{x}_REV Inter-packet times histogram x-th bin for the reverse direction

On the dataset metadata page, packet histogram features are called PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT. Those are the names of database columns that are flattened to the _BIN{x} features.

"},{"location":"features/#tcp-features","title":"TCP features","text":"

Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set use_tcp_features to use a subset of flags defined in cesnet_datazoo.constants.SELECTED_TCP_FLAGS.

Name Description FLAG_{F} Whether F flag was present in the forward (client to server) direction FLAG_{F}_REV Whether F flag was present in the reverse (server to client) direction"},{"location":"features/#other-fields","title":"Other fields","text":"

Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets. Set return_other_fields to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.

Name Description ID Per-dataset unique flow identifier TIME_FIRST Timestamp of the first packet TIME_LAST Timestamp of the last packet SRC_IP Source IP address DST_IP Destination IP address DST_ASN Destination Autonomous System number SRC_PORT Source port DST_PORT Destination port PROTOCOL Transport protocol TLS_SNI / QUIC_SNI Server Name Indication domain TLS_JA3 JA3 fingerprint QUIC_VERSION QUIC protocol version QUIC_USER_AGENT User agent string if available in the QUIC Initial Packet"},{"location":"features/#details-about-packet-histograms-and-ppi","title":"Details about packet histograms and PPI","text":"

Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.

TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV Zero-length packets(without L4 payload, e.g. ACKs) Not included Not included Included Retransmissions(and out-of-order packets) Included Not included* Included Computed from Entire flow First 30 packets Entire flow

*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.

For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.

"},{"location":"getting_started/","title":"Getting started","text":""},{"location":"getting_started/#jupyter-notebooks","title":"Jupyter notebooks","text":"

Example Jupyter notebooks are provided at https://github.com/CESNET/cesnet-tcexamples. Start with:

"},{"location":"getting_started/#code-snippets","title":"Code snippets","text":""},{"location":"getting_started/#download-a-dataset-and-compute-statistics","title":"Download a dataset and compute statistics","text":"

from cesnet_datazoo.datasets import CESNET_QUIC22\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset.compute_dataset_statistics(num_samples=100_000, num_workers=0)\n
This will download the dataset, compute dataset statistics, and save them into /datasets/CESNET-QUIC22/statistics.

"},{"location":"getting_started/#enable-logging-and-set-the-spawn-method-on-windows","title":"Enable logging and set the spawn method on Windows","text":"

import logging\nimport multiprocessing as mp\n\nmp.set_start_method(\"spawn\") \nlogging.basicConfig(\n    level=logging.INFO,\n    format=\"[%(asctime)s][%(name)s][%(levelname)s] - %(message)s\")\n
For running on Windows, we recommend using the spawn method for creating dataloader worker processes. Set up logging to get more information from the package.

"},{"location":"getting_started/#initialize-dataset-to-create-train-validation-and-test-dataframes","title":"Initialize dataset to create train, validation, and test dataframes","text":"
from cesnet_datazoo.datasets import CESNET_QUIC22\nfrom cesnet_datazoo.config import DatasetConfig, AppSelection\n\ndataset = CESNET_QUIC22(\"/datasets/CESNET-QUIC22/\", size=\"XS\")\ndataset_config = DatasetConfig(\n    dataset=dataset,\n    apps_selection=AppSelection.ALL_KNOWN,\n    train_period_name=\"W-2022-44\",\n    test_period_name=\"W-2022-45\",\n)\ndataset.set_dataset_config_and_initialize(dataset_config)\ntrain_dataframe = dataset.get_train_df()\nval_dataframe = dataset.get_val_df()\ntest_dataframe = dataset.get_test_df()\n

The DatasetConfig class handles the configuration of datasets, and calling set_dataset_config_and_initialize initializes train, validation, and test sets with the desired configuration. Data can be read into Pandas DataFrames as shown here or via PyTorch DataLoaders. See CesnetDataset reference.

"},{"location":"installation/","title":"Installation","text":"

Install the package from pip with:

pip install cesnet-datazoo\n

or for editable install with:

pip install -e git+https://github.com/CESNET/cesnet-datazoo\n
"},{"location":"installation/#requirements","title":"Requirements","text":"

The cesnet-datazoo package requires Python >=3.10.

"},{"location":"installation/#dependencies","title":"Dependencies","text":"Name Version matplotlib numpy pandas pydantic >=2.0 PyYAML requests scikit-learn seaborn tables >=3.8.0 torch >=1.10 tqdm"},{"location":"reference_cesnet_dataset/","title":"Base dataset class","text":""},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset","title":"datasets.cesnet_dataset.CesnetDataset","text":"

The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:

The dataset is stored in a PyTables database. The internal PyTablesDataset class is used as a wrapper that implements the PyTorch Dataset interface and is compatible with DataLoader, which provides efficient parallel loading of the data. The dataset configuration is done through the DatasetConfig class.

Intended usage:

  1. Create an instance of the dataset class with the desired size and data root. This will download the dataset if it has not already been downloaded.
  2. Create an instance of DatasetConfig and set it with set_dataset_config_and_initialize. This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.
  3. Use get_train_dataloader or get_train_df to get training data for a classification model.
  4. Validate the model and perform the hyperparameter optimalization on get_val_dataloader or get_val_df.
  5. Evaluate the model on get_test_dataloader or get_test_df.

Parameters:

Name Type Description Default data_root str

Path to the folder where the dataset will be stored. Each dataset size has its own subfolder data_root/size

required size str

Size of the dataset. Options are XS, S, M, L, ORIG.

'S' silent bool

Whether to suppress print and tqdm output.

False

Attributes:

Name Type Description name str

Name of the dataset.

database_filename str

Name of the database file.

database_path str

Path to the database file.

servicemap_path str

Path to the servicemap file.

statistics_path str

Path to the dataset statistics folder.

bucket_url str

URL of the bucket where the database is stored.

metadata DatasetMetadata

Additional dataset metadata.

available_classes list[str]

List of all available classes in the dataset.

available_dates list[str]

List of all available dates in the dataset.

time_periods dict[str, list[str]]

Predefined time periods. Each time period is a list of dates.

default_train_period_name str

Default time period for training.

default_test_period_name str

Default time period for testing.

The following attributes are initialized when set_dataset_config_and_initialize is called.

Attributes:

Name Type Description dataset_config Optional[DatasetConfig]

Configuration of the dataset.

class_info Optional[ClassInfo]

Structured information about the classes.

dataset_indices Optional[IndicesTuple]

Named tuple containing train_indices, val_known_indices, val_unknown_indices, test_known_indices, test_unknown_indices. These are the indices into PyTables database that define train, validation, and test sets.

train_dataset Optional[PyTablesDataset]

Train set in the form of PyTablesDataset instance wrapping the PyTables database.

val_dataset Optional[PyTablesDataset]

Validation set in the form of PyTablesDataset instance wrapping the PyTables database.

test_dataset Optional[PyTablesDataset]

Test set in the form of PyTablesDataset instance wrapping the PyTables database.

known_app_counts Optional[DataFrame]

Known application counts in the train, validation, and test sets.

unknown_app_counts Optional[DataFrame]

Unknown application counts in the validation and test sets.

train_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for training.

train_dataloader_sampler Optional[Sampler]

Sampler used for iterating the training dataloader. Either RandomSampler or SequentialSampler.

train_dataloader_drop_last bool

Whether to drop the last incomplete batch when iterating the training dataloader.

val_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for validation.

test_dataloader Optional[DataLoader]

Iterable PyTorch DataLoader for testing.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
class CesnetDataset():\n    \"\"\"\n    The main class for accessing CESNET datasets. It handles downloading, train/validation/test splitting, and class selection. Access to data is provided through:\n\n    - Iterable PyTorch DataLoader for batch processing. See [using dataloaders][using-dataloaders] for more details.\n    - Pandas DataFrame for loading the entire train, validation, or test set at once.\n\n    The dataset is stored in a [PyTables](https://www.pytables.org/) database. The internal `PyTablesDataset` class is used as a wrapper\n    that implements the PyTorch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) interface\n    and is compatible with [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),\n    which provides efficient parallel loading of the data. The dataset configuration is done through the [`DatasetConfig`][config.DatasetConfig] class.\n\n    **Intended usage:**\n\n    1. Create an instance of the [dataset class][dataset-classes] with the desired size and data root. This will download the dataset if it has not already been downloaded.\n    2. Create an instance of [`DatasetConfig`][config.DatasetConfig] and set it with [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize].\n    This will initialize the dataset \u2014 select classes, split data into train/validation/test sets, and fit data scalers if needed. All is done according to the provided configuration and is cached for later use.\n    3. Use [`get_train_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_train_dataloader] or [`get_train_df`][datasets.cesnet_dataset.CesnetDataset.get_train_df] to get training data for a classification model.\n    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_val_dataloader] or [`get_val_df`][datasets.cesnet_dataset.CesnetDataset.get_val_df].\n    5. Evaluate the model on [`get_test_dataloader`][datasets.cesnet_dataset.CesnetDataset.get_test_dataloader] or [`get_test_df`][datasets.cesnet_dataset.CesnetDataset.get_test_df].\n\n    Parameters:\n        data_root: Path to the folder where the dataset will be stored. Each dataset size has its own subfolder `data_root/size`\n        size: Size of the dataset. Options are `XS`, `S`, `M`, `L`, `ORIG`.\n        silent: Whether to suppress print and tqdm output.\n\n    Attributes:\n        name: Name of the dataset.\n        database_filename: Name of the database file.\n        database_path: Path to the database file.\n        servicemap_path: Path to the servicemap file.\n        statistics_path: Path to the dataset statistics folder.\n        bucket_url: URL of the bucket where the database is stored.\n        metadata: Additional [dataset metadata][metadata].\n        available_classes: List of all available classes in the dataset.\n        available_dates: List of all available dates in the dataset.\n        time_periods: Predefined time periods. Each time period is a list of dates.\n        default_train_period_name: Default time period for training.\n        default_test_period_name: Default time period for testing.\n\n    The following attributes are initialized when [`set_dataset_config_and_initialize`][datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize] is called.\n\n    Attributes:\n        dataset_config: Configuration of the dataset.\n        class_info: Structured information about the classes.\n        dataset_indices: Named tuple containing `train_indices`, `val_known_indices`, `val_unknown_indices`, `test_known_indices`, `test_unknown_indices`. These are the indices into PyTables database that define train, validation, and test sets.\n        train_dataset: Train set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        val_dataset: Validation set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        test_dataset: Test set in the form of `PyTablesDataset` instance wrapping the PyTables database.\n        known_app_counts: Known application counts in the train, validation, and test sets.\n        unknown_app_counts: Unknown application counts in the validation and test sets.\n        train_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training.\n        train_dataloader_sampler: Sampler used for iterating the training dataloader. Either [`RandomSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) or [`SequentialSampler`](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler).\n        train_dataloader_drop_last: Whether to drop the last incomplete batch when iterating the training dataloader.\n        val_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        test_dataloader: Iterable PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    \"\"\"\n    data_root: str\n    size: str\n    silent: bool = False\n\n    name: str\n    database_filename: str\n    database_path: str\n    servicemap_path: str\n    statistics_path: str\n    bucket_url: str\n    metadata: DatasetMetadata\n    available_classes: list[str]\n    available_dates: list[str]\n    time_periods: dict[str, list[str]]\n    default_train_period_name: str\n    default_test_period_name: str\n\n    dataset_config: Optional[DatasetConfig] = None\n    class_info: Optional[ClassInfo] = None\n    dataset_indices: Optional[IndicesTuple] = None\n    train_dataset: Optional[PyTablesDataset] = None\n    val_dataset: Optional[PyTablesDataset] = None\n    test_dataset: Optional[PyTablesDataset] = None\n    known_app_counts: Optional[pd.DataFrame] = None\n    unknown_app_counts: Optional[pd.DataFrame] = None\n    train_dataloader: Optional[DataLoader] = None\n    train_dataloader_sampler: Optional[Sampler] = None\n    train_dataloader_drop_last: bool = True\n    val_dataloader: Optional[DataLoader] = None\n    test_dataloader: Optional[DataLoader] = None\n\n    _collate_fn: Optional[Callable] = None\n    _tables_app_enum: dict[int, str]\n    _tables_cat_enum: dict[int, str]\n\n    def __init__(self, data_root: str, size: str = \"S\", database_checks_at_init: bool = False, silent: bool = False) -> None:\n        self.silent = silent\n        self.metadata = load_metadata(self.name)\n        self.size = size\n        if self.size != \"ORIG\":\n            if size not in self.metadata.available_dataset_sizes:\n                raise ValueError(f\"Unknown dataset size {self.size}\")\n            self.name = f\"{self.name}-{self.size}\"\n            filename, ext = os.path.splitext(self.database_filename)\n            self.database_filename = f\"{filename}-{self.size}{ext}\"\n        self.data_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, self.size)))\n        self.database_path = os.path.join(self.data_root, self.database_filename)\n        self.servicemap_path = os.path.join(self.data_root, SERVICEMAP_FILE)\n        self.statistics_path = os.path.join(self.data_root, \"statistics\")\n        if not os.path.exists(self.data_root):\n            os.makedirs(self.data_root)\n        if not self._is_downloaded():\n            self._download()\n        if database_checks_at_init:\n            with tb.open_file(self.database_path, mode=\"r\") as database:\n                tables_paths = list(map(lambda x: x._v_pathname, iter(database.get_node(f\"/flows\"))))\n                num_samples = 0\n                for p in tables_paths:\n                    table = database.get_node(p)\n                    assert isinstance(table, tb.Table)\n                    if self._tables_app_enum != {v: k for k, v in dict(table.get_enum(APP_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_app_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    if self._tables_cat_enum != {v: k for k, v in dict(table.get_enum(CATEGORY_COLUMN)).items()}:\n                        raise ValueError(f\"Found mismatch between _tables_cat_enum and the PyTables database enum in table {p}. Please report this issue.\")\n                    num_samples += len(table)\n                if self.size == \"ORIG\" and num_samples != self.metadata.available_samples:\n                    raise ValueError(f\"Expected {self.metadata.available_samples} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.size != \"ORIG\" and num_samples != DATASET_SIZES[self.size]:\n                    raise ValueError(f\"Expected {DATASET_SIZES[self.size]} samples, but got {num_samples} in the database. Please delete the data root folder, update cesnet-datazoo, and redownload the dataset.\")\n                if self.available_dates != list(map(lambda x: x.removeprefix(\"/flows/D\"), tables_paths)):\n                    raise ValueError(f\"Found mismatch between available_dates and the dates available in the PyTables database. Please report this issue.\")\n        # Add all available dates as single date time periods\n        for d in self.available_dates:\n            self.time_periods[d] = [d]\n        available_applications = sorted([app for app in pd.read_csv(self.servicemap_path, index_col=\"Tag\").index if not is_background_app(app)])\n        if len(available_applications) != self.metadata.application_count:\n            raise ValueError(f\"Found {len(available_applications)} applications in the servicemap (omitting background traffic classes), but expected {self.metadata.application_count}. Please report this issue.\")\n        self.available_classes = available_applications + self.metadata.background_traffic_classes\n\n    def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n        \"\"\"\n        Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n        Parameters:\n            dataset_config: Desired configuration of the dataset.\n            disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n        \"\"\"\n        self.dataset_config = dataset_config\n        self._clear()\n        self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n\n    def get_train_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n        When the dataloader is iterated in random order, the last incomplete batch is dropped.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config               | Description                                                                                |\n        | ---------------------------- | ------------------------------------------------------------------------------------------ |\n        | `batch_size`                 | Number of samples per batch.                                                               |\n        | `train_workers`              | Number of workers for loading train data.                                                  |\n        | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n        | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n        Returns:\n            Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n        if not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n        assert self.train_dataset\n        if self.train_dataloader:\n            return self.train_dataloader\n        # Create sampler according to the selected order\n        if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n            if self.dataset_config.train_dataloader_seed is not None:\n                generator = torch.Generator()\n                generator.manual_seed(self.dataset_config.train_dataloader_seed)\n            else:\n                generator = None\n            self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n            self.train_dataloader_drop_last = True\n        elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n            self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n            self.train_dataloader_drop_last = False\n        else: assert_never(self.dataset_config.train_dataloader_order)\n        # Create dataloader\n        batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n        train_dataloader = DataLoader(\n            self.train_dataset,\n            num_workers=self.dataset_config.train_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.train_workers > 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.train_workers == 0:\n            self.train_dataset.pytables_worker_init()\n        self.train_dataloader = train_dataloader\n        return train_dataloader\n\n    def get_val_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n        The dataloader is created on the first call and then cached.\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `val_workers`     | Number of workers for loading validation data.                    |\n\n        Returns:\n            Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n        if not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n        assert self.val_dataset is not None\n        if self.val_dataloader:\n            return self.val_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        val_dataloader = DataLoader(\n            self.val_dataset,\n            num_workers=self.dataset_config.val_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=self.dataset_config.val_workers > 0,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.val_workers == 0:\n            self.val_dataset.pytables_worker_init()\n        self.val_dataloader = val_dataloader\n        return val_dataloader\n\n    def get_test_dataloader(self) -> DataLoader:\n        \"\"\"\n        Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n        The dataloader is created on the first call and then cached.\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n        The dataloader is configured with the following config attributes:\n\n        | Dataset config    | Description                                                       |\n        | ------------------| ------------------------------------------------------------------|\n        | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n        | `test_workers`    | Number of workers for loading test data.                          |\n\n        Returns:\n            Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n        \"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n        if not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n        assert self.test_dataset is not None\n        if self.test_dataloader:\n            return self.test_dataloader\n        batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n        test_dataloader = DataLoader(\n            self.test_dataset,\n            num_workers=self.dataset_config.test_workers,\n            worker_init_fn=worker_init_fn,\n            collate_fn=self._collate_fn,\n            persistent_workers=False,\n            batch_size=None,\n            sampler=batch_sampler,)\n        if self.dataset_config.test_workers == 0:\n            self.test_dataset.pytables_worker_init()\n        self.test_dataloader = test_dataloader\n        return test_dataloader\n\n    def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n        \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n        train_dataloader = self.get_train_dataloader()\n        val_dataloader = self.get_val_dataloader()\n        test_dataloader = self.get_test_dataloader()\n        return train_dataloader, val_dataloader, test_dataloader\n\n    def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n        !!! warning \"Memory usage\"\n\n            The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Train data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_train=True)\n        assert self.dataset_config is not None and self.train_dataset is not None\n        if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n        train_dataloader = self.get_train_dataloader()\n        assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n        # Read dataloader in sequential order\n        train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n        train_dataloader.sampler.drop_last = False\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        df = create_df_from_dataloader(dataloader=train_dataloader,\n                                       feature_names=feature_names,\n                                       flatten_ppi=flatten_ppi,\n                                       silent=self.silent)\n        # Restore the original dataloader sampler and drop_last\n        train_dataloader.sampler.sampler = self.train_dataloader_sampler\n        train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n        return df\n\n    def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n        !!! warning \"Memory usage\"\n\n            The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Validation data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_val=True)\n        assert self.dataset_config is not None and self.val_dataset is not None\n        if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n        \"\"\"\n        Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n        When the dataset is used in the open-world setting, and unknown classes are defined,\n        the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n        !!! warning \"Memory usage\"\n\n            The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n        Parameters:\n            flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n        Returns:\n            Test data as a dataframe.\n        \"\"\"\n        self._check_before_dataframe(check_test=True)\n        assert self.dataset_config is not None and self.test_dataset is not None\n        if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n            warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n        feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n        return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                         feature_names=feature_names,\n                                         flatten_ppi=flatten_ppi,\n                                         silent=self.silent)\n\n    def get_num_classes(self) -> int:\n        \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n        return self.class_info.num_classes\n\n    def get_known_apps(self) -> list[str]:\n        \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n        return self.class_info.known_apps\n\n    def get_unknown_apps(self) -> list[str]:\n        \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n        if self.class_info is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n        return self.class_info.unknown_apps\n\n    def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n        \"\"\"\n        Computes dataset statistics and saves them to the `statistics_path` folder.\n\n        Parameters:\n            num_samples: Number of samples to use for computing the statistics.\n            num_workers: Number of workers for loading data.\n            batch_size: Number of samples per batch for loading data.\n            disabled_apps: List of applications to exclude from the statistics.\n        \"\"\"\n        if disabled_apps:\n            bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n            if len(bad_disabled_apps) > 0:\n                raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if not os.path.exists(self.statistics_path):\n            os.mkdir(self.statistics_path)\n        compute_dataset_statistics(database_path=self.database_path,\n                                   tables_app_enum=self._tables_app_enum,\n                                   tables_cat_enum=self._tables_cat_enum,\n                                   output_dir=self.statistics_path,\n                                   packet_histograms=self.metadata.packet_histograms,\n                                   flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                                   protocol=self.metadata.protocol,\n                                   extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                                   disabled_apps=disabled_apps if disabled_apps is not None else [],\n                                   num_samples=num_samples,\n                                   num_workers=num_workers,\n                                   batch_size=batch_size,\n                                   silent=self.silent)\n\n    def _generate_time_periods(self) -> None:\n        time_periods = {}\n        for period in self.time_periods:\n            time_periods[period] = []\n            if period.startswith(\"W\"):\n                split = period.split(\"-\")\n                collection_year, week = int(split[1]), int(split[2])\n                for d in range(1, 8):\n                    s = datetime.date.fromisocalendar(collection_year, week, d).strftime(\"%Y%m%d\")\n                    # last week of a year can span into the following year\n                    if s not in self.metadata.missing_dates_in_collection_period and s.startswith(str(collection_year)):\n                        time_periods[period].append(s)\n            elif period.startswith(\"M\"):\n                split = period.split(\"-\")\n                collection_year, month = int(split[1]), int(split[2])\n                for d in range(1, calendar.monthrange(collection_year, month)[1]):\n                    s = datetime.date(collection_year, month, d).strftime(\"%Y%m%d\")\n                    if s not in self.metadata.missing_dates_in_collection_period:\n                        time_periods[period].append(s)\n        self.time_periods = time_periods\n\n    def _is_downloaded(self) -> bool:\n        \"\"\"Servicemap is downloaded after the database; thus if it exists, the database is also downloaded\"\"\"\n        return os.path.exists(self.servicemap_path) and os.path.exists(self.database_path)\n\n    def _download(self) -> None:\n        if not self.silent:\n            print(f\"Downloading {self.name} dataset\")\n        database_url = f\"{self.bucket_url}&file={self.database_filename}\"\n        servicemap_url = f\"{self.bucket_url}&file={SERVICEMAP_FILE}\"\n        resumable_download(url=database_url, file_path=self.database_path, silent=self.silent)\n        simple_download(url=servicemap_url, file_path=self.servicemap_path)\n\n    def _clear(self) -> None:\n        self.class_info = None\n        self.dataset_indices = None\n        self.train_dataset = None\n        self.val_dataset = None\n        self.test_dataset = None\n        self.known_app_counts = None\n        self.unknown_app_counts = None\n        self.train_dataloader = None\n        self.train_dataloader_sampler = None\n        self.train_dataloader_drop_last = True\n        self.val_dataloader = None\n        self.test_dataloader = None\n        self._collate_fn = None\n\n    def _check_before_dataframe(self, check_train: bool = False, check_val: bool = False, check_test: bool = False) -> None:\n        if self.dataset_config is None:\n            raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting a dataframe\")\n        if self.dataset_config.return_tensors:\n            raise ValueError(\"Dataframes are not available when return_tensors is set. Use a dataloader instead.\")\n        if check_train and not self.dataset_config.need_train_set:\n            raise ValueError(\"Train dataframe is not available when need_train_set is false\")\n        if check_val and not self.dataset_config.need_val_set:\n            raise ValueError(\"Validation dataframe is not available when need_val_set is false\")\n        if check_test and not self.dataset_config.need_test_set:\n            raise ValueError(\"Test dataframe is not available when need_test_set is false\")\n\n    def _initialize_train_val_test(self, disable_indices_cache: bool = False) -> None:\n        assert self.dataset_config is not None\n        dataset_config = self.dataset_config\n        servicemap = pd.read_csv(dataset_config.servicemap_path, index_col=\"Tag\")\n        # Initialize train set\n        if dataset_config.need_train_set:\n            train_indices, train_unknown_indices, known_apps, unknown_apps = init_or_load_train_indices(dataset_config=dataset_config,\n                                                                                                        tables_app_enum=self._tables_app_enum,\n                                                                                                        servicemap=servicemap,\n                                                                                                        disable_indices_cache=disable_indices_cache,)\n            # Date weight sampling of train indices\n            if dataset_config.train_dates_weigths is not None:\n                assert dataset_config.train_size != \"all\"\n                if dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                    # requested number of samples is train_size + val_known_size when using the split-from-train validation approach\n                    assert dataset_config.val_known_size != \"all\"\n                    num_samples = dataset_config.train_size + dataset_config.val_known_size\n                else:\n                    num_samples = dataset_config.train_size\n                if num_samples > len(train_indices):\n                    raise ValueError(f\"Requested number of samples for weight sampling ({num_samples}) is larger than the number of available train samples ({len(train_indices)})\")\n                train_indices = date_weight_sample_train_indices(dataset_config=dataset_config, train_indices=train_indices, num_samples=num_samples)\n        elif dataset_config.apps_selection == AppSelection.FIXED:\n            known_apps = sorted(dataset_config.apps_selection_fixed_known)\n            unknown_apps = sorted(dataset_config.apps_selection_fixed_unknown)\n            train_indices = no_indices()\n            train_unknown_indices = no_indices()\n        else:\n            raise ValueError(\"Either need train set or the fixed application selection\")\n        # Initialize validation set\n        if dataset_config.need_val_set:\n            if dataset_config.val_approach == ValidationApproach.VALIDATION_DATES:\n                val_known_indices, val_unknown_indices, val_data_path = init_or_load_val_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n            elif dataset_config.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                train_val_rng = get_fresh_random_generator(dataset_config=dataset_config, section=RandomizedSection.TRAIN_VAL_SPLIT)\n                val_data_path = dataset_config._get_train_data_path()\n                val_unknown_indices = train_unknown_indices\n                train_labels = train_indices[:, INDICES_LABEL_POS]\n                if dataset_config.train_dates_weigths is not None:\n                    assert dataset_config.val_known_size != \"all\"\n                    # When weight sampling is used, val_known_size is kept but the resulting train size can be smaller due to no enough samples in some train dates\n                    if dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples after weight sampling ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.val_known_size, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                    dataset_config.train_size = len(train_indices)\n                elif dataset_config.train_size == \"all\" and dataset_config.val_known_size == \"all\":\n                    train_indices, val_known_indices = train_test_split(train_indices, test_size=dataset_config.train_val_split_fraction, stratify=train_labels, shuffle=True, random_state=train_val_rng)\n                else:\n                    if dataset_config.val_known_size != \"all\" and  dataset_config.train_size != \"all\" and dataset_config.train_size + dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested train size + validation size ({dataset_config.train_size + dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.train_size != \"all\" and dataset_config.train_size > len(train_indices):\n                        raise ValueError(f\"Requested train size ({dataset_config.train_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    if dataset_config.val_known_size != \"all\" and dataset_config.val_known_size > len(train_indices):\n                        raise ValueError(f\"Requested validation size ({dataset_config.val_known_size}) is larger than the number of available train samples ({len(train_indices)})\")\n                    train_indices, val_known_indices = train_test_split(train_indices,\n                                                                        train_size=dataset_config.train_size if dataset_config.train_size != \"all\" else None,\n                                                                        test_size=dataset_config.val_known_size if dataset_config.val_known_size != \"all\" else None,\n                                                                        stratify=train_labels, shuffle=True, random_state=train_val_rng)\n        else:\n            val_known_indices = no_indices()\n            val_unknown_indices = no_indices()\n            val_data_path = None\n        # Initialize test set\n        if dataset_config.need_test_set:\n            test_known_indices, test_unknown_indices, test_data_path = init_or_load_test_indices(dataset_config=dataset_config,\n                                                                                                 known_apps=known_apps,\n                                                                                                 unknown_apps=unknown_apps,\n                                                                                                 tables_app_enum=self._tables_app_enum,\n                                                                                                 disable_indices_cache=disable_indices_cache,)\n        else:\n            test_known_indices = no_indices()\n            test_unknown_indices = no_indices()\n            test_data_path = None\n        # Fit scalers if needed\n        if (dataset_config.ppi_transform is not None and dataset_config.ppi_transform.needs_fitting or\n            dataset_config.flowstats_transform is not None and dataset_config.flowstats_transform.needs_fitting):\n            if not dataset_config.need_train_set:\n                raise ValueError(\"Train set is needed to fit the scalers. Provide pre-fitted scalers.\")\n            fit_scalers(dataset_config=dataset_config, train_indices=train_indices)\n        # Subset dataset indices based on the selected sizes and compute application counts\n        dataset_indices = IndicesTuple(train_indices=train_indices, val_known_indices=val_known_indices, val_unknown_indices=val_unknown_indices, test_known_indices=test_known_indices, test_unknown_indices=test_unknown_indices)\n        dataset_indices = subset_and_sort_indices(dataset_config=dataset_config, dataset_indices=dataset_indices)\n        known_app_counts = compute_known_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        unknown_app_counts = compute_unknown_app_counts(dataset_indices=dataset_indices, tables_app_enum=self._tables_app_enum)\n        # Combine known and unknown test indicies to create a single dataloader\n        assert isinstance(dataset_config.test_unknown_size, int)\n        if dataset_config.test_unknown_size > 0 and len(unknown_apps) > 0:\n            test_combined_indices = np.concatenate((dataset_indices.test_known_indices, dataset_indices.test_unknown_indices))\n        else:\n            test_combined_indices = dataset_indices.test_known_indices\n        # Create encoder the class info structure\n        encoder = LabelEncoder().fit(known_apps)\n        encoder.classes_ = np.append(encoder.classes_, UNKNOWN_STR_LABEL)\n        class_info = create_class_info(servicemap=servicemap, encoder=encoder, known_apps=known_apps, unknown_apps=unknown_apps)\n        encode_labels_with_unknown_fn = partial(_encode_labels_with_unknown, encoder=encoder, class_info=class_info)\n        # Create train, validation, and test datasets\n        train_dataset = val_dataset = test_dataset = None\n        if dataset_config.need_train_set:\n            train_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_train_tables_paths(),\n                indices=dataset_indices.train_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,)\n        if dataset_config.need_val_set:\n            assert val_data_path is not None\n            val_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_val_tables_paths(),\n                indices=dataset_indices.val_known_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_val,\n                preload_blob=os.path.join(val_data_path, \"preload\", f\"val_dataset-{dataset_config.val_known_size}.npz\"),)\n        if dataset_config.need_test_set:\n            assert test_data_path is not None\n            test_dataset = PyTablesDataset(\n                database_path=dataset_config.database_path,\n                tables_paths=dataset_config._get_test_tables_paths(),\n                indices=test_combined_indices,\n                tables_app_enum=self._tables_app_enum,\n                tables_cat_enum=self._tables_cat_enum,\n                flowstats_features=dataset_config.flowstats_features,\n                flowstats_features_boolean=dataset_config.flowstats_features_boolean,\n                flowstats_features_phist=dataset_config.flowstats_features_phist,\n                other_fields=self.dataset_config.other_fields,\n                ppi_channels=dataset_config.get_ppi_channels(),\n                ppi_transform=dataset_config.ppi_transform,\n                flowstats_transform=dataset_config.flowstats_transform,\n                flowstats_phist_transform=dataset_config.flowstats_phist_transform,\n                target_transform=encode_labels_with_unknown_fn,\n                return_tensors=dataset_config.return_tensors,\n                preload=dataset_config.preload_test,\n                preload_blob=os.path.join(test_data_path, \"preload\", f\"test_dataset-{dataset_config.test_known_size}-{dataset_config.test_unknown_size}.npz\"),)\n        self.class_info = class_info\n        self.dataset_indices = dataset_indices\n        self.train_dataset = train_dataset\n        self.val_dataset = val_dataset\n        self.test_dataset = test_dataset\n        self.known_app_counts = known_app_counts\n        self.unknown_app_counts = unknown_app_counts\n        self._collate_fn = collate_fn_simple\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.set_dataset_config_and_initialize","title":"set_dataset_config_and_initialize","text":"
set_dataset_config_and_initialize(\n    dataset_config: DatasetConfig,\n    disable_indices_cache: bool = False,\n) -> None\n

Initialize train, validation, and test sets. Data cannot be accessed before calling this method.

Parameters:

Name Type Description Default dataset_config DatasetConfig

Desired configuration of the dataset.

required disable_indices_cache bool

Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.

False Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def set_dataset_config_and_initialize(self, dataset_config: DatasetConfig, disable_indices_cache: bool = False) -> None:\n    \"\"\"\n    Initialize train, validation, and test sets. Data cannot be accessed before calling this method.\n\n    Parameters:\n        dataset_config: Desired configuration of the dataset.\n        disable_indices_cache: Whether to disable caching of the dataset indices. This is useful when the dataset is used in many different configurations and you want to save disk space.\n    \"\"\"\n    self.dataset_config = dataset_config\n    self._clear()\n    self._initialize_train_val_test(disable_indices_cache=disable_indices_cache)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_dataloader","title":"get_train_dataloader","text":"
get_train_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for training. The dataloader is created on the first call and then cached. When the dataloader is iterated in random order, the last incomplete batch is dropped. The dataloader is configured with the following config attributes:

Dataset config Description batch_size Number of samples per batch. train_workers Number of workers for loading train data. train_dataloader_order Whether to load train data in sequential or random order. See config.DataLoaderOrder. train_dataloader_seed Seed for loading train data in random order.

Returns:

Type Description DataLoader

Train data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for training. The dataloader is created on the first call and then cached.\n    When the dataloader is iterated in random order, the last incomplete batch is dropped.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config               | Description                                                                                |\n    | ---------------------------- | ------------------------------------------------------------------------------------------ |\n    | `batch_size`                 | Number of samples per batch.                                                               |\n    | `train_workers`              | Number of workers for loading train data.                                                  |\n    | `train_dataloader_order`     | Whether to load train data in sequential or random order. See [config.DataLoaderOrder][].  |\n    | `train_dataloader_seed`      | Seed for loading train data in random order.                                               |\n\n    Returns:\n        Train data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting train dataloader\")\n    if not self.dataset_config.need_train_set:\n        raise ValueError(\"Train dataloader is not available when need_train_set is false\")\n    assert self.train_dataset\n    if self.train_dataloader:\n        return self.train_dataloader\n    # Create sampler according to the selected order\n    if self.dataset_config.train_dataloader_order == DataLoaderOrder.RANDOM:\n        if self.dataset_config.train_dataloader_seed is not None:\n            generator = torch.Generator()\n            generator.manual_seed(self.dataset_config.train_dataloader_seed)\n        else:\n            generator = None\n        self.train_dataloader_sampler = RandomSampler(self.train_dataset, generator=generator)\n        self.train_dataloader_drop_last = True\n    elif self.dataset_config.train_dataloader_order == DataLoaderOrder.SEQUENTIAL:\n        self.train_dataloader_sampler = SequentialSampler(self.train_dataset)\n        self.train_dataloader_drop_last = False\n    else: assert_never(self.dataset_config.train_dataloader_order)\n    # Create dataloader\n    batch_sampler = BatchSampler(sampler=self.train_dataloader_sampler, batch_size=self.dataset_config.batch_size, drop_last=self.train_dataloader_drop_last)\n    train_dataloader = DataLoader(\n        self.train_dataset,\n        num_workers=self.dataset_config.train_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.train_workers > 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.train_workers == 0:\n        self.train_dataset.pytables_worker_init()\n    self.train_dataloader = train_dataloader\n    return train_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_dataloader","title":"get_val_dataloader","text":"
get_val_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for validation. The dataloader is created on the first call and then cached. The dataloader is configured with the following config attributes:

Dataset config Description test_batch_size Number of samples per batch for loading validation and test data. val_workers Number of workers for loading validation data.

Returns:

Type Description DataLoader

Validation data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for validation.\n    The dataloader is created on the first call and then cached.\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `val_workers`     | Number of workers for loading validation data.                    |\n\n    Returns:\n        Validation data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting validaion dataloader\")\n    if not self.dataset_config.need_val_set:\n        raise ValueError(\"Validation dataloader is not available when need_val_set is false\")\n    assert self.val_dataset is not None\n    if self.val_dataloader:\n        return self.val_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.val_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    val_dataloader = DataLoader(\n        self.val_dataset,\n        num_workers=self.dataset_config.val_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=self.dataset_config.val_workers > 0,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.val_workers == 0:\n        self.val_dataset.pytables_worker_init()\n    self.val_dataloader = val_dataloader\n    return val_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_dataloader","title":"get_test_dataloader","text":"
get_test_dataloader() -> DataLoader\n

Provides a PyTorch DataLoader for testing. The dataloader is created on the first call and then cached.

When the dataset is used in the open-world setting, and unknown classes are defined, the test dataloader returns test_known_size samples of known classes followed by test_unknown_size samples of unknown classes.

The dataloader is configured with the following config attributes:

Dataset config Description test_batch_size Number of samples per batch for loading validation and test data. test_workers Number of workers for loading test data.

Returns:

Type Description DataLoader

Test data as an iterable dataloader. See using dataloaders for more details.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_dataloader(self) -> DataLoader:\n    \"\"\"\n    Provides a PyTorch [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for testing.\n    The dataloader is created on the first call and then cached.\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the test dataloader returns `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n    The dataloader is configured with the following config attributes:\n\n    | Dataset config    | Description                                                       |\n    | ------------------| ------------------------------------------------------------------|\n    | `test_batch_size` | Number of samples per batch for loading validation and test data. |\n    | `test_workers`    | Number of workers for loading test data.                          |\n\n    Returns:\n        Test data as an iterable dataloader. See [using dataloaders][using-dataloaders] for more details.\n    \"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting test dataloader\")\n    if not self.dataset_config.need_test_set:\n        raise ValueError(\"Test dataloader is not available when need_test_set is false\")\n    assert self.test_dataset is not None\n    if self.test_dataloader:\n        return self.test_dataloader\n    batch_sampler = BatchSampler(sampler=SequentialSampler(self.test_dataset), batch_size=self.dataset_config.test_batch_size, drop_last=False)\n    test_dataloader = DataLoader(\n        self.test_dataset,\n        num_workers=self.dataset_config.test_workers,\n        worker_init_fn=worker_init_fn,\n        collate_fn=self._collate_fn,\n        persistent_workers=False,\n        batch_size=None,\n        sampler=batch_sampler,)\n    if self.dataset_config.test_workers == 0:\n        self.test_dataset.pytables_worker_init()\n    self.test_dataloader = test_dataloader\n    return test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_dataloaders","title":"get_dataloaders","text":"
get_dataloaders() -> (\n    tuple[DataLoader, DataLoader, DataLoader]\n)\n

Gets train, validation, and test dataloaders in one call.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_dataloaders(self) -> tuple[DataLoader, DataLoader, DataLoader]:\n    \"\"\"Gets train, validation, and test dataloaders in one call.\"\"\"\n    if self.dataset_config is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting dataloaders\")\n    train_dataloader = self.get_train_dataloader()\n    val_dataloader = self.get_val_dataloader()\n    test_dataloader = self.get_test_dataloader()\n    return train_dataloader, val_dataloader, test_dataloader\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_train_df","title":"get_train_df","text":"
get_train_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates a train Pandas DataFrame. The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.

Memory usage

The whole train set is loaded into memory. If the dataset size is larger than 'S', consider using get_train_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Train data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_train_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates a train Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order. Consider shuffling the dataframe if needed.\n\n    !!! warning \"Memory usage\"\n\n        The whole train set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_train_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Train data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_train=True)\n    assert self.dataset_config is not None and self.train_dataset is not None\n    if len(self.train_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Train set has ({len(self.train_dataset)} samples), consider using get_train_dataloader() instead\")\n    train_dataloader = self.get_train_dataloader()\n    assert isinstance(train_dataloader.sampler, BatchSampler) and self.train_dataloader_sampler is not None\n    # Read dataloader in sequential order\n    train_dataloader.sampler.sampler = SequentialSampler(self.train_dataset)\n    train_dataloader.sampler.drop_last = False\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    df = create_df_from_dataloader(dataloader=train_dataloader,\n                                   feature_names=feature_names,\n                                   flatten_ppi=flatten_ppi,\n                                   silent=self.silent)\n    # Restore the original dataloader sampler and drop_last\n    train_dataloader.sampler.sampler = self.train_dataloader_sampler\n    train_dataloader.sampler.drop_last = self.train_dataloader_drop_last\n    return df\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_val_df","title":"get_val_df","text":"
get_val_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates validation Pandas DataFrame. The dataframe is in sequential (datetime) order.

Memory usage

The whole validation set is loaded into memory. If the dataset size is larger than 'S', consider using get_val_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Validation data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_val_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates validation Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n    !!! warning \"Memory usage\"\n\n        The whole validation set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_val_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Validation data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_val=True)\n    assert self.dataset_config is not None and self.val_dataset is not None\n    if len(self.val_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Validation set has ({len(self.val_dataset)} samples), consider using get_val_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_val_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_test_df","title":"get_test_df","text":"
get_test_df(flatten_ppi: bool = False) -> pd.DataFrame\n

Creates test Pandas DataFrame. The dataframe is in sequential (datetime) order.

When the dataset is used in the open-world setting, and unknown classes are defined, the returned test dataframe is composed of test_known_size samples of known classes followed by test_unknown_size samples of unknown classes.

Memory usage

The whole test set is loaded into memory. If the dataset size is larger than 'S', consider using get_test_dataloader instead.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten the PPI sequence into individual columns (named IPT_X, DIR_X, SIZE_X, PUSH_X, X being the index of the packet) or keep one PPI column with 2D data.

False

Returns:

Type Description DataFrame

Test data as a dataframe.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_test_df(self, flatten_ppi: bool = False) -> pd.DataFrame:\n    \"\"\"\n    Creates test Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). The dataframe is in sequential (datetime) order.\n\n\n    When the dataset is used in the open-world setting, and unknown classes are defined,\n    the returned test dataframe is composed of `test_known_size` samples of known classes followed by `test_unknown_size` samples of unknown classes.\n\n\n    !!! warning \"Memory usage\"\n\n        The whole test set is loaded into memory. If the dataset size is larger than `'S'`, consider using `get_test_dataloader` instead.\n\n    Parameters:\n        flatten_ppi: Whether to flatten the PPI sequence into individual columns (named `IPT_X`, `DIR_X`, `SIZE_X`, `PUSH_X`, *X* being the index of the packet) or keep one `PPI` column with 2D data.\n\n    Returns:\n        Test data as a dataframe.\n    \"\"\"\n    self._check_before_dataframe(check_test=True)\n    assert self.dataset_config is not None and self.test_dataset is not None\n    if len(self.test_dataset) > DATAFRAME_SAMPLES_WARNING_THRESHOLD:\n        warnings.warn(f\"Test set has ({len(self.test_dataset)} samples), consider using get_test_dataloader() instead\")\n    feature_names = self.dataset_config.get_feature_names(flatten_ppi=flatten_ppi)\n    return create_df_from_dataloader(dataloader=self.get_test_dataloader(),\n                                     feature_names=feature_names,\n                                     flatten_ppi=flatten_ppi,\n                                     silent=self.silent)\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_num_classes","title":"get_num_classes","text":"
get_num_classes() -> int\n

Returns the number of classes in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_num_classes(self) -> int:\n    \"\"\"Returns the number of classes in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting the number of classes\")\n    return self.class_info.num_classes\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_known_apps","title":"get_known_apps","text":"
get_known_apps() -> list[str]\n

Returns the list of known applications in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_known_apps(self) -> list[str]:\n    \"\"\"Returns the list of known applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting known apps\")\n    return self.class_info.known_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.get_unknown_apps","title":"get_unknown_apps","text":"
get_unknown_apps() -> list[str]\n

Returns the list of unknown applications in the current configuration of the dataset.

Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def get_unknown_apps(self) -> list[str]:\n    \"\"\"Returns the list of unknown applications in the current configuration of the dataset.\"\"\"\n    if self.class_info is None:\n        raise ValueError(\"Dataset is not initialized, use set_dataset_config_and_initialize() before getting unknown apps\")\n    return self.class_info.unknown_apps\n
"},{"location":"reference_cesnet_dataset/#datasets.cesnet_dataset.CesnetDataset.compute_dataset_statistics","title":"compute_dataset_statistics","text":"
compute_dataset_statistics(\n    num_samples: int | Literal[\"all\"] = 10000000,\n    num_workers: int = 4,\n    batch_size: int = 16384,\n    disabled_apps: Optional[list[str]] = None,\n) -> None\n

Computes dataset statistics and saves them to the statistics_path folder.

Parameters:

Name Type Description Default num_samples int | Literal['all']

Number of samples to use for computing the statistics.

10000000 num_workers int

Number of workers for loading data.

4 batch_size int

Number of samples per batch for loading data.

16384 disabled_apps Optional[list[str]]

List of applications to exclude from the statistics.

None Source code in cesnet_datazoo\\datasets\\cesnet_dataset.py
def compute_dataset_statistics(self, num_samples: int | Literal[\"all\"] = 10_000_000, num_workers: int = 4, batch_size: int = 16384, disabled_apps: Optional[list[str]] = None) -> None:\n    \"\"\"\n    Computes dataset statistics and saves them to the `statistics_path` folder.\n\n    Parameters:\n        num_samples: Number of samples to use for computing the statistics.\n        num_workers: Number of workers for loading data.\n        batch_size: Number of samples per batch for loading data.\n        disabled_apps: List of applications to exclude from the statistics.\n    \"\"\"\n    if disabled_apps:\n        bad_disabled_apps = [a for a in disabled_apps if a not in self.available_classes]\n        if len(bad_disabled_apps) > 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n    if not os.path.exists(self.statistics_path):\n        os.mkdir(self.statistics_path)\n    compute_dataset_statistics(database_path=self.database_path,\n                               tables_app_enum=self._tables_app_enum,\n                               tables_cat_enum=self._tables_cat_enum,\n                               output_dir=self.statistics_path,\n                               packet_histograms=self.metadata.packet_histograms,\n                               flowstats_features_boolean=self.metadata.flowstats_features_boolean,\n                               protocol=self.metadata.protocol,\n                               extra_fields=not self.name.startswith(\"CESNET-TLS22\"),\n                               disabled_apps=disabled_apps if disabled_apps is not None else [],\n                               num_samples=num_samples,\n                               num_workers=num_workers,\n                               batch_size=batch_size,\n                               silent=self.silent)\n
"},{"location":"reference_dataset_config/","title":"Config class","text":""},{"location":"reference_dataset_config/#config.DatasetConfig","title":"config.DatasetConfig","text":"

The main class for the configuration of:

When initializing this class, pass a CesnetDataset instance to be configured and the desired configuration. Available options are here.

Attributes:

Name Type Description dataset InitVar[CesnetDataset]

The dataset instance to be configured.

data_root str

Taken from the dataset instance.

database_filename str

Taken from the dataset instance.

database_path str

Taken from the dataset instance.

servicemap_path str

Taken from the dataset instance.

flowstats_features list[str]

Taken from dataset.metadata.flowstats_features.

flowstats_features_boolean list[str]

Taken from dataset.metadata.flowstats_features_boolean.

flowstats_features_phist list[str]

Taken from dataset.metadata.packet_histograms if use_packet_histograms is true, otherwise an empty list.

other_fields list[str]

Taken from dataset.metadata.other_fields if return_other_fields is true, otherwise an empty list.

"},{"location":"reference_dataset_config/#config.DatasetConfig--configuration-options","title":"Configuration options","text":"

Attributes:

Name Type Description need_train_set bool

Use to disable the train set. Default: True

need_val_set bool

Use to disable the validation set. Default: True

need_test_set bool

Use to disable the test set. Default: True

train_period_name str

Name of the train period. See instructions.

train_dates list[str]

Dates used for creating a train set.

train_dates_weigths Optional[list[int]]

To use a non-uniform distribution of samples across train dates.

val_approach ValidationApproach

How a validation set should be created. Either split train data into train and validation or have a separate validation period. Default: SPLIT_FROM_TRAIN

train_val_split_fraction float

The fraction of validation samples when splitting from the train set. Default: 0.2

val_period_name str

Name of the validation period. See instructions.

val_dates list[str]

Dates used for creating a validation set.

test_period_name str

Name of the test period. See instructions.

test_dates list[str]

Dates used for creating a test set.

apps_selection AppSelection

How to select application classes. Default: ALL_KNOWN

apps_selection_topx int

Take top X as known.

apps_selection_background_unknown list[str]

Provide a list of background traffic classes to be used as unknown.

apps_selection_fixed_known list[str]

Provide a list of manually selected known applications.

apps_selection_fixed_unknown list[str]

Provide a list of manually selected unknown applications.

disabled_apps list[str]

List of applications to be disabled and not used at all.

min_train_samples_check MinTrainSamplesCheck

How to handle applications with not enough training samples. Default: DISABLE_APPS

min_train_samples_per_app int

Defines the threshold for not enough. Default: 100

random_state int

Fix all random processes performed during dataset initialization. Default: 420

fold_id int

To perform N-fold cross-validation, set this to 1..N. Each fold will use the same configuration but a different random seed. Default: 0

train_workers int

Number of workers for loading train data. 0 means that the data will be loaded in the main process. Default: 4

test_workers int

Number of workers for loading test data. 0 means that the data will be loaded in the main process. Default: 1

val_workers int

Number of workers for loading validation data. 0 means that the data will be loaded in the main process. Default: 1

batch_size int

Number of samples per batch. Default: 192

test_batch_size int

Number of samples per batch for loading validation and test data. Default: 2048

preload_val bool

Whether to dump the validation set with numpy.savez_compressed and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. Default: False

preload_test bool

Whether to dump the test set with numpy.savez_compressed and preload it in future runs. Default: False

train_size int | Literal['all']

Size of the train set. See instructions. Default: all

val_known_size int | Literal['all']

Size of the validation set. See instructions. Default: all

test_known_size int | Literal['all']

Size of the test set. See instructions. Default: all

val_unknown_size int | Literal['all']

Size of the unknown classes validation set. Use for evaluation in the open-world setting. Default: 0

test_unknown_size int | Literal['all']

Size of the unknown classes test set. Use for evaluation in the open-world setting. Default: 0

train_dataloader_order DataLoaderOrder

Whether to load train data in sequential or random order. Default: RANDOM

train_dataloader_seed Optional[int]

Seed for loading train data in random order. Default: None

return_other_fields bool

Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. Default: False

return_tensors bool

Use for returning torch.Tensor from dataloaders. Dataframes are not available when this option is used. Default: False

use_packet_histograms bool

Whether to use packet histogram features, if available in the dataset. Default: True

use_tcp_features bool

Whether to use TCP features, if available in the dataset. Default: True

use_push_flags bool

Whether to use push flags in packet sequences, if available in the dataset. Default: False

fit_scalers_samples int | float

Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. Default: 0.25

ppi_transform Optional[Callable]

Transform function for PPI sequences. See the transforms page for more information. Default: None

flowstats_transform Optional[Callable]

Transform function for flow statistics. See the transforms page for more information. Default: None

flowstats_phist_transform Optional[Callable]

Transform function for packet histograms. See the transforms page for more information. Default: None

"},{"location":"reference_dataset_config/#config.DatasetConfig--how-to-configure-train-validation-and-test-sets","title":"How to configure train, validation, and test sets","text":"

There are three options for how to define train/validation/test dates.

  1. Choose a predefined time period (train_period_name, val_period_name, or test_period_name) available in dataset.time_periods and leave the list of dates (train_dates, val_dates, or test_dates) empty.
  2. Provide a list of dates and a name for the time period. The dates are checked against dataset.available_dates.
  3. Do not specify anything and use the dataset's defaults dataset.default_train_period_name and dataset.default_test_period_name.

There are two options for configuring sizes of train/validation/test sets.

  1. Select an appropriate dataset size (default is S) when creating the CesnetDataset instance and leave train_size, val_known_size, and test_known_size with their default all value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).
  2. Provide exact sizes in train_size, val_known_size, and test_known_size. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the ORIG dataset size and want to control the size of experiments.

Tip

The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.

Source code in cesnet_datazoo\\config.py
@dataclass(config=C)\nclass DatasetConfig():\n    \"\"\"\n    The main class for the configuration of:\n\n    - Train, validation, test sets (dates, sizes, validation approach).\n    - Application selection \u2014 either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).\n    - Data transformations. See the [transforms][transforms] page for more information.\n    - Dataloader options like batch sizes, order of loading, or number of workers.\n\n    When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].\n\n    Attributes:\n        dataset: The dataset instance to be configured.\n        data_root: Taken from the dataset instance.\n        database_filename: Taken from the dataset instance.\n        database_path: Taken from the dataset instance.\n        servicemap_path: Taken from the dataset instance.\n        flowstats_features: Taken from `dataset.metadata.flowstats_features`.\n        flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.\n        flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.\n        other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.\n\n    # Configuration options\n\n    Attributes:\n        need_train_set: Use to disable the train set. `Default: True`\n        need_val_set: Use to disable the validation set. `Default: True`\n        need_test_set: Use to disable the test set. `Default: True`\n        train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        train_dates: Dates used for creating a train set.\n        train_dates_weigths: To use a non-uniform distribution of samples across train dates.\n        val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`\n        train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`\n        val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        val_dates: Dates used for creating a validation set.\n        test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].\n        test_dates: Dates used for creating a test set.\n\n        apps_selection: How to select application classes. `Default: ALL_KNOWN`\n        apps_selection_topx: Take top X as known.\n        apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.\n        apps_selection_fixed_known: Provide a list of manually selected known applications.\n        apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.\n        disabled_apps: List of applications to be disabled and not used at all.\n        min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`\n        min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`\n\n        random_state: Fix all random processes performed during dataset initialization. `Default: 420`\n        fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`\n        train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`\n        test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`\n        val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`\n        batch_size: Number of samples per batch. `Default: 192`\n        test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`\n        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: False`\n        preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`\n        train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`\n        val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`\n        test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`\n        train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`\n        train_dataloader_seed: Seed for loading train data in random order. `Default: None`\n\n        return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`\n        return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`\n        use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`\n        use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`\n        use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`\n        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`\n        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`\n        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`\n\n    # How to configure train, validation, and test sets\n    There are three options for how to define train/validation/test dates.\n\n    1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.\n    2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.\n    3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.\n\n    There are two options for configuring sizes of train/validation/test sets.\n\n    1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.\n    This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).\n    2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.\n    This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.\n\n    !!! tip Validation set\n        The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].\n\n    \"\"\"\n    dataset: InitVar[CesnetDataset]\n    data_root: str = field(init=False)\n    database_filename: str =  field(init=False)\n    database_path: str =  field(init=False)\n    servicemap_path: str = field(init=False)\n    flowstats_features: list[str] = field(init=False)\n    flowstats_features_boolean: list[str] = field(init=False)\n    flowstats_features_phist: list[str] = field(init=False)\n    other_fields: list[str] = field(init=False)\n\n    need_train_set: bool = True\n    need_val_set: bool = True\n    need_test_set: bool = True\n    train_period_name: str = \"\"\n    train_dates: list[str] = field(default_factory=list)\n    train_dates_weigths: Optional[list[int]] = None\n    val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN\n    train_val_split_fraction: float = 0.2\n    val_period_name: str = \"\"\n    val_dates: list[str] = field(default_factory=list)\n    test_period_name: str = \"\"\n    test_dates: list[str] = field(default_factory=list)\n\n    apps_selection: AppSelection = AppSelection.ALL_KNOWN\n    apps_selection_topx: int = 0\n    apps_selection_background_unknown: list[str] = field(default_factory=list)\n    apps_selection_fixed_known: list[str] = field(default_factory=list)\n    apps_selection_fixed_unknown: list[str] = field(default_factory=list)\n    disabled_apps: list[str] = field(default_factory=list)\n    min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS\n    min_train_samples_per_app: int = 100\n\n    random_state: int = 420\n    fold_id: int = 0\n    train_workers: int = 4\n    test_workers: int = 1\n    val_workers: int = 1\n    batch_size: int = 192\n    test_batch_size: int = 2048\n    preload_val: bool = False\n    preload_test: bool = False\n    train_size: int | Literal[\"all\"] = \"all\"\n    val_known_size: int | Literal[\"all\"] = \"all\"\n    test_known_size: int | Literal[\"all\"] = \"all\"\n    val_unknown_size: int | Literal[\"all\"] = 0\n    test_unknown_size: int | Literal[\"all\"] = 0\n    train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM\n    train_dataloader_seed: Optional[int] = None\n\n    return_other_fields: bool = False\n    return_tensors: bool = False\n    use_packet_histograms: bool = False\n    use_tcp_features: bool = False\n    use_push_flags: bool = False\n    fit_scalers_samples: int | float = 0.25\n    ppi_transform: Optional[Callable] = None\n    flowstats_transform: Optional[Callable] = None\n    flowstats_phist_transform: Optional[Callable] = None\n\n    def __post_init__(self, dataset: CesnetDataset):\n        \"\"\"\n        Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.\n        \"\"\"\n        self.data_root = dataset.data_root\n        self.servicemap_path = dataset.servicemap_path\n        self.database_filename = dataset.database_filename\n        self.database_path = dataset.database_path\n\n        if not self.need_train_set:\n            if self.apps_selection != AppSelection.FIXED:\n                raise ValueError(\"Application selection has to be fixed when need_train_set is false\")\n            if (len(self.train_dates) > 0 or self.train_period_name != \"\"):\n                raise ValueError(\"train_dates and train_period_name cannot be specified when need_train_set is false\")\n        else:\n            # Configure train dates\n            if len(self.train_dates) > 0 and self.train_period_name == \"\":\n                raise ValueError(\"train_period_name has to be specified when train_dates are set\")\n            if len(self.train_dates) == 0 and self.train_period_name != \"\":\n                if self.train_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods\")\n                self.train_dates = dataset.time_periods[self.train_period_name]\n            if len(self.train_dates) == 0 and self.train_period_name == \"\":\n                self.train_period_name = dataset.default_train_period_name\n                self.train_dates = dataset.time_periods[dataset.default_train_period_name]\n        # Configure test dates\n        if not self.need_test_set:\n            if (len(self.test_dates) > 0 or self.test_period_name != \"\"):\n                raise ValueError(\"test_dates and test_period_name cannot be specified when need_test_set is false\")\n        else:\n            if len(self.test_dates) > 0 and self.test_period_name == \"\":\n                raise ValueError(\"test_period_name has to be specified when test_dates are set\")\n            if len(self.test_dates) == 0 and self.test_period_name != \"\":\n                if self.test_period_name not in dataset.time_periods:\n                    raise ValueError(f\"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods\")\n                self.test_dates = dataset.time_periods[self.test_period_name]\n            if len(self.test_dates) == 0 and self.test_period_name == \"\":\n                self.test_period_name = dataset.default_test_period_name\n                self.test_dates = dataset.time_periods[dataset.default_test_period_name]\n        # Configure val dates\n        if not self.need_val_set:\n            if len(self.val_dates) > 0 or self.val_period_name != \"\" or self.val_approach != ValidationApproach.SPLIT_FROM_TRAIN:\n                raise ValueError(\"val_dates, val_period_name, and val_approach cannot be specified when need_val_set is false\")\n        else:\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n                if len(self.val_dates) > 0 or self.val_period_name != \"\":\n                    raise ValueError(\"val_dates and val_period_name cannot be specified when the validation approach is split-from-train\")\n                if not self.need_train_set:\n                    raise ValueError(\"Cannot use the split-from-train validation approach when need_train_set is false. Either use the validation-dates approach or set need_val_set to false.\")\n            elif self.val_approach == ValidationApproach.VALIDATION_DATES:\n                if len(self.val_dates) > 0 and self.val_period_name == \"\":\n                    raise ValueError(\"val_period_name has to be specified when val_dates are set\")\n                if len(self.val_dates) == 0 and self.val_period_name != \"\":\n                    if self.val_period_name not in dataset.time_periods:\n                        raise ValueError(f\"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods\")\n                    self.val_dates = dataset.time_periods[self.val_period_name]\n                if len(self.val_dates) == 0 and self.val_period_name == \"\":\n                    raise ValueError(\"val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates\")\n        # Check if train, val, and test dates are available in the dataset\n        bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]\n        bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]\n        bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]\n        if len(bad_train_dates) > 0:\n            raise ValueError(f\"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_val_dates) > 0:\n            raise ValueError(f\"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        if len(bad_test_dates) > 0:\n            raise ValueError(f\"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})\" \\\n                            + (f\". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}\" if dataset.metadata.missing_dates_in_collection_period else \"\"))\n        # Check time order of train, val, and test periods\n        train_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.train_dates]\n        test_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.test_dates]\n        if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):\n            warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        if self.val_approach == ValidationApproach.VALIDATION_DATES:\n            val_dates = [datetime.strptime(date_str, \"%Y%m%d\").date() for date_str in self.val_dates]\n            if len(train_dates) > 0 and min(val_dates) <= max(train_dates):\n                warnings.warn(f\"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n            if len(test_dates) > 0 and min(test_dates) <= max(val_dates):\n                warnings.warn(f\"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.\")\n        # Configure features\n        self.flowstats_features = dataset.metadata.flowstats_features\n        self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean\n        self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []\n        if self.use_packet_histograms:\n            if len(dataset.metadata.packet_histograms) == 0:\n                raise ValueError(\"This dataset does not support use_packet_histograms\")\n            self.flowstats_features_phist = dataset.metadata.packet_histograms\n        else:\n            self.flowstats_features_phist = []\n            if self.flowstats_phist_transform is not None:\n                raise ValueError(\"flowstats_phist_transform cannot be specified when use_packet_histograms is false\")\n        if dataset.metadata.protocol == Protocol.TLS:\n            if self.use_tcp_features:\n                self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS\n            if self.use_push_flags and \"PUSH_FLAG\" not in dataset.metadata.ppi_features:\n                raise ValueError(\"This TLS dataset does not support use_push_flags\")\n        if dataset.metadata.protocol == Protocol.QUIC:\n            if self.use_tcp_features:\n                raise ValueError(\"QUIC datasets do not support use_tcp_features\")\n            if self.use_push_flags:\n                raise ValueError(\"QUIC datasets do not support use_push_flags\")\n        # When train_dates_weigths are used, train_size and val_known_size have to be specified\n        if self.train_dates_weigths is not None:\n            if not self.need_train_set:\n                raise ValueError(\"train_dates_weigths cannot be specified when need_train_set is false\")\n            if len(self.train_dates_weigths) != len(self.train_dates):\n                raise ValueError(\"train_dates_weigths has to have the same length as train_dates\")\n            if self.train_size == \"all\":\n                raise ValueError(\"train_size cannot be 'all' when train_dates_weigths are speficied\")\n            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == \"all\":\n                raise ValueError(\"val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train\")\n        # App selection\n        if self.apps_selection == AppSelection.ALL_KNOWN:\n            self.val_unknown_size = 0\n            self.test_unknown_size = 0\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known\")\n        if self.apps_selection == AppSelection.TOPX_KNOWN:\n            if self.apps_selection_topx == 0:\n                raise ValueError(\"apps_selection_topx has to be greater than 0 when application selection is top-x-known\")\n            if len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known\")\n        if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:\n            if len(self.apps_selection_background_unknown) == 0:\n                raise ValueError(\"apps_selection_background_unknown has to be specified when application selection is background-unknown\")\n            bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]\n            if len(bad_apps) > 0:\n                raise ValueError(f\"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:\n                raise ValueError(\"apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown\")\n        if self.apps_selection == AppSelection.FIXED:\n            if len(self.apps_selection_fixed_known) == 0:\n                raise ValueError(\"apps_selection_fixed_known has to be specified when application selection is fixed\")\n            bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]\n            if len(bad_apps) > 0:\n                raise ValueError(f\"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes\")\n            if len(self.disabled_apps) > 0:\n                raise ValueError(\"disabled_apps cannot be specified when application selection is fixed\")\n            if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:\n                warnings.warn(\"min_train_samples_per_app is not used when application selection is fixed\")\n            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0:\n                raise ValueError(\"apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed\")\n        # More asserts\n        bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]\n        if len(bad_disabled_apps) > 0:\n            raise ValueError(f\"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes\")\n        if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples <= 0 or self.fit_scalers_samples > 1):\n            raise ValueError(\"fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer\")\n\n    def get_flowstats_features_len(self) -> int:\n        \"\"\"Gets the number of flow statistics features.\"\"\"\n        return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n\n    def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n        \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n        phist_mapping = {\n            \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n            \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        }\n        short_names_mapping = {\n            \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n            \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n            \"FLOW_ENDREASON_END\": \"FEND_END\",\n            \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n            \"FLAG_CWR\": \"F_CWR\",\n            \"FLAG_CWR_REV\": \"F_CWR_REV\",\n            \"FLAG_ECE\": \"F_ECE\",\n            \"FLAG_ECE_REV\": \"F_ECE_REV\",\n            \"FLAG_PSH_REV\": \"F_PSH_REV\",\n            \"FLAG_RST\": \"F_RST\",\n            \"FLAG_RST_REV\": \"F_RST_REV\",\n            \"FLAG_FIN\": \"F_FIN\",\n            \"FLAG_FIN_REV\": \"F_FIN_REV\",\n        }\n        feature_names = self.flowstats_features[:]\n        for f in self.flowstats_features_boolean:\n            if shorter_names and f in short_names_mapping:\n                feature_names.append(short_names_mapping[f])\n            else:\n                feature_names.append(f)\n        for f in self.flowstats_features_phist:\n            feature_names.extend(phist_mapping[f])\n        assert len(feature_names) == self.get_flowstats_features_len()\n        return feature_names\n\n    def get_ppi_feature_names(self) -> list[str]:\n        \"\"\"Gets the names of flattened PPI features.\"\"\"\n        ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                               [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        if self.use_push_flags:\n            ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n        return ppi_feature_names\n\n    def get_ppi_channels(self) -> list[int]:\n        \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n        if self.use_push_flags:\n            return TCP_PPI_CHANNELS\n        else:\n            return UDP_PPI_CHANNELS\n\n    def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n        \"\"\"\n        Gets feature names.\n\n        Parameters:\n            flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n        \"\"\"\n        feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n        feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n        return feature_names\n\n    def _get_train_tables_paths(self) -> list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.train_dates))\n\n    def _get_val_tables_paths(self) -> list[str]:\n        if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:\n            return self._get_train_tables_paths()\n        return list(map(lambda t: f\"/flows/D{t}\", self.val_dates))\n\n    def _get_test_tables_paths(self) -> list[str]:\n        return list(map(lambda t: f\"/flows/D{t}\", self.test_dates))\n\n    def _get_train_data_hash(self) -> str:\n        train_data_params = self._get_train_data_params()\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        return params_hash\n\n    def _get_train_data_path(self) -> str:\n        if self.need_train_set:\n            params_hash = self._get_train_data_hash()\n            return os.path.join(self.data_root, \"train-data\", f\"{params_hash}_{self.random_state}\", f\"fold_{self.fold_id}\")\n        else:\n            return os.path.join(self.data_root, \"train-data\", \"default\")\n\n    def _get_train_data_params(self) -> TrainDataParams:\n        return TrainDataParams(\n            database_filename=self.database_filename,\n            train_period_name=self.train_period_name,\n            train_tables_paths=self._get_train_tables_paths(),\n            apps_selection=self.apps_selection,\n            apps_selection_topx=self.apps_selection_topx,\n            apps_selection_background_unknown=self.apps_selection_background_unknown,\n            apps_selection_fixed_known=self.apps_selection_fixed_known,\n            apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,\n            disabled_apps=self.disabled_apps,\n            min_train_samples_per_app=self.min_train_samples_per_app,\n            min_train_samples_check=self.min_train_samples_check,)\n\n    def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n        assert self.val_approach == ValidationApproach.VALIDATION_DATES\n        val_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.val_period_name,\n            test_tables_paths=self._get_val_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        val_data_path = os.path.join(self.data_root, \"val-data\", f\"{params_hash}_{self.random_state}\")\n        return val_data_params, val_data_path\n\n    def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:\n        test_data_params = TestDataParams(\n            database_filename=self.database_filename,\n            test_period_name=self.test_period_name,\n            test_tables_paths=self._get_test_tables_paths(),\n            known_apps=known_apps,\n            unknown_apps=unknown_apps,)\n        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()\n        params_hash = params_hash[:10]\n        test_data_path = os.path.join(self.data_root, \"test-data\", f\"{params_hash}_{self.random_state}\")\n        return test_data_params, test_data_path\n\n    @model_validator(mode=\"before\") # type: ignore\n    @classmethod\n    def check_deprecated_args(cls, values):\n        kwargs = values.kwargs\n        if \"train_period\" in kwargs:\n            warnings.warn(\"train_period is deprecated. Use train_period_name instead.\")\n            kwargs[\"train_period_name\"] = kwargs[\"train_period\"]\n        if \"val_period\" in kwargs:\n            warnings.warn(\"val_period is deprecated. Use val_period_name instead.\")\n            kwargs[\"val_period_name\"] = kwargs[\"val_period\"]\n        if \"test_period\" in kwargs:\n            warnings.warn(\"test_period is deprecated. Use test_period_name instead.\")\n            kwargs[\"test_period_name\"] = kwargs[\"test_period\"]\n        return values\n\n    def __str__(self):\n        _process_tag = yaml.emitter.Emitter.process_tag\n        _ignore_aliases = yaml.Dumper.ignore_aliases\n        yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None\n        yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True\n        s = yaml.dump(dataclasses.asdict(self), sort_keys=False)\n        yaml.emitter.Emitter.process_tag = _process_tag\n        yaml.Dumper.ignore_aliases = _ignore_aliases\n        return s\n
"},{"location":"reference_dataset_config/#config.DatasetConfig-functions","title":"Functions","text":""},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_features_len","title":"get_flowstats_features_len","text":"
get_flowstats_features_len() -> int\n

Gets the number of flow statistics features.

Source code in cesnet_datazoo\\config.py
def get_flowstats_features_len(self) -> int:\n    \"\"\"Gets the number of flow statistics features.\"\"\"\n    return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_flowstats_feature_names_expanded","title":"get_flowstats_feature_names_expanded","text":"
get_flowstats_feature_names_expanded(\n    shorter_names: bool = False,\n) -> list[str]\n

Gets names of flow statistics features. Packet histograms are expanded into bin features.

Source code in cesnet_datazoo\\config.py
def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:\n    \"\"\"Gets names of flow statistics features. Packet histograms are expanded into bin features.\"\"\"\n    phist_mapping = {\n        \"PHIST_SRC_SIZES\": [f\"PSIZE_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_SIZES\": [f\"PSIZE_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_SRC_IPT\": [f\"IPT_BIN{i}\" for i in range(1, PHIST_BIN_COUNT + 1)],\n        \"PHIST_DST_IPT\": [f\"IPT_BIN{i}_REV\" for i in range(1, PHIST_BIN_COUNT + 1)],\n    }\n    short_names_mapping = {\n        \"FLOW_ENDREASON_IDLE\": \"FEND_IDLE\",\n        \"FLOW_ENDREASON_ACTIVE\": \"FEND_ACTIVE\",\n        \"FLOW_ENDREASON_END\": \"FEND_END\",\n        \"FLOW_ENDREASON_OTHER\": \"FEND_OTHER\",\n        \"FLAG_CWR\": \"F_CWR\",\n        \"FLAG_CWR_REV\": \"F_CWR_REV\",\n        \"FLAG_ECE\": \"F_ECE\",\n        \"FLAG_ECE_REV\": \"F_ECE_REV\",\n        \"FLAG_PSH_REV\": \"F_PSH_REV\",\n        \"FLAG_RST\": \"F_RST\",\n        \"FLAG_RST_REV\": \"F_RST_REV\",\n        \"FLAG_FIN\": \"F_FIN\",\n        \"FLAG_FIN_REV\": \"F_FIN_REV\",\n    }\n    feature_names = self.flowstats_features[:]\n    for f in self.flowstats_features_boolean:\n        if shorter_names and f in short_names_mapping:\n            feature_names.append(short_names_mapping[f])\n        else:\n            feature_names.append(f)\n    for f in self.flowstats_features_phist:\n        feature_names.extend(phist_mapping[f])\n    assert len(feature_names) == self.get_flowstats_features_len()\n    return feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_feature_names","title":"get_ppi_feature_names","text":"
get_ppi_feature_names() -> list[str]\n

Gets the names of flattened PPI features.

Source code in cesnet_datazoo\\config.py
def get_ppi_feature_names(self) -> list[str]:\n    \"\"\"Gets the names of flattened PPI features.\"\"\"\n    ppi_feature_names = [f\"IPT_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"DIR_{i}\" for i in range(1, PPI_MAX_LEN + 1)] + \\\n                           [f\"SIZE_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    if self.use_push_flags:\n        ppi_feature_names += [f\"PUSH_{i}\" for i in range(1, PPI_MAX_LEN + 1)]\n    return ppi_feature_names\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_ppi_channels","title":"get_ppi_channels","text":"
get_ppi_channels() -> list[int]\n

Gets the available features (channels) in PPI sequences.

Source code in cesnet_datazoo\\config.py
def get_ppi_channels(self) -> list[int]:\n    \"\"\"Gets the available features (channels) in PPI sequences.\"\"\"\n    if self.use_push_flags:\n        return TCP_PPI_CHANNELS\n    else:\n        return UDP_PPI_CHANNELS\n
"},{"location":"reference_dataset_config/#config.DatasetConfig.get_feature_names","title":"get_feature_names","text":"
get_feature_names(\n    flatten_ppi: bool = False, shorter_names: bool = False\n) -> list[str]\n

Gets feature names.

Parameters:

Name Type Description Default flatten_ppi bool

Whether to flatten PPI into individual feature names or keep one PPI column.

False Source code in cesnet_datazoo\\config.py
def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:\n    \"\"\"\n    Gets feature names.\n\n    Parameters:\n        flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.\n    \"\"\"\n    feature_names = self.get_ppi_feature_names() if flatten_ppi else [\"PPI\"]\n    feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)\n    return feature_names\n
"},{"location":"reference_dataset_config/#enums-for-configuration","title":"Enums for configuration","text":"

The following enums are used for dataset configuration.

"},{"location":"reference_dataset_config/#config.ValidationApproach","title":"config.ValidationApproach","text":"

The validation approach defines which samples should be used for creating a validation set.

SPLIT_FROM_TRAIN class-attribute instance-attribute
SPLIT_FROM_TRAIN = 'split-from-train'\n

Split train data into train and validation. Scikit-learn train_test_split is used to create a random stratified validation set. The fraction of validation samples is defined in train_val_split_fraction.

VALIDATION_DATES class-attribute instance-attribute
VALIDATION_DATES = 'validation-dates'\n

Use separate validation dates to create a validation set. Validation dates need to be specified in val_dates, and the name of the validation period in val_period_name.

"},{"location":"reference_dataset_config/#config.AppSelection","title":"config.AppSelection","text":"

Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use ALL_KNOWN to select all the applications as known. Use TOPX_KNOWN or BACKGROUND_UNKNOWN for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The FIXED is for manual selection of known and unknown applications.

ALL_KNOWN class-attribute instance-attribute
ALL_KNOWN = 'all-known'\n

Use all applications as known.

TOPX_KNOWN class-attribute instance-attribute
TOPX_KNOWN = 'topx-known'\n

Use the first X (apps_selection_topx) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.

BACKGROUND_UNKNOWN class-attribute instance-attribute
BACKGROUND_UNKNOWN = 'background-unknown'\n

Use the list of background traffic classes (apps_selection_background_unknown) as unknown, and the rest as known.

FIXED class-attribute instance-attribute
FIXED = 'fixed'\n

Manual application selection. Provide lists of known applications (apps_selection_fixed_known) and unknown applications (apps_selection_fixed_unknown).

"},{"location":"reference_dataset_config/#config.MinTrainSamplesCheck","title":"config.MinTrainSamplesCheck","text":"

Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with min_train_samples_per_app, and its default value is 100. With the DISABLE_APPS approach, these applications will be disabled and not used for training or testing. With the WARN_AND_EXIT approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set min_train_samples_per_app to 0.

WARN_AND_EXIT class-attribute instance-attribute
WARN_AND_EXIT = 'warn-and-exit'\n

Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to disabled_apps.

DISABLE_APPS class-attribute instance-attribute
DISABLE_APPS = 'disable-apps'\n

Disable applications with not enough training samples.

"},{"location":"reference_dataset_config/#config.DataLoaderOrder","title":"config.DataLoaderOrder","text":"

Validation and test sets are always loaded in sequential order \u2014 sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use RANDOM if your classification model requires it; SEQUENTIAL otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.

RANDOM class-attribute instance-attribute
RANDOM = 'random'\n

Iterate train data in random order.

SEQUENTIAL class-attribute instance-attribute
SEQUENTIAL = 'sequential'\n

Iterate train data in sequential (datetime) order.

"},{"location":"reference_datasets/","title":"Dataset classes","text":"

These are subclasses of CesnetDataset representing individual datasets available in cesnet-datazoo.

"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS22","title":"datasets.datasets.CESNET_TLS22","text":"

Bases: CesnetDataset

Dataset class for CESNET-TLS22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS22][cesnet-tls22].\"\"\"\n    name = \"CESNET-TLS22\"\n    database_filename = \"CESNET-TLS22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls22\"\n    available_dates = _CESNET_TLS22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2021-40\": [\"20211004\", \"20211005\", \"20211006\", \"20211007\", \"20211008\", \"20211009\", \"20211010\"],\n        \"W-2021-41\": [\"20211011\", \"20211012\", \"20211013\", \"20211014\", \"20211015\", \"20211016\", \"20211017\"],\n    }\n    default_train_period_name = \"W-2021-40\"\n    default_test_period_name = \"W-2021-41\"\n    _tables_app_enum = _CESNET_TLS22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_QUIC22","title":"datasets.datasets.CESNET_QUIC22","text":"

Bases: CesnetDataset

Dataset class for CESNET-QUIC22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_QUIC22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-QUIC22][cesnet-quic22].\"\"\"\n    name = \"CESNET-QUIC22\"\n    database_filename = \"CESNET-QUIC22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-quic22\"\n    available_dates = _CESNET_QUIC22_AVAILABLE_DATES\n    time_periods = {\n        \"W-2022-44\": [\"20221031\", \"20221101\", \"20221102\", \"20221103\", \"20221104\", \"20221105\", \"20221106\"],\n        \"W-2022-45\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\"],\n        \"W-2022-46\": [\"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\"],\n        \"W-2022-47\": [\"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n        \"W45-47\": [\"20221107\", \"20221108\", \"20221109\", \"20221110\", \"20221111\", \"20221112\", \"20221113\",\n                   \"20221114\", \"20221115\", \"20221116\", \"20221117\", \"20221118\", \"20221119\", \"20221120\",\n                   \"20221121\", \"20221122\", \"20221123\", \"20221124\", \"20221125\", \"20221126\", \"20221127\"],\n    }\n    default_train_period_name = \"W-2022-44\"\n    default_test_period_name = \"W-2022-45\"\n    _tables_app_enum = _CESNET_QUIC22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_QUIC22_TABLES_CATEGORY_ENUM\n
"},{"location":"reference_datasets/#datasets.datasets.CESNET_TLS_Year22","title":"datasets.datasets.CESNET_TLS_Year22","text":"

Bases: CesnetDataset

Dataset class for CESNET-TLS-Year22.

Source code in cesnet_datazoo\\datasets\\datasets.py
class CESNET_TLS_Year22(CesnetDataset):\n    \"\"\"Dataset class for [CESNET-TLS-Year22][cesnet-tls-year22].\"\"\"\n    name = \"CESNET-TLS-Year22\"\n    database_filename = \"CESNET-TLS-Year22.h5\"\n    bucket_url = \"https://liberouter.org/datazoo/download?bucket=cesnet-tls-year22\"\n    available_dates = _CESNET_TLS_YEAR22_AVAILABLE_DATES\n    time_periods = _CESNET_TLS_YEAR22_TIME_PERIODS\n    default_train_period_name = \"M-2022-9\"\n    default_test_period_name = \"M-2022-10\"\n    _tables_app_enum = _CESNET_TLS_YEAR22_TABLES_APP_ENUM\n    _tables_cat_enum = _CESNET_TLS_YEAR22_TABLES_CATEGORY_ENUM\n
"},{"location":"transforms/","title":"Transforms","text":"

The cesnet_datazoo package supports configurable transforms of input data in a similar fashion to what torchvision is doing for the computer vision field. Input features are split into three groups, each having its own transformation. Those groups are PPI sequences, flow statistics, and packet histograms.

Transforms are implemented in a separate package CESNET Models. See cesnet_models.transforms documentation for details.

Limitations

The current implementation does not support the composing of transformations.

"},{"location":"transforms/#available-transformations","title":"Available transformations","text":"

PPI sequences

Flow statistics

Packet histograms

More transformations will be implemented in future versions.

"},{"location":"transforms/#data-scaling","title":"Data scaling","text":"

Transformations implementing data scaling will be fitted, if needed, on a subset of training data during dataset initialization.

"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 7230241..78fd93c 100755 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ