# AD_QAOA dataset preprocessing functions (AD_detection)

This file contains the functions used for the preprocessing of the dataset used and analized with the AD_QAOA class.

### generate dataset

Generates a dataset containing both normal and outlier samples based on specified distributions and parameters. Normal and outlier samples are shuffled to randomize order, and timestamps are assigned sequentially.

`generate_dataset(normal_sample_type, normal_sample_params, outlier_sample_type, outlier_sample_params)`

#### Args

* `normal_sample_type` (str): The distribution type for normal samples. Options are 'uniform', 'normal', 'exponential', or 'poisson'.
* `normal_sample_params` (dict): Parameters for the normal sample distribution, passed as keyword arguments to the corresponding numpy random function.
* `outlier_sample_type` (str): The distribution type for outlier samples. Options are 'uniform', 'normal', 'exponential', or 'poisson'.
* `outlier_sample_params` (dict): Parameters for the outlier sample distribution, passed as keyword arguments to the corresponding numpy random function.

#### Returns

* `dataset` (list of tuples): A list of tuples, where each tuple contains a timestamp and a value. Timestamps are generated sequentially.
* `outlier_indices` (set): A set of indices representing the positions of the outlier samples in the dataset.

#### Raises

* `ValueError`: If either the "normal_sample_type" or "outlier_sample_type" is unsupported.

### scale dataset

Scales the values in the dataset to a specified range [new_min, new_max], maintaining the relative proportions of the original values.

`scale_dataset(dataset, new_min=1, new_max=10)`

#### Args

* `dataset` (list of tuples): A list of (timestamp, value) pairs representing the dataset (Time_series).
* `new_min` (float, optional): The minimum value of the scaled range. Default is 1.
* `new_max` (float, optional): The maximum value of the scaled range. Default is 10.

#### Returns

* `scaled_dataset` (list of tuples): A list of (timestamp, scaled_value) pairs, where each value has been scaled to the specified range.

### load dataset from csv

Loads a dataset from a CSV file, mapping timestamps to values and returning the dataset along with the original time values (so they can be used/displayed if needed).

`load_dataset_from_csv(file_path: str, time_column: str, value_column: str)`

#### Args

* `file_path` (str): The path to the CSV file.
* `time_column` (str): The name of the column containing time data.
* `value_column` (str): The name of the column containing value data.

#### Returns

* `dataset` (list of tuples): A list of (timestamp, value) pairs with sequentially generated timestamps.
* `outlier_indices` (set): A set of indices representing the positions of the outlier samples in the dataset.
* `normalized_rank_values` (list of float): A list of normalized rank values, calculated as the ratio of 'string_rank' to the length of 'qaoa_state' for each result.

### load partial dataset from csv

Loads a portion of the dataset from a CSV file and renumbers timestamps from 0 to (end - start).

`load_partial_dataset_from_csv(file_path, time_column, value_column, start, end)`

#### Args

* `file_path` (str): Path to the CSV file.
* `time_column` (str): Name of the column containing timestamp data.
* `value_column` (str): Name of the column containing value data.
* `start` (int): Starting index of the data range to load.
* `end` (int): Ending index of the data range to load.


#### Returns

* `dataset` (list of tuples): A list of (timestamp, value) pairs with renumbered timestamps.
* `original_times` (np.ndarray): An array of original time values from the selected range.

### split dataset with best batch size

Based on available batch sizes and the desired overlap between the batches, tests the dataset in order to split it in the most balanced way, then proceeds to actually effect the split.

#### Args

* `dataset` (list of tuples): The dataset to split, represented as a list of (timestamp, value) pairs.
* `overlap` (int, optional): The number of overlapping samples between consecutive batches. Default is 2.
* `batch_sizes` (list of int, optional): List of possible batch sizes to test. Default is [7, 8, 9, 10].

#### Returns

* `best_batches` (list of lists): A list of batches, where each batch is a list of (timestamp, value) pairs.
* `best_batch_size` (int): The batch size that results in the largest final batch.