# Benchmarking Experiments for Extended K-Prototypes (Paper Versions)

Variations on the experimental structure of the original extension as presented for the undergraduate thesis of the author.

In [None]:
import os
import sys
current_dir = os.getcwd()

# Get the absolute path of the parent directory
parent_dir = os.path.abspath(os.path.join(current_dir, os.pardir))
sys.path.append(parent_dir)

from random import randint
from typing import Any

import pandas as pd

from benchmark_extension import Experiment

Data generation has been changed to follow a probabilistic approach for the generation of the categorical and multi-valued attributes.  
  
Categorical attributes are sampled from a user-defined distribution of categorical items.  
  
Multi-valued attributes are sampled using a tree of conditional probability per cluster to simulate the way in which some items of multi-valued attributes tend to appear in common with others. This is preferred to repeating the approach pursued for categorical variables to better represent real-life conditions that practitioners may face.

In [None]:
sample_configuration = {
    'n_samples': 2000,
    'n_clusters': 3,
    'class_weights': [0.33, 0.33],
    # Numeric Features
    'n_numeric_features': 5,
    'separability': 3.0,
    'noise': 0.01,
    # Categroical Features
    'n_categorical_features': 5,
    'categorical_cardinalities': [6, 6, 6, 6, 6],
    'category_distributions':(
        [
            [0.4, 0.4, 0.05, 0.05, 0.05],
            [0.4, 0.4, 0.05, 0.05, 0.05],
            [0.4, 0.4, 0.05, 0.05, 0.05],
            [0.4, 0.4, 0.05, 0.05, 0.05],
            [0.4, 0.4, 0.05, 0.05, 0.05]],
        [
            [0.05, 0.05, 0.4, 0.4, 0.05],
            [0.05, 0.05, 0.4, 0.4, 0.05],
            [0.05, 0.05, 0.4, 0.4, 0.05],
            [0.05, 0.05, 0.4, 0.4, 0.05],
            [0.05, 0.05, 0.4, 0.4, 0.05]],
        [
            [0.05, 0.05, 0.05, 0.05, 0.4],
            [0.05, 0.05, 0.05, 0.05, 0.4],
            [0.05, 0.05, 0.05, 0.05, 0.4],
            [0.05, 0.05, 0.05, 0.05, 0.4],
            [0.05, 0.05, 0.05, 0.05, 0.4]]
    ),
    # Multi-valued Features
    'n_multival_features': 5,
    'probability_trees': [],
    # Approach Settings
    'approach_settings': {
        'naive': {
            'gamma': None
        },
        'one-hot': {
            'gamma': None,
            'max_dummies': 100
        },
        'one-hot-pca': {
            'gamma': None,
            'reduced_dimensions': 0.25
        },
        'extended': {
            'gamma_c': 0.33,
            'gamma_m': 0.33,
            'theta': 0.001
        }
    }
}

The sample configuration includes a bi-modal distribution for each categorical attribute that modifies the modes for each cluster. Harder-to-cluster configurations where some elements are modal in more than one cluster or where some levels are never modal could be tried.

**Input Rules**:  

- Class weights should be of length `n_clusters - 1`. The missing weight will be calculated with `1 - sum(class_weights)`.

- Length of `categorical_cardinalities` equal to `n_categorical_features`.

- Length of `category_distributions` should be equal to `n_clusters`. Each item is a list of lists specifying a categorical distribution for each categorical attribute describing the categorical characteristics of the cluster.
  
- Each item in `category_distributions` should include a list per categorical attribute containing the sampling probabilities for each category (as defined in `categorical_cardinalities` minus one) in the attribute. The probabilities should sum up to less than one. The probability for the missing category will be calculated as `1 - sum(distribution)`.
  
- Length of `probability_trees` should be equal to `n_multival_features`.
  
- The probabilities of the children of each node in each ``tree`` of ``probability_trees`` should sum up to one. Rule does not apply to leafs.
  
- In `approach_settings`, the field `reduced_dimensions` in `one-hot-pca` must be a float in the open interval $(0, 1)$.
  
- In `approach_settings`, the field the gamma fields in `extended` must be floats in the interval $[0, 1)$ and not sum up to more than one. 