Example Usage: Create Own Dataset
=================================

## The Iris Dataset

In this example, we will demonstrate how to use an own dataset with Metrics As Scores.
Note that it is not required to interact with code, so we show two ways to achieve our goal here:

1. Use the Text-based Command Line User Interface (TUI)
2. Implement the same scenario in code

The latter might be useful for programmatic usage, when you need to create datasets in a batch fashion.

# Using the Text-based Command Line User Interface (TUI)

# Implementing the Creation of Own Dataset in Code

Here, we will go through the following steps:

1. Load the Iris data frame and transform it into the required format.
   1. Transforming the data frame
   2. Creating a manifest
2. Conduct analyses that are required for generating a scientific report.
3. Fitting of parametric random variables to the data.
4. Generating the densities required for the web application of Metrics As Scores.
5. Finishing Up
   

Please note: The documentation for the code can be found at <https://mrshoenel.github.io/metrics-as-scores/>.

## Importing the Dataset

The well-known Iris dataset is included in `scikit-learn`, so we can load it directly.
It has 4 real-valued features and one label.

In [14]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame

# Let's rename the numeric column and use the actual labels:
for idx in range(len(iris.target_names)):
    df['target'].replace(to_replace=idx, value=iris.target_names[idx], inplace=True)

print(df.head(5))

   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  target
0                5.1               3.5  ...               0.2  setosa
1                4.9               3.0  ...               0.2  setosa
2                4.7               3.2  ...               0.2  setosa
3                4.6               3.1  ...               0.2  setosa
4                5.0               3.6  ...               0.2  setosa

[5 rows x 5 columns]


### Transforming the data frame

Metrics As Scores requires a stacked format of the data frame. We take each feature's data, the group, and the name of the feature to produce a new 3-column data frame. After we have done this for all features, we vertically stack these frames. This is implemented as a helper function, so we can just go ahead and make use of it.

In [15]:
from metrics_as_scores.tools.funcs import transform_to_MAS_dataset

df_mas = transform_to_MAS_dataset(df=df, group_col='target', feature_cols=iris.feature_names)

print(df_mas.head(5))

             Feature   Group  Value
0  sepal length (cm)  setosa    5.1
1  sepal length (cm)  setosa    4.9
2  sepal length (cm)  setosa    4.7
3  sepal length (cm)  setosa    4.6
4  sepal length (cm)  setosa    5.0


### Creating a Manifest

Every dataset must be accompanied by a manifest that provides some meta information about it. The manifest is also required for the next steps. When bundling and publishing a dataset, the manifest is a required file.

Metrics As Scores used a typed dictionary that is then stored as `manifest.json`. Then, a new instance of `Dataset` is created. It is passed the transformed data frame and the manifest.

In [16]:
from metrics_as_scores.distribution.distribution import Dataset, LocalDataset

# Here, we manually create a manifest and fill out the minimum required properties.
# A manifest for a publishable dataset needs to present proper values for all keys.
manifest: LocalDataset = {}
manifest['author'] = ['First A. Author', 'Second Author']
manifest['id'] = 'iris'
manifest['colname_context'] = 'Group'
manifest['colname_data'] = 'Value'
manifest['colname_type'] = 'Feature'
# Note this is a dictionary, where the keys are the features' (column) names
# and the values are a obligatory descriptions.
manifest['qtypes'] = { fn: f'Description for feature {fn}' for fn in iris.feature_names }
# This is a list of the available groups.
manifest['contexts'] = iris.target_names.tolist()

manifest

{'author': ['First A. Author', 'Second Author'],
 'id': 'iris',
 'colname_context': 'Group',
 'colname_data': 'Value',
 'colname_type': 'Feature',
 'qtypes': {'sepal length (cm)': 'Description for feature sepal length (cm)',
  'sepal width (cm)': 'Description for feature sepal width (cm)',
  'petal length (cm)': 'Description for feature petal length (cm)',
  'petal width (cm)': 'Description for feature petal width (cm)'},
 'contexts': ['setosa', 'versicolor', 'virginica']}

Let's create the actual dataset for Metrics As Scores. The `Dataset` allows us to conduct the required statistical tests effortlessly.

In [17]:
dataset_mas = Dataset(ds=manifest, df=df_mas)
# Just a test:
print(dataset_mas.quantity_types_continuous)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


## Conduct analyses that are required for generating a scientific report

Here, we will conduct the three statistical analyses that are required to generate a report.
The report itself is a Quarto template that can (but does not necessarily have to) be changed.
Having the manifest, dataset, and the following three reports is enough.

In [18]:
# Note that we want to do a full analysis and, therefore, pass in all available features
# (called quantity types) and groups (called contexts). We also pass in a virtual "ALL"-
# context, which creates an additional group that contains all data.
test_anova = dataset_mas.analyze_ANOVA(
    qtypes=dataset_mas.quantity_types, contexts=list(dataset_mas.contexts(include_all_contexts=True)))

test_anova.head(5)

100%|██████████| 4/4 [00:00<00:00, 1001.09it/s]


Unnamed: 0,qtype,stat,pval,across_contexts
0,sepal length (cm),44.194514,1.2504020000000002e-23,setosa;versicolor;virginica;__ALL__
1,sepal width (cm),24.727043,2.637466e-14,setosa;versicolor;virginica;__ALL__
2,petal length (cm),87.738074,1.21977e-40,setosa;versicolor;virginica;__ALL__
3,petal width (cm),85.564667,6.874646e-40,setosa;versicolor;virginica;__ALL__


In [19]:
# Tukey's Honest Significance Test:
test_tukey = dataset_mas.analyze_TukeyHSD(qtypes=dataset_mas.quantity_types)

test_tukey.head(5)

100%|██████████| 4/4 [00:00<00:00, 2005.64it/s]


Unnamed: 0,group1,group2,meandiff,p-adj,lower,upper,reject
0,__ALL__,setosa,-0.8373,0.0,-1.1287,-0.546,True
1,__ALL__,versicolor,0.0927,0.8441,-0.1987,0.384,False
2,__ALL__,virginica,0.7447,0.0,0.4533,1.036,True
3,setosa,versicolor,0.93,0.0,0.5732,1.2868,True
4,setosa,virginica,1.582,0.0,1.2252,1.9388,True


In [20]:
# The two-sample Komolgorov--Smirnov Test:
test_ks2 = dataset_mas.analyze_distr(qtypes=dataset_mas.quantity_types, use_ks_2samp=True)

test_ks2.head(5)

100%|██████████| 4/4 [00:00<00:00, 1976.11it/s]


Unnamed: 0,qtype,stat,pval,group1,group2
0,sepal length (cm),0.78,2.807571e-15,setosa,versicolor
1,sepal length (cm),0.92,7.773164000000001e-23,setosa,virginica
2,sepal length (cm),0.56,2.537645e-11,setosa,__ALL__
3,sepal length (cm),0.5,4.807534e-06,versicolor,virginica
4,sepal length (cm),0.24,0.024115,versicolor,__ALL__


## Fitting of parametric random variables to the data.

Fitting random variables is required so that we can inspect and use the parametric fits in the web application.
A dataset that ought to be published needs to contain these.
Metrics As Scores can fit more than ~$120$ random variables, of which ~$20$ are discrete.
The Iris dataset only has real-valued (continuous) features, so we will limit ourselves to continuous random variables.

In [21]:
from typing import Any
from nptyping import Float, NDArray, Shape
from metrics_as_scores.distribution.fitting import FitterPymoo
from metrics_as_scores.distribution.distribution import DistTransform
from metrics_as_scores.data.pregenerate_fit import get_data_tuple
from metrics_as_scores.data.pregenerate_distns import generate_parametric_fits
from scipy.stats._distn_infrastructure import rv_generic
from joblib import Parallel, delayed
from tqdm import tqdm


def get_data_tuples(dist_transform: DistTransform, continuous: bool) -> tuple[dict[str, float], dict[str, NDArray[Shape["*"], Float]]]:
        """
        This function helps us to prepare all combinations of data that are required
        for fitting. It was mainly taken from cli/FitParametric::_get_data_tuples(..),
        so please go there if you need to know more.
        """
        res = Parallel(n_jobs=-1)(delayed(get_data_tuple)(ds=dataset_mas, qtype=qtype, dist_transform=dist_transform, continuous_transform=continuous) for qtype in tqdm(dataset_mas.quantity_types))
        data_dict = dict([(item[0], item[1]) for sublist in res for item in sublist])
        transform_values_dict = dict([(item[0], item[2]) for sublist in res for item in sublist])
        return (transform_values_dict, data_dict)


def fit_parametric(dist_transform: DistTransform, selected_rvs_c: list[type[rv_generic]]) -> list[dict[str, Any]]:
    """
    This function was also taken from cli/FitParametric.
    """
    print('Performing distribution transforms for continuous random variables ...')
    transform_values_dict, data_dict = get_data_tuples(dist_transform=dist_transform, continuous=True)
    # We don't have discrete features, so let's use empty dictionaries.
    # Please refer to the original function in FitParametric for how to
    # do the same for discrete features, but it's pretty straightforward.
    transform_values_discrete_dict, data_discrete_dict = {}, {}

    print(f'Starting fitting of distributions for transform {dist_transform.value}, in randomized order.')
    return generate_parametric_fits(
        ds=dataset_mas,
        num_jobs=-1,
        fitter_type=FitterPymoo,
        dist_transform=dist_transform,
        selected_rvs_c=selected_rvs_c,
        selected_rvs_d=[], # No discrete fits this time ;)
        data_dict=data_dict,
        data_discrete_dict=data_discrete_dict,
        transform_values_dict=transform_values_dict,
        transform_values_discrete_dict=transform_values_discrete_dict)

For each `DistTransform`, we generate a set of parametric fits.
Instead of attempting to fit all available continuous random variables (from `Continuous_RVs`), we will make a dedicated list of random variables that should be attempted fitting, just for the sake of computing this notebook faster.
In reality you should always attempt to fit them all, and you even must do so if you intend to publish your dataset.

In [22]:
from metrics_as_scores.distribution.fitting import Continuous_RVs # There is also 'Discrete_RVs' if you need them.
from scipy.stats._continuous_distns import alpha_gen, cauchy_gen, crystalball_gen, exponnorm_gen, fisk_gen, foldcauchy_gen, genlogistic_gen, gumbel_l_gen, gumbel_r_gen, johnsonsb_gen, johnsonsu_gen, kstwobign_gen, laplace_asymmetric_gen, moyal_gen, nakagami_gen, ncf_gen, nct_gen, norm_gen, pearson3_gen, rayleigh_gen, rdist_gen, reciprocal_gen, rice_gen, skew_norm_gen, truncweibull_min_gen

# Note that this is a list of random variables that were previously observed to best fit
# (transformations of) the Iris data, it is not just made up!
selected_rvs_c = [alpha_gen, cauchy_gen, crystalball_gen, exponnorm_gen, fisk_gen, foldcauchy_gen, genlogistic_gen, gumbel_l_gen, gumbel_r_gen, johnsonsb_gen, johnsonsu_gen, kstwobign_gen, laplace_asymmetric_gen, moyal_gen, nakagami_gen, ncf_gen, nct_gen, norm_gen, pearson3_gen, rayleigh_gen, rdist_gen, reciprocal_gen, rice_gen, skew_norm_gen, truncweibull_min_gen]

fits: dict[DistTransform, list[dict[str, Any]]] = {}

for dist_transform in list(DistTransform):
    # You should actually store the result in a file with the following name:
    file_name = f'pregen_distns_{dist_transform.name}.pickle'
    
    result = fit_parametric(dist_transform=dist_transform, selected_rvs_c=selected_rvs_c)
    # Here, we'll just store the result in the above fits-dictionary:
    fits[dist_transform] = result

100%|██████████| 4/4 [00:00<00:00, 2002.05it/s]


Starting fitting of distributions for transform NONE [<none>], in randomized order.


100%|██████████| 400/400 [00:04<00:00, 88.41it/s] 
100%|██████████| 4/4 [00:00<00:00, 4005.06it/s]


Starting fitting of distributions for transform EXPECTATION [E[X] (expectation)], in randomized order.


100%|██████████| 400/400 [00:04<00:00, 92.19it/s] 
100%|██████████| 4/4 [00:00<00:00, 3972.82it/s]


Starting fitting of distributions for transform MEDIAN [Median (50th percentile)], in randomized order.


100%|██████████| 400/400 [00:04<00:00, 87.92it/s] 
100%|██████████| 4/4 [00:00<00:00, 3902.59it/s]


Starting fitting of distributions for transform MODE [Mode (most likely value)], in randomized order.


100%|██████████| 400/400 [00:04<00:00, 89.88it/s]
100%|██████████| 4/4 [00:00<00:00, 3929.09it/s]


Starting fitting of distributions for transform INFIMUM [Infimum (min. observed value)], in randomized order.


100%|██████████| 400/400 [00:03<00:00, 109.36it/s]
100%|██████████| 4/4 [00:00<00:00, 4114.08it/s]


Starting fitting of distributions for transform SUPREMUM [Supremum (max. observed value)], in randomized order.


100%|██████████| 400/400 [00:03<00:00, 102.37it/s]


Let's inspect a fit.
You see that some best-fitting random variable was found (selected by the most appropriate statistical test).
A variety of statistical tests was performed and each `FitResult` retains all of their results.

In [26]:
fits[DistTransform.EXPECTATION][0]

{'context': 'setosa',
 'dist_transform': 'EXPECTATION',
 'qtype': 'sepal width (cm)',
 'rv': 'fisk_gen',
 'type': 'continuous',
 'grid_idx': 29,
 'transform_value': 3.428000000000001,
 'params': {'c': 0.8316874605020181,
  'loc': 0.028000000000000906,
  'scale': 0.0557721052353025},
 'stat_tests': {'tests': {'cramervonmises_ordinary': {'pval': 3.560315283823723e-05,
    'stat': 1.7544763557922332},
   'cramervonmises_jittered': {'pval': 3.560315283823723e-05,
    'stat': 1.7544763557922332},
   'cramervonmises_2samp_ordinary': {'pval': 0.004611473102710151,
    'stat': 0.882200000000001},
   'cramervonmises_2samp_jittered': {'pval': 0.004611473102710151,
    'stat': 0.882200000000001},
   'ks_1samp_ordinary': {'pval': 6.665666358245721e-06,
    'stat': 0.3475925934892153},
   'ks_1samp_jittered': {'pval': 6.665666358245721e-06,
    'stat': 0.3475925934892153},
   'ks_2samp_ordinary': {'pval': 0.005841778142694731, 'stat': 0.34},
   'ks_2samp_jittered': {'pval': 0.005841778142694731, 's

## Generating the densities required for the web application of Metrics As Scores

In order to use web application with our dataset, we need to pre-generate densities for it, as generating these during runtime could take a lot of time (depending on the dataset size) and be detrimental to the user experience.
Therefore, we pre-generate these files and trade storage space for computing time.

In [None]:
from metrics_as_scores.data.pregenerate import generate_parametric, generate_empirical, generate_empirical_discrete
from metrics_as_scores.distribution.distribution import Parametric, Parametric_discrete, Empirical, Empirical_discrete, KDE_approx
from sklearn.model_selection import ParameterGrid


# The following functions were taken from cli/GenerateDensitiesWorkflow.
# They are helpers for creating every required combination of density
# type and DistTransform. Also, they do this in parallel.

def generate_parametric() -> None:
    grid = dict(
        clazz = [Parametric, Parametric_discrete],
        transform = list(DistTransform))
    expanded_grid = pd.DataFrame(ParameterGrid(param_grid=grid))
    Parallel(n_jobs=-1)(delayed(generate_parametric)(dataset_mas, self.densities_dir, self.fits_dir, expanded_grid.iloc[i,]['clazz'], expanded_grid.iloc[i,]['transform']) for i in range(len(expanded_grid.index)))

def generate_empirical_kde() -> None:
    grid = dict(
        clazz = [Empirical, KDE_approx],
        transform = list(DistTransform))
    expanded_grid = pd.DataFrame(ParameterGrid(param_grid=grid))
    Parallel(n_jobs=min(self.num_cpus, len(expanded_grid.index)))(delayed(generate_empirical)(self.ds, self.densities_dir, expanded_grid.iloc[i,]['clazz'], expanded_grid.iloc[i,]['transform']) for i in range(len(expanded_grid.index)))

## Finishing Up

We have successfully imported our own datasets, created parametric fits and densities for the web application.
In order for the dataset to become usable, it needs to be made available in the `datasets`-folder of Metrics As Scores and follow the directory structure of a dataset.

In [28]:
from metrics_as_scores.__init__ import DATASETS_DIR

print(f'The datasets reside currently in: {str(DATASETS_DIR)}')

The datasets reside currently in: C:\repos\lnu_metrics-as-scores\datasets


For the sake of this notebook, we will copy over the **default** dataset (a template for new datasets that comes with Metrics As Scores) and then save our manifest, the transformed data, the parametric fits, and the generated densities in that folder.
Then, we ask a helper to check the consistency of our work!