Example Usage: Create Own Dataset
=================================

In this example, we will demonstrate how to use an own dataset with Metrics As Scores.
Note that it is not required to interact with code, as the following can be fully accomplished using the Text-based Command Line User Interface (**TUI**).
It can be launched after installation simply by typing `mas` at the prompt.

However, this example might still be useful for programmatic usage, when you need to create datasets in a batch fashion.

Please note: The documentation for the code can be found at <https://mrshoenel.github.io/metrics-as-scores/>.


# Implementing the Creation of Own Dataset in Code

Here, we will go through the following steps:

1. Load the Iris Data Frame and Transform It Into the Required Format
   1. Transforming the Data Frame
   2. Creating a Manifest
   3. Using the Workflow to Initialize the Dataset
2. Conduct Analyses That Are Required for Generating a Scientific Report
3. Fitting of Parametric Random Variables to the Data
4. Generating the Densities Required for the Web Application of Metrics As Scores
5. Finishing Up: Prepare for Publication
   1. Rendering the About.pdf With Quarto
   2. Bundling the Dataset


# 1. Load the Iris Data Frame and Transform It Into the Required Format

The well-known Iris dataset is included in `scikit-learn`, so we can load it directly.
It has 4 real-valued features and one label.

In [2]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame

# Let's rename the numeric column and use the actual labels:
for idx in range(len(iris.target_names)):
    df['target'].replace(to_replace=idx, value=iris.target_names[idx], inplace=True)

print(df.head(5))

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0  setosa  
1  setosa  
2  setosa  
3  setosa  
4  setosa  


## 1.1. Transforming the Data Frame

Metrics As Scores requires a stacked format of the data frame. We take each feature's data, the group, and the name of the feature to produce a new 3-column data frame. After we have done this for all features, we vertically stack these frames. This is implemented as a helper function, so we can just go ahead and make use of it.

In [3]:
from metrics_as_scores.tools.funcs import transform_to_MAS_dataset

df_mas = transform_to_MAS_dataset(df=df, group_col='target', feature_cols=iris.feature_names)

print(df_mas.head(5))

             Feature   Group  Value
0  sepal length (cm)  setosa    5.1
1  sepal length (cm)  setosa    4.9
2  sepal length (cm)  setosa    4.7
3  sepal length (cm)  setosa    4.6
4  sepal length (cm)  setosa    5.0


## 1.2. Creating a Manifest

Every dataset must be accompanied by a manifest that provides some meta information about it. The manifest is also required for the next steps. When bundling and publishing a dataset, the manifest is a required file.
The dataset's ID in the manifest will also be used as **dataset's local folder's name**.

Metrics As Scores used a typed dictionary that is then stored as `manifest.json`. Then, a new instance of `Dataset` is created. It is passed the transformed data frame and the manifest.

In [4]:
from metrics_as_scores.distribution.distribution import Dataset, LocalDataset

# Here, we manually create a manifest and fill out the minimum required properties.
# The dataset creation workflow has a nice wizard that we can use, otherwise.
# A manifest for a publishable dataset needs to present proper values for all keys.
manifest: LocalDataset = {}
manifest['author'] = ['First A. Author', 'Second Author']
manifest['name'] = 'The Iris dataset'
manifest['id'] = 'iris-example'
manifest['desc'] = 'The Iris dataset holds 50 observations per species.'
manifest['colname_context'] = 'Group'
manifest['colname_data'] = 'Value'
manifest['colname_type'] = 'Feature'
# Note this is a dictionary, where the keys are the features' (column) names
# and the values are a obligatory descriptions.
manifest['desc_qtypes'] = { fn: f'The {fn}' for fn in iris.feature_names }
# The feature (quantity) types is a dictionary, too:
manifest['qtypes'] = { fn: 'continuous' for fn in iris.feature_names }
# This is a list of the available groups.
manifest['contexts'] = iris.target_names.tolist()
# The Iris dataset has no ideal values for any of its features:
manifest['ideal_values'] = { fn: None for fn in iris.feature_names }

manifest

{'author': ['First A. Author', 'Second Author'],
 'name': 'The Iris dataset.',
 'id': 'iris-example',
 'desc': 'The Iris dataset holds 50 observations per species.',
 'colname_context': 'Group',
 'colname_data': 'Value',
 'colname_type': 'Feature',
 'desc_qtypes': {'sepal length (cm)': 'The sepal length (cm)',
  'sepal width (cm)': 'The sepal width (cm)',
  'petal length (cm)': 'The petal length (cm)',
  'petal width (cm)': 'The petal width (cm)'},
 'qtypes': {'sepal length (cm)': 'continuous',
  'sepal width (cm)': 'continuous',
  'petal length (cm)': 'continuous',
  'petal width (cm)': 'continuous'},
 'contexts': ['setosa', 'versicolor', 'virginica'],
 'ideal_values': {'sepal length (cm)': None,
  'sepal width (cm)': None,
  'petal length (cm)': None,
  'petal width (cm)': None}}

## 1.3. Using the Workflow to Initialize the Dataset

In order to create a `Dataset`, we need the manifest and the original data frame.
We will use the workflow that is also used in the TUI to help us with that.

In [5]:
from metrics_as_scores.cli.CreateDataset import CreateDatasetWorkflow

create = CreateDatasetWorkflow(manifest=manifest, org_df=df_mas)
# Let's also get the dataset as initialized by the workflow.
dataset = create.dataset
print(f'The dataset\'s folder will be: {str(create.dataset_dir)}')

The dataset's folder will be: C:\repos\lnu_metrics-as-scores\datasets\iris-example


Now that we have the workflow, we will exploit its convenience methods to initialize the dataset.
First, we will make sure all required directories exist, before we copy copy over some files from the *default* dataset and save the manifest and data frame to disk.

In [6]:
create._make_dirs()

In [7]:
create._init_dataset()

In [8]:
create._save_manifest_and_data()

Go ahead and check out the directory that was created!

# 2. Conduct Analyses That Are Required for Generating a Scientific Report

Here, we will conduct the three statistical analyses that are required to generate a report.
The report itself is a Quarto template that can (but does not necessarily have to) be changed.

You can call the methods on the `Dataset` manually or use the workflow to run the analyses.
Here, for demonstration purposes, we call them one-by-one and save the files manually.

In [9]:
# Note that we want to do a full analysis and, therefore, pass in all available features
# (called quantity types) and groups (called contexts). We also pass in a virtual "ALL"-
# context, which creates an additional group that contains all data.
test_anova = dataset.analyze_ANOVA(
    qtypes=dataset.quantity_types, contexts=list(dataset.contexts(include_all_contexts=True)))

# Let's save the result (normally done by the workflow):
test_anova.to_csv(create.path_test_ANOVA, index=False)

test_anova.head(5)

100%|██████████| 4/4 [00:00<00:00, 160.19it/s]


Unnamed: 0,qtype,stat,pval,across_contexts
0,sepal length (cm),44.194514,1.2504020000000002e-23,setosa;versicolor;virginica;__ALL__
1,sepal width (cm),24.727043,2.637466e-14,setosa;versicolor;virginica;__ALL__
2,petal length (cm),87.738074,1.21977e-40,setosa;versicolor;virginica;__ALL__
3,petal width (cm),85.564667,6.874646e-40,setosa;versicolor;virginica;__ALL__


In [10]:
# Tukey's Honest Significance Test:
test_tukey = dataset.analyze_TukeyHSD(qtypes=dataset.quantity_types)
test_tukey.to_csv(create.path_test_TukeyHSD, index=False)
test_tukey.head(5)

100%|██████████| 4/4 [00:00<00:00, 1997.53it/s]


Unnamed: 0,group1,group2,meandiff,p-adj,lower,upper,reject
0,__ALL__,setosa,-0.8373,0.0,-1.1287,-0.546,True
1,__ALL__,versicolor,0.0927,0.8441,-0.1987,0.384,False
2,__ALL__,virginica,0.7447,0.0,0.4533,1.036,True
3,setosa,versicolor,0.93,0.0,0.5732,1.2868,True
4,setosa,virginica,1.582,0.0,1.2252,1.9388,True


In [11]:
# The two-sample Komolgorov--Smirnov Test:
test_ks2 = dataset.analyze_distr(qtypes=dataset.quantity_types, use_ks_2samp=True)
test_ks2.to_csv(create.path_test_ks2samp, index=False)
test_ks2.head(5)

100%|██████████| 4/4 [00:00<00:00, 4036.87it/s]


Unnamed: 0,qtype,stat,pval,group1,group2
0,sepal length (cm),0.78,2.807571e-15,setosa,versicolor
1,sepal length (cm),0.92,7.773164000000001e-23,setosa,virginica
2,sepal length (cm),0.56,2.537645e-11,setosa,__ALL__
3,sepal length (cm),0.5,4.807534e-06,versicolor,virginica
4,sepal length (cm),0.24,0.024115,versicolor,__ALL__


# 3. Fitting of Parametric Random Variables to the Data

Fitting random variables is required so that we can inspect and use the parametric fits in the web application.
A dataset that ought to be published needs to contain these.
Metrics As Scores can fit more than ~$120$ random variables, of which ~$20$ are discrete.
The Iris dataset only has real-valued (continuous) features, so we will limit ourselves to continuous random variables.

In [12]:
from metrics_as_scores.cli.FitParametric import FitParametricWorkflow
from metrics_as_scores.distribution.fitting import FitterPymoo

fit = FitParametricWorkflow()
# Let's manually initialize the workflow:
fit.use_ds = manifest
fit.ds = dataset
fit.df = df_mas
fit.fits_dir = create.fits_dir
fit.selected_rvs_d = [] # Do not attempt fitting any discrete random variables
fit.num_cpus = -1 # Use all available CPU cores

In [13]:
from metrics_as_scores.distribution.fitting import Continuous_RVs # There is also 'Discrete_RVs' if you need them.

# Note that this is a list of random variables that were previously observed to best fit
# (transformations of) the Iris data, it is not just made up!
from scipy.stats._continuous_distns import alpha_gen, cauchy_gen, crystalball_gen, exponnorm_gen, fisk_gen, foldcauchy_gen, genlogistic_gen, gumbel_l_gen, gumbel_r_gen, johnsonsb_gen, johnsonsu_gen, kstwobign_gen, laplace_asymmetric_gen, moyal_gen, nakagami_gen, ncf_gen, nct_gen, norm_gen, pearson3_gen, rayleigh_gen, rdist_gen, reciprocal_gen, rice_gen, skew_norm_gen, truncweibull_min_gen

# Let's use this short list in the fitting workflow:
fit.selected_rvs_c = [alpha_gen, cauchy_gen, crystalball_gen, exponnorm_gen, fisk_gen, foldcauchy_gen, genlogistic_gen, gumbel_l_gen, gumbel_r_gen, johnsonsb_gen, johnsonsu_gen, kstwobign_gen, laplace_asymmetric_gen, moyal_gen, nakagami_gen, ncf_gen, nct_gen, norm_gen, pearson3_gen, rayleigh_gen, rdist_gen, reciprocal_gen, rice_gen, skew_norm_gen, truncweibull_min_gen]

We are now ready to start the fitting process!

For each `DistTransform`, we generate a set of parametric fits.
Instead of attempting to fit all available continuous random variables (from `Continuous_RVs`), we use a dedicated list of random variables that should be attempted fitting, just for the sake of computing this notebook faster.
In reality you should always attempt to fit them all, and you even must do so if you intend to publish your dataset.

In [14]:
from metrics_as_scores.distribution.distribution import DistTransform
from pickle import dump
from typing import Any

result: list[dict[str, Any]] = None
for dist_transform in list(DistTransform):
    # The following will compute all fits for all features for a single DistTransform
    # and save the file in the dataset's 'fits'-folder.
    print(f'Starting fitting of distributions for transform {dist_transform.value}')
    result = fit._fit_parametric(dist_transform=dist_transform, do_print=False)

    result_file = fit.fits_dir.joinpath(f'./pregen_distns_{dist_transform.name}.pickle')
    with open(file=str(result_file), mode='wb') as fp:
        dump(obj=result, file=fp)

Starting fitting of distributions for transform <none>


100%|██████████| 4/4 [00:00<00:00, 4006.98it/s]
100%|██████████| 4/4 [00:00<00:00, 2005.17it/s]
100%|██████████| 400/400 [00:04<00:00, 92.88it/s] 


Starting fitting of distributions for transform E[X] (expectation)


100%|██████████| 4/4 [00:00<00:00, 3988.88it/s]
100%|██████████| 4/4 [00:00<00:00, 4033.95it/s]
100%|██████████| 400/400 [00:04<00:00, 99.38it/s] 


Starting fitting of distributions for transform Median (50th percentile)


100%|██████████| 4/4 [00:00<00:00, 1977.51it/s]
100%|██████████| 4/4 [00:00<00:00, 4005.06it/s]
100%|██████████| 400/400 [00:04<00:00, 98.04it/s] 


Starting fitting of distributions for transform Mode (most likely value)


100%|██████████| 4/4 [00:00<00:00, 3993.62it/s]
100%|██████████| 4/4 [00:00<00:00, 4010.81it/s]
100%|██████████| 400/400 [00:04<00:00, 89.82it/s] 


Starting fitting of distributions for transform Infimum (min. observed value)


100%|██████████| 4/4 [00:00<00:00, 2002.29it/s]
100%|██████████| 4/4 [00:00<00:00, 2000.86it/s]
100%|██████████| 400/400 [00:03<00:00, 104.24it/s]


Starting fitting of distributions for transform Supremum (max. observed value)


100%|██████████| 4/4 [00:00<00:00, 2002.77it/s]
100%|██████████| 4/4 [00:00<00:00, 4006.02it/s]
100%|██████████| 400/400 [00:03<00:00, 109.15it/s]


Let's inspect a fit.
You see that some best-fitting random variable was found (selected by the most appropriate statistical test).
A variety of statistical tests was performed and each `FitResult` retains all of their results.

In [15]:
result[0]

{'context': 'setosa',
 'dist_transform': 'SUPREMUM',
 'qtype': 'sepal width (cm)',
 'rv': 'fisk_gen',
 'type': 'continuous',
 'grid_idx': 29,
 'transform_value': 4.4,
 'params': {'c': 60517527.07863054,
  'loc': -12641545.884248791,
  'scale': 12641546.864645472},
 'stat_tests': {'tests': {'cramervonmises_ordinary': {'pval': 0.7619894052317475,
    'stat': 0.06867291332303076},
   'cramervonmises_jittered': {'pval': 0.7619894052317475,
    'stat': 0.06867291332303076},
   'cramervonmises_2samp_ordinary': {'pval': 0.8343025940612143,
    'stat': 0.05900000000000105},
   'cramervonmises_2samp_jittered': {'pval': 0.8343025940612143,
    'stat': 0.05900000000000105},
   'ks_1samp_ordinary': {'pval': 0.7033899762790617,
    'stat': 0.09655598332815263},
   'ks_1samp_jittered': {'pval': 0.7033899762790617,
    'stat': 0.09655598332815263},
   'ks_2samp_ordinary': {'pval': 0.9667464356809096, 'stat': 0.1},
   'ks_2samp_jittered': {'pval': 0.9667464356809096, 'stat': 0.1},
   'epps_singleton_2

Again, go ahead and check out the generated files in the `fits`-folder of your new dataset!

# 4. Generating the Densities Required for the Web Application of Metrics As Scores

In order to use web application with our dataset, we need to pre-generate densities for it, as generating these during runtime could take a lot of time (depending on the dataset size) and be detrimental to the user experience.
Therefore, we pre-generate these files and trade storage space for computing time.

In [16]:
from metrics_as_scores.cli.GenerateDensities import GenerateDensitiesWorkflow

gendens = GenerateDensitiesWorkflow()
gendens.use_ds = manifest
gendens.ds = dataset
# We need access to them for generating densities from parametric fits
gendens.fits_dir = create.fits_dir
gendens.densities_dir = create.densities_dir
gendens.num_cpus = -1 # Allow parallel

We are now ready to pre-generate all densities.
Note that you should always generate all of the following, even if you do not have discrete features, because the process will generate empty densities and mark the distributions explicitly as *unfit*.
This is then exploited in the web application.

The following generates densities for five different types and six different distribution transforms (a total of $30$ files will be created).

In [17]:
gendens._generate_parametric()

In [18]:
gendens._generate_empirical_kde()

In [19]:
gendens._generate_empirical_discrete()

Check out the `densities`-folder in your dataset!

# 5. Finishing Up: Prepare for Publication

We have successfully imported our own datasets, created parametric fits and densities for the web application.
We should now run a consistency check.
The result of it should be that only one last file is missing: The **``About.pdf``**.

In [20]:
from metrics_as_scores.cli.helpers import required_files_folders_local_dataset, validate_local_dataset_files, PathStatus

dirs, files = required_files_folders_local_dataset(local_ds_id=manifest['id'])
dirs_status, files_status = validate_local_dataset_files(dirs=dirs, files=files)

for dir in dirs_status:
    assert dirs_status[dir] == PathStatus.OK

for file in files_status:
    if 'About.pdf' in file.name:
        assert files_status[file] == PathStatus.DOESNT_EXIST
    else:
        assert files_status[file] == PathStatus.OK


## 5.1. Rendering the About.pdf With Quarto

You will find the files `About.qmd` and `_quarto.yml` in your newly created dataset folder.
Those will use the dataset's manifest and statistical tests generated earlier to render a nice-looking scientific report that shall accompany your dataset.
It is supposed to be published alongside the dataset.

An `About.pdf` is required for publication.
How you produce it is ultimately up to you, of course.
In the following, we simply render the Quarto template.

In [21]:
%%cmd
quarto render ../datasets/iris-example/About.qmd --to pdf

Microsoft Windows [Version 10.0.18363.592]
(c) 2019 Microsoft Corporation. All rights reserved.

(venv) c:\repos\lnu_metrics-as-scores\notebooks>quarto render ../datasets/iris-example/About.qmd --to pdf



Starting python3 kernel...Done

Executing 'About.ipynb'
  Cell 1/13...Done
  Cell 2/13...Done
  Cell 3/13...Done
  Cell 4/13...Done
  Cell 5/13...Done
  Cell 6/13...Done
  Cell 7/13...Done
  Cell 8/13...Done
  Cell 9/13...Done
  Cell 10/13...Done
  Cell 11/13...Done
  Cell 12/13...Done
  Cell 13/13...Done

pandoc 
  to: latex
  output-file: About.tex
  standalone: true
  pdf-engine: pdflatex
  variables:
    graphics: true
    tables: true
  default-image-extension: pdf
  number-sections: true
  top-level-division: section
  
metadata
  documentclass: scrartcl
  classoption:
    - DIV=11
    - numbers=noendperiod
  papersize: letter
  header-includes:
    - '\KOMAoption{captions}{tableheading}'
  block-headings: true
  bibliography:
    - refs.bib
  title: The Iris dataset.
  author:
    - First A. Author
    - Second Author
  geometry: 'a4paper,margin=2.25cm'
  subtitle: A Dataset For _Metrics As Scores_
  jupyter: python3
  
running pdflatex - 1
  This is pdfTeX, Version 3.141592653


(venv) c:\repos\lnu_metrics-as-scores\notebooks>

## 5.2. Bundling the Dataset

Now that the dataset is complete, we can bundle it into a single Zip file.

In [22]:
from metrics_as_scores.cli.BundleOwn import BundleDatasetWorkflow

bundle = BundleDatasetWorkflow()
bundle.use_ds = manifest
bundle.ds_dir = create.dataset_dir

zip = bundle._make_zip()
print(f'Zip file created at: {str(zip)}')

Zip file created at: C:\repos\lnu_metrics-as-scores\datasets\iris-example\dataset.zip


When uploading the dataset, it is recommended to upload the `dataset.zip` file alongside the `About.pdf`, so it is possible to get an overview of the dataset before downloading it.
Here are three example datasets:

- Metrics and Domains From the Qualitas.class corpus (Hönel 2023b). 10 GB. <https://doi.org/10.5281/zenodo.7633949>.
- ELISA Spectrophotometer Samples (Hönel 2023a). 266 MB. <https://doi.org/10.5281/zenodo.7633989>.
- Price, weight, and other properties of over 1,200 ideal-cut and best-clarity diamonds (Hönel 2023c). 508 MB. <https://doi.org/10.5281/zenodo.7647596>.