08.02.2023 Area A Data Modeling #30

hampusnasstrom · 2023-02-15T10:47:16Z

hampusnasstrom
Feb 15, 2023
Maintainer

Meeting Notes

Tamas raised the question of whether the statistical parameters should be stored or be extracted on the fly from the data. Sandor raised the point that the data is not always known and then we need to store the statistical parameters.

Sandor mentioned that an array can be a subset of a larger array.

Goals:

0D statistical descriptors
Search-ability

Tamas made the analogy to "summary" of dataframe in the R programming language. There the following descriptors are used:

Numerical Array

Mean,
Min,
Max,
Median,
First Quartile,
Third Quartile,
Standard Deviation

Additionally we might need:

Skewness
Shape (dimensions) E.g. (100, 100, 2)
Preview / Static plot / binned plot → Sherjeel will investigate

Context Array

Examples of implementations: xarray.DataArray, pandas.Series, NXData
Non-statistical labels we need for an array in a context:

Values,
- Type(Value) = Union(NumericalArray, Ref(also virtual)
Axes / Index / Independent variable-> Ref to another array (learn from Xarray)
- Type(Index) = List(Union(ContextArray, Ref)), len(Index) == len(values.shape)
- Type(Axes) = List(Union(ContextArray, Ref)) (Parametrization)
Unit
Errors and uncertainty
Quantization (digitization)

Table of Context Array

Examples of implementations: xarray.Dataset, pandas.DataFrame
Do we need this? → Leave it for the Application defintion/schema

Future Discussion Points

Virtual data (grouping data in different files)
How ContextArray should be used

Tasks

Hampus will draft an implementation in NOMAD of NumericalArray without reference
Sherjeel will investigate binning / preview

hampusnasstrom · 2023-02-17T09:01:21Z

hampusnasstrom
Feb 17, 2023
Maintainer Author

I'm working on the implementation of the array data we discussed last week. I am wondering how we thought to deal with the percentiles (Median, First Quartile, Third Quartile) for arrays with dim>1? We could compute them along each axis but that defeats the purpose of ending up with scalar descriptors.

I'm guessing there is a reason that numpy has class methods for mean, min, max, std and shape of an array whilst the percentiles need to be calculated along a specified array with np.percentile().

My suggestion would be to not include these properties in the precomputed values.

3 replies

hampusnasstrom Feb 17, 2023
Maintainer Author

Although I guess shape is also not scalar so maybe we could do it along each axis.

tomio13 Feb 17, 2023
Collaborator

Data frames are practically tables (in R).
We can consider data to be a list of arrays, then we can allow each array having any shape the user will.
Then a summary goes for every element of this list, clearing up the question of dimensions.
It is optional to require the same dimension from each element.

I would actually use numpy.quantile(data, [0.05, 0.25, 0.5, 0.75, 0.95]) to calculate:

lower outlier limit
first quartile
median
third quartile
upper outlier lmit
just like for box plots.
Adding:
mean
standard deviation
skewness
are further extras.
I also think not concentrating on existing libraries will result a cleaner and simpler concept for a start.

hampusnasstrom Feb 17, 2023
Maintainer Author

I think we should separate out tables as something different. That would be, as you said, a list of our NumericalArray. Anyways, making a table only reduces the problem by one dimension.
It turns out that Numpy's approach is to calculate the quantiles on the flattened array. Not sure how useful this would be? For the median it might make sense to me.

tomio13 · 2023-02-17T14:16:13Z

tomio13
Feb 17, 2023
Collaborator

Since any of the quantiles needs an ordered list of the observed values, shape of the array does not matter.
You can consider it this way: here we provide a preview of all the data presented in one 'column', category, name it as you like. One element of the list.
Dimensions would matter if we do more detailed analysis, for which we need instructions from the user who knows where to look.

1 reply

hampusnasstrom Feb 17, 2023
Maintainer Author

Sounds good! So for the numerical array we would have these descriptors:

import json
import numpy as np

mu, sigma = 1, 0.1 # mean and standard deviation
a = np.random.normal(mu, sigma, (20, 30, 10))
qs = (0.05, 0.25, 0.5, 0.75, 0.95)
quants = np.quantile(a, qs)
descriptors= {
    "dimensionality": a.ndim,
    "shape": a.shape,
    "mean": a.mean(),
    "min": a.min(),
    "max": a.max(),
    "standard_deviation": a.std(),
    "quantiles": {q: quant for q, quant in zip(qs, quants)}
}

print(json.dumps(descriptors, indent=2))

{
  "dimensionality": 3,
  "shape": [
    20,
    30,
    10
  ],
  "mean": 1.0002707071723846,
  "min": 0.6440913244294681,
  "max": 1.3667379386039438,
  "standard_deviation": 0.09958481036605629,
  "quantiles": {
    "0.05": 0.8351471083009838,
    "0.25": 0.931646179009207,
    "0.5": 0.9991245263293533,
    "0.75": 1.0693072252478522,
    "0.95": 1.1614793219827306
  }
}

tomio13 · 2023-02-17T15:27:01Z

tomio13
Feb 17, 2023
Collaborator

You may want to swap / add median for the 0.5 quantile. I like the result 🙂 👍

1 reply

hampusnasstrom Feb 17, 2023
Maintainer Author

Good point, I agree! We would probably need to name the quantiles in general as property names cannot be numbers in NOMAD/Python. Something like:

{
  "dimensionality": 3,
  "shape": [
    20,
    30,
    10
  ],
  "mean": 0.9998186049542321,
  "min": 0.6356656443226303,
  "max": 1.3669845586063034,
  "standard_deviation": 0.0999057187786374,
  "ventile_1": 0.8367533792496997,
  "quartile_1": 0.931900335463369,
  "median": 1.0005747551415944,
  "quartile_3": 1.066327333292421,
  "ventile_19": 1.1633194210563471
}

We can always change the property names later. I will try to implement something and then we can test it out and iterate. Thanks for the input and the tip with the quantile function!

%%timeit
b, c, d = np.quantile(a, [0.25, 0.5, 0.75])

282 µs ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
c = np.median(a)
b = np.percentile(a, 25)
d = np.percentile(a, 75)

634 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

08.02.2023 Area A Data Modeling #30

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

08.02.2023 Area A Data Modeling #30

hampusnasstrom Feb 15, 2023 Maintainer

Meeting Notes

Numerical Array

Context Array

Table of Context Array

Future Discussion Points

Tasks

Replies: 3 comments · 5 replies

hampusnasstrom Feb 17, 2023 Maintainer Author

hampusnasstrom Feb 17, 2023 Maintainer Author

tomio13 Feb 17, 2023 Collaborator

hampusnasstrom Feb 17, 2023 Maintainer Author

tomio13 Feb 17, 2023 Collaborator

hampusnasstrom Feb 17, 2023 Maintainer Author

tomio13 Feb 17, 2023 Collaborator

hampusnasstrom Feb 17, 2023 Maintainer Author

hampusnasstrom
Feb 15, 2023
Maintainer

Replies: 3 comments 5 replies

hampusnasstrom
Feb 17, 2023
Maintainer Author

hampusnasstrom Feb 17, 2023
Maintainer Author

tomio13 Feb 17, 2023
Collaborator

hampusnasstrom Feb 17, 2023
Maintainer Author

tomio13
Feb 17, 2023
Collaborator

hampusnasstrom Feb 17, 2023
Maintainer Author

tomio13
Feb 17, 2023
Collaborator

hampusnasstrom Feb 17, 2023
Maintainer Author