Skip to content

Commit 9d494d5

Browse files
committed
update file names
1 parent 2776eba commit 9d494d5

File tree

14 files changed

+611
-0
lines changed

14 files changed

+611
-0
lines changed

content/docs/api/_index.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
---
2+
weight: 4
3+
title: "Modelling APIs"
4+
bookCollapseSection: true
5+
---
6+
7+
# API
8+
9+
The API is defined more in terms of file formats than it is in terms of data types. There are two file formats that are native to the data pipeline, and files in these formats are referred to as *data products*: TOML files, and HDF5 files. TOML files store “small” parameter data, representing individual parameters. HDF5 files are used to store structured data, encoded either as “arrays” or “tables”. Both formats are described in more detail below, alongside API functions used to interact with them. Data in any other file format are treated as binary blobs, and are referred to as *external objects*.
10+
11+
Different metadata is stored about each -- data products record information about their internal structure and naming of their components, whereas external objects record information about their provenance (since data products are internal to the pipeline, provenance is recorded separately). A single object can be both an external object and a data product, and thus have both sets of metadata recorded.
12+
13+
## Initialisation
14+
15+
The API must be initialised with the model URI and git sha, which should then be set as run-level metadata.
16+
17+
## Additional metadata on write
18+
19+
The write functions all accept `description` and `issues` arguments.
20+
21+
## TOML (parameter) files
22+
23+
A parameter file contains representations of one or more parameters, each a single number, possibly with some associated uncertainty. Parameters may by represented as point-estimates, parametric distributions, and sample data.
24+
25+
### File format
26+
27+
Parameters are stored in toml-formatted files, with the extension “toml”, containing sections corresponding to different components. The following is an example of the internal encoding, defining three components: "`my-point-estimate`", "`my-distribution`", and "`my-samples`":
28+
29+
```
30+
[my-point-estimate]
31+
type = "point-estimate"
32+
value = 0.1
33+
34+
[my-distribution]
35+
type = "distribution"
36+
distribution = "gamma"
37+
shape = 1
38+
scale = 2
39+
40+
[my-samples]
41+
type = "samples"
42+
samples = [1.0, 2.0, 3.0, 4.0, 5.0]
43+
```
44+
45+
Point estimates are used when our knowledge of the parameter is only sufficient for a single value, with no notion of uncertainty. A point estimate component must have type = "point-estimate" and a value that is either a float or an integer.
46+
47+
Distributions are used when our knowledge of a parameter can be represented by a parametric distribution. A distribution component must have type = "distribution", a distribution set to a string name of the distribution, and other parameters determined by the distribution. The set of distributions required to be supported is currently undefined.
48+
49+
Samples are used when our knowledge of a parameter is represented by samples, from either empirical measurements, or a posterior distribution. A samples component must have type = "samples" and a value that is a list of floats and integers.
50+
51+
#### Distributions
52+
53+
The supported distributions,each with a link to information about their parameterisation, and their standardised parameter names are as follows:
54+
55+
56+
| Distribution | Standardised parameter names |
57+
| -------------------------- | ------------------------------------------ |
58+
| categorical (non-standard) | bins (string array), weights (float array) |
59+
| gamma | k (float), theta (float) |
60+
| normal | mu (float), sigma (float) |
61+
| uniform | a (float), b (float) |
62+
| poisson | lambda (float) |
63+
| exponential | lambda (float) |
64+
| beta | alpha (float), beta (float) |
65+
| binomial | n (int), p (float) |
66+
| multinomial | n (int), p (float array) |
67+
68+
### API functions
69+
70+
`read_estimate(data_product, component) -> float or integer`
71+
72+
If the component is represented as a point estimate, return that value.
73+
74+
If the component is represented as a distribution, return the distribution mean.
75+
76+
If the component is represented as samples, return the sample mean.
77+
78+
`read_distribution(data_product, component) -> distribution object`
79+
80+
If the component is represented as a point estimate, fail.
81+
82+
If the component is represented as a distribution, return an object representing that distribution.
83+
84+
If the component is represented as samples, failreturn an empirical distribution.
85+
86+
`read_samples(data_product, component) -> list of floats or integers`
87+
88+
If the component is represented as a point estimate, fail.
89+
90+
If the component is represented as a distribution, fail.
91+
92+
If the component is represented as samples, return the samples.
93+
94+
`write_estimate(data_product, component, estimate, description, issues)`
95+
96+
`write_distribution(data_product, component, distribution object, description, issues)`
97+
98+
`write_samples(data_product, component, samples, description, issues)`
99+
100+
## HDF5 files
101+
102+
<span style="font-size:12pt; color:red">Note that the following is subject to change. For example, we may want to add all of the metadata as attributes.</span>
103+
104+
An HDF5 file can be either a table or an array. A table is always 2-dimentional and might typically be used when each column contains different classes of data (*e.g.* integers and strings). Conversely, all elements in an array should be the same class, though the array itself might be 1-dimensional, 2-dimensional, or more (*e.g.* a 3-dimensional array comprising population counts, with rows as area, columns as age, and a third dimension representing gender).
105+
106+
You should create a single HDF5 file for a single dataset. Unless you have a dataset that really should have been generated as multiple datasets in the first place (*e.g.* testing data mixed with carehome data), in which case use your own judgement.
107+
108+
HDF5 files contain structured data, encoded as either an “array”, or a “table”, both of which are described in more detail below.
109+
110+
### File format
111+
112+
HDF5 files are stored with the extension “h5”. Internally, each component is stored in a different (possibly nested) group, where the full path defines the component name (*e.g.* “path/to/component”). Inside the group for each component is either a value named “array”, or a value named “table”. It is an error for there to be both.
113+
114+
#### array format
115+
116+
{component}/array
117+
: An n-dimensional array of numerical data
118+
119+
{component}/Dimension_{i}_title
120+
: The string name of dimension {{< katex >}}i{{< /katex >}}
121+
122+
{component}/Dimension_{i}_names
123+
: String labels for dimension {{< katex >}}i{{< /katex >}}
124+
125+
{component}/Dimension_{i}_values
126+
: Values for dimension {{< katex >}}i{{< /katex >}}
127+
128+
{component}/Dimension_{i}_units
129+
: Units for dimension {{< katex >}}i{{< /katex >}}
130+
131+
{component}/units
132+
: Units for the data in array
133+
134+
#### table format
135+
136+
{component}/table
137+
: A dataframe
138+
139+
{component}/row_names
140+
: String labels for the row axis
141+
142+
{component}/column_units
143+
: Units for the columns
144+
145+
### API functions
146+
147+
`read_array(data_product, component) -> array`
148+
149+
If the component does not terminate in an array-formatted value, raise an error.
150+
151+
Return an array, currently with no structural information.
152+
153+
`read_table(data_product, component) -> dataframe`
154+
155+
If the component does not terminate in a table-formatted value, raise an error.
156+
157+
Return a dataframe, with labelled columns.
158+
159+
`write_array(data_product, component, array, description, issues)`
160+
161+
If the array argument is not array-formatted, raise an error.
162+
163+
`write_table(data_product, component, table, description, issues)`
164+
165+
If the table argument is not table-formatted, raise an error.

content/docs/api/c/_index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
weight: 5
3+
title: "C"
4+
bookCollapseSection: false
5+
---

content/docs/api/cpp/_index.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
weight: 6
3+
title: "C++"
4+
bookCollapseSection: false
5+
---
6+
7+
# cppDataPipeline
8+
9+
The `cppDataPipeline` library contains functions used to interface with the FAIR Data Pipeline in C++.
10+
11+
## Building
12+
13+
The library can be built using cmake and requires a C++ 11 compatible compiler
14+
15+
```
16+
cmake -Bbuild
17+
cmake --build build
18+
```
19+
20+
## Usage
21+
22+
See the [package documentation][docs] for instructions and examples.
23+
24+
## Source code
25+
26+
See the package's [code repo][repo].
27+
28+
[docs]: https://www.fairdatapipeline.org/cppDataPipeline/
29+
[repo]: https://github.com/FAIRDataPipeline/cppDataPipeline

content/docs/api/fortran/_index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
weight: 7
3+
title: "FORTRAN"
4+
bookCollapseSection: false
5+
---

content/docs/api/java/_index.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
weight: 4
3+
title: "Java"
4+
bookCollapseSection: true
5+
---
6+
7+
# javaDataPipeline
8+
9+
The javaDataPipeline Java library allows Java modellers to interface with the FAIR Data Pipeline.
10+
It is available on Maven Central:
11+
12+
```gradle
13+
dependencies {
14+
implementation 'org.fairdatapipeline:api:1.0.0-beta'
15+
}
16+
```
17+
18+
The main purpose of interfacing with the FAIR Data Pipeline is the recording of Coderuns: a Coderun is a recorded session of your modelling code, all the inputs used and outputs generated will be recorded and a [Provenance Report][prov] will be generated.
19+
20+
The basic idea behind the FAIR DataPipeline and Coderuns is explained on the [Data pipeline](/docs/interface/) page.
21+
22+
23+
24+
25+
26+
27+
28+
29+
## Usage
30+
31+
A useful starting point to get to know the javaDataPipeline is to explore the [java Simple Model](https://github.com/FAIRDataPipeline/javaSimpleModel). This shows example [config.yaml](https://www.fairdatapipeline.org/docs/interface/config/) files, example
32+
`main(String[] args)` code that allows `fair run` to call your code, and example `Coderun()` code for reading/writing input/output Data Products.
33+
34+
35+
The links below give a basic introduction to the javaDataPipeline; these should be read in conjunction with the [javaDocs][docs] which give more precise specification of the java API.
36+
37+
[Basic Coderun](coderun)
38+
[Issues](issues)
39+
[Parameters](parameters)
40+
[HDF5](hdf5)
41+
[Run](run)
42+
[Debug](debug)
43+
44+
See the [package documentation][docs] for Javadoc.
45+
46+
47+
## License/copyright
48+
49+
Copyright (c) 2021: Bram Boskamp, Biomathematics and Statistics Scotland and the Scottish COVID-19 Response Consortium
50+
51+
GNU GENERAL PUBLIC LICENSE Version 3
52+
53+
54+
## Source code
55+
56+
See the package's [code repo][repo].
57+
58+
[docs]: https://www.fairdatapipeline.org/javaDataPipeline
59+
[repo]: https://github.com/FAIRDataPipeline/javaDataPipeline
60+
[prov]: /docs/data_registry/prov_report/

content/docs/api/java/coderun.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
---
2+
weight: 10
3+
---
4+
5+
# javaDataPipeline Coderun
6+
7+
The main class used to interface with the FAIR DataPipeline in Java, is the `Coderun` class.
8+
9+
Users should initialise the `Coderun` instance using a *try-with-resources* block or ensure that `.close()` is explicitly called upon finishing.
10+
11+
The `Coderun` constructor needs a Path to the config.yaml, a Path to the script.sh, and the registry authentication token.
12+
13+
The user then would access the input data product(s) using `coderun.get_dp_for_read(dataproduct_name)`
14+
and output data product(s) using `coderun.get_dp_for_write(dataproduct_name, extension)`.
15+
16+
From the resulting data_product we can then access either one unnamed object_component: `Object_component_read oc = dp.getComponent()`
17+
or any number of named object_components: `Object_component_read oc1 = dp.getComponent('ComponentName')`.
18+
19+
20+
Data in FAIR Data Pipeline internal formats (HDF5, TOML) is written to (or read from) named `Object_components`, while other file formats (such as CSV) are written to (or read from) just 1 unnamed `Object_component`.
21+
22+
Here is an example of a `Coderun` writing a CSV file to one unnamed `Object_component`:
23+
24+
```
25+
try (var coderun = new Coderun(configPath, scriptPath, token)) {
26+
Data_product_write dp = coderun.get_dp_for_write(dataProduct, "csv");
27+
Object_component_write oc = dp.getComponent();
28+
Path p = oc.writeLink();
29+
write_my_csv_data_to_path(p);
30+
}
31+
```
32+
33+
And for reading:
34+
35+
```
36+
try (var coderun = new Coderun(configPath, scriptPath, token)) {
37+
Data_product_read dp = coderun.get_dp_for_read(dataProduct);
38+
Object_component_read oc = dp.getComponent();
39+
Path p = oc.readLink();
40+
read_my_csv_data_from_path(p);
41+
}
42+
```
43+
44+
45+
Here is an example of a coderun writing Samples to one named Object_component:
46+
47+
```Java
48+
try (var coderun = new Coderun(configPath, scriptPath)) {
49+
ImmutableSamples samples = ImmutableSamples.builder().addSamples(1, 2, 3).rng(rng).build();
50+
String dataProduct = "animal/dodo";
51+
String component1 = "example-samples-dodo1";
52+
Data_product_write dp = coderun.get_dp_for_write(dataProduct, "toml");
53+
Object_component_write oc1 = dp.getComponent(component1);
54+
oc1.writeSamples(samples);
55+
}
56+
```
57+

content/docs/api/java/debug.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
weight: 60
3+
---
4+
5+
# Debugging / troubleshooting
6+
7+
## Logger
8+
9+
The javaDataPipeline library uses *SLF4J* (Simple Logging Facade for Java).
10+
11+
You can either use this with the SLF4J simple logger, or it can bind to your own favourite logging framework.
12+
In order to set the default simple logger to more verbose logging:
13+
14+
- add the `org.slf4j:slf4j-simple` dependency
15+
- set the log level:
16+
17+
```System.setProperty(org.slf4j.impl.SimpleLogger.DEFAULT_LOG_LEVEL_KEY, "TRACE");```
18+
19+
## Running manually
20+
21+
Instead of running:
22+
23+
```
24+
fair run src/main/resources/config.yaml
25+
```
26+
27+
You can call `fair run` and your actual java code separately:
28+
29+
```
30+
fair run --ci src/main/resources/config.yaml
31+
gradle run --args "/tmp/tmpXXXXXXX/data_store/jobs/<timestamp>"
32+
```
33+
34+
The actual tmp location is shown as output from the `fair run` command.
35+
36+
This may help debug/troubleshoot.
37+

content/docs/api/java/hdf5.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
weight: 40
3+
---
4+
5+
# HDF5
6+
7+
HDF files are not supported in the current version of the Java FAIR Data Pipeline API.

0 commit comments

Comments
 (0)