Skip to content

Using the Midden Dataset Editor

bryanrcarlson edited this page Sep 15, 2023 · 6 revisions

Overview

Metadata is the heart of Midden and the tools to create metadata are the Dataset Editor and the Project Editor. The Dataset Editor is covered here.

NOTE: The Dataset Editor can be accessed through the browser by going to: {your midden address}/editor/dataset. For example, if you deployed using Github Pages using your Organization called "TheRads" then the address will be: https://therads.github.io/Midden/editor/dataset.

Midden is designed to allow the creation of metadata early in the data lifecycle. In this vein there are few required fields. Similarly, although Midden is designed to foster good data management practices, there is a preference on flexibility and agility over rigid convention. This is all to say that there are responsibilities placed on the metadata creators. The quality of the resultant data catalog will depend heavily on the thought that is put into the metadata.

Midden is not opinionated, but it should be used in an opinionated way.

The whole purpose of the Dataset Editor is to create a file that describes a dataset.

Workflow

Basically, a person uses the Dataset Editor to describe a dataset, downloads a .midden file, and places the file alongside the dataset.

  1. Create a new metadata file by clicking the "New" button or edit existing metadata by clicking the "Upload" or clicking "Edit" while viewing a dataset
  2. Edit the metadata fields (see "Metadata fields" below for details)
  3. Review the information by clicking "Preview"
  4. Download the .midden file by clicking "Download"
  5. Move the downloaded .midden file to the location where the associated dataset is saved

Metadata fields

Basic

This section contains essential information pertaining to the described dataset. Metadata is considered adequate if these fields are completed.

Zone

The data zone that the dataset belongs to.

NOTE: Items in the dropdown menu are populated by the zones array in the app-config.json file

Name

The name of the dataset.

NOTE: The "Name" here also determines the name of the .midden file that is created by the Editor. For example, a "Name" value of "MyDataset_v1" would result in a file called "MyDataset_v1.midden".

Project

The name of the project that the dataset belongs to.

Description

A description of the dataset. Can be short, but should include enough information for a data user to understand the basic origin and purpose of the data.

BEST PRACTICE: If the metadata is updated, a timestamp and description of the update should be included here. For example: "2021-03-03: Added variable definitions"

Contacts

Contact information for contributors to the dataset. Because Midden protects the data itself by not providing download links, the contact information here is important for potential data-users to start a conversation about access.

Name

The name of the contact person.

Email

The email address of the contact person.

Role

The role of the contact person.

NOTE: Items in the dropdown menu are populated by the roles array in the app-config.json file. The default values for roles are from the ISO 19115 metadata standard as described here https://wiki.esipfed.org/ISO_19115-3_Codelists#CI_RoleCode

Tags

Tags (i.e. "labels", "hashtags") are used to make the dataset more discoverable. The "Catalog" supports browsing datasets by tags so users can find similar datasets that have the same tag.

It is recommended that a dataset should contain at least a few tags. The value of the tags should be as consistent as possible.

NOTE: Items in the dropdown menu are populated by the tags array in the app-config.json file. The default values for tags are from the ISO 19115 metadata standards and the EPA metadata standards.

BEST PRACTICE: Consistency in tag names is important for data discovery. It is recommended that organizations have custom tags with an [Org] prefix defined in the app-config.json file, where "Org" can be any short term that represents the organization.

Variables

This represents the variables within the dataset; i.e. the data dictionary.

Name

The name of the variable (e.g. a column header in a csv file or a field name in a shapefile/geojson).

BEST PRACTICE: Some datasets, such as Excel Workbooks, have nested data structures where variable names may be spread across different Worksheets. To aid in specifying such names, the use of a forward slash "/" can be used: e.g. worksheet1/myVariable.

Description

The description of the variable; should include any coded values, expected types, ranges, etc. Also describes formats (e.g. ISO 8601 dates) and meaning of qualitative values (if units do not apply).

Units

The units of the variable, if applicable

Methods

A list of method details that are specific to the variable; sensors, analytic equipment, etc.

Quality Controls

A list of quality control checks that have been applied to the variable. These are intended to be general categories of checks that can be used to filter variables when searching for certain quality. Specific details of the quality control checks can be described in the methods field.

NOTE: Items in the dropdown menu are populated by the qualityControlTags array in the app-config.json file. The default values are those used by the Cook Agronomy Farm LTAR site, as described here: https://docs.google.com/document/d/1ufsDxVAh0E_PTHp-uGKmPzok3adPvPEYbxuFp5A8Uds

Processing

Indication of the origin of the value. Similar to the Quality Control tag, this is intended to be a general category used to filter variables when searching for a certain level of processing (e.g. raw data vs modeled data). Details of the processing should be defined in the methods field.

NOTE: Items in the dropdown menu are populated by the processingLevels array in the app-config.json file. The default values are those used by the Cook Agronomy Farm LTAR site, as described here: https://docs.google.com/document/d/1ufsDxVAh0E_PTHp-uGKmPzok3adPvPEYbxuFp5A8Uds

Type

This provides additional filtering ability and further context to the variable. Examples based on statistical fields are "discrete", "continuous", "nominal", etc.

NOTE: Items in the dropdown menu are populated by the variableType array in the app-config.json file. The default values are those used by the Cook Agronomy Farm LTAR site and are loosely based on definitions in dimensional modeling. A "dimension" describes the "who, what, where, when, why, and how". A "metric" is a measurement (quantitative or, stretching the formal definition, nominal/ordinal).

Tags

A list of tags specific to each variable.

BEST PRACTICE: Any controlled terms that can be used as an analog to the variable name can be specified here.

Height

The height of the measurement respective to the ground; positive indicates above ground, negative indicates below ground.

NOTE: This is deprecated and will likely not be used in future versions of Midden

Coverage

These variables are used to specify the spatial and temporal coverage of the dataset.

Spatial Repeats

The number of locations of repeated measurements that are represented in the dataset. For example, a dataset that contains soil temperature measurements at five different locations buried at 5 different depths would have a spatial repeats value of 25.

Spatial Extent

The area at which the data were collected or represent. This should be represented as valid GeoJSON; point, line, polygon.

NOTE: Items in the dropdown menu are populated by the geometries array in the app-config.json file.

BEST PRACTICE: Although any polygon is valid, it is recommended that a bounding box be used instead of a complex polygon. The reason for this is to reduce the file-size of the generated metadata.

NOTE: Until Midden has an embedded map tool, consider using the online tool https://geojson.io to obtain valid GeoJSON. Copy the geometry object, starting with the opened angle bracket: "{" and include everything until the closing angle bracket: "}". E.g.: {"type":"Polygon","coordinates":[[[...]]]}

Temporal Resolution

The frequency at which the variables of the dataset were measured. Air temperature measured every 15 minutes may have the value of 15 min. A dataset that contains plant community survey data taken annually may have a value of 1 year or annually.

BEST PRACTICE: Be consistent with how temporal resolution is defined to make it more machine readable: e.g. choose between using 1 year or annually and do not mix them.

Temporal Extent

The starting and ending dates that contain the time the data were collected.

BEST PRACTICE: Consider using the ISO 8601 format for time-intervals: e.g. 1997-07-16/1997-07-17 corresponds to a time-period starting on July 16, 1997, and ending on July 17, 1997.

Structure

Use these fields to specify the structure of the dataset to aid in machine-readability. Ideally, a consumer of the metadata should have enough information to read the dataset without any further exploration (e.g. a person can write a script to download the data).

File Format

The format that the data are stored in. This could be a file extension (e.g. .json, .txt, .jpg), general category (e.g. tabular, image, time-series), or some standard (e.g. MIME types: text/csv, image/gif, application/java-archive).

File Path Template

A description of the directory and file structure within the dataset folder, if applicable. For example, this can be used to describe a dataset comprised of time-series files generated every hour and separated into monthly folders: {YYYY-MM}/{DD}T{hh}:{mm}_{VariableName}.csv

File Path Description

A description of the file path template where each variable is described. E.g. "{YYYY-MM} is the four digit year (YYYY) and two digit month (MM) that data were collected....

Dataset Structure

A category tag that broadly indicates how the data are structured.

NOTE: Items in the dropdown menu are populated by the datasetStructures array in the app-config.json file. The default values, Single and Multiple are used to represent a dataset containing multiple files of different versions of the dataset and a dataset comprised of multiple files that can be aggregated together, respectively.

Processing

These fields are used to describe how the dataset was created and any associated products.

Methods

The methods used to generate the dataset. Depending upon scope, this could include field methods, data processing, data pipelines, and so on.

Parent Datasets

This is used to specify datasets that this dataset was derived from. Values are expected to be linked resources (URL/DOI) but a citation or reference is fine. This field is important for documenting data lineage.

NOTE: Listing the full URL of metadata in your Midden catalog is encouraged and may be formally supported in the future (perhaps a visualization of the dependency graph??)

Derived Works

This is used to indicate related products that use the dataset; published papers, presentations, decision support tools, etc.