Skip to content
ybrugnara edited this page Jan 9, 2020 · 24 revisions

The Station Exchange Format (SEF)

10-Dec-2019: The SEF has been updated to v1.0.0. This version is fully compatible with the previous one (v0.2.0); however, a few changes have been made in the usage guidelines. These changes are highlighted in bold italic in the following text.

This is a file format specification and R package for newly-digitised historical weather observations. It has been produced to support the work of the Copernicus Data Rescue Service in particular to allow people rescuing observations to present them for widespread use in a simple but standard format.

This format is for land observations from fixed stations. Marine observations, and moving observing platforms, should use the IMMA format instead.

What is it, and why?

Weather data rescue is the process of getting historical weather observations off paper, into digital formats, and into use. This is typically done in two steps:

  1. A transcription step which finds observations archived on paper and produces digital versions of those observations - typically as Excel spreadsheets or in a similar format.
  2. A database-building step which converts the new digital observations into the format and schema used by an observations database, and adds the observations to the database.

These two steps are usually done by different people: the first by a large group of observation experts (transcribers), each interested in a different set of to-be-digitised observations; the second by a small group of synthesisers trying to make the best possible database. The split between the steps causes problems: the output of step one (variably-structured Excel spreadsheets) is poorly suited as the input of step 2. We can't ask the transcribers to produce database-ready output, because this requires them to know too much about the precise and ideosyncratic details of each database, and we can't expect the synthesisers to work with millions of variably-structured Excel spreadsheets - partly because they would have to learn too much about the ideosyncrasies of each observation source, and partly because there are many fewer synthesisers than transcribers. The practical effect of this is that observations pile up in a transcribed-but-unusable state, and it takes too long to get them into use.

The Station Exchange Format (SEF) is a proposed new output for the transcription step. It will eliminate the bottleneck between the steps by specifying a single data format that's suitable both as the output of step one and the input to step 2. This means the format must have two, somewhat contradictory, properties:

  1. It must be machine readable with NO human involvement – so it needs all the necessary metadata in an unambiguous arrangement. Otherwise it’s too expensive for synthesizers to read.
  2. It must be easy for non-experts to read, understand, and create. In particular we need people to look at a couple of examples, think ‘OK, I can make that’ and get on with it, rather than “Oh, looks tricky, I won’t do that right now”. Otherwise it’s too expensive for transcribers to create.

If SEF is successful, 99% of users will be transcribers writing it, and the problems of reading it can be confined to a couple of software libraries. So it needs to be possible to read it unambiguously, but it doesn’t matter how slow or difficult this is – it matters a great deal if it’s hard to create. The best format will adequate for readers, but optimised for creators. That means plain text, editable in a text editor, editable in a spreadsheet format, opens in the right program when double clicked; easy to read and write in python, R, and even Fortran.

Success in data rescue means recruiting a lot of transcribers, and getting them to make SEF files. It's reasonable to assume that every difficulty given to the SEF creator will halve the number of people willing to put their data in the format (and so reduce the data we get). So SEF needs to be so simple that you don’t even need to read the instructions. In particular we need to protect SEF creators from the Common Data Model and all similar necessary complexities.

The current design tries to be both simple enough to be obvious, and powerful enough to be useful, by having a core set of headers and columns which are obvious, and an arbitrarily extensible metadata section. Most users should be able to create a base SEF file by modifying a standard example without ever having to look up what any of the columns mean, and any community can customise the format to their precise requirements by specifying their own set of required metadata. Metadata standards can evolve with use, rather than being specified at the start.

The file format

SEF files look like this:

SEF	1.0.0
ID	Rosario_Santa_Fe
Name	Rosario de Santa Fe
Lat	-32.945
Lon	-60.333
Alt	36
Source	C3S_SouthAmerica
Link	https://data-rescue.copernicus-climate.eu/lso/1086330
Vbl	p
Stat	point
Units	hPa
Meta	Alias=Rosario|PTC=T|PGC=T|Data policy=Open|QC software=dataresqc v1.0.0
Year	Month	Day	Hour	Minute	Period	Value	Meta
1886	3	18	12	17	0	1005.14	orig=758.2mm|atb=24.8C|orig.time=8am
1886	3	18	22	17	0	866.94	orig=653.3mm|atb=25.6C|orig.time=6pm|qc=climatic_outliers
1886	3	19	12	17	0	1006.28	orig=758.6mm|atb=21.5C|orig.time=8am
1886	3	19	22	17	0	1005.43	orig=758.1mm|atb=22.9C|orig.time=6pm
1886	3	20	12	17	0	1010.00	orig=761mm|atb=18.6C|orig.time=8am
1886	3	20	22	17	0	1008.62	orig=760.5mm|atb=22.5C|orig.time=6pm

One SEF file contains observations of one variable from one station. It is a text file encoded as UTF8. It is a tab-separated values file and should have a .tsv extension. This means it can be easily viewed and edited in any text editor or spreadsheet program (though care should be taken to preserve the tab structure and text encoding).

Header

The first 12 lines of the file are a series of headers, each given as a name::value pair separated by a tab. They must be in the order given. Missing values can be given as NA or left blank. The SEF version number must be present.

  • SEF: The first three characters in the file must be SEF. The associated value is the semantic version of the format used. This enables software to recognise the format and read the rest of the file correctly. At the moment, version 1.0.0 is in use.

  • ID: This is the machine readable name of the station. It may contain only lower-case or upper-case Latin letters, numbers or the characters: - (dash), _ (underscore) or . (full stop). It must not contain blanks. There is no length limit.

  • Name: Station name - any string (except no tabs or carriage returns). This is the human readable name of the station.

  • Lat: Latitude of the station (degrees north as decimal number).

  • Lon: Longitude of the station (degrees east as decimal number).

  • Alt: Altitude of the station (meters above sea-level).

  • Source: Source identifier. This is for making collections of SEF files and identifies a group of files from the same source. It will be set by the collector. Any string (except no tabs or carriage returns).

  • Link: Where to find additional metadata (a web address). SEF users are strongly recommended to add their metadata to the C3S DRS metadata registry and then link to the appropriate page in that service.

  • Vbl: Name of the variable included in the file. There is a recommended list of standard variable names. Use this if possible.

  • Stat: What statistic (mean, max, min, ...) is reported from the variable. There is a recommended list of standard statistics. Use this if possible.

  • Units: Units in which the variable value is given in the file (e.g. 'hPa', 'Pa', 'K', 'm/s'). Where possible, this should be compliant with UDUNITS-2. The units in which the values were originally measured can be given in the Meta column (see Data table section).

  • Meta: Anything else. Pipe-separated (|) string of metadata entries. Each entry may be any string (except no tabs, pipes, or carriage returns). There is a standard list of meaningful entries, but other entries can be added as necessary. Metadata specified here is assumed to apply to all observations in this file, unless overwritten by the observation-specific metadata entry.

Data table

Lines 13 and onward in the file are a table of observations. Line 13 is a header, lines 14 and on are observations. Missing values can be given as NA or left blank. The table must contain these columns in this order:

  • Year: Year in which the observation was made (UTC). An integer.

  • Month: Month in which the observation was made (UTC). An integer (1-12). For annual data, it is recommended to leave this column empty (or NA) when referring to calendar years.

  • Day: Day of month in which the observation was made (UTC). An integer (1-31). For monthly, seasonal, or annual data, it is recommended to leave this column empty (or NA) when referring to calendar months or years.

  • Hour: Hour at which the observation was made (UTC). An integer (0-24). The use of 24 is recommended for daily values calculated from midnight to midnight (UTC). This is to avoid ambiguities in the date.

  • Minute: Minute at which the observation was made (UTC). An integer (0-59).

  • Period: Time period of observation (instantaneous, sum over previous 24 hours, ...). There is a table of meaningful codes.

  • Value: The observation value. It is recommended to round the value to a meaningful number of decimal places.

  • Meta: Anything else. Pipe-separated (|) string of metadata entries. Each entry may be any string (except no tabs, pipes, or carriage returns). There is a standard list of meaningful entries, but other entries can be added as necessary. Metadata specified here only applies to this observation, and overrides any file-wide specification.

Examples

Examples of SEF files, alongside the original digitisation spreadsheets, metadata, and conversion scripts, can be found here.

Station relocations and homogenised data

When a station is relocated and gets new coordinates, a new SEF file should be created.

Even though the SEF was principally designed for raw data, it is also possible to use it for homogenised data. Specific metadata entries have been pre-defined for that. In the case of homogenised data, a single SEF file is sufficient. The coordinates indicated in the header must be those of the location with respect to which the data have been adjusted (usually the most recent location).

R Package

R functions are provided to facilitate reading and writing SEF files. You can install them from the R command line with:

devtools::install_github("C3S-Data-Rescue-Lot1-WP3/SEF")

In particular:

  • read_sef reads a SEF file into a R data frame.
  • read_meta reads one or more fields from the SEF header.
  • write_sef transforms a R data frame into a SEF file.
  • check_sef verify the compliance of a SEF file to these guidelines.

Python API

Functions to manipulate SEF files are also available for Python here.

Authors and acknowledgements

This document was created by Philip Brohan (UKMO) and is currently maintained by Yuri Brugnara (University of Bern; yuri.brugnara@giub.unibe.ch). The file format specification is the responsibility of the Copernicus Data Rescue Service.

Clone this wiki locally
You can’t perform that action at this time.