Skip to content

Math Databases

Katja Berčič edited this page Apr 17, 2019 · 67 revisions

Contents

The main contents of the catalogue are available as an overview table at mathdb.mathhub.info.

Collections and databases that are not yet entered in the overview table are listed in the section TODO.

Call to action

If you have a collection or database that is not contained in a catalogue, let me know!

I will also be very grateful for any other comments you might have.

To do so, please open an issue with a label math db. For other ways of contacting me, see my FAU page

TODO

Possibly more could be found on EUDAT.

MathOverflow

Look for keywords such as atlas, census, collection, database,

Unsorted resources:

Miscelaneous

Index

Database Requirements

Updated Requirements

FAIR Principles

Contents

Collection features, classification

Database Requirements

Go up to index.

In alphabetical order.

  1. Citable
    The records in the database need to be citable.
  2. Coverage information
    The database interfaces should make explicit the assumptions made by the system and collections as well as providing information on the completeness of the search results. For example for graphs: "the search results are complete for your search parameters up to order 511".
  3. Collaborative
    The database should be collaborative. It needs to be easy for people to contribute and it must be possible for any data to find out who contributed it and when. In particular,
    1. adding a new collection should ideally be a declarative task as opposed to a programming one -- it should suffice to describe the structure of the collection,
    2. changes such as adding new kinds of objects, new collections, new properties, or updates to existing values need to be tracked by the database.
  4. Decentralised
    The database should not rely on the existence of a central authority. An entry in a local copy of the database should be easily distributable to other copies without any third-party intervention.
  5. Interoperable
    Other systems (like databases and computer algebra systems) should be able to interact with the database. This involves having well defined APIs.
  6. Non-redundant
    The objects should be stored up to isomorphism.
  7. Provenance In addition to the information on who and when produced the data, the database needs to provide information on how the data were obtained (algorithms used, software libraries, pipelines, etc.).
  8. Self-explaining
    The database interfaces should make accessing definitions of concepts and further information easily available.
  9. Searchable
    The database should have at least a basic search (filter) functionality for objects
  10. User-friendly
    The database should provide suitable interfaces. This means
    1. simple, intuitive and easily accessible interfaces for casual users (it should avoid exposing the casual user to low level interfaces, such as Python, SageMath, SQL, ...),
    2. making usage easy or easier for power users.

Updated Requirements

Go up to index.

This is a draft.

FAIR

The descriptions of the principles are short excerpts from GO FAIR.

Findable

F1: (Meta)data are assigned globally unique and persistent identifiers

F1 at GO FAIR

Globally unique and persistent identifiers remove ambiguity in the meaning of your published data by assigning a unique identifier to every element of metadata and every concept/measurement in your dataset.

  • Dataset: The dataset has a globally unique and persistent identifier.
  • Datum: The data are assigned globally unique and persistent identifiers.
  • Metadata: The metadata have a globally unique and persistent identifier.

F2: Data are described with rich metadata

F2 at GO FAIR

Rich metadata allow a computer to automatically accomplish routine and tedious sorting and prioritising tasks that currently demand a lot of attention from researchers. Rich metadata implies that you should not presume that you know who will want to use your data, or for what purpose.

  • Metadata: The metadata richly describes the data.

F3: Metadata clearly and explicitly include the identifier of the data they describe

F3 at GO FAIR

The metadata and the dataset they describe are usually separate files. The association between a metadata file and the dataset should be made explicit by mentioning a dataset’s globally unique and persistent identifier in the metadata.

  • Metadata: Metadata clearly and explicitly include the identifier of the data they describe.

F4: (Meta)data are registered or indexed in a searchable resource

F4 at GO FAIR

The metadata and the dataset they describe are usually separate files. The association between a metadata file and the dataset should be made explicit by mentioning a dataset’s globally unique and persistent identifier in the metadata.

  • Dataset: The dataset is registered or indexed in a searchable resource.
  • Datum: The data are registered or indexed in a searchable resource.
  • Metadata: The metadata are registered or indexed in a searchable resource.

Accessible

A1: (Meta)data are retrievable by their identifier using a standardised communication protocol

A1 at GO FAIR

FAIR data retrieval should be mediated without specialised tools or communication methods. So, clearly define who can access the actual data, and specify how. Most users of the internet retrieve data by ‘clicking on a link’. This is a high-level interface to a low-level protocol called tcp, that the computer executes to load data in the user’s web browser.

  • Dataset: The dataset is retrievable by its identifier using a standardised communication protocol.
  • Datum: The data are retrievable by their identifiers using a standardised communication protocol.
  • Metadata: The metadata are retrievable by their identifier using a standardised communication protocol.

A1.1: The protocol is open, free and universally implementable

A1.1 at GO FAIR

To maximise data reuse, the protocol should be free (no-cost) and open (-sourced) and thus globally implementable to facilitate data retrieval. Anyone with a computer and an internet connection can access at least the metadata.

  • Dataset: The protocol to retrieve the dataset is open, free and universally implementable.
  • Datum: The protocol to retrieve the data is open, free and universally implementable.
  • Metadata: The protocol to retrieve the metadata is open, free and universally implementable.

A1.2: The protocol allows for an authentication and authorisation where necessary

A1.2 at GO FAIR

This is a key, but often misunderstood, element of FAIR. The ‘A’ in FAIR does not necessarily mean ‘open’ or ‘free’. Rather, it implies that one should provide the exact conditions under which the data are accessible. Hence, even heavily protected and private data can be FAIR. Ideally, accessibility is specified in such a way that a machine can automatically understand the requirements, and then either automatically execute the requirements or alert the user to the requirements.

  • Dataset: The protocol to retrieve the dataset allows for an authentication and authorisation where necessary.
  • Datum: The protocol to retrieve the data allows for an authentication and authorisation where necessary.
  • Metadata: The protocol to retrieve the metadata allows for an authentication and authorisation where necessary.

A2: Metadata should be accessible even when the data is no longer available

A2 at GO FAIR

Metadata are valuable in and of themselves, when planning research, especially replication studies. Even if the original data are missing, tracking down people, institutions or publications associated with the original research can be extremely useful.

  • Metadata: Metadata should be accessible even when the data is no longer available.

Interoperable

I1: (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation

I1 at GO FAIR

Humans should be able to exchange and interpret each other’s data (so preferably do not use dead languages). But this also applies to computers, meaning that data that should be readable for machines without the need for specialised or ad hoc algorithms, translators, or mappings.

  • Dataset: The dataset uses a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • Datum: The data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
  • Metadata: The metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2: (Meta)data use vocabularies that follow the FAIR principles

I2 at GO FAIR

The controlled vocabulary used to describe datasets needs to be documented and resolvable using globally unique and persistent identifiers. This documentation needs to be easily findable and accessible by anyone who uses the dataset.

  • Dataset: The dataset uses vocabularies that follow the FAIR principles.
  • Datum: The data use vocabularies that follow the FAIR principles.
  • Metadata: The metadata use vocabularies that follow the FAIR principles.

I3: (Meta)data include qualified references to other (meta)data

I3 at GO FAIR

A qualified reference is a cross-reference that explains its intent. For example, X is regulator of Y is a much more qualified reference than X is associated with Y, or X see also Y. The goal therefore is to create as many meaningful links as possible between (meta)data resources to enrich the contextual knowledge about the data, balanced against the time/energy involved in making a good data model. To be more concrete, you should specify if one dataset builds on another data set, if additional datasets are needed to complete the data, or if complementary information is stored in a different dataset.

  • Dataset: The dataset includes qualified references to other (meta)data.
  • Datum: The data include qualified references to other (meta)data.
  • Metadata: The metadata include qualified references to other (meta)data.

Reusable

R1: (Meta)data are richly described with a plurality of accurate and relevant attributes

R1 at GO FAIR

It will be much easier to find and reuse data if there are many labels are attached to the data. Principle R1 is related to F2, but R1 focuses on the ability of a user (machine or human) to decide if the data is actually USEFUL in a particular context.

  • Dataset: The dataset is richly described with a plurality of accurate and relevant attributes.
  • Datum: The data are richly described with a plurality of accurate and relevant attributes.
  • Metadata: The metadata are richly described with a plurality of accurate and relevant attributes.

R1.1: (Meta)data are released with a clear and accessible data usage license

R1.1 at GO FAIR

Under I, we covered elements of technical interoperability. R1.1 is about legal interoperability. What usage rights do you attach to your data? This should be described clearly. Ambiguity could severely limit the reuse of your data by organisations that struggle to comply with licensing restrictions. Clarity of licensing status will become more important with automated searches involving more licensing considerations. The conditions under which the data can be used should be clear to machines and humans.

  • Dataset: The dataset is released with a clear and accessible data usage license.
  • Metadata: The metadata is released with a clear and accessible data usage license.

Notes. This is under the assumption that the data inherit the dataset license.

R1.2: (Meta)data are associated with detailed provenance

R1.2 at GO FAIR

For others to reuse your data, they should know where the data came from (i.e., clear story of origin/history, see R1), who to cite and/or how you wish to be acknowledged. Include a description of the workflow that led to your data: Who generated or collected it? How has it been processed? Has it been published before? Does it contain data from someone else that you may have transformed or completed? Ideally, this workflow is described in a machine-readable format.

  • Dataset: The dataset is associated with detailed provenance.
  • Datum: The data are associated with detailed provenance.
  • Metadata: The metadata are associated with detailed provenance.

Notes. In most cases the value at R1.1 for data will just get copied over from the value for the dataset.

R1.3: (Meta)data meet domain-relevant community standards

R1.3 at GO FAIR

It is easier to reuse data sets if they are similar: same type of data, data organised in a standardised way, well-established and sustainable file formats, documentation (metadata) following a common template and using common vocabulary. If community standards or best practices for data archiving and sharing exist, they should be followed.

  • Dataset: The dataset meets meet domain-relevant community standards.
  • Datum: The data meet domain-relevant community standards.
  • Metadata: The metadata meet domain-relevant community standards.

Other Requirements

Contents

Go up to index.

Other Resources

1. Some Lists of finite Structures

Generated by a script on-the-fly.

http://www1.chapman.edu/~jipsen/finitestructures.html

Peter Jipsen

Lists

1. Crowdsourcing project for the database of numbers of isomorphism types of finite groups

https://github.com/alex-konovalov/gnu/wiki/Mathematical-databases

collected by Alex Konovalov

2. DiscreteZOO

https://github.com/DiscreteZOO/DiscreteZOO-overview/issues

3. The Encyclopedia of Graphs

http://atlas.gregas.eu/sources

4. House of Graphs

https://hog.grinvin.org/MetaDirectory.action

Meta papers

1. Fingerprint Databases for Theorems

https://sites.math.washington.edu/~billey/papers/fingerprints.pdf

Sara Billey and Bridget Tenner

2. Semantic-aware Fingerprints of Symbolic Research Data

https://link.springer.com/content/pdf/10.1007%2F978-3-319-42432-3_51.pdf

Hans-Gert Gräbe

Data in Software Packages

1. Optional and standard databases for Magma

http://magma.maths.usyd.edu.au/magma/download/db/

2. Overview of data libraries in GAP

http://www.gap-system.org/Datalib/datalib.html

3. SageMath databases

http://doc.sagemath.org/html/en/reference/databases/index.html

Collection features, classification

Go up to index.

Some additional features to consider for databases: query API, license, ...

Sometimes it makes more sense to just generate objects on the fly. Are there systems that partly store, partly generate on the fly? (Small Groups in GAP?)

Fingerprints and related

Most collections contain mathematical objects (encoded in some way), each object typically has some extra data. What is the nature of these data and how do the collection's authors refer to it (i.e. data, invariants, properties, some kind of fingerprints)? In particular, if the collection uses fingerprints, what is their nature?

For fingerprinting resources, see the papers.

Querying

What kind of querying does the collection support? Do the queries only use the object data or do they look at the structure of encoded objects directly in some way?

You can’t perform that action at this time.