# SOFT7 Data Source

This notebook contains examples related to the SOFT7 data source.
How to create, manage, resolve, and use SOFT7 data sources.

## Generate a SOFT7 Data Source

SOFT7 Data Source instances can be generated based on information from the following parts:

1. Data source (DB, File, Webpage, ...).
2. Generic data source parser.
3. Data source parser configuration.
4. SOFT7 entity (data model).

Parts 2 and 3 are together considered to produce the "specific parser".
Parts 1 through 3 are provided as a collective, based on the [`ResourceConfig`](https://emmc-asbl.github.io/oteapi-core/latest/all_models/#oteapi.models.ResourceConfig) from [OTEAPI Core](https://emmc-asbl.github.io/oteapi-core/).

### Resource configuration

The resource configuration, originally based on the [Data Catalog Vocabulary (DCAT)](https://www.w3.org/TR/vocab-dcat-3/), is in this case a small set of data catalog fields mapped to resource-specific values:

- `downloadUrl` or `accessUrl`:
  - `downloadUrl`: The URL of the downloadable file in a given format. E.g. CSV file or RDF file.

    Usage: `downloadURL` _SHOULD_ be used for the URL at which this distribution is available directly, typically through a HTTPS GET request or SFTP.
  - `accessUrl`: A URL of the resource that gives access to a distribution of the dataset. E.g. landing page, feed, SPARQL endpoint.

    Usage: `accessURL` _SHOULD_ be used for the URL of a service or location that can provide access to this distribution, typically through a Web form, query or API call.  
    `downloadURL` is preferred for direct links to downloadable resources.
- `mediaType`: The media type of the distribution as defined by IANA [[IANA-MEDIA-TYPES](https://www.w3.org/TR/vocab-dcat-2/#bib-iana-media-types)].

  Usage: This property _SHOULD_ be used when the media type of the distribution is defined in IANA [[IANA-MEDIA-TYPES](https://www.w3.org/TR/vocab-dcat-2/#bib-iana-media-types)].
- `accessService`: A data service that gives access to the distribution of the dataset.

It is worth noting that the resource configuration **MUST** contain either `downloadUrl` _and_ `mediaType` **or** `accessUrl` _and_ `accessService`.
It may contain any combination otherwise, but a minimum of one of the two combinations is required.

The part described up to now defines the "data source" part of the SOFT7 data source.
Furthermore, based on either `mediaType` or `accessService` the "generic parser" part of the SOFT7 data source is determined.

To supply a parser configuration, one needs to know the generic parser that will be used, as well as the specific data to be retrieved.
Under "normal" circumstances, the parser configuration is stored alongside a reference to the data source for easy reusability.

Finally, a SOFT7 entity (or data model) is required to base the generated SOFT7 Data Source instance on.
Again, under "normal" circumstances, the SOFT7 entity is stored alongside a reference to the data source for easy reusability.

### Example of generating a SOFT7 Data Source

The following example shows how to generate a SOFT7 Data Source instance based on the parts described above.

In [15]:
import logging

logging.getLogger("s7").addHandler(logging.StreamHandler())

In [2]:
from s7.factories import create_entity

OPTIMADEStructure = create_entity("http://onto-ns.com/meta/1.0/OPTIMADEStructure#")
OPTIMADEStructure.model_fields["properties"].annotation.model_fields["attributes"].annotation.model_fields["properties"].annotation.model_fields

{'elements': FieldInfo(annotation=list[str], required=True, title='elements', description='The chemical symbols of the different elements present in the structure.', json_schema_extra={'x-soft7-shape': ['nelements']}),
 'nelements': FieldInfo(annotation=int, required=True, title='nelements', description='Number of different elements in the structure as an integer.', json_schema_extra={}),
 'elements_ratios': FieldInfo(annotation=list[float], required=True, title='elements_ratios', description='Relative proportions of different elements in the structure.', json_schema_extra={'x-soft7-shape': ['nelements']}),
 'chemical_formula_descriptive': FieldInfo(annotation=str, required=True, title='chemical_formula_descriptive', description='The chemical formula for a structure as a string in a form chosen by the API implementation.', json_schema_extra={}),
 'chemical_formula_reduced': FieldInfo(annotation=str, required=True, title='chemical_formula_reduced', description='The reduced chemical form

In [33]:
import os

from s7.factories import create_datasource

os.environ["OTELIB_DEBUG"] = "true"

datasource = create_datasource(
    entity="http://onto-ns.com/meta/1.0/OPTIMADEStructure",
    configs={
        "dataresource": {
            "downloadUrl": "https://optimade.materialsproject.org/v1/structures/mp-1228448",
            "mediaType": "application/json",
        },
        "mapping": {
            "mappingType": "triples",
            "prefixes": {
                "optimade": "https://optimade.materialsproject.org/v1/structures/mp-1228448#",
                "s7_top": "http://onto-ns.com/meta/1.0/OPTIMADEStructure#",
                "s7_attr": "http://onto-ns.com/meta/1.0/OPTIMADEStructureAttributes#",
                "s7_species": "http://onto-ns.com/meta/1.0/OPTIMADEStructureSpecies#",
            },
            "triples": {
                # top
                ("optimade:data.id", "", "s7_top:properties.id"),
                ("optimade:data.type", "", "s7_top:properties.type"),
                ("optimade:data.attributes", "", "s7_top:properties.attributes"),

                # attributes - dimensions
                ("optimade:data.attributes.nsites", "", "s7_attr:dimensions.nsites"),
                ("optimade:data.attributes.nelements", "", "s7_attr:dimensions.nelements"),
                # attributes - properties
                ("optimade:data.attributes.immutable_id", "", "s7_attr:properties.immutable_id"),
                ("optimade:data.attributes.last>_modified", "", "s7_attr:properties.last_modified"),
                ("optimade:data.attributes.elements", "", "s7_attr:properties.elements"),
                ("optimade:data.attributes.elements_ratios", "", "s7_attr:properties.elements_ratios"),
                ("optimade:data.attributes.chemical_formula_descriptive", "", "s7_attr:properties.chemical_formula_descriptive"),
                ("optimade:data.attributes.chemical_formula_reduced", "", "s7_attr:properties.chemical_formula_reduced"),
                ("optimade:data.attributes.chemical_formula_hill", "", "s7_attr:properties.chemical_formula_hill"),
                ("optimade:data.attributes.chemical_formula_anonymous", "", "s7_attr:properties.chemical_formula_anonymous"),
                ("optimade:data.attributes.dimension_types", "", "s7_attr:properties.dimension_types"),
                ("optimade:data.attributes.nperiodic_dimensions", "", "s7_attr:properties.nperiodic_dimensions"),
                ("optimade:data.attributes.lattice_vectors", "", "s7_attr:properties.lattice_vectors"),
                ("optimade:data.attributes.cartesian_site_positions", "", "s7_attr:properties.cartesian_site_positions"),
                ("optimade:data.attributes.species_at_sites", "", "s7_attr:properties.species_at_sites"),
                ("optimade:data.attributes.structure_features", "", "s7_attr:properties.structure_features"),
                ("optimade:data.attributes.species", "", "s7_attr:properties.species"),

                # attributes.species - properties
                ("optimade:data.attributes.species.name", "", "s7_species:properties.species.properties.name"),
                ("optimade:data.attributes.species.chemical_symbols", "", "s7_species:properties.chemical_symbols"),
                ("optimade:data.attributes.species.concentration", "", "s7_species:properties.concentration"),
                # attributes.species - dimensions - missing
            },
        },
    },
)

In [34]:
datasource.__getattribute__("id")

AttributeError: 

In [None]:
from s7.factories.datasource_factory import CACHE

print(CACHE)

{}
