Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data dictionary capability for datasets #4015

Merged
merged 26 commits into from
Oct 6, 2023
Merged

Data dictionary capability for datasets #4015

merged 26 commits into from
Oct 6, 2023

Conversation

dafeder
Copy link
Member

@dafeder dafeder commented Sep 13, 2023

Allows data dictionaries to be referenced directly from a dataset. Or, more specifically, directly from a distribution. Using the describedBy field, a user can point to an existing data dictionary in DKAN.

In keeping with the DCAT-US standard, describedBy is expected to be a URL. To allow for domain-agnostic references between datasets and data dictionaries, URLs will be normalized to the dkan:// stream wrapper for storage. They can also be passed to DKAN this way, or else just as absolute URLs on the same domain as the request is being made to.

For instance:

Input

{
    "describedBy": "dkan://metastore/schemas/data-dictionary/items/087978c8-b629-4311-a07c-2e026192ccea",
    "dataDictionaryType": "application/vnd.tableschema+json"
}

or

{
    "describedBy": "https://imp.data.medicaid.gov/api/1/metastore/schemas/data-dictionary/items/087978c8-b629-4311-a07c-2e026192ccea",
    "dataDictionaryType": "application/vnd.tableschema+json"
}

Output

{
    "describedBy": "https://imp.data.medicaid.gov/api/1/metastore/schemas/data-dictionary/items/087978c8-b629-4311-a07c-2e026192ccea",
    "dataDictionaryType": "application/vnd.tableschema+json"
}

Storage

{
    "describedBy": "dkan://metastore/schemas/data-dictionary/items/087978c8-b629-4311-a07c-2e026192ccea",
    "dataDictionaryType": "application/vnd.tableschema+json"
}

Note that the dataDictionaryType should also be defined as application/vnd.tableschema+json or the datastore will not attempt to parse the data dictionary for post-import processing (data types, indexes etc).

QA Steps

  1. Build a fresh site, visit admin/dkan/data-dictionary/settings and set the data dictionary mode to "distribution reference."

  2. Post the following data dictionary to your DKAN instance:

{
    "title": "Bike lanes data dictionary",
    "data": {
        "fields": [
            {
                "name": "objectid",
                "title": "OBJECTID",
                "type": "integer",
                "description": "Internal feature number."
            },
            {
                "name": "roadway",
                "title": "ROADWAY",
                "type": "string",
                "description": "A unique 8-character identification number assigned to a roadway or section of a roadway either On or Off the State Highway System for which information is maintained in the Department's Roadway Characteristics Inventory (RCI)."
            },
            {
                "name": "road_side",
                "title": "ROAD_SIDE",
                "type": "string",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "enum": ["R", "L", "C"]
                },
                "description": "Side of the road. C = Composite; L = Left side; R = Right side"
            },
            {
                "name": "lncd",
                "title": "LNCD",
                "type": "integer",
                "constraints": {
                    "maxLength": 1,
                    "minLength": 1,
                    "maximum": 5,
                    "minimum": 0
                },
                "description": "Codes 0 = UNDESIGNATED; 1 = DESIGNATED; 2 = BUFFERED; 3 = COLORED; 4 = BOTH 2 AND 3; 5 = SHARROW"
            },
            {
                "name": "descr",
                "title": "DESCR",
                "type": "string",
                "constraints": {
                  "maxLength": 30,
                  "enum": ["UNDESIGNATED", "DESIGNATED"]
                },
                "description": "Designation description."
            },
            {
                "name": "begin_post",
                "title": "BEGIN_POST",
                "type": "number",
                "description": "Denotes the lowest milepoint for the record."
            },
            {
                "name": "end_post",
                "title": "END_POST",
                "type": "number",
                "description": "Denotes the highest milepoint for the record."
            },
            {
                "name": "shape_len",
                "title": "Shape_Leng",
                "type": "number",
                "description": "Length in meters"
            }
        ]
    }
}
  1. Post the following dataset to your DKAN instance.
{
    "@type": "dcat:Dataset",
    "accessLevel": "public",
    "contactPoint": {
        "fn": "Jane Doe",
        "hasEmail": "mailto:data.admin@example.com"
    },
    "description": "Data on bike lanes in Florida.",
    "distribution": [
    {
        "@type": "dcat:Distribution",
        "downloadURL": "https://demo.getdkan.org/sites/default/files/distribution/cedcd327-4e5d-43f9-8eb1-c11850fa7c55/Bike_Lane.csv",
        "mediaType": "text\/csv",
        "format": "csv",
        "title": "Florida Bike Lanes",
        "describedBy": "dkan://metastore/schemas/data-dictionary/items/b063524c-cf33-5d06-a161-770fc688d53b",
        "describedByType": "application/vnd.tableschema+json"
    }
    ],
    "identifier": "cedcd327-4e5d-43f9-8eb1-c11850fa7c55",
    "issued": "2016-06-22",
    "license": "http://opendatacommons.org/licenses/by/1.0/",
    "modified": "2016-06-22",
    "publisher": {
        "@type": "org:Organization",
        "name": "State Economic Council"
    },
    "spatial": "Florida",
    "theme": [
        "Transportation",
        "City Planning"
    ],
    "title": "Florida Bike Lanes ",
    "keyword":["bike-lanes", "streets", "infrastructure"]
}
  1. Run the import and post_import queues. The dictionary should be applied.

NOTE: There is a mysql error currently being thrown from this dictionary on the begin_post field but it is unrelated to this PR; will figure out what's wrong and improve QA steps.

@dafeder dafeder changed the title Data dictionary capability for datasets, take 3 Data dictionary capability for datasets Sep 25, 2023
@paul-m
Copy link
Contributor

paul-m commented Oct 3, 2023

Followed the QA steps, verified that before the post_import queue, the datastore table was not typed according to the data dictionary, but afterwards it was.

Before:

mysql> describe datastore_7bd1ea6f608c2647e3680cff50def8ab;
+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| record_number | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| objectid      | text             | YES  |     | NULL    |                |
| roadway       | text             | YES  |     | NULL    |                |
| road_side     | text             | YES  |     | NULL    |                |
| lncd          | text             | YES  |     | NULL    |                |
| descr         | text             | YES  |     | NULL    |                |
| begin_post    | text             | YES  |     | NULL    |                |
| end_post      | text             | YES  |     | NULL    |                |
| shape_leng    | text             | YES  |     | NULL    |                |
+---------------+------------------+------+-----+---------+----------------+
9 rows in set (0.00 sec)

After:

mysql> describe datastore_7bd1ea6f608c2647e3680cff50def8ab;
+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| record_number | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| objectid      | int(11)          | YES  |     | NULL    |                |
| roadway       | text             | YES  |     | NULL    |                |
| road_side     | text             | YES  |     | NULL    |                |
| lncd          | int(11)          | YES  |     | NULL    |                |
| descr         | text             | YES  |     | NULL    |                |
| begin_post    | decimal(5,3)     | YES  |     | NULL    |                |
| end_post      | decimal(5,3)     | YES  |     | NULL    |                |
| shape_leng    | text             | YES  |     | NULL    |                |
+---------------+------------------+------+-----+---------+----------------+
9 rows in set (0.00 sec)

When posting the dataset, modifying the referencedBy field to have a bad UUID resulted in a 400 bad request. This seems like the right thing to do.

@dafeder
Copy link
Member Author

dafeder commented Oct 3, 2023

Did you note the message in that 400 response? It should hopefully tell you why it was bad.

@paul-m
Copy link
Contributor

paul-m commented Oct 3, 2023

400 error message:

HTTP/1.1 400 Bad Request
Cache-Control: must-revalidate, no-cache, private
Content-Language: en
Content-Type: application/json
Date: Tue, 03 Oct 2023 17:40:26 GMT
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Server: Apache/2.4.56 (Debian)
Set-Cookie: SSESS7890ddc970ec5af6fb5d2dfb16ce86c3=9BCBDyDwaqn5xvK-qlIIbzIhoJ4j1Nhn8peQHmooKKafabk2; expires=Thu, 26-Oct-2023 21:13:46 GMT; Max-Age=2000000; path=/; domain=.dkan-core.ddev.site; secure; HttpOnly
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Generator: Drupal 10 (https://www.drupal.org)
Transfer-Encoding: chunked

{
  "message": "The value dkan:\/\/metastore\/schemas\/data-dictionary\/items\/b063524c-cf33-5d06-a161-770fc688d53c, derived from dkan:\/\/metastore\/schemas\/data-dictionary\/items\/b063524c-cf33-5d06-a161-770fc688d53c, is not a valid data-dictionary URI.",
  "status": 400,
  "timestamp": "2023-10-03T17:40:26+00:00"
}
Response file saved.
> 2023-10-03T104026.400.json

Response code: 400 (Bad Request); Time: 239ms (239 ms); Content length: 307 bytes (307 B)

@dafeder dafeder marked this pull request as ready for review October 4, 2023 15:15
modules/metastore/src/Reference/MetastoreUrlGenerator.php Outdated Show resolved Hide resolved
sonar-project.properties Outdated Show resolved Hide resolved
Copy link
Contributor

@paul-m paul-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all the things. :-)

@paul-m paul-m merged commit a148554 into 2.x Oct 6, 2023
12 checks passed
@paul-m paul-m deleted the dict-refs3 branch October 6, 2023 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants