Semantic data.gov.uk (0.8.5)
This script is fully functional (not beta or alpha or what not).
build.xml is an Apache Ant script that runs the following tasks:
- Downloading the latest metadata dumps from data.gov.uk (CSV)
- Cleaning the data dumps (empty lines, spaces in CSV headers, etc.)
- Converting the CSV into RDF (using TARQL)
- Uploading the RDF to a repository
- Converting text identifiers into URIs for better linking across the data
- Integrating the output of beheader into the graph (soon)
- Adding some metadata about the resulting data set (DCAT, VoID, PROV)
This script is run every night to update the RDF metadata.
The data model can be seen here.
- Apache Ant, with [ANT INSTALL]/bin directory added to your PATH environment variable
- cURL, with [CURL INSTALL] directory added to your PATH environment variable
- TARQL by Richard Cyganiak (@cygri), with [TARQL INSTALL] directory added to your PATH environment variable
- An RDF repository. Apache Fuseki is a good choice, but there are plenty.
upload_template.propertiesand rename it
- Open it and fill it. As-is, your repository requires a user:password combination
- If Requirements are fulfilled, just run
- If you have already run the process and just want to reload the data in the triple store, run
- Tell me!
I would love to read your feedback/comments/suggestions!
If you have a Github account, you can create an issue.
Otherwise, you can reach me:
- Fixed malformed URLs by trimming trailing space before upload
- Detection of machine-readable resources (
- Added backup-repository and load-backup targets to enable the management of the repository as a service
- Added data integration from beheader
- Fixed dcat:downloadUrl
- Fixed missing directories (csv and rdf)
- Adapted scripts and queries to data.gov.uk setup (#1)
Pre-fork change log
- Added properties dgfr:responseStatusCode, dgfr:responseTime and dgfr:availabilityCheckedOn to the ontology and API configuration
- Added direct link between organizations and published distributions (see the result in the data model
- Added a view for anavailable resources in the API (https://www.data.maudry.com/fr/resources/unavailable)
- Icons for boolean values (true/false) are clearer now
- Added ontology documentation (Ontoology, thanks @dgarijo). You can view it following the ontology URI at http://colin.maudry.com/ontologies/dgfr/index.html
- Availability and unavailability count at dataset and organization levels
- Made SPARQL endpoint configuration more flexible
- Fixed errors in ontology
- Disabled archiving of RDF due to disk space. Will enable again when I have a clearer archiving strategy.
- Calculation of popularity points for all objects, and aggregate sums on organisations and datasets
- Integration of the data collected by beheader (availability of the distributions, content type, content length)
- Enabled ETL with previously downloaded data to have CasanovaLD up quicker
- Not much...
- Updated the API documentation
- Updated VoiD and PROV metadata to match the new repository location
- The RDF data is now loaded in a single atomic transaction in the repository
- Switch from Dydra (http://dydra.com) to a local Apache Fuseki instance
- Added organizations and reuses data, with all identifiers turned into URIs for full linking
- That was a lame name. Say hi to CasanovaLD!
- Improved documentation
- The data.gouv.fr explorer app, with somewhat documented APIs, is live!
- URIs have changed to match the domain of the app
- Added dgfr:visits and dcterms:keywords (as comma-separated list, meh) in the data
- Redirections to the www. address was flaky on data.gouv.fr, so I had to specify the fully resolved address (e.g. http://www.data.gouv.fr/fr/datasets.csv)
- Fixed missing properties (mismatch at conversion stage). Still no tags
- Fixed RDF dataset modification date
- Fixed resources that have spaces in their URLs (url-encode)
- Added dgfr:slug for datasets
- Configured upload and update of VoID and PROV metadata (in default graph)
- Enabled scheduled task to update data every day
- Script to download/clean/convert/publish data.gouv.fr dataset metadata
- Basic documentation