Various stuff around uplifting the French open data portal, data.gouv.fr, to the Semantic Web (Web 3.0).
This is the foundation work that fuels CasanovaLD.
I have forked this projet to deal with data.gov.uk metadata. It became datagovuk-rdf.
build.xml is an Apache Ant script that runs the following tasks:
- Downloading the latest metadata dumps from data.gouv.fr (CSV)
- Cleaning the data dumps (empty lines, wrongly escaped quotes)
- Converting the CSV into RDF (using TARQL)
- Uploading to an RDF repository
- Converting text identifiers in URIs for better linking across the data
- Doing some smart postprocessing to make the objects more interlinked in the graph, integrates the output of beheader into the graph
- Adding some metadata about the resulting data set (DCAT, VoID, PROV)
This script is run every 2 hours to update the RDF metadata (see here in French, in English).
This wouldn't be possible and so easy without the publication of live CSVs by @noirbizarre for Etalab.
The data model can be seen here.
- Apache Ant, with [ANT INSTALL]/bin directory added to your PATH environment variable
- cURL, with [CURL INSTALL] directory added to your PATH environment variable
- TARQL by Richard Cyganiak (@cygri), with [TARQL INSTALL] directory added to your PATH environment variable
- An RDF repository. Apache Fuseki is a good choice, but there are plenty.
- Copy upload_template.properties and rename it upload.properties
- Open it and fill it. As-is, your repository requires a user:password combination
- If Requirements are fulfilled, just run
ant
indatagouvfr-rdf
root folder. - If you have already run the process and just want to reload the data in the triple store, run
ant quick
.
- Tell me!
I would love to read your feedback/comments/suggestions!
If you have a Github account, you can create an issue.
Otherwise, you can reach me:
- by email: colin@maudry.com
- on Twitter: @CMaudry
- Externalized more configuration to
config.properties
- Restructured tasks
- Trims trailing spaces off resource URLs and replace other spaces with %20 (#52)
- Added KML formats, JSON-LD and N3 to the list of machine readable formats (#37)
- Added
application/shp+zip
in the list of machine readable formats
- Fixed CasanovaLD address for documentation (/.doc)
- Added backup-repository and load-backup targets to enable the management of the repository as a service
- Added dgfr:machineReadable property to distinguish machine readable resources from the others (CSV, XML, JSON, RDF, plain text, etc.)
- Added a clean version of dcat:mediaType values, without
charset=
- Added properties dgfr:responseStatusCode, dgfr:responseTime and dgfr:availabilityCheckedOn to the ontology and API configuration
- Added direct link between organizations and published distributions (see the result in the data model
- Added a view for anavailable resources in the API (https://www.data.maudry.com/fr/resources/unavailable)
- Icons for boolean values (true/false) are clearer now
- Added ontology documentation (Ontoology, thanks @dgarijo). You can view it following the ontology URI at http://colin.maudry.com/ontologies/dgfr/index.html
- Availability and unavailability count at dataset and organization levels
- Made SPARQL endpoint configuration more flexible
- Fixed errors in ontology
- Disabled archiving of RDF due to disk space. Will enable again when I have a clearer archiving strategy.
- Calculation of popularity points for all objects, and aggregate sums on organisations and datasets
- Integration of the data collected by beheader (availability of the distributions, content type, content length)
- Enabled ETL with previously downloaded data to have CasanovaLD up quicker
- Not much...
- Updated the API documentation
- Updated VoiD and PROV metadata to match the new repository location
- The RDF data is now loaded in a single atomic transaction in the repository
- Switch from Dydra (http://dydra.com) to a local Apache Fuseki instance
- Added organizations and reuses data, with all identifiers turned into URIs for full linking
- That was a lame name. Say hi to CasanovaLD!
- Improved documentation
- The data.gouv.fr explorer app, with somewhat documented APIs, is live!
- URIs have changed to match the domain of the app
- Added dgfr:visits and dcterms:keywords (as comma-separated list, meh) in the data
- Redirections to the www. address was flaky on data.gouv.fr, so I had to specify the fully resolved address (e.g. http://www.data.gouv.fr/fr/datasets.csv)
- Fixed missing properties (mismatch at conversion stage). Still no tags
- Fixed RDF dataset modification date
- Fixed resources that have spaces in their URLs (url-encode)
- Added dgfr:slug for datasets
- Configured upload and update of VoID and PROV metadata (in default graph)
- Enabled scheduled task to update data every day
- Script to download/clean/convert/publish data.gouv.fr dataset metadata
- Basic documentation